HUME Workflows
Hume is a machine reading system developed by BBN. It extracts events and causal relations from text and serves as one of the components of the integrated system. In addition to reading documents, it supports the integrated ontology construction workflow.
Contents
Integrated system
In this workflow, Hume is running as a containerized service as part of the integrated system.
System Preparation
For this mode, you will need to prepare a config file.
You will also need to create a new working folder, we’ll use runtime
as an example in this document.
Docker Config File
The config file is a json file that must be visible at runtime from /extra/config.json
inside the container. It’s modified from https://github.com/twosixlabs-dart/python-kafka-consumer/blob/master/pyconsumer/resources/env/test.json. An example is provided as follows:
Note: Do not change unannotated fields in the example unless you really know what you’re doing.
{
"kafka.bootstrap.servers": "str,required: example is wm-ingest-pipeline-streaming-1.prod.dart.worldmodelers.com:9093",
"auth": {
"username": "str,required. If auth is not required, please delete auth dict directly",
"password": "str,required"
},
"app": {
"id": "hume",
"auto_offset_reset": "earliest",
"enable_auto_commit": false
},
"topic": {
"from": "str,required: topic to listen to from kafka"
},
"CDR_retrieval": "str,required: example is https://wm-ingest-pipeline-rest-1.prod.dart.worldmodelers.com/dart/api/v1/cdrs",
"DART_upload": "str,required: example is https://wm-ingest-pipeline-rest-1.prod.dart.worldmodelers.com/dart/api/v1/readers",
"Ontology_retrieval": "str,required: example is https://wm-ingest-pipeline-rest-1.prod.dart.worldmodelers.com/dart/api/v1/ontologies",
"DART_upload_labels": [
"from_docker"
],
"hume.domain": "WM",
"hume.num_of_vcpus": "int, required: Number of cpu cores available to you. It at least needs to be 2, and please only include number of physical cores instead of SMT cores.",
"hume.tmp_dir": "required,str: a bind point for sharing data in between your local system and docker instance",
"hume.streaming.mini_batch_size": "int, required: For streaming mode, this deterimines the size of a mini batch (which is used for improved efficiency)",
"hume.streaming.mini_batch_wait_time": "int, required: For streaming mode, even if we received fewer documents than mini_batch_size, the system will start processing after mini_batch_wait_time seconds.",
"hume.streaming.keep_pipeline_data": "bool, optional: When enabled all intermediate files during processing will be preserved. This is mainly a debugging switch for developer.",
"hume.streaming.skip_processed": "bool, optional: When enabled, we'll keep a record of processed (document_id, ontology_id) tuples in a file under hume.tmp_dir/processed.log and if we see the same kafka reading request, we'll ignore it. ",
"hume.use_regrounding_cache": "bool, optional: When enabled, the system will save intermediate files of processing result into a directory that if we see the same document later, we can skip certain processing steps and only kick off regrounding pipeline.",
"hume.regrounding_cache_path": "str, required when hume.use_regrounding_cache is true: a persist directory for hosting regrounding cache. Ideally to be a persist storage that can be shared in between docker instances. Also please don't reuse grounding cache from OIAD.",
"hume.cdr_cache_dir": "str or null, optional: When specified, Hume will use the path from inside docker to cache CDRs locally so when new reading requests come as long as it's found in local cache, it won't goto API for fetching it again"
}
Please copy the config file to runtime/config.json
To Run the Hume System
Run Hume with the following command:
docker run -it -v runtime:/extra docker.io/wmbbn/hume:R2022_03_21 /usr/local/envs/py3-jni/bin/python3 /wm_rootfs/git/Hume/src/python/dart_integration/streaming_processing.py
Upon succeessful deployment, any resulting json_ld CAGs will be uploaded back to DART at DART_upload
and the system will continue to listen for kafka messages until you stop it.
Reading and assembly Reading + integration/assembly
In this workflow, INDRA World is used to process the output files generated by HUME. You can follow the instructions below to produce the output files first and then follow the INDRA World instructions to run assembly.
Reading only
In this workflow Hume runs as a standalone reading system.
System Preparation
For this mode, you will need to prepare a folder of CDR files (a richer article representation from DART), a ontology metadata file, and a config file.
You will also need to create a new working folder, we’ll use runtime
as an example in this document.
CDR Files
Detailed information of the schema of CDR can be found from DART side. For Hume’s purposes, the fields below are required:
{
"document_id": "required,str: unique identifier of article",
"source_uri": "optional,str: how to fetch the document",
"extracted_metadata": {
"Author": "optional,str: author of article",
"CreationDate": "optional,str: in '%Y-%m-%d' without quote format. Will help time resolution if provided",
"Pages": "optional,int: Will help for genre determination"
},
"extracted_text": "required,str: original text of article",
"content_type": "optional,str: mime_type string, will be used genre determination"
}
Files should be named [document_id].json
, where [document_id]
is replaced with the actual document_id.
Copy all json files into your runtime/corpus
directory.
Ontology Metadata File
The ontology metadata file should be formatted according to the following schema: https://github.com/WorldModelers/Ontologies/blob/master/CompositionalOntology_metadata.yml
Docker Config File
The config file is a json file that must be visible at runtime from /extra/config.json
inside the container. An example is provided as follows:
{
"hume.domain": "WM",
"hume.num_of_vcpus": "int, required: Number of cpu cores available to you. It at least needs to be 2, and please only include number of physical cores instead of SMT cores.",
"hume.tmp_dir": "required,str: a bind point for sharing data in between your local system and docker instance",
"hume.manual.cdr_dir": "required,str: A path, from inside docker that contains dir of docs you want to process, if you're following above, it should be /extra/corpus",
"hume.manual.keep_pipeline_data": "optional,bool default to false: when enabled, we'll keep intermediate process file in between runs so you don't start from beginning. But if your previous run is finished and you'll kick off the new run, please set it to false for removing intermediate process file and run everything from scratch.",
"hume.external_ontology_path": "optional, str: When provided, hume will use the ontology metadata file you provided instead of pre-shipped one. This path needs to be accessible from inside docker",
"hume.external_ontology_version": "optional, str, but mandatory when you specify hume.external_ontology_path: id of the external ontology",
"hume.use_regrounding_cache": "bool, optional: When enabled, the system will save intermediate files of processing result into a directory that if we see the same document later, we can skip certain processing steps and only kick off regrounding pipeline.",
"hume.regrounding_cache_path": "str, required when hume.use_regrounding_cache is true: a persist directory for hosting regrounding cache. Ideally to be a persist storage that can be shared in between docker instances. Also please don't reuse grounding cache from OIAD."
}
Please copy the config file to runtime/config.json
To Run the Hume System
Run Hume with the following command:
docker run -it -v runtime:/extra docker.io/wmbbn/hume:R2022_03_21 /usr/local/envs/py3-jni/bin/python3 /wm_rootfs/git/Hume/src/python/dart_integration/manual_processing.py
After the run finishes, the results will be accessible at [hume.tmp_dir]/results/[TIMESTAMP]/results
. The resulting output will be in JSON-LD format, as described here.