Causemos
Causemos is the main HMI for the World Modelers program, built and maintained by Uncharted Software. It is an ecosystem that consists of Causemos web application plus a suite of services and utilities. The essential ones for handling unstructured data are:
- Atlas Initial setup, schemas and mappings
- Anansi Knowledge ingestion and incremental assembly
- Causemos Causemos web app
BYOD (Bring Your Own Documents, for workflow W5) - Requires the infrastructure parts:
Recommendation/curation - optional:
For running and using Causemos, the documentation can be found in TopDownModelingAndHMI
Workflows
W4 Document management + reading + integration/assembly + HMI
In this workflow, Causemos ingests, combines and enriches an INDRA statements dataset and a DART CDR dataset to create a new Knowledge Base dataset.
Initial setup
Ensure you have ElasticSearch setup and have the mappings properly as per atlas.
Install mappings.
ES=<host:port> python es_mapper.py
Running data ingestion
Perform a one-time data load of gelocation references, this upserts into a geo
index. Per the documentation in the script, you will need to download and extract from:
http://download.geonames.org/export/dump/allCountries.zip
http://clulab.cs.arizona.edu/models/gadm_woredas.txt
Once extracted, run
ES=<es_url> ES_USER=<user> ES_PASSWORD=<password> python geo_loader.py
Assume you have INDRA and DART datasets in the file system, the ingestion process can be kicked off with the following snippet.
Note: For all intents and purposes here, SOURCE and TARGET should have the same values.
Note: DART_DATA is expected to be in JSONL format, one CDR per line.
#!/usr/bin/env bash
SOURCE_ES=xyz \
SOURCE_USERNAME=xyz \
SOURCE_PASSWORD=xyz \
TARGET_ES=xyz \
TARGET_USERNAME=xyz \
TARGET_PASSWORD=xyz \
DART_DATA=<path_to_dart_cdr.json> \
INDRA_DATASET=<path_to_indra_directory> \
python src/knowledge_pipeline.py
Once done, build and start the Causemos application. You will find the new Knowledge Base under “New Analysis Project”, the Knowledge Base will appear with the name given by INDRA’s metadata.
Against running INDRA/DART service instances, you can
- Download INDRA datasets via
scripts/download_indra_s3.py
- Download DART CDR dataset via
scripts/build_dart.sh
Post processing - Optional
Causemos can also in addition generate recommendation indices, that can be used as suggestions for doing curations in bulk. For more information please see this repository.
W5 Document management + reading + integration/assembly + HMI + BYOD
In this workflow, it is assumed that both INDRA and DART are running as web services.
Iniital setup
BYOD + incremental assembly processing takes place outside of the Causemos app due to heavy data processing and higher latency. This process uses the Prefect infrastructure for task scheduling and runs the incremental_pipeline.py
script in the anansi
project
For full instructions please see READMEs for incremental pipeline and prefect setup.
For setting up Prefect infrastructure
- Instructions here
To create a python-env
- Ssh to Prefect-server
- Run
conda create -n prefect-seq -c conda-forge "python>=3.8.0" prefect "elasticsearch==7.11.0" "boto3==1.17.18" "smart_open==5.0.0" python-dateutil requests
You also need an env/config file on the prefect server, with connection credentials to DART, INDRA, and others
export SOURCE_ES=
export SOURCE_USERNAME=
export SOURCE_PASSWORD=
export TARGET_ES=
export TARGET_USERNAME=
export TARGET_PASSWORD=
export DART_HOST=
export DART_USER=
export DART_PASS=
export INDRA_HOST=
# optional
export CURATION_HOST=
To setup Prefect agent
- Copy anansi/src to the Prefect-server
- Copy credentials to Prefect-server
- Ssh to Prefect-server
- Stop Prefect local-agent
- Source env:
source <env/config file>
- Restart local-agent:
PREFECT__ENGINE__EXECUTOR__DEFAULT_CLASS="prefect.executors.LocalExecutor" PYTHONPATH="${PYTHONPATH}:<path_to_anansi_src>" prefect agent local start --api "http://<Prefect-server>:4200/graphql" --label "non-dask"
To register incremental_pipeline.py
:
- Ssh to Prefect-server
- Set the “shouldRegister” flag to True in the python file
- Activate the env
conda activate prefect-seq
- Re(register)
python incremental_pipeline.py
Running Causemos with BYOD
BYOD is an optional feature in Causemos that integrates with INDRA and DART services. To enable this feature, the Causemos sever needs to start with the “dart” command line option, this will enable the periodic synchronizations against DART and INDRA.
# Usage: yarn start-server --schedules foo,bar
yarn start-server --schedules dart
When a document is uploaded through Causemos, the request is sent to DART. Then behind the scenes Causemos server will poll DART to see what new reader outputs are available on the Kafka queues, cross-reference them against the Causemos internal document upload tracker, and then send the valid entries to INDRA to do incremental assembly.