From fadfde5bcd25a52058ed309bc02bdc11adc117c9 Mon Sep 17 00:00:00 2001 From: Alexander Schlemmer <alexander@mail-schlemmer.de> Date: Fri, 27 Jan 2023 12:32:21 +0100 Subject: [PATCH] DOC: new document describing the typical crawler workflow --- src/doc/workflow.rst | 60 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 60 insertions(+) create mode 100644 src/doc/workflow.rst diff --git a/src/doc/workflow.rst b/src/doc/workflow.rst new file mode 100644 index 00000000..0ffd50ec --- /dev/null +++ b/src/doc/workflow.rst @@ -0,0 +1,60 @@ +Crawler Workflow +================ + +The CaosDB crawler aims to provide a very flexible framework for synchronizing +data on file systems (or potentially other sources of information) with a +running CaosDB instance. The workflow that is used in the scientific environment +should be choosen according to the users needs. It is also possible to combine multiple workflow or use them in parallel. + +In this document we will describe several workflows for crawler operation. + +Local Crawler Operation +----------------------- + +A very simple setup that can also reliably used for testing (e.g. in local +docker containers) sets up the crawler on a local computer. The files that +are being crawled need to be visible to both, the local computer and the +machine, running the CaosDB. + +Prerequisites ++++++++++++++ + +- Make sure that CaosDB is running, that your computer has a network connection to CaosDB and + that your pycaosdb.ini is pointing to the correct instance of CaosDB. Please refer to the + pylib manual for questions related to the configuration in pycaosdb.ini + (https://docs.indiscale.com/caosdb-pylib/README_SETUP.html). +- Make sure that caosdb-crawler and caosdb-advanced-user-tools are installed (e.g. using pip). +- Make sure that you have created: + - The data model, needed for the crawler. + - A file "identifiables.yml" describing the identifiables. + - A cfood file, e.g. cfood.yml. + +Running the crawler ++++++++++++++++++++ + +Running the crawler currently involves two steps: +- Inserting the files +- Running the crawler program + +Inserting the files +))))))))))))))))))) + +This can be done using the module "loadFiles" from caosadvancedtools. +(See https://docs.indiscale.com/caosdb-advanced-user-tools/ for installation.) + +The generic syntax is: + +python3 -m caosadvancedtools.loadFiles -p <prefix-in-caosdb-file-system> <path-to-crawled-folder> + +Important: The <path-to-crawled-folder> is the location of the files **as seen by CaosDB**, e.g. for a CaosDB instance running in a docker container (e.g. see: https://gitlab.com/caosdb/caosdb-docker) the command line could look like: + +python3 -m caosadvancedtools.loadFiles -p / /opt/caosdb/mnt/extroot/ExperimentalData + +This command line would load the folder "ExperimentalData" contained in the extroot folder within the docker container to the CaosDB-prefix "/" which is the root prefix. + +Running the crawler +))))))))))))))))))) + +The following command line assumes that the extroot folder visible in the CaosDB docker container is located in "../extroot": + +caosdb-crawler -i identifiables.yml --prefix /extroot --debug --provenance=provenance.yml -s update cfood.yml ../extroot/ExperimentalData/ -- GitLab