DOC: new document describing the typical crawler workflow

fadfde5b · Alexander Schlemmer · 9ed646f4 · fadfde5b
Commit fadfde5b authored 2 years ago by Alexander Schlemmer
--- a/src/doc/workflow.rst
+++ b/src/doc/workflow.rst
+Crawler Workflow
+================
+
+The CaosDB crawler aims to provide a very flexible framework for synchronizing
+data on file systems (or potentially other sources of information) with a
+running CaosDB instance. The workflow that is used in the scientific environment
+should be choosen according to the users needs. It is also possible to combine multiple workflow or use them in parallel.
+
+In this document we will describe several workflows for crawler operation.
+
+Local Crawler Operation
+-----------------------
+
+A very simple setup that can also reliably used for testing (e.g. in local
+docker containers) sets up the crawler on a local computer. The files that
+are being crawled need to be visible to both, the local computer and the
+machine, running the CaosDB.
+
+Prerequisites
+++++++++++++
+
+- Make sure that CaosDB is running, that your computer has a network connection to CaosDB and
+  that your pycaosdb.ini is pointing to the correct instance of CaosDB. Please refer to the
+  pylib manual for questions related to the configuration in pycaosdb.ini
+  (https://docs.indiscale.com/caosdb-pylib/README_SETUP.html).
+- Make sure that caosdb-crawler and caosdb-advanced-user-tools are installed (e.g. using pip).
+- Make sure that you have created:
+  - The data model, needed for the crawler.
+  - A file "identifiables.yml" describing the identifiables.
+  - A cfood file, e.g. cfood.yml.
+
+Running the crawler
+++++++++++++++++++
+
+Running the crawler currently involves two steps:
+- Inserting the files
+- Running the crawler program
+
+Inserting the files
+)))))))))))))))))))
+
+This can be done using the module "loadFiles" from caosadvancedtools.
+(See https://docs.indiscale.com/caosdb-advanced-user-tools/ for installation.)
+
+The generic syntax is:
+
+python3 -m caosadvancedtools.loadFiles -p <prefix-in-caosdb-file-system> <path-to-crawled-folder>
+
+Important: The <path-to-crawled-folder> is the location of the files **as seen by CaosDB**, e.g. for a CaosDB instance running in a docker container (e.g. see: https://gitlab.com/caosdb/caosdb-docker) the command line could look like:
+
+python3 -m caosadvancedtools.loadFiles -p / /opt/caosdb/mnt/extroot/ExperimentalData
+
+This command line would load the folder "ExperimentalData" contained in the extroot folder within the docker container to the CaosDB-prefix "/" which is the root prefix.
+
+Running the crawler
+)))))))))))))))))))
+
+The following command line assumes that the extroot folder visible in the CaosDB docker container is located in "../extroot":
+
+caosdb-crawler -i identifiables.yml --prefix /extroot --debug --provenance=provenance.yml -s update cfood.yml ../extroot/ExperimentalData/