DOC: usage of crawler

42ce7d33 · Henrik tom Wörden · Quazgar · c4ef29e4 · 42ce7d33 · 42ce7d33
Commit 42ce7d33 authored 4 years ago by Henrik tom Wörden Committed by Quazgar 4 years ago
--- a/src/doc/crawler.md
+++ b/src/doc/crawler.md
+# CaosDB Crawler
+
+The [CaosDB crawler](https://gitlab.com/caosdb/caosdb-advanced-user-tools/blob/master/src/caosadvancedtools/crawler.py) is  tool for automated insertion or updates of entities in CaosDB.
+
+## Introduction
+In simple terms, it is a program that scans a directory structure, identifies files
+that shall be treated and generates corresponding Entities in CaosDB, possibly filling meta data.
+During this process the crawler can also open files and derive content from within, for example reading
+CSV tables and processing individual rows of these tables.
+
+![](crawler_fingerprint.svg)
+
+As shown in the figure, the general principle of the crawler framework is the following:
+- The crawler walks through the file structure and matches file names using regular expressions
+- Based on the matched files finger prints, so called `Identifiables` are created
+- CaosDB is queried for the `Identifiables`:
+  - If an `Identifiable` is found, it may be updated by the crawler.
+  - If an `Identifiable` does not yet exist, a new one will be inserted.
+  
+I.e. the `Identifiables` (or finger print) allows to automatically decide 
+whether to insert a Record or update an existing one. This logic of 
+the crawler is specified in C-Foods (pun intended! :-)). These are python 
+classes that are
+loaded by `crawl.py` and allow for customized crawling and indexing code. 
+More details on the different components of the CaosDB Crawler can 
+be found under [Concepts](#concepts) below.
+
+In case you are happy with our suggestion of a standard crawler, feel free to use the standard crawler.
+The standard crawler lives in this git repository maintained by Henrik tom Wörden:
+https://gitlab.com/henrik_indiscale/scifolder
+
+## Usage
+Typically, the crawler can be invoked in two ways: via the web interface and 
+directly as a Python script.
+
+In both cases, if the crawler has a problem with some file (e.g. columns in a table (tsv, xls, ...) are named incorrectly), 
+the problem should be indicated by a warning that is returned. You can fix the 
+problem and run the crawler again. This does not cause any problems, since the 
+crawler can recognize what has already been processed (see the description of finger prints in the [Introduction](#Introduction)).
+
+However, pay **attention** when you change a property that is used for the 
+finger print: The crawler will not be able to identify a previous version 
+with the changed one since the finger print is different. This often means that entities
+in the data base need to be changed or removed. As a rule of thumb, you should be 
+pretty sure that properties that are used as finger prints will not change after 
+the crawler ran the first time. This prevents complications.
+
+### Invocation from the Web Interface
+If enabled, the crawler can be called using a menu entry in the web interface. 
+This will open a form where the path of the directory that shall be crawled 
+needs to be given. After the execution information about what was done and 
+which problems might exist is printed in the web interface. 
+Note, that some changes might be pending authorization (if indicated in the 
+messages).
+
+### Invocation as Python Script
+The crawler can be executed directly via a python script (usually called `crawl.py`).
+The script prints the progress and reports potential problems.
+The exact behavior depends on your setup. However, you can have a look at the example in
+the [tests](https://gitlab.com/caosdb/caosdb-advanced-user-tools/-/blob/master/integrationtests/full_test/crawl.py).
+Call `python3 crawl.py --help` to see what parameters can be provided. Typically, 
+an invocation looks like:
+```python
+python3 crawl.py "/TestData/"
+```
+In this case `/TestData/` identifies the path to be crawled 
+**within the CaosDB file system**. You can browse the CaosDB file system by 
+opening the WebUI of your CaosDB instance and clicking on "File System".
+
+In the backend, `crawl.py` starts a CQL query `FIND Files WHICH ARE STORED AT /TestData/**` and crawls these files according to your customized `C-Foods`.
+
+Crawling may consist of two distinct steps:
+1. Insertion of files (use function `loadFiles`)
+2. The actual crawling (use program `crawl.py`)
+However, the first step may be included in `crawl.py`. Otherwise, you can only crawl
+files that were previously inserted by `loadFiles`.
+
+#### loadFiles
+
+After installation of the `caosadvancedtools` you can simply 
+call the function `loadFiles` contained in the package:
+
+```
+python3 -m caosadvancedtools.loadFiles  /opt/caosdb/mnt/extroot
+```
+
+`/opt/caosdb/mnt/extroot` is the root of the file system to be crawled as seen 
+by the CaosDB server (The actual path may vary. This is the used in the LinkAhead 
+distribution of CaosDB). In this case the root file system as seen from within 
+the CaosDB docker process is used.
+
+
+The crawler has the CaosDB python client as dependency, so make sure to install [caosdb-pylib](manuals/pylib/Setting-up-caosdb-pylib).
+
+## Extending the Crawlers
+
+Work in Progress
+
+### Identifiable
+As described above, the main feature of an `identifiable` is, that it has sufficient properties to identify an existing Record in CaosDB that should be update by the Crawler instead of inserting a new one. Obviously, this is necessary to allow to run the Crawler twice on the same file structure without duplicating the data in CaosDB.
+
+An `identifiable` is a python Record object with the features to identify the correct Record in CaosDB. This object is used to create a query in order to determine whether the Record exists. If it does not, the python object is used to insert the Record. Thus, after this step it is certain, that a Record with the features of the `identifiable` exists in CaosDB (newly created or from before).
+
+The Crawler also provides a local cache for `identifiables`, i.e. once the Crawler knows the ID of the CaosDB Record of an `identifiable`, the ID is stored and CaosDB does not need to be queried again.
+
+### C-Food
+A `C-Food` is a logical unit in the insertion process. It should be independent of other data and basically define two steps:
+1. define what is needed to do the operation, i.e. create `identifiables`
+2. update the `identifiables` according to the data
+
+An example: An experiment might be uniquely identified by the date when it was conducted and a number and the `identifiable` might look like the following:
+```
+<Record>
+  <Parent name="Experiment"/>
+  <Property name="date">2020-04-19</Property>
+  <Property name="Exp-No">9</Property>
+</Record>
+```
+
+Thus after the first step an `Experiment` with those properties exists. In the second step, this further properties might be added to this Record, e.g. references to data files that were recorded in that experiment or to the person that did the experiment.
+
+A C-Food may also involve multiple `identifiables`, e.g. when the `Experiment` shall reference the `Project` that it belongs to.
+
+### Example C-Food
+
+```python
+import caosdb as db
+from .cfood import AbstractCFood, assure_has_property
+
+class ExampleCFood(AbstractCFood):
+    @staticmethod
+    def get_re():
+        return (r".*/(?P<species>[^/]+)/"
+                r"(?P<date>\d{4}-\d{2}-\d{2})/README.md")
+
+    def create_identifiables(self):
+        self.experiment = db.Record()
+        self.experiment.add_parent(name="Experiment")
+        self.experiment.add_property(
+			name="date",
+			value=self.match.group('date'))
+        self.identifiables.append(self.experiment)
+```
+
+
+```python
+    def update_identifiables(self):
+        assure_has_property(
+            self.experiment,
+            "species",
+            self.match.group('species'))
+```
+
+### ACQ C-Food
+
+An example for a very specialized C-Food:
+
+```python
+class ACQCFood(BMPGExperimentCFood):                                            
+@staticmethod                                                               
+def get_re():                                                               
+   return (self.exp_folder_pattern + r"acqknowledge/.*\.acq")
+
+def create_identifiables(self):                                             
+   self.experiment = create_identifiable_experiment(self.match)            
+   self.acq = db.Record("ACQRecording")  
+   self.header = get_acq_header(access(self.crawled_file.path))
+   self.acq.add_property("time", header["starttime"])  
+   self.identifiables.append(self.experiment)                              
+													
+def update_identifiables(self):                                             
+   assure_has_parent(self.crawled_file, "ACQRawData")                             
+   assure_object_is_in_list(self.crawled_file.id,                      
+      self.experiment,             
+      "ACQRawData")                      
+   self.acq.add_property("duration", header["duration"])  
+```
+
+### Crawler
+The Crawler is the unit that coordinates the insertion process. It iterates e.g. through a file structure and over the `C-Foods`. It also collects output and errors.
+
+## Standard Crawler
+
+See: https://doi.org/10.3390/data5020043
+
+```yaml
+responsible:    M. Musterfrau
+description:    Videos of cardiomyocytes on glass surrounded by collagen.
+results:
+- filename: *.avi
+  description:  raw videos of the cell culture
+- filename: *.csv
+  description:  velocities for different times
+```
+
+## Sources
+
+Source of the fingerprint picture: https://svgsilh.com/image/1298040.html
--- a/src/doc/crawler_fingerprint.svg
+++ b/src/doc/crawler_fingerprint.svg
--- a/src/doc/getting_started.md
+++ b/src/doc/getting_started.md
 # Getting started with pycaosdb #

-1. Install
-2. import
-3. enjoy
+## Installation
+The program can be installed (under Linux) with:
+```
+# Clone the repository:
+git clone 'https://gitlab.com/caosdb/caosdb-advanced-user-tools'
+
+# cd into the directory:
+cd caosdb-advanced-user-tools
+
+# Use pip to install the package:
+pip install --user .
+```
+
+## import
+
+## enjoy
--- a/src/doc/index.rst
+++ b/src/doc/index.rst
+Welcome to caosadvancedtools' documentation!
+============================================

-Welcome to caosdb-pylib's documentation!
-========================================
+Welcome to the advanced Python tools for CaosDB!
+
+This documentation helps you to :doc:`get started<getting_started>`, explains the most important
+:doc:`concepts<concepts>` and offers a range of :doc:`tutorials<tutorials>`.

 .. toctree::
   :maxdepth: 2
   :caption: Contents:
-   :hidden:

   Getting started <getting_started>
   Concepts <concepts>
   tutorials
+   Caosdb-Crawler <crawler>
   _apidoc/modules

-Welcome to the advanced Python tools for CaosDB!
-
-This documentation helps you to :doc:`get started<getting_started>`, explains the most important
-:doc:`concepts<concepts>` and offers a range of :doc:`tutorials<tutorials>`.


 Indices and tables