DOC: Explain how to extend the crawler

b6c66c1c · Henrik tom Wörden · Florian Spreckelsen · 904872c8 · b6c66c1c · b6c66c1c
Commit b6c66c1c authored 4 years ago by Henrik tom Wörden Committed by Florian Spreckelsen 4 years ago
--- a/.docker/Dockerfile
+++ b/.docker/Dockerfile
@@ -36,4 +36,4 @@ RUN cd /git && pip3 install .
 WORKDIR /git/integrationtests
 CMD /wait-for-it.sh caosdb-server:10443 -t 500 -- ./test.sh
 # At least recommonmark 0.6 required.
-RUN pip3 install recommonmark
+RUN pip3 install recommonmark sphinx-rtd-theme
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -22,6 +22,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - two utility functions when working with files: NameCollector and 
  get_file_via_download
 - Automated documentation builds: `make doc`
+- Crawler documentation

 ### Changed ###


--- a/src/doc/conf.py
+++ b/src/doc/conf.py
@@ -19,6 +19,8 @@

 # -- Project information -----------------------------------------------------

+import sphinx_rtd_theme
+
 project = 'caosadvancedtools'
 copyright = '2020, IndiScale GmbH'
 author = 'Daniel Hornung'
@@ -43,6 +45,7 @@ extensions = [
    'sphinx.ext.intersphinx',
    'sphinx.ext.napoleon',     # For Google style docstrings
    "recommonmark",            # For markdown files.
+    'sphinx_rtd_theme'
 ]

 # Add any paths that contain templates here, relative to this directory.
@@ -82,7 +85,8 @@ pygments_style = None
 # The theme to use for HTML and HTML Help pages.  See the documentation for
 # a list of builtin themes.
 #
-html_theme = 'alabaster'
+
+html_theme = "sphinx_rtd_theme"

 # Theme options are theme-specific and customize the look and feel of a theme
 # further.  For a list of options available for each theme, see the

--- a/src/doc/crawler.md
+++ b/src/doc/crawler.md
 # CaosDB Crawler

-The [CaosDB crawler](https://gitlab.com/caosdb/caosdb-advanced-user-tools/blob/master/src/caosadvancedtools/crawler.py) is  tool for automated insertion or updates of entities in CaosDB.
+The [CaosDB
+crawler](https://gitlab.com/caosdb/caosdb-advanced-user-tools/blob/master/src/caosadvancedtools/crawler.py)
+is a tool for the automated insertion or update of entities in CaosDB.

 ## Introduction
-In simple terms, it is a program that scans a directory structure, identifies files
-that shall be treated and generates corresponding Entities in CaosDB, possibly filling meta data.
-During this process the crawler can also open files and derive content from within, for example reading
-CSV tables and processing individual rows of these tables.
+
+In simple terms, the crawler is a program that scans a directory
+structure, identifies files that will be treated, and generates
+corresponding Entities in CaosDB, possibly filling meta data. During
+this process the crawler can also open files and derive content from
+within, for example reading CSV tables and processing individual rows
+of these tables.

 ![](crawler_fingerprint.svg)

 As shown in the figure, the general principle of the crawler framework is the following:
 - The crawler walks through the file structure and matches file names using regular expressions
 - Based on the matched files finger prints, so called `Identifiables` are created
- CaosDB is queried for the `Identifiables`:
-  - If an `Identifiable` is found, it may be updated by the crawler.
-  - If an `Identifiable` does not yet exist, a new one will be inserted.
+- CaosDB is queried for Records that match the `Identifiables`:
+  - If an `Identifiable` is found, the corresponding Record may be
+    updated by the crawler.
+  - If an `Identifiable` does not yet exist, a new Record will be
+    inserted.
  
 I.e. the `Identifiables` (or finger print) allows to automatically decide 
 whether to insert a Record or update an existing one. This logic of 
@@ -67,7 +74,9 @@ In this case `/TestData/` identifies the path to be crawled
 **within the CaosDB file system**. You can browse the CaosDB file system by 
 opening the WebUI of your CaosDB instance and clicking on "File System".

-In the backend, `crawl.py` starts a CQL query `FIND Files WHICH ARE STORED AT /TestData/**` and crawls these files according to your customized `C-Foods`.
+In the backend, `crawl.py` starts a CQL query `FIND File WHICH IS
+STORED AT /TestData/**` and crawls the resulting files according to
+your customized `C-Foods`.

 Crawling may consist of two distinct steps:
 1. Insertion of files (use function `loadFiles`)
@@ -90,25 +99,26 @@ distribution of CaosDB). In this case the root file system as seen from within
 the CaosDB docker process is used.


-The crawler has the CaosDB python client as dependency, so make sure to install [caosdb-pylib](manuals/pylib/Setting-up-caosdb-pylib).
+The crawler depends on the CaosDB python client, so make sure to
+install [caosdb-pylib](manuals/pylib/Setting-up-caosdb-pylib).

 ## Extending the Crawlers
+In most use cases the crawler needs to be tailored to the specific needs. This
+section explains how this can be done.

-Work in Progress
-
-### Identifiable
-As described above, the main feature of an `identifiable` is, that it has sufficient properties to identify an existing Record in CaosDB that should be update by the Crawler instead of inserting a new one. Obviously, this is necessary to allow to run the Crawler twice on the same file structure without duplicating the data in CaosDB.
-
-An `identifiable` is a python Record object with the features to identify the correct Record in CaosDB. This object is used to create a query in order to determine whether the Record exists. If it does not, the python object is used to insert the Record. Thus, after this step it is certain, that a Record with the features of the `identifiable` exists in CaosDB (newly created or from before).
-
-The Crawler also provides a local cache for `identifiables`, i.e. once the Crawler knows the ID of the CaosDB Record of an `identifiable`, the ID is stored and CaosDB does not need to be queried again.
+The behavior and rules of the crawler are defined in logical units called CFoods.
+In order to extend the crawler you need to extend an existing CFood or create new one.

-### C-Food
-A `C-Food` is a logical unit in the insertion process. It should be independent of other data and basically define two steps:
+### C-Food - Introduction
+A `C-Food` is a Python class that inherits from the base class [AbstractCFood](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood.AbstractCFood). It should be independent of other data and basically define two steps:
 1. define what is needed to do the operation, i.e. create `identifiables`
 2. update the `identifiables` according to the data

-An example: An experiment might be uniquely identified by the date when it was conducted and a number and the `identifiable` might look like the following:
+As described above, the main feature of an `identifiable` is, that it has sufficient properties to identify an existing Record in CaosDB that should be updated by the Crawler instead of inserting a new one. Obviously, this is necessary to allow to run the Crawler twice on the same file structure without duplicating the data in CaosDB.
+
+An `identifiable` is a python Record object with the features to identify the correct Record in CaosDB. This object is used to create a query in order to determine whether the Record exists. If it does not, the python object is used to insert the Record. Thus, after this step it is certain, that a Record with the features of the `identifiable` exists in CaosDB (newly created or from before).
+
+An example: An experiment might be uniquely identified by the date when it was conducted and a number. The `identifiable` might then look like the following:
 ```
 <Record>
  <Parent name="Experiment"/>
@@ -117,81 +127,162 @@ An example: An experiment might be uniquely identified by the date when it was c
 </Record>
 ```

-Thus after the first step an `Experiment` with those properties exists. In the second step, this further properties might be added to this Record, e.g. references to data files that were recorded in that experiment or to the person that did the experiment.
+Thus after the first step an `Experiment` with those properties exists will exist in CaosDB. 
+In the second step, further properties might be added to this Record, e.g. references to data files that were recorded in that experiment or to the person that did the experiment.
+
+Let's look at the following Example:

-A C-Food may also involve multiple `identifiables`, e.g. when the `Experiment` shall reference the `Project` that it belongs to.
+```python
+>>> # Example C-Food
+>>> from caosadvancedtools.cfood import AbstractFileCFood, assure_has_property
+>>> import caosdb as db
+>>> 
+>>> class ExampleCFood(AbstractFileCFood):
+...     @staticmethod
+...     def get_re():
+...         return (r".*/(?P<species>[^/]+)/"
+...                 r"(?P<date>\d{4}-\d{2}-\d{2})/README.md")
+... 
+...     def create_identifiables(self):
+...         self.experiment = db.Record()
+...         self.experiment.add_parent(name="Experiment")
+...         self.experiment.add_property(
+... 			name="date",
+... 			value=self.match.group('date'))
+...         self.identifiables.append(self.experiment)
+... 
+...     def update_identifiables(self):
+...         assure_has_property(
+...             self.experiment,
+...             "species",
+...             self.match.group('species'))
+
+>>> # check whether the definition is valid
+>>> cf = ExampleCFood('')

-### Example C-Food
+```

+Every child of `AbstractFileCFood` (`AbstractFileCFood` is for crawling files..., and yes, you can crawl other stuff as well)
+needs to implement the functions `get_re`, `create_identifiables`, `update_identifiables`.
+The function 
+[`get_re`](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood.AbstractFileCFood.get_re) 
+defines which files shall be treated with this CFood. The function needs to return a string
+with a regular expression. Here, the expression matches any "README.md" file that is located below 
+two folder levels like: `/any/path/whale/2020-01-01/README.md`. Note, that the groups defined in
+the regular expressions (`date` and `species`) can be later used via `self.match.group('name')`.
+
+[`create_identifiables`](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood.AbstractCFood.create_identifiables) 
+defines the `identifiables` that are needed and [`update_identifiables`](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood.AbstractCFood.update_identifiables) applies additional changes.
+Here, an `Experiment` Record is identified using solely the date. This implies that there must NOT 
+exist two `Experiment` Records with the same date. If this might occur, an additional property needs
+to be added to the identifiable. The `identifiables` have to be added to the `self.identifiables`
+list.
+
+After the correct Record has been identified (or created if none existed) an additional property is 
+added that describes the species.
+
+Your CFood needs to be passed to the crawler instance in the `crawl.py` file, that you 
+use for crawling. For example like this:
 ```python
-import caosdb as db
-from .cfood import AbstractCFood, assure_has_property
-
-class ExampleCFood(AbstractCFood):
-    @staticmethod
-    def get_re():
-        return (r".*/(?P<species>[^/]+)/"
-                r"(?P<date>\d{4}-\d{2}-\d{2})/README.md")
-
-    def create_identifiables(self):
-        self.experiment = db.Record()
-        self.experiment.add_parent(name="Experiment")
-        self.experiment.add_property(
-			name="date",
-			value=self.match.group('date'))
-        self.identifiables.append(self.experiment)
+c = FileCrawler(files=files, cfood_types=[ExampleCFood])
 ```

+### C-Food - Advanced
+CFoods have some additional features in order to cope with complex scenarios.
+For example, what if multiple files are together needed to create some Record? 
+Multiple data files recorded in an experiment  could be one example. CFoods may 
+define the [`collect_information`](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood.AbstractCFood.collect_information) function. In this function additional information can be collected by accessing files
+or querying the database. One particular use case is to add file paths to the `attached_filenames` property.
+By default, all files that are located at those paths are also treated by this CFood.
+This also means that the crawler does not list those files as "untreated".
+
+One special case is the existence of multiple, very similar files. Imagine, that
+you want to treat a range of calibration images with a CFood. You can write a 
+regular expression to match all the files but it might be hard to match one particular. In this
+case, you should use the 
+[`CMeal`](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood.CMeal)
+mix-in.  This will assure, that the first match will create a CFood and all
+following ones are attached to the same instance. For further information, 
+please consult the [API documentation](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood.CMeal).
+
+As the crawler may run in different environments, it might be different how files can be accessed.
+This can be defined using the [File Guide](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood.FileGuide).
+
+In the `crawl.py` file, you should set this appropriately:
+```python
+>>> from caosadvancedtools.cfood import fileguide
+>>> import os
+
+>>> fileguide.access = lambda path: "/main/data/" + path
+
+```
+This prefixes all paths that are used in CaosDB with "/main/data/". In CFoods,
+files can then be accessed using the fileguide as follows:

 ```python
-    def update_identifiables(self):
-        assure_has_property(
-            self.experiment,
-            "species",
-            self.match.group('species'))
+with open(fileguide.access("/some/path")):
+# do stuff
+   pass
 ```

-### ACQ C-Food
+### Changing data in CaosDB
+As described above, a Record matching the identifiable will be inserted if no such
+Record existed before. This is typically unproblematic. However, what if existing
+Records need to be modified? Many manipulations have the potential of overwriting
+changes in made in CaosDB. Thus, unless the data being crawled is a single source of
+truth for the information in CaosDB (and changes to the respective data in CaosDB
+should thus not be possible) changes have to be done with some considerations.

-An example for a very specialized C-Food:
+Use the functions `assure_has_xyz` defined in the [cfood module](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood) to only add a given property, if it is not yet existing. And use the functions `assure_xyz_is` to 
+force the value of a property (see remarks above).
+
+To further assure that changes are correct, the crawler comes with an authorization
+mechanism. When running the crawler with the `crawl` function, a security level
+can be given.

 ```python
-class ACQCFood(BMPGExperimentCFood):                                            
-@staticmethod                                                               
-def get_re():                                                               
-   return (self.exp_folder_pattern + r"acqknowledge/.*\.acq")
-
-def create_identifiables(self):                                             
-   self.experiment = create_identifiable_experiment(self.match)            
-   self.acq = db.Record("ACQRecording")  
-   self.header = get_acq_header(access(self.crawled_file.path))
-   self.acq.add_property("time", header["starttime"])  
-   self.identifiables.append(self.experiment)                              
-													
-def update_identifiables(self):                                             
-   assure_has_parent(self.crawled_file, "ACQRawData")                             
-   assure_object_is_in_list(self.crawled_file.id,                      
-      self.experiment,             
-      "ACQRawData")                      
-   self.acq.add_property("duration", header["duration"])  
+>>> from caosadvancedtools.crawler import FileCrawler
+>>> from caosadvancedtools.guard import RETRIEVE, INSERT, UPDATE
+>>> files = [] # put files to be crawled in this list
+>>> c = FileCrawler(
+...     files=files, 
+...     cfood_types=[ExampleCFood],
+...     interactive=False) # the crawler runs without asking intermediate questions
+>>> c.crawl(security_level=INSERT)
+
 ```

-### Crawler
-The Crawler is the unit that coordinates the insertion process. It iterates e.g. through a file structure and over the `C-Foods`. It also collects output and errors.
+This assures that every manipulation of data in CaosDB that is done via the functions
+provided by the [`guard`](_apidoc/caosadvancedtools.html#caosadvancedtools.guard) class
+is checked against the provided security level: 
+- "RETRIEVE": allows only to retrieve data from CaosDB. No manipulation is allowed
+- "INSERT": allows only to insert new entities and the manipulation of those newly inserted ones
+- "UPDATE": allows all manipulations
+
+This implies that all data manipulation of the crawler should use the functions that are
+checked by the guard. When writing a CFood you should stick to the above mentioned `assure_has_xyz` and 
+`assure_xyz_is` functions which use the respective data manipulation functions.
+
+If you provide the `to_be_updated` member variable of CFoods to those `assure...` functions,
+the crawler provides another convenient feature: When an update is prevented due
+to the security level, the update is saved and can be subsequently be authorized.
+If the crawler runs on the CaosDB server, it will try to send a mail which allows to 
+authorize the change. If it runs as a local script it will notify you that there
+are unauthorized changes and provide a code with which the crawler can be started to 
+authorize the change.
+
+
+## Real World Example
+A crawler implementation exists that can crawl a file structure that adheres to the rules
+defined in this [Data publication](https://doi.org/10.3390/data5020043).
+The project is of moderate size and shows how a set of CFoods can be defined to 
+deal with a complex file structure.
+
+You can find detailed information on files need to be structured [here](https://gitlab.com/salexan/check-sfs/-/blob/f-software/filesystem_structure.md) and the source code of the CFoods 
+[here](https://gitlab.com/henrik_indiscale/scifolder)

-## Standard Crawler

-See: https://doi.org/10.3390/data5020043

-```yaml
-responsible:    M. Musterfrau
-description:    Videos of cardiomyocytes on glass surrounded by collagen.
-results:
- filename: *.avi
-  description:  raw videos of the cell culture
- filename: *.csv
-  description:  velocities for different times
-```

 ## Sources


--- a/src/doc/index.rst
+++ b/src/doc/index.rst
@@ -3,6 +3,7 @@ Welcome to caosadvancedtools' documentation!

 Welcome to the advanced Python tools for CaosDB!

+
 This documentation helps you to :doc:`get started<getting_started>`, explains the most important
 :doc:`concepts<concepts>` and offers a range of :doc:`tutorials<tutorials>`.