Skip to content
Snippets Groups Projects
Commit b6c66c1c authored by Henrik tom Wörden's avatar Henrik tom Wörden Committed by Florian Spreckelsen
Browse files

DOC: Explain how to extend the crawler

parent 904872c8
Branches
Tags
1 merge request!22Release 0.3
......@@ -36,4 +36,4 @@ RUN cd /git && pip3 install .
WORKDIR /git/integrationtests
CMD /wait-for-it.sh caosdb-server:10443 -t 500 -- ./test.sh
# At least recommonmark 0.6 required.
RUN pip3 install recommonmark
RUN pip3 install recommonmark sphinx-rtd-theme
......@@ -22,6 +22,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- two utility functions when working with files: NameCollector and
get_file_via_download
- Automated documentation builds: `make doc`
- Crawler documentation
### Changed ###
......
......@@ -19,6 +19,8 @@
# -- Project information -----------------------------------------------------
import sphinx_rtd_theme
project = 'caosadvancedtools'
copyright = '2020, IndiScale GmbH'
author = 'Daniel Hornung'
......@@ -43,6 +45,7 @@ extensions = [
'sphinx.ext.intersphinx',
'sphinx.ext.napoleon', # For Google style docstrings
"recommonmark", # For markdown files.
'sphinx_rtd_theme'
]
# Add any paths that contain templates here, relative to this directory.
......@@ -82,7 +85,8 @@ pygments_style = None
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'alabaster'
html_theme = "sphinx_rtd_theme"
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
......
# CaosDB Crawler
The [CaosDB crawler](https://gitlab.com/caosdb/caosdb-advanced-user-tools/blob/master/src/caosadvancedtools/crawler.py) is tool for automated insertion or updates of entities in CaosDB.
The [CaosDB
crawler](https://gitlab.com/caosdb/caosdb-advanced-user-tools/blob/master/src/caosadvancedtools/crawler.py)
is a tool for the automated insertion or update of entities in CaosDB.
## Introduction
In simple terms, it is a program that scans a directory structure, identifies files
that shall be treated and generates corresponding Entities in CaosDB, possibly filling meta data.
During this process the crawler can also open files and derive content from within, for example reading
CSV tables and processing individual rows of these tables.
In simple terms, the crawler is a program that scans a directory
structure, identifies files that will be treated, and generates
corresponding Entities in CaosDB, possibly filling meta data. During
this process the crawler can also open files and derive content from
within, for example reading CSV tables and processing individual rows
of these tables.
![](crawler_fingerprint.svg)
As shown in the figure, the general principle of the crawler framework is the following:
- The crawler walks through the file structure and matches file names using regular expressions
- Based on the matched files finger prints, so called `Identifiables` are created
- CaosDB is queried for the `Identifiables`:
- If an `Identifiable` is found, it may be updated by the crawler.
- If an `Identifiable` does not yet exist, a new one will be inserted.
- CaosDB is queried for Records that match the `Identifiables`:
- If an `Identifiable` is found, the corresponding Record may be
updated by the crawler.
- If an `Identifiable` does not yet exist, a new Record will be
inserted.
I.e. the `Identifiables` (or finger print) allows to automatically decide
whether to insert a Record or update an existing one. This logic of
......@@ -67,7 +74,9 @@ In this case `/TestData/` identifies the path to be crawled
**within the CaosDB file system**. You can browse the CaosDB file system by
opening the WebUI of your CaosDB instance and clicking on "File System".
In the backend, `crawl.py` starts a CQL query `FIND Files WHICH ARE STORED AT /TestData/**` and crawls these files according to your customized `C-Foods`.
In the backend, `crawl.py` starts a CQL query `FIND File WHICH IS
STORED AT /TestData/**` and crawls the resulting files according to
your customized `C-Foods`.
Crawling may consist of two distinct steps:
1. Insertion of files (use function `loadFiles`)
......@@ -90,25 +99,26 @@ distribution of CaosDB). In this case the root file system as seen from within
the CaosDB docker process is used.
The crawler has the CaosDB python client as dependency, so make sure to install [caosdb-pylib](manuals/pylib/Setting-up-caosdb-pylib).
The crawler depends on the CaosDB python client, so make sure to
install [caosdb-pylib](manuals/pylib/Setting-up-caosdb-pylib).
## Extending the Crawlers
In most use cases the crawler needs to be tailored to the specific needs. This
section explains how this can be done.
Work in Progress
### Identifiable
As described above, the main feature of an `identifiable` is, that it has sufficient properties to identify an existing Record in CaosDB that should be update by the Crawler instead of inserting a new one. Obviously, this is necessary to allow to run the Crawler twice on the same file structure without duplicating the data in CaosDB.
An `identifiable` is a python Record object with the features to identify the correct Record in CaosDB. This object is used to create a query in order to determine whether the Record exists. If it does not, the python object is used to insert the Record. Thus, after this step it is certain, that a Record with the features of the `identifiable` exists in CaosDB (newly created or from before).
The Crawler also provides a local cache for `identifiables`, i.e. once the Crawler knows the ID of the CaosDB Record of an `identifiable`, the ID is stored and CaosDB does not need to be queried again.
The behavior and rules of the crawler are defined in logical units called CFoods.
In order to extend the crawler you need to extend an existing CFood or create new one.
### C-Food
A `C-Food` is a logical unit in the insertion process. It should be independent of other data and basically define two steps:
### C-Food - Introduction
A `C-Food` is a Python class that inherits from the base class [AbstractCFood](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood.AbstractCFood). It should be independent of other data and basically define two steps:
1. define what is needed to do the operation, i.e. create `identifiables`
2. update the `identifiables` according to the data
An example: An experiment might be uniquely identified by the date when it was conducted and a number and the `identifiable` might look like the following:
As described above, the main feature of an `identifiable` is, that it has sufficient properties to identify an existing Record in CaosDB that should be updated by the Crawler instead of inserting a new one. Obviously, this is necessary to allow to run the Crawler twice on the same file structure without duplicating the data in CaosDB.
An `identifiable` is a python Record object with the features to identify the correct Record in CaosDB. This object is used to create a query in order to determine whether the Record exists. If it does not, the python object is used to insert the Record. Thus, after this step it is certain, that a Record with the features of the `identifiable` exists in CaosDB (newly created or from before).
An example: An experiment might be uniquely identified by the date when it was conducted and a number. The `identifiable` might then look like the following:
```
<Record>
<Parent name="Experiment"/>
......@@ -117,81 +127,162 @@ An example: An experiment might be uniquely identified by the date when it was c
</Record>
```
Thus after the first step an `Experiment` with those properties exists. In the second step, this further properties might be added to this Record, e.g. references to data files that were recorded in that experiment or to the person that did the experiment.
Thus after the first step an `Experiment` with those properties exists will exist in CaosDB.
In the second step, further properties might be added to this Record, e.g. references to data files that were recorded in that experiment or to the person that did the experiment.
Let's look at the following Example:
A C-Food may also involve multiple `identifiables`, e.g. when the `Experiment` shall reference the `Project` that it belongs to.
```python
>>> # Example C-Food
>>> from caosadvancedtools.cfood import AbstractFileCFood, assure_has_property
>>> import caosdb as db
>>>
>>> class ExampleCFood(AbstractFileCFood):
... @staticmethod
... def get_re():
... return (r".*/(?P<species>[^/]+)/"
... r"(?P<date>\d{4}-\d{2}-\d{2})/README.md")
...
... def create_identifiables(self):
... self.experiment = db.Record()
... self.experiment.add_parent(name="Experiment")
... self.experiment.add_property(
... name="date",
... value=self.match.group('date'))
... self.identifiables.append(self.experiment)
...
... def update_identifiables(self):
... assure_has_property(
... self.experiment,
... "species",
... self.match.group('species'))
>>> # check whether the definition is valid
>>> cf = ExampleCFood('')
### Example C-Food
```
Every child of `AbstractFileCFood` (`AbstractFileCFood` is for crawling files..., and yes, you can crawl other stuff as well)
needs to implement the functions `get_re`, `create_identifiables`, `update_identifiables`.
The function
[`get_re`](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood.AbstractFileCFood.get_re)
defines which files shall be treated with this CFood. The function needs to return a string
with a regular expression. Here, the expression matches any "README.md" file that is located below
two folder levels like: `/any/path/whale/2020-01-01/README.md`. Note, that the groups defined in
the regular expressions (`date` and `species`) can be later used via `self.match.group('name')`.
[`create_identifiables`](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood.AbstractCFood.create_identifiables)
defines the `identifiables` that are needed and [`update_identifiables`](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood.AbstractCFood.update_identifiables) applies additional changes.
Here, an `Experiment` Record is identified using solely the date. This implies that there must NOT
exist two `Experiment` Records with the same date. If this might occur, an additional property needs
to be added to the identifiable. The `identifiables` have to be added to the `self.identifiables`
list.
After the correct Record has been identified (or created if none existed) an additional property is
added that describes the species.
Your CFood needs to be passed to the crawler instance in the `crawl.py` file, that you
use for crawling. For example like this:
```python
import caosdb as db
from .cfood import AbstractCFood, assure_has_property
class ExampleCFood(AbstractCFood):
@staticmethod
def get_re():
return (r".*/(?P<species>[^/]+)/"
r"(?P<date>\d{4}-\d{2}-\d{2})/README.md")
def create_identifiables(self):
self.experiment = db.Record()
self.experiment.add_parent(name="Experiment")
self.experiment.add_property(
name="date",
value=self.match.group('date'))
self.identifiables.append(self.experiment)
c = FileCrawler(files=files, cfood_types=[ExampleCFood])
```
### C-Food - Advanced
CFoods have some additional features in order to cope with complex scenarios.
For example, what if multiple files are together needed to create some Record?
Multiple data files recorded in an experiment could be one example. CFoods may
define the [`collect_information`](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood.AbstractCFood.collect_information) function. In this function additional information can be collected by accessing files
or querying the database. One particular use case is to add file paths to the `attached_filenames` property.
By default, all files that are located at those paths are also treated by this CFood.
This also means that the crawler does not list those files as "untreated".
One special case is the existence of multiple, very similar files. Imagine, that
you want to treat a range of calibration images with a CFood. You can write a
regular expression to match all the files but it might be hard to match one particular. In this
case, you should use the
[`CMeal`](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood.CMeal)
mix-in. This will assure, that the first match will create a CFood and all
following ones are attached to the same instance. For further information,
please consult the [API documentation](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood.CMeal).
As the crawler may run in different environments, it might be different how files can be accessed.
This can be defined using the [File Guide](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood.FileGuide).
In the `crawl.py` file, you should set this appropriately:
```python
>>> from caosadvancedtools.cfood import fileguide
>>> import os
>>> fileguide.access = lambda path: "/main/data/" + path
```
This prefixes all paths that are used in CaosDB with "/main/data/". In CFoods,
files can then be accessed using the fileguide as follows:
```python
def update_identifiables(self):
assure_has_property(
self.experiment,
"species",
self.match.group('species'))
with open(fileguide.access("/some/path")):
# do stuff
pass
```
### ACQ C-Food
### Changing data in CaosDB
As described above, a Record matching the identifiable will be inserted if no such
Record existed before. This is typically unproblematic. However, what if existing
Records need to be modified? Many manipulations have the potential of overwriting
changes in made in CaosDB. Thus, unless the data being crawled is a single source of
truth for the information in CaosDB (and changes to the respective data in CaosDB
should thus not be possible) changes have to be done with some considerations.
An example for a very specialized C-Food:
Use the functions `assure_has_xyz` defined in the [cfood module](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood) to only add a given property, if it is not yet existing. And use the functions `assure_xyz_is` to
force the value of a property (see remarks above).
To further assure that changes are correct, the crawler comes with an authorization
mechanism. When running the crawler with the `crawl` function, a security level
can be given.
```python
class ACQCFood(BMPGExperimentCFood):
@staticmethod
def get_re():
return (self.exp_folder_pattern + r"acqknowledge/.*\.acq")
def create_identifiables(self):
self.experiment = create_identifiable_experiment(self.match)
self.acq = db.Record("ACQRecording")
self.header = get_acq_header(access(self.crawled_file.path))
self.acq.add_property("time", header["starttime"])
self.identifiables.append(self.experiment)
def update_identifiables(self):
assure_has_parent(self.crawled_file, "ACQRawData")
assure_object_is_in_list(self.crawled_file.id,
self.experiment,
"ACQRawData")
self.acq.add_property("duration", header["duration"])
>>> from caosadvancedtools.crawler import FileCrawler
>>> from caosadvancedtools.guard import RETRIEVE, INSERT, UPDATE
>>> files = [] # put files to be crawled in this list
>>> c = FileCrawler(
... files=files,
... cfood_types=[ExampleCFood],
... interactive=False) # the crawler runs without asking intermediate questions
>>> c.crawl(security_level=INSERT)
```
### Crawler
The Crawler is the unit that coordinates the insertion process. It iterates e.g. through a file structure and over the `C-Foods`. It also collects output and errors.
This assures that every manipulation of data in CaosDB that is done via the functions
provided by the [`guard`](_apidoc/caosadvancedtools.html#caosadvancedtools.guard) class
is checked against the provided security level:
- "RETRIEVE": allows only to retrieve data from CaosDB. No manipulation is allowed
- "INSERT": allows only to insert new entities and the manipulation of those newly inserted ones
- "UPDATE": allows all manipulations
This implies that all data manipulation of the crawler should use the functions that are
checked by the guard. When writing a CFood you should stick to the above mentioned `assure_has_xyz` and
`assure_xyz_is` functions which use the respective data manipulation functions.
If you provide the `to_be_updated` member variable of CFoods to those `assure...` functions,
the crawler provides another convenient feature: When an update is prevented due
to the security level, the update is saved and can be subsequently be authorized.
If the crawler runs on the CaosDB server, it will try to send a mail which allows to
authorize the change. If it runs as a local script it will notify you that there
are unauthorized changes and provide a code with which the crawler can be started to
authorize the change.
## Real World Example
A crawler implementation exists that can crawl a file structure that adheres to the rules
defined in this [Data publication](https://doi.org/10.3390/data5020043).
The project is of moderate size and shows how a set of CFoods can be defined to
deal with a complex file structure.
You can find detailed information on files need to be structured [here](https://gitlab.com/salexan/check-sfs/-/blob/f-software/filesystem_structure.md) and the source code of the CFoods
[here](https://gitlab.com/henrik_indiscale/scifolder)
## Standard Crawler
See: https://doi.org/10.3390/data5020043
```yaml
responsible: M. Musterfrau
description: Videos of cardiomyocytes on glass surrounded by collagen.
results:
- filename: *.avi
description: raw videos of the cell culture
- filename: *.csv
description: velocities for different times
```
## Sources
......
......@@ -3,6 +3,7 @@ Welcome to caosadvancedtools' documentation!
Welcome to the advanced Python tools for CaosDB!
This documentation helps you to :doc:`get started<getting_started>`, explains the most important
:doc:`concepts<concepts>` and offers a range of :doc:`tutorials<tutorials>`.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment