The [CaosDB crawler](https://gitlab.com/caosdb/caosdb-advanced-user-tools/blob/master/src/caosadvancedtools/crawler.py) is tool for automated insertion or updates of entities in CaosDB.
is a tool for the automated insertion or update of entities in CaosDB.
## Introduction
In simple terms, it is a program that scans a directory structure, identifies files
that shall be treated and generates corresponding Entities in CaosDB, possibly filling meta data.
During this process the crawler can also open files and derive content from within, for example reading
CSV tables and processing individual rows of these tables.
In simple terms, the crawler is a program that scans a directory
structure, identifies files that will be treated, and generates
corresponding Entities in CaosDB, possibly filling meta data. During
this process the crawler can also open files and derive content from
within, for example reading CSV tables and processing individual rows
of these tables.

As shown in the figure, the general principle of the crawler framework is the following:
- The crawler walks through the file structure and matches file names using regular expressions
- Based on the matched files finger prints, so called `Identifiables` are created
- CaosDB is queried for the `Identifiables`:
- If an `Identifiable` is found, it may be updated by the crawler.
- If an `Identifiable` does not yet exist, a new one will be inserted.
- CaosDB is queried for Records that match the `Identifiables`:
- If an `Identifiable` is found, the corresponding Record may be
updated by the crawler.
- If an `Identifiable` does not yet exist, a new Record will be
inserted.
I.e. the `Identifiables` (or finger print) allows to automatically decide
whether to insert a Record or update an existing one. This logic of
...
...
@@ -67,7 +74,9 @@ In this case `/TestData/` identifies the path to be crawled
**within the CaosDB file system**. You can browse the CaosDB file system by
opening the WebUI of your CaosDB instance and clicking on "File System".
In the backend, `crawl.py` starts a CQL query `FIND Files WHICH ARE STORED AT /TestData/**` and crawls these files according to your customized `C-Foods`.
In the backend, `crawl.py` starts a CQL query `FIND File WHICH IS
STORED AT /TestData/**` and crawls the resulting files according to
your customized `C-Foods`.
Crawling may consist of two distinct steps:
1. Insertion of files (use function `loadFiles`)
...
...
@@ -90,25 +99,26 @@ distribution of CaosDB). In this case the root file system as seen from within
the CaosDB docker process is used.
The crawler has the CaosDB python client as dependency, so make sure to install [caosdb-pylib](manuals/pylib/Setting-up-caosdb-pylib).
The crawler depends on the CaosDB python client, so make sure to
In most use cases the crawler needs to be tailored to the specific needs. This
section explains how this can be done.
Work in Progress
### Identifiable
As described above, the main feature of an `identifiable` is, that it has sufficient properties to identify an existing Record in CaosDB that should be update by the Crawler instead of inserting a new one. Obviously, this is necessary to allow to run the Crawler twice on the same file structure without duplicating the data in CaosDB.
An `identifiable` is a python Record object with the features to identify the correct Record in CaosDB. This object is used to create a query in order to determine whether the Record exists. If it does not, the python object is used to insert the Record. Thus, after this step it is certain, that a Record with the features of the `identifiable` exists in CaosDB (newly created or from before).
The Crawler also provides a local cache for `identifiables`, i.e. once the Crawler knows the ID of the CaosDB Record of an `identifiable`, the ID is stored and CaosDB does not need to be queried again.
The behavior and rules of the crawler are defined in logical units called CFoods.
In order to extend the crawler you need to extend an existing CFood or create new one.
### C-Food
A `C-Food` is a logical unit in the insertion process. It should be independent of other data and basically define two steps:
### C-Food - Introduction
A `C-Food` is a Python class that inherits from the base class [AbstractCFood](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood.AbstractCFood). It should be independent of other data and basically define two steps:
1. define what is needed to do the operation, i.e. create `identifiables`
2. update the `identifiables` according to the data
An example: An experiment might be uniquely identified by the date when it was conducted and a number and the `identifiable` might look like the following:
As described above, the main feature of an `identifiable` is, that it has sufficient properties to identify an existing Record in CaosDB that should be updated by the Crawler instead of inserting a new one. Obviously, this is necessary to allow to run the Crawler twice on the same file structure without duplicating the data in CaosDB.
An `identifiable` is a python Record object with the features to identify the correct Record in CaosDB. This object is used to create a query in order to determine whether the Record exists. If it does not, the python object is used to insert the Record. Thus, after this step it is certain, that a Record with the features of the `identifiable` exists in CaosDB (newly created or from before).
An example: An experiment might be uniquely identified by the date when it was conducted and a number. The `identifiable` might then look like the following:
```
<Record>
<Parent name="Experiment"/>
...
...
@@ -117,81 +127,162 @@ An example: An experiment might be uniquely identified by the date when it was c
</Record>
```
Thus after the first step an `Experiment` with those properties exists. In the second step, this further properties might be added to this Record, e.g. references to data files that were recorded in that experiment or to the person that did the experiment.
Thus after the first step an `Experiment` with those properties exists will exist in CaosDB.
In the second step, further properties might be added to this Record, e.g. references to data files that were recorded in that experiment or to the person that did the experiment.
Let's look at the following Example:
A C-Food may also involve multiple `identifiables`, e.g. when the `Experiment` shall reference the `Project` that it belongs to.
defines the `identifiables` that are needed and [`update_identifiables`](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood.AbstractCFood.update_identifiables) applies additional changes.
Here, an `Experiment` Record is identified using solely the date. This implies that there must NOT
exist two `Experiment` Records with the same date. If this might occur, an additional property needs
to be added to the identifiable. The `identifiables` have to be added to the `self.identifiables`
list.
After the correct Record has been identified (or created if none existed) an additional property is
added that describes the species.
Your CFood needs to be passed to the crawler instance in the `crawl.py` file, that you
CFoods have some additional features in order to cope with complex scenarios.
For example, what if multiple files are together needed to create some Record?
Multiple data files recorded in an experiment could be one example. CFoods may
define the [`collect_information`](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood.AbstractCFood.collect_information) function. In this function additional information can be collected by accessing files
or querying the database. One particular use case is to add file paths to the `attached_filenames` property.
By default, all files that are located at those paths are also treated by this CFood.
This also means that the crawler does not list those files as "untreated".
One special case is the existence of multiple, very similar files. Imagine, that
you want to treat a range of calibration images with a CFood. You can write a
regular expression to match all the files but it might be hard to match one particular. In this
mix-in. This will assure, that the first match will create a CFood and all
following ones are attached to the same instance. For further information,
please consult the [API documentation](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood.CMeal).
As the crawler may run in different environments, it might be different how files can be accessed.
This can be defined using the [File Guide](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood.FileGuide).
In the `crawl.py` file, you should set this appropriately:
```python
>>>fromcaosadvancedtools.cfoodimportfileguide
>>>importos
>>>fileguide.access=lambdapath:"/main/data/"+path
```
This prefixes all paths that are used in CaosDB with "/main/data/". In CFoods,
files can then be accessed using the fileguide as follows:
```python
defupdate_identifiables(self):
assure_has_property(
self.experiment,
"species",
self.match.group('species'))
withopen(fileguide.access("/some/path")):
# do stuff
pass
```
### ACQ C-Food
### Changing data in CaosDB
As described above, a Record matching the identifiable will be inserted if no such
Record existed before. This is typically unproblematic. However, what if existing
Records need to be modified? Many manipulations have the potential of overwriting
changes in made in CaosDB. Thus, unless the data being crawled is a single source of
truth for the information in CaosDB (and changes to the respective data in CaosDB
should thus not be possible) changes have to be done with some considerations.
An example for a very specialized C-Food:
Use the functions `assure_has_xyz` defined in the [cfood module](_apidoc/caosadvancedtools.html#caosadvancedtools.cfood) to only add a given property, if it is not yet existing. And use the functions `assure_xyz_is` to
force the value of a property (see remarks above).
To further assure that changes are correct, the crawler comes with an authorization
mechanism. When running the crawler with the `crawl` function, a security level
...interactive=False)# the crawler runs without asking intermediate questions
>>>c.crawl(security_level=INSERT)
```
### Crawler
The Crawler is the unit that coordinates the insertion process. It iterates e.g. through a file structure and over the `C-Foods`. It also collects output and errors.
This assures that every manipulation of data in CaosDB that is done via the functions
provided by the [`guard`](_apidoc/caosadvancedtools.html#caosadvancedtools.guard) class
is checked against the provided security level:
- "RETRIEVE": allows only to retrieve data from CaosDB. No manipulation is allowed
- "INSERT": allows only to insert new entities and the manipulation of those newly inserted ones
- "UPDATE": allows all manipulations
This implies that all data manipulation of the crawler should use the functions that are
checked by the guard. When writing a CFood you should stick to the above mentioned `assure_has_xyz` and
`assure_xyz_is` functions which use the respective data manipulation functions.
If you provide the `to_be_updated` member variable of CFoods to those `assure...` functions,
the crawler provides another convenient feature: When an update is prevented due
to the security level, the update is saved and can be subsequently be authorized.
If the crawler runs on the CaosDB server, it will try to send a mail which allows to
authorize the change. If it runs as a local script it will notify you that there
are unauthorized changes and provide a code with which the crawler can be started to
authorize the change.
## Real World Example
A crawler implementation exists that can crawl a file structure that adheres to the rules
defined in this [Data publication](https://doi.org/10.3390/data5020043).
The project is of moderate size and shows how a set of CFoods can be defined to
deal with a complex file structure.
You can find detailed information on files need to be structured [here](https://gitlab.com/salexan/check-sfs/-/blob/f-software/filesystem_structure.md) and the source code of the CFoods