Skip to content
Snippets Groups Projects
Commit 42ce7d33 authored by Henrik tom Wörden's avatar Henrik tom Wörden Committed by Quazgar
Browse files

DOC: usage of crawler

parent c4ef29e4
No related branches found
No related tags found
1 merge request!22Release 0.3
# CaosDB Crawler
The [CaosDB crawler](https://gitlab.com/caosdb/caosdb-advanced-user-tools/blob/master/src/caosadvancedtools/crawler.py) is tool for automated insertion or updates of entities in CaosDB.
## Introduction
In simple terms, it is a program that scans a directory structure, identifies files
that shall be treated and generates corresponding Entities in CaosDB, possibly filling meta data.
During this process the crawler can also open files and derive content from within, for example reading
CSV tables and processing individual rows of these tables.
![](crawler_fingerprint.svg)
As shown in the figure, the general principle of the crawler framework is the following:
- The crawler walks through the file structure and matches file names using regular expressions
- Based on the matched files finger prints, so called `Identifiables` are created
- CaosDB is queried for the `Identifiables`:
- If an `Identifiable` is found, it may be updated by the crawler.
- If an `Identifiable` does not yet exist, a new one will be inserted.
I.e. the `Identifiables` (or finger print) allows to automatically decide
whether to insert a Record or update an existing one. This logic of
the crawler is specified in C-Foods (pun intended! :-)). These are python
classes that are
loaded by `crawl.py` and allow for customized crawling and indexing code.
More details on the different components of the CaosDB Crawler can
be found under [Concepts](#concepts) below.
In case you are happy with our suggestion of a standard crawler, feel free to use the standard crawler.
The standard crawler lives in this git repository maintained by Henrik tom Wörden:
https://gitlab.com/henrik_indiscale/scifolder
## Usage
Typically, the crawler can be invoked in two ways: via the web interface and
directly as a Python script.
In both cases, if the crawler has a problem with some file (e.g. columns in a table (tsv, xls, ...) are named incorrectly),
the problem should be indicated by a warning that is returned. You can fix the
problem and run the crawler again. This does not cause any problems, since the
crawler can recognize what has already been processed (see the description of finger prints in the [Introduction](#Introduction)).
However, pay **attention** when you change a property that is used for the
finger print: The crawler will not be able to identify a previous version
with the changed one since the finger print is different. This often means that entities
in the data base need to be changed or removed. As a rule of thumb, you should be
pretty sure that properties that are used as finger prints will not change after
the crawler ran the first time. This prevents complications.
### Invocation from the Web Interface
If enabled, the crawler can be called using a menu entry in the web interface.
This will open a form where the path of the directory that shall be crawled
needs to be given. After the execution information about what was done and
which problems might exist is printed in the web interface.
Note, that some changes might be pending authorization (if indicated in the
messages).
### Invocation as Python Script
The crawler can be executed directly via a python script (usually called `crawl.py`).
The script prints the progress and reports potential problems.
The exact behavior depends on your setup. However, you can have a look at the example in
the [tests](https://gitlab.com/caosdb/caosdb-advanced-user-tools/-/blob/master/integrationtests/full_test/crawl.py).
Call `python3 crawl.py --help` to see what parameters can be provided. Typically,
an invocation looks like:
```python
python3 crawl.py "/TestData/"
```
In this case `/TestData/` identifies the path to be crawled
**within the CaosDB file system**. You can browse the CaosDB file system by
opening the WebUI of your CaosDB instance and clicking on "File System".
In the backend, `crawl.py` starts a CQL query `FIND Files WHICH ARE STORED AT /TestData/**` and crawls these files according to your customized `C-Foods`.
Crawling may consist of two distinct steps:
1. Insertion of files (use function `loadFiles`)
2. The actual crawling (use program `crawl.py`)
However, the first step may be included in `crawl.py`. Otherwise, you can only crawl
files that were previously inserted by `loadFiles`.
#### loadFiles
After installation of the `caosadvancedtools` you can simply
call the function `loadFiles` contained in the package:
```
python3 -m caosadvancedtools.loadFiles /opt/caosdb/mnt/extroot
```
`/opt/caosdb/mnt/extroot` is the root of the file system to be crawled as seen
by the CaosDB server (The actual path may vary. This is the used in the LinkAhead
distribution of CaosDB). In this case the root file system as seen from within
the CaosDB docker process is used.
The crawler has the CaosDB python client as dependency, so make sure to install [caosdb-pylib](manuals/pylib/Setting-up-caosdb-pylib).
## Extending the Crawlers
Work in Progress
### Identifiable
As described above, the main feature of an `identifiable` is, that it has sufficient properties to identify an existing Record in CaosDB that should be update by the Crawler instead of inserting a new one. Obviously, this is necessary to allow to run the Crawler twice on the same file structure without duplicating the data in CaosDB.
An `identifiable` is a python Record object with the features to identify the correct Record in CaosDB. This object is used to create a query in order to determine whether the Record exists. If it does not, the python object is used to insert the Record. Thus, after this step it is certain, that a Record with the features of the `identifiable` exists in CaosDB (newly created or from before).
The Crawler also provides a local cache for `identifiables`, i.e. once the Crawler knows the ID of the CaosDB Record of an `identifiable`, the ID is stored and CaosDB does not need to be queried again.
### C-Food
A `C-Food` is a logical unit in the insertion process. It should be independent of other data and basically define two steps:
1. define what is needed to do the operation, i.e. create `identifiables`
2. update the `identifiables` according to the data
An example: An experiment might be uniquely identified by the date when it was conducted and a number and the `identifiable` might look like the following:
```
<Record>
<Parent name="Experiment"/>
<Property name="date">2020-04-19</Property>
<Property name="Exp-No">9</Property>
</Record>
```
Thus after the first step an `Experiment` with those properties exists. In the second step, this further properties might be added to this Record, e.g. references to data files that were recorded in that experiment or to the person that did the experiment.
A C-Food may also involve multiple `identifiables`, e.g. when the `Experiment` shall reference the `Project` that it belongs to.
### Example C-Food
```python
import caosdb as db
from .cfood import AbstractCFood, assure_has_property
class ExampleCFood(AbstractCFood):
@staticmethod
def get_re():
return (r".*/(?P<species>[^/]+)/"
r"(?P<date>\d{4}-\d{2}-\d{2})/README.md")
def create_identifiables(self):
self.experiment = db.Record()
self.experiment.add_parent(name="Experiment")
self.experiment.add_property(
name="date",
value=self.match.group('date'))
self.identifiables.append(self.experiment)
```
```python
def update_identifiables(self):
assure_has_property(
self.experiment,
"species",
self.match.group('species'))
```
### ACQ C-Food
An example for a very specialized C-Food:
```python
class ACQCFood(BMPGExperimentCFood):
@staticmethod
def get_re():
return (self.exp_folder_pattern + r"acqknowledge/.*\.acq")
def create_identifiables(self):
self.experiment = create_identifiable_experiment(self.match)
self.acq = db.Record("ACQRecording")
self.header = get_acq_header(access(self.crawled_file.path))
self.acq.add_property("time", header["starttime"])
self.identifiables.append(self.experiment)
def update_identifiables(self):
assure_has_parent(self.crawled_file, "ACQRawData")
assure_object_is_in_list(self.crawled_file.id,
self.experiment,
"ACQRawData")
self.acq.add_property("duration", header["duration"])
```
### Crawler
The Crawler is the unit that coordinates the insertion process. It iterates e.g. through a file structure and over the `C-Foods`. It also collects output and errors.
## Standard Crawler
See: https://doi.org/10.3390/data5020043
```yaml
responsible: M. Musterfrau
description: Videos of cardiomyocytes on glass surrounded by collagen.
results:
- filename: *.avi
description: raw videos of the cell culture
- filename: *.csv
description: velocities for different times
```
## Sources
Source of the fingerprint picture: https://svgsilh.com/image/1298040.html
Source diff could not be displayed: it is too large. Options to address this: view the blob.
# Getting started with pycaosdb #
1. Install
2. import
3. enjoy
## Installation
The program can be installed (under Linux) with:
```
# Clone the repository:
git clone 'https://gitlab.com/caosdb/caosdb-advanced-user-tools'
# cd into the directory:
cd caosdb-advanced-user-tools
# Use pip to install the package:
pip install --user .
```
## import
## enjoy
Welcome to caosadvancedtools' documentation!
============================================
Welcome to caosdb-pylib's documentation!
========================================
Welcome to the advanced Python tools for CaosDB!
This documentation helps you to :doc:`get started<getting_started>`, explains the most important
:doc:`concepts<concepts>` and offers a range of :doc:`tutorials<tutorials>`.
.. toctree::
:maxdepth: 2
:caption: Contents:
:hidden:
Getting started <getting_started>
Concepts <concepts>
tutorials
Caosdb-Crawler <crawler>
_apidoc/modules
Welcome to the advanced Python tools for CaosDB!
This documentation helps you to :doc:`get started<getting_started>`, explains the most important
:doc:`concepts<concepts>` and offers a range of :doc:`tutorials<tutorials>`.
Indices and tables
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment