Skip to content
Snippets Groups Projects
Commit 678bfdbf authored by Alexander Schlemmer's avatar Alexander Schlemmer
Browse files

cherry-pick-continue

parent 5c335d2f
No related branches found
No related tags found
2 merge requests!53Release 0.1,!42F doc
Pipeline #28753 passed
# Getting started with the CaosDB Crawler #
## Installation ##
### Requirements ###
### How to install ###
#### Linux ####
Make sure that Python (at least version 3.6) and pip is installed, using your system tools and
documentation.
Then open a terminal and continue in the [Generic installation](#generic-installation) section.
#### Windows ####
If a Python distribution is not yet installed, we recommend Anaconda Python, which you can download
for free from [https://www.anaconda.com](https://www.anaconda.com). The "Anaconda Individual Edition" provides most of all
packages you will ever need out of the box. If you prefer, you may also install the leaner
"Miniconda" installer, which allows you to install packages as you need them.
After installation, open an Anaconda prompt from the Windows menu and continue in the [Generic
installation](#generic-installation) section.
#### MacOS ####
If there is no Python 3 installed yet, there are two main ways to
obtain it: Either get the binary package from
[python.org](https://www.python.org/downloads/) or, for advanced
users, install via [Homebrew](https://brew.sh/). After installation
from python.org, it is recommended to also update the TLS certificates
for Python (this requires administrator rights for your user):
```sh
# Replace this with your Python version number:
cd /Applications/Python\ 3.9/
# This needs administrator rights:
sudo ./Install\ Certificates.command
```
After these steps, you may continue with the [Generic
installation](#generic-installation).
#### Generic installation ####
---
Obtain the sources from GitLab and install from there (`git` must be installed for
this option):
```sh
git clone https://gitlab.com/caosdb/caosdb-crawler
cd caosdb-crawler
pip3 install --user .
```
## Configuration ##
## Try it out ##
## Run Unit Tests
## Documentation ##
Build documentation in `src/doc` with `make html`.
### Requirements ###
- `sphinx`
- `sphinx-autoapi`
- `recommonmark`
### Troubleshooting ###
CFood-Definition
================
The crawler specification is called CFood-definition. It is stored inside a yaml file, or - more precisely - inside of one single or two yaml documents inside a yaml file.
The specification consists of three separate parts:
#. Metadata and macro definitions
#. Custom converter registrations
#. The converter tree specification
In the simplest case, there is just one yaml file with just a single document including at least
the converter tree specification (see :ref:`example 1<example_1>`). Additionally the custom converter part may be also included in
this single document (for historical reasons, see :ref:`example 2<example_2>`), but it is recommended to include them in the separate
document together with the metadata and :doc:`macro<macros>` definitions (see :ref:`below<example_4>`).
If metadata and macro definitions are provided, there **must** be a second document preceeding the
converter tree specification, including these definitions.
Examples
++++++++
A single document with a converter tree specification:
.. _example_1:
.. code-block:: yaml
extroot:
type: Directory
match: ^extroot$
subtree:
DataAnalysis:
type: Directory
match: DataAnalysis
# (...)
A single document with a converter tree specification, but also including a custom converters section:
.. _example_2:
.. code-block:: yaml
Converters:
CustomConverter_1:
package: mypackage.converters
converter: CustomConverter1
CustomConverter_2:
package: mypackage.converters
converter: CustomConverter2
extroot:
type: Directory
match: ^extroot$
subtree:
DataAnalysis:
type: Directory
match: DataAnalysis
# (...)
A yaml multi-document, defining metadata and some macros in the first document and declaring
two custom converters in the second document (**not recommended**, see the recommended version :ref:`below<example_4>`). Please note, that two separate yaml documents can be defined using the ``---`` syntax:
.. _example_3:
.. code-block:: yaml
---
metadata:
name: Datascience CFood
description: CFood for data from the local data science work group
macros:
- !defmacro
name: SimulationDatasetFile
params:
match: null
recordtype: null
nodename: null
definition:
# (...)
---
Converters:
CustomConverter_1:
package: mypackage.converters
converter: CustomConverter1
CustomConverter_2:
package: mypackage.converters
converter: CustomConverter2
extroot:
type: Directory
match: ^extroot$
subtree:
DataAnalysis:
type: Directory
match: DataAnalysis
# (...)
The **recommended way** of defining metadata, custom converters, macros and the main cfood specification is shown in the following code example:
.. _example_4:
.. code-block:: yaml
---
metadata:
name: Datascience CFood
description: CFood for data from the local data science work group
macros:
- !defmacro
name: SimulationDatasetFile
params:
match: null
recordtype: null
nodename: null
definition:
# (...)
Converters:
CustomConverter_1:
package: mypackage.converters
converter: CustomConverter1
CustomConverter_2:
package: mypackage.converters
converter: CustomConverter2
---
extroot:
type: Directory
match: ^extroot$
subtree:
DataAnalysis:
type: Directory
match: DataAnalysis
# (...)
Concepts
))))))))
Structure Elements
++++++++++++++++++
This hierarchical structure is assumed to be consituted of a tree of
StructureElements. The tree is created on the fly by so called Converters which
are defined in a yaml file. The tree of StructureElements is a model
of the existing data (For example could a tree of Python file objects
(StructureElements) represent a file tree that exists on some file server).
Relevant sources in:
src/structure_elements.py
Converters
++++++++++
Converters treat StructureElements and thereby create the StructureElement that
are the children of the treated StructureElement. Converters therefore create
the above named tree. The definition of a Converter also contains what
Converters shall be used to treat the generated child-StructureElements. The
definition is there a tree itself. (Question: Should there be global Converters
that are always checked when treating a StructureElement? Should Converters be
associated with generated child-StructureElements? Currently, all children are
created and checked against all Converters. It could be that one would like to
check file-StructureElements against one set of Converters and
directory-StructureElements against another)
Each StructureElement in the tree has a set of data values, i.e a dictionary of
key value pairs.
Some of those values are set due to the kind of StructureElement. For example,
a file could have the file name as such a key value pair: 'filename': <sth>.
Converters may define additional functions that create further values. For
example, a regular expresion could be used to get a date from a file name.
A converter is defined via a yml file or part of it. The definition states
what kind of StructureElement it treats (typically one).
Also, it defines how children of the current StructureElement are
created and what Converters shall be used to treat those.
The yaml definition looks like the following:
TODO: outdated, see cfood-schema.yml
.. code-block:: yaml
converter-name:
type: <StructureElement Type>
match: ".*"
records:
Experiment1:
parents:
- Experiment
- Blablabla
date: $DATUM
(...)
Experiment2:
parents:
- Experiment
subtree:
(...)
records:
Measurement: <- wird automatisch ein value im valueStore
run_number: 25
Experiment1:
Measurement: +Measurement <- Element in List (list is cleared before run)
*Measurement <- Multi Property (properties are removed before run)
Measurement <- Overwrite
UPDATE-Stage prüft ob es z.B. Gleichheit zwischen Listen gibt (die dadurch definiert sein
kann, dass alle Elemente vorhanden, aber nicht zwingend in der richtigen Reihenfolge sind)
evtl. brauchen wir das nicht, weil crawler eh schon deterministisch ist.
The converter-name is a description of what it represents (e.g.
'experiment-folder') and is used as identifier.
The type restricts what kind of StructureElements are treated.
The match is by default a regular expression, that is matche against the
name of StructureElements. Discussion: StructureElements might not have a
name (e.g. a dict) or should a name be created artificially if necessary
(e.g. "root-dict")? It might make sense to allow keywords like "always" and
other kinds of checks. For example a dictionary could be checked against a
json-schema definition.
recordtypes is a list of definitions that define the semantic structure
(see details below).
valuegenerators allow to provide additional functionality that creates
data values in addition to the ones given by default via the
StructureElement. This can be for example a match group of a regular
expression applied to the filename.
It should be possible to access the values of parent nodes. For example,
the name of a parent node could be accessed with $converter-name.name.
Discussion: This can introduce conflicts, if the key <converver-name>
already exists. An alternative would be to identify those lookups. E.g.
$$converter-name.name (2x$).
childrengenerators denotes how StructureElements shall be created that are
children of the current one.
subtree contains a list of Converter defnitions that look like the one
described here.
those keywords should be allowed but not required. I.e. if no
valuegenerators shall be defined, the keyword may be omitted.
Relevant sources in:
src/converters.py
Identifiables
+++++++++++++
Relevant sources in:
src/identifiable_adapters.py
The Crawler
+++++++++++
The crawler can be considered the main program doing the synchronization in basically two steps:
#. Based on a yaml-specification scan the file system (or other sources) and create a set of CaosDB Entities that are supposed to be inserted or updated in a CaosDB instance.
#. Compare the current state of the CaosDB instance with the set of CaosDB Entities created in step 1, taking into account the :ref:`registered identifiables<Identifiables>`. Insert or update entites accordingly.
Relevant sources in:
src/crawl.py
Special Cases
=============
Variable Precedence
+++++++++++++++++++
Let's assume the following situation
.. code-block:: yaml
description:
type: DictTextElement
match_value: (?P<description>.*)
match_name: description
Making use of the $description variable could refer to two different variables created here:
1. The structure element path.
2. The value of the matched expression.
The matched expression does take precedence over the structure element path and shadows it.
Make sure, that if you want to be able to use the structure element path, to give unique names
to the variables like:
.. code-block:: yaml
description_text_block:
type: DictTextElement
match_value: (?P<description>.*)
match_name: description
Scopes
========
Example:
.. code-block:: yaml
DicomFile:
type: SimpleDicomFile
match: (?P<filename>.*)\.dicom
records:
DicomRecord:
name: $filename
subtree: # header of dicom file
PatientID:
type: DicomHeaderElement
match_name: PatientName
match_value: (?P<patient>.*)
records:
Patient:
name: $patient
dicom_name: $filename # $filename is in same scope!
ExperimentFile:
type: MarkdownFile
match: ^readme.md$
records:
Experiment:
dicom_name: $filename # does NOT work, because $filename is out of scope!
# can variables be used within regexp?
File Objects
============
......@@ -53,6 +53,7 @@ extensions = [
'sphinx.ext.autosectionlabel',
'sphinx.ext.intersphinx',
'sphinx.ext.napoleon', # For Google style docstrings
"recommonmark", # For markdown files.
"sphinx_rtd_theme",
]
......@@ -61,7 +62,7 @@ templates_path = ['_templates']
# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
source_suffix = ['.rst']
source_suffix = ['.rst', '.md']
# The master toctree document.
master_doc = 'index'
......@@ -71,7 +72,7 @@ master_doc = 'index'
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None
language = "en"
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
......@@ -99,7 +100,7 @@ html_theme = "sphinx_rtd_theme"
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
html_static_path = [] # ['_static']
# Custom sidebar templates, must be a dictionary that maps document names
# to template names.
......
Converters
))))))))))
Standard Converters
+++++++++++++++++++
Directory Converter
===================
Simple File Converter
=====================
Markdown File Converter
=======================
Dict Converter
==============
Typical Subtree converters
--------------------------
DictBooleanElementConverter
DictFloatElementConverter
DictTextElementConverter
DictIntegerElementConverter
DictListElementConverter
DictDictElementConverter
YAMLFileConverter
=================
A specialized Dict Converter for yaml files: Yaml files are opened and the contents are
converted into dictionaries that can be further converted using the typical subtree converters
of dict converter.
**WARNING**: Currently unfinished implementation.
JSONFileConverter
=================
TextElementConverter
TableConverter
=================
A generic converter (abstract) for files containing tables.
Currently, there are two specialized implementations for xlsx-files and csv-files.
All table converters generate a subtree that can be converted with DictDictElementConverters:
For each row in the table a DictDictElement (structure element) is generated. The key of the
element is the row number. The value of the element is a dict containing the mapping of
column names to values of the respective cell.
Example:
.. code-block:: yaml
subtree:
TABLE:
type: CSVTableConverter
match: ^test_table.csv$
records:
(...) # Records edited for the whole table file
subtree:
ROW:
type: DictDictElement
match_name: .*
match_value: .*
records:
(...) # Records edited for each row
subtree:
COLUMN:
type: DictFloatElement
match_name: measurement # Name of the column in the table file
match_value: (?P<column_value).*)
records:
(...) # Records edited for each cell
XLSXTableConverter
=================
CSVTableConverter
=================
Custom Converters
+++++++++++++++++
Crawler 2.0 Documentation
=========================
Introduction
------------
.. toctree::
:maxdepth: 2
:caption: Contents:
:hidden:
Getting started<README_SETUP>
Concepts<concepts>
Converters<converters>
CFoods (Crawler Definitions)<cfood>
Macros<macros>
Tutorials<tutorials/index>
API documentation<_apidoc/modules>
This is the documentation for the crawler (previously known as crawler 2.0) for CaosDB, ``caosdb-crawler``.
The crawler is the main date integration tool for CaosDB.
Its task is to automatically synchronize data found on file systems or in other
......@@ -15,291 +30,15 @@ The hierarchical sturcture can be for example a file tree. However it can be
also something different like the contents of a json file or a file tree with
json files.
Concepts
--------
Structure Elements
++++++++++++++++++
This hierarchical structure is assumed to be consituted of a tree of
StructureElements. The tree is created on the fly by so called Converters which
are defined in a yaml file. The tree of StructureElements is a model
of the existing data (For example could a tree of Python file objects
(StructureElements) represent a file tree that exists on some file server).
Relevant sources in:
src/structure_elements.py
Converters
++++++++++
Converters treat StructureElements and thereby create the StructureElement that
are the children of the treated StructureElement. Converters therefore create
the above named tree. The definition of a Converter also contains what
Converters shall be used to treat the generated child-StructureElements. The
definition is there a tree itself. (Question: Should there be global Converters
that are always checked when treating a StructureElement? Should Converters be
associated with generated child-StructureElements? Currently, all children are
created and checked against all Converters. It could be that one would like to
check file-StructureElements against one set of Converters and
directory-StructureElements against another)
Each StructureElement in the tree has a set of data values, i.e a dictionary of
key value pairs.
Some of those values are set due to the kind of StructureElement. For example,
a file could have the file name as such a key value pair: 'filename': <sth>.
Converters may define additional functions that create further values. For
example, a regular expresion could be used to get a date from a file name.
A converter is defined via a yml file or part of it. The definition states
what kind of StructureElement it treats (typically one).
Also, it defines how children of the current StructureElement are
created and what Converters shall be used to treat those.
The yaml definition looks like the following:
TODO: outdated, see cfood-schema.yml
converter-name:
type: <StructureElement Type>
match: ".*"
records:
Experiment1:
parents:
- Experiment
- Blablabla
date: $DATUM
(...)
Experiment2:
parents:
- Experiment
subtree:
(...)
records:
Measurement: <- wird automatisch ein value im valueStore
run_number: 25
Experiment1:
Measurement: +Measurement <- Element in List (list is cleared before run)
*Measurement <- Multi Property (properties are removed before run)
Measurement <- Overwrite
UPDATE-Stage prüft ob es z.B. Gleichheit zwischen Listen gibt (die dadurch definiert sein
kann, dass alle Elemente vorhanden, aber nicht zwingend in der richtigen Reihenfolge sind)
evtl. brauchen wir das nicht, weil crawler eh schon deterministisch ist.
The converter-name is a description of what it represents (e.g.
'experiment-folder') and is used as identifier.
The type restricts what kind of StructureElements are treated.
The match is by default a regular expression, that is matche against the
name of StructureElements. Discussion: StructureElements might not have a
name (e.g. a dict) or should a name be created artificially if necessary
(e.g. "root-dict")? It might make sense to allow keywords like "always" and
other kinds of checks. For example a dictionary could be checked against a
json-schema definition.
recordtypes is a list of definitions that define the semantic structure
(see details below).
valuegenerators allow to provide additional functionality that creates
data values in addition to the ones given by default via the
StructureElement. This can be for example a match group of a regular
expression applied to the filename.
It should be possible to access the values of parent nodes. For example,
the name of a parent node could be accessed with $converter-name.name.
Discussion: This can introduce conflicts, if the key <converver-name>
already exists. An alternative would be to identify those lookups. E.g.
$$converter-name.name (2x$).
childrengenerators denotes how StructureElements shall be created that are
children of the current one.
subtree contains a list of Converter defnitions that look like the one
described here.
those keywords should be allowed but not required. I.e. if no
valuegenerators shall be defined, the keyword may be omitted.
Relevant sources in:
src/converters.py
Standard Converters
+++++++++++++++++++
Directory Converter
====================
Simple File Converter
====================
Markdown File Converter
====================
Dict Converter
====================
Typical Subtree converters
---------------
DictBooleanElementConverter
DictFloatElementConverter
DictTextElementConverter
DictIntegerElementConverter
DictListElementConverter
DictDictElementConverter
YAMLFileConverter
=================
A specialized Dict Converter for yaml files: Yaml files are opened and the contents are
converted into dictionaries that can be further converted using the typical subtree converters
of dict converter.
**WARNING**: Currently unfinished implementation.
JSONFileConverter
=================
TextElementConverter
TableConverter
=================
A generic converter (abstract) for files containing tables.
Currently, there are two specialized implementations for xlsx-files and csv-files.
All table converters generate a subtree that can be converted with DictDictElementConverters:
For each row in the table a DictDictElement (structure element) is generated. The key of the
element is the row number. The value of the element is a dict containing the mapping of
column names to values of the respective cell.
Example:
.. code-block:: yaml
subtree:
TABLE:
type: CSVTableConverter
match: ^test_table.csv$
records:
(...) # Records edited for the whole table file
subtree:
ROW:
type: DictDictElement
match_name: .*
match_value: .*
records:
(...) # Records edited for each row
subtree:
COLUMN:
type: DictFloatElement
match_name: measurement # Name of the column in the table file
match_value: (?P<column_value).*)
records:
(...) # Records edited for each cell
XLSXTableConverter
=================
CSVTableConverter
=================
Custom Converters
+++++++++++++++++
Identifiables
+++++++++++++
Relevant sources in:
src/identifiable_adapters.py
The Crawler
+++++++++++
The crawler can be considered the main program doing the synchronization in basically two steps:
1. Based on a yaml-specification scan the file system (or other sources) and create a set
of CaosDB Entities that are supposed to be inserted or updated in a CaosDB instance.
2. Compare the current state of the CaosDB instance with the set of CaosDB Entities created in
step 1, taking into account the :ref:`registered identifiables<Identifiables>`. Insert or
update entites accordingly.
Relevant sources in:
src/crawl.py
Special Cases
=============
Variable Precedence
++++++++++++
Let's assume the following situation
.. code-block:: yaml
description:
type: DictTextElement
match_value: (?P<description>.*)
match_name: description
Making use of the $description variable could refer to two different variables created here:
1. The structure element path.
2. The value of the matched expression.
The matched expression does take precedence over the structure element path and shadows it.
Make sure, that if you want to be able to use the structure element path, to give unique names
to the variables like:
.. code-block:: yaml
description_text_block:
type: DictTextElement
match_value: (?P<description>.*)
match_name: description
Scopes
========
Example:
This documentation helps you to :doc:`get started<README_SETUP>`, explains the most important
:doc:`concepts<concepts>` and offers a range of :doc:`tutorials<tutorials/index>`.
.. code-block:: yaml
DicomFile:
type: SimpleDicomFile
match: (?P<filename>.*)\.dicom
records:
DicomRecord:
name: $filename
subtree: # header of dicom file
PatientID:
type: DicomHeaderElement
match_name: PatientName
match_value: (?P<patient>.*)
records:
Patient:
name: $patient
dicom_name: $filename # $filename is in same scope!
ExperimentFile:
type: MarkdownFile
match: ^readme.md$
records:
Experiment:
dicom_name: $filename # does NOT work, because $filename is out of scope!
Indices and tables
==================
# can variables be used within regexp?
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
File Objects
============
Macros
------
Macros highly facilitate the writing of complex :doc:`CFoods<cfoods>`. Consider the following prevalent example:
.. _example_files:
.. code-block:: yaml
ExperimentalData:
type: Directory
match: ExperimentalData
subtree:
README:
type: SimpleFile
match: ^README.md$
records:
ReadmeFile:
parents:
- MarkdownFile
role: File
path: $README
file: $README
This example just inserts a file called ``README.md`` contained in Folder ``ExpreimentalData/`` into CaosDB, assigns the parent (RecordType) ``MarkdownFile`` and allows for later referencing this entity within the cfood. As file objects are created in the cfood specification using the ``records`` section with the special role ``File``, defining and using many files can become very cumbersome and make the cfood file difficult to read.
The same version using cfood macros could be defined as follows:
.. _example_files_2:
.. code-block:: yaml
---
metadata:
macros:
- !defmacro
name: MarkdownFile
params:
name: null
filename: null
definition:
${name}_filename
type: SimpleFile
match: $filename
records:
$name:
parents:
- MarkdownFile
role: File
path: ${name}_filename
file: ${name}_filename
---
ExperimentalData:
type: Directory
match: ExperimentalData
subtree: !macro
MarkdownFile:
- name: README
filename: ^README.md$
Complex Example
===============
.. _example_1:
.. code-block:: yaml
macros:
- !defmacro
name: SimulationDatasetFile
params:
match: null
recordtype: null
nodename: null
definition:
$nodename:
match: $match
type: SimpleFile
records:
File:
parents:
- $recordtype
role: File
path: $$$nodename
file: $$$nodename
Simulation:
$recordtype: +$File
Tutorials
+++++++++
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment