Skip to content
Snippets Groups Projects
Commit 2d48821c authored by Florian Spreckelsen's avatar Florian Spreckelsen
Browse files

Merge branch 'release-v0.7.1' into 'main'

DOC WIP: Tutorial: Single structured file

See merge request !162
parents b13c5d49 21b8e5e9
Branches
Tags v0.7.1
1 merge request!162DOC WIP: Tutorial: Single structured file
Pipeline #49033 passed
Showing
with 640 additions and 112 deletions
...@@ -10,7 +10,7 @@ RUN apt-get update && \ ...@@ -10,7 +10,7 @@ RUN apt-get update && \
python3-sphinx \ python3-sphinx \
tox \ tox \
-y -y
RUN pip3 install recommonmark sphinx-rtd-theme RUN pip3 install pylint recommonmark sphinx-rtd-theme
COPY .docker/wait-for-it.sh /wait-for-it.sh COPY .docker/wait-for-it.sh /wait-for-it.sh
ARG PYLIB ARG PYLIB
ADD https://gitlab.indiscale.com/api/v4/projects/97/repository/commits/${PYLIB} \ ADD https://gitlab.indiscale.com/api/v4/projects/97/repository/commits/${PYLIB} \
......
...@@ -279,7 +279,7 @@ cert: ...@@ -279,7 +279,7 @@ cert:
- cd .docker - cd .docker
- CAOSHOSTNAME=caosdb-server ./cert.sh - CAOSHOSTNAME=caosdb-server ./cert.sh
style: code-style:
tags: [docker] tags: [docker]
stage: style stage: style
image: $CI_REGISTRY_IMAGE image: $CI_REGISTRY_IMAGE
...@@ -290,6 +290,17 @@ style: ...@@ -290,6 +290,17 @@ style:
- autopep8 -r --diff --exit-code . - autopep8 -r --diff --exit-code .
allow_failure: true allow_failure: true
pylint:
tags: [docker]
stage: style
image: $CI_REGISTRY_IMAGE
needs:
- job: build-testenv
optional: true
allow_failure: true
script:
- pylint --unsafe-load-any-extension=y -d all -e E,F src/caoscrawler
# Build the sphinx documentation and make it ready for deployment by Gitlab Pages # Build the sphinx documentation and make it ready for deployment by Gitlab Pages
# Special job for serving a static website. See https://docs.gitlab.com/ee/ci/yaml/README.html#pages # Special job for serving a static website. See https://docs.gitlab.com/ee/ci/yaml/README.html#pages
# Based on: https://gitlab.indiscale.com/caosdb/src/caosdb-pylib/-/ci/editor?branch_name=main # Based on: https://gitlab.indiscale.com/caosdb/src/caosdb-pylib/-/ci/editor?branch_name=main
......
...@@ -5,6 +5,15 @@ All notable changes to this project will be documented in this file. ...@@ -5,6 +5,15 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [0.7.1] - 2024-03-21 ##
### Fixed ###
* `crawler_main` doesn't need the deprecated `debug=True` anymore to put out a
provenance file if the `provenance_file` parameter is provided.
* [indiscale#129](https://gitlab.indiscale.com/caosdb/src/caosdb-crawler/-/issues/129)
missing packaging dependency.
## [0.7.0] - 2024-03-04 ## ## [0.7.0] - 2024-03-04 ##
### Added ### ### Added ###
......
...@@ -17,6 +17,6 @@ authors: ...@@ -17,6 +17,6 @@ authors:
given-names: Alexander given-names: Alexander
orcid: https://orcid.org/0000-0003-4124-9649 orcid: https://orcid.org/0000-0003-4124-9649
title: CaosDB - Crawler title: CaosDB - Crawler
version: 0.7.0 version: 0.7.1
doi: 10.3390/data9020024 doi: 10.3390/data9020024
date-released: 2023-03-04 date-released: 2023-03-21
\ No newline at end of file \ No newline at end of file
...@@ -24,24 +24,22 @@ ...@@ -24,24 +24,22 @@
""" """
an integration test module that runs a test against a (close to) real world example an integration test module that runs a test against a (close to) real world example
""" """
from caosdb.utils.register_tests import clear_database, set_test_key
import logging
import json import json
import logging
import os import os
import pytest
import sys
import caosdb as db import linkahead as db
from caosdb.cached import cache_clear from linkahead.cached import cache_clear
from linkahead.utils.register_tests import clear_database, set_test_key
from caosadvancedtools.loadFiles import loadpath
from caosadvancedtools.models.parser import parse_model_from_json_schema, parse_model_from_yaml
from caoscrawler.crawl import Crawler, crawler_main from caoscrawler.crawl import Crawler, crawler_main
from caoscrawler.identifiable_adapters import CaosDBIdentifiableAdapter from caoscrawler.identifiable_adapters import CaosDBIdentifiableAdapter
from caoscrawler.structure_elements import Directory
import pytest
from caosadvancedtools.models.parser import parse_model_from_json_schema, parse_model_from_yaml
from caosadvancedtools.loadFiles import loadpath
from caoscrawler.scanner import load_definition, scan_structure_elements, create_converter_registry from caoscrawler.scanner import load_definition, scan_structure_elements, create_converter_registry
from caoscrawler.structure_elements import Directory
import sys
set_test_key("10b128cf8a1372f30aa3697466bb55e76974e0c16a599bb44ace88f19c8f61e2") set_test_key("10b128cf8a1372f30aa3697466bb55e76974e0c16a599bb44ace88f19c8f61e2")
...@@ -91,15 +89,6 @@ def usemodel(): ...@@ -91,15 +89,6 @@ def usemodel():
dataset_inherits.sync_data_model(noquestion=True) dataset_inherits.sync_data_model(noquestion=True)
@pytest.fixture
def clear_database():
# TODO(fspreck): Remove once the corresponding advancedtools function can
# be used.
ents = db.execute_query("FIND ENTITY WITH ID>99")
if ents:
ents.delete()
def create_identifiable_adapter(): def create_identifiable_adapter():
ident = CaosDBIdentifiableAdapter() ident = CaosDBIdentifiableAdapter()
ident.load_from_yaml_definition(os.path.join(DATADIR, "identifiables.yml")) ident.load_from_yaml_definition(os.path.join(DATADIR, "identifiables.yml"))
......
[metadata] [metadata]
name = caoscrawler name = caoscrawler
version = 0.7.0 version = 0.7.1
author = Alexander Schlemmer author = Alexander Schlemmer
author_email = alexander.schlemmer@ds.mpg.de author_email = alexander.schlemmer@ds.mpg.de
description = A new crawler for caosdb description = A new crawler for caosdb
...@@ -19,14 +19,15 @@ package_dir = ...@@ -19,14 +19,15 @@ package_dir =
packages = find: packages = find:
python_requires = >=3.7 python_requires = >=3.7
install_requires = install_requires =
importlib-resources
caosadvancedtools >= 0.7.0 caosadvancedtools >= 0.7.0
importlib-resources
importlib_metadata;python_version<'3.8'
linkahead > 0.13.2 linkahead > 0.13.2
yaml-header-tools >= 0.2.1
pyyaml
odfpy #make optional odfpy #make optional
packaging
pandas pandas
importlib_metadata;python_version<'3.8' pyyaml
yaml-header-tools >= 0.2.1
[options.packages.find] [options.packages.find]
where = src where = src
......
...@@ -389,8 +389,8 @@ class Converter(object, metaclass=ABCMeta): ...@@ -389,8 +389,8 @@ class Converter(object, metaclass=ABCMeta):
Extract information from the structure element and store them as values in the Extract information from the structure element and store them as values in the
general store. general store.
Parameters: Parameters
------------ ----------
values: GeneralStore values: GeneralStore
The GeneralStore to store values in. The GeneralStore to store values in.
...@@ -409,8 +409,8 @@ class Converter(object, metaclass=ABCMeta): ...@@ -409,8 +409,8 @@ class Converter(object, metaclass=ABCMeta):
Check if transformers are defined using the "transform" keyword. Check if transformers are defined using the "transform" keyword.
Then apply the transformers to the variables defined in GeneralStore "values". Then apply the transformers to the variables defined in GeneralStore "values".
Parameters: Parameters
------------ ----------
values: GeneralStore values: GeneralStore
The GeneralStore to store values in. The GeneralStore to store values in.
...@@ -765,6 +765,12 @@ schema_resource: ...@@ -765,6 +765,12 @@ schema_resource:
class DictElementConverter(Converter): class DictElementConverter(Converter):
"""
**Operates on:** :py:class:`caoscrawler.structure_elements.DictElement`
**Generates:** :py:class:`caoscrawler.structure_elements.StructureElement`
"""
def create_children(self, generalStore: GeneralStore, element: StructureElement): def create_children(self, generalStore: GeneralStore, element: StructureElement):
# TODO: See comment on types and inheritance # TODO: See comment on types and inheritance
if not isinstance(element, DictElement): if not isinstance(element, DictElement):
...@@ -1154,6 +1160,12 @@ class TableConverter(Converter): ...@@ -1154,6 +1160,12 @@ class TableConverter(Converter):
class XLSXTableConverter(TableConverter): class XLSXTableConverter(TableConverter):
"""
**Operates on:** :py:class:`caoscrawler.structure_elements.File`
**Generates:** :py:class:`caoscrawler.structure_elements.DictElement`
"""
def get_options(self): def get_options(self):
return self._get_options([ return self._get_options([
("sheet_name", str), ("sheet_name", str),
......
...@@ -1504,7 +1504,7 @@ def crawler_main(crawled_directory_path: str, ...@@ -1504,7 +1504,7 @@ def crawler_main(crawled_directory_path: str,
dry_run: bool = False, dry_run: bool = False,
prefix: str = "", prefix: str = "",
securityMode: SecurityMode = SecurityMode.UPDATE, securityMode: SecurityMode = SecurityMode.UPDATE,
unique_names=True, unique_names: bool = True,
restricted_path: Optional[list[str]] = None, restricted_path: Optional[list[str]] = None,
remove_prefix: Optional[str] = None, remove_prefix: Optional[str] = None,
add_prefix: Optional[str] = None, add_prefix: Optional[str] = None,
...@@ -1520,9 +1520,9 @@ def crawler_main(crawled_directory_path: str, ...@@ -1520,9 +1520,9 @@ def crawler_main(crawled_directory_path: str,
identifiables_definition_file : str identifiables_definition_file : str
filename of an identifiable definition yaml file filename of an identifiable definition yaml file
debug : bool debug : bool
DEPRECATED, whether or not to run in debug mode DEPRECATED, use a provenance file instead.
provenance_file : str provenance_file : str
provenance information will be stored in a file with given filename Provenance information will be stored in a file with given filename
dry_run : bool dry_run : bool
do not commit any chnages to the server do not commit any chnages to the server
prefix : str prefix : str
...@@ -1562,7 +1562,7 @@ def crawler_main(crawled_directory_path: str, ...@@ -1562,7 +1562,7 @@ def crawler_main(crawled_directory_path: str,
_fix_file_paths(crawled_data, add_prefix, remove_prefix) _fix_file_paths(crawled_data, add_prefix, remove_prefix)
_check_record_types(crawled_data) _check_record_types(crawled_data)
if provenance_file is not None and debug: if provenance_file is not None:
crawler.save_debug_data(debug_tree=debug_tree, filename=provenance_file) crawler.save_debug_data(debug_tree=debug_tree, filename=provenance_file)
if identifiables_definition_file is not None: if identifiables_definition_file is not None:
...@@ -1599,6 +1599,7 @@ def crawler_main(crawled_directory_path: str, ...@@ -1599,6 +1599,7 @@ def crawler_main(crawled_directory_path: str,
logger.debug(err) logger.debug(err)
if "SHARED_DIR" in os.environ: if "SHARED_DIR" in os.environ:
# pylint: disable=E0601
domain = get_config_setting("public_host_url") domain = get_config_setting("public_host_url")
logger.error("Unexpected Error: Please tell your administrator about this and provide the" logger.error("Unexpected Error: Please tell your administrator about this and provide the"
f" following path.\n{domain}/Shared/" + debuglog_public) f" following path.\n{domain}/Shared/" + debuglog_public)
......
...@@ -25,7 +25,9 @@ ...@@ -25,7 +25,9 @@
# #
""" """
This is the scanner, the original "_crawl" function from crawl.py. This is the scanner.
This was where formerly the ``_crawl(...)`` function from ``crawl.py`` was located.
This is just the functionality that extracts data from the file system. This is just the functionality that extracts data from the file system.
""" """
...@@ -257,31 +259,31 @@ def scanner(items: list[StructureElement], ...@@ -257,31 +259,31 @@ def scanner(items: list[StructureElement],
restricted_path: Optional[list[str]] = None, restricted_path: Optional[list[str]] = None,
crawled_data: Optional[list[db.Record]] = None, crawled_data: Optional[list[db.Record]] = None,
debug_tree: Optional[DebugTree] = None, debug_tree: Optional[DebugTree] = None,
registered_transformer_functions: Optional[dict] = None): registered_transformer_functions: Optional[dict] = None) -> list[db.Record]:
"""Crawl a list of StructureElements and apply any matching converters. """Crawl a list of StructureElements and apply any matching converters.
Formerly known as "_crawl". Formerly known as ``_crawl(...)``.
Parameters Parameters
---------- ----------
items: items: list[StructureElement]
structure_elements (e.g. files and folders on one level on the hierarchy) structure_elements (e.g. files and folders on one level on the hierarchy)
converters: converters: list[Converter]
locally defined converters for treating structure elements. A locally locally defined converters for treating structure elements. A locally
defined converter could be one that is only valid for a specific subtree defined converter could be one that is only valid for a specific subtree
of the originally cralwed StructureElement structure. of the originally cralwed StructureElement structure.
general_store, record_store: general_store, record_store: GeneralStore, RecordStore, optional
This recursion of the crawl function should only operate on copies of This recursion of the crawl function should only operate on copies of
the global stores of the Crawler object. the global stores of the Crawler object.
restricted_path : list of strings, optional restricted_path : list[str], optional
traverse the data tree only along the given path. For example, when a traverse the data tree only along the given path. For example, when a
directory contains files a, b and c and b is given as restricted_path, a directory contains files a, b and c, and b is given as ``restricted_path``, a
and c will be ignroed by the crawler. When the end of the given path is and c will be ignored by the crawler. When the end of the given path is
reached, traverse the full tree as normal. The first element of the list reached, traverse the full tree as normal. The first element of the list
provided by restricted_path should be the name of the StructureElement provided by ``restricted_path`` should be the name of the StructureElement
at this level, i.e. denoting the respective element in the items at this level, i.e. denoting the respective element in the items
argument. argument.
...@@ -292,7 +294,8 @@ def scanner(items: list[StructureElement], ...@@ -292,7 +294,8 @@ def scanner(items: list[StructureElement],
Each function is a dictionary: Each function is a dictionary:
- The key is the name of the function to be looked up in the dictionary of registered transformer functions. - The key is the name of the function to be looked up in the dictionary of registered
transformer functions.
- The value is the function which needs to be of the form: - The value is the function which needs to be of the form:
def func(in_value: Any, in_parameters: dict) -> Any: def func(in_value: Any, in_parameters: dict) -> Any:
pass pass
...@@ -457,7 +460,8 @@ def scan_structure_elements(items: Union[list[StructureElement], StructureElemen ...@@ -457,7 +460,8 @@ def scan_structure_elements(items: Union[list[StructureElement], StructureElemen
converter_registry: dict, converter_registry: dict,
restricted_path: Optional[list[str]] = None, restricted_path: Optional[list[str]] = None,
debug_tree: Optional[DebugTree] = None, debug_tree: Optional[DebugTree] = None,
registered_transformer_functions: Optional[dict] = None): registered_transformer_functions: Optional[dict] = None) -> (
list[db.Record]):
""" """
Start point of the crawler recursion. Start point of the crawler recursion.
...@@ -471,14 +475,14 @@ def scan_structure_elements(items: Union[list[StructureElement], StructureElemen ...@@ -471,14 +475,14 @@ def scan_structure_elements(items: Union[list[StructureElement], StructureElemen
crawler_definition : dict crawler_definition : dict
A dictionary representing the crawler definition, possibly from a yaml A dictionary representing the crawler definition, possibly from a yaml
file. file.
restricted_path: optional, list of strings restricted_path: list[str], optional
Traverse the data tree only along the given path. When the end of the Traverse the data tree only along the given path. When the end of the
given path is reached, traverse the full tree as normal. See docstring given path is reached, traverse the full tree as normal. See docstring
of 'scanner' for more details. of 'scanner' for more details.
Returns Returns
------- -------
crawled_data : list crawled_data : list[db.Record]
the final list with the target state of Records. the final list with the target state of Records.
""" """
......
...@@ -28,9 +28,16 @@ import warnings ...@@ -28,9 +28,16 @@ import warnings
class StructureElement(object): class StructureElement(object):
""" base class for elements in the hierarchical data structure """ """Base class for elements in the hierarchical data structure.
def __init__(self, name): Parameters
----------
name: str
The name of the StructureElement. May be used for pattern matching by CFood rules.
"""
def __init__(self, name: str):
# Used to store usage information for debugging: # Used to store usage information for debugging:
self.metadata: tDict[str, set[str]] = { self.metadata: tDict[str, set[str]] = {
"usage": set() "usage": set()
...@@ -46,6 +53,18 @@ class StructureElement(object): ...@@ -46,6 +53,18 @@ class StructureElement(object):
class FileSystemStructureElement(StructureElement): class FileSystemStructureElement(StructureElement):
"""StructureElement representing an element of a file system, like a directory or a simple file.
Parameters
----------
name: str
The name of the StructureElement. May be used for pattern matching by CFood rules.
path: str
The path to the file or directory.
"""
def __init__(self, name: str, path: str): def __init__(self, name: str, path: str):
super().__init__(name) super().__init__(name)
self.path = path self.path = path
...@@ -65,6 +84,7 @@ class Directory(FileSystemStructureElement): ...@@ -65,6 +84,7 @@ class Directory(FileSystemStructureElement):
class File(FileSystemStructureElement): class File(FileSystemStructureElement):
"""StrutureElement representing a file."""
pass pass
......
Concepts Concepts
)))))))) ========
The CaosDB Crawler can handle any kind of hierarchical data structure. The typical use case is The CaosDB Crawler can handle any kind of hierarchical data structure. The typical use case is a
directory tree that is traversed. We use the following terms/concepts to describe how the CaosDB directory tree that is traversed. We use the following terms/concepts to describe how the CaosDB
Crawler works. Crawler works.
Structure Elements Structure Elements
++++++++++++++++++ ++++++++++++++++++
This hierarchical structure is assumed to be consituted of a tree of The crawled hierarchical structure is represented by a tree of *StructureElements*. This tree is
StructureElements. The tree is created on the fly by so called Converters which generated on the fly by so called Converters which are defined in a yaml file (usually called
are defined in a yaml file. The tree of StructureElements is a model ``cfood.yml``). This generated tree of StructureElements is a model of the existing data. For
of the existing data (For example could a tree of Python file objects example a tree of Python *file objects* (StructureElements) could correspond to a file system tree.
(StructureElements) represent a file tree that exists on some file server).
Relevant sources in: Relevant sources in:
...@@ -23,29 +22,28 @@ Relevant sources in: ...@@ -23,29 +22,28 @@ Relevant sources in:
Converters Converters
++++++++++ ++++++++++
Converters treat StructureElements and thereby create the StructureElement that Converters treat a StructureElement and during this process create a number of new
are the children of the treated StructureElement. Converters therefore create StructureElements: the children of the initially treated StructureElement. Thus by treatment of
the above named tree. The definition of a Converter also contains what existing StructureElements, Converters create a tree of StructureElements.
Converters shall be used to treat the generated child-StructureElements. The
definition is therefore a tree itself.
See :std:doc:`converters<converters>` for details.
.. image:: img/converter.png
:height: 170
See :std:doc:`converters<converters>` for details.
Relevant sources in: Relevant sources in:
- ``src/converters.py``
- ``src/converters.py``
Identifiables Identifiables
+++++++++++++ +++++++++++++
An Identifiable of a Record is like the fingerprint of a Record. An *Identifiable* of a Record is like the fingerprint of a Record.
The identifiable contains the information that is used by the CaosDB Crawler to identify Records. The Identifiable contains the information that is used by the CaosDB Crawler to identify Records.
For example, in order to check whether a Record exits in the CaosDB Server, the CaosDB Crawler creates a query For example, the CaosDB Crawler may create a query using the information contained in the
using the information contained in the Identifiable. Identifiable in order to check whether a Record exists in the CaosDB Server.
Suppose a certain experiment is at most done once per day, then the identifiable could Suppose a certain experiment is at most done once per day, then the identifiable could
consist of the RecordType "SomeExperiment" (as a parent) and the Property "date" with the respective value. consist of the RecordType "SomeExperiment" (as a parent) and the Property "date" with the respective value.
...@@ -100,7 +98,9 @@ The Crawler ...@@ -100,7 +98,9 @@ The Crawler
+++++++++++ +++++++++++
The crawler can be considered the main program doing the synchronization in basically two steps: The crawler can be considered the main program doing the synchronization in basically two steps:
#. Based on a yaml-specification scan the file system (or other sources) and create a set of CaosDB Entities that are supposed to be inserted or updated in a CaosDB instance. #. Based on a yaml-specification scan the file system (or other sources) and create a set of CaosDB Entities that are supposed to be inserted or updated in a CaosDB instance.
#. Compare the current state of the CaosDB instance with the set of CaosDB Entities created in step 1, taking into account the :ref:`registered identifiables<Identifiables>`. Insert or update entites accordingly. #. Compare the current state of the CaosDB instance with the set of CaosDB Entities created in step 1, taking into account the :ref:`registered identifiables<Identifiables>`. Insert or update entites accordingly.
Relevant sources in: Relevant sources in:
......
...@@ -33,10 +33,10 @@ copyright = '2024, IndiScale' ...@@ -33,10 +33,10 @@ copyright = '2024, IndiScale'
author = 'Alexander Schlemmer' author = 'Alexander Schlemmer'
# The short X.Y version # The short X.Y version
version = '0.7.0' version = '0.7.1'
# The full version, including alpha/beta/rc tags # The full version, including alpha/beta/rc tags
# release = '0.5.2-rc2' # release = '0.5.2-rc2'
release = '0.7.0' release = '0.7.1'
# -- General configuration --------------------------------------------------- # -- General configuration ---------------------------------------------------
......
Converters Converters
)))))))))) ))))))))))
Converters treat StructureElements and thereby create the StructureElement that Converters treat a StructureElement and during this process create a number of new
are the children of the treated StructureElement. Converters therefore create StructureElements: the children of the initially treated StructureElement. Thus by treatment of
the tree of structure elements. The definition of a Converter also contains what existing StructureElements, Converters create a tree of StructureElements.
Converters shall be used to treat the generated child-StructureElements. The
definition is therefore a tree itself. .. image:: img/converter.png
:height: 170
Each StructureElement in the tree has a set of data values, i.e a dictionary of
key value pairs. The ``cfood.yml`` definition also describes which
Some of those values are set due to the kind of StructureElement. For example, Converters shall be used to treat the generated child StructureElements. The
a file could have the file name as such a key value pair: 'filename': <sth>. definition therefore itself also defines a tree.
Each StructureElement in the tree has a set of properties, organized as
key-value pairs.
Some of those properties are specified by the type of StructureElement. For example,
a file could have the file name as property: ``'filename': myfile.dat``.
Converters may define additional functions that create further values. For Converters may define additional functions that create further values. For
example, a regular expresion could be used to get a date from a file name. example, a regular expression could be used to get a date from a file name.
A converter is defined via a yml file or part of it. The definition states A converter is defined via a yml file or part of it. The definition states
...@@ -20,7 +25,7 @@ what kind of StructureElement it treats (typically one). ...@@ -20,7 +25,7 @@ what kind of StructureElement it treats (typically one).
Also, it defines how children of the current StructureElement are Also, it defines how children of the current StructureElement are
created and what Converters shall be used to treat those. created and what Converters shall be used to treat those.
The yaml definition looks like the following: The yaml definition may look like this:
TODO: outdated, see cfood-schema.yml TODO: outdated, see cfood-schema.yml
...@@ -53,8 +58,9 @@ to generate records (see :py:meth:`~caoscrawler.converters.Converter.create_reco ...@@ -53,8 +58,9 @@ to generate records (see :py:meth:`~caoscrawler.converters.Converter.create_reco
**records** is a dict of definitions that define the semantic structure **records** is a dict of definitions that define the semantic structure
(see details below). (see details below).
Subtree contains a list of Converter defnitions that look like the one **subtree** makes the yaml recursive: It contains a list of new Converter
described here. definitions, which work on the StructureElements that are returned by the
current Converter.
Transform Functions Transform Functions
+++++++++++++++++++ +++++++++++++++++++
...@@ -108,6 +114,9 @@ them to the cfood definition (see :doc:`CFood Documentation<cfood>`). ...@@ -108,6 +114,9 @@ them to the cfood definition (see :doc:`CFood Documentation<cfood>`).
Standard Converters Standard Converters
+++++++++++++++++++ +++++++++++++++++++
These are the standard converters that exist in a default installation. For writing and applying
*custom converters*, see :ref:`below <Custom Converters>`.
Directory Converter Directory Converter
=================== ===================
The Directory Converter creates StructureElements for each File and Directory The Directory Converter creates StructureElements for each File and Directory
...@@ -126,11 +135,14 @@ children elements according to the structure of the header. ...@@ -126,11 +135,14 @@ children elements according to the structure of the header.
DictElement Converter DictElement Converter
===================== =====================
DictElement → StructureElement
Creates a child StructureElement for each key in the dictionary. Creates a child StructureElement for each key in the dictionary.
Typical Subtree converters Typical Subtree converters
-------------------------- --------------------------
The following StructureElement are typically created: The following StructureElement types are typically created by the DictElement converter:
- BooleanElement - BooleanElement
- FloatElement - FloatElement
...@@ -155,12 +167,12 @@ behavior can be adjusted with the fields `accept_text`, `accept_int`, ...@@ -155,12 +167,12 @@ behavior can be adjusted with the fields `accept_text`, `accept_int`,
The following denotes what kind of StructureElements are accepted by default The following denotes what kind of StructureElements are accepted by default
(they are defined in `src/caoscrawler/converters.py`): (they are defined in `src/caoscrawler/converters.py`):
- DictBooleanElementConverter: bool, int - BooleanElementConverter: bool, int
- DictFloatElementConverter: int, float - FloatElementConverter: int, float
- DictTextElementConverter: text, bool, int, float - TextElementConverter: text, bool, int, float
- DictIntegerElementConverter: int - IntegerElementConverter: int
- DictListElementConverter: list - ListElementConverter: list
- DictDictElementConverter: dict - DictElementConverter: dict
YAMLFileConverter YAMLFileConverter
================= =================
...@@ -180,11 +192,13 @@ JSONFileConverter ...@@ -180,11 +192,13 @@ JSONFileConverter
TableConverter TableConverter
============== ==============
Table → DictElement
A generic converter (abstract) for files containing tables. A generic converter (abstract) for files containing tables.
Currently, there are two specialized implementations for xlsx-files and csv-files. Currently, there are two specialized implementations for XLSX files and CSV files.
All table converters generate a subtree that can be converted with DictDictElementConverters: All table converters generate a subtree of dicts, which in turn can be converted with DictElementConverters:
For each row in the table a DictDictElement (structure element) is generated. The key of the For each row in the table the TableConverter generates a DictElement (structure element). The key of the
element is the row number. The value of the element is a dict containing the mapping of element is the row number. The value of the element is a dict containing the mapping of
column names to values of the respective cell. column names to values of the respective cell.
...@@ -193,21 +207,21 @@ Example: ...@@ -193,21 +207,21 @@ Example:
.. code-block:: yaml .. code-block:: yaml
subtree: subtree:
TABLE: TABLE: # Any name for the table as a whole
type: CSVTableConverter type: CSVTableConverter
match: ^test_table.csv$ match: ^test_table.csv$
records: records:
(...) # Records edited for the whole table file (...) # Records edited for the whole table file
subtree: subtree:
ROW: ROW: # Any name for a data row in the table
type: DictDictElement type: DictElement
match_name: .* match_name: .*
match_value: .* match_value: .*
records: records:
(...) # Records edited for each row (...) # Records edited for each row
subtree: subtree:
COLUMN: COLUMN: # Any name for a specific type of column in the table
type: DictFloatElement type: FloatElement
match_name: measurement # Name of the column in the table file match_name: measurement # Name of the column in the table file
match_value: (?P<column_value).*) match_value: (?P<column_value).*)
records: records:
...@@ -217,9 +231,13 @@ Example: ...@@ -217,9 +231,13 @@ Example:
XLSXTableConverter XLSXTableConverter
================== ==================
XLSX File → DictElement
CSVTableConverter CSVTableConverter
================= =================
CSV File → DictElement
Further converters Further converters
++++++++++++++++++ ++++++++++++++++++
...@@ -322,11 +340,15 @@ file in a text property, the name of which can be configured with the ...@@ -322,11 +340,15 @@ file in a text property, the name of which can be configured with the
Custom Converters Custom Converters
+++++++++++++++++ +++++++++++++++++
It was previously mentioned that it is possible to create custom converters. As mentioned before it is possible to create custom converters.
These custom converters can be used to integrate arbitrary data extraction and ETL capabilities These custom converters can be used to integrate arbitrary data extraction and ETL capabilities
into the caosdb-crawler and make these extensions available to any yaml specification. into the LinkAhead crawler and make these extensions available to any yaml specification.
Tell the crawler about a custom converter
=========================================
The basic syntax for adding a custom converter to a yaml cfood definition file is: To use a custom crawler, it must be defined in the ``Converters`` section of the CFood yaml file.
The basic syntax for adding a custom converter to a definition file is:
.. code-block:: yaml .. code-block:: yaml
...@@ -335,7 +357,7 @@ The basic syntax for adding a custom converter to a yaml cfood definition file i ...@@ -335,7 +357,7 @@ The basic syntax for adding a custom converter to a yaml cfood definition file i
package: <python>.<module>.<name> package: <python>.<module>.<name>
converter: <PythonClassName> converter: <PythonClassName>
The Converters-section can be either put into the first or second document of the cfood yaml file. The Converters section can be either put into the first or the second document of the cfood yaml file.
It can be also part of a single-document yaml cfood file. Please refer to :doc:`the cfood documentation<cfood>` for more details. It can be also part of a single-document yaml cfood file. Please refer to :doc:`the cfood documentation<cfood>` for more details.
Details: Details:
...@@ -344,9 +366,16 @@ Details: ...@@ -344,9 +366,16 @@ Details:
- **<python>.<module>.<name>**: The name of the module where the converter class resides. - **<python>.<module>.<name>**: The name of the module where the converter class resides.
- **<PythonClassName>**: Within this specified module there must be a class inheriting from base class :py:class:`caoscrawler.converters.Converter`. - **<PythonClassName>**: Within this specified module there must be a class inheriting from base class :py:class:`caoscrawler.converters.Converter`.
Implementing a custom converter
===============================
Converters inherit from the :py:class:`~caoscrawler.converters.Converter` class.
The following methods are abstract and need to be overwritten by your custom converter to make it work: The following methods are abstract and need to be overwritten by your custom converter to make it work:
- :py:meth:`~caoscrawler.converters.Converter.create_children` :py:meth:`~caoscrawler.converters.Converter.create_children`:
Return a list of child StructureElement objects.
- :py:meth:`~caoscrawler.converters.Converter.match` - :py:meth:`~caoscrawler.converters.Converter.match`
- :py:meth:`~caoscrawler.converters.Converter.typecheck` - :py:meth:`~caoscrawler.converters.Converter.typecheck`
......
Further reading
===============
- A simple `documented example <https://gitlab.com/caosdb/documented-crawler-example>`_ which
demonstrates the crawler usage.
- Some useful examples can be found in the `integration tests
<https://gitlab.com/caosdb/caosdb-crawler/-/tree/main/integrationtests>`_ (and to a certain extent
in the unit tests).
...@@ -10,6 +10,7 @@ Getting Started ...@@ -10,6 +10,7 @@ Getting Started
prerequisites prerequisites
helloworld helloworld
optionalfeatures optionalfeatures
furtherreading
This section will help you get going! From the first installation steps to the first simple crawl. This section will help you get going! From the first installation steps to the first simple crawl.
......
src/doc/img/converter.png

45.9 KiB

This diff is collapsed.
Macros Macros
------ ------
Macros highly facilitate the writing of complex :doc:`CFoods<cfood>`. Consider the following prevalent example: Macros highly facilitate the writing of complex :doc:`CFoods<cfood>`. Consider the following common
example:
.. _example_files: .. _example_files:
.. code-block:: yaml .. code-block:: yaml
......
...@@ -9,4 +9,4 @@ This chapter contains a collection of tutorials. ...@@ -9,4 +9,4 @@ This chapter contains a collection of tutorials.
Parameter File<parameterfile> Parameter File<parameterfile>
Scientific Data Folder<scifolder> Scientific Data Folder<scifolder>
WIP: Single Structured File <single_file>
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment