DOC: Split converters/index.rst into separate files

f25e21e7 · Florian Spreckelsen · 52e21348 · f25e21e7 · f25e21e7 · f25e21e7
Commit f25e21e7 authored 11 months ago by Florian Spreckelsen
--- a/src/doc/concepts.rst
+++ b/src/doc/concepts.rst
@@ -20,7 +20,7 @@ example a tree of Python *file objects* (StructureElements) could correspond to
 Relevant sources in:
- ``src/structure_elements.py``
+- :py:mod:`caoscrawler.structure_elements`
 .. _ConceptConverters:
@@ -38,7 +38,7 @@ See the chapter :std:doc:`Converters<converters/index>` for details.
 Relevant sources in:
- ``src/converters.py``
+- :py:mod:`caoscrawler.converters`
 Identifiables
@@ -70,8 +70,8 @@ In the current implementation an identifiable can only use one RecordType even t
 Relevant sources in
- ``src/identifiable_adapters.py``
+- :py:mod:`caoscrawler.identifiable_adapters`
- ``src/identifiable.py``
+- :py:mod:`caoscrawler.identifiable`
 Registered Identifiables
 ++++++++++++++++++++++++
@@ -110,7 +110,7 @@ The crawler can be considered the main program doing the synchronization in basi
 Relevant sources in:
- ``src/crawl.py``
+- :py:mod:`caoscrawler.crawl`

--- a/src/doc/converters/cfood_definition.rst
+++ b/src/doc/converters/cfood_definition.rst
+CFood definition
++++++++++++++++
+Converter application to data is specified via a tree-like yml file (called ``cfood.yml``, by
+convention).  The yml file specifies which Converters shall be used on which StructureElements, and
+how to treat the generated *child* StructureElements.
+The yaml definition may look like this:
+.. todo::
+  This is outdated, see ``cfood-schema.yml`` for the current specification of a ``cfood.yml``.
+.. code-block:: yaml
+    <NodeName>:
+	type: <ConverterName>
+	match: ".*"
+	records:
+	    Experiment1:
+		parents:
+		- Experiment
+		- Blablabla
+		date: $DATUM
+		(...)
+	    Experiment2:
+		parents:
+		- Experiment
+	subtree:
+	    (...)
+The **<NodeName>** is a description of what the current block represents (e.g.
+``experiment-folder``) and is used as an identifier.
+**<type>** selects the converter that is going to be matched against the current structure
+element. If the structure element matches (this is a combination of a typecheck and a detailed
+match, see the :py:class:`~caoscrawler.converters.Converter` source documentation for details), the
+converter will:
+- generate records (with :py:meth:`~caoscrawler.converters.Converter.create_records`)
+- possibly process a subtree (with :py:meth:`caoscrawler.converters.Converter.create_children`)
+**match** *TODO*
+**records** is a dict of definitions that define the semantic structure
+(see details below).
+**subtree** makes the yaml recursive: It contains a list of new Converter
+definitions, which work on the StructureElements that are returned by the
+current Converter.
--- a/src/doc/converters/custom_converters.rst
+++ b/src/doc/converters/custom_converters.rst
+Custom Converters
+++++++++++++++++
+As mentioned before it is possible to create custom converters.
+These custom converters can be used to integrate arbitrary data extraction and ETL capabilities
+into the LinkAhead crawler and make these extensions available to any yaml specification.
+Tell the crawler about a custom converter
+=========================================
+To use a custom crawler, it must be defined in the ``Converters`` section of the CFood yaml file.
+The basic syntax for adding a custom converter to a definition file is:
+.. code-block:: yaml
+   Converters:
+     <NameOfTheConverterInYamlFile>:
+       package: <python>.<module>.<name>
+       converter: <PythonClassName>
+The Converters section can be either put into the first or the second
+document of the cfood yaml file. It can be also part of a
+single-document yaml cfood file. Please refer to :doc:`the cfood
+documentation<../cfood>` for more details.
+Details:
+- **<NameOfTheConverterInYamlFile>**: This is the name of the converter as it is going to be used in the present yaml file.
+- **<python>.<module>.<name>**: The name of the module where the converter class resides.
+- **<PythonClassName>**: Within this specified module there must be a class inheriting from base class :py:class:`caoscrawler.converters.Converter`.
+Implementing a custom converter
+===============================
+Converters inherit from the :py:class:`~caoscrawler.converters.Converter` class.
+The following methods are abstract and need to be overwritten by your custom converter to make it work:
+:py:meth:`~caoscrawler.converters.Converter.create_children`:
+    Return a list of child StructureElement objects.
+- :py:meth:`~caoscrawler.converters.Converter.match`
+- :py:meth:`~caoscrawler.converters.Converter.typecheck`
+Example
+=======
+In the following, we will explain the process of adding a custom converter to a yaml file using
+a SourceResolver that is able to attach a source element to another entity.
+**Note**: This example might become a standard crawler soon, as part of the scifolder specification. See https://doi.org/10.3390/data5020043 for details. In this documentation example we will, therefore, add it to a package called "scifolder".
+First we will create our package and module structure, which might be:
+.. code-block::
+   scifolder_package/
+     README.md
+     setup.cfg
+     setup.py
+     Makefile
+     tox.ini
+     src/
+       scifolder/
+	 __init__.py
+	 converters/
+	   __init__.py
+	   sources.py  # <- the actual file containing
+		       #    the converter class
+     doc/
+     unittests/
+Now we need to create a class called "SourceResolver" in the file "sources.py". In this - more advanced - example, we will not inherit our converter directly from :py:class:`~caoscrawler.converters.Converter`, but use :py:class:`~caoscrawler.converters.TextElementConverter`. The latter already implements :py:meth:`~caoscrawler.converters.Converter.match` and :py:meth:`~caoscrawler.converters.Converter.typecheck`, so only an implementation for :py:meth:`~caoscrawler.converters.Converter.create_children` has to be provided by us.
+Furthermore we will customize the method :py:meth:`~caoscrawler.converters.Converter.create_records` that allows us to specify a more complex record generation procedure than provided in the standard implementation. One specific limitation of the standard implementation is, that only a fixed
+number of records can be generated by the yaml definition. So for any applications - like here - that require an arbitrary number of records to be created, a customized implementation of :py:meth:`~caoscrawler.converters.Converter.create_records` is recommended.
+In this context it is recommended to make use of the function :func:`caoscrawler.converters.create_records` that implements creation of record objects from python dictionaries of the same structure
+that would be given using a yaml definition (see next section below).
+.. code-block:: python
+    import re
+    from caoscrawler.stores import GeneralStore, RecordStore
+    from caoscrawler.converters import TextElementConverter, create_records
+    from caoscrawler.structure_elements import StructureElement, TextElement
+    class SourceResolver(TextElementConverter):
+      """
+      This resolver uses a source list element (e.g. from the markdown readme file)
+      to link sources correctly.
+      """
+      def __init__(self, definition: dict, name: str,
+		   converter_registry: dict):
+	  """
+	  Initialize a new directory converter.
+	  """
+	  super().__init__(definition, name, converter_registry)
+      def create_children(self, generalStore: GeneralStore,
+				element: StructureElement):
+	  # The source resolver does not create children:
+	  return []
+      def create_records(self, values: GeneralStore,
+			 records: RecordStore,
+			 element: StructureElement,
+			 file_path_prefix):
+	  if not isinstance(element, TextElement):
+	      raise RuntimeError()
+	  # This function must return a list containing tuples, each one for a modified
+	  # property: (name_of_entity, name_of_property)
+	  keys_modified = []
+	  # This is the name of the entity where the source is going to be attached:
+	  attach_to_scientific_activity = self.definition["scientific_activity"]
+	  rec = records[attach_to_scientific_activity]
+	  # The "source" is a path to a source project, so it should have the form:
+	  # /<Category>/<project>/<scientific_activity>/
+	  # obtain these information from the structure element:
+	  val = element.value
+	  regexp = (r'/(?P<category>(SimulationData)|(ExperimentalData)|(DataAnalysis))'
+		    '/(?P<project_date>.*?)_(?P<project_identifier>.*)'
+		    '/(?P<date>[0-9]{4,4}-[0-9]{2,2}-[0-9]{2,2})(_(?P<identifier>.*))?/')
+	  res = re.match(regexp, val)
+	  if res is None:
+	      raise RuntimeError("Source cannot be parsed correctly.")
+	  # Mapping of categories on the file system to corresponding record types in CaosDB:
+	  cat_map = {
+	      "SimulationData": "Simulation",
+	      "ExperimentalData": "Experiment",
+	      "DataAnalysis": "DataAnalysis"}
+	  linkrt = cat_map[res.group("category")]
+	  keys_modified.extend(create_records(values, records, {
+	      "Project": {
+		  "date": res.group("project_date"),
+		  "identifier": res.group("project_identifier"),
+	      },
+	      linkrt: {
+		  "date": res.group("date"),
+		  "identifier": res.group("identifier"),
+		  "project": "$Project"
+	      },
+	      attach_to_scientific_activity: {
+		  "sources": "+$" + linkrt
+	      }}, file_path_prefix))
+	  # Process the records section of the yaml definition:
+	  keys_modified.extend(
+	      super().create_records(values, records, element, file_path_prefix))
+	  # The create_records function must return the modified keys to make it compatible
+	  # to the crawler functions:
+	  return keys_modified
+If the recommended (python) package structure is used, the package containing the converter
+definition can just be installed using `pip install .` or `pip install -e .` from the
+`scifolder_package` directory.
+The following yaml block will register the converter in a yaml file:
+.. code-block:: yaml
+   Converters:
+     SourceResolver:
+       package: scifolder.converters.sources
+       converter: SourceResolver
+Using the `create_records` API function
+=======================================
+The function :func:`caoscrawler.converters.create_records` was already mentioned above and it is
+the recommended way to create new records from custom converters. Let's have a look at the
+function signature:
+.. code-block:: python
+    def create_records(values: GeneralStore,  # <- pass the current variables store here
+		       records: RecordStore,  # <- pass the current store of CaosDB records here
+		       def_records: dict):    # <- This is the actual definition of new records!
+`def_records` is the actual definition of new records according to the yaml cfood specification
+(work in progress, in the docs). Essentially you can do everything here, that you could do
+in the yaml document as well, but using python source code.
+Let's have a look at a few examples:
+.. code-block:: yaml
+  DirConverter:
+    type: Directory
+    match: (?P<dir_name>.*)
+    records:
+      Experiment:
+	identifier: $dir_name
+This block will just create a new record with parent `Experiment` and one property
+`identifier` with a value derived from the matching regular expression.
+Let's formulate that using `create_records`:
+.. code-block:: python
+  dir_name = "directory name"
+  record_def = {
+    "Experiment": {
+      "identifier": dir_name
+      }
+  }
+  keys_modified = create_records(values, records,
+				 record_def)
+The `dir_name` is set explicitely here, everything else is identical to the yaml statements.
+The role of `keys_modified`
+===========================
+You probably have noticed already, that :func:`caoscrawler.converters.create_records` returns
+`keys_modified` which is a list of tuples. Each element of `keys_modified` has two elements:
+- Element 0 is the name of the record that is modified (as used in the record store `records`).
+- Element 1 is the name of the property that is modified.
+It is important, that the correct list of modified keys is returned by
+:py:meth:`~caoscrawler.converters.Converter.create_records` to make the crawler process work.
+So, a sketch of a typical implementation within a custom converter could look like this:
+.. code-block:: python
+  def create_records(self, values: GeneralStore,
+		       records: RecordStore,
+		       element: StructureElement,
+		       file_path_prefix: str):
+    # Modify some records:
+    record_def = {
+      # ...
+    }
+  keys_modified = create_records(values, records,
+				 record_def)
+  # You can of course do it multiple times:
+  keys_modified.extend(create_records(values, records,
+				      record_def))
+  # You can also process the records section of the yaml definition:
+  keys_modified.extend(
+	 super().create_records(values, records, element, file_path_prefix))
+  # This essentially allows users of your converter to customize the creation of records
+  # by providing a custom "records" section additionally to the modifications provided
+  # in this implementation of the Converter.
+  # Important: Return the list of modified keys!
+  return keys_modified
+More complex example
+====================
+Let's have a look at a more complex examples, defining multiple records:
+.. code-block:: yaml
+  DirConverter:
+    type: Directory
+    match: (?P<dir_name>.*)
+    records:
+      Project:
+	identifier: project_name
+      Experiment:
+	identifier: $dir_name
+	Project: $Project
+      ProjectGroup:
+	projects: +$Project
+This block will create two new Records:
+- A project with a constant identifier
+- An experiment with an identifier, derived from a regular expression and a reference to the new project.
+Furthermore a Record `ProjectGroup` will be edited (its initial definition is not given in the
+yaml block): The project that was just created will be added as a list element to the property
+`projects`.
+Let's formulate that using `create_records` (again, `dir_name` is constant here):
+.. code-block:: python
+  dir_name = "directory name"
+  record_def = {
+    "Project": {
+      "identifier": "project_name",
+    }
+    "Experiment": {
+      "identifier": dir_name,
+      "Project": "$Project",
+      }
+    "ProjectGroup": {
+      "projects": "+$Project",
+    }
+  }
+  keys_modified = create_records(values, records,
+				 record_def)
+Debugging
+=========
+You can add the key `debug_match` to the definition of a Converter in order to create debugging
+output for the match step. The following snippet illustrates this:
+.. code-block:: yaml
+  DirConverter:
+    type: Directory
+    match: (?P<dir_name>.*)
+    debug_match: True
+    records:
+      Project:
+	identifier: project_name
+Whenever this Converter tries to match a StructureElement, it logs what was tried to macht against
+what and what the result was.
--- a/src/doc/converters/further_converters.rst
+++ b/src/doc/converters/further_converters.rst
+Further converters
++++++++++++++++++
+More converters, together with cfood definitions and examples can be found in
+the `LinkAhead Crawler Extensions Subgroup
+<https://gitlab.com/linkahead/crawler-extensions>`_ on gitlab. In the following,
+we list converters that are shipped with the crawler library itself but are not
+part of the set of standard converters and may require this library to be
+installed with additional optional dependencies.
+HDF5 Converters
+===============
+For treating `HDF5 Files
+<https://docs.hdfgroup.org/hdf5/develop/_s_p_e_c.html>`_, there are in total
+four individual converters corresponding to the internal structure of HDF5
+files: the :ref:`H5FileConverter` which opens the file itself and creates
+further structure elements from HDF5 groups, datasets, and included
+multi-dimensional arrays that are in turn treated by the
+:ref:`H5GroupConverter`, the :ref:`H5DatasetConverter`, and the
+:ref:`H5NdarrayConverter`, respectively. You need to install the LinkAhead
+crawler with its optional ``h5-crawler`` dependency for using these converters.
+The basic idea when crawling HDF5 files is to treat them very similar to
+:ref:`dictionaries <DictElement Converter>` in which the attributes on root,
+group, or dataset level are essentially treated like ``BooleanElement``,
+``TextElement``, ``FloatElement``, and ``IntegerElement`` in a dictionary: They
+are appended as children and can be accessed via the ``subtree``. The file
+itself and the groups within may contain further groups and datasets, which can
+have their own attributes, subgroups, and datasets, very much like
+``DictElements`` within a dictionary. The main difference to any other
+dictionary type is the presence of multi-dimensional arrays within HDF5
+datasets. Since LinkAhead doesn't have any datatype corresponding to these, and
+since it isn't desirable to store these arrays directly within LinkAhead for
+reasons of performance and of searchability, we wrap them within a specific
+Record as explained :ref:`below <H5NdarrayConverter>`, together with more
+metadata and their internal path within the HDF5 file. Users can thus query for
+datasets and their arrays according to their metadata within LinkAhead and then
+use the internal path information to access the dataset within the file
+directly. The type of this record and the property for storing the internal path
+need to be reflected in the datamodel. Using the default names, you would need a
+datamodel like
+.. code-block:: yaml
+   H5Ndarray:
+     obligatory_properties:
+       internal_hdf5-path:
+	 datatype: TEXT
+although the names of both property and record type can be configured within the
+cfood definition.
+A simple example of a cfood definition for HDF5 files can be found in the `unit
+tests
+<https://gitlab.com/linkahead/linkahead-crawler/-/blob/main/unittests/h5_cfood.yml?ref_type=heads>`_
+and shows how the individual converters are used in order to crawl a `simple
+example file
+<https://gitlab.com/linkahead/linkahead-crawler/-/blob/main/unittests/hdf5_dummy_file.hdf5?ref_type=heads>`_
+containing groups, subgroups, and datasets, together with their respective
+attributes.
+H5FileConverter
+---------------
+This is an extension of the
+:py:class:`~caoscrawler.converters.SimpleFileConverter` class. It opens the HDF5
+file and creates children for any contained group or dataset. Additionally, the
+root-level attributes of the HDF5 file are accessible as children.
+H5GroupConverter
+----------------
+This is an extension of the
+:py:class:`~caoscrawler.converters.DictElementConverter` class. Children are
+created for all subgroups and datasets in this HDF5 group. Additionally, the
+group-level attributes are accessible as children.
+H5DatasetConverter
+------------------
+This is an extension of the
+:py:class:`~caoscrawler.converters.DictElementConverter` class. Most
+importantly, it stores the array data in HDF5 dataset into
+:py:class:`~caoscrawler.hdf5_converter.H5NdarrayElement` which is added to its
+children, as well as the dataset attributes.
+H5NdarrayConverter
+------------------
+This converter creates a wrapper record for the contained dataset. The name of
+this record needs to be specified in the cfood definition of this converter via
+the ``recordname`` option. The RecordType of this record can be configured with
+the ``array_recordtype_name`` option and defaults to ``H5Ndarray``. Via the
+given ``recordname``, this record can be used within the cfood. Most
+importantly, this record stores the internal path of this array within the HDF5
+file in a text property, the name of which can be configured with the
+``internal_path_property_name`` option which defaults to ``internal_hdf5_path``.
--- a/src/doc/converters/index.rst
+++ b/src/doc/converters/index.rst
--- a/src/doc/converters/standard_converters.rst
+++ b/src/doc/converters/standard_converters.rst
+Standard Converters
+++++++++++++++++++
+These are the standard converters that exist in a default installation.  For writing and applying
+*custom converters*, see :ref:`below <Custom Converters>`.
+Directory Converter
+===================
+The Directory Converter creates StructureElements for each File and Directory
+inside the current Directory. You can match a regular expression against the
+directory name using the 'match' key.
+Simple File Converter
+=====================
+The Simple File Converter does not create any children and is usually used if
+a file shall be used as it is and be inserted and referenced by other entities.
+Markdown File Converter
+=======================
+Reads a YAML header from Markdown files (if such a header exists) and creates
+children elements according to the structure of the header.
+DictElement Converter
+=====================
+DictElement → StructureElement
+Creates a child StructureElement for each key in the dictionary.
+Typical Subtree converters
+--------------------------
+The following StructureElement types are typically created by the DictElement converter:
+- BooleanElement
+- FloatElement
+- TextElement
+- IntegerElement
+- ListElement
+- DictElement
+Note that you may use ``TextElement`` for anything that exists in a text format that can be
+interpreted by the server, such as date and datetime strings in ISO-8601 format.
+Scalar Value Converters
+=======================
+`BooleanElementConverter`, `FloatElementConverter`, `TextElementConverter`,  and
+`IntegerElementConverter` behave very similarly.
+These converters expect `match_name` and `match_value` in their definition
+which allow to match the key and the value, respectively.
+Note that there are defaults for accepting other types. For example,
+FloatElementConverter also accepts IntegerElements. The default
+behavior can be adjusted with the fields `accept_text`, `accept_int`,
+`accept_float`, and `accept_bool`.
+The following denotes what kind of StructureElements are accepted by default
+(they are defined in `src/caoscrawler/converters.py`):
+- BooleanElementConverter: bool, int
+- FloatElementConverter: int, float
+- TextElementConverter: text, bool, int, float
+- IntegerElementConverter: int
+- ListElementConverter: list
+- DictElementConverter: dict
+YAMLFileConverter
+=================
+A specialized Dict Converter for yaml files: Yaml files are opened and the contents are
+converted into dictionaries that can be further converted using the typical subtree converters
+of dict converter.
+**WARNING**: Currently unfinished implementation.
+JSONFileConverter
+=================
+TableConverter
+==============
+Table → DictElement
+A generic converter (abstract) for files containing tables.
+Currently, there are two specialized implementations for XLSX files and CSV files.
+All table converters generate a subtree of dicts, which in turn can be converted with DictElementConverters:
+For each row in the table the TableConverter generates a DictElement (structure element). The key of the
+element is the row number. The value of the element is a dict containing the mapping of
+column names to values of the respective cell.
+Example:
+.. code-block:: yaml
+   subtree:
+     TABLE:  # Any name for the table as a whole
+       type: CSVTableConverter
+       match: ^test_table.csv$
+       records:
+	 (...)  # Records edited for the whole table file
+       subtree:
+	 ROW:  # Any name for a data row in the table
+	   type: DictElement
+	   match_name: .*
+	   match_value: .*
+	   records:
+	     (...)  # Records edited for each row
+	   subtree:
+	     COLUMN:  # Any name for a specific type of column in the table
+	       type: FloatElement
+	       match_name: measurement  # Name of the column in the table file
+	       match_value: (?P<column_value).*)
+	       records:
+		 (...)  # Records edited for each cell
+XLSXTableConverter
+==================
+XLSX File → DictElement
+CSVTableConverter
+=================
+CSV File → DictElement
+PropertiesFromDictConverter
+===========================
+The :py:class:`~caoscrawler.converters.PropertiesFromDictConverter` is
+a specialization of the
+:py:class:`~caoscrawler.converters.DictElementConverter` and offers
+all its functionality. It is meant to operate on dictionaries (e.g.,
+from reading in a json or a table file), the keys of which correspond
+closely to properties in a LinkAhead datamodel. This is especially
+handy in cases where properties may be added to the data model and
+data sources that are not yet known when writing the cfood definition.
+The converter definition of the
+:py:class:`~caoscrawler.converters.PropertiesFromDictConverter` has an
+additional required entry ``record_from_dict`` which specifies the
+Record to which the properties extracted from the dict are attached
+to. This Record is identified by its ``variable_name`` by which it can
+be referred to further down the subtree. You can also use the name of
+a Record that was specified earlier in the CFood definition in order
+to extend it by the properties extracted from a dict. Let's have a
+look at a simple example. A CFood definition
+.. code-block:: yaml
+   PropertiesFromDictElement:
+       type: PropertiesFromDictElement
+       match: ".*"
+       record_from_dict:
+	   variable_name: MyRec
+	   parents:
+	   - MyType1
+	   - MyType2
+applied to a dictionary
+.. code-block:: json
+   {
+     "name": "New name",
+     "a": 5,
+     "b": ["a", "b", "c"],
+     "author": {
+       "full_name": "Silvia Scientist"
+     }
+   }
+will create a Record ``New name`` with parents ``MyType1`` and
+``MyType2``. It has a scalar property ``a`` with value 5, a list
+property ``b`` with values "a", "b" and "c", and an ``author``
+property which references an ``author`` with a ``full_name`` property
+with value "Silvia Scientist":
+.. image:: ../img/properties-from-dict-records-author.png
+  :height: 210
+Note how the different dictionary keys are handled differently
+depending on their types: scalar and list values are understood
+automatically, and a dictionary-valued entry like ``author`` is
+translated into a reference to an ``author`` Record automatically.
+You can further specify how references are treated with an optional
+``references key`` in ``record_from_dict``. Let's assume that in the
+above example, we have an ``author`` **Property** with datatype
+``Person`` in our data model. We could add this information by
+extending the above example definition by
+.. code-block:: yaml
+   PropertiesFromDictElement:
+       type: PropertiesFromDictElement
+       match: ".*"
+       record_from_dict:
+	   variable_name: MyRec
+	   parents:
+	   - MyType1
+	   - MyType2
+	   references:
+	       author:
+		   parents:
+		   - Person
+so that now, a ``Person`` record with a ``full_name`` property with
+value "Silvia Scientist" is created as the value of the ``author``
+property:
+.. image:: ../img/properties-from-dict-records-person.png
+  :height: 200
+For the time being, only the parents of the referenced record can be
+set via this option. More complicated treatments can be implemented
+via the ``referenced_record_callback`` (see below).
+Properties can be blacklisted with the ``properties_blacklist``
+keyword, i.e., all keys listed under ``properties_blacklist`` will be
+excluded from automated treatment. Since the
+:py:class:`~caoscrawler.converters.PropertiesFromDictConverter` has
+all the functionality of the
+:py:class:`~caoscrawler.converters.DictElementConverter`, individual
+properties can still be used in a subtree. Together with
+``properties_blacklist`` this can be used to add custom treatment to
+specific properties by blacklisting them in ``record_from_dict`` and
+then treating them in the subtree the same as you would do it in the
+standard
+:py:class:`~caoscrawler.converters.DictElementConverter`. Note that
+the blacklisted keys are excluded on **all** levels of the dictionary,
+i.e., also when they occur in a referenced entity.
+For further customization, the
+:py:class:`~caoscrawler.converters.PropertiesFromDictConverter` can be
+used as a basis for :ref:`custom converters<Custom Converters>` which
+can make use of its ``referenced_record_callback`` argument. The
+``referenced_record_callback`` can be a callable object which takes
+exactly a Record as an argument and needs to return that Record after
+doing whatever custom treatment is needed. Additionally, it is given
+the ``RecordStore`` and the ``ValueStore`` in order to be able to
+access the records and values that have already been defined from
+within ``referenced_record_callback``. Such a function might look the
+following:
+.. code-block:: python
+   def my_callback(rec: db.Record, records: RecordStore, values: GeneralStore):
+       # do something with rec, possibly using other records or values from the stores...
+       rec.description = "This was updated in a callback"
+       return rec
+It is applied to all Records that are created from the dictionary and
+it can be used to, e.g., transform values of some properties, or add
+special treatment to all Records of a specific
+type. ``referenced_record_callback`` is applied **after** the
+properties from the dictionary have been applied as explained above.
+XML Converters
+==============
+There are the following converters for XML content:
+XMLFileConverter
+----------------
+This is a converter that loads an XML file and creates an XMLElement containing the
+root element of the XML tree. It can be matched in the subtree using the XMLTagConverter.
+XMLTagConverter
+---------------
+The XMLTagConverter is a generic converter for XMLElements with the following main features:
+- It allows to match a combination of tag name, attribute names and text contents using the keys:
+  - ``match_tag``: regexp, default empty string
+  - ``match_attrib``: dictionary of key-regexps and value-regexp
+    pairs. Each key matches an attribute name and the corresponding
+    value matches its attribute value.
+  - ``match_text``: regexp, default empty string
+- It allows to traverse the tree using XPath (using Python lxml's xpath functions):
+  - The key ``xpath`` is used to set the xpath expression and has a
+    default of ``child::*``. Its default would generate just the list of
+    sub nodes of the current node. The result of the xpath expression
+    is used to generate structure elements as children. It furthermore
+    uses the keys ``tags_as_children``, ``attribs_as_children`` and
+    ``text_as_children`` to decide which information from the found
+    nodes will be used as children:
+  - ``tags_as_children``: (default ``true``) For each xml tag element
+    found by the xpath expression, generate one XMLTag structure
+    element. Its name is the full path to the tag using the function
+    ``getelementpath`` from ``lxml``.
+  - ``attribs_as_children``: (default ``false``) For each xml tag element
+    found by the xpath expression, generate one XMLAttributeNode
+    structure element for each of its attributes. The name of the
+    respective attribute node has the form: ``<full path of the tag> @
+    <name of the attribute>`` **Please note:** Currently, there is no
+    converter implemented that can match XMLAttributeNodes.
+  - ``text_as_children``: (default ``false``) For each xml tag element
+    found by the xpath expression, generate one XMLTextNode structure
+    element containing the text content of the tag element. Note that
+    in case of multiple text elements, only the first one is
+    added. The name of the respective attribute node has the form:
+    ``<full path of the tag> /text()`` to the tag using the function
+    ``getelementpath`` from ``lxml``. **Please note:** Currently, there is
+    no converter implemented that can match XMLAttributeNodes.
+Namespaces
+**********
+The default is to take the namespace map from the current node and use
+it in xpath queries. Because default namespaces cannot be handled by
+xpath, it is possible to remap the default namespace using the key
+``default_namespace``. The key ``nsmap`` can be used to define
+additional nsmap entries.
+XMLTextNodeConverter
+--------------------
+In the future, this converter can be used to match XMLTextNodes that
+are generated by the XMLTagConverter.
--- a/src/doc/converters/transform_functions.rst
+++ b/src/doc/converters/transform_functions.rst
+Transform Functions
+++++++++++++++++++
+Often the situation arises, that you cannot use a value as it is found. Maybe a value should be
+increased by an offset or a string should be split into a list of pieces. In order to allow such
+simple conversions, transform functions can be named in the converter definition that are then
+applied to the respective variables when the converter is executed.
+.. code-block:: yaml
+    <NodeName>:
+	type: <ConverterName>
+	match: ".*"
+	transform:
+	  <TransformNodeName>:
+	    in: $<in_var_name>
+	    out: $<out_var_name>
+	    functions:
+	    - <func_name>:                         # name of the function to be applied
+		<func_arg1>: <func_arg1_value>     # key value pairs that are passed as parameters
+		<func_arg2>: <func_arg2_value>
+		# ...
+An example that splits the variable ``a`` and puts the generated list in ``b`` is the following:
+.. code-block:: yaml
+    Experiment:
+	type: Dict
+	match: ".*"
+	transform:
+	  param_split:
+	    in: $a
+	    out: $b
+	    functions:
+	    - split:            # split is a function that is defined by default
+		marker: "|"     # its only parameter is the marker that is used to split the string
+	records:
+	  Report:
+	    tags: $b
+This splits the string in '$a' and stores the resulting list in '$b'. This is here used to add a
+list valued property to the Report Record.
+There are a number of transform functions that are defined by default (see
+``src/caoscrawler/default_transformers.yml``). You can define custom transform functions by adding
+them to the cfood definition (see :doc:`CFood Documentation<../cfood>`).