Skip to content
Snippets Groups Projects
Commit f25e21e7 authored by Florian Spreckelsen's avatar Florian Spreckelsen
Browse files

DOC: Split converters/index.rst into separate files

parent 52e21348
No related branches found
No related tags found
2 merge requests!181Release 0.9.0,!174XML Converter
Pipeline #54534 passed
...@@ -20,7 +20,7 @@ example a tree of Python *file objects* (StructureElements) could correspond to ...@@ -20,7 +20,7 @@ example a tree of Python *file objects* (StructureElements) could correspond to
Relevant sources in: Relevant sources in:
- ``src/structure_elements.py`` - :py:mod:`caoscrawler.structure_elements`
.. _ConceptConverters: .. _ConceptConverters:
...@@ -38,7 +38,7 @@ See the chapter :std:doc:`Converters<converters/index>` for details. ...@@ -38,7 +38,7 @@ See the chapter :std:doc:`Converters<converters/index>` for details.
Relevant sources in: Relevant sources in:
- ``src/converters.py`` - :py:mod:`caoscrawler.converters`
Identifiables Identifiables
...@@ -70,8 +70,8 @@ In the current implementation an identifiable can only use one RecordType even t ...@@ -70,8 +70,8 @@ In the current implementation an identifiable can only use one RecordType even t
Relevant sources in Relevant sources in
- ``src/identifiable_adapters.py`` - :py:mod:`caoscrawler.identifiable_adapters`
- ``src/identifiable.py`` - :py:mod:`caoscrawler.identifiable`
Registered Identifiables Registered Identifiables
++++++++++++++++++++++++ ++++++++++++++++++++++++
...@@ -110,7 +110,7 @@ The crawler can be considered the main program doing the synchronization in basi ...@@ -110,7 +110,7 @@ The crawler can be considered the main program doing the synchronization in basi
Relevant sources in: Relevant sources in:
- ``src/crawl.py`` - :py:mod:`caoscrawler.crawl`
......
CFood definition
++++++++++++++++
Converter application to data is specified via a tree-like yml file (called ``cfood.yml``, by
convention). The yml file specifies which Converters shall be used on which StructureElements, and
how to treat the generated *child* StructureElements.
The yaml definition may look like this:
.. todo::
This is outdated, see ``cfood-schema.yml`` for the current specification of a ``cfood.yml``.
.. code-block:: yaml
<NodeName>:
type: <ConverterName>
match: ".*"
records:
Experiment1:
parents:
- Experiment
- Blablabla
date: $DATUM
(...)
Experiment2:
parents:
- Experiment
subtree:
(...)
The **<NodeName>** is a description of what the current block represents (e.g.
``experiment-folder``) and is used as an identifier.
**<type>** selects the converter that is going to be matched against the current structure
element. If the structure element matches (this is a combination of a typecheck and a detailed
match, see the :py:class:`~caoscrawler.converters.Converter` source documentation for details), the
converter will:
- generate records (with :py:meth:`~caoscrawler.converters.Converter.create_records`)
- possibly process a subtree (with :py:meth:`caoscrawler.converters.Converter.create_children`)
**match** *TODO*
**records** is a dict of definitions that define the semantic structure
(see details below).
**subtree** makes the yaml recursive: It contains a list of new Converter
definitions, which work on the StructureElements that are returned by the
current Converter.
Custom Converters
+++++++++++++++++
As mentioned before it is possible to create custom converters.
These custom converters can be used to integrate arbitrary data extraction and ETL capabilities
into the LinkAhead crawler and make these extensions available to any yaml specification.
Tell the crawler about a custom converter
=========================================
To use a custom crawler, it must be defined in the ``Converters`` section of the CFood yaml file.
The basic syntax for adding a custom converter to a definition file is:
.. code-block:: yaml
Converters:
<NameOfTheConverterInYamlFile>:
package: <python>.<module>.<name>
converter: <PythonClassName>
The Converters section can be either put into the first or the second
document of the cfood yaml file. It can be also part of a
single-document yaml cfood file. Please refer to :doc:`the cfood
documentation<../cfood>` for more details.
Details:
- **<NameOfTheConverterInYamlFile>**: This is the name of the converter as it is going to be used in the present yaml file.
- **<python>.<module>.<name>**: The name of the module where the converter class resides.
- **<PythonClassName>**: Within this specified module there must be a class inheriting from base class :py:class:`caoscrawler.converters.Converter`.
Implementing a custom converter
===============================
Converters inherit from the :py:class:`~caoscrawler.converters.Converter` class.
The following methods are abstract and need to be overwritten by your custom converter to make it work:
:py:meth:`~caoscrawler.converters.Converter.create_children`:
Return a list of child StructureElement objects.
- :py:meth:`~caoscrawler.converters.Converter.match`
- :py:meth:`~caoscrawler.converters.Converter.typecheck`
Example
=======
In the following, we will explain the process of adding a custom converter to a yaml file using
a SourceResolver that is able to attach a source element to another entity.
**Note**: This example might become a standard crawler soon, as part of the scifolder specification. See https://doi.org/10.3390/data5020043 for details. In this documentation example we will, therefore, add it to a package called "scifolder".
First we will create our package and module structure, which might be:
.. code-block::
scifolder_package/
README.md
setup.cfg
setup.py
Makefile
tox.ini
src/
scifolder/
__init__.py
converters/
__init__.py
sources.py # <- the actual file containing
# the converter class
doc/
unittests/
Now we need to create a class called "SourceResolver" in the file "sources.py". In this - more advanced - example, we will not inherit our converter directly from :py:class:`~caoscrawler.converters.Converter`, but use :py:class:`~caoscrawler.converters.TextElementConverter`. The latter already implements :py:meth:`~caoscrawler.converters.Converter.match` and :py:meth:`~caoscrawler.converters.Converter.typecheck`, so only an implementation for :py:meth:`~caoscrawler.converters.Converter.create_children` has to be provided by us.
Furthermore we will customize the method :py:meth:`~caoscrawler.converters.Converter.create_records` that allows us to specify a more complex record generation procedure than provided in the standard implementation. One specific limitation of the standard implementation is, that only a fixed
number of records can be generated by the yaml definition. So for any applications - like here - that require an arbitrary number of records to be created, a customized implementation of :py:meth:`~caoscrawler.converters.Converter.create_records` is recommended.
In this context it is recommended to make use of the function :func:`caoscrawler.converters.create_records` that implements creation of record objects from python dictionaries of the same structure
that would be given using a yaml definition (see next section below).
.. code-block:: python
import re
from caoscrawler.stores import GeneralStore, RecordStore
from caoscrawler.converters import TextElementConverter, create_records
from caoscrawler.structure_elements import StructureElement, TextElement
class SourceResolver(TextElementConverter):
"""
This resolver uses a source list element (e.g. from the markdown readme file)
to link sources correctly.
"""
def __init__(self, definition: dict, name: str,
converter_registry: dict):
"""
Initialize a new directory converter.
"""
super().__init__(definition, name, converter_registry)
def create_children(self, generalStore: GeneralStore,
element: StructureElement):
# The source resolver does not create children:
return []
def create_records(self, values: GeneralStore,
records: RecordStore,
element: StructureElement,
file_path_prefix):
if not isinstance(element, TextElement):
raise RuntimeError()
# This function must return a list containing tuples, each one for a modified
# property: (name_of_entity, name_of_property)
keys_modified = []
# This is the name of the entity where the source is going to be attached:
attach_to_scientific_activity = self.definition["scientific_activity"]
rec = records[attach_to_scientific_activity]
# The "source" is a path to a source project, so it should have the form:
# /<Category>/<project>/<scientific_activity>/
# obtain these information from the structure element:
val = element.value
regexp = (r'/(?P<category>(SimulationData)|(ExperimentalData)|(DataAnalysis))'
'/(?P<project_date>.*?)_(?P<project_identifier>.*)'
'/(?P<date>[0-9]{4,4}-[0-9]{2,2}-[0-9]{2,2})(_(?P<identifier>.*))?/')
res = re.match(regexp, val)
if res is None:
raise RuntimeError("Source cannot be parsed correctly.")
# Mapping of categories on the file system to corresponding record types in CaosDB:
cat_map = {
"SimulationData": "Simulation",
"ExperimentalData": "Experiment",
"DataAnalysis": "DataAnalysis"}
linkrt = cat_map[res.group("category")]
keys_modified.extend(create_records(values, records, {
"Project": {
"date": res.group("project_date"),
"identifier": res.group("project_identifier"),
},
linkrt: {
"date": res.group("date"),
"identifier": res.group("identifier"),
"project": "$Project"
},
attach_to_scientific_activity: {
"sources": "+$" + linkrt
}}, file_path_prefix))
# Process the records section of the yaml definition:
keys_modified.extend(
super().create_records(values, records, element, file_path_prefix))
# The create_records function must return the modified keys to make it compatible
# to the crawler functions:
return keys_modified
If the recommended (python) package structure is used, the package containing the converter
definition can just be installed using `pip install .` or `pip install -e .` from the
`scifolder_package` directory.
The following yaml block will register the converter in a yaml file:
.. code-block:: yaml
Converters:
SourceResolver:
package: scifolder.converters.sources
converter: SourceResolver
Using the `create_records` API function
=======================================
The function :func:`caoscrawler.converters.create_records` was already mentioned above and it is
the recommended way to create new records from custom converters. Let's have a look at the
function signature:
.. code-block:: python
def create_records(values: GeneralStore, # <- pass the current variables store here
records: RecordStore, # <- pass the current store of CaosDB records here
def_records: dict): # <- This is the actual definition of new records!
`def_records` is the actual definition of new records according to the yaml cfood specification
(work in progress, in the docs). Essentially you can do everything here, that you could do
in the yaml document as well, but using python source code.
Let's have a look at a few examples:
.. code-block:: yaml
DirConverter:
type: Directory
match: (?P<dir_name>.*)
records:
Experiment:
identifier: $dir_name
This block will just create a new record with parent `Experiment` and one property
`identifier` with a value derived from the matching regular expression.
Let's formulate that using `create_records`:
.. code-block:: python
dir_name = "directory name"
record_def = {
"Experiment": {
"identifier": dir_name
}
}
keys_modified = create_records(values, records,
record_def)
The `dir_name` is set explicitely here, everything else is identical to the yaml statements.
The role of `keys_modified`
===========================
You probably have noticed already, that :func:`caoscrawler.converters.create_records` returns
`keys_modified` which is a list of tuples. Each element of `keys_modified` has two elements:
- Element 0 is the name of the record that is modified (as used in the record store `records`).
- Element 1 is the name of the property that is modified.
It is important, that the correct list of modified keys is returned by
:py:meth:`~caoscrawler.converters.Converter.create_records` to make the crawler process work.
So, a sketch of a typical implementation within a custom converter could look like this:
.. code-block:: python
def create_records(self, values: GeneralStore,
records: RecordStore,
element: StructureElement,
file_path_prefix: str):
# Modify some records:
record_def = {
# ...
}
keys_modified = create_records(values, records,
record_def)
# You can of course do it multiple times:
keys_modified.extend(create_records(values, records,
record_def))
# You can also process the records section of the yaml definition:
keys_modified.extend(
super().create_records(values, records, element, file_path_prefix))
# This essentially allows users of your converter to customize the creation of records
# by providing a custom "records" section additionally to the modifications provided
# in this implementation of the Converter.
# Important: Return the list of modified keys!
return keys_modified
More complex example
====================
Let's have a look at a more complex examples, defining multiple records:
.. code-block:: yaml
DirConverter:
type: Directory
match: (?P<dir_name>.*)
records:
Project:
identifier: project_name
Experiment:
identifier: $dir_name
Project: $Project
ProjectGroup:
projects: +$Project
This block will create two new Records:
- A project with a constant identifier
- An experiment with an identifier, derived from a regular expression and a reference to the new project.
Furthermore a Record `ProjectGroup` will be edited (its initial definition is not given in the
yaml block): The project that was just created will be added as a list element to the property
`projects`.
Let's formulate that using `create_records` (again, `dir_name` is constant here):
.. code-block:: python
dir_name = "directory name"
record_def = {
"Project": {
"identifier": "project_name",
}
"Experiment": {
"identifier": dir_name,
"Project": "$Project",
}
"ProjectGroup": {
"projects": "+$Project",
}
}
keys_modified = create_records(values, records,
record_def)
Debugging
=========
You can add the key `debug_match` to the definition of a Converter in order to create debugging
output for the match step. The following snippet illustrates this:
.. code-block:: yaml
DirConverter:
type: Directory
match: (?P<dir_name>.*)
debug_match: True
records:
Project:
identifier: project_name
Whenever this Converter tries to match a StructureElement, it logs what was tried to macht against
what and what the result was.
Further converters
++++++++++++++++++
More converters, together with cfood definitions and examples can be found in
the `LinkAhead Crawler Extensions Subgroup
<https://gitlab.com/linkahead/crawler-extensions>`_ on gitlab. In the following,
we list converters that are shipped with the crawler library itself but are not
part of the set of standard converters and may require this library to be
installed with additional optional dependencies.
HDF5 Converters
===============
For treating `HDF5 Files
<https://docs.hdfgroup.org/hdf5/develop/_s_p_e_c.html>`_, there are in total
four individual converters corresponding to the internal structure of HDF5
files: the :ref:`H5FileConverter` which opens the file itself and creates
further structure elements from HDF5 groups, datasets, and included
multi-dimensional arrays that are in turn treated by the
:ref:`H5GroupConverter`, the :ref:`H5DatasetConverter`, and the
:ref:`H5NdarrayConverter`, respectively. You need to install the LinkAhead
crawler with its optional ``h5-crawler`` dependency for using these converters.
The basic idea when crawling HDF5 files is to treat them very similar to
:ref:`dictionaries <DictElement Converter>` in which the attributes on root,
group, or dataset level are essentially treated like ``BooleanElement``,
``TextElement``, ``FloatElement``, and ``IntegerElement`` in a dictionary: They
are appended as children and can be accessed via the ``subtree``. The file
itself and the groups within may contain further groups and datasets, which can
have their own attributes, subgroups, and datasets, very much like
``DictElements`` within a dictionary. The main difference to any other
dictionary type is the presence of multi-dimensional arrays within HDF5
datasets. Since LinkAhead doesn't have any datatype corresponding to these, and
since it isn't desirable to store these arrays directly within LinkAhead for
reasons of performance and of searchability, we wrap them within a specific
Record as explained :ref:`below <H5NdarrayConverter>`, together with more
metadata and their internal path within the HDF5 file. Users can thus query for
datasets and their arrays according to their metadata within LinkAhead and then
use the internal path information to access the dataset within the file
directly. The type of this record and the property for storing the internal path
need to be reflected in the datamodel. Using the default names, you would need a
datamodel like
.. code-block:: yaml
H5Ndarray:
obligatory_properties:
internal_hdf5-path:
datatype: TEXT
although the names of both property and record type can be configured within the
cfood definition.
A simple example of a cfood definition for HDF5 files can be found in the `unit
tests
<https://gitlab.com/linkahead/linkahead-crawler/-/blob/main/unittests/h5_cfood.yml?ref_type=heads>`_
and shows how the individual converters are used in order to crawl a `simple
example file
<https://gitlab.com/linkahead/linkahead-crawler/-/blob/main/unittests/hdf5_dummy_file.hdf5?ref_type=heads>`_
containing groups, subgroups, and datasets, together with their respective
attributes.
H5FileConverter
---------------
This is an extension of the
:py:class:`~caoscrawler.converters.SimpleFileConverter` class. It opens the HDF5
file and creates children for any contained group or dataset. Additionally, the
root-level attributes of the HDF5 file are accessible as children.
H5GroupConverter
----------------
This is an extension of the
:py:class:`~caoscrawler.converters.DictElementConverter` class. Children are
created for all subgroups and datasets in this HDF5 group. Additionally, the
group-level attributes are accessible as children.
H5DatasetConverter
------------------
This is an extension of the
:py:class:`~caoscrawler.converters.DictElementConverter` class. Most
importantly, it stores the array data in HDF5 dataset into
:py:class:`~caoscrawler.hdf5_converter.H5NdarrayElement` which is added to its
children, as well as the dataset attributes.
H5NdarrayConverter
------------------
This converter creates a wrapper record for the contained dataset. The name of
this record needs to be specified in the cfood definition of this converter via
the ``recordname`` option. The RecordType of this record can be configured with
the ``array_recordtype_name`` option and defaults to ``H5Ndarray``. Via the
given ``recordname``, this record can be used within the cfood. Most
importantly, this record stores the internal path of this array within the HDF5
file in a text property, the name of which can be configured with the
``internal_path_property_name`` option which defaults to ``internal_hdf5_path``.
This diff is collapsed.
Standard Converters
+++++++++++++++++++
These are the standard converters that exist in a default installation. For writing and applying
*custom converters*, see :ref:`below <Custom Converters>`.
Directory Converter
===================
The Directory Converter creates StructureElements for each File and Directory
inside the current Directory. You can match a regular expression against the
directory name using the 'match' key.
Simple File Converter
=====================
The Simple File Converter does not create any children and is usually used if
a file shall be used as it is and be inserted and referenced by other entities.
Markdown File Converter
=======================
Reads a YAML header from Markdown files (if such a header exists) and creates
children elements according to the structure of the header.
DictElement Converter
=====================
DictElement → StructureElement
Creates a child StructureElement for each key in the dictionary.
Typical Subtree converters
--------------------------
The following StructureElement types are typically created by the DictElement converter:
- BooleanElement
- FloatElement
- TextElement
- IntegerElement
- ListElement
- DictElement
Note that you may use ``TextElement`` for anything that exists in a text format that can be
interpreted by the server, such as date and datetime strings in ISO-8601 format.
Scalar Value Converters
=======================
`BooleanElementConverter`, `FloatElementConverter`, `TextElementConverter`, and
`IntegerElementConverter` behave very similarly.
These converters expect `match_name` and `match_value` in their definition
which allow to match the key and the value, respectively.
Note that there are defaults for accepting other types. For example,
FloatElementConverter also accepts IntegerElements. The default
behavior can be adjusted with the fields `accept_text`, `accept_int`,
`accept_float`, and `accept_bool`.
The following denotes what kind of StructureElements are accepted by default
(they are defined in `src/caoscrawler/converters.py`):
- BooleanElementConverter: bool, int
- FloatElementConverter: int, float
- TextElementConverter: text, bool, int, float
- IntegerElementConverter: int
- ListElementConverter: list
- DictElementConverter: dict
YAMLFileConverter
=================
A specialized Dict Converter for yaml files: Yaml files are opened and the contents are
converted into dictionaries that can be further converted using the typical subtree converters
of dict converter.
**WARNING**: Currently unfinished implementation.
JSONFileConverter
=================
TableConverter
==============
Table → DictElement
A generic converter (abstract) for files containing tables.
Currently, there are two specialized implementations for XLSX files and CSV files.
All table converters generate a subtree of dicts, which in turn can be converted with DictElementConverters:
For each row in the table the TableConverter generates a DictElement (structure element). The key of the
element is the row number. The value of the element is a dict containing the mapping of
column names to values of the respective cell.
Example:
.. code-block:: yaml
subtree:
TABLE: # Any name for the table as a whole
type: CSVTableConverter
match: ^test_table.csv$
records:
(...) # Records edited for the whole table file
subtree:
ROW: # Any name for a data row in the table
type: DictElement
match_name: .*
match_value: .*
records:
(...) # Records edited for each row
subtree:
COLUMN: # Any name for a specific type of column in the table
type: FloatElement
match_name: measurement # Name of the column in the table file
match_value: (?P<column_value).*)
records:
(...) # Records edited for each cell
XLSXTableConverter
==================
XLSX File → DictElement
CSVTableConverter
=================
CSV File → DictElement
PropertiesFromDictConverter
===========================
The :py:class:`~caoscrawler.converters.PropertiesFromDictConverter` is
a specialization of the
:py:class:`~caoscrawler.converters.DictElementConverter` and offers
all its functionality. It is meant to operate on dictionaries (e.g.,
from reading in a json or a table file), the keys of which correspond
closely to properties in a LinkAhead datamodel. This is especially
handy in cases where properties may be added to the data model and
data sources that are not yet known when writing the cfood definition.
The converter definition of the
:py:class:`~caoscrawler.converters.PropertiesFromDictConverter` has an
additional required entry ``record_from_dict`` which specifies the
Record to which the properties extracted from the dict are attached
to. This Record is identified by its ``variable_name`` by which it can
be referred to further down the subtree. You can also use the name of
a Record that was specified earlier in the CFood definition in order
to extend it by the properties extracted from a dict. Let's have a
look at a simple example. A CFood definition
.. code-block:: yaml
PropertiesFromDictElement:
type: PropertiesFromDictElement
match: ".*"
record_from_dict:
variable_name: MyRec
parents:
- MyType1
- MyType2
applied to a dictionary
.. code-block:: json
{
"name": "New name",
"a": 5,
"b": ["a", "b", "c"],
"author": {
"full_name": "Silvia Scientist"
}
}
will create a Record ``New name`` with parents ``MyType1`` and
``MyType2``. It has a scalar property ``a`` with value 5, a list
property ``b`` with values "a", "b" and "c", and an ``author``
property which references an ``author`` with a ``full_name`` property
with value "Silvia Scientist":
.. image:: ../img/properties-from-dict-records-author.png
:height: 210
Note how the different dictionary keys are handled differently
depending on their types: scalar and list values are understood
automatically, and a dictionary-valued entry like ``author`` is
translated into a reference to an ``author`` Record automatically.
You can further specify how references are treated with an optional
``references key`` in ``record_from_dict``. Let's assume that in the
above example, we have an ``author`` **Property** with datatype
``Person`` in our data model. We could add this information by
extending the above example definition by
.. code-block:: yaml
PropertiesFromDictElement:
type: PropertiesFromDictElement
match: ".*"
record_from_dict:
variable_name: MyRec
parents:
- MyType1
- MyType2
references:
author:
parents:
- Person
so that now, a ``Person`` record with a ``full_name`` property with
value "Silvia Scientist" is created as the value of the ``author``
property:
.. image:: ../img/properties-from-dict-records-person.png
:height: 200
For the time being, only the parents of the referenced record can be
set via this option. More complicated treatments can be implemented
via the ``referenced_record_callback`` (see below).
Properties can be blacklisted with the ``properties_blacklist``
keyword, i.e., all keys listed under ``properties_blacklist`` will be
excluded from automated treatment. Since the
:py:class:`~caoscrawler.converters.PropertiesFromDictConverter` has
all the functionality of the
:py:class:`~caoscrawler.converters.DictElementConverter`, individual
properties can still be used in a subtree. Together with
``properties_blacklist`` this can be used to add custom treatment to
specific properties by blacklisting them in ``record_from_dict`` and
then treating them in the subtree the same as you would do it in the
standard
:py:class:`~caoscrawler.converters.DictElementConverter`. Note that
the blacklisted keys are excluded on **all** levels of the dictionary,
i.e., also when they occur in a referenced entity.
For further customization, the
:py:class:`~caoscrawler.converters.PropertiesFromDictConverter` can be
used as a basis for :ref:`custom converters<Custom Converters>` which
can make use of its ``referenced_record_callback`` argument. The
``referenced_record_callback`` can be a callable object which takes
exactly a Record as an argument and needs to return that Record after
doing whatever custom treatment is needed. Additionally, it is given
the ``RecordStore`` and the ``ValueStore`` in order to be able to
access the records and values that have already been defined from
within ``referenced_record_callback``. Such a function might look the
following:
.. code-block:: python
def my_callback(rec: db.Record, records: RecordStore, values: GeneralStore):
# do something with rec, possibly using other records or values from the stores...
rec.description = "This was updated in a callback"
return rec
It is applied to all Records that are created from the dictionary and
it can be used to, e.g., transform values of some properties, or add
special treatment to all Records of a specific
type. ``referenced_record_callback`` is applied **after** the
properties from the dictionary have been applied as explained above.
XML Converters
==============
There are the following converters for XML content:
XMLFileConverter
----------------
This is a converter that loads an XML file and creates an XMLElement containing the
root element of the XML tree. It can be matched in the subtree using the XMLTagConverter.
XMLTagConverter
---------------
The XMLTagConverter is a generic converter for XMLElements with the following main features:
- It allows to match a combination of tag name, attribute names and text contents using the keys:
- ``match_tag``: regexp, default empty string
- ``match_attrib``: dictionary of key-regexps and value-regexp
pairs. Each key matches an attribute name and the corresponding
value matches its attribute value.
- ``match_text``: regexp, default empty string
- It allows to traverse the tree using XPath (using Python lxml's xpath functions):
- The key ``xpath`` is used to set the xpath expression and has a
default of ``child::*``. Its default would generate just the list of
sub nodes of the current node. The result of the xpath expression
is used to generate structure elements as children. It furthermore
uses the keys ``tags_as_children``, ``attribs_as_children`` and
``text_as_children`` to decide which information from the found
nodes will be used as children:
- ``tags_as_children``: (default ``true``) For each xml tag element
found by the xpath expression, generate one XMLTag structure
element. Its name is the full path to the tag using the function
``getelementpath`` from ``lxml``.
- ``attribs_as_children``: (default ``false``) For each xml tag element
found by the xpath expression, generate one XMLAttributeNode
structure element for each of its attributes. The name of the
respective attribute node has the form: ``<full path of the tag> @
<name of the attribute>`` **Please note:** Currently, there is no
converter implemented that can match XMLAttributeNodes.
- ``text_as_children``: (default ``false``) For each xml tag element
found by the xpath expression, generate one XMLTextNode structure
element containing the text content of the tag element. Note that
in case of multiple text elements, only the first one is
added. The name of the respective attribute node has the form:
``<full path of the tag> /text()`` to the tag using the function
``getelementpath`` from ``lxml``. **Please note:** Currently, there is
no converter implemented that can match XMLAttributeNodes.
Namespaces
**********
The default is to take the namespace map from the current node and use
it in xpath queries. Because default namespaces cannot be handled by
xpath, it is possible to remap the default namespace using the key
``default_namespace``. The key ``nsmap`` can be used to define
additional nsmap entries.
XMLTextNodeConverter
--------------------
In the future, this converter can be used to match XMLTextNodes that
are generated by the XMLTagConverter.
Transform Functions
+++++++++++++++++++
Often the situation arises, that you cannot use a value as it is found. Maybe a value should be
increased by an offset or a string should be split into a list of pieces. In order to allow such
simple conversions, transform functions can be named in the converter definition that are then
applied to the respective variables when the converter is executed.
.. code-block:: yaml
<NodeName>:
type: <ConverterName>
match: ".*"
transform:
<TransformNodeName>:
in: $<in_var_name>
out: $<out_var_name>
functions:
- <func_name>: # name of the function to be applied
<func_arg1>: <func_arg1_value> # key value pairs that are passed as parameters
<func_arg2>: <func_arg2_value>
# ...
An example that splits the variable ``a`` and puts the generated list in ``b`` is the following:
.. code-block:: yaml
Experiment:
type: Dict
match: ".*"
transform:
param_split:
in: $a
out: $b
functions:
- split: # split is a function that is defined by default
marker: "|" # its only parameter is the marker that is used to split the string
records:
Report:
tags: $b
This splits the string in '$a' and stores the resulting list in '$b'. This is here used to add a
list valued property to the Report Record.
There are a number of transform functions that are defined by default (see
``src/caoscrawler/default_transformers.yml``). You can define custom transform functions by adding
them to the cfood definition (see :doc:`CFood Documentation<../cfood>`).
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment