In the following, we will explain the process of adding a custom converter to a yaml file using
a SourceResolver that is able to attach a source element to another entity.
**Note**: This example might become a standard crawler soon, as part of the scifolder specification. See https://doi.org/10.3390/data5020043 for details. In this documentation example we will, therefore, add it to a package called "scifolder".
First we will create our package and module structure, which might be:
.. code-block::
scifolder_package/
README.md
setup.cfg
setup.py
Makefile
tox.ini
src/
scifolder/
__init__.py
converters/
__init__.py
sources.py # <- the actual file containing
# the converter class
doc/
unittests/
Now we need to create a class called "SourceResolver" in the file "sources.py". In this - more advanced - example, we will not inherit our converter directly from :py:class:`~caoscrawler.converters.Converter`, but use :py:class:`~caoscrawler.converters.TextElementConverter`. The latter already implements :py:meth:`~caoscrawler.converters.Converter.match` and :py:meth:`~caoscrawler.converters.Converter.typecheck`, so only an implementation for :py:meth:`~caoscrawler.converters.Converter.create_children` has to be provided by us.
Furthermore we will customize the method :py:meth:`~caoscrawler.converters.Converter.create_records` that allows us to specify a more complex record generation procedure than provided in the standard implementation. One specific limitation of the standard implementation is, that only a fixed
number of records can be generated by the yaml definition. So for any applications - like here - that require an arbitrary number of records to be created, a customized implementation of :py:meth:`~caoscrawler.converters.Converter.create_records` is recommended.
In this context it is recommended to make use of the function :func:`caoscrawler.converters.create_records` that implements creation of record objects from python dictionaries of the same structure
that would be given using a yaml definition.
.. code-block:: python
import re
from caoscrawler.stores import GeneralStore, RecordStore
from caoscrawler.converters import TextElementConverter, create_records
from caoscrawler.structure_elements import StructureElement, TextElement
class SourceResolver(TextElementConverter):
"""
This resolver uses a source list element (e.g. from the markdown readme file)