From 56b733987e8e4047ef262d856cad3c7713e7c892 Mon Sep 17 00:00:00 2001
From: fspreck <f.spreckelsen@indiscale.com>
Date: Fri, 16 Feb 2024 16:40:41 +0100
Subject: [PATCH] DOC: Explain datamodel requirements

---
 src/doc/converters.rst | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/src/doc/converters.rst b/src/doc/converters.rst
index 8cda5f2c..cf0d26bf 100644
--- a/src/doc/converters.rst
+++ b/src/doc/converters.rst
@@ -243,7 +243,34 @@ arrays that are in turn treated by the :ref:`H5GroupConverter`, the
 need to install the LinkAhead crawler with its optional ``h5crawler`` dependency
 for using these converters.
 
+The basic idea when crawling HDF5 files is to treat them very similar to
+:ref:`dictionaries <DictElement Converter>` in which the attributes on root,
+group, or dataset level are essentially treated like ``BooleanElement``,
+``TextElement``, ``FloatElement``, and ``IntegerElement`` in a dictionary: They
+are appended as children and can be accessed via the ``subtree``. The file
+itself and the groups within may contain further groups and datasets, which can
+have their own attributes, subgroups, and datasets, very much like
+``DictElements`` within a dictionary. The main difference to any other
+dictionary type is the presence of multi-dimensional arrays within HDF5
+datasets. Since LinkAhead doesn't have any datatype corresponding to these, and
+since it isn't desirable to store these arrays directly within LinkAhead for
+reasons of performance and of searchability, we wrap them within a specific
+Record as explained :ref:`below <H5NdarrayConverter>`, together with more
+metadata and their internal path within the HDF5 file. Users can thus query for
+datasets and their arrays according to their metadata within LinkAhead and then
+use the internal path information to access the dataset within the file
+directly. The type of this record and the property for storing the internal path
+need to be reflected in the datamodel.  Using the default names, you would need a datamodel like
 
+.. code-block:: yaml
+
+   H5Ndarray:
+     obligatory_properties:
+       internal_hdf5-path:
+         datatype: TEXT
+
+although the names of both property and record type can be configured within the
+cfood definition.
 
 H5FileConverter
 ---------------
@@ -267,7 +294,7 @@ H5DatasetConverter
 This is an extension of the
 :py:class:`~caoscrawler.converters.DictElementConverter` class. Most
 importantly, it stores the array data in HDF5 dataset into
-:py:class:`~caoscrawler.hdf5_converters.H5NdarrayElement` which is added to its
+:py:class:`~caoscrawler.hdf5_converter.H5NdarrayElement` which is added to its
 children, as well as the dataset attributes.
 
 H5NdarrayConverter
-- 
GitLab