Skip to content
Snippets Groups Projects
Code owners
Assign users and groups as approvers for specific file changes. Learn more.

Crawler 2.0

The current CaosDB crawler has several limitations. The concept of identifiables is for example not able to incorporate conditions like referencing entities (only entities that are being referenced; other direction). Another aspect is that crawler setup should be more easy. This should probably result in less code (since custom and possibly untested code is error prone). Optimally, setup/configuration can be done using a visual tool or is (in part) automated.

One approach to these goals would be to:

  1. generalize some aspects of the crawler (e.g. the identifiable)
  2. use a more configuration based approach that requires as little programming as possible

The datastructures that we encountered in the past were inherently hierarchical:

  • folder sturctures
  • standardized containers, like HDF5 files
  • ASCII "container" formats, like JSON files

The Crawler 2.0 should be able treat an arbitrary hierarchical structures and convert them to interconnected Records that are consistent with a predefined semantic data model.

The configuration must define:

  • How the structure is created Example: Does the content of a file need to be considered and added to the tree?
  • How the structure and its contained data is mapped to the semantic data model: Example The Record "Experiment" will store the data from the folder name and the email address from a JSON file as CaosDB properties.

Structure Mapping

In the following, it is described how the above can be done on an abstract level.

The hierarchical structure is assumed to be constituted of a tree of StructureElements. The tree is created on the fly by so-called Converters which are defined the configuration. The tree of StructureElements is a model of the existing data. Example: A tree of Python file objects (StructureElements) could represent a file tree that exists on some file server.

Converters treat StructureElements and thereby create the StructureElements that are the children of the treated StructureElement. Example: A StructureElement represents a folder and a Converter defines that for each file in the folder another StructureElement is created. Converters therefore create the above named tree. The definition of a Converter also contains what Converters shall be used to treat the generated child-StructureElements. The definition is therefore a tree itself.

Alex: The previous paragraph is difficult to understand. The reference "above named" is a little unclear.

Side discussion Question: Should there be global Converters that are always checked when treating a StructureElement? Should Converters be associated with generated child-StructureElements? Currently, all children are created and checked against all Converters. It could be that one would like to check file-StructureElements against one set of Converters and directory-StructureElements against another)

Alex' opinion: I would rather go for a macro/variable/template-based solution, so that the employment of a globally predefined converter is explicitely mentioned instead of "silently and automatically" applied.

Each StructureElement in the tree has a set of data values, i.e a dictionary of key-value pairs. Some of those values may be set due to the kind of StructureElement. For example, a file could always have the file name as such a key value pair: 'filename': . Converters may define additional functions that create further values. For example, a regular expression could be used to get a date from a file name.

Identifiables

The concept of an identifiable should be broadend to how an entity can be identified. Suggestion: Definition through a unique query Example: "FIND RECORD Fish WITH FishNumber=A AND WHICH IS REFERENCED BY B" Note that the second part can not be specified as condition with the old identifiable concept. The query must return 1 or 0 entities. If no entity is returned the respective object may be created and if one is returned it is the one we were looking for. If more than one is returned, then there is a mistake in the definition or in the data set. It is the responsibility of the designer of the Query for the identifiable to make sure, that it returns either zero or one Entity.

Entity Construction

In the simplest case an entity is constructed at a given node from its key- value pairs. However, the data for a given entity might be distributed over different levels of the tree.

Two different approaches are possible:

  1. During the construction of an entity at a given node also key-value pairs from other nodes are used. For example, key-value pairs from parent nodes might be made accessible. Or key-value pairs might be accessed by providing the path to them in the tree.
  2. Information is added to an entity at other nodes. The simplest case uses the identifiable definition to add information. I.e. it is checked whether the respective entity does already exist in the server, if not it is inserted and then the information is added. Additionally, it could be made possible to add information to entities that are constructed in other nodes without the use of the identifiable. For example, it could be allowed to add information to entities that were created at parent nodes.

Alex: I haven't really understood the variant at 2..

Value computation

It is quite straight forward how to set a Property of a Record with a value that is contained in the hierarchical structure. However, the example with the regular expression illustrates that the desired value might not be present. For example, the desired value might be firstname+" "+lastname. Since the computation might not be trivial, it is likely that writing code for these computations might be necessary. Still, these would be tiny parts that probably can easily be unit tested. There is also no immediated security risk since the configuration plus code replace the old scripts (i.e. only code). One could define small functions that are vigorously unit tested and the function names are used in the configuration.