update

fb35bab5 · Henrik tom Wörden · 537323b6 · fb35bab5 · fb35bab5 · fb35bab5
Commit fb35bab5 authored Oct 7, 2021 by Henrik tom Wörden
--- a/concept.md
+++ b/concept.md
+# Crawler 2.0
+The current CaosDB crawler has several limitations. The concept of 
+identifiables is for example not able to incorporate conditions like 
+referencing entities (only entities that are being referenced; other direction).
+Another aspect is that crawler setup shall be more easy. This should probably 
+mean less code (since coding is error prone). Optimally, setup/configuration 
+can be done using a visual tool or is (in part) automated. 
+One approach  to these goals would be to
+1. generalize some aspects of the crawler (e.g. the identifiable)
+2. use a more configuration based approach that requires as little programming
+   as possible
+The datastructures that we encountered in the past were inherently hierarchical:
+folder sturctures, HDF5 files, JSON files, etc.
+The Crawler 2.0 shall be able treat an arbitrary hierarchical structures and convert them
+to interconnected Records that are consistent with a predefined semantic data 
+model.
+The configuration must define how the structure is created (for example does 
+the content of a file need to be considered and added to the tree?) and how 
+the structure and its contained data is mapped to the semantic data model (e.g.
+the experiment Record uses the data from the folder name and the email address 
+from a JSON file).
+## Structure Mapping
+In the following, it is described how the above can be done on an abstract level.
+The hierarchical structure is assumed to be constituted of a tree of
+StructureElements. The tree is created on the fly by so called Converters which
+are defined the configuration. The tree of StructureElements is a model
+of the existing data (For example could a tree of Python file objects
+(StructureElements) represent a file tree that exists on some file server).
+Converters treat StructureElements and thereby create the StructureElements that
+are the children of the treated StructureElement (Example: A StructureElement 
+represents a folder and a Converter defines that for each file in the folder 
+another StructureElement is created). Converters therefore create
+the above named tree. The definition of a Converter also contains what
+Converters shall be used to treat the generated child-StructureElements. The
+definition is there a tree itself. 
+> Side discussion
+> Question: Should there be global Converters
+> that are always checked when treating a StructureElement? Should Converters be
+> associated with generated child-StructureElements? Currently, all children are
+> created and checked against all Converters. It could be that one would like to
+> check file-StructureElements against one set of Converters and
+> directory-StructureElements against another)
+Each StructureElement in the tree has a set of data values, i.e a dictionary 
+of key value pairs.
+Some of those values may be set due to the kind of StructureElement. For example,
+a file could always have the file name as such a key value pair: 'filename': <sth>.
+Converters may define additional functions that create further values. For
+example, a regular expression could be used to get a date from a file name.
+## Identifiables
+The concept of an identifiable should be broadend to how can an entity be 
+identified. Suggestion: A unique query defines it.
+Example: "FIND RECORD Fish WITH FishNumber=A AND WHICH IS REFERENCED BY B"
+Note that the second part would be no usable condition with the old 
+identifiable concept.
+The query must return 1 or 0 entities. If no entitiy is returned the respective 
+object may be created and if one is returned it is the one we were looking for.
+If more than one is returned, then there is a mistake in the definition or in 
+the data set.
+## Value computation
+It is quite straight forward how to set a Property of a Record with a value 
+that is contained in the hierarchical structure. However, the example with the
+regular expression illustrates that the desired value might not be present.
+For example, the desired value might be `firstname+" "+lastname`. Since the 
+computation might not be trivial, it is likely that writing code for these 
+computations might be necessary. Still, these would be tiny parts that probably
+can easily be unittested. There is also no immediated security risk since the 
+configuration plus code replace the old scripts (i.e. only code). One could 
+define small functions that are vigorously unittested and the function names 
+are used in the configuration.
--- a/src/newcrawler/crawl-alt.py
+++ b/src/newcrawler/crawl-alt.py
@@ -33,7 +33,7 @@ json files.
 This hierarchical structure is assumed to be consituted of a tree of
 StructureElements. The tree is created on the fly by so called Converters which
-are defined in a yaml file. The tree of StructureElements is there for a model
+are defined in a yaml file. The tree of StructureElements is a model
 of the existing data (For example could a tree of Python file objects
 (StructureElements) represent a file tree that exists on some file server).
@@ -48,7 +48,8 @@ created and checked against all Converters. It could be that one would like to
 check file-StructureElements against one set of Converters and
 directory-StructureElements against another)
-Each StructureElement in the tree has a set of data values, i.e a dictionary.
+Each StructureElement in the tree has a set of data values, i.e a dictionary of
+key value pairs.
 Some of those values are set due to the kind of StructureElement. For example,
 a file could have the file name as such a key value pair: 'filename': <sth>.
 Converters may define additional functions that create further values. For

--- a/tests/test_crawl.py
+++ b/tests/test_crawl.py
@@ -147,6 +147,8 @@ toplevel:
            if r.name == "second-exp":
                self.assertEqual(r.get_property("stuff"), None)
+    def test_three_level(self):
        definition = """
 experiment:
  type: dictionary