diff --git a/concept.md b/concept.md new file mode 100644 index 0000000000000000000000000000000000000000..694e4d6363444e0c43f14494d140ad2ffcdf6ffb --- /dev/null +++ b/concept.md @@ -0,0 +1,80 @@ +# Crawler 2.0 +The current CaosDB crawler has several limitations. The concept of +identifiables is for example not able to incorporate conditions like +referencing entities (only entities that are being referenced; other direction). +Another aspect is that crawler setup shall be more easy. This should probably +mean less code (since coding is error prone). Optimally, setup/configuration +can be done using a visual tool or is (in part) automated. + +One approach to these goals would be to +1. generalize some aspects of the crawler (e.g. the identifiable) +2. use a more configuration based approach that requires as little programming + as possible +The datastructures that we encountered in the past were inherently hierarchical: +folder sturctures, HDF5 files, JSON files, etc. +The Crawler 2.0 shall be able treat an arbitrary hierarchical structures and convert them +to interconnected Records that are consistent with a predefined semantic data +model. + +The configuration must define how the structure is created (for example does +the content of a file need to be considered and added to the tree?) and how +the structure and its contained data is mapped to the semantic data model (e.g. +the experiment Record uses the data from the folder name and the email address +from a JSON file). + + +## Structure Mapping +In the following, it is described how the above can be done on an abstract level. + +The hierarchical structure is assumed to be constituted of a tree of +StructureElements. The tree is created on the fly by so called Converters which +are defined the configuration. The tree of StructureElements is a model +of the existing data (For example could a tree of Python file objects +(StructureElements) represent a file tree that exists on some file server). + +Converters treat StructureElements and thereby create the StructureElements that +are the children of the treated StructureElement (Example: A StructureElement +represents a folder and a Converter defines that for each file in the folder +another StructureElement is created). Converters therefore create +the above named tree. The definition of a Converter also contains what +Converters shall be used to treat the generated child-StructureElements. The +definition is there a tree itself. + +> Side discussion +> Question: Should there be global Converters +> that are always checked when treating a StructureElement? Should Converters be +> associated with generated child-StructureElements? Currently, all children are +> created and checked against all Converters. It could be that one would like to +> check file-StructureElements against one set of Converters and +> directory-StructureElements against another) + +Each StructureElement in the tree has a set of data values, i.e a dictionary +of key value pairs. +Some of those values may be set due to the kind of StructureElement. For example, +a file could always have the file name as such a key value pair: 'filename': <sth>. +Converters may define additional functions that create further values. For +example, a regular expression could be used to get a date from a file name. + +## Identifiables +The concept of an identifiable should be broadend to how can an entity be +identified. Suggestion: A unique query defines it. +Example: "FIND RECORD Fish WITH FishNumber=A AND WHICH IS REFERENCED BY B" +Note that the second part would be no usable condition with the old +identifiable concept. +The query must return 1 or 0 entities. If no entitiy is returned the respective +object may be created and if one is returned it is the one we were looking for. +If more than one is returned, then there is a mistake in the definition or in +the data set. + +## Value computation +It is quite straight forward how to set a Property of a Record with a value +that is contained in the hierarchical structure. However, the example with the +regular expression illustrates that the desired value might not be present. +For example, the desired value might be `firstname+" "+lastname`. Since the +computation might not be trivial, it is likely that writing code for these +computations might be necessary. Still, these would be tiny parts that probably +can easily be unittested. There is also no immediated security risk since the +configuration plus code replace the old scripts (i.e. only code). One could +define small functions that are vigorously unittested and the function names +are used in the configuration. + diff --git a/src/newcrawler/crawl-alt.py b/src/newcrawler/crawl-alt.py index f71d4dae0e24ae38609b7c7443c990f6f894b9fe..021c3f0af8080adb5e83cf09f9ef9f87888fa29f 100644 --- a/src/newcrawler/crawl-alt.py +++ b/src/newcrawler/crawl-alt.py @@ -33,7 +33,7 @@ json files. This hierarchical structure is assumed to be consituted of a tree of StructureElements. The tree is created on the fly by so called Converters which -are defined in a yaml file. The tree of StructureElements is there for a model +are defined in a yaml file. The tree of StructureElements is a model of the existing data (For example could a tree of Python file objects (StructureElements) represent a file tree that exists on some file server). @@ -48,7 +48,8 @@ created and checked against all Converters. It could be that one would like to check file-StructureElements against one set of Converters and directory-StructureElements against another) -Each StructureElement in the tree has a set of data values, i.e a dictionary. +Each StructureElement in the tree has a set of data values, i.e a dictionary of +key value pairs. Some of those values are set due to the kind of StructureElement. For example, a file could have the file name as such a key value pair: 'filename': <sth>. Converters may define additional functions that create further values. For diff --git a/tests/test_crawl.py b/tests/test_crawl.py index 1ccba198b4a918a002fc08b4658dad2f6f04f109..8bc4d14f3884be1cb8d0e76192ad652b2668837f 100644 --- a/tests/test_crawl.py +++ b/tests/test_crawl.py @@ -147,6 +147,8 @@ toplevel: if r.name == "second-exp": self.assertEqual(r.get_property("stuff"), None) + + def test_three_level(self): definition = """ experiment: type: dictionary