Skip to content
Snippets Groups Projects
Commit 79399f16 authored by Alexander Schlemmer's avatar Alexander Schlemmer
Browse files

comments by Alex to concept.md

parent 1d9d7db5
No related branches found
No related tags found
1 merge request!53Release 0.1
...@@ -2,43 +2,50 @@ ...@@ -2,43 +2,50 @@
The current CaosDB crawler has several limitations. The concept of The current CaosDB crawler has several limitations. The concept of
identifiables is for example not able to incorporate conditions like identifiables is for example not able to incorporate conditions like
referencing entities (only entities that are being referenced; other direction). referencing entities (only entities that are being referenced; other direction).
Another aspect is that crawler setup shall be more easy. This should probably Another aspect is that crawler setup should be more easy. This should probably
mean less code (since coding is error prone). Optimally, setup/configuration result in less code (since custom and possibly untested code is error prone).
can be done using a visual tool or is (in part) automated. Optimally, setup/configuration can be done using a visual tool or is (in part) automated.
One approach to these goals would be to One approach to these goals would be to:
1. generalize some aspects of the crawler (e.g. the identifiable) 1. generalize some aspects of the crawler (e.g. the identifiable)
2. use a more configuration based approach that requires as little programming 2. use a more configuration based approach that requires as little programming
as possible as possible
The datastructures that we encountered in the past were inherently hierarchical: The datastructures that we encountered in the past were inherently hierarchical:
folder sturctures, HDF5 files, JSON files, etc. - folder sturctures
The Crawler 2.0 shall be able treat an arbitrary hierarchical structures and convert them - standardized containers, like HDF5 files
to interconnected Records that are consistent with a predefined semantic data - ASCII "container" formats, like JSON files
model.
The configuration must define how the structure is created (for example does The Crawler 2.0 should be able treat an arbitrary hierarchical structures and
the content of a file need to be considered and added to the tree?) and how convert them to interconnected Records that are consistent with a predefined
the structure and its contained data is mapped to the semantic data model (e.g. semantic data model.
the experiment Record uses the data from the folder name and the email address
from a JSON file). The configuration must define:
- How the structure is created
Example: Does the content of a file need to be considered and added to the tree?
- How the structure and its contained data is mapped to the semantic data model:
Example The Record "Experiment" will store the data from the folder name and the
email address from a JSON file as CaosDB properties.
## Structure Mapping ## Structure Mapping
In the following, it is described how the above can be done on an abstract level. In the following, it is described how the above can be done on an abstract level.
The hierarchical structure is assumed to be constituted of a tree of The hierarchical structure is assumed to be constituted of a tree of
StructureElements. The tree is created on the fly by so called Converters which StructureElements. The tree is created on the fly by so-called Converters which
are defined the configuration. The tree of StructureElements is a model are defined the configuration. The tree of StructureElements is a model
of the existing data (For example could a tree of Python file objects of the existing data.
(StructureElements) represent a file tree that exists on some file server). Example: A tree of Python file objects (StructureElements) could represent a file tree
that exists on some file server.
Converters treat StructureElements and thereby create the StructureElements that Converters treat StructureElements and thereby create the StructureElements that
are the children of the treated StructureElement (Example: A StructureElement are the children of the treated StructureElement.
represents a folder and a Converter defines that for each file in the folder Example: A StructureElement represents a folder and a Converter defines that for each file in the folder
another StructureElement is created). Converters therefore create another StructureElement is created.
the above named tree. The definition of a Converter also contains what Converters therefore create the above named tree. The definition of a Converter also contains what
Converters shall be used to treat the generated child-StructureElements. The Converters shall be used to treat the generated child-StructureElements. The definition is therefore a tree itself.
definition is there a tree itself.
> Alex: The previous paragraph is difficult to understand. The reference "above named" is a little unclear.
> Side discussion > Side discussion
> Question: Should there be global Converters > Question: Should there be global Converters
...@@ -47,41 +54,50 @@ definition is there a tree itself. ...@@ -47,41 +54,50 @@ definition is there a tree itself.
> created and checked against all Converters. It could be that one would like to > created and checked against all Converters. It could be that one would like to
> check file-StructureElements against one set of Converters and > check file-StructureElements against one set of Converters and
> directory-StructureElements against another) > directory-StructureElements against another)
>
> Alex' opinion: I would rather go for a macro/variable/template-based solution, so that the employment of a globally predefined
> converter is explicitely mentioned instead of "silently and automatically" applied.
Each StructureElement in the tree has a set of data values, i.e a dictionary Each StructureElement in the tree has a set of data values, i.e a dictionary
of key value pairs. of key-value pairs.
Some of those values may be set due to the kind of StructureElement. For example, Some of those values may be set due to the kind of StructureElement. For example,
a file could always have the file name as such a key value pair: 'filename': <sth>. a file could always have the file name as such a key value pair: 'filename': <sth>.
Converters may define additional functions that create further values. For Converters may define additional functions that create further values. For
example, a regular expression could be used to get a date from a file name. example, a regular expression could be used to get a date from a file name.
## Identifiables ## Identifiables
The concept of an identifiable should be broadend to how can an entity be The concept of an identifiable should be broadend to how an entity can be
identified. Suggestion: A unique query defines it. identified. Suggestion: Definition through a unique query
Example: "FIND RECORD Fish WITH FishNumber=A AND WHICH IS REFERENCED BY B" Example: "FIND RECORD Fish WITH FishNumber=A AND WHICH IS REFERENCED BY B"
Note that the second part would be no usable condition with the old Note that the second part can not be specified as condition with the old
identifiable concept. identifiable concept.
The query must return 1 or 0 entities. If no entity is returned the respective The query must return 1 or 0 entities. If no entity is returned the respective
object may be created and if one is returned it is the one we were looking for. object may be created and if one is returned it is the one we were looking for.
If more than one is returned, then there is a mistake in the definition or in If more than one is returned, then there is a mistake in the definition or in
the data set. the data set. It is the responsibility of the designer of the Query for the identifiable
to make sure, that it returns either zero or one Entity.
## Entity Construction ## Entity Construction
In the simplest case an entity is constructed at a given node from its key
value pairs. However, the data for a given entity might be distributed over the In the simplest case an entity is constructed at a given node from its key-
tree. Two different approaches are possible: value pairs. However, the data for a given entity might be distributed over different levels of
1. During the construction of an entity at a given node also key value pairs the tree.
from other nodes are used. For example, key value pairs from parent nodes might
be made accessible. Or key value pairs might be accessed by providing the path Two different approaches are possible:
to them in the tree. 1. During the construction of an entity at a given node also key-value pairs
from other nodes are used. For example, key-value pairs from parent nodes might
be made accessible. Or key-value pairs might be accessed by providing the path
to them in the tree.
2. Information is added to an entity at other nodes. The simplest case uses the 2. Information is added to an entity at other nodes. The simplest case uses the
identifiable definition to add information. I.e. it is checked whether the identifiable definition to add information. I.e. it is checked whether the
respective entity does already exist in the server, if not it is inserted and respective entity does already exist in the server, if not it is inserted and
then the information is added. then the information is added.
Additionally, it could be made possible to add information to entities that are Additionally, it could be made possible to add information to entities that are
constructed in other nodes without the use of the identifiable. For example, constructed in other nodes without the use of the identifiable. For example,
could it be allowed to add information to entities that were created at parent it could be allowed to add information to entities that were created at parent
nodes. nodes.
> Alex: I haven't really understood the variant at 2..
## Value computation ## Value computation
It is quite straight forward how to set a Property of a Record with a value It is quite straight forward how to set a Property of a Record with a value
...@@ -90,8 +106,8 @@ regular expression illustrates that the desired value might not be present. ...@@ -90,8 +106,8 @@ regular expression illustrates that the desired value might not be present.
For example, the desired value might be `firstname+" "+lastname`. Since the For example, the desired value might be `firstname+" "+lastname`. Since the
computation might not be trivial, it is likely that writing code for these computation might not be trivial, it is likely that writing code for these
computations might be necessary. Still, these would be tiny parts that probably computations might be necessary. Still, these would be tiny parts that probably
can easily be unittested. There is also no immediated security risk since the can easily be unit tested. There is also no immediated security risk since the
configuration plus code replace the old scripts (i.e. only code). One could configuration plus code replace the old scripts (i.e. only code). One could
define small functions that are vigorously unittested and the function names define small functions that are vigorously unit tested and the function names
are used in the configuration. are used in the configuration.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment