Skip to content
GitLab
Explore
Sign in
Register
Primary navigation
Search or go to…
Project
CaosDB Crawler
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Iterations
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Locked files
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Container registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Code review analytics
Issue analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
caosdb
Software
CaosDB Crawler
Commits
fb35bab5
Commit
fb35bab5
authored
3 years ago
by
Henrik tom Wörden
Browse files
Options
Downloads
Patches
Plain Diff
update
parent
537323b6
No related branches found
No related tags found
1 merge request
!53
Release 0.1
Changes
3
Hide whitespace changes
Inline
Side-by-side
Showing
3 changed files
concept.md
+80
-0
80 additions, 0 deletions
concept.md
src/newcrawler/crawl-alt.py
+3
-2
3 additions, 2 deletions
src/newcrawler/crawl-alt.py
tests/test_crawl.py
+2
-0
2 additions, 0 deletions
tests/test_crawl.py
with
85 additions
and
2 deletions
concept.md
0 → 100644
+
80
−
0
View file @
fb35bab5
# Crawler 2.0
The current CaosDB crawler has several limitations. The concept of
identifiables is for example not able to incorporate conditions like
referencing entities (only entities that are being referenced; other direction).
Another aspect is that crawler setup shall be more easy. This should probably
mean less code (since coding is error prone). Optimally, setup/configuration
can be done using a visual tool or is (in part) automated.
One approach to these goals would be to
1.
generalize some aspects of the crawler (e.g. the identifiable)
2.
use a more configuration based approach that requires as little programming
as possible
The datastructures that we encountered in the past were inherently hierarchical:
folder sturctures, HDF5 files, JSON files, etc.
The Crawler 2.0 shall be able treat an arbitrary hierarchical structures and convert them
to interconnected Records that are consistent with a predefined semantic data
model.
The configuration must define how the structure is created (for example does
the content of a file need to be considered and added to the tree?) and how
the structure and its contained data is mapped to the semantic data model (e.g.
the experiment Record uses the data from the folder name and the email address
from a JSON file).
## Structure Mapping
In the following, it is described how the above can be done on an abstract level.
The hierarchical structure is assumed to be constituted of a tree of
StructureElements. The tree is created on the fly by so called Converters which
are defined the configuration. The tree of StructureElements is a model
of the existing data (For example could a tree of Python file objects
(StructureElements) represent a file tree that exists on some file server).
Converters treat StructureElements and thereby create the StructureElements that
are the children of the treated StructureElement (Example: A StructureElement
represents a folder and a Converter defines that for each file in the folder
another StructureElement is created). Converters therefore create
the above named tree. The definition of a Converter also contains what
Converters shall be used to treat the generated child-StructureElements. The
definition is there a tree itself.
> Side discussion
> Question: Should there be global Converters
> that are always checked when treating a StructureElement? Should Converters be
> associated with generated child-StructureElements? Currently, all children are
> created and checked against all Converters. It could be that one would like to
> check file-StructureElements against one set of Converters and
> directory-StructureElements against another)
Each StructureElement in the tree has a set of data values, i.e a dictionary
of key value pairs.
Some of those values may be set due to the kind of StructureElement. For example,
a file could always have the file name as such a key value pair: 'filename':
<sth>
.
Converters may define additional functions that create further values. For
example, a regular expression could be used to get a date from a file name.
## Identifiables
The concept of an identifiable should be broadend to how can an entity be
identified. Suggestion: A unique query defines it.
Example: "FIND RECORD Fish WITH FishNumber=A AND WHICH IS REFERENCED BY B"
Note that the second part would be no usable condition with the old
identifiable concept.
The query must return 1 or 0 entities. If no entitiy is returned the respective
object may be created and if one is returned it is the one we were looking for.
If more than one is returned, then there is a mistake in the definition or in
the data set.
## Value computation
It is quite straight forward how to set a Property of a Record with a value
that is contained in the hierarchical structure. However, the example with the
regular expression illustrates that the desired value might not be present.
For example, the desired value might be
`firstname+" "+lastname`
. Since the
computation might not be trivial, it is likely that writing code for these
computations might be necessary. Still, these would be tiny parts that probably
can easily be unittested. There is also no immediated security risk since the
configuration plus code replace the old scripts (i.e. only code). One could
define small functions that are vigorously unittested and the function names
are used in the configuration.
This diff is collapsed.
Click to expand it.
src/newcrawler/crawl-alt.py
+
3
−
2
View file @
fb35bab5
...
...
@@ -33,7 +33,7 @@ json files.
This hierarchical structure is assumed to be consituted of a tree of
StructureElements. The tree is created on the fly by so called Converters which
are defined in a yaml file. The tree of StructureElements is
there for
a model
are defined in a yaml file. The tree of StructureElements is a model
of the existing data (For example could a tree of Python file objects
(StructureElements) represent a file tree that exists on some file server).
...
...
@@ -48,7 +48,8 @@ created and checked against all Converters. It could be that one would like to
check file-StructureElements against one set of Converters and
directory-StructureElements against another)
Each StructureElement in the tree has a set of data values, i.e a dictionary.
Each StructureElement in the tree has a set of data values, i.e a dictionary of
key value pairs.
Some of those values are set due to the kind of StructureElement. For example,
a file could have the file name as such a key value pair:
'
filename
'
: <sth>.
Converters may define additional functions that create further values. For
...
...
This diff is collapsed.
Click to expand it.
tests/test_crawl.py
+
2
−
0
View file @
fb35bab5
...
...
@@ -147,6 +147,8 @@ toplevel:
if
r
.
name
==
"
second-exp
"
:
self
.
assertEqual
(
r
.
get_property
(
"
stuff
"
),
None
)
def
test_three_level
(
self
):
definition
=
"""
experiment:
type: dictionary
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment