Skip to content
Snippets Groups Projects
Commit 41a8de9d authored by Henrik tom Wörden's avatar Henrik tom Wörden
Browse files

Merge branch 'dev' into f-json-specification-doc

parents 3ca144bd 0bdf4ed0
Branches
Tags
2 merge requests!222Release 0.12.0,!74F json specification doc
Showing
with 994 additions and 501 deletions
...@@ -17,3 +17,4 @@ src/doc/_apidoc/ ...@@ -17,3 +17,4 @@ src/doc/_apidoc/
start_caosdb_docker.sh start_caosdb_docker.sh
src/doc/_apidoc src/doc/_apidoc
/dist/ /dist/
*.egg-info
...@@ -120,10 +120,10 @@ unittest_py3.9: ...@@ -120,10 +120,10 @@ unittest_py3.9:
script: script:
- tox - tox
unittest_py3.8: unittest_py3.7:
tags: [cached-dind] tags: [cached-dind]
stage: test stage: test
image: python:3.8 image: python:3.7
script: &python_test_script script: &python_test_script
# install dependencies # install dependencies
- pip install pytest pytest-cov - pip install pytest pytest-cov
...@@ -135,12 +135,24 @@ unittest_py3.8: ...@@ -135,12 +135,24 @@ unittest_py3.8:
- caosdb-crawler --help - caosdb-crawler --help
- pytest --cov=caosdb -vv ./unittests - pytest --cov=caosdb -vv ./unittests
unittest_py3.8:
tags: [cached-dind]
stage: test
image: python:3.8
script: *python_test_script
unittest_py3.10: unittest_py3.10:
tags: [cached-dind] tags: [cached-dind]
stage: test stage: test
image: python:3.10 image: python:3.10
script: *python_test_script script: *python_test_script
unittest_py3.11:
tags: [cached-dind]
stage: test
image: python:3.11
script: *python_test_script
inttest: inttest:
tags: [docker] tags: [docker]
services: services:
...@@ -277,3 +289,27 @@ style: ...@@ -277,3 +289,27 @@ style:
script: script:
- autopep8 -r --diff --exit-code . - autopep8 -r --diff --exit-code .
allow_failure: true allow_failure: true
# Build the sphinx documentation and make it ready for deployment by Gitlab Pages
# Special job for serving a static website. See https://docs.gitlab.com/ee/ci/yaml/README.html#pages
# Based on: https://gitlab.indiscale.com/caosdb/src/caosdb-pylib/-/ci/editor?branch_name=main
pages_prepare: &pages_prepare
tags: [ cached-dind ]
stage: deploy
needs: []
image: $CI_REGISTRY/caosdb/src/caosdb-pylib/testenv:latest
only:
refs:
- /^release-.*$/i
script:
- echo "Deploying documentation"
- make doc
- cp -r build/doc/html public
artifacts:
paths:
- public
pages:
<<: *pages_prepare
only:
refs:
- main
...@@ -7,12 +7,43 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ...@@ -7,12 +7,43 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [Unreleased] ## ## [Unreleased] ##
### Added ###
- DateElementConverter: allows to interpret text as a date object
- the restricted_path argument allows to crawl only a subtree
### Changed ###
### Deprecated ###
### Removed ###
### Fixed ###
- an empty string as name is treated as no name (as does the server). This, fixes
queries for identifiables since it would contain "WITH name=''" otherwise
which is an impossible condition. If your cfoods contained this case, they are ill defined.
### Security ###
### Documentation ###
## [0.3.0] - 2022-01-30 ##
(Florian Spreckelsen)
### Added ### ### Added ###
- Identifiable class to represent the information used to identify Records. - Identifiable class to represent the information used to identify Records.
- Added some StructureElements: BooleanElement, FloatElement, IntegerElement, - Added some StructureElements: BooleanElement, FloatElement, IntegerElement,
ListElement, DictElement ListElement, DictElement
- String representation for Identifiables - String representation for Identifiables
- [#43](https://gitlab.com/caosdb/caosdb-crawler/-/issues/43) the crawler
version can now be specified in the `metadata` section of the cfood
definition. It is checked against the installed version upon loading of the
definition.
- JSON schema validation can also be used in the DictElementConverter
- YAMLFileConverter class; to parse YAML files
- Variables can now be substituted within the definition of yaml macros
- debugging option for the match step of Converters
- Re-introduced support for Python 3.7
### Changed ### ### Changed ###
...@@ -20,23 +51,21 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ...@@ -20,23 +51,21 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Dict, DictElement and DictDictElement were merged into DictElement. - Dict, DictElement and DictDictElement were merged into DictElement.
- DictTextElement and TextElement were merged into TextElement. The "match" - DictTextElement and TextElement were merged into TextElement. The "match"
keyword is now invalid for TextElements. keyword is now invalid for TextElements.
- JSONFileConverter creates another level of StructureElements (see "How to upgrade" in the docs)
- create_flat_list function now collects entities in a set and also adds the entities
contained in the given list directly
### Deprecated ### ### Deprecated ###
- The DictXYElements are now depricated and are now synonyms for the - The DictXYElements are now depricated and are now synonyms for the
XYElements. XYElements.
### Removed ###
### Fixed ### ### Fixed ###
- [#39](https://gitlab.com/caosdb/caosdb-crawler/-/issues/39) Merge conflicts in - [#39](https://gitlab.com/caosdb/caosdb-crawler/-/issues/39) Merge conflicts in
`split_into_inserts_and_updates` when cached entity references a record `split_into_inserts_and_updates` when cached entity references a record
without id without id
- Queries for identifiables with boolean properties are now created correctly.
### Security ###
### Documentation ###
## [0.2.0] - 2022-11-18 ## ## [0.2.0] - 2022-11-18 ##
(Florian Spreckelsen) (Florian Spreckelsen)
......
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Fitschen
given-names: Timm
orcid: https://orcid.org/0000-0002-4022-432X
- family-names: Schlemmer
given-names: Alexander
orcid: https://orcid.org/0000-0003-4124-9649
- family-names: Hornung
given-names: Daniel
orcid: https://orcid.org/0000-0002-7846-6375
- family-names: tom Wörden
given-names: Henrik
orcid: https://orcid.org/0000-0002-5549-578X
- family-names: Parlitz
given-names: Ulrich
orcid: https://orcid.org/0000-0003-3058-1435
- family-names: Luther
given-names: Stefan
orcid: https://orcid.org/0000-0001-7214-8125
title: CaosDB - Crawler
version: 0.3.0
doi: 10.3390/data4020083
date-released: 2023-01-30
\ No newline at end of file
...@@ -24,6 +24,7 @@ guidelines of the CaosDB Project ...@@ -24,6 +24,7 @@ guidelines of the CaosDB Project
- `version` variables in `src/doc/conf.py` - `version` variables in `src/doc/conf.py`
- Version in [setup.cfg](./setup.cfg): Check the `MAJOR`, `MINOR`, `MICRO`, `PRE` variables and set - Version in [setup.cfg](./setup.cfg): Check the `MAJOR`, `MINOR`, `MICRO`, `PRE` variables and set
`ISRELEASED` to `True`. Use the possibility to issue pre-release versions for testing. `ISRELEASED` to `True`. Use the possibility to issue pre-release versions for testing.
- `CITATION.cff` (update version and date)
5. Merge the release branch into the main branch. 5. Merge the release branch into the main branch.
......
...@@ -31,6 +31,10 @@ Data: ...@@ -31,6 +31,10 @@ Data:
type: JSONFile type: JSONFile
match: .dataspace.json match: .dataspace.json
validate: schema/dataspace.schema.json validate: schema/dataspace.schema.json
subtree:
jsondict:
type: DictElement
match: .*
subtree: subtree:
dataspace_id_element: dataspace_id_element:
type: IntegerElement type: IntegerElement
...@@ -150,6 +154,10 @@ Data: ...@@ -150,6 +154,10 @@ Data:
type: JSONFile type: JSONFile
match: metadata.json match: metadata.json
validate: schema/dataset.schema.json validate: schema/dataset.schema.json
subtree:
jsondict:
type: DictElement
match: .*
subtree: subtree:
title_element: title_element:
type: TextElement type: TextElement
......
[pytest]
testpaths=unittests
[metadata] [metadata]
name = caoscrawler name = caoscrawler
version = 0.2.1 version = 0.3.1
author = Alexander Schlemmer author = Alexander Schlemmer
author_email = alexander.schlemmer@ds.mpg.de author_email = alexander.schlemmer@ds.mpg.de
description = A new crawler for caosdb description = A new crawler for caosdb
...@@ -17,15 +17,16 @@ classifiers = ...@@ -17,15 +17,16 @@ classifiers =
package_dir = package_dir =
= src = src
packages = find: packages = find:
python_requires = >=3.8 python_requires = >=3.7
install_requires = install_requires =
importlib-resources importlib-resources
caosdb > 0.10.0 caosdb >= 0.11.0
caosadvancedtools >= 0.6.0 caosadvancedtools >= 0.6.0
yaml-header-tools >= 0.2.1 yaml-header-tools >= 0.2.1
pyyaml pyyaml
odfpy #make optional odfpy #make optional
pandas pandas
importlib_metadata;python_version<'3.8'
[options.packages.find] [options.packages.find]
where = src where = src
......
from .crawl import Crawler, SecurityMode from .crawl import Crawler, SecurityMode
from .version import CfoodRequiredVersionError, version as __version__
...@@ -27,6 +27,7 @@ cfood: ...@@ -27,6 +27,7 @@ cfood:
- BooleanElement - BooleanElement
- Definitions - Definitions
- Dict - Dict
- Date
- JSONFile - JSONFile
- CSVTableConverter - CSVTableConverter
- XLSXTableConverter - XLSXTableConverter
......
This diff is collapsed.
...@@ -55,7 +55,7 @@ from caosdb.apiutils import (compare_entities, EntityMergeConflictError, ...@@ -55,7 +55,7 @@ from caosdb.apiutils import (compare_entities, EntityMergeConflictError,
merge_entities) merge_entities)
from caosdb.common.datatype import is_reference from caosdb.common.datatype import is_reference
from .converters import Converter, DirectoryConverter from .converters import Converter, DirectoryConverter, ConverterValidationError
from .identifiable import Identifiable from .identifiable import Identifiable
from .identifiable_adapters import (IdentifiableAdapter, from .identifiable_adapters import (IdentifiableAdapter,
LocalStorageIdentifiableAdapter, LocalStorageIdentifiableAdapter,
...@@ -63,7 +63,8 @@ from .identifiable_adapters import (IdentifiableAdapter, ...@@ -63,7 +63,8 @@ from .identifiable_adapters import (IdentifiableAdapter,
from .identified_cache import IdentifiedCache from .identified_cache import IdentifiedCache
from .macros import defmacro_constructor, macro_constructor from .macros import defmacro_constructor, macro_constructor
from .stores import GeneralStore, RecordStore from .stores import GeneralStore, RecordStore
from .structure_elements import StructureElement, Directory from .structure_elements import StructureElement, Directory, NoneElement
from .version import check_cfood_version
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
...@@ -255,12 +256,17 @@ class Crawler(object): ...@@ -255,12 +256,17 @@ class Crawler(object):
if len(crawler_definitions) == 1: if len(crawler_definitions) == 1:
# Simple case, just one document: # Simple case, just one document:
crawler_definition = crawler_definitions[0] crawler_definition = crawler_definitions[0]
metadata = {}
elif len(crawler_definitions) == 2: elif len(crawler_definitions) == 2:
metadata = crawler_definitions[0]["metadata"] if "metadata" in crawler_definitions[0] else {
}
crawler_definition = crawler_definitions[1] crawler_definition = crawler_definitions[1]
else: else:
raise RuntimeError( raise RuntimeError(
"Crawler definition must not contain more than two documents.") "Crawler definition must not contain more than two documents.")
check_cfood_version(metadata)
# TODO: at this point this function can already load the cfood schema extensions # TODO: at this point this function can already load the cfood schema extensions
# from the crawler definition and add them to the yaml schema that will be # from the crawler definition and add them to the yaml schema that will be
# tested in the next lines of code: # tested in the next lines of code:
...@@ -275,8 +281,8 @@ class Crawler(object): ...@@ -275,8 +281,8 @@ class Crawler(object):
schema["cfood"]["$defs"]["converter"]["properties"]["type"]["enum"].append( schema["cfood"]["$defs"]["converter"]["properties"]["type"]["enum"].append(
key) key)
if len(crawler_definitions) == 2: if len(crawler_definitions) == 2:
if "Converters" in crawler_definitions[0]["metadata"]: if "Converters" in metadata:
for key in crawler_definitions[0]["metadata"]["Converters"]: for key in metadata["Converters"]:
schema["cfood"]["$defs"]["converter"]["properties"]["type"]["enum"].append( schema["cfood"]["$defs"]["converter"]["properties"]["type"]["enum"].append(
key) key)
...@@ -300,6 +306,8 @@ class Crawler(object): ...@@ -300,6 +306,8 @@ class Crawler(object):
definition[key] = os.path.join( definition[key] = os.path.join(
os.path.dirname(definition_path), value) os.path.dirname(definition_path), value)
if not os.path.isfile(definition[key]): if not os.path.isfile(definition[key]):
# TODO(henrik) capture this in `crawler_main` similar to
# `ConverterValidationError`.
raise FileNotFoundError( raise FileNotFoundError(
f"Couldn't find validation file {definition[key]}") f"Couldn't find validation file {definition[key]}")
elif isinstance(value, dict): elif isinstance(value, dict):
...@@ -339,6 +347,9 @@ class Crawler(object): ...@@ -339,6 +347,9 @@ class Crawler(object):
"JSONFile": { "JSONFile": {
"converter": "JSONFileConverter", "converter": "JSONFileConverter",
"package": "caoscrawler.converters"}, "package": "caoscrawler.converters"},
"YAMLFile": {
"converter": "YAMLFileConverter",
"package": "caoscrawler.converters"},
"CSVTableConverter": { "CSVTableConverter": {
"converter": "CSVTableConverter", "converter": "CSVTableConverter",
"package": "caoscrawler.converters"}, "package": "caoscrawler.converters"},
...@@ -363,6 +374,9 @@ class Crawler(object): ...@@ -363,6 +374,9 @@ class Crawler(object):
"TextElement": { "TextElement": {
"converter": "TextElementConverter", "converter": "TextElementConverter",
"package": "caoscrawler.converters"}, "package": "caoscrawler.converters"},
"Date": {
"converter": "DateElementConverter",
"package": "caoscrawler.converters"},
"DictIntegerElement": { "DictIntegerElement": {
"converter": "IntegerElementConverter", "converter": "IntegerElementConverter",
"package": "caoscrawler.converters"}, "package": "caoscrawler.converters"},
...@@ -406,11 +420,16 @@ class Crawler(object): ...@@ -406,11 +420,16 @@ class Crawler(object):
value["class"] = getattr(module, value["converter"]) value["class"] = getattr(module, value["converter"])
return converter_registry return converter_registry
def crawl_directory(self, dirname: str, crawler_definition_path: str): def crawl_directory(self, dirname: str, crawler_definition_path: str,
restricted_path: Optional[list[str]] = None):
""" Crawl a single directory. """ Crawl a single directory.
Convenience function that starts the crawler (calls start_crawling) Convenience function that starts the crawler (calls start_crawling)
with a single directory as the StructureElement. with a single directory as the StructureElement.
restricted_path: optional, list of strings
Traverse the data tree only along the given path. When the end of the given path
is reached, traverse the full tree as normal.
""" """
crawler_definition = self.load_definition(crawler_definition_path) crawler_definition = self.load_definition(crawler_definition_path)
...@@ -433,7 +452,9 @@ class Crawler(object): ...@@ -433,7 +452,9 @@ class Crawler(object):
self.start_crawling(Directory(dir_structure_name, self.start_crawling(Directory(dir_structure_name,
dirname), dirname),
crawler_definition, crawler_definition,
converter_registry) converter_registry,
restricted_path=restricted_path
)
@staticmethod @staticmethod
def initialize_converters(crawler_definition: dict, converter_registry: dict): def initialize_converters(crawler_definition: dict, converter_registry: dict):
...@@ -461,7 +482,8 @@ class Crawler(object): ...@@ -461,7 +482,8 @@ class Crawler(object):
def start_crawling(self, items: Union[list[StructureElement], StructureElement], def start_crawling(self, items: Union[list[StructureElement], StructureElement],
crawler_definition: dict, crawler_definition: dict,
converter_registry: dict): converter_registry: dict,
restricted_path: Optional[list[str]] = None):
""" """
Start point of the crawler recursion. Start point of the crawler recursion.
...@@ -473,6 +495,9 @@ class Crawler(object): ...@@ -473,6 +495,9 @@ class Crawler(object):
crawler_definition : dict crawler_definition : dict
A dictionary representing the crawler definition, possibly from a yaml A dictionary representing the crawler definition, possibly from a yaml
file. file.
restricted_path: optional, list of strings
Traverse the data tree only along the given path. When the end of the given path
is reached, traverse the full tree as normal.
Returns Returns
------- -------
...@@ -489,14 +514,18 @@ class Crawler(object): ...@@ -489,14 +514,18 @@ class Crawler(object):
items = [items] items = [items]
self.run_id = uuid.uuid1() self.run_id = uuid.uuid1()
local_converters = Crawler.initialize_converters( local_converters = Crawler.initialize_converters(crawler_definition, converter_registry)
crawler_definition, converter_registry)
# This recursive crawling procedure generates the update list: # This recursive crawling procedure generates the update list:
self.crawled_data: list[db.Record] = [] self.crawled_data: list[db.Record] = []
self._crawl(items, local_converters, self.generalStore, self._crawl(
self.recordStore, [], []) items=items,
local_converters=local_converters,
generalStore=self.generalStore,
recordStore=self.recordStore,
structure_elements_path=[],
converters_path=[],
restricted_path=restricted_path)
if self.debug: if self.debug:
self.debug_converters = local_converters self.debug_converters = local_converters
...@@ -546,14 +575,20 @@ class Crawler(object): ...@@ -546,14 +575,20 @@ class Crawler(object):
return False return False
@staticmethod @staticmethod
def create_flat_list(ent_list: list[db.Entity], flat: list[db.Entity]): def create_flat_list(ent_list: list[db.Entity], flat: Optional[list[db.Entity]] = None):
""" """
Recursively adds all properties contained in entities from ent_list to Recursively adds entities and all their properties contained in ent_list to
the output list flat. Each element will only be added once to the list. the output list flat.
TODO: This function will be moved to pylib as it is also needed by the TODO: This function will be moved to pylib as it is also needed by the
high level API. high level API.
""" """
# Note: A set would be useful here, but we do not want a random order.
if flat is None:
flat = list()
for el in ent_list:
if el not in flat:
flat.append(el)
for ent in ent_list: for ent in ent_list:
for p in ent.properties: for p in ent.properties:
# For lists append each element that is of type Entity to flat: # For lists append each element that is of type Entity to flat:
...@@ -567,6 +602,7 @@ class Crawler(object): ...@@ -567,6 +602,7 @@ class Crawler(object):
if p.value not in flat: if p.value not in flat:
flat.append(p.value) flat.append(p.value)
Crawler.create_flat_list([p.value], flat) Crawler.create_flat_list([p.value], flat)
return flat
def _has_missing_object_in_references(self, ident: Identifiable, referencing_entities: list): def _has_missing_object_in_references(self, ident: Identifiable, referencing_entities: list):
""" """
...@@ -736,9 +772,7 @@ class Crawler(object): ...@@ -736,9 +772,7 @@ class Crawler(object):
def split_into_inserts_and_updates(self, ent_list: list[db.Entity]): def split_into_inserts_and_updates(self, ent_list: list[db.Entity]):
to_be_inserted: list[db.Entity] = [] to_be_inserted: list[db.Entity] = []
to_be_updated: list[db.Entity] = [] to_be_updated: list[db.Entity] = []
flat = list(ent_list) flat = Crawler.create_flat_list(ent_list)
# assure all entities are direct members TODO Can this be removed at some point?Check only?
Crawler.create_flat_list(ent_list, flat)
# TODO: can the following be removed at some point # TODO: can the following be removed at some point
for ent in flat: for ent in flat:
...@@ -1142,11 +1176,14 @@ ____________________\n""".format(i + 1, len(pending_changes)) + str(el[3])) ...@@ -1142,11 +1176,14 @@ ____________________\n""".format(i + 1, len(pending_changes)) + str(el[3]))
with open(filename, "w") as f: with open(filename, "w") as f:
f.write(yaml.dump(paths, sort_keys=False)) f.write(yaml.dump(paths, sort_keys=False))
def _crawl(self, items: list[StructureElement], def _crawl(self,
items: list[StructureElement],
local_converters: list[Converter], local_converters: list[Converter],
generalStore: GeneralStore, generalStore: GeneralStore,
recordStore: RecordStore, recordStore: RecordStore,
structure_elements_path: list[str], converters_path: list[str]): structure_elements_path: list[str],
converters_path: list[str],
restricted_path: Optional[list[str]] = None):
""" """
Crawl a list of StructureElements and apply any matching converters. Crawl a list of StructureElements and apply any matching converters.
...@@ -1155,16 +1192,31 @@ ____________________\n""".format(i + 1, len(pending_changes)) + str(el[3])) ...@@ -1155,16 +1192,31 @@ ____________________\n""".format(i + 1, len(pending_changes)) + str(el[3]))
treating structure elements. A locally defined converter could be treating structure elements. A locally defined converter could be
one that is only valid for a specific subtree of the originally one that is only valid for a specific subtree of the originally
cralwed StructureElement structure. cralwed StructureElement structure.
generalStore and recordStore: This recursion of the crawl function should only operate on copies of the generalStore and recordStore: This recursion of the crawl function should only operate on
global stores of the Crawler object. copies of the global stores of the Crawler object.
restricted_path: optional, list of strings, traverse the data tree only along the given
path. For example, when a directory contains files a, b and c and b is
given in restricted_path, a and c will be ignroed by the crawler.
When the end of the given path is reached, traverse the full tree as
normal. The first element of the list provided by restricted_path should
be the name of the StructureElement at this level, i.e. denoting the
respective element in the items argument.
""" """
# This path_found variable stores wether the path given by restricted_path was found in the
# data tree
path_found = False
if restricted_path is not None and len(restricted_path) == 0:
restricted_path = None
for element in items: for element in items:
for converter in local_converters: for converter in local_converters:
# type is something like "matches files", replace isinstance with "type_matches" # type is something like "matches files", replace isinstance with "type_matches"
# match function tests regexp for example # match function tests regexp for example
if (converter.typecheck(element) and if (converter.typecheck(element) and (
converter.match(element) is not None): restricted_path is None or element.name == restricted_path[0])
and converter.match(element) is not None):
path_found = True
generalStore_copy = generalStore.create_scoped_copy() generalStore_copy = generalStore.create_scoped_copy()
recordStore_copy = recordStore.create_scoped_copy() recordStore_copy = recordStore.create_scoped_copy()
...@@ -1179,8 +1231,8 @@ ____________________\n""".format(i + 1, len(pending_changes)) + str(el[3])) ...@@ -1179,8 +1231,8 @@ ____________________\n""".format(i + 1, len(pending_changes)) + str(el[3]))
keys_modified = converter.create_records( keys_modified = converter.create_records(
generalStore_copy, recordStore_copy, element) generalStore_copy, recordStore_copy, element)
children = converter.create_children( children = converter.create_children(generalStore_copy, element)
generalStore_copy, element)
if self.debug: if self.debug:
# add provenance information for each varaible # add provenance information for each varaible
self.debug_tree[str(element)] = ( self.debug_tree[str(element)] = (
...@@ -1205,7 +1257,12 @@ ____________________\n""".format(i + 1, len(pending_changes)) + str(el[3])) ...@@ -1205,7 +1257,12 @@ ____________________\n""".format(i + 1, len(pending_changes)) + str(el[3]))
self._crawl(children, converter.converters, self._crawl(children, converter.converters,
generalStore_copy, recordStore_copy, generalStore_copy, recordStore_copy,
structure_elements_path + [element.get_name()], structure_elements_path + [element.get_name()],
converters_path + [converter.name]) converters_path + [converter.name],
restricted_path[1:] if restricted_path is not None else None)
if restricted_path and not path_found:
raise RuntimeError("A 'restricted_path' argument was given that is not contained in "
"the data tree")
# if the crawler is running out of scope, copy all records in # if the crawler is running out of scope, copy all records in
# the recordStore, that were created in this scope # the recordStore, that were created in this scope
# to the general update container. # to the general update container.
...@@ -1236,6 +1293,7 @@ def crawler_main(crawled_directory_path: str, ...@@ -1236,6 +1293,7 @@ def crawler_main(crawled_directory_path: str,
prefix: str = "", prefix: str = "",
securityMode: SecurityMode = SecurityMode.UPDATE, securityMode: SecurityMode = SecurityMode.UPDATE,
unique_names=True, unique_names=True,
restricted_path: Optional[list[str]] = None
): ):
""" """
...@@ -1259,6 +1317,9 @@ def crawler_main(crawled_directory_path: str, ...@@ -1259,6 +1317,9 @@ def crawler_main(crawled_directory_path: str,
securityMode of Crawler securityMode of Crawler
unique_names : bool unique_names : bool
whether or not to update or insert entities inspite of name conflicts whether or not to update or insert entities inspite of name conflicts
restricted_path: optional, list of strings
Traverse the data tree only along the given path. When the end of the given path
is reached, traverse the full tree as normal.
Returns Returns
------- -------
...@@ -1266,8 +1327,12 @@ def crawler_main(crawled_directory_path: str, ...@@ -1266,8 +1327,12 @@ def crawler_main(crawled_directory_path: str,
0 if successful 0 if successful
""" """
crawler = Crawler(debug=debug, securityMode=securityMode) crawler = Crawler(debug=debug, securityMode=securityMode)
crawler.crawl_directory(crawled_directory_path, cfood_file_name) try:
if provenance_file is not None: crawler.crawl_directory(crawled_directory_path, cfood_file_name, restricted_path)
except ConverterValidationError as err:
print(err)
return 1
if provenance_file is not None and debug:
crawler.save_debug_data(provenance_file) crawler.save_debug_data(provenance_file)
if identifiables_definition_file is not None: if identifiables_definition_file is not None:
...@@ -1328,6 +1393,15 @@ def parse_args(): ...@@ -1328,6 +1393,15 @@ def parse_args():
formatter_class=RawTextHelpFormatter) formatter_class=RawTextHelpFormatter)
parser.add_argument("cfood_file_name", parser.add_argument("cfood_file_name",
help="Path name of the cfood yaml file to be used.") help="Path name of the cfood yaml file to be used.")
mg = parser.add_mutually_exclusive_group()
mg.add_argument("-r", "--restrict", nargs="*",
help="Restrict the crawling to the subtree at the end of the given path."
"I.e. for each level that is given the crawler only treats the element "
"with the given name.")
mg.add_argument("--restrict-path", help="same as restrict; instead of a list, this takes a "
"single string that is interpreded as file system path. Note that a trailing"
"separator (e.g. '/') will be ignored. Use --restrict if you need to have "
"empty strings.")
parser.add_argument("--provenance", required=False, parser.add_argument("--provenance", required=False,
help="Path name of the provenance yaml file. " help="Path name of the provenance yaml file. "
"This file will only be generated if this option is set.") "This file will only be generated if this option is set.")
...@@ -1359,6 +1433,15 @@ def parse_args(): ...@@ -1359,6 +1433,15 @@ def parse_args():
return parser.parse_args() return parser.parse_args()
def split_restricted_path(path):
elements = []
while path != "/":
path, el = os.path.split(path)
if el != "":
elements.insert(0, el)
return elements
def main(): def main():
args = parse_args() args = parse_args()
...@@ -1374,6 +1457,12 @@ def main(): ...@@ -1374,6 +1457,12 @@ def main():
if args.add_cwd_to_path: if args.add_cwd_to_path:
sys.path.append(os.path.abspath(".")) sys.path.append(os.path.abspath("."))
restricted_path = None
if args.restrict_path:
restricted_path = split_restricted_path(args.restrict_path)
if args.restrict:
restricted_path = args.restrict
sys.exit(crawler_main( sys.exit(crawler_main(
crawled_directory_path=args.crawled_directory_path, crawled_directory_path=args.crawled_directory_path,
cfood_file_name=args.cfood_file_name, cfood_file_name=args.cfood_file_name,
...@@ -1386,6 +1475,7 @@ def main(): ...@@ -1386,6 +1475,7 @@ def main():
"insert": SecurityMode.INSERT, "insert": SecurityMode.INSERT,
"update": SecurityMode.UPDATE}[args.security_mode], "update": SecurityMode.UPDATE}[args.security_mode],
unique_names=args.unique_names, unique_names=args.unique_names,
restricted_path=restricted_path
)) ))
......
...@@ -62,6 +62,8 @@ class Identifiable(): ...@@ -62,6 +62,8 @@ class Identifiable():
self.path = path self.path = path
self.record_type = record_type self.record_type = record_type
self.name = name self.name = name
if name is "":
self.name = None
self.properties: dict = {} self.properties: dict = {}
if properties is not None: if properties is not None:
self.properties = properties self.properties = properties
......
...@@ -27,6 +27,7 @@ from __future__ import annotations ...@@ -27,6 +27,7 @@ from __future__ import annotations
import yaml import yaml
from datetime import datetime from datetime import datetime
from typing import Any
from .identifiable import Identifiable from .identifiable import Identifiable
import caosdb as db import caosdb as db
import logging import logging
...@@ -35,14 +36,14 @@ from .utils import has_parent ...@@ -35,14 +36,14 @@ from .utils import has_parent
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
def convert_value(value): def convert_value(value: Any):
""" Returns a string representation of the value that is suitable """ Returns a string representation of the value that is suitable
to be used in the query to be used in the query
looking for the identified record. looking for the identified record.
Parameters Parameters
---------- ----------
value : The property of which the value shall be returned. value : Any type, the value that shall be returned and potentially converted.
Returns Returns
------- -------
...@@ -54,11 +55,13 @@ def convert_value(value): ...@@ -54,11 +55,13 @@ def convert_value(value):
return str(value.id) return str(value.id)
elif isinstance(value, datetime): elif isinstance(value, datetime):
return value.isoformat() return value.isoformat()
elif type(value) == str: elif isinstance(value, bool):
return str(value).upper()
elif isinstance(value, str):
# replace single quotes, otherwise they may break the queries # replace single quotes, otherwise they may break the queries
return value.replace("\'", "\\'") return value.replace("\'", "\\'")
else: else:
return f"{value}" return str(value)
class IdentifiableAdapter(metaclass=ABCMeta): class IdentifiableAdapter(metaclass=ABCMeta):
...@@ -97,7 +100,7 @@ class IdentifiableAdapter(metaclass=ABCMeta): ...@@ -97,7 +100,7 @@ class IdentifiableAdapter(metaclass=ABCMeta):
whether the required record already exists. whether the required record already exists.
""" """
query_string = "FIND Record " query_string = "FIND RECORD "
if ident.record_type is not None: if ident.record_type is not None:
query_string += ident.record_type query_string += ident.record_type
for ref in ident.backrefs: for ref in ident.backrefs:
......
...@@ -135,6 +135,7 @@ def macro_constructor(loader, node): ...@@ -135,6 +135,7 @@ def macro_constructor(loader, node):
raise RuntimeError("params type not supported") raise RuntimeError("params type not supported")
else: else:
raise RuntimeError("params type must not be None") raise RuntimeError("params type must not be None")
params = substitute_dict(params, params)
definition = substitute_dict(macro.definition, params) definition = substitute_dict(macro.definition, params)
res.update(definition) res.update(definition)
else: else:
...@@ -146,6 +147,7 @@ def macro_constructor(loader, node): ...@@ -146,6 +147,7 @@ def macro_constructor(loader, node):
params.update(params_setter) params.update(params_setter)
else: else:
raise RuntimeError("params type not supported") raise RuntimeError("params type not supported")
params = substitute_dict(params, params)
definition = substitute_dict(macro.definition, params) definition = substitute_dict(macro.definition, params)
res.update(definition) res.update(definition)
else: else:
......
...@@ -56,6 +56,10 @@ class FileSystemStructureElement(StructureElement): ...@@ -56,6 +56,10 @@ class FileSystemStructureElement(StructureElement):
return "{}: {}, {}".format(class_name_short, self.name, self.path) return "{}: {}, {}".format(class_name_short, self.name, self.path)
class NoneElement(StructureElement):
pass
class Directory(FileSystemStructureElement): class Directory(FileSystemStructureElement):
pass pass
......
#
# This file is a part of the CaosDB Project.
#
# Copyright (C) 2022 Indiscale GmbH <info@indiscale.com>
# Copyright (C) 2022 Florian Spreckelsen <f.spreckelsen@indiscale.com>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as
# published by the Free Software Foundation, either version 3 of the
# License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU Affero General Public License for more details.
#
# You should have received a copy of the GNU Affero General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
#
try:
from importlib import metadata as importlib_metadata
except ImportError: # Python<3.8 dowesn"t support this so use
import importlib_metadata
from packaging.version import parse as parse_version
from warnings import warn
# Read in version of locally installed caoscrawler package
version = importlib_metadata.version("caoscrawler")
class CfoodRequiredVersionError(RuntimeError):
"""The installed crawler version is older than the version specified in the
cfood's metadata.
"""
def check_cfood_version(metadata: dict):
if not metadata or "crawler-version" not in metadata:
msg = """
No crawler version specified in cfood definition, so there is now guarantee that
the cfood definition matches the installed crawler version.
Specifying a version is highly recommended to ensure that the definition works
as expected with the installed version of the crawler.
"""
warn(msg, UserWarning)
return
installed_version = parse_version(version)
cfood_version = parse_version(metadata["crawler-version"])
if cfood_version > installed_version:
msg = f"""
Your cfood definition requires a newer version of the CaosDB crawler. Please
update the crawler to the required version.
Crawler version specified in cfood: {cfood_version}
Crawler version installed on your system: {installed_version}
"""
raise CfoodRequiredVersionError(msg)
elif cfood_version < installed_version:
# only warn if major or minor of installed version are newer than
# specified in cfood
if (cfood_version.major < installed_version.major) or (cfood_version.minor < installed_version.minor):
msg = f"""
The cfood was written for a previous crawler version. Running the crawler in a
newer version than specified in the cfood definition may lead to unwanted or
unexpected behavior. Please visit the CHANGELOG
(https://gitlab.com/caosdb/caosdb-crawler/-/blob/main/CHANGELOG.md) and check
for any relevant changes.
Crawler version specified in cfood: {cfood_version}
Crawler version installed on your system: {installed_version}
"""
warn(msg, UserWarning)
return
# At this point, the version is either equal or the installed crawler
# version is newer just by an increase in the patch version, so still
# compatible. We can safely ...
return
...@@ -16,6 +16,9 @@ document together with the metadata and :doc:`macro<macros>` definitions (see :r ...@@ -16,6 +16,9 @@ document together with the metadata and :doc:`macro<macros>` definitions (see :r
If metadata and macro definitions are provided, there **must** be a second document preceeding the If metadata and macro definitions are provided, there **must** be a second document preceeding the
converter tree specification, including these definitions. converter tree specification, including these definitions.
It is highly recommended to specify the version of the CaosDB crawler for which
the cfood is written in the metadata section, see :ref:`below<example_3>`.
Examples Examples
++++++++ ++++++++
...@@ -69,6 +72,7 @@ two custom converters in the second document (**not recommended**, see the recom ...@@ -69,6 +72,7 @@ two custom converters in the second document (**not recommended**, see the recom
metadata: metadata:
name: Datascience CFood name: Datascience CFood
description: CFood for data from the local data science work group description: CFood for data from the local data science work group
crawler-version: 0.2.1
macros: macros:
- !defmacro - !defmacro
name: SimulationDatasetFile name: SimulationDatasetFile
...@@ -108,6 +112,7 @@ The **recommended way** of defining metadata, custom converters, macros and the ...@@ -108,6 +112,7 @@ The **recommended way** of defining metadata, custom converters, macros and the
metadata: metadata:
name: Datascience CFood name: Datascience CFood
description: CFood for data from the local data science work group description: CFood for data from the local data science work group
crawler-version: 0.2.1
macros: macros:
- !defmacro - !defmacro
name: SimulationDatasetFile name: SimulationDatasetFile
......
...@@ -33,10 +33,10 @@ copyright = '2021, MPIDS' ...@@ -33,10 +33,10 @@ copyright = '2021, MPIDS'
author = 'Alexander Schlemmer' author = 'Alexander Schlemmer'
# The short X.Y version # The short X.Y version
version = '0.2.1' version = '0.3.1'
# The full version, including alpha/beta/rc tags # The full version, including alpha/beta/rc tags
# release = '0.5.2-rc2' # release = '0.5.2-rc2'
release = '0.2.1-dev' release = '0.3.1-dev'
# -- General configuration --------------------------------------------------- # -- General configuration ---------------------------------------------------
......
...@@ -77,7 +77,7 @@ Reads a YAML header from Markdown files (if such a header exists) and creates ...@@ -77,7 +77,7 @@ Reads a YAML header from Markdown files (if such a header exists) and creates
children elements according to the structure of the header. children elements according to the structure of the header.
DictElement Converter DictElement Converter
============== =====================
Creates a child StructureElement for each key in the dictionary. Creates a child StructureElement for each key in the dictionary.
Typical Subtree converters Typical Subtree converters
...@@ -483,3 +483,22 @@ Let's formulate that using `create_records` (again, `dir_name` is constant here) ...@@ -483,3 +483,22 @@ Let's formulate that using `create_records` (again, `dir_name` is constant here)
keys_modified = create_records(values, records, keys_modified = create_records(values, records,
record_def) record_def)
Debugging
=========
You can add the key `debug_match` to the definition of a Converter in order to create debugging
output for the match step. The following snippet illustrates this:
.. code-block:: yaml
DirConverter:
type: Directory
match: (?P<dir_name>.*)
debug_match: True
records:
Project:
identifier: project_name
Whenever this Converter tries to match a StructureElement, it logs what was tried to macht against
what and what the result was.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment