Skip to content
Snippets Groups Projects
Verified Commit 7c7b1016 authored by Daniel Hornung's avatar Daniel Hornung
Browse files

Merge branch 'dev' into f-small-doc-fixes

parents fbdf2293 2aa524bf
Branches
Tags
2 merge requests!162DOC WIP: Tutorial: Single structured file,!129Documentation: many small changes
Pipeline #44923 failed
......@@ -7,8 +7,10 @@ RUN apt-get update && \
python3-autopep8 \
python3-pip \
python3-pytest \
python3-sphinx \
tox \
-y
RUN pip3 install recommonmark sphinx-rtd-theme
COPY .docker/wait-for-it.sh /wait-for-it.sh
ARG PYLIB
ADD https://gitlab.indiscale.com/api/v4/projects/97/repository/commits/${PYLIB} \
......
......@@ -296,8 +296,9 @@ style:
pages_prepare: &pages_prepare
tags: [ cached-dind ]
stage: deploy
needs: []
image: $CI_REGISTRY/caosdb/src/caosdb-pylib/testenv:latest
needs:
- job: build-testenv
image: $CI_REGISTRY_IMAGE
only:
refs:
- /^release-.*$/i
......
......@@ -10,12 +10,22 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added ###
### Changed ###
- If the `parents` key is used in a cfood at a lower level for a Record that
already has a Parent (because it was explicitly given or the default Parent),
the old Parent(s) are now overwritten with the value belonging to the
`parents` key.
- If a registered identifiable states, that a reference by a Record with parent
RT1 is needed, then now also references from Records that have a child of RT1
as parent are accepted.
### Deprecated ###
### Removed ###
### Fixed ###
- Empty Records can now be created (https://gitlab.com/caosdb/caosdb-crawler/-/issues/27)
* [#58](https://gitlab.com/caosdb/caosdb-crawler/-/issues/58) Documentation builds API docs in pipeline now.
### Security ###
......
# CaosDB-Crawler
## Welcome
This is the repository of the CaosDB-Crawler, a tool for automatic data
insertion into [CaosDB](https://gitlab.com/caosdb/caosdb-meta).
This is the repository of the LinkAhead Crawler, a tool for automatic data
insertion into [LinkAhead](https://gitlab.com/linkahead/linkahead).
This is a new implementation resolving problems of the original implementation
in [caosdb-advancedtools](https://gitlab.com/caosdb/caosdb-advanced-user-tools)
in [LinkAhead Python Advanced User Tools](https://gitlab.com/caosdb/caosdb-advanced-user-tools)
## Setup
......@@ -16,20 +15,23 @@ setup this code.
## Further Reading
Please refer to the [official documentation](https://docs.indiscale.com/caosdb-crawler/) of the CaosDB-Crawler for more information.
Please refer to the [official documentation](https://docs.indiscale.com/caosdb-crawler/) of the LinkAhead Crawler for more information.
## Contributing
Thank you very much to all contributers—[past, present](https://gitlab.com/caosdb/caosdb/-/blob/dev/HUMANS.md), and prospective ones.
Thank you very much to all contributers—[past,
present](https://gitlab.com/linkahead/linkahead/-/blob/main/HUMANS.md), and prospective
ones.
### Code of Conduct
By participating, you are expected to uphold our [Code of Conduct](https://gitlab.com/caosdb/caosdb/-/blob/dev/CODE_OF_CONDUCT.md).
By participating, you are expected to uphold our [Code of
Conduct](https://gitlab.com/linkahead/linkahead/-/blob/main/CODE_OF_CONDUCT.md).
### How to Contribute
* You found a bug, have a question, or want to request a feature? Please
[create an issue](https://gitlab.com/caosdb/caosdb-crawler).
[create an issue](https://gitlab.com/linkahead/linkahead-crawler/-/issues).
* You want to contribute code?
* **Forking:** Please fork the repository and create a merge request in GitLab and choose this repository as
target. Make sure to select "Allow commits from members who can merge the target branch" under
......@@ -38,9 +40,8 @@ By participating, you are expected to uphold our [Code of Conduct](https://gitla
* **Code style:** This project adhers to the PEP8 recommendations, you can test your code style
using the `autopep8` tool (`autopep8 -i -r ./`). Please write your doc strings following the
[NumpyDoc](https://numpydoc.readthedocs.io/en/latest/format.html) conventions.
* You can also contact us at **info (AT) caosdb.de** and join the
CaosDB community on
[#caosdb:matrix.org](https://matrix.to/#/!unwwlTfOznjEnMMXxf:matrix.org).
* You can also join the LinkAhead community on
[#linkahead:matrix.org](https://matrix.to/#/!unwwlTfOznjEnMMXxf:matrix.org).
There is the file `unittests/records.xml` that servers as a dummy for a server state with files.
......
......@@ -114,7 +114,7 @@ def test_issue_23(clear_database):
assert rec_crawled.get_property("identifying_prop").value == "identifier"
assert rec_crawled.get_property("prop_b") is not None
assert rec_crawled.get_property("prop_b").value == "something_else"
# no interaction with the database yet, so the rrecord shouldn't have a prop_a yet
# no interaction with the database yet, so the record shouldn't have a prop_a yet
assert rec_crawled.get_property("prop_a") is None
# synchronize with database and update the record
......@@ -133,3 +133,78 @@ def test_issue_23(clear_database):
"identifying_prop").value == rec_crawled.get_property("identifying_prop").value
assert rec_retrieved.get_property(
"prop_b").value == rec_crawled.get_property("prop_b").value
def test_issue_83(clear_database):
"""https://gitlab.com/linkahead/linkahead-crawler/-/issues/83. Test that
names don't need to be unique for referenced entities if they are not part
of the identifiable.
"""
# Very simple data model
identifying_prop = db.Property(name="IdentifyingProp", datatype=db.INTEGER).insert()
referenced_type = db.RecordType(name="ReferencedType").add_property(
name=identifying_prop.name, importance=db.OBLIGATORY).insert()
referencing_type = db.RecordType(name="ReferencingType").add_property(
name=referenced_type.name, datatype=db.LIST(referenced_type.name)).insert()
# Define identifiables. ReferencingType by name, ReferencedType by
# IdentifyingProp and not by name.
ident = CaosDBIdentifiableAdapter()
ident.register_identifiable(referenced_type.name, db.RecordType().add_parent(
name=referenced_type.name).add_property(name=identifying_prop.name))
ident.register_identifiable(referencing_type.name, db.RecordType().add_parent(
name=referencing_type.name).add_property(name="name"))
crawler = Crawler(identifiableAdapter=ident)
ref_target1 = db.Record(name="RefTarget").add_parent(
name=referenced_type.name).add_property(name=identifying_prop.name, value=1)
ref_target2 = db.Record(name="RefTarget").add_parent(
name=referenced_type.name).add_property(name=identifying_prop.name, value=2)
referencing1 = db.Record(name="Referencing1").add_parent(
name=referencing_type.name).add_property(name=referenced_type.name, value=[ref_target1])
referencing2 = db.Record(name="Referencing2").add_parent(
name=referencing_type.name).add_property(name=referenced_type.name, value=[ref_target2])
referencing3 = db.Record(name="Referencing3").add_parent(name=referencing_type.name).add_property(
name=referenced_type.name, value=[ref_target1, ref_target2])
records = db.Container().extend(
[ref_target1, ref_target2, referencing1, referencing2, referencing3])
ins, ups = crawler.synchronize(crawled_data=records, unique_names=False)
assert len(ins) == len(records)
assert len(ups) == 0
retrieved_target1 = db.execute_query(
f"FIND {referenced_type.name} WITH {identifying_prop.name}=1", unique=True)
retrieved_target2 = db.execute_query(
f"FIND {referenced_type.name} WITH {identifying_prop.name}=2", unique=True)
assert retrieved_target2.name == retrieved_target1.name
assert retrieved_target1.name == ref_target1.name
assert retrieved_target1.id != retrieved_target2.id
retrieved_referencing1 = db.execute_query(
f"FIND {referencing_type.name} WITH name={referencing1.name}", unique=True)
assert retrieved_referencing1.get_property(referenced_type.name) is not None
assert retrieved_referencing1.get_property(referenced_type.name).value == [
retrieved_target1.id]
assert retrieved_referencing1.get_property(referenced_type.name).value != [
retrieved_target2.id]
retrieved_referencing2 = db.execute_query(
f"FIND {referencing_type.name} WITH name={referencing2.name}", unique=True)
assert retrieved_referencing2.get_property(referenced_type.name) is not None
assert retrieved_referencing2.get_property(referenced_type.name).value == [
retrieved_target2.id]
assert retrieved_referencing2.get_property(referenced_type.name).value != [
retrieved_target1.id]
retrieved_referencing3 = db.execute_query(
f"FIND {referencing_type.name} WITH name={referencing3.name}", unique=True)
assert retrieved_referencing3.get_property(referenced_type.name) is not None
assert len(retrieved_referencing3.get_property(referenced_type.name).value) == 2
assert retrieved_target1.id in retrieved_referencing3.get_property(referenced_type.name).value
assert retrieved_target2.id in retrieved_referencing3.get_property(referenced_type.name).value
......@@ -20,8 +20,8 @@ packages = find:
python_requires = >=3.7
install_requires =
importlib-resources
caosdb > 0.11.2
caosadvancedtools >= 0.7.0
linkahead >= 0.13.1
yaml-header-tools >= 0.2.1
pyyaml
odfpy #make optional
......
......@@ -17,7 +17,7 @@
# along with this program. If not, see <https://www.gnu.org/licenses/>.
#
import caosdb as db
import linkahead as db
DEFAULTS = {
"send_crawler_notifications": False,
......
......@@ -24,29 +24,29 @@
#
from __future__ import annotations
from jsonschema import validate, ValidationError
import os
import re
import datetime
import caosdb as db
import json
import logging
import os
import re
import warnings
from .utils import has_parent
from .stores import GeneralStore, RecordStore
from .structure_elements import (StructureElement, Directory, File, DictElement, JSONFile,
IntegerElement, BooleanElement, FloatElement, NoneElement,
TextElement, TextElement, ListElement)
from typing import List, Optional, Tuple, Union
from abc import ABCMeta, abstractmethod
from string import Template
import yaml_header_tools
from typing import List, Optional, Tuple, Union
import caosdb as db
import pandas as pd
import logging
import yaml
import yaml_header_tools
from jsonschema import ValidationError, validate
from .stores import GeneralStore, RecordStore
from .structure_elements import (BooleanElement, DictElement, Directory, File,
FloatElement, IntegerElement, JSONFile,
ListElement, NoneElement, StructureElement,
TextElement)
from .utils import has_parent
# These are special properties which are (currently) treated differently
# by the converters:
......@@ -235,6 +235,12 @@ def create_records(values: GeneralStore, records: RecordStore, def_records: dict
keys_modified = []
for name, record in def_records.items():
# If only a name was given (Like this:
# Experiment:
# ) set record to an empty dict / empty configuration
if record is None:
record = {}
role = "Record"
# This allows us to create e.g. Files
if "role" in record:
......@@ -300,6 +306,7 @@ def create_records(values: GeneralStore, records: RecordStore, def_records: dict
# no matter whether the record existed in the record store or not,
# parents will be added when they aren't present in the record yet:
if "parents" in record:
c_record.parents.clear()
for parent in record["parents"]:
# Do the variables replacement:
var_replaced_parent = replace_variables(parent, values)
......
......@@ -36,6 +36,7 @@ import importlib
import logging
import os
import sys
import traceback
import uuid
import warnings
from argparse import RawTextHelpFormatter
......@@ -407,12 +408,12 @@ class Crawler(object):
if p.value.path != cached.path:
raise RuntimeError(
"The cached and the refernced entity are not identical.\n"
f"Cached:\n{cached}\nRefernced:\n{el}"
f"Cached:\n{cached}\nReferenced:\n{el}"
)
else:
raise RuntimeError(
"The cached and the refernced entity are not identical.\n"
f"Cached:\n{cached}\nRefernced:\n{el}"
f"Cached:\n{cached}\nReferenced:\n{el}"
)
lst.append(cached)
else:
......@@ -428,12 +429,12 @@ class Crawler(object):
if p.value.path != cached.path:
raise RuntimeError(
"The cached and the refernced entity are not identical.\n"
f"Cached:\n{cached}\nRefernced:\n{p.value}"
f"Cached:\n{cached}\nReferenced:\n{p.value}"
)
else:
raise RuntimeError(
"The cached and the refernced entity are not identical.\n"
f"Cached:\n{cached}\nRefernced:\n{p.value}"
f"Cached:\n{cached}\nReferenced:\n{p.value}"
)
p.value = cached
......@@ -783,6 +784,8 @@ class Crawler(object):
for i in reversed(range(len(crawled_data))):
if not check_identical(crawled_data[i], identified_records[i]):
logger.debug("Sheduled update because of the folllowing diff:\n"
+ str(compare_entities(crawled_data[i], identified_records[i])))
actual_updates.append(crawled_data[i])
return actual_updates
......@@ -1335,14 +1338,17 @@ def crawler_main(crawled_directory_path: str,
_update_status_record(crawler.run_id, len(inserts), len(updates), status="OK")
return 0
except ForbiddenTransaction as err:
logger.debug(traceback.format_exc())
logger.error(err)
_update_status_record(crawler.run_id, 0, 0, status="FAILED")
return 1
except ConverterValidationError as err:
logger.debug(traceback.format_exc())
logger.error(err)
_update_status_record(crawler.run_id, 0, 0, status="FAILED")
return 1
except Exception as err:
logger.debug(traceback.format_exc())
logger.debug(err)
if "SHARED_DIR" in os.environ:
......
......@@ -40,6 +40,12 @@ from .utils import has_parent
logger = logging.getLogger(__name__)
def get_children_of_rt(rtname):
"""Supply the name of a recordtype. This name and the name of all children RTs are returned in
a list"""
return [p.name for p in db.execute_query(f"FIND RECORDTYPE {rtname}")]
def convert_value(value: Any):
""" Returns a string representation of the value that is suitable
to be used in the query
......@@ -212,11 +218,16 @@ identifiabel, identifiable and identified record) for a Record.
# TODO: similar to the Identifiable class, Registred Identifiable should be a
# separate class too
if prop.name.lower() == "is_referenced_by":
for rtname in prop.value:
if (id(record) in referencing_entities
and rtname in referencing_entities[id(record)]):
identifiable_backrefs.extend(referencing_entities[id(record)][rtname])
else:
for givenrt in prop.value:
rt_and_children = get_children_of_rt(givenrt)
found = False
for rtname in rt_and_children:
if (id(record) in referencing_entities
and rtname in referencing_entities[id(record)]):
identifiable_backrefs.extend(
referencing_entities[id(record)][rtname])
found = True
if not found:
# TODO: is this the appropriate error?
raise NotImplementedError(
f"The following record is missing an identifying property:"
......
......@@ -270,6 +270,8 @@ Parameters
converters_path = []
for element in items:
element_path = os.path.join(*(structure_elements_path + [element.get_name()]))
logger.debug(f"Dealing with {element_path}")
for converter in converters:
# type is something like "matches files", replace isinstance with "type_matches"
......@@ -282,8 +284,7 @@ Parameters
record_store_copy = record_store.create_scoped_copy()
# Create an entry for this matched structure element that contains the path:
general_store_copy[converter.name] = (
os.path.join(*(structure_elements_path + [element.get_name()])))
general_store_copy[converter.name] = element_path
# extracts values from structure element and stores them in the
# variable store
......
# How to upgrade
## 0.6.x to 0.7.0
If you added Parents to Records at multiple places in the CFood, you must now
do this at a single location because this key now overwrites previously set
parents.
## 0.5.x to 0.6.0
[#41](https://gitlab.com/caosdb/caosdb-crawler/-/issues/41) was fixed. This
means that you previously used the name of Entities as an identifying
......
......@@ -607,7 +607,7 @@ def test_create_flat_list():
assert c in flat
@ pytest.fixture
@pytest.fixture
def crawler_mocked_for_backref_test():
crawler = Crawler()
# mock retrieval of registered identifiabls: return Record with just a parent
......@@ -651,6 +651,8 @@ def test_validation_error_print(caplog):
caplog.clear()
@patch("caoscrawler.identifiable_adapters.get_children_of_rt",
new=Mock(side_effect=lambda x: [x]))
def test_split_into_inserts_and_updates_backref(crawler_mocked_for_backref_test):
crawler = crawler_mocked_for_backref_test
identlist = [Identifiable(name="A", record_type="BR"),
......@@ -685,6 +687,8 @@ def test_split_into_inserts_and_updates_backref(crawler_mocked_for_backref_test)
assert insert[0].name == "B"
@patch("caoscrawler.identifiable_adapters.get_children_of_rt",
new=Mock(side_effect=lambda x: [x]))
def test_split_into_inserts_and_updates_mult_backref(crawler_mocked_for_backref_test):
# test whether multiple references of the same record type are correctly used
crawler = crawler_mocked_for_backref_test
......@@ -705,6 +709,8 @@ def test_split_into_inserts_and_updates_mult_backref(crawler_mocked_for_backref_
assert len(insert) == 2
@patch("caoscrawler.identifiable_adapters.get_children_of_rt",
new=Mock(side_effect=lambda x: [x]))
def test_split_into_inserts_and_updates_diff_backref(crawler_mocked_for_backref_test):
# test whether multiple references of the different record types are correctly used
crawler = crawler_mocked_for_backref_test
......
......@@ -20,6 +20,8 @@ def clear_cache():
cache_clear()
@patch("caoscrawler.identifiable_adapters.get_children_of_rt",
new=Mock(side_effect=id))
@patch("caoscrawler.identifiable_adapters.cached_get_entity_by",
new=Mock(side_effect=mock_get_entity_by))
def test_file_identifiable():
......
---
metadata:
crawler-version: 0.6.1
---
Definitions:
type: Definitions
data:
type: Dict
match_name: '.*'
records:
Experiment:
Projekt:
parents: ["project"]
name: "p"
Campaign:
name: "c"
Stuff:
name: "s"
subtree:
Experiment:
type: DictElement
match: '.*'
records:
Experiment:
parents: ["Exp"]
name: "e"
Projekt:
parents: ["Projekt"]
Campaign:
parents: ["Cap"]
Stuff:
name: "s"
Experiment2:
type: DictElement
match: '.*'
records:
Campaign:
parents: ["Cap2"]
# encoding: utf-8
#
# This file is a part of the CaosDB Project.
#
# Copyright (C) 2021 Henrik tom Wörden <h.tomwoerden@indiscale.com>
# 2021-2023 Research Group Biomedical Physics,
# Max-Planck-Institute for Dynamics and Self-Organization Göttingen
# Alexander Schlemmer <alexander.schlemmer@ds.mpg.de>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as
# published by the Free Software Foundation, either version 3 of the
# License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU Affero General Public License for more details.
#
# You should have received a copy of the GNU Affero General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
#
import json
import logging
......@@ -276,3 +298,33 @@ def test_variable_deletion_problems():
assert record.get_property("var2").value == "test"
else:
raise RuntimeError("Wrong name")
def test_record_parents():
""" Test the correct list of returned records by the scanner """
data = {
'Experiments': {}
}
crawler_definition = load_definition(UNITTESTDIR / "test_parent_cfood.yml")
converter_registry = create_converter_registry(crawler_definition)
records = scan_structure_elements(DictElement(name="", value=data), crawler_definition,
converter_registry)
assert len(records) == 4
for rec in records:
if rec.name == 'e':
assert rec.parents[0].name == 'Exp' # default parent was overwritten
assert len(rec.parents) == 1
elif rec.name == 'c':
assert rec.parents[0].name == 'Cap2' # default parent was overwritten by second
# converter
assert len(rec.parents) == 1
elif rec.name == 'p':
assert rec.parents[0].name == 'Projekt' # top level set parent was overwritten
assert len(rec.parents) == 1
elif rec.name == 's':
assert rec.parents[0].name == 'Stuff' # default parent stays if no parent is given on
# lower levels
assert len(rec.parents) == 1
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment