Merge branch 'dev' into f-small-doc-fixes

7c7b1016 · Daniel Hornung · fbdf2293 · 2aa524bf · 7c7b1016 · 7c7b1016
Verified Commit 7c7b1016 authored 1 year ago by Daniel Hornung
--- a/.docker/Dockerfile
+++ b/.docker/Dockerfile
@@ -7,8 +7,10 @@ RUN apt-get update && \
    python3-autopep8 \
    python3-pip \
    python3-pytest \
+    python3-sphinx \
    tox \
    -y
+RUN pip3 install recommonmark sphinx-rtd-theme
 COPY .docker/wait-for-it.sh /wait-for-it.sh
 ARG PYLIB
 ADD https://gitlab.indiscale.com/api/v4/projects/97/repository/commits/${PYLIB} \

--- a/.gitlab-ci.yml
+++ b/.gitlab-ci.yml
@@ -296,8 +296,9 @@ style:
 pages_prepare: &pages_prepare
  tags: [ cached-dind ]
  stage: deploy
-  needs: []
-  image: $CI_REGISTRY/caosdb/src/caosdb-pylib/testenv:latest
+  needs:
+    - job: build-testenv
+  image: $CI_REGISTRY_IMAGE
  only:
    refs:
      - /^release-.*$/i

--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -10,12 +10,22 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### Added ###

 ### Changed ###
+- If the `parents` key is used in a cfood at a lower level for a Record that
+  already has a Parent (because it was explicitly given or the default Parent),
+  the old Parent(s) are now overwritten with the value belonging to the
+  `parents` key.
+- If a registered identifiable states, that a reference by a Record with parent
+  RT1 is needed, then now also references from Records that have a child of RT1
+  as parent are accepted.

 ### Deprecated ###

 ### Removed ###

 ### Fixed ###
+- Empty Records can now be created (https://gitlab.com/caosdb/caosdb-crawler/-/issues/27)
+
+* [#58](https://gitlab.com/caosdb/caosdb-crawler/-/issues/58) Documentation builds API docs in pipeline now.

 ### Security ###


--- a/README.md
+++ b/README.md
-# CaosDB-Crawler

 ## Welcome

-This is the repository of the CaosDB-Crawler, a tool for automatic data
-insertion into [CaosDB](https://gitlab.com/caosdb/caosdb-meta).
+This is the repository of the LinkAhead Crawler, a tool for automatic data
+insertion into [LinkAhead](https://gitlab.com/linkahead/linkahead).

 This is a new implementation resolving  problems of the original implementation
-in [caosdb-advancedtools](https://gitlab.com/caosdb/caosdb-advanced-user-tools)
+in [LinkAhead Python Advanced User Tools](https://gitlab.com/caosdb/caosdb-advanced-user-tools)

 ## Setup

@@ -16,20 +15,23 @@ setup this code.

 ## Further Reading

-Please refer to the [official documentation](https://docs.indiscale.com/caosdb-crawler/) of the CaosDB-Crawler for more information.
+Please refer to the [official documentation](https://docs.indiscale.com/caosdb-crawler/) of the LinkAhead Crawler for more information.

 ## Contributing

-Thank you very much to all contributers—[past, present](https://gitlab.com/caosdb/caosdb/-/blob/dev/HUMANS.md), and prospective ones.
+Thank you very much to all contributers—[past,
+present](https://gitlab.com/linkahead/linkahead/-/blob/main/HUMANS.md), and prospective
+ones.

 ### Code of Conduct

-By participating, you are expected to uphold our [Code of Conduct](https://gitlab.com/caosdb/caosdb/-/blob/dev/CODE_OF_CONDUCT.md).
+By participating, you are expected to uphold our [Code of
+Conduct](https://gitlab.com/linkahead/linkahead/-/blob/main/CODE_OF_CONDUCT.md).

 ### How to Contribute

 * You found a bug, have a question, or want to request a feature? Please 
-[create an issue](https://gitlab.com/caosdb/caosdb-crawler).
+[create an issue](https://gitlab.com/linkahead/linkahead-crawler/-/issues).
 * You want to contribute code?
    * **Forking:** Please fork the repository and create a merge request in GitLab and choose this repository as
      target. Make sure to select "Allow commits from members who can merge the target branch" under
@@ -38,9 +40,8 @@ By participating, you are expected to uphold our [Code of Conduct](https://gitla
    * **Code style:** This project adhers to the PEP8 recommendations, you can test your code style
      using the `autopep8` tool (`autopep8 -i -r ./`).  Please write your doc strings following the
      [NumpyDoc](https://numpydoc.readthedocs.io/en/latest/format.html) conventions.
-* You can also contact us at **info (AT) caosdb.de** and join the
-  CaosDB community on
-  [#caosdb:matrix.org](https://matrix.to/#/!unwwlTfOznjEnMMXxf:matrix.org).
+* You can also  join the LinkAhead community on
+  [#linkahead:matrix.org](https://matrix.to/#/!unwwlTfOznjEnMMXxf:matrix.org).


 There is the file `unittests/records.xml` that servers as a dummy for a server state with files.

--- a/integrationtests/test_issues.py
+++ b/integrationtests/test_issues.py
@@ -114,7 +114,7 @@ def test_issue_23(clear_database):
    assert rec_crawled.get_property("identifying_prop").value == "identifier"
    assert rec_crawled.get_property("prop_b") is not None
    assert rec_crawled.get_property("prop_b").value == "something_else"
-    # no interaction with the database yet, so the rrecord shouldn't have a prop_a yet
+    # no interaction with the database yet, so the record shouldn't have a prop_a yet
    assert rec_crawled.get_property("prop_a") is None

    # synchronize with database and update the record
@@ -133,3 +133,78 @@ def test_issue_23(clear_database):
        "identifying_prop").value == rec_crawled.get_property("identifying_prop").value
    assert rec_retrieved.get_property(
        "prop_b").value == rec_crawled.get_property("prop_b").value
+
+
+def test_issue_83(clear_database):
+    """https://gitlab.com/linkahead/linkahead-crawler/-/issues/83. Test that
+    names don't need to be unique for referenced entities if they are not part
+    of the identifiable.
+
+    """
+
+    # Very simple data model
+    identifying_prop = db.Property(name="IdentifyingProp", datatype=db.INTEGER).insert()
+    referenced_type = db.RecordType(name="ReferencedType").add_property(
+        name=identifying_prop.name, importance=db.OBLIGATORY).insert()
+    referencing_type = db.RecordType(name="ReferencingType").add_property(
+        name=referenced_type.name, datatype=db.LIST(referenced_type.name)).insert()
+
+    # Define identifiables. ReferencingType by name, ReferencedType by
+    # IdentifyingProp and not by name.
+    ident = CaosDBIdentifiableAdapter()
+    ident.register_identifiable(referenced_type.name, db.RecordType().add_parent(
+        name=referenced_type.name).add_property(name=identifying_prop.name))
+    ident.register_identifiable(referencing_type.name, db.RecordType().add_parent(
+        name=referencing_type.name).add_property(name="name"))
+
+    crawler = Crawler(identifiableAdapter=ident)
+
+    ref_target1 = db.Record(name="RefTarget").add_parent(
+        name=referenced_type.name).add_property(name=identifying_prop.name, value=1)
+    ref_target2 = db.Record(name="RefTarget").add_parent(
+        name=referenced_type.name).add_property(name=identifying_prop.name, value=2)
+
+    referencing1 = db.Record(name="Referencing1").add_parent(
+        name=referencing_type.name).add_property(name=referenced_type.name, value=[ref_target1])
+    referencing2 = db.Record(name="Referencing2").add_parent(
+        name=referencing_type.name).add_property(name=referenced_type.name, value=[ref_target2])
+    referencing3 = db.Record(name="Referencing3").add_parent(name=referencing_type.name).add_property(
+        name=referenced_type.name, value=[ref_target1, ref_target2])
+
+    records = db.Container().extend(
+        [ref_target1, ref_target2, referencing1, referencing2, referencing3])
+
+    ins, ups = crawler.synchronize(crawled_data=records, unique_names=False)
+    assert len(ins) == len(records)
+    assert len(ups) == 0
+
+    retrieved_target1 = db.execute_query(
+        f"FIND {referenced_type.name} WITH {identifying_prop.name}=1", unique=True)
+    retrieved_target2 = db.execute_query(
+        f"FIND {referenced_type.name} WITH {identifying_prop.name}=2", unique=True)
+    assert retrieved_target2.name == retrieved_target1.name
+    assert retrieved_target1.name == ref_target1.name
+    assert retrieved_target1.id != retrieved_target2.id
+
+    retrieved_referencing1 = db.execute_query(
+        f"FIND {referencing_type.name} WITH name={referencing1.name}", unique=True)
+    assert retrieved_referencing1.get_property(referenced_type.name) is not None
+    assert retrieved_referencing1.get_property(referenced_type.name).value == [
+        retrieved_target1.id]
+    assert retrieved_referencing1.get_property(referenced_type.name).value != [
+        retrieved_target2.id]
+
+    retrieved_referencing2 = db.execute_query(
+        f"FIND {referencing_type.name} WITH name={referencing2.name}", unique=True)
+    assert retrieved_referencing2.get_property(referenced_type.name) is not None
+    assert retrieved_referencing2.get_property(referenced_type.name).value == [
+        retrieved_target2.id]
+    assert retrieved_referencing2.get_property(referenced_type.name).value != [
+        retrieved_target1.id]
+
+    retrieved_referencing3 = db.execute_query(
+        f"FIND {referencing_type.name} WITH name={referencing3.name}", unique=True)
+    assert retrieved_referencing3.get_property(referenced_type.name) is not None
+    assert len(retrieved_referencing3.get_property(referenced_type.name).value) == 2
+    assert retrieved_target1.id in retrieved_referencing3.get_property(referenced_type.name).value
+    assert retrieved_target2.id in retrieved_referencing3.get_property(referenced_type.name).value
--- a/setup.cfg
+++ b/setup.cfg
@@ -20,8 +20,8 @@ packages = find:
 python_requires = >=3.7
 install_requires =
 	importlib-resources
-	caosdb > 0.11.2
 	caosadvancedtools >= 0.7.0
+    linkahead >= 0.13.1
    yaml-header-tools >= 0.2.1
    pyyaml
    odfpy #make optional

--- a/src/caoscrawler/config.py
+++ b/src/caoscrawler/config.py
@@ -17,7 +17,7 @@
 # along with this program. If not, see <https://www.gnu.org/licenses/>.
 #

-import caosdb as db
+import linkahead as db

 DEFAULTS = {
    "send_crawler_notifications": False,

--- a/src/caoscrawler/converters.py
+++ b/src/caoscrawler/converters.py
@@ -24,29 +24,29 @@
 #

 from __future__ import annotations
-from jsonschema import validate, ValidationError

-import os
-import re
 import datetime
-import caosdb as db
 import json
+import logging
+import os
+import re
 import warnings
-from .utils import has_parent
-from .stores import GeneralStore, RecordStore
-from .structure_elements import (StructureElement, Directory, File, DictElement, JSONFile,
-                                 IntegerElement, BooleanElement, FloatElement, NoneElement,
-                                 TextElement, TextElement, ListElement)
-from typing import List, Optional, Tuple, Union
 from abc import ABCMeta, abstractmethod
 from string import Template
-import yaml_header_tools
+from typing import List, Optional, Tuple, Union

+import caosdb as db
 import pandas as pd
-import logging
-
-
 import yaml
+import yaml_header_tools
+from jsonschema import ValidationError, validate
+
+from .stores import GeneralStore, RecordStore
+from .structure_elements import (BooleanElement, DictElement, Directory, File,
+                                 FloatElement, IntegerElement, JSONFile,
+                                 ListElement, NoneElement, StructureElement,
+                                 TextElement)
+from .utils import has_parent

 # These are special properties which are (currently) treated differently
 # by the converters:
@@ -235,6 +235,12 @@ def create_records(values: GeneralStore, records: RecordStore, def_records: dict
    keys_modified = []

    for name, record in def_records.items():
+        # If only a name was given (Like this:
+        # Experiment:
+        # ) set record to an empty dict / empty configuration
+        if record is None:
+            record = {}
+
        role = "Record"
        # This allows us to create e.g. Files
        if "role" in record:
@@ -300,6 +306,7 @@ def create_records(values: GeneralStore, records: RecordStore, def_records: dict
        # no matter whether the record existed in the record store or not,
        # parents will be added when they aren't present in the record yet:
        if "parents" in record:
+            c_record.parents.clear()
            for parent in record["parents"]:
                # Do the variables replacement:
                var_replaced_parent = replace_variables(parent, values)

--- a/src/caoscrawler/crawl.py
+++ b/src/caoscrawler/crawl.py
@@ -36,6 +36,7 @@ import importlib
 import logging
 import os
 import sys
+import traceback
 import uuid
 import warnings
 from argparse import RawTextHelpFormatter
@@ -407,12 +408,12 @@ class Crawler(object):
                                if p.value.path != cached.path:
                                    raise RuntimeError(
                                        "The cached and the refernced entity are not identical.\n"
-                                        f"Cached:\n{cached}\nRefernced:\n{el}"
+                                        f"Cached:\n{cached}\nReferenced:\n{el}"
                                    )
                            else:
                                raise RuntimeError(
                                    "The cached and the refernced entity are not identical.\n"
-                                    f"Cached:\n{cached}\nRefernced:\n{el}"
+                                    f"Cached:\n{cached}\nReferenced:\n{el}"
                                )
                        lst.append(cached)
                    else:
@@ -428,12 +429,12 @@ class Crawler(object):
                        if p.value.path != cached.path:
                            raise RuntimeError(
                                "The cached and the refernced entity are not identical.\n"
-                                f"Cached:\n{cached}\nRefernced:\n{p.value}"
+                                f"Cached:\n{cached}\nReferenced:\n{p.value}"
                            )
                    else:
                        raise RuntimeError(
                            "The cached and the refernced entity are not identical.\n"
-                            f"Cached:\n{cached}\nRefernced:\n{p.value}"
+                            f"Cached:\n{cached}\nReferenced:\n{p.value}"
                        )
                p.value = cached

@@ -783,6 +784,8 @@ class Crawler(object):
        for i in reversed(range(len(crawled_data))):

            if not check_identical(crawled_data[i], identified_records[i]):
+                logger.debug("Sheduled update because of the folllowing diff:\n"
+                             + str(compare_entities(crawled_data[i], identified_records[i])))
                actual_updates.append(crawled_data[i])

        return actual_updates
@@ -1335,14 +1338,17 @@ def crawler_main(crawled_directory_path: str,
                _update_status_record(crawler.run_id, len(inserts), len(updates), status="OK")
        return 0
    except ForbiddenTransaction as err:
+        logger.debug(traceback.format_exc())
        logger.error(err)
        _update_status_record(crawler.run_id, 0, 0, status="FAILED")
        return 1
    except ConverterValidationError as err:
+        logger.debug(traceback.format_exc())
        logger.error(err)
        _update_status_record(crawler.run_id, 0, 0, status="FAILED")
        return 1
    except Exception as err:
+        logger.debug(traceback.format_exc())
        logger.debug(err)

        if "SHARED_DIR" in os.environ:

--- a/src/caoscrawler/identifiable_adapters.py
+++ b/src/caoscrawler/identifiable_adapters.py
@@ -40,6 +40,12 @@ from .utils import has_parent
 logger = logging.getLogger(__name__)


+def get_children_of_rt(rtname):
+    """Supply the name of a recordtype. This name and the name of all children RTs are returned in
+    a list"""
+    return [p.name for p in db.execute_query(f"FIND RECORDTYPE {rtname}")]
+
+
 def convert_value(value: Any):
    """ Returns a string representation of the value that is suitable
    to be used in the query
@@ -212,11 +218,16 @@ identifiabel, identifiable and identified record) for a Record.
                # TODO: similar to the Identifiable class, Registred Identifiable should be a
                # separate class too
                if prop.name.lower() == "is_referenced_by":
-                    for rtname in prop.value:
-                        if (id(record) in referencing_entities
-                                and rtname in referencing_entities[id(record)]):
-                            identifiable_backrefs.extend(referencing_entities[id(record)][rtname])
-                        else:
+                    for givenrt in prop.value:
+                        rt_and_children = get_children_of_rt(givenrt)
+                        found = False
+                        for rtname in rt_and_children:
+                            if (id(record) in referencing_entities
+                                    and rtname in referencing_entities[id(record)]):
+                                identifiable_backrefs.extend(
+                                    referencing_entities[id(record)][rtname])
+                                found = True
+                        if not found:
                            # TODO: is this the appropriate error?
                            raise NotImplementedError(
                                f"The following record is missing an identifying property:"

--- a/src/caoscrawler/scanner.py
+++ b/src/caoscrawler/scanner.py
@@ -270,6 +270,8 @@ Parameters
        converters_path = []

    for element in items:
+        element_path = os.path.join(*(structure_elements_path + [element.get_name()]))
+        logger.debug(f"Dealing with {element_path}")
        for converter in converters:

            # type is something like "matches files", replace isinstance with "type_matches"
@@ -282,8 +284,7 @@ Parameters
                record_store_copy = record_store.create_scoped_copy()

                # Create an entry for this matched structure element that contains the path:
-                general_store_copy[converter.name] = (
-                    os.path.join(*(structure_elements_path + [element.get_name()])))
+                general_store_copy[converter.name] = element_path

                # extracts values from structure element and stores them in the
                # variable store

--- a/src/doc/how-to-upgrade.md
+++ b/src/doc/how-to-upgrade.md

 # How to upgrade
+## 0.6.x to 0.7.0
+If you added Parents to Records at multiple places in the CFood, you must now
+do this at a single location because this key now overwrites previously set
+parents.
+
 ## 0.5.x to 0.6.0
 [#41](https://gitlab.com/caosdb/caosdb-crawler/-/issues/41) was fixed. This
 means that you previously used the name of Entities as an identifying

--- a/unittests/test_crawler.py
+++ b/unittests/test_crawler.py
@@ -607,7 +607,7 @@ def test_create_flat_list():
    assert c in flat


-@ pytest.fixture
+@pytest.fixture
 def crawler_mocked_for_backref_test():
    crawler = Crawler()
    # mock retrieval of registered identifiabls: return Record with just a parent
@@ -651,6 +651,8 @@ def test_validation_error_print(caplog):
        caplog.clear()


+@patch("caoscrawler.identifiable_adapters.get_children_of_rt",
+       new=Mock(side_effect=lambda x: [x]))
 def test_split_into_inserts_and_updates_backref(crawler_mocked_for_backref_test):
    crawler = crawler_mocked_for_backref_test
    identlist = [Identifiable(name="A", record_type="BR"),
@@ -685,6 +687,8 @@ def test_split_into_inserts_and_updates_backref(crawler_mocked_for_backref_test)
    assert insert[0].name == "B"


+@patch("caoscrawler.identifiable_adapters.get_children_of_rt",
+       new=Mock(side_effect=lambda x: [x]))
 def test_split_into_inserts_and_updates_mult_backref(crawler_mocked_for_backref_test):
    # test whether multiple references of the same record type are correctly used
    crawler = crawler_mocked_for_backref_test
@@ -705,6 +709,8 @@ def test_split_into_inserts_and_updates_mult_backref(crawler_mocked_for_backref_
    assert len(insert) == 2


+@patch("caoscrawler.identifiable_adapters.get_children_of_rt",
+       new=Mock(side_effect=lambda x: [x]))
 def test_split_into_inserts_and_updates_diff_backref(crawler_mocked_for_backref_test):
    # test whether multiple references of the different record types are correctly used
    crawler = crawler_mocked_for_backref_test

--- a/unittests/test_file_identifiables.py
+++ b/unittests/test_file_identifiables.py
@@ -20,6 +20,8 @@ def clear_cache():
    cache_clear()


+@patch("caoscrawler.identifiable_adapters.get_children_of_rt",
+       new=Mock(side_effect=id))
 @patch("caoscrawler.identifiable_adapters.cached_get_entity_by",
       new=Mock(side_effect=mock_get_entity_by))
 def test_file_identifiable():

--- a/unittests/test_parent_cfood.yml
+++ b/unittests/test_parent_cfood.yml
+---
+metadata:
+  crawler-version: 0.6.1
+---
+Definitions:
+  type: Definitions
+
+data:
+  type: Dict
+  match_name: '.*'
+  records:
+    Experiment:
+    Projekt:
+      parents: ["project"]
+      name: "p"
+    Campaign:
+      name: "c"
+    Stuff:
+      name: "s"
+  subtree:
+    Experiment:
+      type: DictElement
+      match: '.*'
+      records:
+        Experiment:
+            parents: ["Exp"]
+            name: "e"
+        Projekt:
+            parents: ["Projekt"]
+        Campaign:
+            parents: ["Cap"]
+        Stuff:
+          name: "s"
+    Experiment2:
+      type: DictElement
+      match: '.*'
+      records:
+        Campaign:
+            parents: ["Cap2"]
--- a/unittests/test_scanner.py
+++ b/unittests/test_scanner.py
+# encoding: utf-8
+#
+# This file is a part of the CaosDB Project.
+#
+# Copyright (C) 2021      Henrik tom Wörden <h.tomwoerden@indiscale.com>
+#               2021-2023 Research Group Biomedical Physics,
+# Max-Planck-Institute for Dynamics and Self-Organization Göttingen
+# Alexander Schlemmer <alexander.schlemmer@ds.mpg.de>
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU Affero General Public License as
+# published by the Free Software Foundation, either version 3 of the
+# License, or (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU Affero General Public License for more details.
+#
+# You should have received a copy of the GNU Affero General Public License
+# along with this program. If not, see <https://www.gnu.org/licenses/>.
+#

 import json
 import logging
@@ -276,3 +298,33 @@ def test_variable_deletion_problems():
            assert record.get_property("var2").value == "test"
        else:
            raise RuntimeError("Wrong name")
+
+
+def test_record_parents():
+    """ Test the correct list of returned records by the scanner     """
+
+    data = {
+        'Experiments': {}
+    }
+
+    crawler_definition = load_definition(UNITTESTDIR / "test_parent_cfood.yml")
+    converter_registry = create_converter_registry(crawler_definition)
+
+    records = scan_structure_elements(DictElement(name="", value=data), crawler_definition,
+                                      converter_registry)
+    assert len(records) == 4
+    for rec in records:
+        if rec.name == 'e':
+            assert rec.parents[0].name == 'Exp'  # default parent was overwritten
+            assert len(rec.parents) == 1
+        elif rec.name == 'c':
+            assert rec.parents[0].name == 'Cap2'  # default parent was overwritten by second
+            # converter
+            assert len(rec.parents) == 1
+        elif rec.name == 'p':
+            assert rec.parents[0].name == 'Projekt'  # top level set parent was overwritten
+            assert len(rec.parents) == 1
+        elif rec.name == 's':
+            assert rec.parents[0].name == 'Stuff'  # default parent stays if no parent is given on
+            # lower levels
+            assert len(rec.parents) == 1