Skip to content
Snippets Groups Projects
Code owners
Assign users and groups as approvers for specific file changes. Learn more.
To find the state of this project's repository at the time of any of these versions, check out the tags.

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[0.8.0] - 2024-08-23

Added

  • Support for Python 3.12 and experimental support for 3.13
  • CFood macros now accept complex objects as values, not just strings.
  • More options for the CSVTableConverter
  • New converters:
    • DatetimeElementConverter
    • SPSSConverter
  • New scripts:
    • spss_to_datamodel
    • csv_to_datamodel
  • New transformer functions:
    • date_parse
    • datetime_parse
  • New PropertiesFromDictConverter which allows to automatically create property values from dictionary keys.

Changed

  • CFood macros do not render everything into strings now.
  • Better internal handling of identifiable/reference resolving and merging of entities. This also includes more understandable output for users.
  • Better handling of missing imports, with nice messages for users.
  • No longer use configuration of advancedtools to set to and from email addresses

Removed

  • Support for Python 3.7

Fixed

  • 93 cfood.yaml does not allow umlaut in $expression
  • 96 Do not fail silently on transaction errors

Security

Documentation

  • General improvement of the documentaion, in many small places.
  • The API documentation should now also include documentation of the constructors.

[0.7.1] - 2024-03-21

Fixed

  • crawler_main doesn't need the deprecated debug=True anymore to put out a provenance file if the provenance_file parameter is provided.
  • indiscale#129 missing packaging dependency.

[0.7.0] - 2024-03-04

Added

  • transform sections can be added to a CFood to apply functions to values stored in variables.
  • default transform functions: submatch, split and replace.
  • * can now be used as a wildcard in the identifiables parameter file to denote that any Record may reference the identified one.
  • crawl.TreatedRecordLookUp class replacing the old (and slow) identified_cache module. The new class now handles all records identified by id, path, or identifiable simultaneously. See API docs for more info on how to add to and get from the new lookup class.
  • identifiable_adapters.IdentifiableAdapter.get_identifying_referencing_entities and identifiable_adapters.IdentifiableAdapter.get_identifying_referenced_entities static methods to return the referencing or referenced entities belonging to a registered identifiable, respectively.
  • #70: Optional converters for HDF5 files. They require this package to be installed with its h5-crawler dependency.

Changed

  • If the parents key is used in a cfood at a lower level for a Record that already has a Parent (because it was explicitly given or the default Parent), the old Parent(s) are now overwritten with the value belonging to the parents key.
  • If a registered identifiable states, that a reference by a Record with parent RT1 is needed, then now also references from Records that have a child of RT1 as parent are accepted.
  • More aggressive caching.
  • The identifiable_adapters.IdentifiableAdapter now creates (possibly empty) reference lists for all records in create_reference_mapping. This allows functions like get_identifiable to be called only with the subset of the referenceing entities belonging to a specific Record.
  • The identifiable_adapters.IdentifiableAdapter uses entity ids (negative for entities that don't exist remotely) instead of entity objects for keeping track of references.
  • Log output is either written to $SHARED_DIR/ (when this variable is set) or just to the terminal.

Deprecated

  • IdentifiableAdapter.get_file

Removed

  • identified_cache module which was replaced by the crawl.TreatedRecordLookUp class.

Fixed

  • Empty Records can now be created (https://gitlab.com/caosdb/caosdb-crawler/-/issues/27)
  • #58 Documentation builds API docs in pipeline now.
  • #117 (closed) replace_variable does no longer unnecessarily change the type. Values stored in variables in a CFood can have now other types.
  • indiscale#113 Resolving referenced entities fails in some corner cases. The crawler now handles cases correctly in which entities retrieved from the server have to be merged with local entities that both reference another, already existing entity
  • A corner case in split_into_inserts_and_updates whereby two records created in different places in the cfood definition would not be merged if both were identified by the same LinkAhead id
  • #87 Handle long strings more gracefully. The crawler sometimes runs into linkahead-server#101, this is now mitigated.
  • indiscale#128 Yet another corner case of referencing resolution resolved.

[0.6.0] - 2023-06-23

(Florian Spreckelsen)

Added

  • Standard logging for server side execution
  • Email notification if the pycaosdb.ini contains a [caoscrawler] with send_crawler_notifications=True.
  • Creation of CrawlerRun Records that contain status information about data integration of the crawler if the pycaosdb.ini contains a [caoscrawler] with create_crawler_status_records=True.
  • The Crawler synchronize function now takes list of RecordType names. Records that have the given names as parents are excluded from inserts or updates
  • Crawler.synchronize now takes an optional path_for_authorized_run argument that specifies the path with which the crawler can be rerun to authorize pending changes.

Fixed

  • Query generation when there are only backrefs or backrefs and a name
  • Query generation when there are spaces or ' in RecordType or Identifiable names
  • usage of ID when looking for identified records
  • #41

Documentation

  • Expanded documentation, also has (better) tutorials now.

[0.5.0] - 2023-03-28

(Florian Spreckelsen)

Changed

  • Refactoring of the crawl.py module: Now there is a separate scanner module handling the collecting of information that is independent of CaosDB itself.
  • The signature of the function save_debug_data was changed to explicitely take the debug_tree as its first argument. This change was necessary, as the debug_tree is no longer saved as member field of the Crawler class.

Deprecated

  • The functions load_definition, initialize_converters and load_converters are deprecated. Please use the functions load_definition, initialize_converters and create_converter_registry from the scanner module instead.
  • The function start_crawling is deprecated. The function scan_structure_elements in the scanner module mostly covers its functionality.

[0.4.0] - 2023-03-22

(Florian Spreckelsen)

Added

  • DateElementConverter: allows to interpret text as a date object
  • the restricted_path argument allows to crawl only a subtree
  • logging that provides a summary of what is inserted and updated
  • You can now access the file system path of a structure element (if it has one) using the variable name <converter name>.path
  • add_prefix and remove_prefix arguments for the command line interface and the crawler_main function for the adding/removal of path prefixes when creating file entities.
  • More strict checking of identifiables.yaml.
  • Better error messages when server does not conform to expected data model.

Changed

  • The definitions for the default converters were removed from crawl.py and placed into a separate yaml file called default_converters.yml. There is a new test testing for the correct loading behavior of that file.
  • JSONFileConverter, YAMLFileConverter and MarkdownFileConverter now inherit from SimpleFileConverter. Behavior is unchanged, except that the MarkdownFileConverter now raises a ConverterValidationError when the YAML header cannot be read instead of silently not matching.

Deprecated

  • The prefix argument of crawler_main is deprecated. Use the new argument remove_prefix instead.

Removed

  • The command line argument --prefix. Use the new argument --remove-prefix instead.

Fixed

  • an empty string as name is treated as no name (as does the server). This, fixes queries for identifiables since it would contain "WITH name=''" otherwise which is an impossible condition. If your cfoods contained this case, they are ill defined.

[0.3.0] - 2022-01-30

(Florian Spreckelsen)

Added

  • Identifiable class to represent the information used to identify Records.
  • Added some StructureElements: BooleanElement, FloatElement, IntegerElement, ListElement, DictElement
  • String representation for Identifiables
  • #43 the crawler version can now be specified in the metadata section of the cfood definition. It is checked against the installed version upon loading of the definition.
  • JSON schema validation can also be used in the DictElementConverter
  • YAMLFileConverter class; to parse YAML files
  • Variables can now be substituted within the definition of yaml macros
  • debugging option for the match step of Converters
  • Re-introduced support for Python 3.7

Changed

  • Some StructureElements changed (see "How to upgrade" in the docs):
    • Dict, DictElement and DictDictElement were merged into DictElement.
    • DictTextElement and TextElement were merged into TextElement. The "match" keyword is now invalid for TextElements.
  • JSONFileConverter creates another level of StructureElements (see "How to upgrade" in the docs)
  • create_flat_list function now collects entities in a set and also adds the entities contained in the given list directly

Deprecated

  • The DictXYElements are now depricated and are now synonyms for the XYElements.

Fixed

  • #39 Merge conflicts in split_into_inserts_and_updates when cached entity references a record without id
  • Queries for identifiables with boolean properties are now created correctly.

[0.2.0] - 2022-11-18

(Florian Spreckelsen)

Added

  • the -c/--add-cwd-to-path option allows to plays for example custom converter modules into the current working directory(cwd) since the cwd is added to the Python path.

Changed

  • Converters often used in dicts (DictFloatElementConverter, DictIntegerElementConverter, ...) do now accept other StructureElements by default. For example a DictIntegerElement is accepted by default instead of a DictFloatElement. This behavior can be changed (see converter documentation). Note This might lead to additional matches compared to previous versions.
  • _AbstractDictElementConverter uses re.DOTALL for match_value
  • The "fallback" parent, the name of the element in the cfood, is only used when the object is created and only if there are no parents given.

Fixed

  • #31 Identified cache: Hash is the same for Records without IDs
  • #30
  • #23 Crawler may overwrite and delete existing data in case of manually added properties
  • #10 floats can be interpreted as integers and vice versa, there are defaults for allowing other types and this can be changed per converter

[0.1.0] - 2022-10-11

(Florian Spreckelsen)

Added

  • Everything
  • Added new converters for tables: CSVTableConverter and XLSXTableConverter
  • Possibility to authorize updates as in the old crawler
  • Allow authorization of inserts
  • Allow splitting cfoods into multiple yaml documents
  • Implemented macros
  • Converters can now filter the list of children
  • You can now crawl data with name conflicts: synchronize(unique_names=False)

Changed

  • MAINT: Renamed module from newcrawler to caoscrawler
  • MAINT: Removed global converters from crawl.py

Fixed

  • FIX: #12 (closed)
  • FIX: #14 (moved)
  • FIX: Variables are now also replaced when the value is given as a list.
  • FIX: #35 (closed) Parent cannot be set from value
  • #6: Fixed many type hints to be compatible to python 3.8
  • #9: Scalars of types different than string can now be given in cfood definitions