CHANGELOG.md



To find the state of this project's repository at the time of any of these versions, check out the tags.


Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[0.8.0] - 2024-08-23

Added

Support for Python 3.12 and experimental support for 3.13
CFood macros now accept complex objects as values, not just strings.
More options for the CSVTableConverter

New converters:

DatetimeElementConverter
SPSSConverter


New scripts:

spss_to_datamodel
csv_to_datamodel


New transformer functions:

date_parse
datetime_parse


New PropertiesFromDictConverter which allows to automatically
create property values from dictionary keys.


Changed

CFood macros do not render everything into strings now.
Better internal handling of identifiable/reference resolving and merging of entities.  This also
includes more understandable output for users.
Better handling of missing imports, with nice messages for users.
No longer use configuration of advancedtools to set to and from email addresses


Removed

Support for Python 3.7


Fixed


93 cfood.yaml does not allow umlaut in $expression

96 Do not fail silently on transaction errors


Security

Documentation

General improvement of the documentaion, in many small places.
The API documentation should now also include documentation of the constructors.


[0.7.1] - 2024-03-21

Fixed


crawler_main doesn't need the deprecated debug=True anymore to put out a
provenance file if the provenance_file parameter is provided.

indiscale#129
missing packaging dependency.


[0.7.0] - 2024-03-04

Added


transform sections can be added to a CFood to apply functions to values stored in variables.
default transform functions: submatch, split and replace.

* can now be used as a wildcard in the identifiables parameter file to denote
that any Record may reference the identified one.

crawl.TreatedRecordLookUp class replacing the old (and slow)
identified_cache module. The new class now handles all records identified by
id, path, or identifiable simultaneously. See API docs for more info on how to
add to and get from the new lookup class.

identifiable_adapters.IdentifiableAdapter.get_identifying_referencing_entities
and
identifiable_adapters.IdentifiableAdapter.get_identifying_referenced_entities
static methods to return the referencing or referenced entities belonging to a
registered identifiable, respectively.

#70: Optional
converters for HDF5 files. They require this package to be installed with its
h5-crawler dependency.


Changed

If the parents key is used in a cfood at a lower level for a Record that
already has a Parent (because it was explicitly given or the default Parent),
the old Parent(s) are now overwritten with the value belonging to the
parents key.
If a registered identifiable states, that a reference by a Record with parent
RT1 is needed, then now also references from Records that have a child of RT1
as parent are accepted.
More aggressive caching.
The identifiable_adapters.IdentifiableAdapter now creates (possibly empty)
reference lists for all records in create_reference_mapping. This allows
functions like get_identifiable to be called only with the subset of the
referenceing entities belonging to a specific Record.
The identifiable_adapters.IdentifiableAdapter uses entity ids (negative for
entities that don't exist remotely) instead of entity objects for keeping
track of references.
Log output is either written to $SHARED_DIR/ (when this variable is set) or just to the terminal.


Deprecated

IdentifiableAdapter.get_file


Removed


identified_cache module which was replaced by the crawl.TreatedRecordLookUp class.


Fixed

Empty Records can now be created (https://gitlab.com/caosdb/caosdb-crawler/-/issues/27)

#58 Documentation builds API docs in pipeline now.

#117 (closed)
replace_variable does no longer unnecessarily change the type. Values stored
in variables in a CFood can have now other types.

indiscale#113
Resolving referenced entities fails in some corner cases. The crawler now
handles cases correctly in which entities retrieved from the server have to be
merged with local entities that both reference another, already existing
entity
A corner case in split_into_inserts_and_updates whereby two records created
in different places in the cfood definition would not be merged if both were
identified by the same LinkAhead id

#87 Handle long strings more gracefully.  The crawler sometimes runs into
linkahead-server#101, this is now mitigated.

indiscale#128 Yet another corner case of referencing resolution resolved.


[0.6.0] - 2023-06-23
(Florian Spreckelsen)

Added

Standard logging for server side execution
Email notification if the pycaosdb.ini contains a [caoscrawler] with
send_crawler_notifications=True.
Creation of CrawlerRun Records that contain status information about data
integration of the crawler if the pycaosdb.ini contains a [caoscrawler]
with create_crawler_status_records=True.
The Crawler synchronize function now takes list of RecordType names.
Records that have the given names as parents are excluded from inserts or
updates

Crawler.synchronize now takes an optional path_for_authorized_run argument
that specifies the path with which the crawler can be rerun to authorize
pending changes.


Fixed

Query generation when there are only backrefs or backrefs and a name
Query generation when there are spaces or ' in RecordType or Identifiable
names
usage of ID when looking for identified records
#41


Documentation

Expanded documentation, also has (better) tutorials now.


[0.5.0] - 2023-03-28
(Florian Spreckelsen)

Changed

Refactoring of the crawl.py module: Now there is a separate scanner module handling the
collecting of information that is independent of CaosDB itself.
The signature of the function save_debug_data was changed to explicitely
take the debug_tree as its first argument. This change was necessary, as
the debug_tree is no longer saved as member field of the Crawler class.


Deprecated

The functions load_definition, initialize_converters and
load_converters are deprecated. Please use the functions
load_definition, initialize_converters and
create_converter_registry from the scanner module instead.
The function start_crawling is deprecated. The function
scan_structure_elements in the scanner module mostly covers its
functionality.


[0.4.0] - 2023-03-22
(Florian Spreckelsen)

Added

DateElementConverter: allows to interpret text as a date object
the restricted_path argument allows to crawl only a subtree
logging that provides a summary of what is inserted and updated
You can now access the file system path of a structure element (if it has one) using the variable
name <converter name>.path


add_prefix and remove_prefix arguments for the command line interface
and the crawler_main function for the adding/removal of path prefixes when
creating file entities.
More strict checking of identifiables.yaml.
Better error messages when server does not conform to expected data model.


Changed

The definitions for the default converters were removed from crawl.py and placed into
a separate yaml file called default_converters.yml. There is a new test testing for
the correct loading behavior of that file.
JSONFileConverter, YAMLFileConverter and MarkdownFileConverter now inherit from
SimpleFileConverter. Behavior is unchanged, except that the MarkdownFileConverter now raises a
ConverterValidationError when the YAML header cannot be read instead of silently not matching.


Deprecated

The prefix argument of crawler_main is deprecated. Use the new argument
remove_prefix instead.


Removed

The command line argument --prefix. Use the new argument --remove-prefix instead.


Fixed

an empty string as name is treated as no name (as does the server). This, fixes
queries for identifiables since it would contain "WITH name=''" otherwise
which is an impossible condition. If your cfoods contained this case, they are ill defined.


[0.3.0] - 2022-01-30
(Florian Spreckelsen)

Added

Identifiable class to represent the information used to identify Records.
Added some StructureElements: BooleanElement, FloatElement, IntegerElement,
ListElement, DictElement
String representation for Identifiables

#43 the crawler
version can now be specified in the metadata section of the cfood
definition. It is checked against the installed version upon loading of the
definition.
JSON schema validation can also be used in the DictElementConverter
YAMLFileConverter class; to parse YAML files
Variables can now be substituted within the definition of yaml macros
debugging option for the match step of Converters
Re-introduced support for Python 3.7


Changed

Some StructureElements changed (see "How to upgrade" in the docs):

Dict, DictElement and DictDictElement were merged into DictElement.
DictTextElement and TextElement were merged into TextElement. The "match"
keyword is now invalid for TextElements.


JSONFileConverter creates another level of StructureElements (see "How to upgrade" in the docs)
create_flat_list function now collects entities in a set and also adds the entities
contained in the given list directly


Deprecated

The DictXYElements are now depricated and are now synonyms for the
XYElements.


Fixed


#39 Merge conflicts in
split_into_inserts_and_updates when cached entity references a record
without id
Queries for identifiables with boolean properties are now created correctly.


[0.2.0] - 2022-11-18
(Florian Spreckelsen)

Added

the -c/--add-cwd-to-path option allows to plays for example custom converter
modules into the current working directory(cwd) since the cwd is added to
the Python path.


Changed

Converters often used in dicts (DictFloatElementConverter,
DictIntegerElementConverter, ...) do now accept other StructureElements by
default. For example a DictIntegerElement is accepted by default instead of a
DictFloatElement. This behavior can be changed (see converter documentation).
Note This might lead to additional matches compared to previous versions.

_AbstractDictElementConverter uses re.DOTALL for match_value

The "fallback" parent, the name of the element in the cfood, is only used
when the object is created and only if there are no parents given.


Fixed


#31 Identified cache:
Hash is the same for Records without IDs
#30

#23 Crawler may
overwrite and delete existing data in case of manually added properties

#10 floats can be
interpreted as integers and vice versa, there are defaults for allowing other
types and this can be changed per converter


[0.1.0] - 2022-10-11
(Florian Spreckelsen)

Added

Everything
Added new converters for tables: CSVTableConverter and XLSXTableConverter
Possibility to authorize updates as in the old crawler
Allow authorization of inserts
Allow splitting cfoods into multiple yaml documents
Implemented macros
Converters can now filter the list of children
You can now crawl data with name conflicts: synchronize(unique_names=False)


Changed

MAINT: Renamed module from newcrawler to caoscrawler

MAINT: Removed global converters from crawl.py


Fixed

FIX: #12 (closed)

FIX: #14 (moved)

FIX: Variables are now also replaced when the value is given as a list.
FIX: #35 (closed) Parent cannot be set from value

#6: Fixed many type
hints to be compatible to python 3.8

#9: Scalars of types
different than string can now be given in cfood definitions