-
Florian Spreckelsen authoredFlorian Spreckelsen authored
Code owners
Assign users and groups as approvers for specific file changes. Learn more.
To find the state of this project's repository at the time of any of these versions, check out the tags.
- Changelog
- [0.8.0] - 2024-08-23
- Added
- Changed
- Removed
- Fixed
- Security
- Documentation
- [0.7.1] - 2024-03-21
- Fixed
- [0.7.0] - 2024-03-04
- Added
- Changed
- Deprecated
- Removed
- Fixed
- [0.6.0] - 2023-06-23
- Added
- Fixed
- Documentation
- [0.5.0] - 2023-03-28
- Changed
- Deprecated
- [0.4.0] - 2023-03-22
- Added
- Changed
- Deprecated
- Removed
- Fixed
- [0.3.0] - 2022-01-30
- Added
- Changed
- Deprecated
- Fixed
- [0.2.0] - 2022-11-18
- Added
- Changed
- Fixed
- [0.1.0] - 2022-10-11
- Added
- Changed
- Fixed
CHANGELOG.md 13.04 KiB
Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[0.8.0] - 2024-08-23
Added
- Support for Python 3.12 and experimental support for 3.13
- CFood macros now accept complex objects as values, not just strings.
- More options for the
CSVTableConverter
- New converters:
DatetimeElementConverter
SPSSConverter
- New scripts:
spss_to_datamodel
csv_to_datamodel
- New transformer functions:
date_parse
datetime_parse
- New
PropertiesFromDictConverter
which allows to automatically create property values from dictionary keys.
Changed
- CFood macros do not render everything into strings now.
- Better internal handling of identifiable/reference resolving and merging of entities. This also includes more understandable output for users.
- Better handling of missing imports, with nice messages for users.
- No longer use configuration of advancedtools to set to and from email addresses
Removed
- Support for Python 3.7
Fixed
Security
Documentation
- General improvement of the documentaion, in many small places.
- The API documentation should now also include documentation of the constructors.
[0.7.1] - 2024-03-21
Fixed
-
crawler_main
doesn't need the deprecateddebug=True
anymore to put out a provenance file if theprovenance_file
parameter is provided. - indiscale#129 missing packaging dependency.
[0.7.0] - 2024-03-04
Added
-
transform
sections can be added to a CFood to apply functions to values stored in variables. - default transform functions: submatch, split and replace.
-
*
can now be used as a wildcard in the identifiables parameter file to denote that any Record may reference the identified one. -
crawl.TreatedRecordLookUp
class replacing the old (and slow)identified_cache
module. The new class now handles all records identified by id, path, or identifiable simultaneously. See API docs for more info on how to add to and get from the new lookup class. -
identifiable_adapters.IdentifiableAdapter.get_identifying_referencing_entities
andidentifiable_adapters.IdentifiableAdapter.get_identifying_referenced_entities
static methods to return the referencing or referenced entities belonging to a registered identifiable, respectively. -
#70: Optional
converters for HDF5 files. They require this package to be installed with its
h5-crawler
dependency.
Changed
- If the
parents
key is used in a cfood at a lower level for a Record that already has a Parent (because it was explicitly given or the default Parent), the old Parent(s) are now overwritten with the value belonging to theparents
key. - If a registered identifiable states, that a reference by a Record with parent RT1 is needed, then now also references from Records that have a child of RT1 as parent are accepted.
- More aggressive caching.
- The
identifiable_adapters.IdentifiableAdapter
now creates (possibly empty) reference lists for all records increate_reference_mapping
. This allows functions likeget_identifiable
to be called only with the subset of the referenceing entities belonging to a specific Record. - The
identifiable_adapters.IdentifiableAdapter
uses entity ids (negative for entities that don't exist remotely) instead of entity objects for keeping track of references. - Log output is either written to $SHARED_DIR/ (when this variable is set) or just to the terminal.
Deprecated
IdentifiableAdapter.get_file
Removed
-
identified_cache
module which was replaced by thecrawl.TreatedRecordLookUp
class.
Fixed
- Empty Records can now be created (https://gitlab.com/caosdb/caosdb-crawler/-/issues/27)
- #58 Documentation builds API docs in pipeline now.
-
#117 (closed)
replace_variable
does no longer unnecessarily change the type. Values stored in variables in a CFood can have now other types. - indiscale#113 Resolving referenced entities fails in some corner cases. The crawler now handles cases correctly in which entities retrieved from the server have to be merged with local entities that both reference another, already existing entity
- A corner case in
split_into_inserts_and_updates
whereby two records created in different places in the cfood definition would not be merged if both were identified by the same LinkAhead id - #87 Handle long strings more gracefully. The crawler sometimes runs into linkahead-server#101, this is now mitigated.
- indiscale#128 Yet another corner case of referencing resolution resolved.
[0.6.0] - 2023-06-23
(Florian Spreckelsen)
Added
- Standard logging for server side execution
- Email notification if the
pycaosdb.ini
contains a[caoscrawler]
withsend_crawler_notifications=True
. - Creation of CrawlerRun Records that contain status information about data
integration of the crawler if the
pycaosdb.ini
contains a[caoscrawler]
withcreate_crawler_status_records=True
. - The Crawler
synchronize
function now takes list of RecordType names. Records that have the given names as parents are excluded from inserts or updates -
Crawler.synchronize
now takes an optionalpath_for_authorized_run
argument that specifies the path with which the crawler can be rerun to authorize pending changes.
Fixed
- Query generation when there are only backrefs or backrefs and a name
- Query generation when there are spaces or
'
in RecordType or Identifiable names - usage of ID when looking for identified records
- #41
Documentation
- Expanded documentation, also has (better) tutorials now.
[0.5.0] - 2023-03-28
(Florian Spreckelsen)
Changed
- Refactoring of the crawl.py module: Now there is a separate scanner module handling the collecting of information that is independent of CaosDB itself.
- The signature of the function
save_debug_data
was changed to explicitely take thedebug_tree
as its first argument. This change was necessary, as thedebug_tree
is no longer saved as member field of the Crawler class.
Deprecated
- The functions
load_definition
,initialize_converters
andload_converters
are deprecated. Please use the functionsload_definition
,initialize_converters
andcreate_converter_registry
from the scanner module instead. - The function
start_crawling
is deprecated. The functionscan_structure_elements
in the scanner module mostly covers its functionality.
[0.4.0] - 2023-03-22
(Florian Spreckelsen)
Added
- DateElementConverter: allows to interpret text as a date object
- the restricted_path argument allows to crawl only a subtree
- logging that provides a summary of what is inserted and updated
- You can now access the file system path of a structure element (if it has one) using the variable
name
<converter name>.path
-
add_prefix
andremove_prefix
arguments for the command line interface and thecrawler_main
function for the adding/removal of path prefixes when creating file entities. - More strict checking of
identifiables.yaml
. - Better error messages when server does not conform to expected data model.
Changed
- The definitions for the default converters were removed from crawl.py and placed into
a separate yaml file called
default_converters.yml
. There is a new test testing for the correct loading behavior of that file. - JSONFileConverter, YAMLFileConverter and MarkdownFileConverter now inherit from SimpleFileConverter. Behavior is unchanged, except that the MarkdownFileConverter now raises a ConverterValidationError when the YAML header cannot be read instead of silently not matching.
Deprecated
- The
prefix
argument ofcrawler_main
is deprecated. Use the new argumentremove_prefix
instead.
Removed
- The command line argument
--prefix
. Use the new argument--remove-prefix
instead.
Fixed
- an empty string as name is treated as no name (as does the server). This, fixes queries for identifiables since it would contain "WITH name=''" otherwise which is an impossible condition. If your cfoods contained this case, they are ill defined.
[0.3.0] - 2022-01-30
(Florian Spreckelsen)
Added
- Identifiable class to represent the information used to identify Records.
- Added some StructureElements: BooleanElement, FloatElement, IntegerElement, ListElement, DictElement
- String representation for Identifiables
-
#43 the crawler
version can now be specified in the
metadata
section of the cfood definition. It is checked against the installed version upon loading of the definition. - JSON schema validation can also be used in the DictElementConverter
- YAMLFileConverter class; to parse YAML files
- Variables can now be substituted within the definition of yaml macros
- debugging option for the match step of Converters
- Re-introduced support for Python 3.7
Changed
- Some StructureElements changed (see "How to upgrade" in the docs):
- Dict, DictElement and DictDictElement were merged into DictElement.
- DictTextElement and TextElement were merged into TextElement. The "match" keyword is now invalid for TextElements.
- JSONFileConverter creates another level of StructureElements (see "How to upgrade" in the docs)
- create_flat_list function now collects entities in a set and also adds the entities contained in the given list directly
Deprecated
- The DictXYElements are now depricated and are now synonyms for the XYElements.
Fixed
-
#39 Merge conflicts in
split_into_inserts_and_updates
when cached entity references a record without id - Queries for identifiables with boolean properties are now created correctly.
[0.2.0] - 2022-11-18
(Florian Spreckelsen)
Added
- the -c/--add-cwd-to-path option allows to plays for example custom converter modules into the current working directory(cwd) since the cwd is added to the Python path.
Changed
- Converters often used in dicts (DictFloatElementConverter, DictIntegerElementConverter, ...) do now accept other StructureElements by default. For example a DictIntegerElement is accepted by default instead of a DictFloatElement. This behavior can be changed (see converter documentation). Note This might lead to additional matches compared to previous versions.
-
_AbstractDictElementConverter
usesre.DOTALL
formatch_value
- The "fallback" parent, the name of the element in the cfood, is only used when the object is created and only if there are no parents given.
Fixed
- #31 Identified cache: Hash is the same for Records without IDs
- #30
- #23 Crawler may overwrite and delete existing data in case of manually added properties
- #10 floats can be interpreted as integers and vice versa, there are defaults for allowing other types and this can be changed per converter
[0.1.0] - 2022-10-11
(Florian Spreckelsen)
Added
- Everything
- Added new converters for tables: CSVTableConverter and XLSXTableConverter
- Possibility to authorize updates as in the old crawler
- Allow authorization of inserts
- Allow splitting cfoods into multiple yaml documents
- Implemented macros
- Converters can now filter the list of children
- You can now crawl data with name conflicts:
synchronize(unique_names=False)
Changed
- MAINT: Renamed module from
newcrawler
tocaoscrawler
- MAINT: Removed global converters from
crawl.py
Fixed
- FIX: #12 (closed)
- FIX: #14 (moved)
- FIX: Variables are now also replaced when the value is given as a list.
- FIX: #35 (closed) Parent cannot be set from value
- #6: Fixed many type hints to be compatible to python 3.8
- #9: Scalars of types different than string can now be given in cfood definitions