Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • caosdb/src/caosdb-crawler
1 result
Show changes
Commits on Source (66)
Showing
with 807 additions and 186 deletions
......@@ -9,13 +9,44 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added ###
### Changed ###
### Deprecated ###
### Removed ###
### Fixed ###
### Security ###
### Documentation ###
## [0.8.0] - 2024-08-23 ##
### Added ###
* Support for Python 3.12 and experimental support for 3.13
* `spss_to_datamodel` script.
* `SPSSConverter` class
* CFood macros now accept complex objects as values, not just strings.
* More options for the `CSVTableConverter`
* New converters:
* `DatetimeElementConverter`
* `SPSSConverter`
* New scripts:
* `spss_to_datamodel`
* `csv_to_datamodel`
* New transformer functions:
* `date_parse`
* `datetime_parse`
* New ``PropertiesFromDictConverter`` which allows to automatically
create property values from dictionary keys.
### Changed ###
### Deprecated ###
* CFood macros do not render everything into strings now.
* Better internal handling of identifiable/reference resolving and merging of entities. This also
includes more understandable output for users.
* Better handling of missing imports, with nice messages for users.
* No longer use configuration of advancedtools to set to and from email addresses
### Removed ###
......@@ -24,11 +55,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Fixed ###
* [93](https://gitlab.com/linkahead/linkahead-crawler/-/issues/93) cfood.yaml does not allow umlaut in $expression
* [96](https://gitlab.com/linkahead/linkahead-crawler/-/issues/96) Do not fail silently on transaction errors
### Security ###
### Documentation ###
* General improvement of the documentaion, in many small places.
* The API documentation should now also include documentation of the constructors.
## [0.7.1] - 2024-03-21 ##
### Fixed ###
......@@ -170,6 +205,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- ``add_prefix`` and ``remove_prefix`` arguments for the command line interface
and the ``crawler_main`` function for the adding/removal of path prefixes when
creating file entities.
- More strict checking of `identifiables.yaml`.
- Better error messages when server does not conform to expected data model.
### Changed ###
......@@ -218,7 +255,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Some StructureElements changed (see "How to upgrade" in the docs):
- Dict, DictElement and DictDictElement were merged into DictElement.
- DictTextElement and TextElement were merged into TextElement. The "match"
keyword is now invalid for TextElements.
keyword is now invalid for TextElements.
- JSONFileConverter creates another level of StructureElements (see "How to upgrade" in the docs)
- create_flat_list function now collects entities in a set and also adds the entities
contained in the given list directly
......
......@@ -17,6 +17,6 @@ authors:
given-names: Alexander
orcid: https://orcid.org/0000-0003-4124-9649
title: CaosDB - Crawler
version: 0.7.1
version: 0.8.0
doi: 10.3390/data9020024
date-released: 2023-03-21
\ No newline at end of file
date-released: 2024-08-23
\ No newline at end of file
......@@ -32,7 +32,7 @@ import sys
from argparse import RawTextHelpFormatter
from pathlib import Path
import caosdb as db
import linkahead as db
import pytest
import yaml
from caosadvancedtools.crawler import Crawler as OldCrawler
......@@ -42,8 +42,8 @@ from caoscrawler.debug_tree import DebugTree
from caoscrawler.identifiable import Identifiable
from caoscrawler.identifiable_adapters import CaosDBIdentifiableAdapter
from caoscrawler.scanner import scan_directory
from caosdb import EmptyUniqueQueryError
from caosdb.utils.register_tests import clear_database, set_test_key
from linkahead import EmptyUniqueQueryError
from linkahead.utils.register_tests import clear_database, set_test_key
set_test_key("10b128cf8a1372f30aa3697466bb55e76974e0c16a599bb44ace88f19c8f61e2")
......
......@@ -27,12 +27,12 @@ import os
import pytest
from subprocess import run
import caosdb as db
import linkahead as db
from caosadvancedtools.loadFiles import loadpath
from caosdb.cached import cache_clear
from linkahead.cached import cache_clear
from caosadvancedtools.models import parser as parser
from caoscrawler.crawl import crawler_main
from caosdb.utils.register_tests import clear_database, set_test_key
from linkahead.utils.register_tests import clear_database, set_test_key
set_test_key("10b128cf8a1372f30aa3697466bb55e76974e0c16a599bb44ace88f19c8f61e2")
......
[metadata]
name = caoscrawler
version = 0.7.2
version = 0.8.1
author = Alexander Schlemmer
author_email = alexander.schlemmer@ds.mpg.de
description = A new crawler for caosdb
......
cfood:
type: object
properties:
Converters:
description: Defintiion of custom converters
type: object
additionalProperties:
type: object
properties:
converter:
type: string
package:
type: string
required:
- converter
- package
macros:
description: Macro definitions
type: array
Transformers:
description: Variable transformer definition
type: object
additionalProperties:
type: object
properties:
function:
type: string
package:
type: string
required:
- package
- function
additionalProperties:
$ref:
"#/$defs/converter"
$defs:
parents:
description: Parents for this record are given here as a list of names.
type: array
items:
type: string
converter:
properties:
type:
......@@ -28,7 +63,9 @@ cfood:
- Definitions
- Dict
- Date
- Datetime
- JSONFile
- YAMLFile
- CSVTableConverter
- XLSXTableConverter
- SPSSFile
......@@ -36,6 +73,7 @@ cfood:
- H5Dataset
- H5Group
- H5Ndarray
- PropertiesFromDictElement
description: Type of this converter node.
match:
description: typically a regexp which is matched to a structure element name
......@@ -46,15 +84,46 @@ cfood:
match_value:
description: a regexp that is matched to the value of a key-value pair
type: string
records:
description: This field is used to define new records or to modify records which have been defined on a higher level.
record_from_dict:
description: Only relevant for PropertiesFromDictElement. Specify the root record which is generated from the contained dictionary.
type: object
required:
- variable_name
properties:
parents:
description: Parents for this record are given here as a list of names.
variable_name:
description: |
Name of the record by which it can be accessed in the
cfood definiton. Can also be the name of an existing
record in which case that record will be updated by
the PropertiesFromDictConverter.
type: string
properties_blacklist:
description: List of keys to be ignored in the automatic treatment. They will be ignored on all levels of the dictionary.
type: array
items:
type: string
references:
description: List of keys that will be transformed into named reference properties.
type: object
additionalProperties:
type: object
properties:
parents:
$ref:
"#/$defs/parents"
name:
description: Name of this record. If none is given, variable_name is used.
type: string
parents:
$ref:
"#/$defs/parents"
records:
description: This field is used to define new records or to modify records which have been defined on a higher level.
type: object
properties:
parents:
$ref:
"#/$defs/parents"
additionalProperties:
oneOf:
- type: object
......@@ -76,3 +145,15 @@ cfood:
additionalProperties:
$ref:
"#/$defs/converter"
if:
properties:
type:
const:
"PropertiesFromDictElement"
then:
required:
- type
- record_from_dict
else:
required:
- type
......@@ -432,6 +432,7 @@ class Converter(object, metaclass=ABCMeta):
return
for transformer_key, transformer in self.definition["transform"].items():
in_value = replace_variables(transformer["in"], values)
out_value = in_value
for tr_func_el in transformer["functions"]:
if not isinstance(tr_func_el, dict):
......@@ -817,6 +818,180 @@ class DictElementConverter(Converter):
return match_name_and_value(self.definition, element.name, element.value)
class PropertiesFromDictConverter(DictElementConverter):
"""Extend the :py:class:`DictElementConverter` by a heuristic to set
property values from the dictionary keys.
"""
def __init__(self, definition: dict, name: str, converter_registry: dict,
referenced_record_callback: Optional[callable] = None):
super().__init__(definition, name, converter_registry)
self.referenced_record_callback = referenced_record_callback
def _recursively_create_records(self, subdict: dict, root_record: db.Record,
root_rec_name: str,
values: GeneralStore, records: RecordStore,
referenced_record_callback: callable,
keys_modified: list = []
):
"""Create a record form the given `subdict` and recursively create referenced records."""
blacklisted_keys = self.definition["record_from_dict"][
"properties_blacklist"] if "properties_blacklist" in self.definition["record_from_dict"] else []
special_references = self.definition["record_from_dict"]["references"] if "references" in self.definition["record_from_dict"] else [
]
for key, value in subdict.items():
if key in blacklisted_keys:
# We ignore this in the automated property generation
continue
if isinstance(value, list):
if not any([isinstance(val, dict) for val in value]):
# no dict in list, i.e., no references, so this is simple
root_record.add_property(name=key, value=value)
else:
if not all([isinstance(val, dict) for val in value]):
# if this is not an error (most probably it is), this
# needs to be handled manually for now.
raise ValueError(
f"{key} in {subdict} contains a mixed list of references and scalars.")
ref_recs = []
for ii, ref_dict in enumerate(value):
ref_var_name = f"{root_rec_name}.{key}.{ii+1}"
ref_rec, keys_modified = self._create_ref_rec(
ref_var_name,
key,
ref_dict,
special_references,
records,
values,
keys_modified,
referenced_record_callback
)
ref_recs.append(ref_rec)
root_record.add_property(name=key, value=ref_recs)
elif isinstance(value, dict):
# Treat scalar reference
ref_var_name = f"{root_rec_name}.{key}"
ref_rec, keys_modified = self._create_ref_rec(
ref_var_name,
key,
value,
special_references,
records,
values,
keys_modified,
referenced_record_callback
)
root_record.add_property(key, ref_rec)
else:
# All that remains are scalar properties which may or
# may not be special attributes like name.
if key.lower() in SPECIAL_PROPERTIES:
setattr(root_record, key.lower(), value)
else:
root_record.add_property(name=key, value=value)
keys_modified.append((root_rec_name, key))
if referenced_record_callback:
root_record = referenced_record_callback(root_record, records, values)
return keys_modified
def _create_ref_rec(
self,
name: str,
key: str,
subdict: dict,
special_references: dict,
records: RecordStore,
values: GeneralStore,
keys_modified: list,
referenced_record_callback: callable
):
"""Create the referenced Record and forward the stores etc. to
``_recursively_create_records``.
Parameters:
-----------
name : str
name of the referenced record to be created in RecordStore and Value Store.
key : str
name of the key this record's definition had in the original dict.
subdict : dict
subdict containing this record's definition from the original dict.
special_references : dict
special treatment of referenced records from the converter definition.
records : RecordStore
RecordStore for entering new Records
values : GeneralStore
ValueStore for entering new Records
keys_modified : list
List for keeping track of changes
referenced_record_callback : callable
Advanced treatment of referenced records as given in the
converter initialization.
"""
ref_rec = db.Record()
if key in special_references:
for par in special_references[key]["parents"]:
ref_rec.add_parent(par)
else:
ref_rec.add_parent(key)
records[name] = ref_rec
values[name] = ref_rec
keys_modified = self._recursively_create_records(
subdict=subdict,
root_record=ref_rec,
root_rec_name=name,
values=values,
records=records,
referenced_record_callback=referenced_record_callback,
keys_modified=keys_modified
)
return ref_rec, keys_modified
def create_records(self, values: GeneralStore, records: RecordStore,
element: StructureElement):
keys_modified = []
rfd = self.definition["record_from_dict"]
if rfd["variable_name"] not in records:
rec = db.Record()
if "name" in rfd:
rec.name = rfd["name"]
if "parents" in rfd:
for par in rfd["parents"]:
rec.add_parent(par)
else:
rec.add_parent(rfd["variable_name"])
records[rfd["variable_name"]] = rec
values[rfd["variable_name"]] = rec
else:
rec = records[rfd["variable_name"]]
keys_modified = self._recursively_create_records(
subdict=element.value,
root_record=rec,
root_rec_name=rfd["variable_name"],
values=values,
records=records,
referenced_record_callback=self.referenced_record_callback,
keys_modified=keys_modified,
)
keys_modified.extend(super().create_records(
values=values, records=records, element=element))
return keys_modified
class DictConverter(DictElementConverter):
def __init__(self, *args, **kwargs):
warnings.warn(DeprecationWarning(
......@@ -1240,11 +1415,12 @@ class DateElementConverter(TextElementConverter):
"""allows to convert different text formats of dates to Python date objects.
The text to be parsed must be contained in the "date" group. The format string can be supplied
under "dateformat" in the Converter definition. The library used is datetime so see its
under "date_format" in the Converter definition. The library used is datetime so see its
documentation for information on how to create the format string.
"""
# TODO make `date` parameter name configurable
def match(self, element: StructureElement):
matches = super().match(element)
if matches is not None and "date" in matches:
......@@ -1253,3 +1429,24 @@ class DateElementConverter(TextElementConverter):
self.definition["date_format"] if "date_format" in self.definition else "%Y-%m-%d"
).date()})
return matches
class DatetimeElementConverter(TextElementConverter):
"""Convert text so that it is formatted in a way that LinkAhead can understand it.
The text to be parsed must be in the ``val`` parameter. The format string can be supplied in the
``datetime_format`` node. This class uses the ``datetime`` module, so ``datetime_format`` must
follow this specificaton:
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes
"""
# TODO make `val` parameter name configurable
def match(self, element: StructureElement):
matches = super().match(element)
if matches is not None and "val" in matches:
fmt_default = "%Y-%m-%dT%H:%M:%S"
fmt = self.definition.get("datetime_format", fmt_default)
dt_str = datetime.datetime.strptime(matches["val"], fmt).strftime(fmt_default)
matches.update({"val": dt_str})
return matches
......@@ -55,6 +55,9 @@ from linkahead.apiutils import (compare_entities,
merge_entities)
from linkahead.cached import cache_clear, cached_get_entity_by
from linkahead.common.datatype import get_list_datatype, is_reference
from linkahead.exceptions import (
TransactionError,
)
from linkahead.utils.escape import escape_squoted_text
from .config import get_config_setting
......@@ -746,9 +749,31 @@ one with the entities that need to be updated and the other with entities to be
def inform_about_pending_changes(pending_changes, run_id, path, inserts=False):
# Sending an Email with a link to a form to authorize updates is
if get_config_setting("send_crawler_notifications"):
filename = OldCrawler.save_form(
[el[3] for el in pending_changes], path, run_id)
OldCrawler.send_mail([el[3] for el in pending_changes], filename)
filename = OldCrawler.save_form([el[3] for el in pending_changes], path, run_id)
text = """Dear Curator,
there where changes that need your authorization. Please check the following
carefully and if the changes are ok, click on the following link:
{url}/Shared/{filename}
{changes}
""".format(url=db.configuration.get_config()["Connection"]["url"],
filename=filename,
changes="\n".join([el[3] for el in pending_changes]))
try:
fro = get_config_setting("sendmail_from_address")
to = get_config_setting("sendmail_to_address")
except KeyError:
logger.error("Server Configuration is missing a setting for "
"sending mails. The administrator should check "
"'from_mail' and 'to_mail'.")
return
send_mail(
from_addr=fro,
to=to,
subject="Crawler Update",
body=text)
for i, el in enumerate(pending_changes):
......@@ -859,6 +884,7 @@ def _notify_about_inserts_and_updates(n_inserts, n_updates, logfile, run_id):
The email contains some basic information and a link to the log and the CrawlerRun Record.
"""
if not get_config_setting("send_crawler_notifications"):
logger.debug("Crawler email notifications are disabled.")
return
if n_inserts == 0 and n_updates == 0:
return
......@@ -869,8 +895,8 @@ the CaosDB Crawler successfully crawled the data and
"""
domain = get_config_setting("public_host_url")
if get_config_setting("create_crawler_status_records"):
domain = get_config_setting("public_host_url")
text += ("You can checkout the CrawlerRun Record for more information:\n"
f"{domain}/Entity/?P=0L10&query=find%20crawlerrun%20with%20run_id=%27{run_id}%27\n\n")
text += (f"You can download the logfile here:\n{domain}/Shared/" + logfile)
......@@ -1056,6 +1082,10 @@ def crawler_main(crawled_directory_path: str,
ident = CaosDBIdentifiableAdapter()
ident.load_from_yaml_definition(identifiables_definition_file)
crawler.identifiableAdapter = ident
else:
# TODO
# raise ValueError("An identifiable file is needed.")
pass
remove_prefix = _treat_deprecated_prefix(prefix, remove_prefix)
......@@ -1081,15 +1111,24 @@ def crawler_main(crawled_directory_path: str,
logger.error(err)
_update_status_record(crawler.run_id, 0, 0, status="FAILED")
return 1
except TransactionError as err:
logger.debug(traceback.format_exc())
logger.error(err)
logger.error("Transaction error details:")
for suberr in err.errors:
logger.error("---")
logger.error(suberr.msg)
logger.error(suberr.entity)
return 1
except Exception as err:
logger.debug(traceback.format_exc())
logger.debug(err)
logger.error(err)
if "SHARED_DIR" in os.environ:
# pylint: disable=E0601
domain = get_config_setting("public_host_url")
logger.error("Unexpected Error: Please tell your administrator about this and provide the"
f" following path.\n{domain}/Shared/" + debuglog_public)
logger.error("Unexpected Error: Please tell your administrator about this and provide "
f"the following path.\n{domain}/Shared/" + debuglog_public)
_update_status_record(crawler.run_id, 0, 0, status="FAILED")
return 1
......
......@@ -8,9 +8,15 @@ BooleanElement:
Date:
converter: DateElementConverter
package: caoscrawler.converters
Datetime:
converter: DatetimeElementConverter
package: caoscrawler.converters
Dict:
converter: DictElementConverter
package: caoscrawler.converters
PropertiesFromDictElement:
converter: PropertiesFromDictConverter
package: caoscrawler.converters
FloatElement:
converter: FloatElementConverter
package: caoscrawler.converters
......
# Lookup table for matching functions and cfood yaml node names.
submatch:
package: caoscrawler.transformer_functions
......@@ -9,3 +9,9 @@ split:
replace:
package: caoscrawler.transformer_functions
function: replace
date_parse:
package: caoscrawler.transformer_functions
function: date_parse
datetime_parse:
package: caoscrawler.transformer_functions
function: datetime_parse
......@@ -27,15 +27,6 @@ class ForbiddenTransaction(Exception):
pass
class MissingReferencingEntityError(Exception):
"""Thrown if the identifiable requires that some entity references the given entity but there
is no such reference """
def __init__(self, *args, rts=None, **kwargs):
self.rts = rts
super().__init__(self, *args, **kwargs)
class ImpossibleMergeError(Exception):
"""Thrown if due to identifying information, two SyncNodes or two Properties of SyncNodes
should be merged, but there is conflicting information that prevents this.
......@@ -47,8 +38,29 @@ class ImpossibleMergeError(Exception):
super().__init__(self, *args, **kwargs)
class InvalidIdentifiableYAML(Exception):
"""Thrown if the identifiable definition is invalid."""
pass
class MissingIdentifyingProperty(Exception):
"""Thrown if a SyncNode does not have the properties required by the corresponding registered
identifiable
"""
pass
class MissingRecordType(Exception):
"""Thrown if an record type can not be found although it is expected that it exists on the
server.
"""
pass
class MissingReferencingEntityError(Exception):
"""Thrown if the identifiable requires that some entity references the given entity but there
is no such reference """
def __init__(self, *args, rts=None, **kwargs):
self.rts = rts
super().__init__(self, *args, **kwargs)
......@@ -36,7 +36,12 @@ import yaml
from linkahead.cached import cached_get_entity_by, cached_query
from linkahead.utils.escape import escape_squoted_text
from .exceptions import MissingIdentifyingProperty, MissingReferencingEntityError
from .exceptions import (
InvalidIdentifiableYAML,
MissingIdentifyingProperty,
MissingRecordType,
MissingReferencingEntityError,
)
from .identifiable import Identifiable
from .sync_node import SyncNode
from .utils import has_parent
......@@ -48,7 +53,10 @@ def get_children_of_rt(rtname):
"""Supply the name of a recordtype. This name and the name of all children RTs are returned in
a list"""
escaped = escape_squoted_text(rtname)
return [p.name for p in cached_query(f"FIND RECORDTYPE '{escaped}'")]
recordtypes = [p.name for p in cached_query(f"FIND RECORDTYPE '{escaped}'")]
if not recordtypes:
raise MissingRecordType(f"Record type could not be found on server: {rtname}")
return recordtypes
def convert_value(value: Any) -> str:
......@@ -165,7 +173,10 @@ class IdentifiableAdapter(metaclass=ABCMeta):
"""
if node.registered_identifiable is None:
if raise_exception:
raise RuntimeError("no registered_identifiable")
parents = [p.name for p in node.parents]
parents_str = "\n".join(f"- {p}" for p in parents)
raise RuntimeError("No registered identifiable for node with these parents:\n"
+ parents_str)
else:
return False
for prop in node.registered_identifiable.properties:
......@@ -576,19 +587,32 @@ class CaosDBIdentifiableAdapter(IdentifiableAdapter):
"""Load identifiables defined in a yaml file"""
with open(path, "r", encoding="utf-8") as yaml_f:
identifiable_data = yaml.safe_load(yaml_f)
self.load_from_yaml_object(identifiable_data)
for key, value in identifiable_data.items():
rt = db.RecordType().add_parent(key)
for prop_name in value:
def load_from_yaml_object(self, identifiable_data):
"""Load identifiables defined in a yaml object.
"""
for rt_name, id_list in identifiable_data.items():
rt = db.RecordType().add_parent(rt_name)
if not isinstance(id_list, list):
raise InvalidIdentifiableYAML(
f"Identifiable contents must be lists, but this was not: {rt_name}")
for prop_name in id_list:
if isinstance(prop_name, str):
rt.add_property(name=prop_name)
elif isinstance(prop_name, dict):
for k, v in prop_name.items():
if k == "is_referenced_by" and not isinstance(v, list):
raise InvalidIdentifiableYAML(
f"'is_referenced_by' must be a list. Found in: {rt_name}")
rt.add_property(name=k, value=v)
else:
NotImplementedError("YAML is not structured correctly")
raise InvalidIdentifiableYAML(
"Identifiable properties must be str or dict, but this one was not:\n"
f" {rt_name}/{prop_name}")
self.register_identifiable(key, rt)
self.register_identifiable(rt_name, rt)
def register_identifiable(self, name: str, definition: db.RecordType):
self._registered_identifiables[name] = definition
......
......@@ -25,12 +25,17 @@
# Function to expand a macro in yaml
# A. Schlemmer, 05/2022
import re
from dataclasses import dataclass
from typing import Any, Dict
from copy import deepcopy
from string import Template
_SAFE_SUBST_PAT = re.compile(r"^\$(?P<key>\w+)$")
_SAFE_SUBST_PAT_BRACES = re.compile(r"^\$\{(?P<key>\w+)}$")
@dataclass
class MacroDefinition:
"""
......@@ -53,6 +58,12 @@ def substitute(propvalue, values: dict):
Substitution of variables in strings using the variable substitution
library from python's standard library.
"""
# Simple matches are simply replaced by the raw dict entry.
if match := (_SAFE_SUBST_PAT.fullmatch(propvalue)
or _SAFE_SUBST_PAT_BRACES.fullmatch(propvalue)):
key = match.group("key")
if key in values:
return values[key]
propvalue_template = Template(propvalue)
return propvalue_template.safe_substitute(**values)
......
......@@ -104,17 +104,27 @@ metadata:
directory: # corresponds to the directory given to the crawler
type: Directory
match: .* # we do not care how it is named here
records:
DirRecord: # One record for each directory.
subtree:
# This is the file
thisfile:
type: []{file}
match: []{match}
records:
DatFileRecord: # One record for each matching file
role: File
path: $thisfile
file: $thisfile
subtree:
entry:
type: Dict
match: .* # Name is irrelevant
records:
MyParent:
BaseElement: # One BaseElement record for each row in the CSV/TSV file
DatFileRecord: $DatFileRecord
DirRecord:
BaseElement: +$BaseElement
subtree: !macro
"""
......@@ -196,8 +206,24 @@ cfood: str
defs.append(def_str)
del defs
sep = repr(sniffed.delimiter)
sep = f'"{sep[1:-1]}"'
match_str = f"""'.*[ct]sv'
sep: {sep}
# "header": [int]
# "names": [str]
# "index_col": [int]
# "usecols": [int]
# "true_values": [str]
# "false_values": [str]
# "na_values": [str]
# "skiprows": [int]
# "nrows": [int]
# "keep_default_na": [bool]
"""
cfood_str = (_CustomTemplate(CFOOD_TEMPLATE).substitute({"file": "CSVTableConverter",
"match": ".*\\[ct]sv"})
"match": match_str})
+ prefix[2:] + "ColumnValue:\n" + "".join(defs_col_value)
+ prefix[2:] + "ColumnValueReference:\n" + "".join(defs_col_value_ref)
)
......
......@@ -20,9 +20,14 @@
# You should have received a copy of the GNU Affero General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
"""Definition of default transformer functions.
See https://docs.indiscale.com/caosdb-crawler/converters.html#transform-functions for more
information.
"""
Defnition of default transformer functions.
"""
import datetime
import re
from typing import Any
......@@ -61,3 +66,36 @@ def replace(in_value: Any, in_parameters: dict):
if not isinstance(in_value, str):
raise RuntimeError("must be string")
return in_value.replace(in_parameters['remove'], in_parameters['insert'])
def date_parse(in_value: str, params: dict) -> str:
"""Transform text so that it is formatted in a way that LinkAhead can understand it.
Parameters
==========
- date_format: str, optional
A format string using the ``datetime`` specificaton:
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes
"""
fmt_default = "%Y-%m-%d"
fmt = params.get("date_format", fmt_default)
dt_str = datetime.datetime.strptime(in_value, fmt).strftime(fmt_default)
return dt_str
def datetime_parse(in_value: str, params: dict) -> str:
"""Transform text so that it is formatted in a way that LinkAhead can understand it.
Parameters
==========
- datetime_format: str, optional
A format string using the ``datetime`` specificaton:
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes
"""
fmt_default = "%Y-%m-%dT%H:%M:%S"
fmt = params.get("datetime_format", fmt_default)
dt_str = datetime.datetime.strptime(in_value, fmt).strftime(fmt_default)
return dt_str
......@@ -13,7 +13,10 @@ see INSTALL.md
We use sphinx to create the documentation. Docstrings in the code should comply
with the Googly style (see link below).
Build documentation in `src/doc` with `make html`.
Build documentation in `src/doc` with `make doc`. Note that for the
automatic generation of the complete API documentation, it is
necessary to first install this library with all its optional
dependencies, i.e., `pip install .[h5-crawler,spss]`.
### Requirements ###
......
......@@ -33,10 +33,10 @@ copyright = '2024, IndiScale'
author = 'Alexander Schlemmer'
# The short X.Y version
version = '0.7.2'
version = '0.8.1'
# The full version, including alpha/beta/rc tags
# release = '0.5.2-rc2'
release = '0.7.2-dev'
release = '0.8.1-dev'
# -- General configuration ---------------------------------------------------
......
......@@ -31,20 +31,20 @@ The yaml definition may look like this:
.. code-block:: yaml
<NodeName>:
type: <ConverterName>
match: ".*"
records:
Experiment1:
parents:
- Experiment
- Blablabla
date: $DATUM
(...)
Experiment2:
parents:
- Experiment
subtree:
(...)
type: <ConverterName>
match: ".*"
records:
Experiment1:
parents:
- Experiment
- Blablabla
date: $DATUM
(...)
Experiment2:
parents:
- Experiment
subtree:
(...)
The **<NodeName>** is a description of what the current block represents (e.g.
``experiment-folder``) and is used as an identifier.
......@@ -76,35 +76,35 @@ applied to the respective variables when the converter is executed.
.. code-block:: yaml
<NodeName>:
type: <ConverterName>
match: ".*"
transform:
<TransformNodeName>:
in: $<in_var_name>
out: $<out_var_name>
functions:
- <func_name>: # name of the function to be applied
<func_arg1>: <func_arg1_value> # key value pairs that are passed as parameters
<func_arg2>: <func_arg2_value>
# ...
type: <ConverterName>
match: ".*"
transform:
<TransformNodeName>:
in: $<in_var_name>
out: $<out_var_name>
functions:
- <func_name>: # name of the function to be applied
<func_arg1>: <func_arg1_value> # key value pairs that are passed as parameters
<func_arg2>: <func_arg2_value>
# ...
An example that splits the variable ``a`` and puts the generated list in ``b`` is the following:
.. code-block:: yaml
Experiment:
type: Dict
match: ".*"
transform:
param_split:
in: $a
out: $b
functions:
- split: # split is a function that is defined by default
marker: "|" # its only parameter is the marker that is used to split the string
records:
Report:
tags: $b
type: Dict
match: ".*"
transform:
param_split:
in: $a
out: $b
functions:
- split: # split is a function that is defined by default
marker: "|" # its only parameter is the marker that is used to split the string
records:
Report:
tags: $b
This splits the string in '$a' and stores the resulting list in '$b'. This is here used to add a
list valued property to the Report Record.
......@@ -218,21 +218,21 @@ Example:
type: CSVTableConverter
match: ^test_table.csv$
records:
(...) # Records edited for the whole table file
(...) # Records edited for the whole table file
subtree:
ROW: # Any name for a data row in the table
type: DictElement
match_name: .*
match_value: .*
records:
(...) # Records edited for each row
subtree:
COLUMN: # Any name for a specific type of column in the table
type: FloatElement
match_name: measurement # Name of the column in the table file
match_value: (?P<column_value).*)
records:
(...) # Records edited for each cell
ROW: # Any name for a data row in the table
type: DictElement
match_name: .*
match_value: .*
records:
(...) # Records edited for each row
subtree:
COLUMN: # Any name for a specific type of column in the table
type: FloatElement
match_name: measurement # Name of the column in the table file
match_value: (?P<column_value).*)
records:
(...) # Records edited for each cell
XLSXTableConverter
......@@ -245,6 +245,140 @@ CSVTableConverter
CSV File → DictElement
PropertiesFromDictConverter
===========================
The :py:class:`~caoscrawler.converters.PropertiesFromDictConverter` is
a specialization of the
:py:class:`~caoscrawler.converters.DictElementConverter` and offers
all its functionality. It is meant to operate on dictionaries (e.g.,
from reading in a json or a table file), the keys of which correspond
closely to properties in a LinkAhead datamodel. This is especially
handy in cases where properties may be added to the data model and
data sources that are not yet known when writing the cfood definition.
The converter definition of the
:py:class:`~caoscrawler.converters.PropertiesFromDictConverter` has an
additional required entry ``record_from_dict`` which specifies the
Record to which the properties extracted from the dict are attached
to. This Record is identified by its ``variable_name`` by which it can
be referred to further down the subtree. You can also use the name of
a Record that was specified earlier in the CFood definition in order
to extend it by the properties extracted from a dict. Let's have a
look at a simple example. A CFood definition
.. code-block:: yaml
PropertiesFromDictElement:
type: PropertiesFromDictElement
match: ".*"
record_from_dict:
variable_name: MyRec
parents:
- MyType1
- MyType2
applied to a dictionary
.. code-block:: json
{
"name": "New name",
"a": 5,
"b": ["a", "b", "c"],
"author": {
"full_name": "Silvia Scientist"
}
}
will create a Record ``New name`` with parents ``MyType1`` and
``MyType2``. It has a scalar property ``a`` with value 5, a list
property ``b`` with values "a", "b" and "c", and an ``author``
property which references an ``author`` with a ``full_name`` property
with value "Silvia Scientist":
.. image:: img/properties-from-dict-records-author.png
:height: 210
Note how the different dictionary keys are handled differently
depending on their types: scalar and list values are understood
automatically, and a dictionary-valued entry like ``author`` is
translated into a reference to an ``author`` Record automatically.
You can further specify how references are treated with an optional
``references key`` in ``record_from_dict``. Let's assume that in the
above example, we have an ``author`` **Property** with datatype
``Person`` in our data model. We could add this information by
extending the above example definition by
.. code-block:: yaml
PropertiesFromDictElement:
type: PropertiesFromDictElement
match: ".*"
record_from_dict:
variable_name: MyRec
parents:
- MyType1
- MyType2
references:
author:
parents:
- Person
so that now, a ``Person`` record with a ``full_name`` property with
value "Silvia Scientist" is created as the value of the ``author``
property:
.. image:: img/properties-from-dict-records-person.png
:height: 200
For the time being, only the parents of the referenced record can be
set via this option. More complicated treatments can be implemented
via the ``referenced_record_callback`` (see below).
Properties can be blacklisted with the ``properties_blacklist``
keyword, i.e., all keys listed under ``properties_blacklist`` will be
excluded from automated treatment. Since the
:py:class:`~caoscrawler.converters.PropertiesFromDictConverter` has
all the functionality of the
:py:class:`~caoscrawler.converters.DictElementConverter`, individual
properties can still be used in a subtree. Together with
``properties_blacklist`` this can be used to add custom treatment to
specific properties by blacklisting them in ``record_from_dict`` and
then treating them in the subtree the same as you would do it in the
standard
:py:class:`~caoscrawler.converters.DictElementConverter`. Note that
the blacklisted keys are excluded on **all** levels of the dictionary,
i.e., also when they occur in a referenced entity.
For further customization, the
:py:class:`~caoscrawler.converters.PropertiesFromDictConverter` can be
used as a basis for :ref:`custom converters<Custom Converters>` which
can make use of its ``referenced_record_callback`` argument. The
``referenced_record_callback`` can be a callable object which takes
exactly a Record as an argument and needs to return that Record after
doing whatever custom treatment is needed. Additionally, it is given
the ``RecordStore`` and the ``ValueStore`` in order to be able to
access the records and values that have already been defined from
within ``referenced_record_callback``. Such a function might look the
following:
.. code-block:: python
def my_callback(rec: db.Record, records: RecordStore, values: GeneralStore):
# do something with rec, possibly using other records or values from the stores...
rec.description = "This was updated in a callback"
return rec
It is applied to all Records that are created from the dictionary and
it can be used to, e.g., transform values of some properties, or add
special treatment to all Records of a specific
type. ``referenced_record_callback`` is applied **after** the
properties from the dictionary have been applied as explained above.
Further converters
++++++++++++++++++
......@@ -293,7 +427,7 @@ datamodel like
H5Ndarray:
obligatory_properties:
internal_hdf5-path:
datatype: TEXT
datatype: TEXT
although the names of both property and record type can be configured within the
cfood definition.
......@@ -407,11 +541,11 @@ First we will create our package and module structure, which might be:
tox.ini
src/
scifolder/
__init__.py
converters/
__init__.py
sources.py # <- the actual file containing
# the converter class
__init__.py
converters/
__init__.py
sources.py # <- the actual file containing
# the converter class
doc/
unittests/
......@@ -436,74 +570,74 @@ that would be given using a yaml definition (see next section below).
"""
def __init__(self, definition: dict, name: str,
converter_registry: dict):
"""
Initialize a new directory converter.
"""
super().__init__(definition, name, converter_registry)
converter_registry: dict):
"""
Initialize a new directory converter.
"""
super().__init__(definition, name, converter_registry)
def create_children(self, generalStore: GeneralStore,
element: StructureElement):
element: StructureElement):
# The source resolver does not create children:
# The source resolver does not create children:
return []
return []
def create_records(self, values: GeneralStore,
records: RecordStore,
element: StructureElement,
file_path_prefix):
if not isinstance(element, TextElement):
raise RuntimeError()
# This function must return a list containing tuples, each one for a modified
# property: (name_of_entity, name_of_property)
keys_modified = []
# This is the name of the entity where the source is going to be attached:
attach_to_scientific_activity = self.definition["scientific_activity"]
rec = records[attach_to_scientific_activity]
# The "source" is a path to a source project, so it should have the form:
# /<Category>/<project>/<scientific_activity>/
# obtain these information from the structure element:
val = element.value
regexp = (r'/(?P<category>(SimulationData)|(ExperimentalData)|(DataAnalysis))'
'/(?P<project_date>.*?)_(?P<project_identifier>.*)'
'/(?P<date>[0-9]{4,4}-[0-9]{2,2}-[0-9]{2,2})(_(?P<identifier>.*))?/')
res = re.match(regexp, val)
if res is None:
raise RuntimeError("Source cannot be parsed correctly.")
# Mapping of categories on the file system to corresponding record types in CaosDB:
cat_map = {
"SimulationData": "Simulation",
"ExperimentalData": "Experiment",
"DataAnalysis": "DataAnalysis"}
linkrt = cat_map[res.group("category")]
keys_modified.extend(create_records(values, records, {
"Project": {
"date": res.group("project_date"),
"identifier": res.group("project_identifier"),
},
linkrt: {
"date": res.group("date"),
"identifier": res.group("identifier"),
"project": "$Project"
},
attach_to_scientific_activity: {
"sources": "+$" + linkrt
}}, file_path_prefix))
# Process the records section of the yaml definition:
keys_modified.extend(
super().create_records(values, records, element, file_path_prefix))
# The create_records function must return the modified keys to make it compatible
# to the crawler functions:
return keys_modified
records: RecordStore,
element: StructureElement,
file_path_prefix):
if not isinstance(element, TextElement):
raise RuntimeError()
# This function must return a list containing tuples, each one for a modified
# property: (name_of_entity, name_of_property)
keys_modified = []
# This is the name of the entity where the source is going to be attached:
attach_to_scientific_activity = self.definition["scientific_activity"]
rec = records[attach_to_scientific_activity]
# The "source" is a path to a source project, so it should have the form:
# /<Category>/<project>/<scientific_activity>/
# obtain these information from the structure element:
val = element.value
regexp = (r'/(?P<category>(SimulationData)|(ExperimentalData)|(DataAnalysis))'
'/(?P<project_date>.*?)_(?P<project_identifier>.*)'
'/(?P<date>[0-9]{4,4}-[0-9]{2,2}-[0-9]{2,2})(_(?P<identifier>.*))?/')
res = re.match(regexp, val)
if res is None:
raise RuntimeError("Source cannot be parsed correctly.")
# Mapping of categories on the file system to corresponding record types in CaosDB:
cat_map = {
"SimulationData": "Simulation",
"ExperimentalData": "Experiment",
"DataAnalysis": "DataAnalysis"}
linkrt = cat_map[res.group("category")]
keys_modified.extend(create_records(values, records, {
"Project": {
"date": res.group("project_date"),
"identifier": res.group("project_identifier"),
},
linkrt: {
"date": res.group("date"),
"identifier": res.group("identifier"),
"project": "$Project"
},
attach_to_scientific_activity: {
"sources": "+$" + linkrt
}}, file_path_prefix))
# Process the records section of the yaml definition:
keys_modified.extend(
super().create_records(values, records, element, file_path_prefix))
# The create_records function must return the modified keys to make it compatible
# to the crawler functions:
return keys_modified
If the recommended (python) package structure is used, the package containing the converter
......@@ -530,8 +664,8 @@ function signature:
.. code-block:: python
def create_records(values: GeneralStore, # <- pass the current variables store here
records: RecordStore, # <- pass the current store of CaosDB records here
def_records: dict): # <- This is the actual definition of new records!
records: RecordStore, # <- pass the current store of CaosDB records here
def_records: dict): # <- This is the actual definition of new records!
`def_records` is the actual definition of new records according to the yaml cfood specification
......@@ -547,7 +681,7 @@ Let's have a look at a few examples:
match: (?P<dir_name>.*)
records:
Experiment:
identifier: $dir_name
identifier: $dir_name
This block will just create a new record with parent `Experiment` and one property
`identifier` with a value derived from the matching regular expression.
......@@ -565,7 +699,7 @@ Let's formulate that using `create_records`:
}
keys_modified = create_records(values, records,
record_def)
record_def)
The `dir_name` is set explicitely here, everything else is identical to the yaml statements.
......@@ -588,9 +722,9 @@ So, a sketch of a typical implementation within a custom converter could look li
.. code-block:: python
def create_records(self, values: GeneralStore,
records: RecordStore,
element: StructureElement,
file_path_prefix: str):
records: RecordStore,
element: StructureElement,
file_path_prefix: str):
# Modify some records:
record_def = {
......@@ -598,15 +732,15 @@ So, a sketch of a typical implementation within a custom converter could look li
}
keys_modified = create_records(values, records,
record_def)
record_def)
# You can of course do it multiple times:
keys_modified.extend(create_records(values, records,
record_def))
record_def))
# You can also process the records section of the yaml definition:
keys_modified.extend(
super().create_records(values, records, element, file_path_prefix))
super().create_records(values, records, element, file_path_prefix))
# This essentially allows users of your converter to customize the creation of records
# by providing a custom "records" section additionally to the modifications provided
# in this implementation of the Converter.
......@@ -627,12 +761,12 @@ Let's have a look at a more complex examples, defining multiple records:
match: (?P<dir_name>.*)
records:
Project:
identifier: project_name
identifier: project_name
Experiment:
identifier: $dir_name
Project: $Project
identifier: $dir_name
Project: $Project
ProjectGroup:
projects: +$Project
projects: +$Project
This block will create two new Records:
......@@ -665,7 +799,7 @@ Let's formulate that using `create_records` (again, `dir_name` is constant here)
}
keys_modified = create_records(values, records,
record_def)
record_def)
Debugging
=========
......@@ -681,7 +815,7 @@ output for the match step. The following snippet illustrates this:
debug_match: True
records:
Project:
identifier: project_name
identifier: project_name
Whenever this Converter tries to match a StructureElement, it logs what was tried to macht against
......
......@@ -33,7 +33,7 @@ Then you can do the following interactively in (I)Python. But we recommend that
copy the code into a script and execute it to spare yourself typing.
```python
import caosdb as db
import linkahead as db
from datetime import datetime
from caoscrawler import Crawler, SecurityMode
from caoscrawler.identifiable_adapters import CaosDBIdentifiableAdapter
......
......@@ -30,6 +30,13 @@ to decide what tool is used for sending mails (use the upper one if you
want to actually send mails. See ``sendmail`` configuration in the
LinkAhead docs.
You can even supply the name of a custom CSS file that shall be used:
.. code:: ini
[advancedtools]
crawler.customcssfile = theme-research.css
Crawler Status Records
----------------------
......