diff --git a/src/doc/getting_started/helloworld.md b/src/doc/getting_started/helloworld.md new file mode 100644 index 0000000000000000000000000000000000000000..aa8f72ceda16398d4b8541eb5fceb65dd6106105 --- /dev/null +++ b/src/doc/getting_started/helloworld.md @@ -0,0 +1,89 @@ +# Hello World + +For this example, we need a very simple data model. You can insert it into your +CaosDB instance by saving the following to a file called `model.yml`: + +```yaml +HelloWorld: + recommended_properties: + time: + datatype: DATETIME + note: + datatype: TEXT +``` +and insert the model using +```sh +python -m caosadvancedtools.models.parser model.yml --sync +``` + +Let's look first at how the CaosDB Crawler synchronizes Records that are +created locally with those that might already exist on the CaosDB server. + + +You can do the following interactively in (I)Python. But we recommend that you +can copy the code into a script and execute it to spare yourself typing. + +```python +import caosdb as db +from datetime import datetime +from caoscrawler import Crawler, SecurityMode +from caoscrawler.identifiable_adapters import CaosDBIdentifiableAdapter + + +# Create a Record that will be synced +hello_rec = db.Record(name="My first Record") +hello_rec.add_parent("HelloWorld") +hello_rec.add_property(name="time", value=datetime.now().isoformat()) + +# Create a Crawler instance that we will use for synchronization +crawler = Crawler(securityMode=SecurityMode.UPDATE) +# This defines how Records on the server are identified with the ones we have locally +identifiables_definition_file = "identifiables.yml" +ident = CaosDBIdentifiableAdapter() +ident.load_from_yaml_definition(identifiables_definition_file) +crawler.identifiableAdapter = ident + +# Here we synchronize the Record +inserts, updates = crawler.synchronize(commit_changes=True, unique_names=True, + crawled_data=[hello_rec]) +print(f"Inserted {len(inserts)} Records") +print(f"Updated {len(updates)} Records") +``` + +You also need a file called `identifiables.yml` with the following content: +```yml +HelloWorld: + - name +``` +Now, start by executing the code. What happens? The output suggests, that one +entity was inserted. Please go to the web interface of your instance and have a +look. You can use the query `FIND HelloWorld`. You should see a brand new +Record with a current time stamp. + +So, how did this happen? In our script, we created a "HelloWorld" Record and +gave it to the Crawler. The Crawler checks how "HelloWorld" Records are +identified. We told the Crawler with our `identifiables.yml` that it should +use the name. The Crawler thus checked whether a "HelloWorld" Record with our +name exists on the Server. It did not. Therefore the Record that we provided +was inserted in the Server. + +Now, run the script again. What happens? There is an update! This time, a +Record with the required name existed. Thus the "time" Property was updated. + +The Crawler does not touch Properties that are not present in the local data. +Thus, if you add a "note" Property to the Record in the Server (e.g. with the +edit mode in the web interface) and run the script again, this Property is +kept unchanged. This means that you can extend Records, that were created using +the Crawler. + +Note, that if you change the name of the "HelloWorld" Record in the script and +run it again, a new Record is inserted by the Crawler. This is because we told +the Crawler that it should use the name to check whether a "HelloWorld" Record +already exists in the Server. + +So far, you saw how the Crawler handles synchronization in a very simple +scenario in the following tutorials, you will learn how this looks like if +there are multiple connected Records involved which might not simply be +identified using the name. Also, we created the Record "manually" in this +example while the typical use case is to create it automatically from some file +or directory. How this is done will also be shown in the following chapters. diff --git a/src/doc/getting_started/helloworld.rst b/src/doc/getting_started/helloworld.rst deleted file mode 100644 index ef4a1398322b59d7983b7dff384534cfa501b660..0000000000000000000000000000000000000000 --- a/src/doc/getting_started/helloworld.rst +++ /dev/null @@ -1,5 +0,0 @@ - -Prerequisites -))))))))))))) - -TODO Describe the smallest possible crawler run diff --git a/src/doc/getting_started/prerequisites.md b/src/doc/getting_started/prerequisites.md new file mode 100644 index 0000000000000000000000000000000000000000..ad6cb72d3088a55c31f403b0bdf1d0e6423e7588 --- /dev/null +++ b/src/doc/getting_started/prerequisites.md @@ -0,0 +1,30 @@ +# Prerequisites + +The CaosDB Crawler is a utility to create CaosDB Records from some data +structure, e.g. files, and synchronize this Records with a CaosDB server. +Thus two prerequisites to use the CaosDB Crawler are clear: +1. You need access to a running CaosDB instance. See [documentation](https://docs.indiscale.com/caosdb-deploy/index.html) +2. You need access to the data that you want to insert, i.e. the files or + the table from which you want to create Records. + +Make sure that you configured your Python client to speak +to the correct CaosDB instance (see [configuration docs](https://docs.indiscale.com/caosdb-pylib/configuration.html)). + +We would like to make another prerequisite explicit that is related to the first +point above: You need a data model. Typically, if you want to insert data into +an actively used CaosDB instance, there is already a data model. However, if +there is not yet a data model you can define one using the +[edit mode](https://docs.indiscale.com/caosdb-webui/tutorials/edit_mode.html) +or the [YAML format](https://docs.indiscale.com/caosdb-advanced-user-tools/yaml_interface.html). +We will provide small data models for the examples to come. + + +Also it is recommended and for the following chapters necessary, that you have +some experience with the CaosDB Python client. +If you don't, you can start with +the [tutorials](https://docs.indiscale.com/caosdb-pylib/tutorials/index.html) + +If you want to use the +possibility to write CaosDB Crawler configuration files (so called CFoods) it +helps if you know regular expressions. If you don't, don't worry we keep it +simple in this tutorial. diff --git a/src/doc/getting_started/prerequisites.rst b/src/doc/getting_started/prerequisites.rst deleted file mode 100644 index dc8022b6cad99a8508f19f47dc01c601fb676c5b..0000000000000000000000000000000000000000 --- a/src/doc/getting_started/prerequisites.rst +++ /dev/null @@ -1,6 +0,0 @@ - -Prerequisites -))))))))))))) - -TODO Describe what you need to actually do a crawler run: data, CaosDB, ... - diff --git a/src/doc/index.rst b/src/doc/index.rst index d319bf4d24a05a3033b1ae5bbf80433c5ef3646b..20f335f7885971b65caf91dfe723f867e46b8595 100644 --- a/src/doc/index.rst +++ b/src/doc/index.rst @@ -31,7 +31,7 @@ The hierarchical structure can be for example a file tree. However it can be also something different like the contents of a JSON file or a file tree with JSON files. -This documentation helps you to :doc:`get started<README_SETUP>`, explains the most important +This documentation helps you to :doc:`get started<getting_started/index>`, explains the most important :doc:`concepts<concepts>` and offers a range of :doc:`tutorials<tutorials/index>`. @@ -40,4 +40,3 @@ Indices and tables * :ref:`genindex` * :ref:`modindex` -* :ref:`search` diff --git a/src/doc/tutorials/example.rst b/src/doc/tutorials/example.rst deleted file mode 100644 index a1adee7008f3b004e6b441573798b2e57f9a4384..0000000000000000000000000000000000000000 --- a/src/doc/tutorials/example.rst +++ /dev/null @@ -1,108 +0,0 @@ -Example CFood -============= - -Let's walk through an example cfood that makes use of a simple directory structure. We assume -the structure which is supposed to be crawled to have the following form: - -.. code-block:: - - ExperimentalData/ - - 2022_ProjectA/ - - 2022-02-17_TestDataset/ - file1.dat - file2.dat - ... - ... - - 2023_ProjectB/ - ... - - ... - -This file structure conforms to the one described in our article "Guidelines for a Standardized Filesystem Layout for Scientific Data" (https://doi.org/10.3390/data5020043). As a simplified example -we want to write a crawler that creates "Project" and "Measurement" records in CaosDB and set -some reasonable properties stemming from the file and directory names. Furthermore, we want -to link the ficticious dat files to the Measurement records. - -Let's first clarify the terms we are using: - -.. code-block:: - - ExperimentalData/ <--- Category level (level 0) - - 2022_ProjectA/ <--- Project level (level 1) - - 2022-02-17_TestDataset/ <--- Activity / Measurement level (level 2) - file1.dat <--- Files on level 3 - file2.dat - ... - ... - - 2023_ProjectB/ <--- Project level (level 1) - ... - - ... - -So we can see, that the three-level folder structure, described in the paper is replicated. -We are using the term "Activity level" here, instead of the terms used in the article, as -it can be used in a more general way. - -The following yaml cfood is able to match and insert / update the records accordingly: - - -.. code-block:: yaml - - - ExperimentalData: # Converter for the category level - type: Directory - match: ^ExperimentalData$ # The name of the matched folder is given here! - - - subtree: - - project_dir: # Converter for the project level - type: Directory - match: (?P<date>.*?)_(?P<identifier>.*) - - records: - Project: - parents: - - Project - date: $date - identifier: $identifier - - - subtree: - - measurement: # Converter for the activity / measurement level - type: Directory - match: (?P<date>[0-9]{4,4}-[0-9]{2,2}-[0-9]{2,2})(_(?P<identifier>.*))? - - records: - Measurement: - date: $date - identifier: $identifier - project: $Project - - - subtree: - - datFile: # Converter for the files - type: SimpleFile - match: ^(.*)\.dat$ # The file extension is matched using a regular expression. - - records: - datFileRecord: - role: File - path: $datFile - file: $datFile - Measurement: - output: +$datFileRecord - - -Here, we provide a detailled explanation of the specific parts of the yaml definition: - -.. image:: example_crawler.svg - diff --git a/src/doc/tutorials/index.rst b/src/doc/tutorials/index.rst index 02371de196cc139776416882aff31bd6fa4dabbe..b6f0fab511f3646f3ec6a7a320299e72a2c20038 100644 --- a/src/doc/tutorials/index.rst +++ b/src/doc/tutorials/index.rst @@ -7,5 +7,6 @@ This chapter contains a collection of tutorials. :maxdepth: 2 :caption: Contents: - Example CFood<example> + Parameter File<parameterfile> + Scientific Data Folder<scifolder> diff --git a/src/doc/tutorials/parameterfile.rst b/src/doc/tutorials/parameterfile.rst new file mode 100644 index 0000000000000000000000000000000000000000..2ce6309a59e518d6054ba20b04795ff57ceec55f --- /dev/null +++ b/src/doc/tutorials/parameterfile.rst @@ -0,0 +1,127 @@ +Tutorial: Parameter File +======================== + +In the “HelloWorld” Example, the Record, that was synchronized with the +server, was created “manually” using the Python client. Now, we want to +have a look at how the Crawler can be told to do this for us. + +The Crawler needs some instructions on what kind of Records it should +create given the data that we provide. This is done using so called +“CFood” YAML files. + +Let’s start again with something simple. A common scenario is that we +want to insert the contents of some parameter file. Suppose, the +parameter file is named ``params_2022-02-02.json`` and looks like the +following: + + +.. code:: json + + { + "frequency": 0.5, + "resolution": 0.01 + } + +Suppose these are two Properties of an Experiment and the date in the file name +is the date of the Experiment. Thus, the data model could be described in a +``model.yml`` like this: + +.. code:: yaml + + Experiment: + recommended_properties: + frequency: + datatype: DOUBLE + resolution: + datatype: DOUBLE + date: + datatype: DATETIME + +The following section tells the crawler that the key value pair +``frequency: 0.5`` shall be used to set the Property "frequency" of an +"Experiment" Record: + +.. code:: yaml + + frequency: # just the name of this section + type: FloatElement # it is a float value + match_name: ^frequency$ # regular expression: the key is 'frequency' + match_value: ^(?P<value>.*)$ # regular expression: we match any value + records: + Experiment: + frequency: $value + +The first part of this section defines, what kind of data element shall be +considered (here: a key value pair with a float value and a key that is +"frequency") and then we use this to set the "frequency" Property. + +How does it work that we actually assign the value? Let's look at what the +regular expressions do: + +- ``^frequency$`` assures that the key is exactly "frequency". "^" matches the + beginning of the string and "$" the end. +- ``^(?P<value>.*)$`` creates a match group with the name "value" and the + pattern of this group is ".*". The dot matches any character and the star means + that it can occur zero, one or multiple times. Thus, this regular expression + matches anything and puts it in the group with the name value. + +We can use the groups from the regular expressions that are used for matching. +Here, we use the "value" group to assign the "frequency" value to the "Experiment". + +Since we will not pass this key value pair on its own to the crawler, we need +to embed it into its context. The full CFood file for +this example might look like the following: + +.. code:: yaml + + --- + metadata: + crawler-version: 0.5.0 + --- + directory: # corresponds to the directory given to the crawler + type: Directory + match: .* # we do not care how it is named here + subtree: + parameterfile: # corresponds to our parameter file + type: JSONFile + match: params_(?P<date>\d+-\d+-\d+)\.json # this is the naming pattern of the parameter file + records: + Experiment: # one Experiment is associated with the file + date: $date # the date is taken from the file name + subtree: + dict: # the JSON contains a dictionary + type: Dict + match: .* # the dictionary does not have a meaningful name + subtree: + frequency: # here we parse the frequency... + type: FloatElement + match_name: frequency + match_value: (?P<val>.*) + records: + Experiment: + frequency: $val + resolution: # ... and here the resolution + type: FloatElement + match_name: resolution + match_value: (?P<val>.*) + records: + Experiment: + resolution: $val + +You do not need to understand every aspect of this right now. We will cover +this later in greater depth. You might think: "Ohh.. This is lengthy". Well, +yes BUT this is a very generic approach that allows data integration from ANY +hierarchical data structure (directory trees, JSON, YAML, HDF5, DICOM, ... and +combinations of those!) and as you will see in later chapters there are ways +to write this in a more condensed way! + +For now, we want to see it running! + +The crawler can then be run with the following command (assuming that +the parameter file lies in the current working directory): + +.. code:: sh + + caosdb-crawler -s update -i identifiables.yml cfood.yml . + + diff --git a/src/doc/tutorials/scifolder.rst b/src/doc/tutorials/scifolder.rst new file mode 100644 index 0000000000000000000000000000000000000000..18e27622442e315fc3a58d55b4abcb586c5cda60 --- /dev/null +++ b/src/doc/tutorials/scifolder.rst @@ -0,0 +1,103 @@ +Example CFood +============= + +Let's walk through a more elaborate example of using the CaosDB Crawler +that makes use of a simple directory structure. We assume +the structure which is supposed to be crawled to have the following form: + +.. code-block:: + + ExperimentalData/ + + 2022_ProjectA/ + + 2022-02-17_TestDataset/ + file1.dat + file2.dat + ... + ... + + 2023_ProjectB/ + ... + + ... + +This file structure conforms to the one described in our article "Guidelines for a Standardized Filesystem Layout for Scientific Data" (https://doi.org/10.3390/data5020043). As a simplified example +we want to write a crawler that creates "Project" and "Measurement" records in CaosDB and set +some reasonable properties stemming from the file and directory names. Furthermore, we want +to link the fictional data files to the Measurement records. + +Let's first clarify the terms we are using: + +.. code-block:: + + ExperimentalData/ <--- Category level (level 0) + + 2022_ProjectA/ <--- Project level (level 1) + + 2022-02-17_TestDataset/ <--- Activity / Measurement level (level 2) + file1.dat <--- Files on level 3 + file2.dat + ... + ... + + 2023_ProjectB/ <--- Project level (level 1) + ... + + ... + +So we can see, that the three-level folder structure, described in the paper is replicated. +We are using the term "Activity level" here, instead of the terms used in the article, as +it can be used in a more general way. + +The following YAML CFood is able to match and insert / update the records +accordingly. We added a detailed explanation of the specific parts of +the YAML definition: + +.. image:: example_crawler.svg + + +If you want to try this out you yourself, you can do so by +- copying the folder with example data somewhere (You can find it `here <https://gitlab.indiscale.com/caosdb/src/caosdb-crawler/-/tree/main/unittests/test_directories/examples_article>`__.) +- adding the files to the server (See below) +- copying the CFood (You can find it `here <https://gitlab.indiscale.com/caosdb/src/caosdb-crawler/-/blob/main/unittests/scifolder_cfood.yml>`__.) +- adding the model to the server (You can find it `here <https://gitlab.indiscale.com/caosdb/src/caosdb-crawler/-/blob/main/integrationtests/basic_example/model.yml>`__.) + +If the Records that are created shall referenced by CaosDB File Entities, you +(currently) need to make them accessible in CaosDB in advance. For example, if you +have a folder with experimental data and you want those files to be referenced +(for example by an Experiment Record). The best option is here to have the +file system where the data resides mounted into your CaosDB instance and +then add the corresponding files using `loadFiles` of the Python library: + +.. code-block:: + + python -m caosadvancedtools.loadFiles /opt/caosdb/mnt/extroot/mount_point_name + +(The path is the one that the CaosDB server needs which is not necessarily the +same as the one on you local machine. The prefix ``/opt/caosdb/mnt/extroot/`` is +correct for all LinkAhead instances. If you are in doubt, please ask your +administrator for the correct path) +For more information on `loadFiles` call `python -m caosadvancedtools.loadFiles --help` + +We still need the identifiable definition for this use case. Store the following +in a file called ``identifiables.yml``: + +.. code-block::yaml + + Person: + - last_name + Measurement: + - date + - project + Project: + - date + - identifier + + +Run the crawler with: + +.. code-block:: + + caosdb-crawler -s update -i identifiables.yml scifolder_cfood.yml extroot +