Skip to content
Snippets Groups Projects
Verified Commit c8ea9f90 authored by Daniel Hornung's avatar Daniel Hornung
Browse files

Merge branch 'dev' into f-convert-xlsx-to-json-next

parents 49c1e0d4 b49aa866
No related branches found
No related tags found
2 merge requests!107Release v0.11.0,!103xlsx -> json conversion
Pipeline #50702 passed
...@@ -64,6 +64,7 @@ Build documentation in `build/` with `make doc`. ...@@ -64,6 +64,7 @@ Build documentation in `build/` with `make doc`.
- `sphinx` - `sphinx`
- `sphinx-autoapi` - `sphinx-autoapi`
- `sphinx-rtd-theme`
- `recommonmark >= 0.6.0` - `recommonmark >= 0.6.0`
### How to contribute ### ### How to contribute ###
......
...@@ -103,8 +103,6 @@ Look at ``xlsx_utils.get_path_position`` for the specification of the "proper na ...@@ -103,8 +103,6 @@ Look at ``xlsx_utils.get_path_position`` for the specification of the "proper na
data_column_paths = {col.index: col.path for col in data_columns.values()} data_column_paths = {col.index: col.path for col in data_columns.values()}
# Parent path, insert in correct order. # Parent path, insert in correct order.
parent, proper_name = xlsx_utils.get_path_position(sheet) parent, proper_name = xlsx_utils.get_path_position(sheet)
# print(parent, proper_name, sheet.title)
# breakpoint()
if parent: if parent:
parent_sheetname = xlsx_utils.get_worksheet_for_path(parent, self._defining_path_index) parent_sheetname = xlsx_utils.get_worksheet_for_path(parent, self._defining_path_index)
if parent_sheetname not in self._handled_sheets: if parent_sheetname not in self._handled_sheets:
...@@ -148,7 +146,6 @@ Look at ``xlsx_utils.get_path_position`` for the specification of the "proper na ...@@ -148,7 +146,6 @@ Look at ``xlsx_utils.get_path_position`` for the specification of the "proper na
value = self._validate_and_convert(value, path) value = self._validate_and_convert(value, path)
_set_in_nested(mydict=data, path=path, value=value, prefix=parent, skip=1) _set_in_nested(mydict=data, path=path, value=value, prefix=parent, skip=1)
continue continue
continue
# Find current position in tree # Find current position in tree
parent_dict = self._get_parent_dict(parent_path=parent, foreign=foreign) parent_dict = self._get_parent_dict(parent_path=parent, foreign=foreign)
...@@ -157,11 +154,7 @@ Look at ``xlsx_utils.get_path_position`` for the specification of the "proper na ...@@ -157,11 +154,7 @@ Look at ``xlsx_utils.get_path_position`` for the specification of the "proper na
if proper_name not in parent_dict: if proper_name not in parent_dict:
parent_dict[proper_name] = [] parent_dict[proper_name] = []
parent_dict[proper_name].append(data) parent_dict[proper_name].append(data)
# breakpoint()
# if sheet.title == "Training.Organisation":
# breakpoint()
self._handled_sheets.add(sheet.title) self._handled_sheets.add(sheet.title)
# print(f"Added sheet: {sheet.title}")
def _is_multiple_choice(self, path: list[str]) -> bool: def _is_multiple_choice(self, path: list[str]) -> bool:
"""Test if the path belongs to a multiple choice section.""" """Test if the path belongs to a multiple choice section."""
...@@ -309,7 +302,7 @@ mydict: dict ...@@ -309,7 +302,7 @@ mydict: dict
path: list path: list
A list of keys, denoting the location of the value. A list of keys, denoting the location of the value.
value value
The value inside the dict. The value which shall be set inside the dict.
prefix: list prefix: list
A list of keys which shall be removed from ``path``. A KeyError is raised if ``path`` does not A list of keys which shall be removed from ``path``. A KeyError is raised if ``path`` does not
start with the elements of ``prefix``. start with the elements of ``prefix``.
......
# Conversion between LinkAhead data models, JSON schema, and XLSX (and vice versa) #
This file describes the conversion between JSON schema files and XLSX templates, and between JSON
data files following a given schema and XLSX files with data. This conversion is handled by the
Python modules in the `table_json_conversion` library.
Requirements: When converting from a json schema, the top level of the json schema must be a
dict. The keys of the dict are RecordType names.
## Data models in JSON Schema and JSON data ##
The data model in LinkAhead defines the types of records present in a LinkAhead instance and their
structure. This data model can also be represented in a JSON Schema, which defines the structure of
JSON files containing records pertaining to the data model.
For example, the following JSON can describe a singe "Person" Record:
```JSON
{
"Person": [
{
"family_name": "Steve",
"given_name": "Stevie"
}
]
}
```
A *JSON Schema* specifies a concrete structure, and the associated JSON files can be used to
represent data for specific record structures. For instance, one could create a JSON Schema allowing
the storage of "Training" Records containing information about conducted trainings. This is
particularly valuable for data import and export. One could generate web forms from the JSON Schema
or use it to export objects stored in LinkAhead as JSON.
### Note: Data models and data arrays ###
The schema as created by ``json_schema_exporter.recordtype_to_json_schema(...)`` is, from a broad
view, a dict with all the top level recordtypes (the recordtype names are the keys). While this is
appropriate for the generation of user input forms, data often consists of multiple entries of the
same type. XLSX files are no exception, users expect that they may enter multiple rows of data.
Since the data model schema does not match multiple data sets, there is a utility function which
create a *data array* schema out of the *data model* schema: It basically replaces the top-level
entries of the data model by lists which may contain data.
A **short example** illustrates this well. Consider a *data model* schema which fits to this data
content:
```JSON
{
"Person": {
"name": "Charly"
}
}
```
Now the automatically generated *data array* schema would accept the following data:
```JSON
{
"Person": [
{
"name": "Charly"
},
{
"name": "Sam"
}
]
}
```
## From JSON to XLSX: Data Representation ##
The following describes how JSON files representing LinkAhead records are converted into XLSX files,
or how JSON files with records are created from XLSX files.
The attribute name (e.g., "Person" above) determines the RecordType, and the value of this attribute
can either be an object or a list. If it is an object (as in the example above), a single record is
represented. In the case of a list, multiple records sharing the same RecordType as the parent are
represented.
The *Properties* of the record (e.g., `family_name` and `given_name` above) become *columns* in the
XLSX file. These properties have an attribute name and a value. The value can be:
a. A primitive (text, number, boolean, ...)
b. A record
c. A list of primitive types
d. A list of unique enums (multiple choice)
e. A list of records
In cases *a.* and *c.*, a cell is created in the column corresponding to the property in the XLSX
file. In case *b.*, columns are created for the Properties of the record, where for each of the
Properties the cases *a.* - *e.* are considered recursively. Case *d.* leads to a number of
columns, one for each of the possible choices.
For case *e.* however, the two-dimensional structure of an XLSX sheet is not sufficient. Therefore,
for such cases, *new* XLSX sheets/tables are created.
In these sheets/tables, the referenced records are treated as described above (new columns for the
Properties). However, there are now additional columns that indicate from which "external" record
these records are referenced.
Let's now consider these four cases in detail and with examples:
### a. Properties with primitive data types ###
```JSON
{
"Training": [
{
"date": "2023-01-01",
"url": "www.indiscale.com",
"duration": 1.0,
"participants": 1,
"remote": false
},
{
"date": "2023-06-15",
"url": "www.indiscale.com/next",
"duration": 2.5,
"participants": None,
"remote": true
}
]
}
```
This entry will be represented in an XLSX sheet with the following content:
| date | url | duration | participants | remote |
|------------|------------------------|----------|--------------|--------|
| 2023-01-01 | www.indiscale.com | 1.0 | 1 | false |
| 2023-06-15 | www.indiscale.com/next | 2.5 | | true |
### b. Property referencing a record ###
```JSON
{
"Training": [
{
"date": "2023-01-01",
"supervisor": {
"family_name": "Stevenson",
"given_name": "Stevie",
}
}
]
}
```
This entry will be represented in an XLSX sheet with the following content:
| date | `supervisor.family_name` | `supervisor.given_name` |
|------------|--------------------------|-------------------------|
| 2023-01-01 | Stevenson | Stevie |
Note that column names may be renamed. The mapping of columns to properties of records is ensured
through the content of hidden rows. (See below for the definition of hidden rows.)
### c. Properties containing lists of primitive data types ###
```JSON
{
"Training": [
{
"url": "www.indiscale.com",
"subjects": ["Math", "Physics"],
}
]
}
```
This entry would be represented in an XLSX sheet with the following content:
| url | subjects |
|-------------------|--------------|
| www.indiscale.com | Math;Physics |
The list elements are written into the cell separated by `;` (semicolon). If the elements contain
the separator `;`, it is escaped with `\\`.
### d. Multiple choice properties ###
```JSON
{
"Training": [
{
"date": "2024-04-17",
"skills": [
"Planning",
"Evaluation"
]
}
]
}
```
If the `skills` list is denoted as an `enum` array with `"uniqueItems": true` in the json schema,
this entry would be represented like this in an XLSX:
| date | skills.Planning | skills.Communication | skills.Evaluation |
|------------|-----------------|----------------------|-------------------|
| 2024-04-17 | x | | x |
Note that this example assumes that the list of possible choices, as given in the json schema, was
"Planning, Communication, Evaluation".
### e. Properties containing lists with references ###
```JSON
{
"Training": [
{
"date": "2023-01-01",
"coach": [
{
"family_name": "Sky",
"given_name": "Max",
},
{
"family_name": "Sky",
"given_name": "Min",
}
]
}
]
}
```
Since the two coaches cannot be represented properly in a single cell, another worksheet is needed
to contain the properties of the coaches.
The sheet for the Trainings in this example only contains the "date" column
| date |
|------------|
| 2023-01-01 |
Additionally, there is *another* sheet where the coaches are stored. Here, it is crucial to define
how the correct element is chosen from potentially multiple "Trainings". In this case, it means that
the "date" must be unique.
Note: This uniqueness requirement is not strictly checked right now, it is your responsibility as a
user that such "foreign properties" are truly unique.
The second sheet looks like this:
| date | `coach.family_name` | `coach.given_name` |
|------------|---------------------|--------------------|
| 2023-01-01 | Sky | Max |
| 2023-01-01 | Sky | Min |
## Data in XLSX: Hidden automation logic ##
### First column: Marker for row types ###
The first column in each sheet will be hidden and it will contain an entry in each row that needs
special treatment. The following values are used:
- ``IGNORE``: This row is ignored. It can be used for explanatory texts or layout.
- ``COL_TYPE``: Typically the first row that is not `IGNORE`. It indicates the row that defines the
type of columns (`FOREIGN`, `SCALAR`, `LIST`, `MULTIPLE_CHOICE`, `IGNORE`). This row must occur
exactly once per sheet.
- ``PATH``: Indicates that the row is used to define the path within the JSON. These rows are
typically hidden for users.
An example table could look like this:
| `IGNORE` | | Welcome | to this | file! | |
| `IGNORE` | | Please | enter your | data here: | |
| `COL_TYPE` | `IGNORE` | `SCALAR` | `SCALAR` | `LIST` | `SCALAR` |
| `PATH` | | `Training` | `Training` | `Training` | `Training` |
| `PATH` | | `url` | `date` | `subjects` | `supervisor` |
| `PATH` | | | | | `email` |
| `IGNORE` | Please enter one training per line. | Training URL | Training date | Subjects | Supervisor's email |
|------------|-------------------------------------|----------------|---------------|--------------|--------------------|
| | | example.com/mp | 2024-02-27 | Math;Physics | steve@example.com |
| | | example.com/m | 2024-02-27 | Math | stella@example.com |
### Parsing XLSX data ###
To extract the value of a given cell, we traverse all path elements (in ``PATH`` rows) from top to
bottom. The final element of the path is the name of the Property to which the value belongs. In
the example above, `steve@example.com` is the value of the `email` Property in the path
`["Training", "supervisor", "email"]`.
The path elements are sufficient to identify the object within a JSON, at least if the corresponding
JSON element is a single object. If the JSON element is an array, the appropriate object within the
array needs to be selected.
For this selection additional ``FOREIGN`` columns are used. The paths in these columns must all have
the same *base* and one additional *unique key* component. For example, two `FOREIGN` columns could
be `["Training", "date"]` and `["Training", "url"]`, where `["Training"]` is the *base path* and
`"date"` and `"url"` are the *unique keys*.
The base path defines the table (or recordtype) to which the entries belong, and the values of the
unique keys define the actual rows to which data belongs.
For example, this table defines three coaches for the two trainings from the last table:
| `COL_TYPE` | `FOREIGN` | `FOREIGN` | `SCALAR` |
| `PATH` | `Training` | `Training` | `Training` |
| `PATH` | `date` | `url` | `coach` |
| `PATH` | | | `given_name` |
| `IGNORE` | Date of training | URL of training | The coach's given name |
| `IGNORE` | from sheet 'Training' | from sheet 'Training' | |
|------------|-----------------------|-----------------------|------------------------|
| | 2024-02-27 | example.com/mp | Ada |
| | 2024-02-27 | example.com/mp | Berta |
| | 2024-02-27 | example.com/m | Chris |
#### Sepcial case: multiple choice "checkboxes" ####
As a special case, enum arrays with `"uniqueItems": true` can be represented as multiple columns,
with one column per choice. The choices are denoted as the last `PATH` component, the column type
must be `MULTIPLE_CHOICE`.
Stored data is denoted as an "x" character in the respective cell, empty cells denote that the item
was not selected. Additionally, the implementation also allows `TRUE` or `1` for selected items,
and `FALSE`, `0` or cells with only whitespace characters for deselected items:
| `COL_TYPE` | `MULTIPLE_CHOICE` | `MULTIPLE_CHOICE` | `MULTIPLE_CHOICE` |
| `PATH` | `skills` | `skills` | `skills` |
| `PATH` | `Planning` | `Communication` | `Evaluation` |
| `IGNORE` | skills.Planning | skills.Communication | skills.Evaluation |
|------------|-------------------|----------------------|-------------------|
| | x | | X |
| | `" "` | `TRUE` | `FALSE` |
| | 0 | x | 1 |
These rows correspond to:
1. Planning, Evaluation
2. Communication
3. Communication, Evaluation
## Current limitations ##
The current implementation still lacks the following:
- Files handling is not implemented yet.
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment