DOC: Benchmarking doc moved to this repository.

0cd0ef48 · Daniel · 8827a23e · 0cd0ef48 · 0cd0ef48
Unverified Commit 0cd0ef48 authored Dec 9, 2020 by Daniel
--- a/src/doc/development/benchmarking.md
+++ b/src/doc/development/benchmarking.md
+
+# Benchmarking CaosDB #
+
+Benchmarking CaosDB may encompass several distinct areas: How much time is spent in the server's
+Java code, how much time is spent inside the SQL backend, are the same costly methods clalled more
+than once?  This documentation tries to answer some questions connected with these benchmarking
+aspects and give you the tools to answer your own questions.
+
+## Tools for the benchmarking ##
+
+For averaging over many runs of comparable requests and for putting the database into a
+representative state, Python scripts are used.  The scripts can be found in the `caosdb-dev-tools`
+repository, located at [https://gitlab.indiscale.com/caosdb/src/caosdb-dev-tools](https://gitlab.indiscale.com/caosdb/src/caosdb-dev-tools) in the folder
+`benchmarking`:
+
+### `fill_database.py` ###
+
+This commandline script is meant for filling the database with enough data to represeny an actual
+real-life case, it can easily create hundreds of thousands of Entities.
+
+The script inserts predefined amounts of randomized Entities into the database, RecordTypes,
+Properties and Records.  Each Record has a random (but with defined average) number of Properties,
+some of which may be references to other Records which have been inserted before.  Actual insertion
+of the Entities into CaosDB is done in chunks of a defined size.
+
+Users can tell the script to store times needed for the insertion of each chunk into a tsv file.
+
+### `measure_execution_time.py` ###
+
+A somewhat outdated script which executes a given query a number of times and then save statistics
+about the `TransactionBenchmark` readings (see below for more information abut the transaction
+benchmarks) delivered by the server.
+
+### Benchmarking SQL commands ###
+
+MariaDB and MySQL have a feature to enable the logging of SQL queries' times.  This logging must be
+turned on on the SQL server.  For the Docker environment LinkAhead, this can be done with `linkahead
+mysqllog {on,off,store}`.
+
+### External JVM profilers ###
+
+Additionally to the transaction benchmarks, it is possible to benchmark the server execution via
+external Java profilers.  For example, [VisualVM](https://visualvm.github.io/) can connect to JVMs running locally or remotely
+(e.g. in a Docker container).  To enable this in LinkAhead's Docker environment, set
+
+```yaml
+devel:
+  profiler: true
+```
+
+Most profilers, like as VisualVM, only gather cumulative data for call trees, they do not provide
+complete call graphs (as callgrind/kcachegrind would do).  They also do not differentiate between
+calls with different query strings, as long as the Java process flow is the same (for example, `FIND
+Record 1234` and `FIND Record A WHICH HAS A Property B WHICH HAS A Property C>100` would be handled
+equally).
+
+## How to set up a representative database ##
+For reproducible results, it makes sense to start off with an empty database and fill it using the
+`fill_database.py` script, for example like this:
+
+```sh
+./fill_database.py -t 500 -p 700 -r 10000 -s 100 --clean
+```
+
+The `--clean` argument is not strictly necessary when the database was empty before, but it may make
+sense when there have been previous runs of the command.  This example would create 500 RecordTypes,
+700 Properties and 10000 Records with randomized properties, everything is inserted in chunks of 100
+Entities.
+
+## How to measure request times ##
+
+If the execution of the Java components is of interest, the VisualVM profiler should be started and
+connected to the server before any requests to the server are started.
+
+When doing performance tests which are used for detailed analysis, it is important that
+
+1. CaosDB is in a reproducible state, which should be documented
+2. all measurements are repeated several times to account for inevitable variance in access (for
+   example file system caching, network variablity etc.)
+
+### Filling the database ###
+
+By simply adding the option `-T logfile.tsv` to the `fill_database.py` command above, the times for
+inserting the records are stored in a tsv file and can be analyzed later.
+
+### Obtain statistics about a query ###
+
+To repeat single queries a number of times, `measure_execution_time.py` can be used, for example:
+
+```sh
+./measure_execution_time.py -n 120 -q "FIND MusicalInstrument WHICH IS REFERENCED BY Analysis"
+```
+
+This command executes the query 120 times, additional arguments could even plot the
+TransactionBenchmark results directly.
+
+## What is measured ##
+
+For a consistent interpretation, the exact definitions of the measured times are as follows:
+
+### SQL logs ###
+
+As per https://mariadb.com/kb/en/general-query-log, the logs store only the time at which the SQL
+server received a query, not the duration of the query.
+
+#### Possible future enhancements ####
+
+- The `query_response_time` plugin may be additionally used in the future, see
+  https://mariadb.com/kb/en/query-response-time-plugin
+
+### Transaction benchmarks ###
+
+Transaction benchmarking manually collects timing information for each transaction.  At defined
+points, different measurements can be made, accumulated and will finally be returned to the client.
+Benchmark objects may consist of sub benchmarks and have a number of measurement objects, which
+contain the actual statistics.
+
+Because transaction benchmarks must be manually added to the server code, they only monitor those
+code paths where they are added.  On the other hand, their manual nature allows for a more
+abstracted analysis of performance bottlenecks.
+
+### Java profiler ###
+
+VisualVM records for each thread the call tree, specifically which methods were called how often and
+how much time was spent inside these methods.
+
+### Global requests ###
+
+Python scripts may measure the global time needed for the execution of each request.
+`fill_database.py` obtains its numbers this way.
--- a/src/doc/index.rst
+++ b/src/doc/index.rst
@@ -12,6 +12,7 @@ Welcome to caosdb-server's documentation!
   Concepts <concepts>
   tutorials
   administration/*
+   development/*
   API documentation<_apidoc/packages>

 Welcome to the CaosDB, the flexible semantic data management toolkit!