diff --git a/src/doc/development/benchmarking.md b/src/doc/development/benchmarking.md new file mode 100644 index 0000000000000000000000000000000000000000..868c4cc72cb4205980315532ef1bfee011ddf377 --- /dev/null +++ b/src/doc/development/benchmarking.md @@ -0,0 +1,130 @@ + +# Benchmarking CaosDB # + +Benchmarking CaosDB may encompass several distinct areas: How much time is spent in the server's +Java code, how much time is spent inside the SQL backend, are the same costly methods clalled more +than once? This documentation tries to answer some questions connected with these benchmarking +aspects and give you the tools to answer your own questions. + +## Tools for the benchmarking ## + +For averaging over many runs of comparable requests and for putting the database into a +representative state, Python scripts are used. The scripts can be found in the `caosdb-dev-tools` +repository, located at [https://gitlab.indiscale.com/caosdb/src/caosdb-dev-tools](https://gitlab.indiscale.com/caosdb/src/caosdb-dev-tools) in the folder +`benchmarking`: + +### `fill_database.py` ### + +This commandline script is meant for filling the database with enough data to represeny an actual +real-life case, it can easily create hundreds of thousands of Entities. + +The script inserts predefined amounts of randomized Entities into the database, RecordTypes, +Properties and Records. Each Record has a random (but with defined average) number of Properties, +some of which may be references to other Records which have been inserted before. Actual insertion +of the Entities into CaosDB is done in chunks of a defined size. + +Users can tell the script to store times needed for the insertion of each chunk into a tsv file. + +### `measure_execution_time.py` ### + +A somewhat outdated script which executes a given query a number of times and then save statistics +about the `TransactionBenchmark` readings (see below for more information abut the transaction +benchmarks) delivered by the server. + +### Benchmarking SQL commands ### + +MariaDB and MySQL have a feature to enable the logging of SQL queries' times. This logging must be +turned on on the SQL server. For the Docker environment LinkAhead, this can be done with `linkahead +mysqllog {on,off,store}`. + +### External JVM profilers ### + +Additionally to the transaction benchmarks, it is possible to benchmark the server execution via +external Java profilers. For example, [VisualVM](https://visualvm.github.io/) can connect to JVMs running locally or remotely +(e.g. in a Docker container). To enable this in LinkAhead's Docker environment, set + +```yaml +devel: + profiler: true +``` + +Most profilers, like as VisualVM, only gather cumulative data for call trees, they do not provide +complete call graphs (as callgrind/kcachegrind would do). They also do not differentiate between +calls with different query strings, as long as the Java process flow is the same (for example, `FIND +Record 1234` and `FIND Record A WHICH HAS A Property B WHICH HAS A Property C>100` would be handled +equally). + +## How to set up a representative database ## +For reproducible results, it makes sense to start off with an empty database and fill it using the +`fill_database.py` script, for example like this: + +```sh +./fill_database.py -t 500 -p 700 -r 10000 -s 100 --clean +``` + +The `--clean` argument is not strictly necessary when the database was empty before, but it may make +sense when there have been previous runs of the command. This example would create 500 RecordTypes, +700 Properties and 10000 Records with randomized properties, everything is inserted in chunks of 100 +Entities. + +## How to measure request times ## + +If the execution of the Java components is of interest, the VisualVM profiler should be started and +connected to the server before any requests to the server are started. + +When doing performance tests which are used for detailed analysis, it is important that + +1. CaosDB is in a reproducible state, which should be documented +2. all measurements are repeated several times to account for inevitable variance in access (for + example file system caching, network variablity etc.) + +### Filling the database ### + +By simply adding the option `-T logfile.tsv` to the `fill_database.py` command above, the times for +inserting the records are stored in a tsv file and can be analyzed later. + +### Obtain statistics about a query ### + +To repeat single queries a number of times, `measure_execution_time.py` can be used, for example: + +```sh +./measure_execution_time.py -n 120 -q "FIND MusicalInstrument WHICH IS REFERENCED BY Analysis" +``` + +This command executes the query 120 times, additional arguments could even plot the +TransactionBenchmark results directly. + +## What is measured ## + +For a consistent interpretation, the exact definitions of the measured times are as follows: + +### SQL logs ### + +As per https://mariadb.com/kb/en/general-query-log, the logs store only the time at which the SQL +server received a query, not the duration of the query. + +#### Possible future enhancements #### + +- The `query_response_time` plugin may be additionally used in the future, see + https://mariadb.com/kb/en/query-response-time-plugin + +### Transaction benchmarks ### + +Transaction benchmarking manually collects timing information for each transaction. At defined +points, different measurements can be made, accumulated and will finally be returned to the client. +Benchmark objects may consist of sub benchmarks and have a number of measurement objects, which +contain the actual statistics. + +Because transaction benchmarks must be manually added to the server code, they only monitor those +code paths where they are added. On the other hand, their manual nature allows for a more +abstracted analysis of performance bottlenecks. + +### Java profiler ### + +VisualVM records for each thread the call tree, specifically which methods were called how often and +how much time was spent inside these methods. + +### Global requests ### + +Python scripts may measure the global time needed for the execution of each request. +`fill_database.py` obtains its numbers this way. diff --git a/src/doc/index.rst b/src/doc/index.rst index e9734cbbcc009bc3625216b755258250a9a04d58..6edb6b40dfd613b444c5f1219a0fe4e69c7a9ee6 100644 --- a/src/doc/index.rst +++ b/src/doc/index.rst @@ -12,6 +12,7 @@ Welcome to caosdb-server's documentation! Concepts <concepts> tutorials administration/* + development/* API documentation<_apidoc/packages> Welcome to the CaosDB, the flexible semantic data management toolkit!