100 Days of MDBX - part 1

2025 Feb 27 See all posts


100 Days of MDBX - part 1

I want to read and explore more open source projects and learn from them. I picked MDBX for this exercise, as it's something we use at work in Erigon, wherein we've been building a database geared towards storing and serving archive blockchain data.

This is pretty much an unedited dump of notes taken. My blog is my weird corner of the internet, and I don't want to have a dependency on editing before publishing these notes. Each part contains 10 days of study. Here's Part 1.

day 1

got repo from erthink/libmdbx

building:

make all
make check

cmake - build generator, targetting build systems like make, ninja, visual studio; it generates build files from single config

ninja - alternative to make; small, fast build system designed for speed...parallel builds, automatically handles dependencies, often used as backend for cmake; used in large projects like chrome and android

amalgamated sources – combining multiple files into 1 big file... make check gives error – probably because you can't test on amalgamated source (says this) git fetch --tags --force --prune – should get the complete source, with which i can do make check

abandoning trying to run make check – it's meant for library developers + it's usually long running tests. Even though weird "git fetch" didn't fix it. I'm running make test which is quicker...and I want to use it to add or see simple tests around creating table or playing around with values etc. ACtually I'm mistakesn make test seems to run stochastic tests...which is long. Is this long test? So stochastic.sh has loops param, which is 2 for test or 42 for long-test so make test completed and took 10 minutes or so...there's also a make smoke which uses mdbx_test and is shorter.

looking at the files involved in making mdbx_test binary...starting with test/main.c++ this is the main file for the binary first thing I saw was MDBX_NORETURN.

day 2

looking at mdbx-go...might provide smaller context. Difficult seeing which tests to run...

PNL - page number list, a sorted array of IDS (mdbx_chk.c) pgno_t – page number type. 32bit page numbers; which for 4k pages is db of size 2^44 bytes...

first step: create environment mdbx_env_create(MDBX_env**)

MDBX checks if lock free atomics is provided for 32/64-bit integers. TIL about lock free atomics. See this

basically, locks are used to synchronize between threads, can we do better than that? Yes lock-free atomics. It uses Compare and Swap (CAS), and can be used complicated thing like structs as well. It optimistically (and atomically) loads the value, which creates a local copy, and then updates it, and then attempts to swap it.

C = atomic_load(A)

_Bool atomic_compare_exchange_weak(volatile A *object, C *expected, D desired);

in C you put the initially loaded original; while D is the updated value you want to replace with. A is object being shared across multiple threads. If atomic_compare_exchange_weak fails, you retry (so there's a while loop). CAS needs to deal with ABA problem, which happens because a bunch of threads can combine ops together and end up updating to the same mem location, causing an interrupted thread to be fooled into thinking that the exchange succeeded. So an additional tag (say which keeps "update count"), can instead cause the exchange to (correctly) fail. Deferred reclamation – in which update creates a new version, while existing reader can continue reading the old value (which is not GC'ed). It is a technique that can avoid ABA problem, but that only works in automatic GC envs which are lock-free (else the lock-free data structure is dependent on a locking GC, thereby making it locking too!) The CAS should happen atomically btw. It does have hardware support – e.g. x86 implements it as CMPXCHG instruction.

Environment flags (mdbx.h) mdbx_env_open sets these flags

exclusive and cooperative mode cooperative mode –

day 3

environment - supports multiple kv tables (aka kv maps/spaces/subdbs), all residing in the same shared-memory map.

TIL: c++ unions and their use cases it's basically a special class type where all members share the same memory location.

union MyUnion {
    int i;
    float f;
    char str[20];
};  // All members share the same memory

// Usage:
MyUnion u;
u.i = 42;        // Uses memory as int
u.f = 3.14;      // Now the same memory is used as float
                 // (and i's value is no longer valid)

use cases: reinterpreting memory; memory optimization; tagged unions (with enums)

tracks – MDBX_env structure; geometry (mdbx_env_set_geometry()); page number lists

MDBX_env structure

quick browse –

pages – not exactly defined anywhere. Seems like a logical page, which can be of different types depending on what part of B-tree it stores. The types are branch/leaf/large(overflow page)/meta/dupfix (MDBX_DUPFIXED records)/ subp (MDBX_DUPSORT subpages) etc. (page_type_t)

readers table (ref reader_slot in layout-lck.h)

so finding the slot is a per-thread operation (and not a per-tx operation). Once the slot in the reader table is found, things like (txnid, thread_id, owner_process_id etc.) is written to this slot (reader_slot), and the address is recorded in the thread-local storage (and can be accessed by other txs started in the same thread.) Plus, the reader slots are aligned with processor's cache line, to ensure there's no cache thrashing or contention when readers update their own slot.

day 4

look a bit more into MVCC

transactional memory – it's a hardware/software solution to simplify concurrent programming requirement of accessing a region of memory in an atomic way (by multiple readers and writers). Transactional memory provides optimistic concurrency control (no use of locks) allowing threads to run in parallel. It enforces atomicity, consistency and isolation in the region of code marked as transaction. Since there's no lock mechanisms used, transactional memory doesn't suffer from deadlocks. Transactional memory might be limited in its usecase, since it needs a shared-memory abstraction. It can be implemented in hardware, but usually this is limited. More often the transactional memory is implemented at a software level, using some synchronizing hardware primitives like atomic compare and swaps (see lock free atomics on day2). why lock-free solution is desired? read-write locks – known to create contention, specially a long-standing read transactions and new write transactions.

MVCC is a solution which provides transactional memory. It's non-locking by maintaining versions of the data read, and a write to a particular "slot" produces a new version. The version that each tx sees depends on the isolation level implemented. For MVCC, the most common is snapshot isolation (so if in MDBX), in which the tx sees the version of data as of when the tx had started. In MVCC the data items are "tagged" either with tx timestamp or transaction id, and uses it to figure out what state of data to read. Read and write txs are thus isolated from each other and need no locking. MVCC introduces the problem of purging the older data version no longer needed. It can do so by a "stop the world process" like postgres VACCUUM FREEZE process, for example. Some dbs also use undo logs to get to previous versions, but this is inefficient for write-intensive workloads.

MVCC implementation:

MDBX has a more pragmatic implementation of MVCC:


DBI - database tables; DBI sequence numbers - the subdbs are assigned monotonically increasing numbers.

src: mdbx.h mdbx_dbi_open – open or create a subdb and return the handle. Then there's mdbx_dbi_close() to close. bunch of settings which define the subdb. e.g

day 5

environments

mdbx_env_create() // create an environment (collection of subdbs, residing in same shared memory map)
mdbx_env_open()
mdbx_env_close()

these create a lock file (LCK-file) and storage file (aka DXB-file).

transactions

mdbx_txn_begin(MDBX_env*, MDBX_txn *parent, MDBX_txn_flags_t, MDBX_txn **txn)

mdbx_txn_set_userctx(MDBX_txn *txn, void *ctx)

mdbx_txn_break()

mdbx_txn_commit_ex() - commit a tx, and return latency information
mdbx_txn_env() – returns the tx's env

mdbx_txn_park(MDBX_txn*, autounpark bool)

mdbx_txn_reset()

mdbx_txn_renew

MVCC

a question around MVCC - create a read-write tx, do some writes, and then create a nested read tx. Does the read tx see the new values? Or is the values only visible when there's a commit?

Also, MDBX_NOSTICKYTHREADS is weird – it tells to decouple txn from threads as much as possible. So txn can be called from multiple threads in this case? – yes, the API functions don't check if the tx and the current thread match. Txs and cursors can be called from any thread. But this also means that errors arising from simultaneous use of tx/cursors in different threads is hard to detect.

database/dbi/subdb

tables api

mdbx_dbi_open_ex(): create or open a named table in the environment with custom comparison functions (for keys and values)
mdbx_dbi_rename(): renames a table given dbi handle

day 6

CRUD

mdbx_canary:

mdbx_cmp_func:

mdbx_put flags

cursors mdbx_cursor_count(): gives count for current key (only makes sense in MDBX_DUPSORT table)
mdbx_cursor_count_ex(): equivalent of above with additional data for nested b-tree page statistics of current key

mdbx_cursor_delete(MDBX_cursor*, MDBX_put_flags_t)

mdbx_cursor_get(MDBX_cursor*, MDBX_val *key, MDBX_val *data, MDBX_cursor_op)

mdbx_cursor_get_batch() – batch get for non-dupsort tables. For dupsort tables use the above api with specific flag (MDBX_GET_MULTIPLE)

mdbx_cursor_put(cursor, key, value, MDBX_put_flags_t)

day 7

mdbx_cursor_scan(MDBX_cursor*, predicate_fn, *context, MDBX_cursor_op Start_op, MDBX_cursor_op turn_op, void *arg)

This is more efficient that manually looping over the elements because it saves accumulated costs of crossing the DSO (dynamic shared object) boundary. mdbx-go doesn't provide api for this..might be interesting to look into, as scanning is common operation.
similar funciton is mdbx_cursor_scan_from, where you can provide starting key, value pair for cursor positioning.

sequences mdbx_dbi_sequence(txn, dbi, uint64 *result, uint64 increment)

mdbx_dcmp

mdbx_del(key, data) – delete item from table

mdbx_drop – empty/delete/close the table

mdbx_get(MDBX_txn*, MDBX_dbi, MDBX_val *key. MDBX_val *data)

mdbx_get_equal_or_great()

mdbx_put(txn, dbi, key, data, flags)

mdbx_replace(txn, dbi, key, *new_data, *old_data, flags)

cursors
cursor: cursor is an opaque structure for navigating through a table. You can bind the cursor with a txn and dbi (kv table), and then use it to traverse the table.

mdbx_cursor_create(void *context)

mdbx_cursor_copy(srs, dest) – copy cursor position and state

bunch of basic queries about cursor's position

mdbx_cursor_renew(txn, cursor)

mdbx_cursor_reset(cursor)

mdbx_txn_release_all_cursors(txn, bool unbind)

day 8

I feel bored surveying the apis. I'm skipping surveying the settings section. I want to write some code.
I can't do this in golang, since I'm interested in internal implementation details. So need to find a way to run/debug c code? Yes, c is better. c++ debugging is nightmare. Maybe I don't have that option...c and c++ code exist together, I don't know what's the factor deciding which code is in c and which in c++? Maybe just availability of STL libraries and other additional functionalities like synchronization mechanisms.

cmake – build system generator (typically for c/c++ projects, but can also work for other languages). I want to now write some unit tests based on libmdbx, and I'll use libmdbx as submodule, as recommended by documentation.

project is here

most of the time is spent on figuring out CMakeLists.txt...

With this I can write some tests:

#include <cstdio>
#include <gtest/gtest.h>
#include "libmdbx/mdbx.h"  

TEST(EnvTests, OpenEnv) {
 MDBX_env *env = NULL;
 int rc = mdbx_env_create(&env);
 ASSERT_EQ(rc, 0);
 ASSERT_NE(env, nullptr);
}

I created a Makefile, which can run build/test the code: make build and make test; can also make cmake to build off CMakeLists.txt

exposed a debug build version make cmake-dbg.

day 9

well, looking at sync modes today. Got to revise section "Controlling Kernel Buffering for File I/O" from TLPI. This is lower level, but still relevant to the MDBX sync modes..

The difference is that SIFI has 2 sync calls, while SIDI might just have 1 call. e.g. if the file size doesn't change, SIDI can get by with 1 call. This is crucial in db applications.
So this is what the synchronization mechanisms linux provides. Whereas in MDBX, there are various sync modes. The "data" is the MVCC snapshots of B-tree. Each snapshot has a separate root. This pointer to the root along with the transaction number corresponding to it is stored separately and is the metadata.
So durability considerations are two-fold: one is pages of the B-tree itself. Second is this metadata page. Each of these calls does fdatasync(fd). The MDBX sync modes plays around when and if the data/metadata is synchronized.

env_flags:

ENV_CHANGEABLE_FLAGS = MDBX_SAFE_NOSYNC | MDBX_NOMETASYNC | DEPRECATED_MAPASYNC | MDBX_NOMEMINIT                                            |DEPRECATED_COALESCE | MDBX_PAGEPERTURB | MDBX_ACCEDE | MDBX_VALIDATION

ENV_CHANGELESS_FLAGS = MDBX_NOSUBDIR | MDBX_RDONLY | MDBX_WRITEMAP | MDBX_NOSTICKYTHREADS | MDBX_NORDAHEAD |
                       MDBX_LIFORECLAIM | MDBX_EXCLUSIVE,

ENV_USABLE_FLAGS = ENV_CHANGEABLE_FLAGS | ENV_CHANGELESS_FLAGS

apart from this configured vscode with C++ TestMate to make it detect and run GoogleTest. VSCode's "testing" functionality is quite good. By default vscode's cmake used ninja, I changed it to work with makefiles ("Unix Makefiles" generator), since that's what I've been using when running things from cli. I had some trouble in getting recompilation working on testing, but after the above configuration, it was okay.

day 10

Today let's look at what mechanisms are there on mdbx_env_open and mdbx_env_create. Debugging through a test.

mdbx_env_open()

pathname is represented interestingly:

struct { /* path to the DB files */
 pathchar_t *lck, *dxb, *specified;
 void *buffer;
} pathname;

buffer contains combined (back-to-back) values of dxb, lck file and folder (called "specified"). lck, dxb, specified then point to appropriate sections in buffer. This is a weird way to do it, I think it might just be to get a larger continuous section of memory for storing these 3 values, instead of 3 sections in different places. It feels like an over-optimization, but perhaps for a project caring deeply about memory, and there being a possibility of OOM errors, it makes sense to reduce fragmentation.