Dark Mode Toggle

100 Days of MDBX - part 1

2025 Feb 27 See all posts

100 Days of MDBX - part 1

I want to read and explore more open source projects and learn from them. I picked MDBX for this exercise, as it's something we use at work in Erigon, wherein we've been building a database geared towards storing and serving archive blockchain data.

This is pretty much an unedited dump of notes taken. My blog is my weird corner of the internet, and I don't want to have a dependency on editing before publishing these notes. Each part contains 10 days of study. Here's Part 1.

day 1

got repo from erthink/libmdbx

building:

make all
make check

cmake - build generator, targetting build systems like make, ninja, visual studio; it generates build files from single config

ninja - alternative to make; small, fast build system designed for speed...parallel builds, automatically handles dependencies, often used as backend for cmake; used in large projects like chrome and android

amalgamated sources – combining multiple files into 1 big file... make check gives error – probably because you can't test on amalgamated source (says this) git fetch --tags --force --prune – should get the complete source, with which i can do make check

abandoning trying to run make check – it's meant for library developers + it's usually long running tests. Even though weird "git fetch" didn't fix it. I'm running make test which is quicker...and I want to use it to add or see simple tests around creating table or playing around with values etc. ACtually I'm mistakesn make test seems to run stochastic tests...which is long. Is this long test? So stochastic.sh has loops param, which is 2 for test or 42 for long-test so make test completed and took 10 minutes or so...there's also a make smoke which uses mdbx_test and is shorter.

looking at the files involved in making mdbx_test binary...starting with test/main.c++ this is the main file for the binary first thing I saw was MDBX_NORETURN.

looking at main() function...it parses the flags, and sets up "actors", which are test actors (processes) which will participate in the test concurrently.
osal-unix.c++#osal_setup(): osal stands for operating system abstraction layer...this configures the actors with the synchronization mechanisms based on the os and version:
- for System V - sets up IPC channels; otherwise use shared_memory
- check if it's posix2001 or posix2008 standard, setup shared mutex;
- if posix1988, setup semaphore – this is simpler than higher level apis like barriers.
- windows has it's own file (I'm not interested in windows at all)

day 2

looking at mdbx-go...might provide smaller context. Difficult seeing which tests to run...

PNL - page number list, a sorted array of IDS (mdbx_chk.c) pgno_t – page number type. 32bit page numbers; which for 4k pages is db of size 2^44 bytes...

first step: create environment mdbx_env_create(MDBX_env**)

MDBX checks if lock free atomics is provided for 32/64-bit integers. TIL about lock free atomics. See this

basically, locks are used to synchronize between threads, can we do better than that? Yes lock-free atomics. It uses Compare and Swap (CAS), and can be used complicated thing like structs as well. It optimistically (and atomically) loads the value, which creates a local copy, and then updates it, and then attempts to swap it.

C = atomic_load(A)

_Bool atomic_compare_exchange_weak(volatile A *object, C *expected, D desired);

in C you put the initially loaded original; while D is the updated value you want to replace with. A is object being shared across multiple threads. If atomic_compare_exchange_weak fails, you retry (so there's a while loop). CAS needs to deal with ABA problem, which happens because a bunch of threads can combine ops together and end up updating to the same mem location, causing an interrupted thread to be fooled into thinking that the exchange succeeded. So an additional tag (say which keeps "update count"), can instead cause the exchange to (correctly) fail. Deferred reclamation – in which update creates a new version, while existing reader can continue reading the old value (which is not GC'ed). It is a technique that can avoid ABA problem, but that only works in automatic GC envs which are lock-free (else the lock-free data structure is dependent on a locking GC, thereby making it locking too!) The CAS should happen atomically btw. It does have hardware support – e.g. x86 implements it as CMPXCHG instruction.

Environment flags (mdbx.h) mdbx_env_open sets these flags

exclusive and cooperative mode cooperative mode –

data file is local (not on network share) +
the process opening the env should also be local +
the OS kernel (i.e. fs and memory mapping impl) and all processes opening the given environment must be on the same physical RAM – TODO: how is this enforced?

day 3

environment - supports multiple kv tables (aka kv maps/spaces/subdbs), all residing in the same shared-memory map.

TIL: c++ unions and their use cases it's basically a special class type where all members share the same memory location.

union MyUnion {
    int i;
    float f;
    char str[20];
};  // All members share the same memory

// Usage:
MyUnion u;
u.i = 42;        // Uses memory as int
u.f = 3.14;      // Now the same memory is used as float
                 // (and i's value is no longer valid)

use cases: reinterpreting memory; memory optimization; tagged unions (with enums)

tracks – MDBX_env structure; geometry (mdbx_env_set_geometry()); page number lists

MDBX_env structure

quick browse –

two mmap'ed files: osal_mmap_t dxb_mmap; /* The main data file */ and osal_mmap_t lck_mmap; /* The lock file */
subpage related config (what is subpage?)
merge thresholds
max readers (size of reader table)
dbi sequence numbers
WAF
shadow_reserve – list of malloc'ed blocks for re-use

pages – not exactly defined anywhere. Seems like a logical page, which can be of different types depending on what part of B-tree it stores. The types are branch/leaf/large(overflow page)/meta/dupfix (MDBX_DUPFIXED records)/ subp (MDBX_DUPSORT subpages) etc. (page_type_t)

readers table (ref reader_slot in layout-lck.h)

has some slots in the readers table(default 61)
readers don't need any lock for their data access. They simply record their txId in the reader table. The reader mutex is needed to find an empty slot in the reader table. The slot's address is saved in thread-local storage, so subsequent read txs from the same thread needs no further locking to proceed.
with MDBX_NOSTICKYTHREADS, slot address is not saved in thread-local storage . No reader table is used if the database is on a read-only filesystem.
used to implement parts of MVCC - readers don't need any locking. This table helps keep track of which reader is accessing data of which old transactions. So when an old tx is no longer in use, it can be reclaimed by a writer.
A write thread scans the slots of the table to determine oldest outstanding reader tx. An freed pages older than this will be reclaimed by the writer.

so finding the slot is a per-thread operation (and not a per-tx operation). Once the slot in the reader table is found, things like (txnid, thread_id, owner_process_id etc.) is written to this slot (reader_slot), and the address is recorded in the thread-local storage (and can be accessed by other txs started in the same thread.) Plus, the reader slots are aligned with processor's cache line, to ensure there's no cache thrashing or contention when readers update their own slot.

day 4

look a bit more into MVCC

transactional memory – it's a hardware/software solution to simplify concurrent programming requirement of accessing a region of memory in an atomic way (by multiple readers and writers). Transactional memory provides optimistic concurrency control (no use of locks) allowing threads to run in parallel. It enforces atomicity, consistency and isolation in the region of code marked as transaction. Since there's no lock mechanisms used, transactional memory doesn't suffer from deadlocks. Transactional memory might be limited in its usecase, since it needs a shared-memory abstraction. It can be implemented in hardware, but usually this is limited. More often the transactional memory is implemented at a software level, using some synchronizing hardware primitives like atomic compare and swaps (see lock free atomics on day2). why lock-free solution is desired? read-write locks – known to create contention, specially a long-standing read transactions and new write transactions.

MVCC is a solution which provides transactional memory. It's non-locking by maintaining versions of the data read, and a write to a particular "slot" produces a new version. The version that each tx sees depends on the isolation level implemented. For MVCC, the most common is snapshot isolation (so if in MDBX), in which the tx sees the version of data as of when the tx had started. In MVCC the data items are "tagged" either with tx timestamp or transaction id, and uses it to figure out what state of data to read. Read and write txs are thus isolated from each other and need no locking. MVCC introduces the problem of purging the older data version no longer needed. It can do so by a "stop the world process" like postgres VACCUUM FREEZE process, for example. Some dbs also use undo logs to get to previous versions, but this is inefficient for write-intensive workloads.

MVCC implementation:

chapter 15.6.1 of Silberschatz book – talks about multiversion timestamp ordering. This ensures readers can do lockless, conflict free reads, writes are serializable. Although conflicts between txs are resolved via rollbacks.
chapter 15.6.2 – talks about multiversion two-phased locking, combining the benefits of mvcc and two-phased locking protocol.

MDBX has a more pragmatic implementation of MVCC:

readers just record their tx ID in the reader table and can immediately access their snapshot
writes use a single lock rather (than two phase locking) to serialize write txs.
version cleanup is not a stop the world process, but rather the writers scan the reader table to reclaim pages no longer needed.

DBI - database tables; DBI sequence numbers - the subdbs are assigned monotonically increasing numbers.

src: mdbx.h mdbx_dbi_open – open or create a subdb and return the handle. Then there's mdbx_dbi_close() to close. bunch of settings which define the subdb. e.g

MDBX_DB_DEFAULTS: keys are arbitrary byte strings, and compared beginning to end.
MDBX_REVERSEKEY: arbit byte strings, compared in reverse order.
MDBX_DUPSORT: each key can have multiple data items, stored in sorted order.
MDBX_CREATE: create if doesn't exist.

day 5

would be nice to start a survey of apis of mdbx...and conceptualise that.

environments

mdbx_env_create() // create an environment (collection of subdbs, residing in same shared memory map)
mdbx_env_open()
mdbx_env_close()

these create a lock file (LCK-file) and storage file (aka DXB-file).

transactions

these provide read-only view of the data, opened by mdbx_txn_begin().
they can be read-only or read-write. Read-write txs can be nested.
each tx must be used by one thread at a time.

mdbx_txn_begin(MDBX_env*, MDBX_txn *parent, MDBX_txn_flags_t, MDBX_txn **txn)

parent is for nested tx. txs may nest to any level
a parent tx and its cursor may not issue any other operations other than commit or abort while it has active child txs.
txn – address where new MDBX_txn handle will be stored.
corresponding mdbx_txn_begin() and mdbx_txn_abort()

mdbx_txn_set_userctx(MDBX_txn *txn, void *ctx)

associates some application context with the tx
can be retrieved by mdbx_txn_get_userctx()
to start a tx with associated context, use mdbx_txn_begin_ex()

mdbx_txn_break()

"break" a tx – means don't allow any further operation
broken tx must be aborted explicitly later.

mdbx_txn_commit_ex() - commit a tx, and return latency information
mdbx_txn_env() – returns the tx's env

mdbx_txn_park(MDBX_txn*, autounpark bool)

long reads are problem, since write txs cannot recycle the older snapshot version pages.
"parking" tx allows it to be evicted by a write tx, if it interfers with the garbage recycling (old MVCC data snapshot). If eviction doesn't occur, the recovery (transition to working state and execution continuation) is cheaper.
A parked tx can be unparked by mdbx_txn_unpark() or by using autounpark=true, which automatically unparks when using the parked txn in apis that involve reading data.

mdbx_txn_reset()

reset a read-only tx.
It basically aborts the tx like mdbx_txn_abort() but keeps the transaction handle
so when mdbx_txn_renew() is called, the same handle can be reused. This saves allocation overhead if the process will start a new read-only tx soon; and also locking overhead if MDBX_NOSTICKYTHREADS is in use.

mdbx_txn_renew

acquires a new reader lock for a txn handle that has been released by mdbx_txn_reset().

MVCC

a question around MVCC - create a read-write tx, do some writes, and then create a nested read tx. Does the read tx see the new values? Or is the values only visible when there's a commit?

if the read tx is started independently of rwtx, it should not see the updated values if rwtx didn't commit.
but if readtx is nested inside rwtx, it should see the updated value.
It would be interesting to see later on how this is done.

Also, MDBX_NOSTICKYTHREADS is weird – it tells to decouple txn from threads as much as possible. So txn can be called from multiple threads in this case? – yes, the API functions don't check if the tx and the current thread match. Txs and cursors can be called from any thread. But this also means that errors arising from simultaneous use of tx/cursors in different threads is hard to detect.

however "simultaneous use of API objects is still not allowed" – what does this mean?
- begin/abort/commit – must be done on same thread.
- some ops should be on same thread – restriction arising from most OS's synchronization primitives requirement that they must complete strictly on the same thread where it started.
reading txs stop using thread local storage (TLS) to store the reader_table_slot's address when MDBX_NOSTICKYTHREADS is on.

database/dbi/subdb

tables api

this is a kv space inside the environment
to create a database, create a tx, and then use mdbx_dbi_open(MDBX_txn *txn, const char *name, MDBX_db_flags_t flags, MDBX_dbi *dbi). The flags are mentioned here
dbi is where the newly created dbi will be stored.
MDBX vs LMDB: in MDBX handles opened for existing tables are immediately made available to other txs, regardless this tx will be aborted or reset.
However newly created dbi handle will be unavailable till the tx commits.
mdbx_dbi_open2: open or create

mdbx_dbi_open_ex(): create or open a named table in the environment with custom comparison functions (for keys and values)
mdbx_dbi_rename(): renames a table given dbi handle

day 6

CRUD

mdbx_canary:

4 integer markers associated with the env.
x,y,z is set by mdbx_canary_put args; while v is always set to tx number

mdbx_cmp_func:

custom comparison function for keys in a table.
not recommended mdbx_chk and mdbx_load doesn't work well

mdbx_put flags

for mdbx_put(), mdbx_cursor_put(), mdbx_replace() functions

cursors mdbx_cursor_count(): gives count for current key (only makes sense in MDBX_DUPSORT table)
mdbx_cursor_count_ex(): equivalent of above with additional data for nested b-tree page statistics of current key

mdbx_cursor_delete(MDBX_cursor*, MDBX_put_flags_t)

flags must be either:
- MDBX_CURRENT: delete single entry at current cursor pos
- MDBX_ALLDUPS or MDBX_NODUPDATA: delete all key-value entries for the current key (in case DUPSORT table is there)
- MDBX_ALLDUPS is a put flag, and so is used when inserting too. There it replaces all values (corresponding to current key) with the input value

mdbx_cursor_get(MDBX_cursor*, MDBX_val *key, MDBX_val *data, MDBX_cursor_op)

get value of key and data into the provided params. Both these are database owned (so copy if you want to modify it; definitely don't dispose it)
it also has seek abilities via the cursor_op. So it can do MDBX_LAST or MDBX_GET_CURRENT etc.

mdbx_cursor_get_batch() – batch get for non-dupsort tables. For dupsort tables use the above api with specific flag (MDBX_GET_MULTIPLE)

mdbx_cursor_put(cursor, key, value, MDBX_put_flags_t)

different options for flags to control nature of put – e.g. should this fail if the current cursor pos doesn't match the provided key value? MDBX_APPEND allows faster bulk loading; MDBX_RESERVE to just return reserved space (and not put the value)...useful if the value to be put is generated later, this saves an extra memcpy.

day 7

mdbx_cursor_scan(MDBX_cursor*, predicate_fn, *context, MDBX_cursor_op Start_op, MDBX_cursor_op turn_op, void *arg)

start_op – MDBX_cursor_op can be used to position the cursor at the start of the scan (MDBX_FIRST/MDBX_LASTMDBX_GET_CURRENT etc.)
turn_op – cursor positioning operation to goto the next element. e.g. MDBX_NEXT, MDBX_NEXT_DUP, MDBX_PREV etc.
arg – external argument for predicate_fn entirely prepared by user.

This is more efficient that manually looping over the elements because it saves accumulated costs of crossing the DSO (dynamic shared object) boundary. mdbx-go doesn't provide api for this..might be interesting to look into, as scanning is common operation.
similar funciton is mdbx_cursor_scan_from, where you can provide starting key, value pair for cursor positioning.

sequences mdbx_dbi_sequence(txn, dbi, uint64 *result, uint64 increment)

sequence generation for tables. Each table has a sequence value. To get, use increment=0; increment>0 can increases the value of this sequence. This can be used for a PUT operation where monotonically increasing integers are keys and/or when ranges of integers have to be "reserved" to put in bulk.

mdbx_dcmp

compare two data items according to a particular table
MDBX has cmp functions for both keys and values
mdbx_get_keycmp() and mdbx_get_datacmp for example
corresponding function for key is mdbx_cmp

mdbx_del(key, data) – delete item from table

data param is needed regardless dupsort table or not
if data is non-NULL only the matching data item will be deleted. If it's null, any/all value(s) for specified key will be deleted.

mdbx_drop – empty/delete/close the table

mdbx_get(MDBX_txn*, MDBX_dbi, MDBX_val *key. MDBX_val *data)

get items from table
if dupsort table, it returns first data item for the key; to get all values required mdbx_cursor_get()
value returned (in data) is owned by the table and must not be released or modified. Tables opened without MDBX_WRITEMAP (it maps data into memory with write permissions) will cause SIGSEGV
similar function is mdbx_get_ex() which optionally returns number of data elements for the key

mdbx_get_equal_or_great()

return equal or great (comparison function) key-value pair, but not exactly matching with the key
for dupsort it will also match over multi-value/duplicates.
updates both the key and data params, pointing it to the actual kv pair in the table

mdbx_put(txn, dbi, key, data, flags)

bunch of flags like MDBX_NODUPDATA (only insert if not key not present; only for dupsort tables), MDBX_APPEND, MDBX_RESERVE

mdbx_replace(txn, dbi, key, *new_data, *old_data, flags)

replace items in table, return the previous value

cursors
cursor: cursor is an opaque structure for navigating through a table. You can bind the cursor with a txn and dbi (kv table), and then use it to traverse the table.

mdbx_cursor_create(void *context)

created, but need to bind txn and dbi in a separate call (mdbx_cursor_bind)
user context can be retrieved by mdbx_cursor_get_userctx call
to create + bind in a single call, use mdbx_cursor_open

mdbx_cursor_copy(srs, dest) – copy cursor position and state

bunch of basic queries about cursor's position

mdbx_cursor_eof() – if cursor points to end of table.
mdbx_cursor_on_first() – pointed to first kv pair?
mdbx_cursos_on_first_dup – pointed to first or sole multi-value of corresponding key
similarly for last

mdbx_cursor_renew(txn, cursor)

renew cursor for use by provided txn
same as calling mdbx_cursor_bind with same dbi and the new txn.
corresponding function is actually mdbx_cursor_unbind() which dissociates the txn but holds the dbi internally.

mdbx_cursor_reset(cursor)

reset cursor status. Can't use the previous positions. Must first position the cursor (independently of the previous position), before using it again

mdbx_txn_release_all_cursors(txn, bool unbind)

unbind or close all cursors of a given txn

day 8

I feel bored surveying the apis. I'm skipping surveying the settings section. I want to write some code.
I can't do this in golang, since I'm interested in internal implementation details. So need to find a way to run/debug c code? Yes, c is better. c++ debugging is nightmare. Maybe I don't have that option...c and c++ code exist together, I don't know what's the factor deciding which code is in c and which in c++? Maybe just availability of STL libraries and other additional functionalities like synchronization mechanisms.

cmake – build system generator (typically for c/c++ projects, but can also work for other languages). I want to now write some unit tests based on libmdbx, and I'll use libmdbx as submodule, as recommended by documentation.

project is here

most of the time is spent on figuring out CMakeLists.txt...

libmdbx imported as submodule
googletest for testing c++ code
link mdbx-static (defined in libmdbx's CMakeLists/makefile) using target_link_libraries. It links your executable with libraries defined in it.
target_include_directories – sort of like -I flag in clang compiler, telling where to find the header files.

With this I can write some tests:

#include <cstdio>
#include <gtest/gtest.h>
#include "libmdbx/mdbx.h"  

TEST(EnvTests, OpenEnv) {
 MDBX_env *env = NULL;
 int rc = mdbx_env_create(&env);
 ASSERT_EQ(rc, 0);
 ASSERT_NE(env, nullptr);
}

I created a Makefile, which can run build/test the code: make build and make test; can also make cmake to build off CMakeLists.txt

exposed a debug build version make cmake-dbg.

day 9

well, looking at sync modes today. Got to revise section "Controlling Kernel Buffering for File I/O" from TLPI. This is lower level, but still relevant to the MDBX sync modes..

synchronized i/o data integrity (SIDI) – concerned with subsequent read operations reading the correct data. So fdatasync(fd) ensuring synchronized i/o data integrity syncs the data pages of the file, but also some important associated metadata needed for subsequent reads (like file size)
synchronized i/o file integrity (SIFI) – syncs data as well as all metadata associated with the file.

The difference is that SIFI has 2 sync calls, while SIDI might just have 1 call. e.g. if the file size doesn't change, SIDI can get by with 1 call. This is crucial in db applications.
So this is what the synchronization mechanisms linux provides. Whereas in MDBX, there are various sync modes. The "data" is the MVCC snapshots of B-tree. Each snapshot has a separate root. This pointer to the root along with the transaction number corresponding to it is stored separately and is the metadata.
So durability considerations are two-fold: one is pages of the B-tree itself. Second is this metadata page. Each of these calls does fdatasync(fd). The MDBX sync modes plays around when and if the data/metadata is synchronized.

MDBX_SYNC_DURABLE: first data is synced to disk; then metadata is synced to disk. Most durable sync mode.
MDBX_NOMETASYNC: data is synced after commit; but the metadata is not. Disk sync of metadata is deferred until the system chooses to sync, or via other explicit MDBX calls like mdbx_env_sync()
MDBX_SAFE_NOSYNC/MDBX_UTTERLY_NOSYNC: even more weaker durability guarantees.

env_flags:

only sybset of mdbx_env flags can be changed at runtime. Other needs closing env and re-opening with changed flags.

ENV_CHANGEABLE_FLAGS = MDBX_SAFE_NOSYNC | MDBX_NOMETASYNC | DEPRECATED_MAPASYNC | MDBX_NOMEMINIT                                            |DEPRECATED_COALESCE | MDBX_PAGEPERTURB | MDBX_ACCEDE | MDBX_VALIDATION

ENV_CHANGELESS_FLAGS = MDBX_NOSUBDIR | MDBX_RDONLY | MDBX_WRITEMAP | MDBX_NOSTICKYTHREADS | MDBX_NORDAHEAD |
                       MDBX_LIFORECLAIM | MDBX_EXCLUSIVE,

ENV_USABLE_FLAGS = ENV_CHANGEABLE_FLAGS | ENV_CHANGELESS_FLAGS

apart from this configured vscode with C++ TestMate to make it detect and run GoogleTest. VSCode's "testing" functionality is quite good. By default vscode's cmake used ninja, I changed it to work with makefiles ("Unix Makefiles" generator), since that's what I've been using when running things from cli. I had some trouble in getting recompilation working on testing, but after the above configuration, it was okay.

day 10

Today let's look at what mechanisms are there on mdbx_env_open and mdbx_env_create. Debugging through a test.

mdbx_env_open()

mdbx_env_set_flags() might set some flags for the env, this is combined with flags provided in mdbx_env_open()
ignore irrelevant flags (like MDBX_WRITEMAP/MDBX_NOMETASYNC) when the env is opened in read only mode (MDBX_RDONLY)
env_handle_pathname: basic checks like file exists or not/permissions, then sets entries in env->pathname struct for lck file, dxb file.

pathname is represented interestingly:

struct { /* path to the DB files */
 pathchar_t *lck, *dxb, *specified;
 void *buffer;
} pathname;

buffer contains combined (back-to-back) values of dxb, lck file and folder (called "specified"). lck, dxb, specified then point to appropriate sections in buffer. This is a weird way to do it, I think it might just be to get a larger continuous section of memory for storing these 3 values, instead of 3 sections in different places. It feels like an over-optimization, but perhaps for a project caring deeply about memory, and there being a possibility of OOM errors, it makes sense to reduce fragmentation.