On-disk file format and data structure

Larch B-tree forests are stored on the disk as follows:

  • Each forest is stored in a separate directory tree structure.
  • The file metadata is in the INI format, and stores metadata about the forest, including the identifiers of the root nodes of the trees.
  • The nodes directory contains the actual nodes.
  • The refcounts directory contains reference counts for nodes.

The metadata file

The following values are stored in the metadata file:

  • format is 1/1: node store format version 1 and node codec version 1.
  • node_size is the maximum size of an encoded node, in bytes.
  • key_size is the size of the keys, in bytes.
  • last_id is the latest allocated node identifier in the forest.
  • root_ids lists the identifiers of the root nodes of the trees.

Node files

Leaf nodes are encoded as follows:

  • the first four bytes are ‘O’, ‘R’, ‘B’, ‘L’
  • 64-bit unsigned integer giving the node identifier
  • 32-bit unsigned integer giving the number of key/value pairs
  • all keys catenated
  • lengths of values as 32-bit unsigned integers
  • all values catenated

Index nodes are encoded as follows:

  • the first four bytes are ‘O’, ‘R’, ‘B’, ‘I’
  • 64-bit unsigned integer giving the node identifier
  • 32-bit integer giving the number of keys
  • all keys catenated
  • all child ids catenated, where each id is a 64-bit unsigned integer

All integers are in big-endian order.

All root nodes are index nodes, so decoding can start knowing that the first node is an index node.

Each node is stored in a file under the nodes subdirectory, using multiple levels of directories to avoid having too many files in the same directory. The basename of the file is the node identifier in hexadecimal.

The directory levels are user adjustable, see the IdPath class for details.

Reference counts

Each node has a reference count. When the count drops to zero, the node file gets removed. Reference counts are stored in files under the refcounts subdirectory. Each file there contains up to 32768 reference counts, as a 16-bit unsigned big-endian integer. Thus, the reference count for node i is file named refcount-N where N is i modulo 32768.

A refcount file that is full of zeroes is removed. When looking up refcounts, if the file does not exist, the count is assumed to be zero. This avoids having to store refcounts for deleted nodes indefinitely.

Table Of Contents

Previous topic

Least recently used object cache

This Page