.putty P1DocsOpen Source
Related
Understanding Diffusion Models for Video Generation: Key Questions AnsweredBreaking Free from the Forking Cycle: Meta’s Strategy for Continuous WebRTC UpgradesUsing eBPF to Break Circular Dependencies in Deployments: GitHub's ApproachPython 3.13.10 Released: A Maintenance Update with 300+ Fixes7 Things You Need to Know About Git 2.54Breaking Free from the Fork: How Meta Unified WebRTC Across 50+ ApplicationsWarp Terminal Goes Open Source with AI-Powered Community Contribution ModelBehind the Scenes: Documenting the Unsung Heroes of Open Source

How Prolly Trees Enable Version-Controlled Databases

Last updated: 2026-05-02 04:30:15 · Open Source

The Foundation: B-Trees

Modern databases and filesystems rely heavily on B-trees, a self-balancing tree data structure that maintains sorted data and allows efficient insertion, deletion, and search operations. B-trees are particularly well-suited for storage systems that interact with block devices (like hard drives or SSDs) because they minimize disk reads by keeping a high branching factor. This design ensures that even large datasets can be accessed with a small number of I/O operations.

How Prolly Trees Enable Version-Controlled Databases

What Makes B-Trees Efficient

The key to a B-tree's efficiency lies in its ability to hold many keys per node, reducing tree height. For example, a typical B-tree with a branching factor of 100 can store tens of millions of entries in just three or four levels. Each node corresponds to a disk block, and reading a node requires only one disk seek. This makes B-trees a cornerstone of relational databases like MySQL, PostgreSQL, and many file systems.

Introducing Prolly Trees

While B-trees excel at current-state queries, they struggle with version control—tracking changes over time while preserving historical states. Enter Prolly trees (short for probabilistic B-trees), a variant that introduces a layer of indirection to enable efficient branching and merging of data structures. The core innovation is that nodes are identified not by fixed block addresses but by content hashes, making them immutable. When a change occurs, only the affected path from leaf to root is rewritten, and unchanged nodes are reused via hash pointers.

How Prolly Trees Differ

In a traditional B-tree, a node update modifies the block in place, destroying the previous version. A Prolly tree, by contrast, creates new nodes for every modified piece of the tree; old nodes remain untouched and accessible through stored hash pointers. This structure allows efficient diffing, merging, and branching—exactly what version control systems need. Moreover, because node boundaries are determined probabilistically (using rolling hashes of key ranges), Prolly trees naturally produce balanced splits without explicit rebalancing algorithms.

Dolt: A Version-Controlled Database

Dolt is an open-source, Apache 2.0-licensed project that marries the concepts of Git with a full SQL database. It treats an entire database like a Git repository, allowing users to commit, push, pull, and merge data just as they would code. The secret sauce behind Dolt's version control capabilities is its use of Prolly trees.

How Dolt Leverages Prolly Trees

Dolt stores the entire dataset as a single Prolly tree (or a set of trees for different tables). Each row is a key-value pair, and the tree structure maps primary keys to their corresponding values. When a user executes a UPDATE, INSERT, or DELETE statement, Dolt constructs a new Prolly tree representing the new state, reusing unchanged nodes from the prior state. This process is both space-efficient (unchanged data is shared) and time-efficient (only changed paths are recalculated). The same mechanism enables branch creation: each branch points to the root hash of its latest tree, and merging two branches involves comparing hash pointers to find common ancestors and apply changes.

Because Prolly trees support snapshot isolation, Dolt can provide transactional consistency across versions. A query at a specific commit always sees a consistent state, even if concurrent writes are occurring. This makes Dolt ideal for applications like collaborative data editing, auditing, and reproducible data pipelines.

Potential Applications Beyond Dolt

The data structure used by Dolt—Prolly trees—could inspire other projects seeking efficient version control. For instance:

  • File systems that support versioning (e.g., git-like file storage)
  • Collaborative editing platforms where multiple users modify large datasets concurrently
  • Backup and archiving systems that need to store incremental snapshots with low overhead
  • Blockchain and distributed ledger technologies that require tamper-evident historical records

The probabilistic nature of node boundaries also simplifies distributed setups: since nodes are content-addressed, they can be cached or replicated without central coordination. Any project that benefits from immutable, merkle-like trees could adopt Prolly trees as a foundation.

In summary, Prolly trees extend the proven efficiency of B-trees with the version-control capabilities of content-addressed storage. By doing so, they unlock new possibilities for data management—and Dolt shows just how powerful this combination can be.