Full Code of spacejam/sled for AI

main 05b42c17ea14 cached

63 files

521.3 KB

134.6k tokens

566 symbols

1 requests

Download .txt

Showing preview only (544K chars total). Download the full file or copy to clipboard to get everything.

Repository: spacejam/sled
Branch: main
Commit: 05b42c17ea14
Files: 63
Total size: 521.3 KB

Directory structure:
gitextract_hgnzatql/

├── .github/
│   ├── FUNDING.yml
│   ├── ISSUE_TEMPLATE/
│   │   ├── blank_issue.md
│   │   ├── bugs.md
│   │   ├── config.yml
│   │   └── feature_request.md
│   ├── dependabot.yml
│   └── workflows/
│       └── test.yml
├── .gitignore
├── .rustfmt.toml
├── ARCHITECTURE.md
├── CHANGELOG.md
├── CONTRIBUTING.md
├── Cargo.toml
├── LICENSE-APACHE
├── LICENSE-MIT
├── README.md
├── RELEASE_CHECKLIST.md
├── SAFETY.md
├── SECURITY.md
├── art/
│   └── CREDITS
├── code-of-conduct.md
├── examples/
│   └── bench.rs
├── fuzz/
│   ├── .gitignore
│   ├── Cargo.toml
│   └── fuzz_targets/
│       └── fuzz_model.rs
├── scripts/
│   ├── cgtest.sh
│   ├── cross_compile.sh
│   ├── execution_explorer.py
│   ├── instructions
│   ├── sanitizers.sh
│   ├── shufnice.sh
│   └── ubuntu_bench
├── src/
│   ├── alloc.rs
│   ├── block_checker.rs
│   ├── config.rs
│   ├── db.rs
│   ├── event_verifier.rs
│   ├── flush_epoch.rs
│   ├── heap.rs
│   ├── id_allocator.rs
│   ├── leaf.rs
│   ├── lib.rs
│   ├── metadata_store.rs
│   ├── object_cache.rs
│   ├── object_location_mapper.rs
│   └── tree.rs
└── tests/
    ├── 00_regression.rs
    ├── common/
    │   └── mod.rs
    ├── concurrent_batch_atomicity.rs
    ├── crash_tests/
    │   ├── crash_batches.rs
    │   ├── crash_heap.rs
    │   ├── crash_iter.rs
    │   ├── crash_metadata_store.rs
    │   ├── crash_object_cache.rs
    │   ├── crash_sequential_writes.rs
    │   ├── crash_tx.rs
    │   └── mod.rs
    ├── test_crash_recovery.rs
    ├── test_quiescent.rs
    ├── test_space_leaks.rs
    ├── test_tree.rs
    ├── test_tree_failpoints.rs
    └── tree/
        └── mod.rs

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/FUNDING.yml
================================================
# These are supported funding model platforms

github: spacejam # Replace with up to 4 GitHub Sponsors-enabled usernames e.g., [user1, user2]
patreon: # Replace with a single Patreon username
open_collective: # Replace with a single Open Collective username
ko_fi: # Replace with a single Ko-fi username
tidelift: # Replace with a single Tidelift platform-name/package-name e.g., npm/babel
community_bridge: # Replace with a single Community Bridge project-name e.g., cloud-foundry
liberapay: # Replace with a single Liberapay username
issuehunt: # Replace with a single IssueHunt username
otechie: # Replace with a single Otechie username
custom: # Replace with up to 4 custom sponsorship URLs e.g., ['link1', 'link2']


================================================
FILE: .github/ISSUE_TEMPLATE/blank_issue.md
================================================
---
name: Blank Issue (do not use this for bug reports or feature requests)
about: Create an issue with a blank template.
---


================================================
FILE: .github/ISSUE_TEMPLATE/bugs.md
================================================
---
name: Bug Report
about: Report a correctness issue or violated expectation
labels: bug
---

Bug reports must include all following items:

1. expected result
1. actual result
1. sled version
1. rustc version
1. operating system
1. minimal code sample that helps to reproduce the issue
1. logs, panic messages, stack traces

Incomplete bug reports will be closed.

Do not open bug reports for documentation issues. Please just open a PR with the proposed documentation change.

Thank you for understanding :)


================================================
FILE: .github/ISSUE_TEMPLATE/config.yml
================================================
blank_issues_enabled: true
contact_links:
  - name: sled discord
    url: https://discord.gg/Z6VsXds
    about: Please ask questions in the discord server here.


================================================
FILE: .github/ISSUE_TEMPLATE/feature_request.md
================================================
---
name: Feature Request
about: Request a feature for sled
labels: feature
---

#### Use Case:

#### Proposed Change:

#### Who Benefits From The Change(s)?

#### Alternative Approaches


================================================
FILE: .github/dependabot.yml
================================================
version: 2
updates:
- package-ecosystem: cargo
  directory: "/"
  schedule:
    interval: daily
    time: "10:00"
  open-pull-requests-limit: 10
  ignore:
  - dependency-name: crdts
    versions:
    - ">= 2.a, < 3"
  - dependency-name: zerocopy
    versions:
    - 0.4.0


================================================
FILE: .github/workflows/test.yml
================================================
name: Rust

on:
  pull_request:
    branches:
    - main

jobs:
  clippy_check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v1
      - uses: actions-rs/toolchain@v1
        with:
            toolchain: nightly
            components: clippy
            override: true
      - run: rustup component add clippy
      - uses: actions-rs/clippy-check@v1
        with:
          token: ${{ secrets.GITHUB_TOKEN }}
          args: --all-features
  default:
    name: Cargo Test on ${{ matrix.os }}
    env:
      RUST_BACKTRACE: 1
    runs-on: ${{ matrix.os }}
    strategy:
      fail-fast: false
      matrix:
        os: [ubuntu-latest, macos-latest, windows-latest]
    steps:
    - uses: actions/checkout@v1
    - name: Cache target
      uses: actions/cache@v2
      env:
        cache-name: cache-default-target-and-lockfile
      with:
        path: |
          target
          Cargo.lock
          ~/.rustup
        key: ${{ runner.os }}-${{ env.cache-name }}-${{ hashFiles('**/Cargo.toml') }}
    - name: linux coredump setup
      if: ${{ runner.os == 'linux' }}
      run: |
        ulimit -c unlimited
        echo "$PWD/core-dumps/corefile-%e-%p-%t" | sudo tee /proc/sys/kernel/core_pattern
        mkdir core-dumps
    - name: cargo test
      run: |
        rustup update --no-self-update
        cargo test --release --no-default-features --features=for-internal-testing-only -- --nocapture
    - uses: actions/upload-artifact@v4
      if: ${{ failure() && runner.os == 'linux' }}
      with:
        name: linux-core-dumps
        path: |
          ./core-dumps/*
          ./target/release/deps/test_*
  examples:
    name: Example Tests
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v1
    - name: Cache target
      uses: actions/cache@v2
      env:
        cache-name: cache-examples-target-and-lockfile
      with:
        path: |
          target
          Cargo.lock
          ~/.rustup
        key: ${{ runner.os }}-${{ env.cache-name }}-${{ hashFiles('**/Cargo.toml') }}
    - name: example tests
      run: |
        rustup update --no-self-update
        cargo run --example playground
        cargo run --example structured
  cross-compile:
    name: Cross Compile
    runs-on: macos-latest
    steps:
    - uses: actions/checkout@v1
    - name: cross compile
      run: |
        set -eo pipefail
        echo "cross build"
        scripts/cross_compile.sh
  burn-in:
    name: Burn In
    env:
      RUST_BACKTRACE: 1
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v1
    - name: Cache target
      uses: actions/cache@v2
      env:
        cache-name: cache-stress2-asan-target-and-lockfile
      with:
        path: |
          benchmarks/stress2/target
          benchmarks/stress2/Cargo.lock
          ~/.rustup
        key: ${{ runner.os }}-${{ env.cache-name }}-${{ hashFiles('**/Cargo.toml') }}
    - name: burn in
      run: |
        set -eo pipefail
        pushd benchmarks/stress2
        ulimit -c unlimited
        echo "$PWD/core-dumps/corefile-%e-%p-%t" | sudo tee /proc/sys/kernel/core_pattern
        mkdir core-dumps
        rustup toolchain install nightly
        rustup toolchain install nightly --component rust-src
        rustup update
        rm -rf default.sled || true
        export RUSTFLAGS="-Z sanitizer=address"
        export ASAN_OPTIONS="detect_odr_violation=0"
        cargo +nightly build --release --target x86_64-unknown-linux-gnu
        target/x86_64-unknown-linux-gnu/release/stress2 --duration=240
        rm -rf default.sled
    - name: print backtraces with gdb
      if: ${{ failure() }}
      run: |
        sudo apt-get update
        sudo apt-get install gdb
        pushd benchmarks/stress2
        echo "first backtrace:"
        gdb target/release/stress2 core-dumps/* -batch -ex 'bt -frame-info source-and-location'
        echo ""
        echo ""
        echo ""
        echo "all backtraces:"
        gdb target/release/stress2 core-dumps/* -batch -ex 't a a bt -frame-info source-and-location'
    - uses: actions/upload-artifact@v4
      if: ${{ failure() }}
      with:
        name: linux-core-dumps
        path: |
          ./benchmarks/stress2/core-dumps/*
          ./benchmarks/stress2/target/release/stress2
  sanitizers:
    name: Sanitizers
    env:
      RUST_BACKTRACE: 1
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v1
    - name: Cache rustup
      uses: actions/cache@v2
      env:
        cache-name: cache-sanitizers-target-and-lockfile
      with:
        path: |
          ~/.rustup
          benchmarks/stress2/target
          benchmarks/stress2/Cargo.lock
        key: ${{ runner.os }}-${{ env.cache-name }}-${{ hashFiles('**/Cargo.toml') }}
    - name: sanitizers
      run: |
        set -eo pipefail
        scripts/sanitizers.sh


================================================
FILE: .gitignore
================================================
CLAUDE.md
fuzz-*.log
default.sled
timing_test*
*db
crash_test_files
*conf
*snap.*
*grind.out*
vgcore*
*.bk
*orig
tags
perf*
*folded
*out
*perf
*svg
*txt
experiments
target
Cargo.lock
*swp
*swo
*.proptest-regressions
corpus
artifacts
.idea
cargo-timing*


================================================
FILE: .rustfmt.toml
================================================
version = "Two"
use_small_heuristics = "Max"
reorder_imports = true
max_width = 80
wrap_comments = true
combine_control_expr = true
report_todo = "Always"


================================================
FILE: ARCHITECTURE.md
================================================
<table style="width:100%">
<tr>
  <td>
    <table style="width:100%">
      <tr>
        <td> key </td>
        <td> value </td>
      </tr>
      <tr>
        <td><a href="https://github.com/sponsors/spacejam">buy a coffee for us to convert into databases</a></td>
        <td><a href="https://github.com/sponsors/spacejam"><img src="https://img.shields.io/github/sponsors/spacejam"></a></td>
      </tr>
      <tr>
        <td><a href="https://docs.rs/sled">documentation</a></td>
        <td><a href="https://docs.rs/sled"><img src="https://docs.rs/sled/badge.svg"></a></td>
      </tr>
      <tr>
        <td><a href="https://discord.gg/Z6VsXds">chat about databases with us</a></td>
        <td><a href="https://discord.gg/Z6VsXds"><img src="https://img.shields.io/discord/509773073294295082.svg?logo=discord"></a></td>
      </tr>
     </table>
  </td>
  <td>
<p align="center">
  <img src="https://raw.githubusercontent.com/spacejam/sled/main/art/tree_face_anti-transphobia.png" width="40%" height="auto" />
  </p>
  </td>
 </tr>
</table>

# sled 1.0 architecture

## in-memory

* Lock-free B+ tree index, extracted into the [`concurrent-map`](https://github.com/komora-io/concurrent-map) crate.
* The lowest key from each leaf is stored in this in-memory index.
* To read any leaf that is not already cached in memory, at most one disk read will be required.
* RwLock-backed leaves, using the ArcRwLock from the [`parking_lot`](https://github.com/Amanieu/parking_lot) crate. As a `Db` grows, leaf contention tends to go down in most use cases. But this may be revisited over time if many users have issues with RwLock-related contention. Avoiding full RCU for updates on the leaves results in many of the performance benefits over sled 0.34, with significantly lower memory pressure.
* A simple but very high performance epoch-based reclamation technique is used for safely deferring frees of in-memory index data and reuse of on-disk heap slots, extracted into the [`ebr`](https://github.com/komora-io/ebr) crate.
* A scan-resistant LRU is used for handling eviction. By default, 20% of the cache is reserved for leaves that are accessed at most once. This is configurable via `Config.entry_cache_percent`. This is handled by the extracted [`cache-advisor`](https://github.com/komora-io/cache-advisor) crate. The overall cache size is set by the `Config.cache_size` configurable.

## write path

* This is where things get interesting. There is no traditional WAL. There is no LSM. Only metadata is logged atomically after objects are written in parallel.
* The important guarantees are:
  * all previous writes are durable after a call to `Db::flush` (This is also called periodically in the background by a flusher thread)
  * all write batches written using `Db::apply_batch` are either 100% visible or 0% visible after crash recovery. If it was followed by a flush that returned `Ok(())` it is guaranteed to be present.
* Atomic ([linearizable](https://jepsen.io/consistency/models/linearizable)) durability is provided by marking dirty leaves as participants in "flush epochs" and performing atomic batch writes of the full epoch at a time, in order. Each call to `Db::flush` advances the current flush epoch by 1.
* The atomic write consists in the following steps:
  1. User code or the background flusher thread calls `Db::flush`.
  1. In parallel (via [rayon](https://docs.rs/rayon)) serialize and compress each dirty leaf with zstd (configurable via `Config.zstd_compression_level`).
  1. Based on the size of the bytes for each object, choose the smallest heap file slot that can hold the full set of bytes. This is an on-disk slab allocator.
  1. Slab slots are not power-of-two sized, but tend to increase in size by around 20% from one to the next, resulting in far lower fragmentation than typical page-oriented heaps with either constant-size or power-of-two sized leaves.
  1. Write the object to the allocated slot from the rayon threadpool.
  1. After all writes, fsync the heap files that were written to.
  1. If any writes were written to the end of the heap file, causing it to grow, fsync the directory that stores all heap files.
  1. After the writes are stable, it is now safe to write an atomic metadata batch that records the location of each written leaf in the heap. This is a simple framed batch of `(low_key, slab_slot)` tuples that are initially written to a log, but eventually merged into a simple snapshot file for the metadata store once the log becomes larger than the snapshot file.
  1. Fsync of the metadata log file.
  1. Fsync of the metadata log directory.
  1. After the atomic metadata batch write, the previously occupied slab slots are marked for future reuse with the epoch-based reclamation system. After all threads that may have witnessed the previous location have finished their work, the slab slot is added to the free `BinaryHeap` of the slot that it belongs to so that it may be reused in future atomic write batches.
  1. Return `Ok(())` to the caller of `Db::flush`.
* Writing objects before the metadata write is random, but modern SSDs handle this well. Even though the SSD's FTL will be working harder to defragment things periodically than if we wrote a few megabytes sequentially with each write, the data that the FTL will be copying will be mostly live due to the eager leaf write-backs.

## recovery

* Recovery involves simply reading the atomic metadata store that records the low key for each written leaf as well as its location and mapping it into the in-memory index. Any gaps in the slabs are then used as free slots.
* Any write that failed to complete its entire atomic writebatch is treated as if it never happened, because no user-visible flush ever returned successfully.
* Rayon is also used here for parallelizing reads of this metadata. In general, this is extremely fast compared to the previous sled recovery process.

## tuning

* The larger the `LEAF_FANOUT` const generic on the high-level `Db` struct (default `1024`), the smaller the in-memory leaf index and the better the compression ratio of the on-disk file, but the more expensive it will be to read the entire leaf off of disk and decompress it.
* You can choose to turn the `LEAF_FANOUT` relatively low to make the system behave more like an Index+Log architecture, but overall disk size will grow and write performance will decrease.
* NB: changing `LEAF_FANOUT` after writing data is not supported.


================================================
FILE: CHANGELOG.md
================================================
# Unreleased

## New Features

* #1178 batches and transactions are now unified for subscribers.
* #1231 `Tree::get_zero_copy` allows for reading a value directly
  in-place without making an `IVec` first.
* #1250 the global `print_profile` function has been added
  which is enabled when compiling with the `metrics` feature.
* #1254 `IVec` data will now always have an alignment of 8,
  which may enable interesting architecture-specific use cases.
* #1307 & #1315 `Db::contains_tree` can be used to see if a
  `Tree` with a given name already exists.

## Improvements

* #1214 a new slab-style storage engine has been added which
  replaces the previous file-per-blob technique for storing
  large pages.
* #1231 tree nodes now get merged into a single-allocation
  representation that is able to dynamically avoid various
  overheads, resulting in significant efficiency improvements.

## Breaking Changes

* #1400 Bump MSRV to 1.57.
* #1399 Thread support is now required on all platforms.
* #1135 The "no_metrics" anti-feature has been replaced with
  the "metrics" positive feature.
* #1178 the `Event` enum has become a unified struct that allows
  subscribers to iterate over each (Tree, key, optional value)
  involved in single key operations, batches, or transactions in
  a unified way.
* #1178 the `Event::key` method has been removed in favor of the
  new more comprehensive `iter` method.
* #1214 The deprecated `Config::build` method has been removed.
* #1248 The deprecated `Tree::set` method has been removed.
* #1248 The deprecated `Tree::del` method has been removed.
* #1250 The `Config::print_profile_on_drop` method has been
  removed in favor of the global `print_profile` function.
* #1252 The deprecated `Db::open` method has been removed.
* #1252 The deprecated `Config::segment_cleanup_skew` method
  has been removed.
* #1252 The deprecated `Config::segment_cleanup_threshold`
  method has been removed.
* #1252 The deprecated `Config::snapshot_path` method has
  been removed.
* #1253 The `IVec::subslice` method has been removed.
* #1275 Keys and values are now limited to 128gb on 64-bit
  platforms and 512mb on 32-bit platforms.
* #1281 `Config`'s `cache_capacity` is now a usize, as u64
  doesn't make sense for things that must fit in memory anyway.
* #1314 `Subscriber::next_timeout` now requires a mutable self
  reference.
* #1349 The "measure_allocs" feature has been removed.
* #1354 `Error` has been modified to be Copy, removing all
  heap-allocated variants.

## Bug Fixes

* #1202 Fix a space leak where blobs were not
  removed when replaced by another blob.
* #1229 the powerful ALICE crash consistency tool has been
  used to discover several crash vulnerabilities, now fixed.

# 0.34.7

## Bug Fixes

* #1314 Fix a bug in Subscriber's Future impl.

# 0.34.6

## Improvements

* documentation improved

# 0.34.5

## Improvements

* #1164 widen some trait bounds on trees and batches

# 0.34.4

## New Features

* #1151 `Send` is implemented for `Iter`
* #1167 added `Tree::first` and `Tree::last` functions
  to retrieve the first or last items in a `Tree`, unless
  the `Tree` is empty.

## Bug Fixes

* #1159 dropping a `Db` instance will no-longer
  prematurely shut-down the background flusher
  thread.
* #1168 fixed an issue that was causing panics during
  recovery in 32-bit code.
* #1170 when encountering corrupted storage data,
  the recovery process will panic less often.

# 0.34.3

## New Features

* #1146 added `TransactionalTree::generate_id`

# 0.34.2

## Improvements

* #1133 transactions and writebatch performance has been
  significantly improved by removing a bottleneck in
  the atomic batch stability tracking code.

# 0.34.1

## New Features

* #1136 Added the `TransactionalTree::flush` method to
  flush the underlying database after the transaction
  commits and before the transaction returns.

# 0.34

## Improvements

* #1132 implemented From<sled::Error> for io::Error to
  reduce friction in some situations.

## Breaking Changes

* #1131 transactions performed on `Tree`s from different
  `Db`s will now safely fail.
* #1131 transactions may now only be performed on tuples
  of up to 14 elements. For higher numbers, please use
  slices.

# 0.33

## Breaking Changes

* #1125 the backtrace crate has been made optional, which
  cuts several seconds off compilation time, but may cause
  breakage if you interacted with the backtrace field
  of corruption-related errors.

## Bug Fixes

* #1128 `Tree::pop_min` and `Tree::pop_max` had a bug where
  they were not atomic.

# 0.32.1

## New Features

* #1116 `IVec::subslice` has been added to facilitate
  creating zero-copy subsliced `IVec`s that are backed
  by the same data.

## Bug Fixes

* #1120 Fixed a use-after-free caused by missing `ref` keyword
  on a `Copy` type in a pattern match in `IVec::as_mut`.
* #1108 conversions from `Box<[u8]>` to `IVec` are fixed.

# 0.32

## New Features

* #1079 `Transactional` is now implemented for
  `[&Tree]` and `[Tree]` so you can avoid the
  previous friction of using tuples, as was
  necessary previously.
* #1058 The minimum supported Rust version (MSRV)
  is now 1.39.0.
* #1037 `Subscriber` now implements `Future` (non-fused)
  so prefix watching may now be iterated over via
  `while let Some(event) = (&mut subscriber).await {}`

## Improvements

* #965 concurrency control is now dynamically enabled
  for atomic point operations, so that it may be
  avoided unless transactional functionality is
  being used in the system. This significantly
  increases performance for workloads that do not
  use transactions.
* A number of memory optimizations have been implemented.
* Disk usage has been significantly reduced for many
  workloads.
* #1016 On 64-bit systems, we can now store 1-2 trillion items.
* #993 Added DerefMut and AsMut<[u8]> for `IVec` where it
  works similarly to a `Cow`, making a private copy
  if the backing `Arc`'s strong count is not 1.
* #1020 The sled wiki has been moved into the documentation
  itself, and is accessible through the `doc` module
  exported in lib.

## Breaking Changes

* #975 Changed the default `segment_size` from 8m to 512k.
  This will result in far smaller database files due
  to better file garbage collection granularity.
* #975 deprecated several `Config` options that will be
  removed over time.
* #1000 rearranged some transaction-related imports, and
  moved them to the `transaction` module away from
  the library root to keep the top level docs clean.
* #1015 `TransactionalTree::apply_batch` now accepts
  its argument by reference instead of by value.
* `Event` has been changed to make the inner fields
  named instead of anonymous.
* #1057 read-only mode has been removed due to not having
  the resources to properly keep it tested while
  making progress on high priority issues. This may
  be correctly implemented in the future if resources
  permit.
* The conversion between `Box<[u8]>` and `IVec` has
  been temporarily removed. This is re-added in 0.32.1.

# 0.31

## Improvements

* #947 dramatic read and recovery optimizations
* #921 reduced the reliance on locks while
  performing multithreaded IO on windows.
* #928 use `sync_file_range` on linux instead
  of a full fsync for most writes.
* #946 io_uring support changed to the `rio` crate
* #939 reduced memory consumption during
  zstd decompression

## Breaking Changes

* #927 use SQLite-style varints for serializing
  `u64`. This dramatically reduces the written
  bytes for databases that store small keys and
  values.
* #943 use varints for most of the fields in
  message headers, causing an additional large
  space reduction. combined with #927, these
  changes reduce bytes written by 68% for workloads
  writing small items.

# 0.30.3

* Documentation-only release

# 0.30.2

## New Features

* Added the `open` function for quickly
  opening a database at a path with default
  configuration.

# 0.30.1

## Bugfixes

* Fixed an issue where an idle threadpool worker
  would spin in a hot loop until work arrived

# 0.30

## Breaking Changes

* Migrated to a new storage format

## Bugfixes

* Fixed a bug where cache was not being evicted.
* Fixed a bug with using transactions with
  compression.

# 0.29.2

## New Features

* The `create_new` option has been added
  to `Config`, allowing the user to specify
  that a database should only be freshly
  created, rather than re-opened.

# 0.29.1

## Bugfixes

* Fixed a bug where prefix encoding could be
  incorrectly handled when merging nodes together.

# 0.29

## New Features

* The `Config::open` method has been added to give
  `Config` a similar feel to std's `fs::OpenOptions`.
  The `Config::build` and `Db::start` methods are
  now deprecated in favor of calling `Config::open`
  directly.
* A `checksum` method has been added to Tree and Db
  for use in verifying backups and migrations.
* Transactions may now involve up to 69 different
  tables. Nice.
* The `TransactionError::Abort` variant has had
  a generic member added that can be returned
  as a way to return information from a
  manually-aborted transaction. An `abort` helper
  function has been added to reduce the boiler-
  plate required to return aborted results.

## Breaking Changes

* The `ConfigBuilder` structure has been removed
  in favor of a simplified `Config` structure
  with the same functionality.
* The way that sled versions are detected at
  initialization time is now independent of serde.
* The `cas` method is deprecated in favor of the new
  `compare_and_swap` method which now returns the
  proposed value that failed to be applied.
* Tree nodes now have constant prefix encoding
  lengths.
* The `io_buf_size` configurable renamed to
  `segment_size`.
* The `io_buf_size` configurable method has been
  removed from ConfigBuilder. This can be manually
  set by setting the attribute directly on the
  ConfigBuilder, but this is discouraged.
  Additionally, this must now be a power of 2.
* The `page_consolidation_threshold` method has been
  removed from ConfigBuilder, and this is now
  a constant of 10.

# 0.28

## Breaking Changes

* `Iter` no longer has a lifetime parameter.
* `Db::open_tree` now returns a `Tree` instead of
  an `Arc<Tree>`. `Tree` now has an inner type that
  uses an `Arc`, so you don't need to think about it.

## Bug Fixes

* A bug with prefix encoding has been fixed that
  led to nodes with keys longer than 256 bytes
  being stored incorrectly, which led to them
  being inaccessible and also leading to infinite
  loops during iteration.
* Several cases of incorrect unsafe code were removed
  from the sled crate. No bugs are known to have been
  encountered, but they may have resulted in
  incorrect optimizations in future refactors.

# 0.27

## Breaking Changes

* `Event::Set` has been renamed to `Event::Insert` and
  `Event::Del` has been renamed to `Event::Remove`. These
  names better align with the methods of BTreeMap from
  the standard library.

## Bug Fixes

* A deadlock was possible in very high write volume
  situations when the segment accountant lock was
  taken by all IO threads while a task was blocked
  trying to submit a file truncation request to the
  threadpool while holding the segment accountant lock.

## New Features

* `flush_async` has been added to perform time-intensive
  flushing in an asynchronous manner, returning a Future.

# 0.26.1

## Improvements

* std::thread is no longer used on platforms other than
  linux, macos, and windows, which increases portability.

# 0.26

## New Features

* Transactions! You may now call `Tree::transaction` and
  perform reads, writes, and deletes within a provided
  closure with a `TransactionalTree` argument. This
  closure may be called multiple times if the transaction
  encounters a concurrent update in the process of its
  execution. Transactions may also be used on tuples of
  `Tree` objects, where the closure will then be
  parameterized on `TransactionalTree` instances providing
  access to each of the provided `Tree` instances. This
  allows you to atomically read and modify multiple
  `Tree` instances in a single atomic operation.
  These transactions are serializable, fully ACID,
  and optimistic.
* `Tree::apply_batch` allows you to apply a `Batch`
* `TransactionalTree::apply_batch` allow you to
  apply a `Batch` from within a transaction.

## Breaking Changes

* `Tree::batch` has been removed. Now you can directly
  create a `Batch` with `Batch::default()` and then apply
  it to a `Tree` with `Tree::apply_batch` or during a
  transaction using `TransactionalTree::apply_batch`.
  This facilitates multi-`Tree` batches via transactions.
* `Event::Merge` has been removed, and `Tree::merge` will
  now send a complete `Event::Set` item to be distributed
  to all listening subscribers.


================================================
FILE: CONTRIBUTING.md
================================================
# Welcome to the Project :)

* Don't be a jerk - here's our [code of conduct](./code-of-conduct.md).
  We have a track record of defending our community from harm.

There are at least three great ways to contribute to sled:

* [financial contribution](https://github.com/sponsors/spacejam)
* coding
* conversation

#### Coding Considerations:

Please don't waste your time or ours by implementing things that
we do not want to introduce and maintain. Please discuss in an
issue or on chat before submitting a PR with:

* public API changes
* new functionality of any sort
* additional unsafe code
* significant refactoring

The above changes are unlikely to be merged or receive
timely attention without prior discussion.

PRs that generally require less coordination beforehand:

* Anything addressing a correctness issue.
* Better docs: whatever you find confusing!
* Small code changes with big performance implications, substantiated with [responsibly-gathered metrics](https://sled.rs/perf#experiment-checklist).
* FFI submodule changes: these are generally less well maintained than the Rust core, and benefit more from public assistance.
* Generally any new kind of test that avoids biases inherent in the others.

#### All PRs block on failing tests!

sled has intense testing, including crash tests, multi-threaded tests with
delay injection, a variety of mechanically-generated tests that combine fault
injection with concurrency in interesting ways, cross-compilation and minimum
supported Rust version checks, LLVM sanitizers, and more. It can sometimes be
challenging to understand why something is failing these intense tests.

For better understanding test failures, please:

1. read the failing test name and output log for clues
1. try to reproduce the failed test locally by running its associated command from the [test script](https://github.com/spacejam/sled/blob/main/.github/workflows/test.yml)
1. If it is not clear why your test is failing, feel free to request help with understanding it either on discord or requesting help on the PR, and we will do our best to help.

Want to help sled but don't have time for individual contributions? Contribute via [GitHub Sponsors](https://github.com/sponsors/spacejam) to support the people pushing the project forward!


================================================
FILE: Cargo.toml
================================================
[package]
name = "sled"
version = "1.0.0-alpha.124"
edition = "2024"
authors = ["Tyler Neely <tylerneely@gmail.com>"]
documentation = "https://docs.rs/sled/"
description = "Lightweight high-performance pure-rust transactional embedded database."
license = "MIT OR Apache-2.0"
homepage = "https://github.com/spacejam/sled"
repository = "https://github.com/spacejam/sled"
keywords = ["redis", "mongo", "sqlite", "lmdb", "rocksdb"]
categories = ["database-implementations", "concurrency", "data-structures", "algorithms", "caching"]
readme = "README.md"
exclude = ["benchmarks", "examples", "bindings", "scripts", "experiments"]

[features]
# initializes allocated memory to 0xa1, writes 0xde to deallocated memory before freeing it
testing-shred-allocator = []
# use a counting global allocator that provides the sled::alloc::{allocated, freed, resident, reset} functions
testing-count-allocator = []
for-internal-testing-only = []
# turn off re-use of object IDs and heap slots, disable tree leaf merges, disable heap file truncation.
monotonic-behavior = []

[profile.release]
debug = true
opt-level = 3
overflow-checks = true
panic = "abort"

[profile.test]
debug = true
overflow-checks = true
panic = "abort"

[dependencies]
bincode = "1.3.3"
cache-advisor = "1.0.16"
concurrent-map = { version = "5.0.31", features = ["serde"] }
crc32fast = "1.3.2"
ebr = "0.2.13"
inline-array = { version = "0.1.13", features = ["serde", "concurrent_map_minimum"] }
fs2 = "0.4.3"
log = "0.4.19"
pagetable = "0.4.5"
parking_lot = { version = "0.12.1", features = ["arc_lock"] }
rayon = "1.7.0"
serde = { version = "1.0", features = ["derive"] }
stack-map = { version = "1.0.5", features = ["serde"] }
zstd = "0.12.4"
fnv = "1.0.7"
fault-injection = "1.0.10"
crossbeam-queue = "0.3.8"
crossbeam-channel = "0.5.8"
tempdir = "0.3.7"

[dev-dependencies]
env_logger = "0.10.0"
num-format = "0.4.4"
# heed = "0.11.0"
# rocksdb = "0.21.0"
# rusqlite = "0.29.0"
# old_sled = { version = "0.34", package = "sled" }
rand = "0.9"
quickcheck = "1.0.3"
rand_distr = "0.5"
libc = "0.2.147"

[[test]]
name = "test_crash_recovery"
path = "tests/test_crash_recovery.rs"
harness = false



================================================
FILE: LICENSE-APACHE
================================================
                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "{}"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright 2015 Tyler Neely
   Copyright 2016 Tyler Neely
   Copyright 2017 Tyler Neely
   Copyright 2018 Tyler Neely
   Copyright 2019 Tyler Neely
   Copyright 2020 Tyler Neely
   Copyright 2021 Tyler Neely
   Copyright 2022 Tyler Neely
   Copyright 2023 Tyler Neely

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: LICENSE-MIT
================================================
Copyright (c) 2015 Tyler Neely
Copyright (c) 2016 Tyler Neely
Copyright (c) 2017 Tyler Neely
Copyright (c) 2018 Tyler Neely
Copyright (c) 2019 Tyler Neely
Copyright (c) 2020 Tyler Neely
Copyright (c) 2021 Tyler Neely
Copyright (c) 2022 Tyler Neely
Copyright (c) 2023 Tyler Neely

Permission is hereby granted, free of charge, to any
person obtaining a copy of this software and associated
documentation files (the "Software"), to deal in the
Software without restriction, including without
limitation the rights to use, copy, modify, merge,
publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software
is furnished to do so, subject to the following
conditions:

The above copyright notice and this permission notice
shall be included in all copies or substantial portions
of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF
ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED
TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.


================================================
FILE: README.md
================================================

<table style="width:100%">
<tr>
  <td>
    <table style="width:100%">
      <tr>
        <td> key </td>
        <td> value </td>
      </tr>
      <tr>
        <td><a href="https://docs.rs/sled">documentation</a></td>
        <td><a href="https://docs.rs/sled"><img src="https://docs.rs/sled/badge.svg"></a></td>
      </tr>
      <tr>
        <td><a href="https://discord.gg/Z6VsXds">chat about databases with us</a></td>
        <td><a href="https://discord.gg/Z6VsXds"><img src="https://img.shields.io/discord/509773073294295082.svg?logo=discord"></a></td>
      </tr>
     </table>
  </td>
  <td>
<p align="center">
  <img src="https://raw.githubusercontent.com/spacejam/sled/main/art/tree_face_anti-transphobia.png" width="40%" height="auto" />
  </p>
  </td>
 </tr>
</table>


# sled - ~~it's all downhill from here!!!~~

An embedded database.

```rust
let tree = sled::open("/tmp/welcome-to-sled")?;

// insert and get, similar to std's BTreeMap
let old_value = tree.insert("key", "value")?;

assert_eq!(
  tree.get(&"key")?,
  Some(sled::IVec::from("value")),
);

// range queries
for kv_result in tree.range("key_1".."key_9") {}

// deletion
let old_value = tree.remove(&"key")?;

// atomic compare and swap
tree.compare_and_swap(
  "key",
  Some("current_value"),
  Some("new_value"),
)?;

// block until all operations are stable on disk
// (flush_async also available to get a Future)
tree.flush()?;
```

$${\color{red}This \space README \space is \space out \space of \space sync \space with \space the \space main \space branch \space which \space contains \space a \space large \space in-progress \space rewrite }$$

If you would like to work with structured data without paying expensive deserialization costs, check out the [structured](examples/structured.rs) example!

# features

* [API](https://docs.rs/sled) similar to a threadsafe `BTreeMap<[u8], [u8]>`
* serializable (ACID) [transactions](https://docs.rs/sled/latest/sled/struct.Tree.html#method.transaction)
  for atomically reading and writing to multiple keys in multiple keyspaces.
* fully atomic single-key operations, including [compare and swap](https://docs.rs/sled/latest/sled/struct.Tree.html#method.compare_and_swap)
* zero-copy reads
* [write batches](https://docs.rs/sled/latest/sled/struct.Tree.html#method.apply_batch)
* [subscribe to changes on key
  prefixes](https://docs.rs/sled/latest/sled/struct.Tree.html#method.watch_prefix)
* [multiple keyspaces](https://docs.rs/sled/latest/sled/struct.Db.html#method.open_tree)
* [merge operators](https://docs.rs/sled/latest/sled/doc/merge_operators/index.html)
* forward and reverse iterators over ranges of items
* a crash-safe monotonic [ID generator](https://docs.rs/sled/latest/sled/struct.Db.html#method.generate_id)
  capable of generating 75-125 million unique ID's per second
* [zstd](https://github.com/facebook/zstd) compression (use the
  `compression` build feature, disabled by default)
* cpu-scalable lock-free implementation
* flash-optimized log-structured storage
* uses modern b-tree techniques such as prefix encoding and suffix
  truncation for reducing the storage costs of long keys with shared
  prefixes. If keys are the same length and sequential then the
  system can avoid storing 99%+ of the key data in most cases,
  essentially acting like a learned index

# expectations, gotchas, advice

* Maybe one of the first things that seems weird is the `IVec` type.
  This is an inlinable `Arc`ed slice that makes some things more efficient.
* Durability: **sled automatically fsyncs every 500ms by default**,
  which can be configured with the `flush_every_ms` configurable, or you may
  call `flush` / `flush_async` manually after operations.
* **Transactions are optimistic** - do not interact with external state
  or perform IO from within a transaction closure unless it is
  [idempotent](https://en.wikipedia.org/wiki/Idempotent).
* Internal tree node optimizations: sled performs prefix encoding
  on long keys with similar prefixes that are grouped together in a range,
  as well as suffix truncation to further reduce the indexing costs of
  long keys. Nodes will skip potentially expensive length and offset pointers
  if keys or values are all the same length (tracked separately, don't worry
  about making keys the same length as values), so it may improve space usage
  slightly if you use fixed-length keys or values. This also makes it easier
  to use [structured access](examples/structured.rs) as well.
* sled does not support multiple open instances for the time being. Please
  keep sled open for the duration of your process's lifespan. It's totally
  safe and often quite convenient to use a global lazy_static sled instance,
  modulo the normal global variable trade-offs. Every operation is threadsafe,
  and most are implemented under the hood with lock-free algorithms that avoid
  blocking in hot paths.

# performance

* [LSM tree](https://en.wikipedia.org/wiki/Log-structured_merge-tree)-like write performance
  with [traditional B+ tree](https://en.wikipedia.org/wiki/B%2B_tree)-like read performance
* over a billion operations in under a minute at 95% read 5% writes on 16 cores on a small dataset
* measure your own workloads rather than relying on some marketing for contrived workloads

# a note on lexicographic ordering and endianness

If you want to store numerical keys in a way that will play nicely with sled's iterators and ordered operations, please remember to store your numerical items in big-endian form. Little endian (the default of many things) will often appear to be doing the right thing until you start working with more than 256 items (more than 1 byte), causing lexicographic ordering of the serialized bytes to diverge from the lexicographic ordering of their deserialized numerical form.

* Rust integral types have built-in `to_be_bytes` and `from_be_bytes` [methods](https://doc.rust-lang.org/std/primitive.u64.html#method.from_be_bytes).
* bincode [can be configured](https://docs.rs/bincode/1.2.0/bincode/struct.Config.html#method.big_endian) to store integral types in big-endian form.

# interaction with async

If your dataset resides entirely in cache (achievable at startup by setting the cache
to a large enough value and performing a full iteration) then all reads and writes are
non-blocking and async-friendly, without needing to use Futures or an async runtime.

To asynchronously suspend your async task on the durability of writes, we support the
[`flush_async` method](https://docs.rs/sled/latest/sled/struct.Tree.html#method.flush_async),
which returns a Future that your async tasks can await the completion of if they require
high durability guarantees and you are willing to pay the latency costs of fsync.
Note that sled automatically tries to sync all data to disk several times per second
in the background without blocking user threads.

We support async subscription to events that happen on key prefixes, because the
`Subscriber` struct implements `Future<Output=Option<Event>>`:

```rust
let sled = sled::open("my_db").unwrap();

let mut sub = sled.watch_prefix("");

sled.insert(b"a", b"a").unwrap();

extreme::run(async move {
    while let Some(event) = (&mut sub).await {
        println!("got event {:?}", event);
    }
});
```

# minimum supported Rust version (MSRV)

We support Rust 1.62 and up.

# architecture

lock-free tree on a lock-free pagecache on a lock-free log. the pagecache scatters
partial page fragments across the log, rather than rewriting entire pages at a time
as B+ trees for spinning disks historically have. on page reads, we concurrently
scatter-gather reads across the log to materialize the page from its fragments.
check out the [architectural outlook](https://github.com/spacejam/sled/wiki/sled-architectural-outlook)
for a more detailed overview of where we're at and where we see things going!

# philosophy

1. don't make the user think. the interface should be obvious.
1. don't surprise users with performance traps.
1. don't wake up operators. bring reliability techniques from academia into real-world practice.
1. don't use so much electricity. our data structures should play to modern hardware's strengths.

# known issues, warnings

* if reliability is your primary constraint, use SQLite. sled is beta.
* if storage price performance is your primary constraint, use RocksDB. sled uses too much space sometimes.
* if you have a multi-process workload that rarely writes, use LMDB. sled is architected for use with long-running, highly-concurrent workloads such as stateful services or higher-level databases.
* quite young, should be considered unstable for the time being.
* the on-disk format is going to change in ways that require [manual migrations](https://docs.rs/sled/latest/sled/struct.Db.html#method.export) before the `1.0.0` release!

# priorities

1. A full rewrite of sled's storage subsystem is happening on a modular basis as part of the [komora project](https://github.com/komora-io), in particular the marble storage engine. This will dramatically lower both the disk space usage (space amplification) and garbage collection overhead (write amplification) of sled.
2. The memory layout of tree nodes is being completely rewritten to reduce fragmentation and eliminate serialization costs.
3. The merge operator feature will change into a trigger feature that resembles traditional database triggers, allowing state to be modified as part of the same atomic writebatch that triggered it for retaining serializability with reactive semantics.

# fund feature development

Like what we're doing? Help us out via [GitHub Sponsors](https://github.com/sponsors/spacejam)!


================================================
FILE: RELEASE_CHECKLIST.md
================================================
# Release Checklist

This checklist must be completed before publishing a release of any kind.

Over time, anything in this list that can be turned into an automated test should be, but
there are still some big blind spots.

## API stability

- [ ] rust-flavored semver respected

## Performance

- [ ] micro-benchmark regressions should not happen unless newly discovered correctness criteria demands them
- [ ] mixed point operation latency distribution should narrow over time
- [ ] sequential operation average throughput should increase over time
- [ ] workloads should pass TSAN and ASAN on macOS. Linux should additionally pass LSAN & MSAN.
- [ ] workload write and space amplification thresholds should see no regressions

## Concurrency Audit

- [ ] any new `Guard` objects are dropped inside the rayon threadpool
- [ ] no new EBR `Collector`s, as they destroy causality. These will be optimized in-bulk in the future.
- [ ] no code assumes a recently read page pointer will remain unchanged (transactions may change this if reads are inline)
- [ ] no calls to `rand::thread_rng` from a droppable function (anything in the SegmentAccountant)

## Burn-In

- [ ] fuzz tests should run at least 24 hours each with zero crashes
- [ ] sequential and point workloads run at least 24 hours in constrained docker container without OOM / out of disk


================================================
FILE: SAFETY.md
================================================
# sled safety model

This document applies
[STPA](http://psas.scripts.mit.edu/home/get_file.php?name=STPA_handbook.pdf)-style
hazard analysis to the sled embedded database for the purpose of guiding
design and testing efforts to prevent unacceptable losses.

Outline

* [purpose of analysis](#purpose-of-analysis)
  * [losses](#losses)
  * [system boundary](#system-boundary)
  * [hazards](#hazards)
  * [leading indicators](#leading-indicators)
  * [constraints](#constraints)
* [model of control structure](#model-of-control-structure)
* [identify unsafe control actions](#identify-unsafe-control-actions)
* [identify loss scenarios][#identify-loss-scenarios)
* [resources for learning more about STAMP, STPA, and CAST](#resources)

# Purpose of Analysis

## Losses

We wish to prevent the following undesirable situations:

* data loss
* inconsistent (non-linearizable) data access
* process crash
* resource exhaustion

## System Boundary

We draw the line between system and environment where we can reasonably
invest our efforts to prevent losses.

Inside the boundary:

* codebase
  * put safe control actions into place that prevent losses
* documentation
  * show users how to use sled safely
  * recommend hardware, kernels, user code

Outside the boundary:

* Direct changes to hardware, kernels, user code

## Hazards

These hazards can result in the above losses:

* data may be lost if
  * bugs in the logging system
    * `Db::flush` fails to make previous writes durable
  * bugs in the GC system
    * the old location is overwritten before the defragmented location becomes durable
  * bugs in the recovery system
  * hardare failures
* consistency violations may be caused by
  * transaction concurrency control failure to enforce linearizability (strict serializability)
  * non-linearizable lock-free single-key operations
* panic
  * of user threads
  * IO threads
  * flusher & GC thread
  * indexing
  * unwraps/expects
  * failed TryInto/TryFrom + unwrap
* persistent storage exceeding (2 + N concurrent writers) * logical data size
* in-memory cache exceeding the configured cache size
  * caused by incorrect calculation of cache
* use-after-free
* data race
* memory leak
* integer overflow
* buffer overrun
* uninitialized memory access

## Constraints

# Models of Control Structures

for each control action we have, consider:

1. what hazards happen when we fail to apply it / it does not exist?
2. what hazards happen when we do apply it
3. what hazards happen when we apply it too early or too late?
4. what hazards happen if we apply it for too long or not long enough?

durability model

  * recovery
    * LogIter::max_lsn
      * return None if last_lsn_in_batch >= self.max_lsn
    * batch requirement set to last reservation base + inline len - 1
      * reserve bumps
        * bump_atomic_lsn(&self.iobufs.max_reserved_lsn, reservation_lsn + inline_buf_len as Lsn - 1);

lock-free linearizability model

transactional linearizability (strict serializability) model

panic model

memory usage model

storage usage model



================================================
FILE: SECURITY.md
================================================
# Security Policy

## Reporting a Vulnerability

sled uses some unsafe functionality in the core lock-free algorithms, and in a few places to more efficiently copy data.

Please contact [Tyler Neely](mailto:tylerneely@gmail.com?subject=sled%20security%20issue) immediately if you find any vulnerability, and I will work with you to fix the issue rapidly and coordinate public disclosure with an expedited release including the fix.

If you are a bug hunter or a person with a security interest, here is my mental model of memory corruption risk in the sled codebase:

1. memory issues relating to the lock-free data structures in their colder failure paths. these have been tested a bit by injecting delays into random places, but this is still an area with elevated risk
1. anywhere the `unsafe` keyword is used


================================================
FILE: art/CREDITS
================================================
original tree logo with face:
  https://twitter.com/daiyitastic

anti-transphobia additions:
  spacejam


================================================
FILE: code-of-conduct.md
================================================
# Contributor Covenant Code of Conduct

## Our Pledge

In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to making participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, sex characteristics, gender identity and expression,
level of experience, education, socio-economic status, nationality, personal
appearance, race, religion, or sexual identity and orientation.

## Our Standards

Examples of behavior that contributes to creating a positive environment
include:

* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members

Examples of unacceptable behavior by participants include:

* The use of sexualized language or imagery and unwelcome sexual attention or
  advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
  address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
  professional setting

## Our Responsibilities

Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.

## Scope

This Code of Conduct applies both within project spaces and in public spaces
when an individual is representing the project or its community. Examples of
representing a project or community include using an official project e-mail
address, posting via an official social media account, or acting as an appointed
representative at an online or offline event. Representation of a project may be
further defined and clarified by project maintainers.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at tylerneely@gmail.com. All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html

[homepage]: https://www.contributor-covenant.org



================================================
FILE: examples/bench.rs
================================================
use std::path::Path;
use std::sync::Barrier;
use std::thread::scope;
use std::time::{Duration, Instant};
use std::{fs, io};

use num_format::{Locale, ToFormattedString};

use sled::{Config, Db as SledDb};

type Db = SledDb<1024>;

const N_WRITES_PER_THREAD: u32 = 4 * 1024 * 1024;
const MAX_CONCURRENCY: u32 = 4;
const CONCURRENCY: &[usize] = &[/*1, 2, 4,*/ MAX_CONCURRENCY as _];
const BYTES_PER_ITEM: u32 = 8;

trait Databench: Clone + Send {
    type READ: AsRef<[u8]>;
    const NAME: &'static str;
    const PATH: &'static str;
    fn open() -> Self;
    fn remove_generic(&self, key: &[u8]);
    fn insert_generic(&self, key: &[u8], value: &[u8]);
    fn get_generic(&self, key: &[u8]) -> Option<Self::READ>;
    fn flush_generic(&self);
    fn print_stats(&self);
}

impl Databench for Db {
    type READ = sled::InlineArray;

    const NAME: &'static str = "sled 1.0.0-alpha";
    const PATH: &'static str = "timing_test.sled-new";

    fn open() -> Self {
        sled::Config {
            path: Self::PATH.into(),
            zstd_compression_level: 3,
            cache_capacity_bytes: 1024 * 1024 * 1024,
            entry_cache_percent: 20,
            flush_every_ms: Some(200),
            ..Config::default()
        }
        .open()
        .unwrap()
    }

    fn insert_generic(&self, key: &[u8], value: &[u8]) {
        self.insert(key, value).unwrap();
    }
    fn remove_generic(&self, key: &[u8]) {
        self.remove(key).unwrap();
    }
    fn get_generic(&self, key: &[u8]) -> Option<Self::READ> {
        self.get(key).unwrap()
    }
    fn flush_generic(&self) {
        self.flush().unwrap();
    }
    fn print_stats(&self) {
        dbg!(self.stats());
    }
}

/*
impl Databench for old_sled::Db {
    type READ = old_sled::IVec;

    const NAME: &'static str = "sled 0.34.7";
    const PATH: &'static str = "timing_test.sled-old";

    fn open() -> Self {
        old_sled::open(Self::PATH).unwrap()
    }
    fn insert_generic(&self, key: &[u8], value: &[u8]) {
        self.insert(key, value).unwrap();
    }
    fn get_generic(&self, key: &[u8]) -> Option<Self::READ> {
        self.get(key).unwrap()
    }
    fn flush_generic(&self) {
        self.flush().unwrap();
    }
}
*/

/*
impl Databench for Arc<rocksdb::DB> {
    type READ = Vec<u8>;

    const NAME: &'static str = "rocksdb 0.21.0";
    const PATH: &'static str = "timing_test.rocksdb";

    fn open() -> Self {
        Arc::new(rocksdb::DB::open_default(Self::PATH).unwrap())
    }
    fn insert_generic(&self, key: &[u8], value: &[u8]) {
        self.put(key, value).unwrap();
    }
    fn get_generic(&self, key: &[u8]) -> Option<Self::READ> {
        self.get(key).unwrap()
    }
    fn flush_generic(&self) {
        self.flush().unwrap();
    }
}
*/

/*
struct Lmdb {
    env: heed::Env,
    db: heed::Database<
        heed::types::UnalignedSlice<u8>,
        heed::types::UnalignedSlice<u8>,
    >,
}

impl Clone for Lmdb {
    fn clone(&self) -> Lmdb {
        Lmdb { env: self.env.clone(), db: self.db.clone() }
    }
}

impl Databench for Lmdb {
    type READ = Vec<u8>;

    const NAME: &'static str = "lmdb";
    const PATH: &'static str = "timing_test.lmdb";

    fn open() -> Self {
        let _ = std::fs::create_dir_all(Self::PATH);
        let env = heed::EnvOpenOptions::new()
            .map_size(1024 * 1024 * 1024)
            .open(Self::PATH)
            .unwrap();
        let db = env.create_database(None).unwrap();
        Lmdb { env, db }
    }
    fn insert_generic(&self, key: &[u8], value: &[u8]) {
        let mut wtxn = self.env.write_txn().unwrap();
        self.db.put(&mut wtxn, key, value).unwrap();
        wtxn.commit().unwrap();
    }
    fn get_generic(&self, key: &[u8]) -> Option<Self::READ> {
        let rtxn = self.env.read_txn().unwrap();
        let ret = self.db.get(&rtxn, key).unwrap().map(Vec::from);
        rtxn.commit().unwrap();
        ret
    }
    fn flush_generic(&self) {
        // NOOP
    }
}
*/

/*
struct Sqlite {
    connection: rusqlite::Connection,
}

impl Clone for Sqlite {
    fn clone(&self) -> Sqlite {
        Sqlite { connection: rusqlite::Connection::open(Self::PATH).unwrap() }
    }
}

impl Databench for Sqlite {
    type READ = Vec<u8>;

    const NAME: &'static str = "sqlite";
    const PATH: &'static str = "timing_test.sqlite";

    fn open() -> Self {
        let connection = rusqlite::Connection::open(Self::PATH).unwrap();
        connection
            .execute(
                "create table if not exists bench (
                     key integer primary key,
                     val integer not null
                 )",
                [],
            )
            .unwrap();
        Sqlite { connection }
    }
    fn insert_generic(&self, key: &[u8], value: &[u8]) {
        loop {
            let res = self.connection.execute(
                "insert or ignore into bench (key, val) values (?1, ?2)",
                [
                    format!("{}", u32::from_be_bytes(key.try_into().unwrap())),
                    format!(
                        "{}",
                        u32::from_be_bytes(value.try_into().unwrap())
                    ),
                ],
            );
            if res.is_ok() {
                break;
            }
        }
    }
    fn get_generic(&self, key: &[u8]) -> Option<Self::READ> {
        let mut stmt = self
            .connection
            .prepare("SELECT b.val from bench b WHERE key = ?1")
            .unwrap();
        let mut rows =
            stmt.query([u32::from_be_bytes(key.try_into().unwrap())]).unwrap();

        let value = rows.next().unwrap()?;
        value.get(0).ok()
    }
    fn flush_generic(&self) {
        // NOOP
    }
}
*/

fn allocated() -> usize {
    #[cfg(feature = "testing-count-allocator")]
    {
        return sled::alloc::allocated();
    }
    0
}

fn freed() -> usize {
    #[cfg(feature = "testing-count-allocator")]
    {
        return sled::alloc::freed();
    }
    0
}

fn resident() -> usize {
    #[cfg(feature = "testing-count-allocator")]
    {
        return sled::alloc::resident();
    }
    0
}

fn inserts<D: Databench>(store: &D) -> Vec<InsertStats> {
    println!("{} inserts", D::NAME);
    let mut i = 0_u32;

    let factory = move || {
        i += 1;
        (store.clone(), i - 1)
    };

    let f = |state: (D, u32)| {
        let (store, offset) = state;
        let start = N_WRITES_PER_THREAD * offset;
        let end = N_WRITES_PER_THREAD * (offset + 1);
        for i in start..end {
            let k: &[u8] = &i.to_be_bytes();
            store.insert_generic(k, k);
        }
    };

    let mut ret = vec![];

    for concurrency in CONCURRENCY {
        let insert_elapsed =
            execute_lockstep_concurrent(factory, f, *concurrency);

        let flush_timer = Instant::now();
        store.flush_generic();

        let wps = (N_WRITES_PER_THREAD * *concurrency as u32) as u64
            * 1_000_000_u64
            / u64::try_from(insert_elapsed.as_micros().max(1))
                .unwrap_or(u64::MAX);

        ret.push(InsertStats {
            thread_count: *concurrency,
            inserts_per_second: wps,
        });

        println!(
            "{} inserts/s with {concurrency} threads over {:?}, then {:?} to flush {}",
            wps.to_formatted_string(&Locale::en),
            insert_elapsed,
            flush_timer.elapsed(),
            D::NAME,
        );
    }

    ret
}

fn removes<D: Databench>(store: &D) -> Vec<RemoveStats> {
    println!("{} removals", D::NAME);
    let mut i = 0_u32;

    let factory = move || {
        i += 1;
        (store.clone(), i - 1)
    };

    let f = |state: (D, u32)| {
        let (store, offset) = state;
        let start = N_WRITES_PER_THREAD * offset;
        let end = N_WRITES_PER_THREAD * (offset + 1);
        for i in start..end {
            let k: &[u8] = &i.to_be_bytes();
            store.remove_generic(k);
        }
    };

    let mut ret = vec![];

    for concurrency in CONCURRENCY {
        let remove_elapsed =
            execute_lockstep_concurrent(factory, f, *concurrency);

        let flush_timer = Instant::now();
        store.flush_generic();

        let wps = (N_WRITES_PER_THREAD * *concurrency as u32) as u64
            * 1_000_000_u64
            / u64::try_from(remove_elapsed.as_micros().max(1))
                .unwrap_or(u64::MAX);

        ret.push(RemoveStats {
            thread_count: *concurrency,
            removes_per_second: wps,
        });

        println!(
            "{} removes/s with {concurrency} threads over {:?}, then {:?} to flush {}",
            wps.to_formatted_string(&Locale::en),
            remove_elapsed,
            flush_timer.elapsed(),
            D::NAME,
        );
    }

    ret
}

fn gets<D: Databench>(store: &D) -> Vec<GetStats> {
    println!("{} reads", D::NAME);

    let factory = || store.clone();

    let f = |store: D| {
        let start = 0;
        let end = N_WRITES_PER_THREAD * MAX_CONCURRENCY;
        for i in start..end {
            let k: &[u8] = &i.to_be_bytes();
            store.get_generic(k);
        }
    };

    let mut ret = vec![];

    for concurrency in CONCURRENCY {
        let get_stone_elapsed =
            execute_lockstep_concurrent(factory, f, *concurrency);

        let rps = (N_WRITES_PER_THREAD * MAX_CONCURRENCY * *concurrency as u32)
            as u64
            * 1_000_000_u64
            / u64::try_from(get_stone_elapsed.as_micros().max(1))
                .unwrap_or(u64::MAX);

        ret.push(GetStats { thread_count: *concurrency, gets_per_second: rps });

        println!(
            "{} gets/s with concurrency of {concurrency}, {:?} total reads {}",
            rps.to_formatted_string(&Locale::en),
            get_stone_elapsed,
            D::NAME
        );
    }
    ret
}

fn execute_lockstep_concurrent<
    State: Send,
    Factory: FnMut() -> State,
    F: Sync + Fn(State),
>(
    mut factory: Factory,
    f: F,
    concurrency: usize,
) -> Duration {
    let barrier = &Barrier::new(concurrency + 1);
    let f = &f;

    scope(|s| {
        let mut threads = vec![];

        for _ in 0..concurrency {
            let state = factory();

            let thread = s.spawn(move || {
                barrier.wait();
                f(state);
            });

            threads.push(thread);
        }

        barrier.wait();
        let get_stone = Instant::now();

        for thread in threads.into_iter() {
            thread.join().unwrap();
        }

        get_stone.elapsed()
    })
}

#[derive(Debug, Clone, Copy)]
struct InsertStats {
    thread_count: usize,
    inserts_per_second: u64,
}

#[derive(Debug, Clone, Copy)]
struct GetStats {
    thread_count: usize,
    gets_per_second: u64,
}

#[derive(Debug, Clone, Copy)]
struct RemoveStats {
    thread_count: usize,
    removes_per_second: u64,
}

#[allow(unused)]
#[derive(Debug, Clone)]
struct Stats {
    post_insert_disk_space: u64,
    post_remove_disk_space: u64,
    allocated_memory: usize,
    freed_memory: usize,
    resident_memory: usize,
    insert_stats: Vec<InsertStats>,
    get_stats: Vec<GetStats>,
    remove_stats: Vec<RemoveStats>,
}

impl Stats {
    fn print_report(&self) {
        println!(
            "bytes on disk after inserts: {}",
            self.post_insert_disk_space.to_formatted_string(&Locale::en)
        );
        println!(
            "bytes on disk after removes: {}",
            self.post_remove_disk_space.to_formatted_string(&Locale::en)
        );
        println!(
            "bytes in memory: {}",
            self.resident_memory.to_formatted_string(&Locale::en)
        );
        for stats in &self.insert_stats {
            println!(
                "{} threads {} inserts per second",
                stats.thread_count,
                stats.inserts_per_second.to_formatted_string(&Locale::en)
            );
        }
        for stats in &self.get_stats {
            println!(
                "{} threads {} gets per second",
                stats.thread_count,
                stats.gets_per_second.to_formatted_string(&Locale::en)
            );
        }
        for stats in &self.remove_stats {
            println!(
                "{} threads {} removes per second",
                stats.thread_count,
                stats.removes_per_second.to_formatted_string(&Locale::en)
            );
        }
    }
}

fn bench<D: Databench>() -> Stats {
    let store = D::open();

    let insert_stats = inserts(&store);

    let before_flush = Instant::now();
    store.flush_generic();
    println!("final flush took {:?} for {}", before_flush.elapsed(), D::NAME);

    let post_insert_disk_space = du(D::PATH.as_ref()).unwrap();

    let get_stats = gets(&store);

    let remove_stats = removes(&store);

    store.print_stats();

    Stats {
        post_insert_disk_space,
        post_remove_disk_space: du(D::PATH.as_ref()).unwrap(),
        allocated_memory: allocated(),
        freed_memory: freed(),
        resident_memory: resident(),
        insert_stats,
        get_stats,
        remove_stats,
    }
}

fn du(path: &Path) -> io::Result<u64> {
    fn recurse(mut dir: fs::ReadDir) -> io::Result<u64> {
        dir.try_fold(0, |acc, file| {
            let file = file?;
            let size = match file.metadata()? {
                data if data.is_dir() => recurse(fs::read_dir(file.path())?)?,
                data => data.len(),
            };
            Ok(acc + size)
        })
    }

    recurse(fs::read_dir(path)?)
}

fn main() {
    let _ = env_logger::try_init();

    let new_stats = bench::<Db>();

    println!(
        "raw data size: {}",
        (MAX_CONCURRENCY * N_WRITES_PER_THREAD * BYTES_PER_ITEM)
            .to_formatted_string(&Locale::en)
    );
    println!("sled 1.0 space stats:");
    new_stats.print_report();

    /*
    let old_stats = bench::<old_sled::Db>();
    dbg!(old_stats);

    let new_sled_vs_old_sled_storage_ratio =
        new_stats.disk_space as f64 / old_stats.disk_space as f64;
    let new_sled_vs_old_sled_allocated_memory_ratio =
        new_stats.allocated_memory as f64 / old_stats.allocated_memory as f64;
    let new_sled_vs_old_sled_freed_memory_ratio =
        new_stats.freed_memory as f64 / old_stats.freed_memory as f64;
    let new_sled_vs_old_sled_resident_memory_ratio =
        new_stats.resident_memory as f64 / old_stats.resident_memory as f64;

    dbg!(new_sled_vs_old_sled_storage_ratio);
    dbg!(new_sled_vs_old_sled_allocated_memory_ratio);
    dbg!(new_sled_vs_old_sled_freed_memory_ratio);
    dbg!(new_sled_vs_old_sled_resident_memory_ratio);

    let rocksdb_stats = bench::<Arc<rocksdb::DB>>();

    bench::<Lmdb>();

    bench::<Sqlite>();
    */

    /*
    let new_sled_vs_rocksdb_storage_ratio =
        new_stats.disk_space as f64 / rocksdb_stats.disk_space as f64;
    let new_sled_vs_rocksdb_allocated_memory_ratio =
        new_stats.allocated_memory as f64 / rocksdb_stats.allocated_memory as f64;
    let new_sled_vs_rocksdb_freed_memory_ratio =
        new_stats.freed_memory as f64 / rocksdb_stats.freed_memory as f64;
    let new_sled_vs_rocksdb_resident_memory_ratio =
        new_stats.resident_memory as f64 / rocksdb_stats.resident_memory as f64;

    dbg!(new_sled_vs_rocksdb_storage_ratio);
    dbg!(new_sled_vs_rocksdb_allocated_memory_ratio);
    dbg!(new_sled_vs_rocksdb_freed_memory_ratio);
    dbg!(new_sled_vs_rocksdb_resident_memory_ratio);
    */

    /*
    let scan = Instant::now();
    let count = stone.iter().count();
    assert_eq!(count as u64, N_WRITES_PER_THREAD);
    let scan_elapsed = scan.elapsed();
    println!(
        "{} scanned items/s, total {:?}",
        (N_WRITES_PER_THREAD * 1_000_000) / u64::try_from(scan_elapsed.as_micros().max(1)).unwrap_or(u64::MAX),
        scan_elapsed
    );
    */

    /*
    let scan_rev = Instant::now();
    let count = stone.range(..).rev().count();
    assert_eq!(count as u64, N_WRITES_PER_THREAD);
    let scan_rev_elapsed = scan_rev.elapsed();
    println!(
        "{} reverse-scanned items/s, total {:?}",
        (N_WRITES_PER_THREAD * 1_000_000) / u64::try_from(scan_rev_elapsed.as_micros().max(1)).unwrap_or(u64::MAX),
        scan_rev_elapsed
    );
    */
}


================================================
FILE: fuzz/.gitignore
================================================
target
corpus
artifacts


================================================
FILE: fuzz/Cargo.toml
================================================
[package]
name = "bloodstone-fuzz"
version = "0.0.0"
authors = ["Automatically generated"]
publish = false
edition = "2018"

[package.metadata]
cargo-fuzz = true

[dependencies.libfuzzer-sys]
version = "0.4.0"
features = ["arbitrary-derive"]

[dependencies]
arbitrary = { version = "1.0.3", features = ["derive"] }
tempfile = "3.5.0"

[dependencies.sled]
path = ".."
features = []

# Prevent this from interfering with workspaces
[workspace]
members = ["."]

[[bin]]
name = "fuzz_model"
path = "fuzz_targets/fuzz_model.rs"
test = false
doc = false


================================================
FILE: fuzz/fuzz_targets/fuzz_model.rs
================================================
#![no_main]
#[macro_use]
extern crate libfuzzer_sys;
extern crate arbitrary;
extern crate sled;

use arbitrary::Arbitrary;

use sled::{Config, Db as SledDb, InlineArray};

type Db = SledDb<3>;

const KEYSPACE: u64 = 128;

#[derive(Debug)]
enum Op {
    Get { key: InlineArray },
    Insert { key: InlineArray, value: InlineArray },
    Reboot,
    Remove { key: InlineArray },
    Cas { key: InlineArray, old: Option<InlineArray>, new: Option<InlineArray> },
    Range { start: InlineArray, end: InlineArray },
}

fn keygen(
    u: &mut arbitrary::Unstructured<'_>,
) -> arbitrary::Result<InlineArray> {
    let key_i: u64 = u.int_in_range(0..=KEYSPACE)?;
    Ok(key_i.to_be_bytes().as_ref().into())
}

impl<'a> Arbitrary<'a> for Op {
    fn arbitrary(
        u: &mut arbitrary::Unstructured<'a>,
    ) -> arbitrary::Result<Self> {
        Ok(if u.ratio(1, 2)? {
            Op::Insert { key: keygen(u)?, value: keygen(u)? }
        } else if u.ratio(1, 2)? {
            Op::Get { key: keygen(u)? }
        } else if u.ratio(1, 2)? {
            Op::Reboot
        } else if u.ratio(1, 2)? {
            Op::Remove { key: keygen(u)? }
        } else if u.ratio(1, 2)? {
            Op::Cas {
                key: keygen(u)?,
                old: if u.ratio(1, 2)? { Some(keygen(u)?) } else { None },
                new: if u.ratio(1, 2)? { Some(keygen(u)?) } else { None },
            }
        } else {
            let start = u.int_in_range(0..=KEYSPACE)?;
            let end = (start + 1).max(u.int_in_range(0..=KEYSPACE)?);

            Op::Range {
                start: start.to_be_bytes().as_ref().into(),
                end: end.to_be_bytes().as_ref().into(),
            }
        })
    }
}

fuzz_target!(|ops: Vec<Op>| {
    let tmp_dir = tempfile::TempDir::new().unwrap();
    let tmp_path = tmp_dir.path().to_owned();
    let config = Config::new().path(tmp_path);

    let mut tree: Db = config.open().unwrap();
    let mut model = std::collections::BTreeMap::new();

    for (_i, op) in ops.into_iter().enumerate() {
        match op {
            Op::Insert { key, value } => {
                assert_eq!(
                    tree.insert(key.clone(), value.clone()).unwrap(),
                    model.insert(key, value)
                );
            }
            Op::Get { key } => {
                assert_eq!(tree.get(&key).unwrap(), model.get(&key).cloned());
            }
            Op::Reboot => {
                drop(tree);
                tree = config.open().unwrap();
            }
            Op::Remove { key } => {
                assert_eq!(tree.remove(&key).unwrap(), model.remove(&key));
            }
            Op::Range { start, end } => {
                let mut model_iter =
                    model.range::<InlineArray, _>(&start..&end);
                let mut tree_iter = tree.range(start..end);

                for (k1, v1) in &mut model_iter {
                    let (k2, v2) = tree_iter
                        .next()
                        .expect("None returned from iter when Some expected")
                        .expect("IO issue encountered");
                    assert_eq!((k1, v1), (&k2, &v2));
                }

                assert!(tree_iter.next().is_none());
            }
            Op::Cas { key, old, new } => {
                let succ = if old == model.get(&key).cloned() {
                    if let Some(n) = &new {
                        model.insert(key.clone(), n.clone());
                    } else {
                        model.remove(&key);
                    }
                    true
                } else {
                    false
                };

                let res = tree
                    .compare_and_swap(key, old.as_ref(), new)
                    .expect("hit IO error");

                if succ {
                    assert!(res.is_ok());
                } else {
                    assert!(res.is_err());
                }
            }
        };

        for (key, value) in &model {
            assert_eq!(tree.get(key).unwrap().unwrap(), value);
        }

        for kv_res in &tree {
            let (key, value) = kv_res.unwrap();
            assert_eq!(model.get(&key), Some(&value));
        }
    }

    let mut model_iter = model.iter();
    let mut tree_iter = tree.iter();

    for (k1, v1) in &mut model_iter {
        let (k2, v2) = tree_iter.next().unwrap().unwrap();
        assert_eq!((k1, v1), (&k2, &v2));
    }

    assert!(tree_iter.next().is_none());
});


================================================
FILE: scripts/cgtest.sh
================================================
#!/bin/sh
set -e

cgdelete memory:sledTest || true
cgcreate -g memory:sledTest
echo 100M > /sys/fs/cgroup/memory/sledTest/memory.limit_in_bytes

su $SUDO_USER -c 'cargo build --release --features=testing'

for test in target/release/deps/test*; do
  if [[ -x $test ]]
  then
    echo running test: $test
    cgexec -g memory:sledTest $test --test-threads=1
    rm $test
  fi
done


================================================
FILE: scripts/cross_compile.sh
================================================
#!/bin/sh
set -e

# checks sled's compatibility using several targets

targets="wasm32-wasi wasm32-unknown-unknown aarch64-fuchsia aarch64-linux-android \
         i686-linux-android i686-unknown-linux-gnu \
         x86_64-linux-android x86_64-fuchsia \
         mips-unknown-linux-musl aarch64-apple-ios"

rustup update --no-self-update

RUSTFLAGS="--cfg miri" cargo check

rustup toolchain install 1.62 --no-self-update
cargo clean
rm Cargo.lock
cargo +1.62 check

for target in $targets; do
  echo "setting up $target..."
  rustup target add $target
  echo "checking $target..."
  cargo check --target $target
done



================================================
FILE: scripts/execution_explorer.py
================================================
#!/usr/bin/gdb --command

"""
a simple python GDB script for running multithreaded
programs in a way that is "deterministic enough"
to tease out and replay interesting bugs.

Tyler Neely 25 Sept 2017
t@jujit.su

references:
    https://sourceware.org/gdb/onlinedocs/gdb/All_002dStop-Mode.html
    https://sourceware.org/gdb/onlinedocs/gdb/Non_002dStop-Mode.html
    https://sourceware.org/gdb/onlinedocs/gdb/Threads-In-Python.html
    https://sourceware.org/gdb/onlinedocs/gdb/Events-In-Python.html
    https://blog.0x972.info/index.php?tag=gdb.py
"""

import gdb
import random

###############################################################################
#                                   config                                    #
###############################################################################
# set this to a number for reproducing results or None to explore randomly
seed = 156112673742  # None  # 951931004895

# set this to the number of valid threads in the program
# {2, 3} assumes a main thread that waits on 2 workers.
# {1, ... N} assumes all of the first N threads are to be explored
threads_whitelist = {2, 3}

# set this to the file of the binary to explore
filename = "target/debug/binary"

# set this to the place the threads should rendezvous before exploring
entrypoint = "src/main.rs:8"

# set this to after the threads are done
exitpoint = "src/main.rs:12"

# invariant unreachable points that should never be accessed
unreachable = [
        "panic_unwind::imp::panic"
        ]

# set this to the locations you want to test interleavings for
interesting = [
        "src/main.rs:8",
        "src/main.rs:9"
        ]

# uncomment this to output the specific commands issued to gdb
gdb.execute("set trace-commands on")

###############################################################################
###############################################################################


class UnreachableBreakpoint(gdb.Breakpoint):
    pass


class DoneBreakpoint(gdb.Breakpoint):
    pass


class InterestingBreakpoint(gdb.Breakpoint):
    pass


class DeterministicExecutor:
    def __init__(self, seed=None):
        if seed:
            print("seeding with", seed)
            self.seed = seed
            random.seed(seed)
        else:
            # pick a random new seed if not provided with one
            self.reseed()

        gdb.execute("file " + filename)

        # non-stop is necessary to provide thread-specific
        # information when breakpoints are hit.
        gdb.execute("set non-stop on")
        gdb.execute("set confirm off")

        self.ready = set()
        self.finished = set()

    def reseed(self):
        random.seed()
        self.seed = random.randrange(1e12)
        print("reseeding with", self.seed)
        random.seed(self.seed)

    def restart(self):
        # reset inner state
        self.ready = set()
        self.finished = set()

        # disconnect callbacks
        gdb.events.stop.disconnect(self.scheduler_callback)
        gdb.events.exited.disconnect(self.exit_callback)

        # nuke all breakpoints
        gdb.execute("d")

        # end execution
        gdb.execute("k")

        # pick new seed
        self.reseed()

        self.run()

    def rendezvous_callback(self, event):
        try:
            self.ready.add(event.inferior_thread.num)
            if len(self.ready) == len(threads_whitelist):
                self.run_schedule()
        except Exception as e:
            # this will be thrown if breakpoint is not a part of event,
            # like when the event was stopped for another reason.
            print(e)

    def run(self):
        gdb.execute("b " + entrypoint)

        gdb.events.stop.connect(self.rendezvous_callback)
        gdb.events.exited.connect(self.exit_callback)

        gdb.execute("r")

    def run_schedule(self):
        print("running schedule")
        gdb.execute("d")
        gdb.events.stop.disconnect(self.rendezvous_callback)
        gdb.events.stop.connect(self.scheduler_callback)

        for bp in interesting:
            InterestingBreakpoint(bp)

        for bp in unreachable:
            UnreachableBreakpoint(bp)

        DoneBreakpoint(exitpoint)

        self.pick()

    def pick(self):
        threads = self.runnable_threads()
        if not threads:
            print("restarting execution after running out of valid threads")
            self.restart()
            return

        thread = random.choice(threads)

        gdb.execute("t " + str(thread.num))
        gdb.execute("c")

    def scheduler_callback(self, event):
        if not isinstance(event, gdb.BreakpointEvent):
            print("WTF sched callback got", event.__dict__)
            return

        if isinstance(event.breakpoint, DoneBreakpoint):
            self.finished.add(event.inferior_thread.num)
        elif isinstance(event.breakpoint, UnreachableBreakpoint):
            print("!" * 80)
            print("unreachable breakpoint triggered with seed", self.seed)
            print("!" * 80)
            gdb.events.exited.disconnect(self.exit_callback)
            gdb.execute("q")
        else:
            print("thread", event.inferior_thread.num,
                  "hit breakpoint at", event.breakpoint.location)

        self.pick()

    def runnable_threads(self):
        threads = gdb.selected_inferior().threads()

        def f(it):
            return (it.is_valid() and not
                    it.is_exited() and
                    it.num in threads_whitelist and
                    it.num not in self.finished)

        good_threads = [it for it in threads if f(it)]
        good_threads.sort(key=lambda it: it.num)

        return good_threads

    def exit_callback(self, event):
        try:
            if event.exit_code != 0:
                print("!" * 80)
                print("interesting exit with seed", self.seed)
                print("!" * 80)
            else:
                print("happy exit")
                self.restart()

            gdb.execute("q")
        except Exception as e:
            pass

de = DeterministicExecutor(seed)
de.run()


================================================
FILE: scripts/instructions
================================================
#!/bin/sh
# counts instructions for a standard workload
set -e

OUTFILE="cachegrind.stress2.`git describe --always --dirty`-`date +%s`"

rm -rf default.sled || true

cargo build \
  --bin=stress2 \
  --release


# --tool=callgrind --dump-instr=yes --collect-jumps=yes --simulate-cache=yes \
# --callgrind-out-file="$OUTFILE" \

valgrind \
  --tool=cachegrind \
  --cachegrind-out-file="$OUTFILE" \
  ./target/release/stress2 --total-ops=50000 --set-prop=1000000000000 --threads=1

LAST=`ls -t cachegrind.stress2.* | sed -n 2p`

echo "comparing $LAST with new $OUTFILE"

echo "--------------------------------------------------------------------------------"
echo "change since last run:"
echo "         Ir   I1mr  ILmr          Dr    D1mr    DLmr          Dw    D1mw    DLmw"
echo "--------------------------------------------------------------------------------"
cg_diff $LAST $OUTFILE | tail -1


================================================
FILE: scripts/sanitizers.sh
================================================
#!/bin/bash
set -eo pipefail

pushd benchmarks/stress2

rustup toolchain install nightly
rustup toolchain install nightly --component rust-src
rustup update

export SLED_LOCK_FREE_DELAY_INTENSITY=2000

echo "msan"
cargo clean
export RUSTFLAGS="-Zsanitizer=memory -Zsanitizer-memory-track-origins"
cargo +nightly build -Zbuild-std --target x86_64-unknown-linux-gnu
sudo rm -rf default.sled
sudo target/x86_64-unknown-linux-gnu/debug/stress2 --duration=30 --set-prop=100000000 --val-len=1000 --entries=100 --threads=100
sudo target/x86_64-unknown-linux-gnu/debug/stress2 --duration=30 --entries=100
sudo target/x86_64-unknown-linux-gnu/debug/stress2 --duration=30
unset MSAN_OPTIONS

echo "asan"
cargo clean
export RUSTFLAGS="-Z sanitizer=address"
export ASAN_OPTIONS="detect_odr_violation=0"
cargo +nightly build --features=lock_free_delays --target x86_64-unknown-linux-gnu
sudo rm -rf default.sled
sudo target/x86_64-unknown-linux-gnu/debug/stress2 --duration=60
sudo target/x86_64-unknown-linux-gnu/debug/stress2 --duration=6
unset ASAN_OPTIONS

echo "lsan"
cargo clean
export RUSTFLAGS="-Z sanitizer=leak"
cargo +nightly build --features=lock_free_delays --target x86_64-unknown-linux-gnu
sudo rm -rf default.sled
sudo target/x86_64-unknown-linux-gnu/debug/stress2 --duration=60
sudo target/x86_64-unknown-linux-gnu/debug/stress2 --duration=6

echo "tsan"
cargo clean
export RUSTFLAGS="-Z sanitizer=thread"
export TSAN_OPTIONS=suppressions=../../tsan_suppressions.txt
sudo rm -rf default.sled
cargo +nightly run --features=lock_free_delays --target x86_64-unknown-linux-gnu -- --duration=60
cargo +nightly run --features=lock_free_delays --target x86_64-unknown-linux-gnu -- --duration=6
unset RUSTFLAGS
unset TSAN_OPTIONS


================================================
FILE: scripts/shufnice.sh
================================================
#!/bin/sh

while true; do
  PID=`pgrep $1`
  TIDS=`ls /proc/$PID/task`
  TID=`echo $TIDS |  tr " " "\n" | shuf -n1`
  NICE=$((`shuf -i 0-39 -n 1` - 20))
  echo "renicing $TID to $NICE"
  renice -n $NICE -p $TID
done


================================================
FILE: scripts/ubuntu_bench
================================================
#!/bin/sh

sudo apt-get update
sudo apt-get install htop dstat build-essential linux-tools-common linux-tools-generic linux-tools-`uname -r`
curl https://sh.rustup.rs -sSf | sh
source $HOME/.cargo/env

cargo install flamegraph

git clone https://github.com/spacejam/sled.git
cd sled

cores=$(grep -c ^processor /proc/cpuinfo)
writers=(($cores / 5 + 1 ))
readers=$(( ($cores / 5 + 1) * 4 ))

cargo build --release --bin=stress2 --features=stress

# we use sudo here to get access to symbols
pushd benchmarks/stress2
cargo flamegraph --release -- --get=$readers --set=$writers


================================================
FILE: src/alloc.rs
================================================
#[cfg(any(
    feature = "testing-shred-allocator",
    feature = "testing-count-allocator"
))]
pub use alloc::*;

// the memshred feature causes all allocated and deallocated
// memory to be set to a specific non-zero value of 0xa1 for
// uninitialized allocations and 0xde for deallocated memory,
// in the hope that it will cause memory errors to surface
// more quickly.

#[cfg(feature = "testing-shred-allocator")]
mod alloc {
    use std::alloc::{Layout, System};

    #[global_allocator]
    static ALLOCATOR: ShredAllocator = ShredAllocator;

    #[derive(Default, Debug, Clone, Copy)]
    struct ShredAllocator;

    unsafe impl std::alloc::GlobalAlloc for ShredAllocator {
        unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
            let ret = System.alloc(layout);
            assert_ne!(ret, std::ptr::null_mut());
            std::ptr::write_bytes(ret, 0xa1, layout.size());
            ret
        }

        unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
            std::ptr::write_bytes(ptr, 0xde, layout.size());
            System.dealloc(ptr, layout)
        }
    }
}

#[cfg(feature = "testing-count-allocator")]
mod alloc {
    use std::alloc::{Layout, System};

    #[global_allocator]
    static ALLOCATOR: CountingAllocator = CountingAllocator;

    static ALLOCATED: AtomicUsize = AtomicUsize::new(0);
    static FREED: AtomicUsize = AtomicUsize::new(0);
    static RESIDENT: AtomicUsize = AtomicUsize::new(0);

    fn allocated() -> usize {
        ALLOCATED.swap(0, Ordering::Relaxed)
    }

    fn freed() -> usize {
        FREED.swap(0, Ordering::Relaxed)
    }

    fn resident() -> usize {
        RESIDENT.load(Ordering::Relaxed)
    }

    #[derive(Default, Debug, Clone, Copy)]
    struct CountingAllocator;

    unsafe impl std::alloc::GlobalAlloc for CountingAllocator {
        unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
            let ret = System.alloc(layout);
            assert_ne!(ret, std::ptr::null_mut());
            ALLOCATED.fetch_add(layout.size(), Ordering::Relaxed);
            RESIDENT.fetch_add(layout.size(), Ordering::Relaxed);
            std::ptr::write_bytes(ret, 0xa1, layout.size());
            ret
        }

        unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
            std::ptr::write_bytes(ptr, 0xde, layout.size());
            FREED.fetch_add(layout.size(), Ordering::Relaxed);
            RESIDENT.fetch_sub(layout.size(), Ordering::Relaxed);
            System.dealloc(ptr, layout)
        }
    }
}


================================================
FILE: src/block_checker.rs
================================================
use std::collections::BTreeMap;
use std::panic::Location;
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::{LazyLock, Mutex};

static COUNTER: AtomicU64 = AtomicU64::new(0);
static CHECK_INS: LazyLock<BlockChecker> = LazyLock::new(|| {
    std::thread::spawn(move || {
        let mut last_top_10 = Default::default();
        loop {
            std::thread::sleep(std::time::Duration::from_secs(5));
            last_top_10 = CHECK_INS.report(last_top_10);
        }
    });

    BlockChecker::default()
});

type LocationMap = BTreeMap<u64, &'static Location<'static>>;

#[derive(Default)]
pub(crate) struct BlockChecker {
    state: Mutex<LocationMap>,
}

impl BlockChecker {
    fn report(&self, last_top_10: LocationMap) -> LocationMap {
        let state = self.state.lock().unwrap();
        println!("top 10 longest blocking sections:");

        let top_10: LocationMap =
            state.iter().take(10).map(|(k, v)| (*k, *v)).collect();

        for (id, location) in &top_10 {
            if last_top_10.contains_key(id) {
                println!("id: {}, location: {:?}", id, location);
            }
        }

        top_10
    }

    fn check_in(&self, location: &'static Location) -> BlockGuard {
        let next_id = COUNTER.fetch_add(1, Ordering::Relaxed);
        let mut state = self.state.lock().unwrap();
        state.insert(next_id, location);
        BlockGuard { id: next_id }
    }

    fn check_out(&self, id: u64) {
        let mut state = self.state.lock().unwrap();
        state.remove(&id);
    }
}

pub(crate) struct BlockGuard {
    id: u64,
}

impl Drop for BlockGuard {
    fn drop(&mut self) {
        CHECK_INS.check_out(self.id)
    }
}

#[track_caller]
pub(crate) fn track_blocks() -> BlockGuard {
    let caller = Location::caller();
    CHECK_INS.check_in(caller)
}


================================================
FILE: src/config.rs
================================================
use std::io;
use std::path::{Path, PathBuf};
use std::sync::Arc;

use fault_injection::{annotate, fallible};
use tempdir::TempDir;

use crate::Db;

macro_rules! builder {
    ($(($name:ident, $t:ty, $desc:expr)),*) => {
        $(
            #[doc=$desc]
            pub fn $name(mut self, to: $t) -> Self {
                self.$name = to;
                self
            }
        )*
    }
}

#[derive(Debug, Clone)]
pub struct Config {
    /// The base directory for storing the database.
    pub path: PathBuf,
    /// Cache size in **bytes**. Default is 512mb.
    pub cache_capacity_bytes: usize,
    /// The percentage of the cache that is dedicated to the
    /// scan-resistant entry cache.
    pub entry_cache_percent: u8,
    /// Start a background thread that flushes data to disk
    /// every few milliseconds. Defaults to every 200ms.
    pub flush_every_ms: Option<usize>,
    /// The zstd compression level to use when writing data to disk. Defaults to 3.
    pub zstd_compression_level: i32,
    /// This is only set to `Some` for objects created via
    /// `Config::tmp`, and will remove the storage directory
    /// when the final Arc drops.
    pub tempdir_deleter: Option<Arc<TempDir>>,
    /// A float between 0.0 and 1.0 that controls how much fragmentation can
    /// exist in a file before GC attempts to recompact it.
    pub target_heap_file_fill_ratio: f32,
    /// Values larger than this configurable will be stored as separate blob
    pub max_inline_value_threshold: usize,
}

impl Default for Config {
    fn default() -> Config {
        Config {
            path: "bloodstone.default".into(),
            flush_every_ms: Some(200),
            cache_capacity_bytes: 512 * 1024 * 1024,
            entry_cache_percent: 20,
            zstd_compression_level: 3,
            tempdir_deleter: None,
            target_heap_file_fill_ratio: 0.9,
            max_inline_value_threshold: 4096,
        }
    }
}

impl Config {
    /// Returns a default `Config`
    pub fn new() -> Config {
        Config::default()
    }

    /// Returns a config with the `path` initialized to a system
    /// temporary directory that will be deleted when this `Config`
    /// is dropped.
    pub fn tmp() -> io::Result<Config> {
        let tempdir = fallible!(tempdir::TempDir::new("sled_tmp"));

        Ok(Config {
            path: tempdir.path().into(),
            tempdir_deleter: Some(Arc::new(tempdir)),
            ..Config::default()
        })
    }

    /// Set the path of the database (builder).
    pub fn path<P: AsRef<Path>>(mut self, path: P) -> Config {
        self.path = path.as_ref().to_path_buf();
        self
    }

    builder!(
        (flush_every_ms, Option<usize>, "Start a background thread that flushes data to disk every few milliseconds. Defaults to every 200ms."),
        (cache_capacity_bytes, usize, "Cache size in **bytes**. Default is 512mb."),
        (entry_cache_percent, u8, "The percentage of the cache that is dedicated to the scan-resistant entry cache."),
        (zstd_compression_level, i32, "The zstd compression level to use when writing data to disk. Defaults to 3."),
        (target_heap_file_fill_ratio, f32, "A float between 0.0 and 1.0 that controls how much fragmentation can exist in a file before GC attempts to recompact it."),
        (max_inline_value_threshold, usize, "Values larger than this configurable will be stored as separate blob")
    );

    pub fn open<const LEAF_FANOUT: usize>(
        &self,
    ) -> io::Result<Db<LEAF_FANOUT>> {
        if LEAF_FANOUT < 3 {
            return Err(annotate!(io::Error::new(
                io::ErrorKind::Unsupported,
                "Db's LEAF_FANOUT const generic must be 3 or greater."
            )));
        }
        Db::open_with_config(self)
    }
}


================================================
FILE: src/db.rs
================================================
use std::collections::HashMap;
use std::fmt;
use std::io;
use std::sync::{Arc, mpsc};
use std::time::{Duration, Instant};

use parking_lot::Mutex;

use crate::*;

/// sled 1.0 alpha :)
///
/// One of the main differences between this and sled 0.34 is that
/// `Db` and `Tree` now have a `LEAF_FANOUT` const generic parameter.
/// This parameter is an interesting single-knob performance tunable
/// that allows users to traverse the performance-vs-efficiency
/// trade-off spectrum. The default value of `1024` causes keys and
/// values to be more efficiently compressed when stored on disk,
/// but for larger-than-memory random workloads it may be advantageous
/// to lower `LEAF_FANOUT` to between `16` to `256`, depending on your
/// efficiency requirements. A lower value will also cause contention
/// to be reduced for frequently accessed data. This value cannot be
/// changed after creating the database.
///
/// As an alpha release, please do not expect this to be safe for
/// business-critical use cases. However, if you would like this to
/// serve your business-critical use cases over time, please give it
/// a shot in a low-risk non-production environment and report any
/// issues you encounter in a github issue.
///
/// Note that `Db` implements `Deref` for the default `Tree` (sled's
/// version of namespaces / keyspaces / buckets), but you can create
/// and use others using `Db::open_tree`.
#[derive(Clone)]
pub struct Db<const LEAF_FANOUT: usize = 1024> {
    config: Config,
    _shutdown_dropper: Arc<ShutdownDropper<LEAF_FANOUT>>,
    cache: ObjectCache<LEAF_FANOUT>,
    trees: Arc<Mutex<HashMap<CollectionId, Tree<LEAF_FANOUT>>>>,
    collection_id_allocator: Arc<Allocator>,
    collection_name_mapping: Tree<LEAF_FANOUT>,
    default_tree: Tree<LEAF_FANOUT>,
    was_recovered: bool,
}

impl<const LEAF_FANOUT: usize> std::ops::Deref for Db<LEAF_FANOUT> {
    type Target = Tree<LEAF_FANOUT>;
    fn deref(&self) -> &Tree<LEAF_FANOUT> {
        &self.default_tree
    }
}

impl<const LEAF_FANOUT: usize> IntoIterator for &Db<LEAF_FANOUT> {
    type Item = io::Result<(InlineArray, InlineArray)>;
    type IntoIter = crate::Iter<LEAF_FANOUT>;

    fn into_iter(self) -> Self::IntoIter {
        self.iter()
    }
}

impl<const LEAF_FANOUT: usize> fmt::Debug for Db<LEAF_FANOUT> {
    fn fmt(&self, w: &mut fmt::Formatter<'_>) -> fmt::Result {
        let alternate = w.alternate();

        let mut debug_struct = w.debug_struct(&format!("Db<{}>", LEAF_FANOUT));

        if alternate {
            debug_struct
                .field("global_error", &self.check_error())
                .field(
                    "data",
                    &format!("{:?}", self.iter().collect::<Vec<_>>()),
                )
                .finish()
        } else {
            debug_struct.field("global_error", &self.check_error()).finish()
        }
    }
}

fn flusher<const LEAF_FANOUT: usize>(
    cache: ObjectCache<LEAF_FANOUT>,
    shutdown_signal: mpsc::Receiver<mpsc::Sender<()>>,
    flush_every_ms: usize,
) {
    let interval = Duration::from_millis(flush_every_ms as _);
    let mut last_flush_duration = Duration::default();

    let flush = || {
        let flush_res_res = std::panic::catch_unwind(|| cache.flush());
        match flush_res_res {
            Ok(Ok(_)) => {
                // don't abort.
                return;
            }
            Ok(Err(flush_failure)) => {
                log::error!(
                    "Db flusher encountered error while flushing: {:?}",
                    flush_failure
                );
                cache.set_error(&flush_failure);
            }
            Err(panicked) => {
                log::error!(
                    "Db flusher panicked while flushing: {:?}",
                    panicked
                );
                cache.set_error(&io::Error::other(
                    "Db flusher panicked while flushing".to_string(),
                ));
            }
        }
        std::process::abort();
    };

    loop {
        let recv_timeout = interval
            .saturating_sub(last_flush_duration)
            .max(Duration::from_millis(1));
        if let Ok(shutdown_sender) = shutdown_signal.recv_timeout(recv_timeout)
        {
            flush();

            // this is probably unnecessary but it will avoid issues
            // if egregious bugs get introduced that trigger it
            cache.set_error(&io::Error::other(
                "system has been shut down".to_string(),
            ));

            assert!(cache.is_clean());

            drop(cache);

            if let Err(e) = shutdown_sender.send(()) {
                log::error!(
                    "Db flusher could not ack shutdown to requestor: {e:?}"
                );
            }
            log::debug!(
                "flush thread terminating after signalling to requestor"
            );
            return;
        }

        let before_flush = Instant::now();

        flush();

        last_flush_duration = before_flush.elapsed();
    }
}

impl<const LEAF_FANOUT: usize> Drop for Db<LEAF_FANOUT> {
    fn drop(&mut self) {
        if self.config.flush_every_ms.is_none() {
            if let Err(e) = self.flush() {
                log::error!("failed to flush Db on Drop: {e:?}");
            }
        } else {
            // otherwise, it is expected that the flusher thread will
            // flush while shutting down the final Db/Tree instance
        }
    }
}

impl<const LEAF_FANOUT: usize> Db<LEAF_FANOUT> {
    #[cfg(feature = "for-internal-testing-only")]
    fn validate(&self) -> io::Result<()> {
        // for each tree, iterate over index, read node and assert low key matches
        // and assert first time we've ever seen node ID

        let mut ever_seen = std::collections::HashSet::new();
        let before = std::time::Instant::now();

        #[cfg(feature = "for-internal-testing-only")]
        let _b0 = crate::block_checker::track_blocks();

        for (_cid, tree) in self.trees.lock().iter() {
            let mut hi_none_count = 0;
            let mut last_hi = None;
            for (low, node) in tree.index.iter() {
                // ensure we haven't reused the object_id across Trees
                assert!(ever_seen.insert(node.object_id));

                let (read_low, node_mu, read_node) =
                    tree.page_in(&low, self.cache.current_flush_epoch())?;

                assert_eq!(read_node.object_id, node.object_id);
                assert_eq!(node_mu.leaf.as_ref().unwrap().lo, low);
                assert_eq!(read_low, low);

                if let Some(hi) = &last_hi {
                    assert_eq!(hi, &node_mu.leaf.as_ref().unwrap().lo);
                }

                if let Some(hi) = &node_mu.leaf.as_ref().unwrap().hi {
                    last_hi = Some(hi.clone());
                } else {
                    assert_eq!(hi_none_count, 0);
                    hi_none_count += 1;
                }
            }
            // each tree should have exactly one leaf with no max hi key
            assert_eq!(hi_none_count, 1);
        }

        log::debug!(
            "{} leaves looking good after {} micros",
            ever_seen.len(),
            before.elapsed().as_micros()
        );

        Ok(())
    }

    pub fn stats(&self) -> Stats {
        Stats { cache: self.cache.stats() }
    }

    pub fn size_on_disk(&self) -> io::Result<u64> {
        use std::fs::read_dir;

        fn recurse(mut dir: std::fs::ReadDir) -> io::Result<u64> {
            dir.try_fold(0, |acc, file| {
                let file = file?;
                let size = match file.metadata()? {
                    data if data.is_dir() => recurse(read_dir(file.path())?)?,
                    data => data.len(),
                };
                Ok(acc + size)
            })
        }

        recurse(read_dir(&self.cache.config.path)?)
    }

    /// Returns `true` if the database was
    /// recovered from a previous process.
    /// Note that database state is only
    /// guaranteed to be present up to the
    /// last call to `flush`! Otherwise state
    /// is synced to disk periodically if the
    /// `Config.sync_every_ms` configuration option
    /// is set to `Some(number_of_ms_between_syncs)`
    /// or if the IO buffer gets filled to
    /// capacity before being rotated.
    pub fn was_recovered(&self) -> bool {
        self.was_recovered
    }

    pub fn open_with_config(config: &Config) -> io::Result<Db<LEAF_FANOUT>> {
        let (shutdown_tx, shutdown_rx) = mpsc::channel();

        let (cache, indices, was_recovered) = ObjectCache::recover(config)?;

        let _shutdown_dropper = Arc::new(ShutdownDropper {
            shutdown_sender: Mutex::new(shutdown_tx),
            cache: Mutex::new(cache.clone()),
        });

        let mut allocated_collection_ids = fnv::FnvHashSet::default();

        let mut trees: HashMap<CollectionId, Tree<LEAF_FANOUT>> = indices
            .into_iter()
            .map(|(collection_id, index)| {
                assert!(
                    allocated_collection_ids.insert(collection_id.0),
                    "allocated_collection_ids already contained {:?}",
                    collection_id
                );
                (
                    collection_id,
                    Tree::new(
                        collection_id,
                        cache.clone(),
                        index,
                        _shutdown_dropper.clone(),
                    ),
                )
            })
            .collect();

        let collection_name_mapping =
            trees.get(&NAME_MAPPING_COLLECTION_ID).unwrap().clone();

        let default_tree = trees.get(&DEFAULT_COLLECTION_ID).unwrap().clone();

        for kv_res in collection_name_mapping.iter() {
            let (_collection_name, collection_id_buf) = kv_res.unwrap();
            let collection_id = CollectionId(u64::from_le_bytes(
                collection_id_buf.as_ref().try_into().unwrap(),
            ));

            if trees.contains_key(&collection_id) {
                continue;
            }

            // need to initialize tree leaf for empty collection

            assert!(
                allocated_collection_ids.insert(collection_id.0),
                "allocated_collection_ids already contained {:?}",
                collection_id
            );

            let initial_low_key = InlineArray::default();

            let empty_node = cache.allocate_default_node(collection_id);

            let index = Index::default();

            assert!(index.insert(initial_low_key, empty_node).is_none());

            let tree = Tree::new(
                collection_id,
                cache.clone(),
                index,
                _shutdown_dropper.clone(),
            );

            trees.insert(collection_id, tree);
        }

        let collection_id_allocator =
            Arc::new(Allocator::from_allocated(&allocated_collection_ids));

        assert_eq!(collection_name_mapping.len()? + 2, trees.len());

        let ret = Db {
            config: config.clone(),
            cache: cache.clone(),
            default_tree,
            collection_name_mapping,
            collection_id_allocator,
            trees: Arc::new(Mutex::new(trees)),
            _shutdown_dropper,
            was_recovered,
        };

        #[cfg(feature = "for-internal-testing-only")]
        ret.validate()?;

        if let Some(flush_every_ms) = ret.cache.config.flush_every_ms {
            let spawn_res = std::thread::Builder::new()
                .name("sled_flusher".into())
                .spawn(move || flusher(cache, shutdown_rx, flush_every_ms));

            if let Err(e) = spawn_res {
                return Err(io::Error::other(format!(
                    "unable to spawn flusher thread for sled database: {:?}",
                    e
                )));
            }
        }
        Ok(ret)
    }

    /// A database export method for all collections in the `Db`,
    /// for use in sled version upgrades. Can be used in combination
    /// with the `import` method below on a database running a later
    /// version.
    ///
    /// # Panics
    ///
    /// Panics if any IO problems occur while trying
    /// to perform the export.
    ///
    /// # Examples
    ///
    /// If you want to migrate from one version of sled
    /// to another, you need to pull in both versions
    /// by using version renaming:
    ///
    /// `Cargo.toml`:
    ///
    /// ```toml
    /// [dependencies]
    /// sled = "0.32"
    /// old_sled = { version = "0.31", package = "sled" }
    /// ```
    ///
    /// and in your code, remember that old versions of
    /// sled might have a different way to open them
    /// than the current `sled::open` method:
    ///
    /// ```
    /// # use sled as old_sled;
    /// # fn main() -> Result<(), Box<dyn std::error::Error>> {
    /// let old = old_sled::open("my_old_db_export")?;
    ///
    /// // may be a different version of sled,
    /// // the export type is version agnostic.
    /// let new = sled::open("my_new_db_export")?;
    ///
    /// let export = old.export();
    /// new.import(export);
    ///
    /// assert_eq!(old.checksum()?, new.checksum()?);
    /// # drop(old);
    /// # drop(new);
    /// # let _ = std::fs::remove_dir_all("my_old_db_export");
    /// # let _ = std::fs::remove_dir_all("my_new_db_export");
    /// # Ok(()) }
    /// ```
    pub fn export(
        &self,
    ) -> Vec<(
        CollectionType,
        CollectionName,
        impl Iterator<Item = Vec<Vec<u8>>> + '_,
    )> {
        let trees = self.trees.lock();

        let mut ret = vec![];

        for kv_res in self.collection_name_mapping.iter() {
            let (collection_name, collection_id_buf) = kv_res.unwrap();
            let collection_id = CollectionId(u64::from_le_bytes(
                collection_id_buf.as_ref().try_into().unwrap(),
            ));
            let tree = trees.get(&collection_id).unwrap().clone();

            ret.push((
                b"tree".to_vec(),
                collection_name.to_vec(),
                tree.iter().map(|kv_opt| {
                    let kv = kv_opt.unwrap();
                    vec![kv.0.to_vec(), kv.1.to_vec()]
                }),
            ));
        }

        ret
    }

    /// Imports the collections from a previous database.
    ///
    /// # Panics
    ///
    /// Panics if any IO problems occur while trying
    /// to perform the import.
    ///
    /// # Examples
    ///
    /// If you want to migrate from one version of sled
    /// to another, you need to pull in both versions
    /// by using version renaming:
    ///
    /// `Cargo.toml`:
    ///
    /// ```toml
    /// [dependencies]
    /// sled = "0.32"
    /// old_sled = { version = "0.31", package = "sled" }
    /// ```
    ///
    /// and in your code, remember that old versions of
    /// sled might have a different way to open them
    /// than the current `sled::open` method:
    ///
    /// ```
    /// # use sled as old_sled;
    /// # fn main() -> Result<(), Box<dyn std::error::Error>> {
    /// let old = old_sled::open("my_old_db_import")?;
    ///
    /// // may be a different version of sled,
    /// // the export type is version agnostic.
    /// let new = sled::open("my_new_db_import")?;
    ///
    /// let export = old.export();
    /// new.import(export);
    ///
    /// assert_eq!(old.checksum()?, new.checksum()?);
    /// # drop(old);
    /// # drop(new);
    /// # let _ = std::fs::remove_dir_all("my_old_db_import");
    /// # let _ = std::fs::remove_dir_all("my_new_db_import");
    /// # Ok(()) }
    /// ```
    pub fn import(
        &self,
        export: Vec<(
            CollectionType,
            CollectionName,
            impl Iterator<Item = Vec<Vec<u8>>>,
        )>,
    ) {
        for (collection_type, collection_name, collection_iter) in export {
            match collection_type {
                ref t if t == b"tree" => {
                    let tree = self
                        .open_tree(collection_name)
                        .expect("failed to open new tree during import");
                    for mut kv in collection_iter {
                        let v = kv
                            .pop()
                            .expect("failed to get value from tree export");
                        let k = kv
                            .pop()
                            .expect("failed to get key from tree export");
                        let old = tree.insert(k, v).expect(
                            "failed to insert value during tree import",
                        );
                        assert!(
                            old.is_none(),
                            "import is overwriting existing data"
                        );
                    }
                }
                other => panic!("unknown collection type {:?}", other),
            }
        }
    }

    pub fn contains_tree<V: AsRef<[u8]>>(&self, name: V) -> io::Result<bool> {
        Ok(self.collection_name_mapping.get(name.as_ref())?.is_some())
    }

    pub fn drop_tree<V: AsRef<[u8]>>(&self, name: V) -> io::Result<bool> {
        let name_ref = name.as_ref();
        let trees = self.trees.lock();

        let tree = if let Some(collection_id_buf) =
            self.collection_name_mapping.get(name_ref)?
        {
            let collection_id = CollectionId(u64::from_le_bytes(
                collection_id_buf.as_ref().try_into().unwrap(),
            ));

            trees.get(&collection_id).unwrap()
        } else {
            return Ok(false);
        };

        tree.clear()?;

        self.collection_name_mapping.remove(name_ref)?;

        Ok(true)
    }

    /// Open or create a new disk-backed [`Tree`] with its own keyspace,
    /// accessible from the `Db` via the provided identifier.
    pub fn open_tree<V: AsRef<[u8]>>(
        &self,
        name: V,
    ) -> io::Result<Tree<LEAF_FANOUT>> {
        let name_ref = name.as_ref();
        let mut trees = self.trees.lock();

        if let Some(collection_id_buf) =
            self.collection_name_mapping.get(name_ref)?
        {
            let collection_id = CollectionId(u64::from_le_bytes(
                collection_id_buf.as_ref().try_into().unwrap(),
            ));

            let tree = trees.get(&collection_id).unwrap();

            return Ok(tree.clone());
        }

        let collection_id =
            CollectionId(self.collection_id_allocator.allocate());

        let initial_low_key = InlineArray::default();

        let empty_node = self.cache.allocate_default_node(collection_id);

        let index = Index::default();

        assert!(index.insert(initial_low_key, empty_node).is_none());

        let tree = Tree::new(
            collection_id,
            self.cache.clone(),
            index,
            self._shutdown_dropper.clone(),
        );

        self.collection_name_mapping
            .insert(name_ref, &collection_id.0.to_le_bytes())?;

        trees.insert(collection_id, tree.clone());

        Ok(tree)
    }
}

/// These types provide the information that allows an entire
/// system to be exported and imported to facilitate
/// major upgrades. It is comprised entirely
/// of standard library types to be forward compatible.
/// NB this definitions are expensive to change, because
/// they impact the migration path.
type CollectionType = Vec<u8>;
type CollectionName = Vec<u8>;


================================================
FILE: src/event_verifier.rs
================================================
use std::collections::BTreeMap;
use std::sync::Mutex;

use crate::{FlushEpoch, ObjectId};

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub(crate) enum State {
    Unallocated,
    Dirty,
    CooperativelySerialized,
    AddedToWriteBatch,
    Flushed,
    CleanPagedIn,
    PagedOut,
}

impl State {
    fn can_transition_within_epoch_to(&self, next: State) -> bool {
        match (self, next) {
            (State::Flushed, State::PagedOut) => true,
            (State::Flushed, _) => false,
            (State::AddedToWriteBatch, State::Flushed) => true,
            (State::AddedToWriteBatch, _) => false,
            (State::CleanPagedIn, State::AddedToWriteBatch) => false,
            (State::CleanPagedIn, State::Flushed) => false,
            (State::Dirty, State::AddedToWriteBatch) => true,
            (State::CooperativelySerialized, State::AddedToWriteBatch) => true,
            (State::CooperativelySerialized, _) => false,
            (State::Unallocated, State::AddedToWriteBatch) => true,
            (State::Unallocated, _) => false,
            (State::Dirty, State::Dirty) => true,
            (State::Dirty, State::CooperativelySerialized) => true,
            (State::Dirty, State::Unallocated) => true,
            (State::Dirty, _) => false,
            (State::CleanPagedIn, State::Dirty) => true,
            (State::CleanPagedIn, State::PagedOut) => true,
            (State::CleanPagedIn, State::CleanPagedIn) => true,
            (State::CleanPagedIn, State::Unallocated) => true,
            (State::CleanPagedIn, State::CooperativelySerialized) => true,
            (State::PagedOut, State::CleanPagedIn) => true,
            (State::PagedOut, _) => false,
        }
    }

    fn needs_flush(&self) -> bool {
        match self {
            State::CleanPagedIn => false,
            State::Flushed => false,
            State::PagedOut => false,
            _ => true,
        }
    }
}

#[derive(Debug, Default)]
pub(crate) struct EventVerifier {
    flush_model:
        Mutex<BTreeMap<(ObjectId, FlushEpoch), Vec<(State, &'static str)>>>,
}

impl Drop for EventVerifier {
    fn drop(&mut self) {
        // assert that nothing is currently Dirty
        let flush_model = self.flush_model.lock().unwrap();
        for ((oid, _epoch), history) in flush_model.iter() {
            if let Some((last_state, _at)) = history.last() {
                assert_ne!(
                    *last_state,
                    State::Dirty,
                    "{oid:?} is Dirty when system shutting down"
                );
            }
        }
    }
}

impl EventVerifier {
    pub(crate) fn mark(
        &self,
        object_id: ObjectId,
        epoch: FlushEpoch,
        state: State,
        at: &'static str,
    ) {
        if matches!(state, State::PagedOut) {
            let dirty_epochs = self.dirty_epochs(object_id);
            if !dirty_epochs.is_empty() {
                println!("{object_id:?} was paged out while having dirty epochs {dirty_epochs:?}");
                self.print_debug_history_for_object(object_id);
                println!("{state:?} {epoch:?} {at}");
                println!("invalid object state transition");
                std::process::abort();
            }
        }

        let mut flush_model = self.flush_model.lock().unwrap();
        let history = flush_model.entry((object_id, epoch)).or_default();

        if let Some((last_state, _at)) = history.last() {
            if !last_state.can_transition_within_epoch_to(state) {
                println!(
                    "object_id {object_id:?} performed \
                    illegal state transition from {last_state:?} \
                    to {state:?} at {at} in epoch {epoch:?}."
                );

                println!("history:");
                history.push((state, at));

                let active_epochs = flush_model.range(
                    (object_id, FlushEpoch::MIN)..=(object_id, FlushEpoch::MAX),
                );
                for ((_oid, epoch), history) in active_epochs {
                    for (last_state, at) in history {
                        println!("{last_state:?} {epoch:?} {at}");
                    }
                }

                println!("invalid object state transition");

                std::process::abort();
            }
        }
        history.push((state, at));
    }

    /// Returns the FlushEpochs for which this ObjectId has unflushed
    /// dirty data for.
    fn dirty_epochs(&self, object_id: ObjectId) -> Vec<FlushEpoch> {
        let mut dirty_epochs = vec![];
        let flush_model = self.flush_model.lock().unwrap();

        let active_epochs = flush_model
            .range((object_id, FlushEpoch::MIN)..=(object_id, FlushEpoch::MAX));

        for ((_oid, epoch), history) in active_epochs {
            let (last_state, _at) = history.last().unwrap();
            if last_state.needs_flush() {
                dirty_epochs.push(*epoch);
            }
        }

        dirty_epochs
    }

    pub(crate) fn print_debug_history_for_object(&self, object_id: ObjectId) {
        let flush_model = self.flush_model.lock().unwrap();
        println!("history for object {:?}:", object_id);
        let active_epochs = flush_model
            .range((object_id, FlushEpoch::MIN)..=(object_id, FlushEpoch::MAX));
        for ((_oid, epoch), history) in active_epochs {
            for (last_state, at) in history {
                println!("{last_state:?} {epoch:?} {at}");
            }
        }
    }
}


================================================
FILE: src/flush_epoch.rs
================================================
use std::num::NonZeroU64;
use std::sync::atomic::{AtomicPtr, AtomicU64, Ordering};
use std::sync::{Arc, Condvar, Mutex};

const SEAL_BIT: u64 = 1 << 63;
const SEAL_MASK: u64 = u64::MAX - SEAL_BIT;
const MIN_EPOCH: u64 = 2;

#[derive(
    Debug,
    Clone,
    Copy,
    serde::Serialize,
    serde::Deserialize,
    PartialOrd,
    Ord,
    PartialEq,
    Eq,
    Hash,
)]
pub struct FlushEpoch(NonZeroU64);

impl FlushEpoch {
    pub const MIN: FlushEpoch = FlushEpoch(NonZeroU64::MIN);
    #[allow(unused)]
    pub const MAX: FlushEpoch = FlushEpoch(NonZeroU64::MAX);

    pub fn increment(&self) -> FlushEpoch {
        FlushEpoch(NonZeroU64::new(self.0.get() + 1).unwrap())
    }

    pub fn get(&self) -> u64 {
        self.0.get()
    }
}

impl concurrent_map::Minimum for FlushEpoch {
    const MIN: FlushEpoch = FlushEpoch::MIN;
}

#[derive(Debug)]
pub(crate) struct FlushInvariants {
    max_flushed_epoch: AtomicU64,
    max_flushing_epoch: AtomicU64,
}

impl Default for FlushInvariants {
    fn default() -> FlushInvariants {
        FlushInvariants {
            max_flushed_epoch: (MIN_EPOCH - 1).into(),
            max_flushing_epoch: (MIN_EPOCH - 1).into(),
        }
    }
}

impl FlushInvariants {
    pub(crate) fn mark_flushed_epoch(&self, epoch: FlushEpoch) {
        let last = self.max_flushed_epoch.swap(epoch.get(), Ordering::SeqCst);

        assert_eq!(last + 1, epoch.get());
    }

    pub(crate) fn mark_flushing_epoch(&self, epoch: FlushEpoch) {
        let last = self.max_flushing_epoch.swap(epoch.get(), Ordering::SeqCst);

        assert_eq!(last + 1, epoch.get());
    }
}

#[derive(Clone, Debug)]
pub(crate) struct Completion {
    mu: Arc<Mutex<bool>>,
    cv: Arc<Condvar>,
    epoch: FlushEpoch,
}

impl Completion {
    pub fn epoch(&self) -> FlushEpoch {
        self.epoch
    }

    pub fn new(epoch: FlushEpoch) -> Completion {
        Completion { mu: Default::default(), cv: Default::default(), epoch }
    }

    pub fn wait_for_complete(self) -> FlushEpoch {
        let mut mu = self.mu.lock().unwrap();
        while !*mu {
            mu = self.cv.wait(mu).unwrap();
        }

        self.epoch
    }

    pub fn mark_complete(self) {
        self.mark_complete_inner(false);
    }

    fn mark_complete_inner(&self, previously_sealed: bool) {
        let mut mu = self.mu.lock().unwrap();
        if !previously_sealed {
            // TODO reevaluate - assert!(!*mu);
        }
        log::trace!("marking epoch {:?} as complete", self.epoch);
        // it's possible for *mu to already be true due to this being
        // immediately dropped in the check_in method when we see that
        // the checked-in epoch has already been marked as sealed.
        *mu = true;
        drop(mu);
        self.cv.notify_all();
    }

    #[cfg(test)]
    pub fn is_complete(&self) -> bool {
        *self.mu.lock().unwrap()
    }
}

pub struct FlushEpochGuard<'a> {
    tracker: &'a EpochTracker,
    previously_sealed: bool,
}

impl Drop for FlushEpochGuard<'_> {
    fn drop(&mut self) {
        let rc = self.tracker.rc.fetch_sub(1, Ordering::SeqCst) - 1;
        if rc & SEAL_MASK == 0 && (rc & SEAL_BIT) == SEAL_BIT {
            crate::debug_delay();
            self.tracker
                .vacancy_notifier
                .mark_complete_inner(self.previously_sealed);
        }
    }
}

impl FlushEpochGuard<'_> {
    pub fn epoch(&self) -> FlushEpoch {
        self.tracker.epoch
    }
}

#[derive(Debug)]
pub(crate) struct EpochTracker {
    epoch: FlushEpoch,
    rc: AtomicU64,
    vacancy_notifier: Completion,
    previous_flush_complete: Completion,
}

#[derive(Clone, Debug)]
pub(crate) struct FlushEpochTracker {
    active_ebr: ebr::Ebr<Box<EpochTracker>, 16, 16>,
    inner: Arc<FlushEpochInner>,
}

#[derive(Debug)]
pub(crate) struct FlushEpochInner {
    counter: AtomicU64,
    roll_mu: Mutex<()>,
    current_active: AtomicPtr<EpochTracker>,
}

impl Drop for FlushEpochInner {
    fn drop(&mut self) {
        let vacancy_mu = self.roll_mu.lock().unwrap();
        let old_ptr =
            self.current_active.swap(std::ptr::null_mut(), Ordering::SeqCst);
        if !old_ptr.is_null() {
            //let old: &EpochTracker = &*old_ptr;
            unsafe { drop(Box::from_raw(old_ptr)) }
        }
        drop(vacancy_mu);
    }
}

impl Default for FlushEpochTracker {
    fn default() -> FlushEpochTracker {
        let last = Completion::new(FlushEpoch(NonZeroU64::new(1).unwrap()));
        let current_active_ptr = Box::into_raw(Box::new(EpochTracker {
            epoch: FlushEpoch(NonZeroU64::new(MIN_EPOCH).unwrap()),
            rc: AtomicU64::new(0),
            vacancy_notifier: Completion::new(FlushEpoch(
                NonZeroU64::new(MIN_EPOCH).unwrap(),
            )),
            previous_flush_complete: last.clone(),
        }));

        last.mark_complete();

        let current_active = AtomicPtr::new(current_active_ptr);

        FlushEpochTracker {
            inner: Arc::new(FlushEpochInner {
                counter: AtomicU64::new(2),
                roll_mu: Mutex::new(()),
                current_active,
            }),
            active_ebr: ebr::Ebr::default(),
        }
    }
}

impl FlushEpochTracker {
    /// Returns the epoch notifier for the previous epoch.
    /// Intended to be passed to a flusher that can eventually
    /// notify the flush-requesting thread.
    pub fn roll_epoch_forward(&self) -> (Completion, Completion, Completion) {
        let mut tracker_guard = self.active_ebr.pin();

        let vacancy_mu = self.inner.roll_mu.lock().unwrap();

        let flush_through = self.inner.counter.fetch_add(1, Ordering::SeqCst);

        let flush_through_epoch =
            FlushEpoch(NonZeroU64::new(flush_through).unwrap());

        let new_epoch = flush_through_epoch.increment();

        let forward_flush_notifier = Completion::new(flush_through_epoch);

        let new_active = Box::into_raw(Box::new(EpochTracker {
            epoch: new_epoch,
            rc: AtomicU64::new(0),
            vacancy_notifier: Completion::new(new_epoch),
            previous_flush_complete: forward_flush_notifier.clone(),
        }));

        let old_ptr =
            self.inner.current_active.swap(new_active, Ordering::SeqCst);

        assert!(!old_ptr.is_null());

        let (last_flush_complete_notifier, vacancy_notifier) = unsafe {
            let old: &EpochTracker = &*old_ptr;
            let last = old.rc.fetch_add(SEAL_BIT + 1, Ordering::SeqCst);

            assert_eq!(
                last & SEAL_BIT,
                0,
                "epoch {} double-sealed",
                flush_through
            );

            // mark_complete_inner called via drop in a uniform way
            //println!("dropping flush epoch guard for epoch {flush_through}");
            drop(FlushEpochGuard { tracker: old, previously_sealed: true });

            (old.previous_flush_complete.clone(), old.vacancy_notifier.clone())
        };
        tracker_guard.defer_drop(unsafe { Box::from_raw(old_ptr) });
        drop(vacancy_mu);
        (last_flush_complete_notifier, vacancy_notifier, forward_flush_notifier)
    }

    pub fn check_in<'a>(&self) -> FlushEpochGuard<'a> {
        let _tracker_guard = self.active_ebr.pin();
        loop {
            let tracker: &'a EpochTracker =
                unsafe { &*self.inner.current_active.load(Ordering::SeqCst) };

            let rc = tracker.rc.fetch_add(1, Ordering::SeqCst);

            let previously_sealed = rc & SEAL_BIT == SEAL_BIT;

            let guard = FlushEpochGuard { tracker, previously_sealed };

            if previously_sealed {
                // the epoch is already closed, so we must drop the rc
                // and possibly notify, which is handled in the guard's
                // Drop impl.
                drop(guard);
            } else {
                return guard;
            }
        }
    }

    pub fn manually_advance_epoch(&self) {
        self.active_ebr.manually_advance_epoch();
    }

    pub fn current_flush_epoch(&self) -> FlushEpoch {
        let current = self.inner.counter.load(Ordering::SeqCst);

        FlushEpoch(NonZeroU64::new(current).unwrap())
    }
}

#[test]
fn flush_epoch_basic_functionality() {
    let epoch_tracker = FlushEpochTracker::default();

    for expected in MIN_EPOCH..1_000_000 {
        let g1 = epoch_tracker.check_in();
        let g2 = epoch_tracker.check_in();

        assert_eq!(g1.tracker.epoch.0.get(), expected);
        assert_eq!(g2.tracker.epoch.0.get(), expected);

        let previous_notifier = epoch_tracker.roll_epoch_forward().1;
        assert!(!previous_notifier.is_complete());

        drop(g1);
        assert!(!previous_notifier.is_complete());
        drop(g2);
        assert_eq!(previous_notifier.wait_for_complete().0.get(), expected);
    }
}

#[cfg(test)]
fn concurrent_flush_epoch_burn_in_inner() {
    const N_THREADS: usize = 10;
    const N_OPS_PER_THREAD: usize = 3000;

    let fa = FlushEpochTracker::default();

    let barrier = std::sync::Arc::new(std::sync::Barrier::new(21));

    let pt = pagetable::PageTable::<AtomicU64>::default();

    let rolls = || {
        let fa = fa.clone();
        let barrier = barrier.clone();
        let pt = &pt;
        move || {
            barrier.wait();
            for _ in 0..N_OPS_PER_THREAD {
                let (previous, this, next) = fa.roll_epoch_forward();
                let last_epoch = previous.wait_for_complete().0.get();
                assert_eq!(0, pt.get(last_epoch).load(Ordering::Acquire));
                let flush_through_epoch = this.wait_for_complete().0.get();
                assert_eq!(
                    0,
                    pt.get(flush_through_epoch).load(Ordering::Acquire)
                );

                next.mark_complete();
            }
        }
    };

    let check_ins = || {
        let fa = fa.clone();
        let barrier = barrier.clone();
        let pt = &pt;
        move || {
            barrier.wait();
            for _ in 0..N_OPS_PER_THREAD {
                let guard = fa.check_in();
                let epoch = guard.epoch().0.get();
                pt.get(epoch).fetch_add(1, Ordering::SeqCst);
                std::thread::yield_now();
                pt.get(epoch).fetch_sub(1, Ordering::SeqCst);
                drop(guard);
            }
        }
    };

    std::thread::scope(|s| {
        let mut threads = vec![];

        for _ in 0..N_THREADS {
            threads.push(s.spawn(rolls()));
            threads.push(s.spawn(check_ins()));
        }

        barrier.wait();

        for thread in threads.into_iter() {
            thread.join().expect("a test thread crashed unexpectedly");
        }
    });

    for i in 0..N_OPS_PER_THREAD * N_THREADS {
        assert_eq!(0, pt.get(i as u64).load(Ordering::Acquire));
    }
}

#[test]
fn concurrent_flush_epoch_burn_in() {
    for _ in 0..128 {
        concurrent_flush_epoch_burn_in_inner();
    }
}


================================================
FILE: src/heap.rs
================================================
use std::fmt;
use std::fs;
use std::io::{self, Read};
use std::num::NonZeroU64;
use std::path::{Path, PathBuf};
use std::sync::Arc;
use std::sync::atomic::{AtomicPtr, AtomicU64, Ordering, fence};
use std::time::{Duration, Instant};

use ebr::{Ebr, Guard};
use fault_injection::{annotate, fallible, maybe};
use fnv::FnvHashSet;
use fs2::FileExt as _;
use parking_lot::{Mutex, RwLock};
use rayon::prelude::*;

use crate::object_location_mapper::{AllocatorStats, ObjectLocationMapper};
use crate::{CollectionId, Config, DeferredFree, MetadataStore, ObjectId};

const WARN: &str = "DO_NOT_PUT_YOUR_FILES_HERE";
pub(crate) const N_SLABS: usize = 78;
const FILE_TARGET_FILL_RATIO: u64 = 80;
const FILE_RESIZE_MARGIN: u64 = 115;

const SLAB_SIZES: [usize; N_SLABS] = [
    64,     // 0x40
    80,     // 0x50
    96,     // 0x60
    112,    // 0x70
    128,    // 0x80
    160,    // 0xa0
    192,    // 0xc0
    224,    // 0xe0
    256,    // 0x100
    320,    // 0x140
    384,    // 0x180
    448,    // 0x1c0
    512,    // 0x200
    640,    // 0x280
    768,    // 0x300
    896,    // 0x380
    1024,   // 0x400
    1280,   // 0x500
    1536,   // 0x600
    1792,   // 0x700
    2048,   // 0x800
    2560,   // 0xa00
    3072,   // 0xc00
    3584,   // 0xe00
    4096,   // 0x1000
    5120,   // 0x1400
    6144,   // 0x1800
    7168,   // 0x1c00
    8192,   // 0x2000
    10240,  // 0x2800
    12288,  // 0x3000
    14336,  // 0x3800
    16384,  // 0x4000
    20480,  // 0x5000
    24576,  // 0x6000
    28672,  // 0x7000
    32768,  // 0x8000
    40960,  // 0xa000
    49152,  // 0xc000
    57344,  // 0xe000
    65536,  // 0x10000
    98304,  // 0x1a000
    131072, // 0x20000
    163840, // 0x28000
    196608,
    262144,
    393216,
    524288,
    786432,
    1048576,
    1572864,
    2097152,
    3145728,
    4194304,
    6291456,
    8388608,
    12582912,
    16777216,
    25165824,
    33554432,
    50331648,
    67108864,
    100663296,
    134217728,
    201326592,
    268435456,
    402653184,
    536870912,
    805306368,
    1073741824,
    1610612736,
    2147483648,
    3221225472,
    4294967296,
    6442450944,
    8589934592,
    12884901888,
    17_179_869_184, // 17gb is max page size as-of now
];

#[derive(Default, Debug, Copy, Clone)]
pub struct WriteBatchStats {
    pub heap_bytes_written: u64,
    pub heap_files_written_to: u64,
    /// Latency inclusive of fsync
    pub heap_write_latency: Duration,
    /// Latency for fsyncing files
    pub heap_sync_latency: Duration,
    pub metadata_bytes_written: u64,
    pub metadata_write_latency: Duration,
    pub truncated_files: u64,
    pub truncated_bytes: u64,
    pub truncate_latency: Duration,
}

#[derive(Default, Debug, Clone, Copy)]
pub struct HeapStats {
    pub allocator: AllocatorStats,
    pub write_batch_max: WriteBatchStats,
    pub write_batch_sum: WriteBatchStats,
    pub truncated_file_bytes: u64,
}

impl WriteBatchStats {
    pub(crate) fn max(&self, other: &WriteBatchStats) -> WriteBatchStats {
        WriteBatchStats {
            heap_bytes_written: self
                .heap_bytes_written
                .max(other.heap_bytes_written),
            heap_files_written_to: self
                .heap_files_written_to
                .max(other.heap_files_written_to),
            heap_write_latency: self
                .heap_write_latency
                .max(other.heap_write_latency),
            heap_sync_latency: self
                .heap_sync_latency
                .max(other.heap_sync_latency),
            metadata_bytes_written: self
                .metadata_bytes_written
                .max(other.metadata_bytes_written),
            metadata_write_latency: self
                .metadata_write_latency
                .max(other.metadata_write_latency),
            truncated_files: self.truncated_files.max(other.truncated_files),
            truncated_bytes: self.truncated_bytes.max(other.truncated_bytes),
            truncate_latency: self.truncate_latency.max(other.truncate_latency),
        }
    }

    pub(crate) fn sum(&self, other: &WriteBatchStats) -> WriteBatchStats {
        use std::ops::Add;
        WriteBatchStats {
            heap_bytes_written: self
                .heap_bytes_written
                .add(other.heap_bytes_written),
            heap_files_written_to: self
                .heap_files_written_to
                .add(other.heap_files_written_to),
            heap_write_latency: self
                .heap_write_latency
                .add(other.heap_write_latency),
            heap_sync_latency: self
                .heap_sync_latency
                .add(other.heap_sync_latency),
            metadata_bytes_written: self
                .metadata_bytes_written
                .add(other.metadata_bytes_written),
            metadata_write_latency: self
                .metadata_write_latency
                .add(other.metadata_write_latency),
            truncated_files: self.truncated_files.add(other.truncated_files),
            truncated_bytes: self.truncated_bytes.add(other.truncated_bytes),
            truncate_latency: self.truncate_latency.add(other.truncate_latency),
        }
    }
}

const fn overhead_for_size(size: usize) -> usize {
    if size + 5 <= u8::MAX as usize {
        // crc32 + 1 byte frame
        5
    } else if size + 6 <= u16::MAX as usize {
        // crc32 + 2 byte frame
        6
    } else if size + 8 <= u32::MAX as usize {
        // crc32 + 4 byte frame
        8
    } else {
        // crc32 + 8 byte frame
        12
    }
}

fn slab_for_size(size: usize) -> u8 {
    let total_size = size + overhead_for_size(size);
    for (idx, slab_size) in SLAB_SIZES.iter().enumerate() {
        if *slab_size >= total_size {
            return u8::try_from(idx).unwrap();
        }
    }
    u8::MAX
}

pub use inline_array::InlineArray;

#[derive(Debug)]
pub struct ObjectRecovery {
    pub object_id: ObjectId,
    pub collection_id: CollectionId,
    pub low_key: InlineArray,
}

pub struct HeapRecovery {
    pub heap: Heap,
    pub recovered_nodes: Vec<ObjectRecovery>,
    pub was_recovered: bool,
}

enum PersistentSettings {
    V1 { leaf_fanout: u64 },
}

impl PersistentSettings {
    // NB: should only be called with a directory lock already exclusively acquired
    fn verify_or_store<P: AsRef<Path>>(
        &self,
        path: P,
        _directory_lock: &std::fs::File,
    ) -> io::Result<()> {
        let settings_path = path.as_ref().join("durability_cookie");

        match std::fs::read(&settings_path) {
            Ok(previous_bytes) => {
                let previous =
                    PersistentSettings::deserialize(&previous_bytes)?;
                self.check_compatibility(&previous)
            }
            Err(e) if e.kind() == std::io::ErrorKind::NotFound => {
                std::fs::write(settings_path, self.serialize())
            }
            Err(e) => Err(e),
        }
    }

    fn deserialize(buf: &[u8]) -> io::Result<PersistentSettings> {
        let mut cursor = buf;
        let mut buf = [0_u8; 64];
        cursor.read_exact(&mut buf)?;

        let version = u16::from_le_bytes([buf[0], buf[1]]);

        let crc_actual = (crc32fast::hash(&buf[0..60]) ^ 0xAF).to_le_bytes();
        let crc_expected = &buf[60..];

        if crc_actual != crc_expected {
            return Err(io::Error::new(
                io::ErrorKind::InvalidData,
                "encountered corrupted settings cookie with mismatched CRC.",
            ));
        }

        match version {
            1 => {
                let leaf_fanout =
                    u64::from_le_bytes(buf[2..10].try_into().unwrap());
                Ok(PersistentSettings::V1 { leaf_fanout })
            }
            _ => Err(io::Error::new(
                io::ErrorKind::InvalidData,
                "encountered unknown version number when reading settings cookie",
            )),
        }
    }

    fn check_compatibility(
        &self,
        other: &PersistentSettings,
    ) -> io::Result<()> {
        use PersistentSettings::*;

        match (self, other) {
            (V1 { leaf_fanout: lf1 }, V1 { leaf_fanout: lf2 }) => {
                if lf1 != lf2 {
                    Err(io::Error::new(
                        io::ErrorKind::Unsupported,
                        format!(
                            "sled was already opened with a LEAF_FANOUT const generic of {}, \
                                and this may not be changed after initial creation. Please use \
                                Db::import / Db::export to migrate, if you wish to change the \
                                system's format.",
                            lf2
                        ),
                    ))
                } else {
                    Ok(())
                }
            }
        }
    }

    fn serialize(&self) -> Vec<u8> {
        // format: 64 bytes in total, with the last 4 being a LE crc32
        // first 2 are LE version number
        let mut buf = vec![];

        match self {
            PersistentSettings::V1 { leaf_fanout } => {
                // LEAF_FANOUT: 8 bytes LE
                let version: [u8; 2] = 1_u16.to_le_bytes();
                buf.extend_from_slice(&version);

                buf.extend_from_slice(&leaf_fanout.to_le_bytes());
            }
        }

        // zero-pad the buffer
        assert!(buf.len() < 60);
        buf.resize(60, 0);

        let hash: u32 = crc32fast::hash(&buf) ^ 0xAF;
        let hash_bytes: [u8; 4] = hash.to_le_bytes();
        buf.extend_from_slice(&hash_bytes);

        // keep the buffer to 64 bytes for easy parsing over time.
        assert_eq!(buf.len(), 64);

        buf
    }
}

#[derive(Clone, Copy, Debug, PartialEq)]
pub(crate) struct SlabAddress {
    slab_id: u8,
    slab_slot: [u8; 7],
}

impl SlabAddress {
    pub(crate) fn from_slab_slot(slab: u8, slot: u64) -> SlabAddress {
        let slot_bytes = slot.to_be_bytes();

        assert_eq!(slot_bytes[0], 0);

        SlabAddress {
            slab_id: slab,
            slab_slot: slot_bytes[1..].try_into().unwrap(),
        }
    }

    #[inline]
    pub const fn slab(&self) -> u8 {
        self.slab_id
    }

    #[inline]
    pub const fn slot(&self) -> u64 {
        u64::from_be_bytes([
            0,
            self.slab_slot[0],
            self.slab_slot[1],
            self.slab_slot[2],
            self.slab_slot[3],
            self.slab_slot[4],
            self.slab_slot[5],
            self.slab_slot[6],
        ])
    }
}

impl From<NonZeroU64> for SlabAddress {
    fn from(i: NonZeroU64) -> SlabAddress {
        let i = i.get();
        let bytes = i.to_be_bytes();
        SlabAddress {
            slab_id: bytes[0] - 1,
            slab_slot: bytes[1..].try_into().unwrap(),
        }
    }
}

impl From<SlabAddress> for NonZeroU64 {
    fn from(sa: SlabAddress) -> NonZeroU64 {
        NonZeroU64::new(u64::from_be_bytes([
            sa.slab_id + 1,
            sa.slab_slot[0],
            sa.slab_slot[1],
            sa.slab_slot[2],
            sa.slab_slot[3],
            sa.slab_slot[4],
            sa.slab_slot[5],
            sa.slab_slot[6],
        ]))
        .unwrap()
    }
}

#[cfg(unix)]
mod sys_io {
    use std::io;
    use std::os::unix::fs::FileExt;

    use super::*;

    pub(super) fn read_exact_at(
        file: &fs::File,
        buf: &mut [u8],
        offset: u64,
    ) -> io::Result<()> {
        match maybe!(file.read_exact_at(buf, offset)) {
            Ok(r) => Ok(r),
            Err(e) => {
                // FIXME BUG 3: failed to read 64 bytes at offset 192 from file with len 192
                println!(
                    "failed to read {} bytes at offset {} from file with len {}",
                    buf.len(),
                    offset,
                    file.metadata().unwrap().len(),
                );
                let _ = dbg!(std::backtrace::Backtrace::force_capture());
                Err(e)
            }
        }
    }

    pub(super) fn write_all_at(
        file: &fs::File,
        buf: &[u8],
        offset: u64,
    ) -> io::Result<()> {
        maybe!(file.write_all_at(buf, offset))
    }
}

#[cfg(windows)]
mod sys_io {
    use std::os::windows::fs::FileExt;

    use super::*;

    pub(super) fn read_exact_at(
        file: &fs::File,
        mut buf: &mut [u8],
        mut offset: u64,
    ) -> io::Result<()> {
        while !buf.is_empty() {
            match maybe!(file.seek_read(buf, offset)) {
                Ok(0) => break,
                Ok(n) => {
                    let tmp = buf;
                    buf = &mut tmp[n..];
                    offset += n as u64;
                }
                Err(ref e) if e.kind() == io::ErrorKind::Interrupted => {}
                Err(e) => return Err(annotate!(e)),
            }
        }
        if !buf.is_empty() {
            Err(annotate!(io::Error::new(
                io::ErrorKind::UnexpectedEof,
                "failed to fill whole buffer"
            )))
        } else {
            Ok(())
        }
    }

    pub(super) fn write_all_at(
        file: &fs::File,
        mut buf: &[u8],
        mut offset: u64,
    ) -> io::Result<()> {
        while !buf.is_empty() {
            match maybe!(file.seek_write(buf, offset)) {
                Ok(0) => {
                    return Err(annotate!(io::Error::new(
                        io::ErrorKind::WriteZero,
                        "failed to write whole buffer",
                    )));
                }
                Ok(n) => {
                    buf = &buf[n..];
                    offset += n as u64;
                }
                Err(ref e) if e.kind() == io::ErrorKind::Interrupted => {}
                Err(e) => return Err(annotate!(e)),
            }
        }
        Ok(())
    }
}

#[derive(Debug)]
struct Slab {
    file: fs::File,
    slot_size: usize,
    max_live_slot_since_last_truncation: AtomicU64,
}

impl Slab {
    fn sync(&self) -> io::Result<()> {
        self.file.sync_all()
    }

    fn read(
        &self,
        slot: u64,
        _guard: &mut Guard<'_, DeferredFree, 16, 16>,
    ) -> io::Result<Vec<u8>> {
        log::trace!("reading from slot {} in slab {}", slot, self.slot_size);

        let mut data = Vec::with_capacity(self.slot_size);
        unsafe {
            data.set_len(self.slot_size);
        }

        let whence = self.slot_size as u64 * slot;

        maybe!(sys_io::read_exact_at(&self.file, &mut data, whence))?;

        let hash_actual: [u8; 4] =
            (crc32fast::hash(&data[..self.slot_size - 4]) ^ 0xAF).to_le_bytes();
        let hash_expected = &data[self.slot_size - 4..];

        if hash_expected != hash_actual {
            return Err(annotate!(io::Error::new(
                io::ErrorKind::InvalidData,
                "crc mismatch - data corruption detected"
            )));
        }

        let len: usize = if self.slot_size <= u8::MAX as usize {
            // crc32 + 1 byte frame
            usize::from(data[self.slot_size - 5])
        } else if self.slot_size <= u16::MAX as usize {
            // crc32 + 2 byte frame
            let mut size_bytes: [u8; 2] = [0; 2];
            size_bytes
                .copy_from_slice(&data[self.slot_size - 6..self.slot_size - 4]);
            usize::from(u16::from_le_bytes(size_bytes))
        } else if self.slot_size <= u32::MAX as usize {
            // crc32 + 4 byte frame
            let mut size_bytes: [u8; 4] = [0; 4];
            size_bytes
                .copy_from_slice(&data[self.slot_size - 8..self.slot_size - 4]);
            usize::try_from(u32::from_le_bytes(size_bytes)).unwrap()
        } else {
            // crc32 + 8 byte frame
            let mut size_bytes: [u8; 8] = [0; 8];
            size_bytes.copy_from_slice(
                &data[self.slot_size - 12..self.slot_size - 4],
            );
            usize::try_from(u64::from_le_bytes(size_bytes)).unwrap()
        };

        data.truncate(len);

        Ok(data)
    }

    fn write(&self, slot: u64, mut data: Vec<u8>) -> io::Result<()> {
        let len = data.len();

        assert!(len + overhead_for_size(data.len()) <= self.slot_size);

        data.resize(self.slot_size, 0);

        if self.slot_size <= u8::MAX as usize {
            // crc32 + 1 byte frame
            data[self.slot_size - 5] = u8::try_from(len).unwrap();
        } else if self.slot_size <= u16::MAX as usize {
            // crc32 + 2 byte frame
            let size_bytes: [u8; 2] = u16::try_from(len).unwrap().to_le_bytes();
            data[self.slot_size - 6..self.slot_size - 4]
                .copy_from_slice(&size_bytes);
        } else if self.slot_size <= u32::MAX as usize {
            // crc32 + 4 byte frame
            let size_bytes: [u8; 4] = u32::try_from(len).unwrap().to_le_bytes();
            data[self.slot_size - 8..self.slot_size - 4]
                .copy_from_slice(&size_bytes);
        } else {
            // crc32 + 8 byte frame
            let size_bytes: [u8; 8] = u64::try_from(len).unwrap().to_le_bytes();
            data[self.slot_size - 12..self.slot_size - 4]
                .copy_from_slice(&size_bytes);
        }

        let hash: [u8; 4] =
            (crc32fast::hash(&data[..self.slot_size - 4]) ^ 0xAF).to_le_bytes();
        data[self.slot_size - 4..].copy_from_slice(&hash);

        let whence = self.slot_size as u64 * slot;

        log::trace!("writing to slot {} in slab {}", slot, self.slot_size);
        sys_io::write_all_at(&self.file, &data, whence)
    }
}

fn set_error(
    global_error: &AtomicPtr<(io::ErrorKind, String)>,
    error: &io::Error,
) {
    let kind = error.kind();
    let reason = error.to_string();

    let boxed = Box::new((kind, reason));
    let ptr = Box::into_raw(boxed);

    if global_error
        .compare_exchange(
            std::ptr::null_mut(),
            ptr,
            Ordering::SeqCst,
            Ordering::SeqCst,
        )
        .is_err()
    {
        // global fatal error already installed, drop this one
        unsafe {
            drop(Box::from_raw(ptr));
        }
    }
}

#[derive(Debug)]
pub enum Update {
    Store {
        object_id: ObjectId,
        collection_id: CollectionId,
        low_key: InlineArray,
        data: Vec<u8>,
    },
    Free {
        object_id: ObjectId,
        collection_id: CollectionId,
    },
}

impl Update {
    #[allow(unused)]
    pub(crate) fn object_id(&self) -> ObjectId {
        match self {
            Update::Store { object_id, .. }
            | Update::Free { object_id, .. } => *object_id,
        }
    }
}

#[derive(Debug, PartialOrd, Ord, PartialEq, Eq)]
pub enum UpdateMetadata {
    Store {
        object_id: ObjectId,
        collection_id: CollectionId,
        low_key: InlineArray,
        location: NonZeroU64,
    },
    Free {
        object_id: ObjectId,
        collection_id: CollectionId,
    },
}

impl UpdateMetadata {
    pub fn object_id(&self) -> ObjectId {
        match self {
            UpdateMetadata::Store { object_id, .. }
            | UpdateMetadata::Free { object_id, .. } => *object_id,
        }
    }
}

#[derive(Debug, Default, Clone, Copy)]
struct WriteBatchStatTracker {
    sum: WriteBatchStats,
    max: WriteBatchStats,
}

#[derive(Clone)]
pub struct Heap {
    path: PathBuf,
    slabs: Arc<[Slab; N_SLABS]>,
    table: ObjectLocationMapper,
    metadata_store: Arc<Mutex<MetadataStore>>,
    free_ebr: Ebr<DeferredFree, 16, 16>,
    global_error: Arc<AtomicPtr<(io::ErrorKind, String)>>,
    #[allow(unused)]
    directory_lock: Arc<fs::File>,
    stats: Arc<RwLock<WriteBatchStatTracker>>,
    truncated_file_bytes: Arc<AtomicU64>,
}

impl fmt::Debug for Heap {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        f.debug_struct("Heap")
            .field("path", &self.path)
            .field("stats", &self.stats())
            .finish()
    }
}

impl Heap {
    pub fn recover(
        leaf_fanout: usize,
        config: &Config,
    ) -> io::Result<HeapRecovery> {
        let path = &config.path;
        log::trace!("recovering Heap at {:?}", path);
        let slabs_dir = path.join("slabs");

        // TODO NOCOMMIT
        let sync_status = std::process::Command::new("sync")
            .status()
            .map(|status| status.success());

        if !matches!(sync_status, Ok(true)) {
            log::warn!(
                "sync command before recovery failed: {:?}",
                sync_status
            );
        }

        // initialize directories if not present
        let mut was_recovered = true;
        for p in [path, &slabs_dir] {
            if let Err(e) = fs::read_dir(p) {
                if e.kind() == io::ErrorKind::NotFound {
                    fallible!(fs::create_dir_all(p));
                    was_recovered = false;
                    continue;
                }
            }
            maybe!(fs::File::open(p).and_then(|f| f.sync_all()))?;
        }

        let _ = fs::File::create(path.join(WARN));

        let mut file_lock_opts = fs::OpenOptions::new();
        file_lock_opts.create(false).read(false).write(false);
        let directory_lock = fallible!(fs::File::open(path));
        fallible!(directory_lock.try_lock_exclusive());

        maybe!(fs::File::open(&slabs_dir).and_then(|f| f.sync_all()))?;
        maybe!(directory_lock.sync_all())?;

        let persistent_settings =
            PersistentSettings::V1 { leaf_fanout: leaf_fanout as u64 };

        persistent_settings.verify_or_store(path, &directory_lock)?;

        let (metadata_store, recovered_metadata) =
            MetadataStore::recover(path.join("metadata"))?;

        let table = ObjectLocationMapper::new(
            &recovered_metadata,
            config.target_heap_file_fill_ratio,
        );

        let mut recovered_nodes =
            Vec::<ObjectRecovery>::with_capacity(recovered_metadata.len());

        for update_metadata in recovered_metadata {
            match update_metadata {
                UpdateMetadata::Store {
                    object_id,
                    collection_id,
                    location: _,
                    low_key,
                } => {
                    recovered_nodes.push(ObjectRecovery {
                        object_id,
                        collection_id,
                        low_key,
                    });
                }
                UpdateMetadata::Free { .. } => {
                    unreachable!()
                }
            }
        }

        let mut slabs = vec![];
        let mut slab_opts = fs::OpenOptions::new();
        slab_opts.create(true).read(true).write(true);
        for slot_size in &SLAB_SIZES {
            let slab_path = slabs_dir.join(format!("{}", slot_size));

            let file = fallible!(slab_opts.open(slab_path));

            slabs.push(Slab {
                slot_size: *slot_size,
                file,
                max_live_slot_since_last_truncation: AtomicU64::new(0),
            })
        }

        maybe!(fs::File::open(&slabs_dir).and_then(|f| f.sync_all()))?;

        log::debug!("recovery of Heap at {:?} complete", path);

        Ok(HeapRecovery {
            heap: Heap {
                slabs: Arc::new(slabs.try_into().unwrap()),
                path: path.into(),
                table,
                global_error: metadata_store.get_global_error_arc(),
                metadata_store: Arc::new(Mutex::new(metadata_store)),
                directory_lock: Arc::new(directory_lock),
                free_ebr: Ebr::default(),
                truncated_file_bytes: Arc::default(),
                stats: Arc::default(),
            },
            recovered_nodes,
            was_recovered,
        })
    }

    pub fn get_global_error_arc(
        &self,
    ) -> Arc<AtomicPtr<(io::ErrorKind, String)>> {
        self.global_error.clone()
    }

    fn check_error(&self) -> io::Result<()> {
        let err_ptr: *const (io::ErrorKind, String) =
            self.global_error.load(Ordering::Acquire);

        if err_ptr.is_null() {
            Ok(())
        } else {
            let deref: &(io::ErrorKind, String) = unsafe { &*err_ptr };
            Err(io::Error::new(deref.0, deref.1.clone()))
        }
    }

    fn set_error(&self, error: &io::Error) {
        set_error(&self.global_error, error);
    }

    pub fn manually_advance_epoch(&self) {
        self.free_ebr.manually_advance_epoch();
    }

    pub fn stats(&self) -> HeapStats {
        let truncated_file_bytes =
            self.truncated_file_bytes.load(Ordering::Acquire);

        let stats = self.stats.read();

        HeapStats {
            truncated_file_bytes,
            allocator: self.table.stats(),
            write_batch_max: stats.max,
            write_batch_sum: stats.sum,
        }
    }

    pub fn read(&self, object_id: ObjectId) -> Option<io::Result<Vec<u8>>> {
        if let Err(e) = self.check_error() {
            return Some(Err(e));
        }

        let mut guard = self.free_ebr.pin();
        let slab_address = self.table.get_location_for_object(object_id)?;

        let slab = &self.slabs[usize::from(slab_address.slab_id)];

        match slab.read(slab_address.slot(), &mut guard) {
            Ok(bytes) => Some(Ok(bytes)),
            Err(e) => {
                let annotated = annotate!(e);
                self.set_error(&annotated);
                Some(Err(annotated))
            }
        }
    }

    pub fn write_batch(
        &self,
        batch: Vec<Update>,
    ) -> io::Result<WriteBatchStats> {
        self.check_error()?;
        let metadata_store = self.metadata_store.try_lock()
            .expect("write_batch called concurrently! major correctness assumpiton violated");
        let mut guard = self.free_ebr.pin();

        let slabs = &self.slabs;
        let table = &self.table;

        let heap_bytes_written = AtomicU64::new(0);
        let heap_files_used_0_to_63 = AtomicU64::new(0);
        let heap_files_used_64_to_127 = AtomicU64::new(0);

        let map_closure = |update: Update| match update {
            Update::Store { object_id, collection_id, low_key, data } => {
                let data_len = data.len();
                let slab_id = slab_for_size(data_len);
                let slab = &slabs[usize::from(slab_id)];
                let new_location = table.allocate_slab_slot(slab_id);
                let new_location_nzu: NonZeroU64 = new_location.into();

                let complete_durability_pipeline =
                    maybe!(slab.write(new_location.slot(), data));

                if let Err(e) = complete_durability_pipeline {
                    // can immediately free slot as the
                    table.free_slab_slot(new_location);
                    return Err(e);
                }

                // record stats
                heap_bytes_written
                    .fetch_add(data_len as u64, Ordering::Release);

                if slab_id < 64 {
                    let slab_bit = 0b1 << slab_id;
                    heap_files_used_0_to_63
                        .fetch_or(slab_bit, Ordering::Release);
                } else {
                    assert!(slab_id < 128);
                    let slab_bit = 0b1 << (slab_id - 64);
                    heap_files_used_64_to_127
                        .fetch_or(slab_bit, Ordering::Release);
                }

                Ok(UpdateMetadata::Store {
                    object_id,
                    collection_id,
                    low_key,
                    location: new_location_nzu,
                })
            }
            Update::Free { object_id, collection_id } => {
                Ok(UpdateMetadata::Free { object_id, collection_id })
            }
        };

        let before_heap_write = Instant::now();

        let metadata_batch_res: io::Result<Vec<UpdateMetadata>> =
            batch.into_par_iter().map(map_closure).collect();

        let before_heap_sync = Instant::now();

        fence(Ordering::SeqCst);

        for slab_id in 0..N_SLABS {
            let dirty = if slab_id < 64 {
                let slab_bit = 0b1 << slab_id;

                heap_files_used_0_to_63.load(Ordering::Acquire) & slab_bit
                    == slab_bit
            } else {
                let slab_bit = 0b1 << (slab_id - 64);

                heap_files_used_64_to_127.load(Ordering::Acquire) & slab_bit
                    == slab_bit
            };

            if dirty {
                self.slabs[slab_id].sync()?;
            }
        }

        let heap_sync_latency = before_heap_sync.elapsed();

        let heap_write_latency = before_heap_write.elapsed();

        let metadata_batch = match metadata_batch_res {
            Ok(mut mb) => {
                // TODO evaluate impact : cost ratio of this sort
                mb.par_sort_unstable();
                mb
            }
            Err(e) => {
                self.set_error(&e);
                return Err(e);
            }
        };

        // make metadata durable
        let before_metadata_write = Instant::now();
        let metadata_bytes_written =
            match metadata_store.write_batch(&metadata_batch) {
                Ok(metadata_bytes_written) => metadata_bytes_written,
                Err(e) => {
                    self.set_error(&e);
                    return Err(e);
                }
            };
        let metadata_write_latency = before_metadata_write.elapsed();

        // reclaim previous disk locations for future writes
        for update_metadata in metadata_batch {
            let last_address_opt = match update_metadata {
                UpdateMetadata::Store { object_id, location, .. } => {
                    self.table.insert(object_id, SlabAddress::from(location))
                }
                UpdateMetadata::Free { object_id, .. } => {
                    guard.defer_drop(DeferredFree {
                        allocator: self.table.clone_object_id_allocator_arc(),
                        freed_slot: object_id.0.get(),
                    });
                    self.table.remove(object_id)
                }
            };

            if let Some(last_address) = last_address_opt {
                guard.defer_drop(DeferredFree {
                    allocator: self
                        .table
                        .clone_slab_allocator_arc(last_address.slab_id),
                    freed_slot: last_address.slot(),
                });
            }
        }

        // truncate files that are now too fragmented
        let before_truncate = Instant::now();
        let mut truncated_files = 0;
        let mut truncated_bytes = 0;
        for (i, max_live_slot) in self.table.get_max_allocated_per_slab() {
            let slab = &self.slabs[i];

            let last_max = slab
                .max_live_slot_since_last_truncation
                .fetch_max(max_live_slot, Ordering::SeqCst);

            let max_since_last_truncation = last_max.max(max_live_slot);

            let currently_occupied_bytes =
                (max_live_slot + 1) * slab.slot_size as u64;

            let max_occupied_bytes =
                (max_since_last_truncation + 1) * slab.slot_size as u64;

            let ratio = currently_occupied_bytes * 100 / max_occupied_bytes;

            if ratio < FILE_TARGET_FILL_RATIO {
                let target_len = if max_live_slot < 16 {
                    currently_occupied_bytes
                } else {
                    currently_occupied_bytes * FILE_RESIZE_MARGIN / 100
                };

                assert!(target_len < max_occupied_bytes);
                assert!(
                    target_len >= currently_occupied_bytes,
                    "target_len of {} is above actual occupied len of {}",
                    target_len,
                    currently_occupied_bytes
                );

                if cfg!(not(feature = "monotonic-behavior")) {
                    if slab.file.set_len(target_len).is_ok() {
                        slab.max_live_slot_since_last_truncation
                            .store(max_live_slot, Ordering::SeqCst);

                        let file_truncated_bytes =
                            currently_occupied_bytes.saturating_sub(target_len);
                        self.truncated_file_bytes
                            .fetch_add(file_truncated_bytes, Ordering::Release);

                        truncated_files += 1;
                        truncated_bytes += file_truncated_bytes;
                    } else {
                        // TODO surface stats
                    }
                }
            }
        }

        let truncate_latency = before_truncate.elapsed();

        let heap_files_written_to = u64::from(
            heap_files_used_0_to_63.load(Ordering::Acquire).count_ones()
                + heap_files_used_64_to_127
                    .load(Ordering::Acquire)
                    .count_ones(),
        );

        let stats = WriteBatchStats {
            heap_bytes_written: heap_bytes_written.load(Ordering::Acquire),
            heap_files_written_to,
            heap_write_latency,
            heap_sync_latency,
            metadata_bytes_written,
            metadata_write_latency,
            truncated_files,
            truncated_bytes,
            truncate_latency,
        };

        {
            let mut stats_tracker = self.stats.write();
            stats_tracker.max = stats_tracker.max.max(&stats);
            stats_tracker.sum = stats_tracker.sum.sum(&stats);
        }

        Ok(stats)
    }

    pub fn heap_object_id_pin(&self) -> ebr::Guard<'_, DeferredFree, 16, 16> {
        self.free_ebr.pin()
    }

    pub fn allocate_object_id(&self) -> ObjectId {
        self.table.allocate_object_id()
    }

    pub(crate) fn objects_to_defrag(&self) -> FnvHashSet<ObjectId> {
        self.table.objects_to_defrag()
    }
}


================================================
FILE: src/id_allocator.rs
================================================
use std::collections::BTreeSet;
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Arc;

use crossbeam_queue::SegQueue;
use fnv::FnvHashSet;
use parking_lot::Mutex;

#[derive(Default, Debug)]
struct FreeSetAndTip {
    free_set: BTreeSet<u64>,
    next_to_allocate: u64,
}

#[derive(Default, Debug)]
pub struct Allocator {
    free_and_pending: Mutex<FreeSetAndTip>,
    /// Flat combining.
    ///
    /// A lock free queue of recently freed ids which uses when there is contention on `free_and_pending`.
    free_queue: SegQueue<u64>,
    allocation_counter: AtomicU64,
    free_counter: AtomicU64,
}

impl Allocator {
    /// Intended primarily for heap slab slot allocators when performing GC.
    ///
    /// If the slab is fragmented beyond the desired fill ratio, this returns
    /// the range of offsets (min inclusive, max exclusive) that may be copied
    /// into earlier free slots if they are currently occupied in order to
    /// achieve the desired fragmentation ratio.
    pub fn fragmentation_cutoff(
        &self,
        desired_ratio: f32,
    ) -> Option<(u64, u64)> {
        let mut free_and_tip = self.free_and_pending.lock();

        let next_to_allocate = free_and_tip.next_to_allocate;

        if next_to_allocate == 0 {
            return None;
        }

        while let Some(free_id) = self.free_queue.pop() {
            free_and_tip.free_set.insert(free_id);
        }

        let live_objects =
            next_to_allocate - free_and_tip.free_set.len() as u64;
        let actual_ratio = live_objects as f32 / next_to_allocate as f32;

        log::trace!(
            "fragmented_slots actual ratio: {actual_ratio}, free len: {}",
            free_and_tip.free_set.len()
        );

        if desired_ratio <= actual_ratio {
            return None;
        }

        // calculate theoretical cut-off point, return everything past that
        let min = (live_objects as f32 / desired_ratio) as u64;
        let max = next_to_allocate;
        assert!(min < max);
        Some((min, max))
    }

    pub fn from_allocated(allocated: &FnvHashSet<u64>) -> Allocator {
        let mut heap = BTreeSet::<u64>::default();
        let max = allocated.iter().copied().max();

        for i in 0..max.unwrap_or(0) {
            if !allocated.contains(&i) {
                heap.insert(i);
            }
        }

        let free_and_pending = Mutex::new(FreeSetAndTip {
            free_set: heap,
            next_to_allocate: max.map(|m| m + 1).unwrap_or(0),
        });

        Allocator {
            free_and_pending,
            free_queue: SegQueue::default(),
            allocation_counter: 0.into(),
            free_counter: 0.into(),
        }
    }

    pub fn max_allocated(&self) -> Option<u64> {
        let next = self.free_and_pending.lock().next_to_allocate;

        if next == 0 {
            None
        } else {
            Some(next - 1)
        }
    }

    pub fn allocate(&self) -> u64 {
        self.allocation_counter.fetch_add(1, Ordering::Relaxed);
        let mut free_and_tip = self.free_and_pending.lock();
        while let Some(free_id) = self.free_queue.pop() {
            free_and_tip.free_set.insert(free_id);
        }

        compact(&mut free_and_tip);

        let pop_attempt = free_and_tip.free_set.pop_first();

        if let Some(id) = pop_attempt {
            id
        } else {
            let ret = free_and_tip.next_to_allocate;
            free_and_tip.next_to_allocate += 1;
            ret
        }
    }

    pub fn free(&self, id: u64) {
        if cfg!(not(feature = "monotonic-behavior")) {
            self.free_counter.fetch_add(1, Ordering::Relaxed);
            if let Some(mut free) = self.free_and_pending.try_lock() {
                while let Some(free_id) = self.free_queue.pop() {
                    free.free_set.insert(free_id);
                }
                free.free_set.insert(id);

                compact(&mut free);
            } else {
                self.free_queue.push(id);
            }
        }
    }

    /// Returns the counters for allocated, free
    pub fn counters(&self) -> (u64, u64) {
        (
            self.allocation_counter.load(Ordering::Acquire),
            self.free_counter.load(Ordering::Acquire),
        )
    }
}

fn compact(free: &mut FreeSetAndTip) {
    let next = &mut free.next_to_allocate;

    while *next > 1 && free.free_set.contains(&(*next - 1)) {
        free.free_set.remove(&(*next - 1));
        *next -= 1;
    }
}

pub struct DeferredFree {
    pub allocator: Arc<Allocator>,
    pub freed_slot: u64,
}

impl Drop for DeferredFree {
    fn drop(&mut self) {
        self.allocator.free(self.freed_slot)
    }
}


================================================
FILE: src/leaf.rs
================================================
use crate::*;

#[derive(Debug, Clone, serde::Serialize, serde::Deserialize)]
pub(crate) struct Leaf<const LEAF_FANOUT: usize> {
    pub lo: InlineArray,
    pub hi: Option<InlineArray>,
    pub prefix_length: usize,
    data: stack_map::StackMap<InlineArray, InlineArray, LEAF_FANOUT>,
    pub in_memory_size: usize,
    pub mutation_count: u64,
    #[serde(skip)]
    pub dirty_flush_epoch: Option<FlushEpoch>,
    #[serde(skip)]
    pub page_out_on_flush: Option<FlushEpoch>,
    #[serde(skip)]
    pub deleted: Option<FlushEpoch>,
    #[serde(skip)]
    pub max_unflushed_epoch: Option<FlushEpoch>,
}

impl<const LEAF_FANOUT: usize> Leaf<LEAF_FANOUT> {
    pub(crate) fn empty() -> Leaf<LEAF_FANOUT> {
        Leaf {
            lo: InlineArray::default(),
            hi: None,
            prefix_length: 0,
            data: stack_map::StackMap::default(),
            // this does not need to be marked as dirty until it actually
            // receives inserted data
            dirty_flush_epoch: None,
            in_memory_size: std::mem::size_of::<Leaf<LEAF_FANOUT>>(),
            mutation_count: 0,
            page_out_on_flush: None,
            deleted: None,
            max_unflushed_epoch: None,
        }
    }

    pub(crate) const fn is_empty(&self) -> bool {
        self.data.is_empty()
    }

    pub(crate) fn set_dirty_epoch(&mut self, epoch: FlushEpoch) {
        assert!(self.deleted.is_none());
        if let Some(current_epoch) = self.dirty_flush_epoch {
            assert!(current_epoch <= epoch);
        }
        if self.page_out_on_flush < Some(epoch) {
            self.page_out_on_flush = None;
        }
        self.dirty_flush_epoch = Some(epoch);
    }

    fn prefix(&self) -> &[u8] {
        assert!(self.deleted.is_none());
        &self.lo[..self.prefix_length]
    }

    pub(crate) fn get(&self, key: &[u8]) -> Option<&InlineArray> {
        assert!(self.deleted.is_none());
        assert!(key.starts_with(self.prefix()));
        let prefixed_key = &key[self.prefix_length..];
        self.data.get(prefixed_key)
    }

    pub(crate) fn insert(
        &mut self,
        key: InlineArray,
        value: InlineArray,
    ) -> Option<InlineArray> {
        assert!(self.deleted.is_none());
        assert!(key.starts_with(self.prefix()));
        let prefixed_key = key[self.prefix_length..].into();
        self.data.insert(prefixed_key, value)
    }

    pub(crate) fn remove(&mut self, key: &[u8]) -> Option<InlineArray> {
        assert!(self.deleted.is_none());
        let prefix = self.prefix();
        assert!(key.starts_with(prefix));
        let partial_key = &key[self.prefix_length..];
        self.data.remove(partial_key)
    }

    pub(crate) fn merge_from(&mut self, other: &mut Self) {
        assert!(self.is_empty());

        self.hi = other.hi.clone();

        let new_prefix_len = if let Some(hi) = &self.hi {
            self.lo.iter().zip(hi.iter()).take_while(|(l, r)| l == r).count()
        } else {
            0
        };

        assert_eq!(self.lo[..new_prefix_len], other.lo[..new_prefix_len]);

        // self.prefix_length is not read because it's expected to be
        // initialized here.
        self.prefix_length = new_prefix_len;

        if self.prefix() == other.prefix() {
            self.data = std::mem::take(&mut other.data);
            return;
        }

        assert!(
            self.prefix_length < other.prefix_length,
            "self: {:?} other: {:?}",
            self,
            other
        );

        let unshifted_key_amount = other.prefix_length - self.prefix_length;
        let unshifted_prefix = &other.lo
            [other.prefix_length - unshifted_key_amount..other.prefix_length];

        for (k, v) in other.data.iter() {
            let mut unshifted_key =
                Vec::with_capacity(unshifted_prefix.len() + k.len());
            unshifted_key.extend_from_slice(unshifted_prefix);
            unshifted_key.extend_from_slice(k);
            self.data.insert(unshifted_key.into(), v.clone());
        }

        assert_eq!(other.data.len(), self.data.len());

        #[cfg(feature = "for-internal-testing-only")]
        assert_eq!(
            self.iter().collect::<Vec<_>>(),
            other.iter().collect::<Vec<_>>(),
            "self: {:#?} \n other: {:#?}\n",
            self,
            other
        );
    }

    pub(crate) fn iter(
        &self,
    ) -> impl Iterator<Item = (InlineArray, InlineArray)> {
        let prefix = self.prefix();
        self.data.iter().map(|(k, v)| {
            let mut unshifted_key = Vec::with_capacity(prefix.len() + k.len());
            unshifted_key.extend_from_slice(prefix);
            unshifted_key.extend_from_slice(k);
            (unshifted_key.into(), v.clone())
        })
    }

    pub(crate) fn serialize(&self, zstd_compression_level: i32) -> Vec<u8> {
        let mut ret = vec![];

        let mut zstd_enc =
            zstd::stream::Encoder::new(&mut ret, zstd_compression_level)
                .unwrap();

        bincode::serialize_into(&mut zstd_enc, self).unwrap();

        zstd_enc.finish().unwrap();

        ret
    }

    pub(crate) fn deserialize(
        buf: &[u8],
    ) -> std::io::Result<Box<Leaf<LEAF_FANOUT>>> {
        let zstd_decoded = zstd::stream::decode_all(buf).unwrap();
        let mut leaf: Box<Leaf<LEAF_FANOUT>> =
            bincode::deserialize(&zstd_decoded).unwrap();

        // use decompressed buffer length as a cheap proxy for in-memory size for now
        leaf.in_memory_size = zstd_decoded.len();

        Ok(leaf)
    }

    fn set_in_memory_size(&mut self) {
        self.in_memory_size = std::mem::size_of::<Leaf<LEAF_FANOUT>>()
            + self.hi.as_ref().map(|h| h.len()).unwrap_or(0)
            + self.lo.len()
            + self.data.iter().map(|(k, v)| k.len() + v.len()).sum::<usize>();
    }

    pub(crate) fn split_if_full(
        &mut self,
        new_epoch: FlushEpoch,
        allocator: &ObjectCache<LEAF_FANOUT>,
        collection_id: CollectionId,
    ) -> Option<(InlineArray, Object<LEAF_FANOUT>)> {
        if self.data.is_full() {
            let original_len = self.data.len();

            let old_prefix_len = self.prefix_length;
            // split
            let split_offset = if self.lo.is_empty() {
                // split left-most shard almost at the beginning for
                // optimizing downward-growing workloads
                1
            } else if self.hi.is_none() {
                // split right-most shard almost at the end for
                // optimizing upward-growing workloads
                self.data.len() - 2
            } else {
                self.data.len() / 2
            };

            let data = self.data.split_off(split_offset);

            let left_max = &self.data.last().unwrap().0;
            let right_min = &data.first().unwrap().0;

            // suffix truncation attempts to shrink the split key
            // so that shorter keys bubble up into the index
            let splitpoint_length = right_min
                .iter()
                .zip(left_max.iter())
                .take_while(|(a, b)| a == b)
                .count()
                + 1;

            let mut split_vec =
                Vec::with_capacity(self.prefix_length + splitpoint_length);
            split_vec.extend_from_slice(self.prefix());
            split_vec.extend_from_slice(&right_min[..splitpoint_length]);
            let split_key = InlineArray::from(split_vec);

            let rhs_id = allocator.allocate_object_id(new_epoch);

            log::trace!(
                "split leaf {:?} at split key: {:?} into new {:?} at {:?}",
                self.lo,
                split_key,
                rhs_id,
                new_epoch,
            );

            let mut rhs = Leaf {
                dirty_flush_epoch: Some(new_epoch),
                hi: self.hi.clone(),
                lo: split_key.clone(),
                prefix_length: 0,
                in_memory_size: 0,
                data,
                mutation_count: 0,
                page_out_on_flush: None,
                deleted: None,
                max_unflushed_epoch: None,
            };

            rhs.shorten_keys_after_split(old_prefix_len);

            rhs.set_in_memory_size();

            self.hi = Some(split_key.clone());

            self.shorten_keys_after_split(old_prefix_len);

            self.set_in_memory_size();

            assert_eq!(self.hi.as_ref().unwrap(), &split_key);
            assert_eq!(rhs.lo, &split_key);
            assert_eq!(rhs.data.len() + self.data.len(), original_len);

            let rhs_node = Object {
                object_id: rhs_id,
                collection_id,
                low_key: split_key.clone(),
                inner: Arc::new(RwLock::new(CacheBox {
                    leaf: Some(Box::new(rhs)),
                    logged_index: BTreeMap::default(),
                })),
            };

            return Some((split_key, rhs_node));
        }

        None
    }

    pub(crate) fn shorten_keys_after_split(&mut self, old_prefix_len: usize) {
        let Some(hi) = self.hi.as_ref() else { return };

        let new_prefix_len =
            self.lo.iter().zip(hi.iter()).take_while(|(l, r)| l == r).count();

        assert_eq!(self.lo[..new_prefix_len], hi[..new_prefix_len]);

        // self.prefix_length is not read because it's expected to be
        // initialized here.
        self.prefix_length = new_prefix_len;

        if new_prefix_len == old_prefix_len {
            return;
        }

        assert!(
            new_prefix_len > old_prefix_len,
            "expected new prefix length of {} to be greater than the pre-split prefix length of {} for node {:?}",
            new_prefix_len,
            old_prefix_len,
            self
        );

        let key_shift = new_prefix_len - old_prefix_len;

        for (k, v) in std::mem::take(&mut self.data).iter() {
            self.data.insert(k[key_shift..].into(), v.clone());
        }
    }
}


================================================
FILE: src/lib.rs
================================================
// 1.0 blockers
//
// bugs
// * page-out needs to be deferred until after any flush of the dirty epoch
//   * need to remove max_unflushed_epoch after flushing it
//   * can't send reliable page-out request backwards from 7->6
//   * re-locking every mutex in a writebatch feels bad
//   * need to signal stability status forward
//     * maybe we already are
//   * can make dirty_flush_epoch atomic and CAS it to 0 after flush
//   * can change dirty_flush_epoch to unflushed_epoch
//   * can always set mutation_count to max dirty flush epoch
//     * this feels nice, we can lazily update a global stable flushed counter
//     * can get rid of dirty_flush_epoch and page_out_on_flush?
//     * or at least dirty_flush_epoch
//   * dirty_flush_epoch really means "hasn't yet been cooperatively serialized @ F.E."
//   * interesting metrics:
//     * whether dirty for some epoch
//     * whether cooperatively serialized for some epoch
//     * whether fully flushed for some epoch
//     * clean -> dirty -> {maybe coop} -> flushed
//   * for page-out, we only care if it's stable or if we need to add it to
//     a page-out priority queue
// * page-out doesn't seem to happen as expected
//
// reliability
// TODO make all writes wrapped in a Tearable wrapper that splits writes
//      and can possibly crash based on a counter.
// TODO test concurrent drop_tree when other threads are still using it
// TODO list trees test for recovering empty collections
// TODO set explicit max key and value sizes w/ corresponding heap
// TODO add failpoints to writepath
//
// performance
// TODO handle prefix encoding
// TODO (minor) remove cache access for removed node in merge function
// TODO index+log hybrid - tinylsm key -> object location
//
// features
// TODO multi-collection batch
//
// misc
// TODO skim inlining output of RUSTFLAGS="-Cremark=all -Cdebuginfo=1"
//
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1.0 cutoff ~~~~~~~~~~~~~~~~~~~~~~~~~~~
//
// post-1.0 improvements
//
// reliability
// TODO bug hiding: if the crash_iter test panics, the test doesn't fail as expected
// TODO event log assertion for testing heap location bidirectional referential integrity,
//      particularly in the object location mapper.
// TODO ensure nothing "from the future" gets copied into earlier epochs during GC
// TODO collection_id on page_in checks - it needs to be pinned w/ heap's EBR?
// TODO put aborts behind feature flags for hard crashes
// TODO re-enable transaction tests in test_tree.rs
//
// performance
// TODO force writers to flush when some number of dirty epochs have built up
// TODO serialize flush batch in parallel
// TODO concurrent serialization of NotYetSerialized dirty objects
// TODO make the Arc<Option<Box<Leaf just a single pointer chase w/ custom container
// TODO allow waiting flusher to start collecting dirty pages as soon
//      as it is evacuated - just wait until last flush is done before
//      we persist the batch
// TODO measure space savings vs cost of zstd in metadata store
// TODO make EBR and index fanout consts as small as possible to reduce memory usage
// TODO make leaf fanout as small as possible while retaining perf
// TODO dynamically sized fanouts for reducing fragmentation
//
// features
// TODO transactions
// TODO implement create exclusive
// TODO temporary trees for transactional in-memory coordination
// TODO corrupted data extraction binary
//

//! `sled` is a high-performance embedded database with
//! an API that is similar to a `BTreeMap<[u8], [u8]>`,
//! but with several additional capabilities for
//! assisting creators of stateful systems.
//!
//! It is fully thread-safe, and all operations are
//! atomic. Multiple `Tree`s with isolated keyspaces
//! are supported with the
//! [`Db::open_tree`](struct.Db.html#method.open_tree) method.
//!
//! `sled` is built by experienced database engineers
//! who think users should spend less time tuning and
//! working against high-friction APIs. Expect
//! significant ergonomic and performance improvements
//! over time. Most surprises are bugs, so please
//! [let us know](mailto:tylerneely@gmail.com?subject=sled%20sucks!!!)
//! if something is high friction.
//!
//! # Examples
//!
//! ```
//! # let _ = std::fs::remove_dir_all("my_db");
//! let db: sled::Db = sled::open("my_db").unwrap();
//!
//! // insert and get
//! db.insert(b"yo!", b"v1");
//! assert_eq!(&db.get(b"yo!").unwrap().unwrap(), b"v1");
//!
//! // Atomic compare-and-swap.
//! db.compare_and_swap(
//!     b"yo!",      // key
//!     Some(b"v1"), // old value, None for not present
//!     Some(b"v2"), // new value, None for delete
//! )
//! .unwrap();
//!
//! // Iterates over key-value pairs, starting at the given key.
//! let scan_key: &[u8] = b"a non-present key before yo!";
//! let mut iter = db.range(scan_key..);
//! assert_eq!(&iter.next().unwrap().unwrap().0, b"yo!");
//! assert!(iter.next().is_none());
//!
//! db.remove(b"yo!");
//! assert!(db.get(b"yo!").unwrap().is_none());
//!
//! let other_tree: sled::Tree = db.open_tree(b"cool db facts").unwrap();
//! other_tree.insert(
//!     b"k1",
//!     &b"a Db acts like a Tree due to implementing Deref<Target = Tree>"[..]
//! ).unwrap();
//! # let _ = std::fs::remove_dir_all("my_db");
//! ```
#[cfg(feature = "for-internal-testing-only")]
mod block_checker;
mod config;
mod db;
mod flush_epoch;
mod heap;
mod id_allocator;
mod leaf;
mod metadata_store;
mod object_cache;
mod object_location_mapper;
mod tree;

#[cfg(any(
    feature = "testing-shred-allocator",
    feature = "testing-count-allocator"
))]
pub mod alloc;

#[cfg(feature = "for-internal-testing-only")]
mod event_verifier;

#[inline]
fn debug_delay() {
    #[cfg(debug_assertions)]
    {
        let rand =
            std::time::SystemTime::UNIX_EPOCH.elapsed().unwrap().as_nanos();

        if rand % 128 > 100 {
            for _ in 0..rand % 16 {
                std::thread::yield_now();
            }
        }
    }
}

pub use crate::config::Config;
pub use crate::db::Db;
pub use crate::tree::{Batch, Iter, Tree};
pub use inline_array::InlineArray;

const NAME_MAPPING_COLLECTION_ID: CollectionId = CollectionId(0);
const DEFAULT_COLLECTION_ID: CollectionId = CollectionId(1);
const INDEX_FANOUT: usize = 64;
const EBR_LOCAL_GC_BUFFER_SIZE: usize = 128;

use std::collections::BTreeMap;
use std::num::NonZeroU64;
use std::ops::Bound;
use std::sync::Arc;

use parking_lot::RwLock;

use crate::flush_epoch::{
    FlushEpoch, FlushEpochGuard, FlushEpochTracker, FlushInvariants,
};
use crate::heap::{
    HeapStats, ObjectRecovery, SlabAddress, Update, WriteBatchStats,
};
use crate::id_allocator::{Allocator, DeferredFree};
use crate::leaf::Leaf;

// These are public so that they can be easily crash tested in external
// binaries. They are hidden because there are zero guarantees around their
// API stability or functionality.
#[doc(hidden)]
pub use crate::heap::{Heap, HeapRecovery};
#[doc(hidden)]
pub use crate::metadata_store::MetadataStore;
#[doc(hidden)]
pub use crate::object_cache::{CacheStats, Dirty, FlushStats, ObjectCache};

/// Opens a `Db` with a default configuration at the
/// specified path. This will create a new storage
/// directory at the specified path if it does
/// not already exist. You can use the `Db::was_recovered`
/// method to determine if your database was recovered
/// from a previous instance.
pub fn open<P: AsRef<std::path::Path>>(path: P) -> std::io::Result<Db> {
    Config::new().path(path).open()
}

#[derive(Debug, Copy, Clone)]
pub struct Stats {
    pub cache: CacheStats,
}

/// Compare and swap result.
///
/// It returns `Ok(Ok(()))` if operation finishes successfully and
///     - `Ok(Err(CompareAndSwapError(current, proposed)))` if operation failed
///       to setup a new value. `CompareAndSwapError` contains current and
///       proposed

Download .txt

gitextract_hgnzatql/

├── .github/
│   ├── FUNDING.yml
│   ├── ISSUE_TEMPLATE/
│   │   ├── blank_issue.md
│   │   ├── bugs.md
│   │   ├── config.yml
│   │   └── feature_request.md
│   ├── dependabot.yml
│   └── workflows/
│       └── test.yml
├── .gitignore
├── .rustfmt.toml
├── ARCHITECTURE.md
├── CHANGELOG.md
├── CONTRIBUTING.md
├── Cargo.toml
├── LICENSE-APACHE
├── LICENSE-MIT
├── README.md
├── RELEASE_CHECKLIST.md
├── SAFETY.md
├── SECURITY.md
├── art/
│   └── CREDITS
├── code-of-conduct.md
├── examples/
│   └── bench.rs
├── fuzz/
│   ├── .gitignore
│   ├── Cargo.toml
│   └── fuzz_targets/
│       └── fuzz_model.rs
├── scripts/
│   ├── cgtest.sh
│   ├── cross_compile.sh
│   ├── execution_explorer.py
│   ├── instructions
│   ├── sanitizers.sh
│   ├── shufnice.sh
│   └── ubuntu_bench
├── src/
│   ├── alloc.rs
│   ├── block_checker.rs
│   ├── config.rs
│   ├── db.rs
│   ├── event_verifier.rs
│   ├── flush_epoch.rs
│   ├── heap.rs
│   ├── id_allocator.rs
│   ├── leaf.rs
│   ├── lib.rs
│   ├── metadata_store.rs
│   ├── object_cache.rs
│   ├── object_location_mapper.rs
│   └── tree.rs
└── tests/
    ├── 00_regression.rs
    ├── common/
    │   └── mod.rs
    ├── concurrent_batch_atomicity.rs
    ├── crash_tests/
    │   ├── crash_batches.rs
    │   ├── crash_heap.rs
    │   ├── crash_iter.rs
    │   ├── crash_metadata_store.rs
    │   ├── crash_object_cache.rs
    │   ├── crash_sequential_writes.rs
    │   ├── crash_tx.rs
    │   └── mod.rs
    ├── test_crash_recovery.rs
    ├── test_quiescent.rs
    ├── test_space_leaks.rs
    ├── test_tree.rs
    ├── test_tree_failpoints.rs
    └── tree/
        └── mod.rs

Download .txt

SYMBOL INDEX (566 symbols across 34 files)

FILE: examples/bench.rs
  type Db (line 11) | type Db = SledDb<1024>;
  constant N_WRITES_PER_THREAD (line 13) | const N_WRITES_PER_THREAD: u32 = 4 * 1024 * 1024;
  constant MAX_CONCURRENCY (line 14) | const MAX_CONCURRENCY: u32 = 4;
  constant CONCURRENCY (line 15) | const CONCURRENCY: &[usize] = &[/*1, 2, 4,*/ MAX_CONCURRENCY as _];
  constant BYTES_PER_ITEM (line 16) | const BYTES_PER_ITEM: u32 = 8;
  type Databench (line 18) | trait Databench: Clone + Send {
    constant NAME (line 20) | const NAME: &'static str;
    constant PATH (line 21) | const PATH: &'static str;
    method open (line 22) | fn open() -> Self;
    method remove_generic (line 23) | fn remove_generic(&self, key: &[u8]);
    method insert_generic (line 24) | fn insert_generic(&self, key: &[u8], value: &[u8]);
    method get_generic (line 25) | fn get_generic(&self, key: &[u8]) -> Option<Self::READ>;
    method flush_generic (line 26) | fn flush_generic(&self);
    method print_stats (line 27) | fn print_stats(&self);
    type READ (line 31) | type READ = sled::InlineArray;
    constant NAME (line 33) | const NAME: &'static str = "sled 1.0.0-alpha";
    constant PATH (line 34) | const PATH: &'static str = "timing_test.sled-new";
    method open (line 36) | fn open() -> Self {
    method insert_generic (line 49) | fn insert_generic(&self, key: &[u8], value: &[u8]) {
    method remove_generic (line 52) | fn remove_generic(&self, key: &[u8]) {
    method get_generic (line 55) | fn get_generic(&self, key: &[u8]) -> Option<Self::READ> {
    method flush_generic (line 58) | fn flush_generic(&self) {
    method print_stats (line 61) | fn print_stats(&self) {
  function allocated (line 221) | fn allocated() -> usize {
  function freed (line 229) | fn freed() -> usize {
  function resident (line 237) | fn resident() -> usize {
  function inserts (line 245) | fn inserts<D: Databench>(store: &D) -> Vec<InsertStats> {
  function removes (line 295) | fn removes<D: Databench>(store: &D) -> Vec<RemoveStats> {
  function gets (line 345) | fn gets<D: Databench>(store: &D) -> Vec<GetStats> {
  function execute_lockstep_concurrent (line 383) | fn execute_lockstep_concurrent<
  type InsertStats (line 421) | struct InsertStats {
  type GetStats (line 427) | struct GetStats {
  type RemoveStats (line 433) | struct RemoveStats {
  type Stats (line 440) | struct Stats {
    method print_report (line 452) | fn print_report(&self) {
  function bench (line 489) | fn bench<D: Databench>() -> Stats {
  function du (line 518) | fn du(path: &Path) -> io::Result<u64> {
  function main (line 533) | fn main() {

FILE: fuzz/fuzz_targets/fuzz_model.rs
  type Db (line 11) | type Db = SledDb<3>;
  constant KEYSPACE (line 13) | const KEYSPACE: u64 = 128;
  type Op (line 16) | enum Op {
    method arbitrary (line 33) | fn arbitrary(
  function keygen (line 25) | fn keygen(

FILE: scripts/execution_explorer.py
  class UnreachableBreakpoint (line 60) | class UnreachableBreakpoint(gdb.Breakpoint):
  class DoneBreakpoint (line 64) | class DoneBreakpoint(gdb.Breakpoint):
  class InterestingBreakpoint (line 68) | class InterestingBreakpoint(gdb.Breakpoint):
  class DeterministicExecutor (line 72) | class DeterministicExecutor:
    method __init__ (line 73) | def __init__(self, seed=None):
    method reseed (line 92) | def reseed(self):
    method restart (line 98) | def restart(self):
    method rendezvous_callback (line 118) | def rendezvous_callback(self, event):
    method run (line 128) | def run(self):
    method run_schedule (line 136) | def run_schedule(self):
    method pick (line 152) | def pick(self):
    method scheduler_callback (line 164) | def scheduler_callback(self, event):
    method runnable_threads (line 183) | def runnable_threads(self):
    method exit_callback (line 197) | def exit_callback(self, event):

FILE: src/alloc.rs
  type ShredAllocator (line 21) | struct ShredAllocator;
    method alloc (line 24) | unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
    method dealloc (line 31) | unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
  function allocated (line 49) | fn allocated() -> usize {
  function freed (line 53) | fn freed() -> usize {
  function resident (line 57) | fn resident() -> usize {
  type CountingAllocator (line 62) | struct CountingAllocator;
    method alloc (line 65) | unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
    method dealloc (line 74) | unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {

FILE: src/block_checker.rs
  type LocationMap (line 19) | type LocationMap = BTreeMap<u64, &'static Location<'static>>;
  type BlockChecker (line 22) | pub(crate) struct BlockChecker {
    method report (line 27) | fn report(&self, last_top_10: LocationMap) -> LocationMap {
    method check_in (line 43) | fn check_in(&self, location: &'static Location) -> BlockGuard {
    method check_out (line 50) | fn check_out(&self, id: u64) {
  type BlockGuard (line 56) | pub(crate) struct BlockGuard {
  method drop (line 61) | fn drop(&mut self) {
  function track_blocks (line 67) | pub(crate) fn track_blocks() -> BlockGuard {

FILE: src/config.rs
  type Config (line 23) | pub struct Config {
    method new (line 64) | pub fn new() -> Config {
    method tmp (line 71) | pub fn tmp() -> io::Result<Config> {
    method path (line 82) | pub fn path<P: AsRef<Path>>(mut self, path: P) -> Config {
    method open (line 96) | pub fn open<const LEAF_FANOUT: usize>(
  method default (line 48) | fn default() -> Config {

FILE: src/db.rs
  type Db (line 35) | pub struct Db<const LEAF_FANOUT: usize = 1024> {
  type Target (line 47) | type Target = Tree<LEAF_FANOUT>;
  function deref (line 48) | fn deref(&self) -> &Tree<LEAF_FANOUT> {
  type Item (line 54) | type Item = io::Result<(InlineArray, InlineArray)>;
  type IntoIter (line 55) | type IntoIter = crate::Iter<LEAF_FANOUT>;
  method into_iter (line 57) | fn into_iter(self) -> Self::IntoIter {
  function fmt (line 63) | fn fmt(&self, w: &mut fmt::Formatter<'_>) -> fmt::Result {
  function flusher (line 82) | fn flusher<const LEAF_FANOUT: usize>(
  method drop (line 155) | fn drop(&mut self) {
  function validate (line 169) | fn validate(&self) -> io::Result<()> {
  function stats (line 217) | pub fn stats(&self) -> Stats {
  function size_on_disk (line 221) | pub fn size_on_disk(&self) -> io::Result<u64> {
  function was_recovered (line 248) | pub fn was_recovered(&self) -> bool {
  function open_with_config (line 252) | pub fn open_with_config(config: &Config) -> io::Result<Db<LEAF_FANOUT>> {
  function export (line 406) | pub fn export(
  function import (line 481) | pub fn import(
  function contains_tree (line 516) | pub fn contains_tree<V: AsRef<[u8]>>(&self, name: V) -> io::Result<bool> {
  function drop_tree (line 520) | pub fn drop_tree<V: AsRef<[u8]>>(&self, name: V) -> io::Result<bool> {
  function open_tree (line 545) | pub fn open_tree<V: AsRef<[u8]>>(
  type CollectionType (line 597) | type CollectionType = Vec<u8>;
  type CollectionName (line 598) | type CollectionName = Vec<u8>;

FILE: src/event_verifier.rs
  type State (line 7) | pub(crate) enum State {
    method can_transition_within_epoch_to (line 18) | fn can_transition_within_epoch_to(&self, next: State) -> bool {
    method needs_flush (line 45) | fn needs_flush(&self) -> bool {
  type EventVerifier (line 56) | pub(crate) struct EventVerifier {
    method mark (line 78) | pub(crate) fn mark(
    method dirty_epochs (line 129) | fn dirty_epochs(&self, object_id: ObjectId) -> Vec<FlushEpoch> {
    method print_debug_history_for_object (line 146) | pub(crate) fn print_debug_history_for_object(&self, object_id: ObjectI...
  method drop (line 62) | fn drop(&mut self) {

FILE: src/flush_epoch.rs
  constant SEAL_BIT (line 5) | const SEAL_BIT: u64 = 1 << 63;
  constant SEAL_MASK (line 6) | const SEAL_MASK: u64 = u64::MAX - SEAL_BIT;
  constant MIN_EPOCH (line 7) | const MIN_EPOCH: u64 = 2;
  type FlushEpoch (line 21) | pub struct FlushEpoch(NonZeroU64);
    constant MIN (line 24) | pub const MIN: FlushEpoch = FlushEpoch(NonZeroU64::MIN);
    constant MAX (line 26) | pub const MAX: FlushEpoch = FlushEpoch(NonZeroU64::MAX);
    method increment (line 28) | pub fn increment(&self) -> FlushEpoch {
    method get (line 32) | pub fn get(&self) -> u64 {
    constant MIN (line 38) | const MIN: FlushEpoch = FlushEpoch::MIN;
  type FlushInvariants (line 42) | pub(crate) struct FlushInvariants {
    method mark_flushed_epoch (line 57) | pub(crate) fn mark_flushed_epoch(&self, epoch: FlushEpoch) {
    method mark_flushing_epoch (line 63) | pub(crate) fn mark_flushing_epoch(&self, epoch: FlushEpoch) {
  method default (line 48) | fn default() -> FlushInvariants {
  type Completion (line 71) | pub(crate) struct Completion {
    method epoch (line 78) | pub fn epoch(&self) -> FlushEpoch {
    method new (line 82) | pub fn new(epoch: FlushEpoch) -> Completion {
    method wait_for_complete (line 86) | pub fn wait_for_complete(self) -> FlushEpoch {
    method mark_complete (line 95) | pub fn mark_complete(self) {
    method mark_complete_inner (line 99) | fn mark_complete_inner(&self, previously_sealed: bool) {
    method is_complete (line 114) | pub fn is_complete(&self) -> bool {
  type FlushEpochGuard (line 119) | pub struct FlushEpochGuard<'a> {
  method drop (line 125) | fn drop(&mut self) {
  function epoch (line 137) | pub fn epoch(&self) -> FlushEpoch {
  type EpochTracker (line 143) | pub(crate) struct EpochTracker {
  type FlushEpochTracker (line 151) | pub(crate) struct FlushEpochTracker {
    method roll_epoch_forward (line 207) | pub fn roll_epoch_forward(&self) -> (Completion, Completion, Completio...
    method check_in (line 255) | pub fn check_in<'a>(&self) -> FlushEpochGuard<'a> {
    method manually_advance_epoch (line 278) | pub fn manually_advance_epoch(&self) {
    method current_flush_epoch (line 282) | pub fn current_flush_epoch(&self) -> FlushEpoch {
  type FlushEpochInner (line 157) | pub(crate) struct FlushEpochInner {
  method drop (line 164) | fn drop(&mut self) {
  method default (line 177) | fn default() -> FlushEpochTracker {
  function flush_epoch_basic_functionality (line 290) | fn flush_epoch_basic_functionality() {
  function concurrent_flush_epoch_burn_in_inner (line 311) | fn concurrent_flush_epoch_burn_in_inner() {
  function concurrent_flush_epoch_burn_in (line 380) | fn concurrent_flush_epoch_burn_in() {

FILE: src/heap.rs
  constant WARN (line 20) | const WARN: &str = "DO_NOT_PUT_YOUR_FILES_HERE";
  constant N_SLABS (line 21) | pub(crate) const N_SLABS: usize = 78;
  constant FILE_TARGET_FILL_RATIO (line 22) | const FILE_TARGET_FILL_RATIO: u64 = 80;
  constant FILE_RESIZE_MARGIN (line 23) | const FILE_RESIZE_MARGIN: u64 = 115;
  constant SLAB_SIZES (line 25) | const SLAB_SIZES: [usize; N_SLABS] = [
  type WriteBatchStats (line 107) | pub struct WriteBatchStats {
    method max (line 130) | pub(crate) fn max(&self, other: &WriteBatchStats) -> WriteBatchStats {
    method sum (line 156) | pub(crate) fn sum(&self, other: &WriteBatchStats) -> WriteBatchStats {
  type HeapStats (line 122) | pub struct HeapStats {
  function overhead_for_size (line 184) | const fn overhead_for_size(size: usize) -> usize {
  function slab_for_size (line 200) | fn slab_for_size(size: usize) -> u8 {
  type ObjectRecovery (line 213) | pub struct ObjectRecovery {
  type HeapRecovery (line 219) | pub struct HeapRecovery {
  type PersistentSettings (line 225) | enum PersistentSettings {
    method verify_or_store (line 231) | fn verify_or_store<P: AsRef<Path>>(
    method deserialize (line 251) | fn deserialize(buf: &[u8]) -> io::Result<PersistentSettings> {
    method check_compatibility (line 281) | fn check_compatibility(
    method serialize (line 307) | fn serialize(&self) -> Vec<u8> {
  type SlabAddress (line 338) | pub(crate) struct SlabAddress {
    method from_slab_slot (line 344) | pub(crate) fn from_slab_slot(slab: u8, slot: u64) -> SlabAddress {
    method slab (line 356) | pub const fn slab(&self) -> u8 {
    method slot (line 361) | pub const fn slot(&self) -> u64 {
    method from (line 376) | fn from(i: NonZeroU64) -> SlabAddress {
  method from (line 387) | fn from(sa: SlabAddress) -> NonZeroU64 {
  function read_exact_at (line 409) | pub(super) fn read_exact_at(
  function write_all_at (line 430) | pub(super) fn write_all_at(
  function read_exact_at (line 445) | pub(super) fn read_exact_at(
  function write_all_at (line 472) | pub(super) fn write_all_at(
  type Slab (line 498) | struct Slab {
    method sync (line 505) | fn sync(&self) -> io::Result<()> {
    method read (line 509) | fn read(
    method write (line 565) | fn write(&self, slot: u64, mut data: Vec<u8>) -> io::Result<()> {
  function set_error (line 603) | fn set_error(
  type Update (line 630) | pub enum Update {
    method object_id (line 645) | pub(crate) fn object_id(&self) -> ObjectId {
  type UpdateMetadata (line 654) | pub enum UpdateMetadata {
    method object_id (line 668) | pub fn object_id(&self) -> ObjectId {
  type WriteBatchStatTracker (line 677) | struct WriteBatchStatTracker {
  type Heap (line 683) | pub struct Heap {
    method fmt (line 697) | fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
    method recover (line 706) | pub fn recover(
    method get_global_error_arc (line 821) | pub fn get_global_error_arc(
    method check_error (line 827) | fn check_error(&self) -> io::Result<()> {
    method set_error (line 839) | fn set_error(&self, error: &io::Error) {
    method manually_advance_epoch (line 843) | pub fn manually_advance_epoch(&self) {
    method stats (line 847) | pub fn stats(&self) -> HeapStats {
    method read (line 861) | pub fn read(&self, object_id: ObjectId) -> Option<io::Result<Vec<u8>>> {
    method write_batch (line 881) | pub fn write_batch(
    method heap_object_id_pin (line 1106) | pub fn heap_object_id_pin(&self) -> ebr::Guard<'_, DeferredFree, 16, 1...
    method allocate_object_id (line 1110) | pub fn allocate_object_id(&self) -> ObjectId {
    method objects_to_defrag (line 1114) | pub(crate) fn objects_to_defrag(&self) -> FnvHashSet<ObjectId> {

FILE: src/id_allocator.rs
  type FreeSetAndTip (line 10) | struct FreeSetAndTip {
  type Allocator (line 16) | pub struct Allocator {
    method fragmentation_cutoff (line 33) | pub fn fragmentation_cutoff(
    method from_allocated (line 69) | pub fn from_allocated(allocated: &FnvHashSet<u64>) -> Allocator {
    method max_allocated (line 92) | pub fn max_allocated(&self) -> Option<u64> {
    method allocate (line 102) | pub fn allocate(&self) -> u64 {
    method free (line 122) | pub fn free(&self, id: u64) {
    method counters (line 139) | pub fn counters(&self) -> (u64, u64) {
  function compact (line 147) | fn compact(free: &mut FreeSetAndTip) {
  type DeferredFree (line 156) | pub struct DeferredFree {
  method drop (line 162) | fn drop(&mut self) {

FILE: src/leaf.rs
  type Leaf (line 4) | pub(crate) struct Leaf<const LEAF_FANOUT: usize> {
  function empty (line 22) | pub(crate) fn empty() -> Leaf<LEAF_FANOUT> {
  function is_empty (line 39) | pub(crate) const fn is_empty(&self) -> bool {
  function set_dirty_epoch (line 43) | pub(crate) fn set_dirty_epoch(&mut self, epoch: FlushEpoch) {
  function prefix (line 54) | fn prefix(&self) -> &[u8] {
  function get (line 59) | pub(crate) fn get(&self, key: &[u8]) -> Option<&InlineArray> {
  function insert (line 66) | pub(crate) fn insert(
  function remove (line 77) | pub(crate) fn remove(&mut self, key: &[u8]) -> Option<InlineArray> {
  function merge_from (line 85) | pub(crate) fn merge_from(&mut self, other: &mut Self) {
  function iter (line 138) | pub(crate) fn iter(
  function serialize (line 150) | pub(crate) fn serialize(&self, zstd_compression_level: i32) -> Vec<u8> {
  function deserialize (line 164) | pub(crate) fn deserialize(
  function set_in_memory_size (line 177) | fn set_in_memory_size(&mut self) {
  function split_if_full (line 184) | pub(crate) fn split_if_full(
  function shorten_keys_after_split (line 280) | pub(crate) fn shorten_keys_after_split(&mut self, old_prefix_len: usize) {

FILE: src/lib.rs
  function debug_delay (line 153) | fn debug_delay() {
  constant NAME_MAPPING_COLLECTION_ID (line 172) | const NAME_MAPPING_COLLECTION_ID: CollectionId = CollectionId(0);
  constant DEFAULT_COLLECTION_ID (line 173) | const DEFAULT_COLLECTION_ID: CollectionId = CollectionId(1);
  constant INDEX_FANOUT (line 174) | const INDEX_FANOUT: usize = 64;
  constant EBR_LOCAL_GC_BUFFER_SIZE (line 175) | const EBR_LOCAL_GC_BUFFER_SIZE: usize = 128;
  function open (line 209) | pub fn open<P: AsRef<std::path::Path>>(path: P) -> std::io::Result<Db> {
  type Stats (line 214) | pub struct Stats {
  type CompareAndSwapResult (line 226) | pub type CompareAndSwapResult = std::io::Result<
  type Index (line 230) | type Index<const LEAF_FANOUT: usize> = concurrent_map::ConcurrentMap<
  type CompareAndSwapError (line 239) | pub struct CompareAndSwapError {
    method fmt (line 256) | fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
  type CompareAndSwapSuccess (line 248) | pub struct CompareAndSwapSuccess {
  type ObjectId (line 275) | pub struct ObjectId(NonZeroU64);
    method new (line 278) | fn new(from: u64) -> Option<ObjectId> {
    type Target (line 284) | type Target = u64;
    method deref (line 286) | fn deref(&self) -> &u64 {
    constant MIN (line 301) | const MIN: ObjectId = ObjectId(NonZeroU64::MIN);
  type CollectionId (line 316) | pub struct CollectionId(u64);
    constant MIN (line 319) | const MIN: CollectionId = CollectionId(u64::MIN);
  type CacheBox (line 323) | struct CacheBox<const LEAF_FANOUT: usize> {
  type LogValue (line 331) | struct LogValue {
  type Object (line 337) | pub struct Object<const LEAF_FANOUT: usize> {
  method eq (line 345) | fn eq(&self, other: &Self) -> bool {
  type ShutdownDropper (line 353) | struct ShutdownDropper<const LEAF_FANOUT: usize> {
  method drop (line 361) | fn drop(&mut self) {
  function map_bound (line 386) | fn map_bound<T, U, F: FnOnce(T) -> U>(bound: Bound<T>, f: F) -> Bound<U> {
  function _assert_public_types_send_sync (line 394) | const fn _assert_public_types_send_sync() {

FILE: src/metadata_store.rs
  constant WARN (line 22) | const WARN: &str = "DO_NOT_PUT_YOUR_FILES_HERE";
  constant TMP_SUFFIX (line 23) | const TMP_SUFFIX: &str = ".tmp";
  constant LOG_PREFIX (line 24) | const LOG_PREFIX: &str = "log";
  constant SNAPSHOT_PREFIX (line 25) | const SNAPSHOT_PREFIX: &str = "snapshot";
  constant ZSTD_LEVEL (line 27) | const ZSTD_LEVEL: i32 = 3;
  type MetadataStore (line 35) | pub struct MetadataStore {
    method get_global_error_arc (line 233) | pub fn get_global_error_arc(
    method shutdown_inner (line 239) | fn shutdown_inner(&mut self) {
    method check_error (line 252) | fn check_error(&self) -> io::Result<()> {
    method set_error (line 256) | fn set_error(&self, error: &io::Error) {
    method recover (line 262) | pub fn recover<P: AsRef<Path>>(
    method recover_inner (line 345) | fn recover_inner<P: AsRef<Path>>(
    method write_batch (line 365) | pub fn write_batch(&self, batch: &[UpdateMetadata]) -> io::Result<u64> {
  method drop (line 41) | fn drop(&mut self) {
  type MetadataRecovery (line 51) | struct MetadataRecovery {
  type LogAndStats (line 57) | struct LogAndStats {
  type WorkerMessage (line 63) | enum WorkerMessage {
  function get_compactions (line 68) | fn get_compactions(
  function worker (line 103) | fn worker(
  function set_error (line 164) | fn set_error(
  function check_error (line 190) | fn check_error(
  type Inner (line 205) | struct Inner {
  method drop (line 215) | fn drop(&mut self) {
  function serialize_batch (line 430) | fn serialize_batch(batch: &[UpdateMetadata]) -> Vec<u8> {
  function read_frame (line 497) | fn read_frame(
  function read_log (line 621) | fn read_log(
  function read_snapshot (line 644) | fn read_snapshot(
  function log_path (line 665) | fn log_path(directory_path: &Path, id: u64) -> PathBuf {
  function snapshot_path (line 669) | fn snapshot_path(directory_path: &Path, id: u64, temporary: bool) -> Pat...
  function enumerate_logs_and_snapshot (line 678) | fn enumerate_logs_and_snapshot(
  function read_snapshot_and_apply_logs (line 753) | fn read_snapshot_and_apply_logs(

FILE: src/object_cache.rs
  type CacheStats (line 17) | pub struct CacheStats {
  type FlushStats (line 33) | pub struct FlushStats {
    method sum (line 45) | pub fn sum(&self, other: &FlushStats) -> FlushStats {
    method max (line 69) | pub fn max(&self, other: &FlushStats) -> FlushStats {
  type Dirty (line 94) | pub enum Dirty<const LEAF_FANOUT: usize> {
  function is_final_state (line 114) | pub fn is_final_state(&self) -> bool {
  type FlushStatTracker (line 124) | struct FlushStatTracker {
  type ReadStatTracker (line 131) | pub(crate) struct ReadStatTracker {
  type ObjectCache (line 141) | pub struct ObjectCache<const LEAF_FANOUT: usize> {
  function recover (line 171) | pub fn recover(
  function is_clean (line 229) | pub fn is_clean(&self) -> bool {
  function read (line 233) | pub fn read(&self, object_id: ObjectId) -> Option<io::Result<Vec<u8>>> {
  function stats (line 241) | pub fn stats(&self) -> CacheStats {
  function check_error (line 278) | pub fn check_error(&self) -> io::Result<()> {
  function set_error (line 290) | pub fn set_error(&self, error: &io::Error) {
  function allocate_default_node (line 314) | pub fn allocate_default_node(
  function allocate_object_id (line 335) | pub fn allocate_object_id(
  function current_flush_epoch (line 354) | pub fn current_flush_epoch(&self) -> FlushEpoch {
  function check_into_flush_epoch (line 358) | pub fn check_into_flush_epoch(&self) -> FlushEpochGuard {
  function install_dirty (line 362) | pub fn install_dirty(
  function mark_access_and_evict (line 397) | pub fn mark_access_and_evict(
  function heap_object_id_pin (line 469) | pub fn heap_object_id_pin(&self) -> ebr::Guard<'_, DeferredFree, 16, 16> {
  function flush (line 473) | pub fn flush(&self) -> io::Result<FlushStats> {
  function initialize (line 911) | fn initialize<const LEAF_FANOUT: usize>(

FILE: src/object_location_mapper.rs
  type AllocatorStats (line 14) | pub struct AllocatorStats {
  type SlabTenancy (line 22) | struct SlabTenancy {
    method objects_to_defrag (line 29) | fn objects_to_defrag(
  type ObjectLocationMapper (line 59) | pub(crate) struct ObjectLocationMapper {
    method new (line 67) | pub(crate) fn new(
    method get_max_allocated_per_slab (line 118) | pub(crate) fn get_max_allocated_per_slab(&self) -> Vec<(usize, u64)> {
    method stats (line 130) | pub(crate) fn stats(&self) -> AllocatorStats {
    method clone_object_id_allocator_arc (line 152) | pub(crate) fn clone_object_id_allocator_arc(&self) -> Arc<Allocator> {
    method allocate_object_id (line 156) | pub(crate) fn allocate_object_id(&self) -> ObjectId {
    method clone_slab_allocator_arc (line 167) | pub(crate) fn clone_slab_allocator_arc(
    method allocate_slab_slot (line 174) | pub(crate) fn allocate_slab_slot(&self, slab_id: u8) -> SlabAddress {
    method free_slab_slot (line 180) | pub(crate) fn free_slab_slot(&self, slab_address: SlabAddress) {
    method get_location_for_object (line 186) | pub(crate) fn get_location_for_object(
    method insert (line 204) | pub(crate) fn insert(
    method remove (line 244) | pub(crate) fn remove(&self, object_id: ObjectId) -> Option<SlabAddress> {
    method objects_to_defrag (line 269) | pub(crate) fn objects_to_defrag(&self) -> FnvHashSet<ObjectId> {

FILE: src/tree.rs
  type Tree (line 27) | pub struct Tree<const LEAF_FANOUT: usize = 1024> {
  method drop (line 35) | fn drop(&mut self) {
  function fmt (line 48) | fn fmt(&self, w: &mut fmt::Formatter<'_>) -> fmt::Result {
  type LeafReadGuard (line 68) | struct LeafReadGuard<'a, const LEAF_FANOUT: usize = 1024> {
  method drop (line 78) | fn drop(&mut self) {
  type LeafWriteGuard (line 106) | struct LeafWriteGuard<'a, const LEAF_FANOUT: usize = 1024> {
  function epoch (line 117) | fn epoch(&self) -> FlushEpoch {
  function handle_cache_access_and_eviction_externally (line 124) | fn handle_cache_access_and_eviction_externally(
  method drop (line 136) | fn drop(&mut self) {
  function new (line 159) | pub(crate) fn new(
  function check_error (line 170) | pub fn check_error(&self) -> io::Result<()> {
  function set_error (line 174) | fn set_error(&self, error: &io::Error) {
  function storage_stats (line 178) | pub fn storage_stats(&self) -> Stats {
  function flush (line 195) | pub fn flush(&self) -> io::Result<FlushStats> {
  function page_in (line 199) | pub(crate) fn page_in(
  function merge_leaf_into_right_sibling (line 399) | fn merge_leaf_into_right_sibling<'a>(
  function successor_leaf_mut (line 517) | fn successor_leaf_mut<'a>(
  function cooperatively_serialize_leaf (line 560) | fn cooperatively_serialize_leaf(
  function leaf_for_key (line 612) | fn leaf_for_key<'a>(
  function leaf_for_key_mut (line 689) | fn leaf_for_key_mut<'a>(
  function get (line 760) | pub fn get<K: AsRef<[u8]>>(
  function insert (line 794) | pub fn insert<K, V>(
  function remove (line 923) | pub fn remove<K: AsRef<[u8]>>(
  function compare_and_swap (line 1036) | pub fn compare_and_swap<K, OV, NV>(
  function update_and_fetch (line 1210) | pub fn update_and_fetch<K, V, F>(
  function fetch_and_update (line 1282) | pub fn fetch_and_update<K, V, F>(
  function iter (line 1307) | pub fn iter(&self) -> Iter<LEAF_FANOUT> {
  function range (line 1320) | pub fn range<K, R>(&self, range: R) -> Iter<LEAF_FANOUT>
  function apply_batch (line 1376) | pub fn apply_batch(&self, batch: Batch) -> io::Result<()> {
  function contains_key (line 1558) | pub fn contains_key<K: AsRef<[u8]>>(&self, key: K) -> io::Result<bool> {
  function get_lt (line 1604) | pub fn get_lt<K>(
  function get_gt (line 1660) | pub fn get_gt<K>(
  function scan_prefix (line 1710) | pub fn scan_prefix<P>(&self, prefix: P) -> Iter<LEAF_FANOUT>
  function first (line 1729) | pub fn first(&self) -> io::Result<Option<(InlineArray, InlineArray)>> {
  function last (line 1735) | pub fn last(&self) -> io::Result<Option<(InlineArray, InlineArray)>> {
  function pop_last (line 1763) | pub fn pop_last(&self) -> io::Result<Option<(InlineArray, InlineArray)>> {
  function pop_last_in_range (line 1829) | pub fn pop_last_in_range<K, R>(
  function pop_first (line 1877) | pub fn pop_first(&self) -> io::Result<Option<(InlineArray, InlineArray)>> {
  function pop_first_in_range (line 1938) | pub fn pop_first_in_range<K, R>(
  function len (line 1977) | pub fn len(&self) -> io::Result<usize> {
  function is_empty (line 1990) | pub fn is_empty(&self) -> io::Result<bool> {
  function clear (line 2004) | pub fn clear(&self) -> io::Result<()> {
  function checksum (line 2017) | pub fn checksum(&self) -> io::Result<u32> {
  type Iter (line 2029) | pub struct Iter<const LEAF_FANOUT: usize> {
  type Item (line 2041) | type Item = io::Result<(InlineArray, InlineArray)>;
  method next (line 2043) | fn next(&mut self) -> Option<Self::Item> {
  method next_back (line 2087) | fn next_back(&mut self) -> Option<Self::Item> {
  function keys (line 2170) | pub fn keys(
  function values (line 2176) | pub fn values(
  type Item (line 2184) | type Item = io::Result<(InlineArray, InlineArray)>;
  type IntoIter (line 2185) | type IntoIter = Iter<LEAF_FANOUT>;
  method into_iter (line 2187) | fn into_iter(self) -> Self::IntoIter {
  type Batch (line 2219) | pub struct Batch {
    method insert (line 2226) | pub fn insert<K, V>(&mut self, key: K, value: V)
    method remove (line 2235) | pub fn remove<K>(&mut self, key: K)
    method get (line 2244) | pub fn get<K: AsRef<[u8]>>(&self, k: K) -> Option<Option<&InlineArray>> {

FILE: tests/00_regression.rs
  type ShredAllocator (line 12) | struct ShredAllocator;
    method alloc (line 15) | unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
    method dealloc (line 25) | unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
  constant INTENSITY (line 34) | const INTENSITY: usize = 10;
  function tree_bug_00 (line 38) | fn tree_bug_00() {
  function tree_bug_01 (line 45) | fn tree_bug_01() {
  function tree_bug_02 (line 70) | fn tree_bug_02() {
  function tree_bug_03 (line 98) | fn tree_bug_03() {
  function tree_bug_04 (line 122) | fn tree_bug_04() {
  function tree_bug_05 (line 148) | fn tree_bug_05() {
  function tree_bug_06 (line 170) | fn tree_bug_06() {
  function tree_bug_07 (line 194) | fn tree_bug_07() {
  function tree_bug_08 (line 218) | fn tree_bug_08() {
  function tree_bug_09 (line 241) | fn tree_bug_09() {
  function tree_bug_10 (line 271) | fn tree_bug_10() {
  function tree_bug_11 (line 310) | fn tree_bug_11() {
  function tree_bug_12 (line 336) | fn tree_bug_12() {
  function tree_bug_13 (line 387) | fn tree_bug_13() {
  function tree_bug_14 (line 418) | fn tree_bug_14() {
  function tree_bug_15 (line 439) | fn tree_bug_15() {
  function tree_bug_17 (line 472) | fn tree_bug_17() {
  function tree_bug_18 (line 489) | fn tree_bug_18() {
  function tree_bug_19 (line 509) | fn tree_bug_19() {
  function tree_bug_20 (line 529) | fn tree_bug_20() {
  function tree_bug_21 (line 550) | fn tree_bug_21() {
  function tree_bug_23 (line 592) | fn tree_bug_23() {
  function tree_bug_31 (line 946) | fn tree_bug_31() {
  function tree_bug_32 (line 970) | fn tree_bug_32() {
  function tree_bug_33 (line 984) | fn tree_bug_33() {
  function tree_bug_35 (line 1026) | fn tree_bug_35() {
  function tree_bug_37 (line 1086) | fn tree_bug_37() {
  function tree_bug_40 (line 1396) | fn tree_bug_40() {
  function tree_bug_41 (line 1409) | fn tree_bug_41() {
  function tree_bug_43 (line 1452) | fn tree_bug_43() {
  function tree_bug_46 (line 1530) | fn tree_bug_46() {
  function tree_bug_47 (line 1542) | fn tree_bug_47() {
  function tree_bug_48 (line 1554) | fn tree_bug_48() {
  function tree_bug_50 (line 1613) | fn tree_bug_50() {
  function tree_bug_51 (line 1636) | fn tree_bug_51() {
  function tree_bug_52 (line 1648) | fn tree_bug_52() {

FILE: tests/common/mod.rs
  type Alloc (line 14) | struct Alloc;
    method alloc (line 17) | unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
    method dealloc (line 24) | unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
  function setup_logger (line 31) | pub fn setup_logger() {
  function cleanup (line 60) | pub fn cleanup(dir: &str) {

FILE: tests/concurrent_batch_atomicity.rs
  constant CONCURRENCY (line 6) | const CONCURRENCY: usize = 32;
  constant N_KEYS (line 7) | const N_KEYS: usize = 1024;
  type Db (line 9) | type Db = SledDb<8>;
  function batch_writer (line 11) | fn batch_writer(db: Db, barrier: Arc<Barrier>, thread_number: usize) {
  function concurrent_batch_atomicity (line 23) | fn concurrent_batch_atomicity() {

FILE: tests/crash_tests/crash_batches.rs
  constant CACHE_SIZE (line 7) | const CACHE_SIZE: usize = 1024 * 1024;
  constant BATCH_SIZE (line 8) | const BATCH_SIZE: u32 = 8;
  constant SEGMENT_SIZE (line 9) | const SEGMENT_SIZE: usize = 1024;
  function verify_batches (line 13) | fn verify_batches(tree: &Db) -> u32 {
  function run_batches_inner (line 63) | fn run_batches_inner(db: Db) {
  function run_crash_batches (line 95) | pub fn run_crash_batches() {

FILE: tests/crash_tests/crash_heap.rs
  constant FANOUT (line 3) | const FANOUT: usize = 3;
  function run_crash_heap (line 5) | pub fn run_crash_heap() {

FILE: tests/crash_tests/crash_iter.rs
  constant CACHE_SIZE (line 6) | const CACHE_SIZE: usize = 256;
  function run_crash_iter (line 8) | pub fn run_crash_iter() {

FILE: tests/crash_tests/crash_metadata_store.rs
  function run_crash_metadata_store (line 3) | pub fn run_crash_metadata_store() {

FILE: tests/crash_tests/crash_object_cache.rs
  constant FANOUT (line 3) | const FANOUT: usize = 3;
  function run_crash_object_cache (line 5) | pub fn run_crash_object_cache() {

FILE: tests/crash_tests/crash_sequential_writes.rs
  constant CACHE_SIZE (line 5) | const CACHE_SIZE: usize = 1024 * 1024;
  constant CYCLE (line 6) | const CYCLE: usize = 256;
  constant SEGMENT_SIZE (line 7) | const SEGMENT_SIZE: usize = 1024;
  function verify (line 12) | fn verify(tree: &Db) -> (u32, u32) {
  function run_inner (line 79) | fn run_inner(config: Config) {
  function run_crash_sequential_writes (line 117) | pub fn run_crash_sequential_writes() {

FILE: tests/crash_tests/crash_tx.rs
  constant CACHE_SIZE (line 3) | const CACHE_SIZE: usize = 1024 * 1024;
  function run_crash_tx (line 5) | pub fn run_crash_tx() {

FILE: tests/crash_tests/mod.rs
  type Db (line 28) | type Db = SledDb<8>;
  constant SEQUENTIAL_WRITES_DIR (line 31) | pub const SEQUENTIAL_WRITES_DIR: &str = "crash_sequential_writes";
  constant BATCHES_DIR (line 32) | pub const BATCHES_DIR: &str = "crash_batches";
  constant ITER_DIR (line 33) | pub const ITER_DIR: &str = "crash_iter";
  constant TX_DIR (line 34) | pub const TX_DIR: &str = "crash_tx";
  constant METADATA_STORE_DIR (line 35) | pub const METADATA_STORE_DIR: &str = "crash_metadata_store";
  constant HEAP_DIR (line 36) | pub const HEAP_DIR: &str = "crash_heap";
  constant OBJECT_CACHE_DIR (line 37) | pub const OBJECT_CACHE_DIR: &str = "crash_object_cache";
  constant CRASH_DIR (line 39) | const CRASH_DIR: &str = "crash_test_files";
  function spawn_killah (line 41) | fn spawn_killah() {
  function u32_to_vec (line 49) | fn u32_to_vec(u: u32) -> Vec<u8> {
  function slice_to_u32 (line 54) | fn slice_to_u32(b: &[u8]) -> u32 {
  function tree_to_string (line 61) | fn tree_to_string(tree: &Db) -> String {

FILE: tests/test_crash_recovery.rs
  constant TEST_ENV_VAR (line 11) | const TEST_ENV_VAR: &str = "SLED_CRASH_TEST";
  constant N_TESTS (line 12) | const N_TESTS: usize = 100;
  constant TESTS (line 14) | const TESTS: [&str; 7] = [
  constant CRASH_CHANCE (line 24) | const CRASH_CHANCE: u32 = 250;
  type ShredAllocator (line 30) | struct ShredAllocator;
    method alloc (line 33) | unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
    method dealloc (line 43) | unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
  function main (line 51) | fn main() {
  function run_child_process (line 134) | fn run_child_process(dir: &str) {
  function supervisor (line 166) | fn supervisor(dir: &str) {

FILE: tests/test_quiescent.rs
  function quiescent_cpu_time (line 10) | fn quiescent_cpu_time() {

FILE: tests/test_space_leaks.rs
  function size_leak (line 7) | fn size_leak() -> io::Result<()> {

FILE: tests/test_tree.rs
  type Db (line 21) | type Db = SledDb<3>;
  constant N_THREADS (line 28) | const N_THREADS: usize = 32;
  constant N_PER_THREAD (line 29) | const N_PER_THREAD: usize = 10_000;
  constant N (line 30) | const N: usize = N_THREADS * N_PER_THREAD;
  constant SPACE (line 31) | const SPACE: usize = N;
  constant INTENSITY (line 34) | const INTENSITY: usize = 1;
  function kv (line 36) | fn kv(i: usize) -> InlineArray {
  function monotonic_inserts (line 44) | fn monotonic_inserts() {
  function fixed_stride_inserts (line 88) | fn fixed_stride_inserts() {
  function sequential_inserts (line 138) | fn sequential_inserts() {
  function reverse_inserts (line 158) | fn reverse_inserts() {
  function very_large_reverse_tree_iterator (line 179) | fn very_large_reverse_tree_iterator() {
  function varied_compression_ratios (line 195) | fn varied_compression_ratios() {
  function test_pop_first (line 224) | fn test_pop_first() -> io::Result<()> {
  function test_pop_last_in_range (line 247) | fn test_pop_last_in_range() -> io::Result<()> {
  function test_interleaved_gets_sets (line 282) | fn test_interleaved_gets_sets() {
  function concurrent_tree_pops (line 318) | fn concurrent_tree_pops() -> std::io::Result<()> {
  function concurrent_tree_ops (line 356) | fn concurrent_tree_ops() {
  function concurrent_tree_iter (line 499) | fn concurrent_tree_iter() -> io::Result<()> {
  function tree_subdir (line 884) | fn tree_subdir() {
  function tree_small_keys_iterator (line 916) | fn tree_small_keys_iterator() {
  function tree_big_keys_iterator (line 956) | fn tree_big_keys_iterator() {
  function tree_range (line 1086) | fn tree_range() {
  function recover_tree (line 1136) | fn recover_tree() {
  function tree_gc (line 1173) | fn tree_gc() {
  function contains_tree (line 1278) | fn contains_tree() {
  function tree_import_export (line 1296) | fn tree_import_export() -> io::Result<()> {
  function quickcheck_tree_matches_btreemap (line 1365) | fn quickcheck_tree_matches_btreemap() {

FILE: tests/test_tree_failpoints.rs
  constant SEGMENT_SIZE (line 13) | const SEGMENT_SIZE: usize = 256;
  constant BATCH_COUNTER_KEY (line 14) | const BATCH_COUNTER_KEY: &[u8] = b"batch_counter";
  type Op (line 17) | enum Op {
  type BatchOp (line 28) | enum BatchOp {
  method arbitrary (line 34) | fn arbitrary<G: Gen>(g: &mut G) -> BatchOp {
  method arbitrary (line 46) | fn arbitrary<G: Gen>(g: &mut G) -> Op {
  method shrink (line 97) | fn shrink(&self) -> Box<dyn Iterator<Item = Op>> {
  function v (line 133) | fn v(b: &[u8]) -> u16 {
  function value_factory (line 140) | fn value_factory(set_counter: u16) -> Vec<u8> {
  function tear_down_failpoints (line 155) | fn tear_down_failpoints() {
  type ReferenceVersion (line 160) | struct ReferenceVersion {
  type ReferenceEntry (line 166) | struct ReferenceEntry {
  function prop_tree_crashes_nicely (line 171) | fn prop_tree_crashes_nicely(ops: Vec<Op>, flusher: bool) -> bool {
  function run_tree_crashes_nicely (line 203) | fn run_tree_crashes_nicely(ops: Vec<Op>, flusher: bool) -> bool {
  function quickcheck_tree_with_failpoints (line 510) | fn quickcheck_tree_with_failpoints() {
  function failpoints_bug_01 (line 528) | fn failpoints_bug_01() {
  function failpoints_bug_02 (line 538) | fn failpoints_bug_02() {
  function failpoints_bug_03 (line 553) | fn failpoints_bug_03() {
  function failpoints_bug_04 (line 567) | fn failpoints_bug_04() {
  function failpoints_bug_05 (line 584) | fn failpoints_bug_05() {
  function failpoints_bug_06 (line 608) | fn failpoints_bug_06() {
  function failpoints_bug_07 (line 630) | fn failpoints_bug_07() {
  function failpoints_bug_08 (line 666) | fn failpoints_bug_08() {
  function failpoints_bug_09 (line 699) | fn failpoints_bug_09() {
  function failpoints_bug_10 (line 754) | fn failpoints_bug_10() {
  function failpoints_bug_11 (line 966) | fn failpoints_bug_11() {
  function failpoints_bug_12 (line 1007) | fn failpoints_bug_12() {
  function failpoints_bug_13 (line 1032) | fn failpoints_bug_13() {
  function failpoints_bug_14 (line 1076) | fn failpoints_bug_14() {
  function failpoints_bug_15 (line 1108) | fn failpoints_bug_15() {
  function failpoints_bug_16 (line 1118) | fn failpoints_bug_16() {
  function failpoints_bug_17 (line 1128) | fn failpoints_bug_17() {
  function failpoints_bug_18 (line 1165) | fn failpoints_bug_18() {
  function failpoints_bug_19 (line 1175) | fn failpoints_bug_19() {
  function failpoints_bug_20 (line 1219) | fn failpoints_bug_20() {
  function failpoints_bug_21 (line 1231) | fn failpoints_bug_21() {
  function failpoints_bug_22 (line 1306) | fn failpoints_bug_22() {
  function failpoints_bug_23 (line 1316) | fn failpoints_bug_23() {
  function failpoints_bug_24 (line 1332) | fn failpoints_bug_24() {
  function failpoints_bug_25 (line 1343) | fn failpoints_bug_25() {
  function failpoints_bug_26 (line 1406) | fn failpoints_bug_26() {
  function failpoints_bug_27 (line 1479) | fn failpoints_bug_27() {
  function failpoints_bug_28 (line 1520) | fn failpoints_bug_28() {
  function failpoints_bug_29 (line 1557) | fn failpoints_bug_29() {
  function failpoints_bug_30 (line 1586) | fn failpoints_bug_30() {
  function failpoints_bug_31 (line 1602) | fn failpoints_bug_31() {
  function failpoints_bug_32 (line 1621) | fn failpoints_bug_32() {
  function failpoints_bug_33 (line 1633) | fn failpoints_bug_33() {
  function failpoints_bug_34 (line 1703) | fn failpoints_bug_34() {
  function failpoints_bug_35 (line 1741) | fn failpoints_bug_35() {
  function failpoints_bug_36 (line 1767) | fn failpoints_bug_36() {
  function failpoints_bug_37 (line 1803) | fn failpoints_bug_37() {
  function failpoints_bug_38 (line 1820) | fn failpoints_bug_38() {
  function failpoints_bug_39 (line 1839) | fn failpoints_bug_39() {

FILE: tests/tree/mod.rs
  type Db (line 8) | type Db = SledDb<3>;
  type Key (line 11) | pub struct Key(pub Vec<u8>);
    method fmt (line 14) | fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
  function range (line 28) | fn range(g: &mut Gen, min_inclusive: usize, max_exclusive: usize) -> usi...
  method arbitrary (line 40) | fn arbitrary(g: &mut Gen) -> Self {
  method shrink (line 64) | fn shrink(&self) -> Box<dyn Iterator<Item = Self>> {
  type Op (line 77) | pub enum Op {
  method arbitrary (line 92) | fn arbitrary(g: &mut Gen) -> Self {
  method shrink (line 112) | fn shrink(&self) -> Box<dyn Iterator<Item = Self>> {
  function bytes_to_u16 (line 134) | fn bytes_to_u16(v: &[u8]) -> u16 {
  function u16_to_bytes (line 139) | fn u16_to_bytes(u: u16) -> Vec<u8> {
  function prop_tree_matches_btreemap (line 158) | pub fn prop_tree_matches_btreemap(
  function prop_tree_matches_btreemap_inner (line 177) | fn prop_tree_matches_btreemap_inner(

Download .json

Condensed preview — 63 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (556K chars).

[
  {
    "path": ".github/FUNDING.yml",
    "chars": 720,
    "preview": "# These are supported funding model platforms\n\ngithub: spacejam # Replace with up to 4 GitHub Sponsors-enabled usernames"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/blank_issue.md",
    "chars": 126,
    "preview": "---\nname: Blank Issue (do not use this for bug reports or feature requests)\nabout: Create an issue with a blank template"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/bugs.md",
    "chars": 512,
    "preview": "---\nname: Bug Report\nabout: Report a correctness issue or violated expectation\nlabels: bug\n---\n\nBug reports must include"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/config.yml",
    "chars": 161,
    "preview": "blank_issues_enabled: true\ncontact_links:\n  - name: sled discord\n    url: https://discord.gg/Z6VsXds\n    about: Please a"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/feature_request.md",
    "chars": 187,
    "preview": "---\nname: Feature Request\nabout: Request a feature for sled\nlabels: feature\n---\n\n#### Use Case:\n\n#### Proposed Change:\n\n"
  },
  {
    "path": ".github/dependabot.yml",
    "chars": 272,
    "preview": "version: 2\nupdates:\n- package-ecosystem: cargo\n  directory: \"/\"\n  schedule:\n    interval: daily\n    time: \"10:00\"\n  open"
  },
  {
    "path": ".github/workflows/test.yml",
    "chars": 4828,
    "preview": "name: Rust\n\non:\n  pull_request:\n    branches:\n    - main\n\njobs:\n  clippy_check:\n    runs-on: ubuntu-latest\n    steps:\n  "
  },
  {
    "path": ".gitignore",
    "chars": 253,
    "preview": "CLAUDE.md\nfuzz-*.log\ndefault.sled\ntiming_test*\n*db\ncrash_test_files\n*conf\n*snap.*\n*grind.out*\nvgcore*\n*.bk\n*orig\ntags\npe"
  },
  {
    "path": ".rustfmt.toml",
    "chars": 155,
    "preview": "version = \"Two\"\nuse_small_heuristics = \"Max\"\nreorder_imports = true\nmax_width = 80\nwrap_comments = true\ncombine_control_"
  },
  {
    "path": "ARCHITECTURE.md",
    "chars": 6457,
    "preview": "<table style=\"width:100%\">\n<tr>\n  <td>\n    <table style=\"width:100%\">\n      <tr>\n        <td> key </td>\n        <td> val"
  },
  {
    "path": "CHANGELOG.md",
    "chars": 12820,
    "preview": "# Unreleased\n\n## New Features\n\n* #1178 batches and transactions are now unified for subscribers.\n* #1231 `Tree::get_zero"
  },
  {
    "path": "CONTRIBUTING.md",
    "chars": 2286,
    "preview": "# Welcome to the Project :)\n\n* Don't be a jerk - here's our [code of conduct](./code-of-conduct.md).\n  We have a track r"
  },
  {
    "path": "Cargo.toml",
    "chars": 2156,
    "preview": "[package]\nname = \"sled\"\nversion = \"1.0.0-alpha.124\"\nedition = \"2024\"\nauthors = [\"Tyler Neely <tylerneely@gmail.com>\"]\ndo"
  },
  {
    "path": "LICENSE-APACHE",
    "chars": 11581,
    "preview": "                                 Apache License\n                           Version 2.0, January 2004\n                   "
  },
  {
    "path": "LICENSE-MIT",
    "chars": 1303,
    "preview": "Copyright (c) 2015 Tyler Neely\nCopyright (c) 2016 Tyler Neely\nCopyright (c) 2017 Tyler Neely\nCopyright (c) 2018 Tyler Ne"
  },
  {
    "path": "README.md",
    "chars": 9678,
    "preview": "\n<table style=\"width:100%\">\n<tr>\n  <td>\n    <table style=\"width:100%\">\n      <tr>\n        <td> key </td>\n        <td> va"
  },
  {
    "path": "RELEASE_CHECKLIST.md",
    "chars": 1350,
    "preview": "# Release Checklist\n\nThis checklist must be completed before publishing a release of any kind.\n\nOver time, anything in t"
  },
  {
    "path": "SAFETY.md",
    "chars": 3062,
    "preview": "# sled safety model\n\nThis document applies\n[STPA](http://psas.scripts.mit.edu/home/get_file.php?name=STPA_handbook.pdf)-"
  },
  {
    "path": "SECURITY.md",
    "chars": 813,
    "preview": "# Security Policy\n\n## Reporting a Vulnerability\n\nsled uses some unsafe functionality in the core lock-free algorithms, a"
  },
  {
    "path": "art/CREDITS",
    "chars": 104,
    "preview": "original tree logo with face:\n  https://twitter.com/daiyitastic\n\nanti-transphobia additions:\n  spacejam\n"
  },
  {
    "path": "code-of-conduct.md",
    "chars": 3250,
    "preview": "# Contributor Covenant Code of Conduct\n\n## Our Pledge\n\nIn the interest of fostering an open and welcoming environment, w"
  },
  {
    "path": "examples/bench.rs",
    "chars": 16246,
    "preview": "use std::path::Path;\nuse std::sync::Barrier;\nuse std::thread::scope;\nuse std::time::{Duration, Instant};\nuse std::{fs, i"
  },
  {
    "path": "fuzz/.gitignore",
    "chars": 24,
    "preview": "target\ncorpus\nartifacts\n"
  },
  {
    "path": "fuzz/Cargo.toml",
    "chars": 548,
    "preview": "[package]\nname = \"bloodstone-fuzz\"\nversion = \"0.0.0\"\nauthors = [\"Automatically generated\"]\npublish = false\nedition = \"20"
  },
  {
    "path": "fuzz/fuzz_targets/fuzz_model.rs",
    "chars": 4505,
    "preview": "#![no_main]\n#[macro_use]\nextern crate libfuzzer_sys;\nextern crate arbitrary;\nextern crate sled;\n\nuse arbitrary::Arbitrar"
  },
  {
    "path": "scripts/cgtest.sh",
    "chars": 380,
    "preview": "#!/bin/sh\nset -e\n\ncgdelete memory:sledTest || true\ncgcreate -g memory:sledTest\necho 100M > /sys/fs/cgroup/memory/sledTes"
  },
  {
    "path": "scripts/cross_compile.sh",
    "chars": 620,
    "preview": "#!/bin/sh\nset -e\n\n# checks sled's compatibility using several targets\n\ntargets=\"wasm32-wasi wasm32-unknown-unknown aarch"
  },
  {
    "path": "scripts/execution_explorer.py",
    "chars": 6138,
    "preview": "#!/usr/bin/gdb --command\n\n\"\"\"\na simple python GDB script for running multithreaded\nprograms in a way that is \"determinis"
  },
  {
    "path": "scripts/instructions",
    "chars": 897,
    "preview": "#!/bin/sh\n# counts instructions for a standard workload\nset -e\n\nOUTFILE=\"cachegrind.stress2.`git describe --always --dir"
  },
  {
    "path": "scripts/sanitizers.sh",
    "chars": 1726,
    "preview": "#!/bin/bash\nset -eo pipefail\n\npushd benchmarks/stress2\n\nrustup toolchain install nightly\nrustup toolchain install nightl"
  },
  {
    "path": "scripts/shufnice.sh",
    "chars": 216,
    "preview": "#!/bin/sh\n\nwhile true; do\n  PID=`pgrep $1`\n  TIDS=`ls /proc/$PID/task`\n  TID=`echo $TIDS |  tr \" \" \"\\n\" | shuf -n1`\n  NI"
  },
  {
    "path": "scripts/ubuntu_bench",
    "chars": 575,
    "preview": "#!/bin/sh\n\nsudo apt-get update\nsudo apt-get install htop dstat build-essential linux-tools-common linux-tools-generic li"
  },
  {
    "path": "src/alloc.rs",
    "chars": 2519,
    "preview": "#[cfg(any(\n    feature = \"testing-shred-allocator\",\n    feature = \"testing-count-allocator\"\n))]\npub use alloc::*;\n\n// th"
  },
  {
    "path": "src/block_checker.rs",
    "chars": 1826,
    "preview": "use std::collections::BTreeMap;\nuse std::panic::Location;\nuse std::sync::atomic::{AtomicU64, Ordering};\nuse std::sync::{"
  },
  {
    "path": "src/config.rs",
    "chars": 3800,
    "preview": "use std::io;\nuse std::path::{Path, PathBuf};\nuse std::sync::Arc;\n\nuse fault_injection::{annotate, fallible};\nuse tempdir"
  },
  {
    "path": "src/db.rs",
    "chars": 19568,
    "preview": "use std::collections::HashMap;\nuse std::fmt;\nuse std::io;\nuse std::sync::{Arc, mpsc};\nuse std::time::{Duration, Instant}"
  },
  {
    "path": "src/event_verifier.rs",
    "chars": 5521,
    "preview": "use std::collections::BTreeMap;\nuse std::sync::Mutex;\n\nuse crate::{FlushEpoch, ObjectId};\n\n#[derive(Debug, Clone, Copy, "
  },
  {
    "path": "src/flush_epoch.rs",
    "chars": 10995,
    "preview": "use std::num::NonZeroU64;\nuse std::sync::atomic::{AtomicPtr, AtomicU64, Ordering};\nuse std::sync::{Arc, Condvar, Mutex};"
  },
  {
    "path": "src/heap.rs",
    "chars": 33983,
    "preview": "use std::fmt;\nuse std::fs;\nuse std::io::{self, Read};\nuse std::num::NonZeroU64;\nuse std::path::{Path, PathBuf};\nuse std:"
  },
  {
    "path": "src/id_allocator.rs",
    "chars": 4699,
    "preview": "use std::collections::BTreeSet;\nuse std::sync::atomic::{AtomicU64, Ordering};\nuse std::sync::Arc;\n\nuse crossbeam_queue::"
  },
  {
    "path": "src/leaf.rs",
    "chars": 10059,
    "preview": "use crate::*;\n\n#[derive(Debug, Clone, serde::Serialize, serde::Deserialize)]\npub(crate) struct Leaf<const LEAF_FANOUT: u"
  },
  {
    "path": "src/lib.rs",
    "chars": 13106,
    "preview": "// 1.0 blockers\n//\n// bugs\n// * page-out needs to be deferred until after any flush of the dirty epoch\n//   * need to re"
  },
  {
    "path": "src/metadata_store.rs",
    "chars": 26877,
    "preview": "use std::collections::BTreeSet;\nuse std::fs;\nuse std::io::{self, Read, Write};\nuse std::num::NonZeroU64;\nuse std::path::"
  },
  {
    "path": "src/object_cache.rs",
    "chars": 35480,
    "preview": "use std::cell::RefCell;\nuse std::collections::HashMap;\nuse std::io;\nuse std::sync::Arc;\nuse std::sync::atomic::{AtomicPt"
  },
  {
    "path": "src/object_location_mapper.rs",
    "chars": 9335,
    "preview": "use std::num::NonZeroU64;\nuse std::sync::Arc;\nuse std::sync::atomic::{AtomicU64, Ordering};\n\nuse fnv::FnvHashSet;\nuse pa"
  },
  {
    "path": "src/tree.rs",
    "chars": 74642,
    "preview": "use std::collections::{BTreeMap, VecDeque};\nuse std::fmt;\nuse std::hint;\nuse std::io;\nuse std::mem::ManuallyDrop;\nuse st"
  },
  {
    "path": "tests/00_regression.rs",
    "chars": 59253,
    "preview": "mod common;\nmod tree;\n\nuse std::alloc::{Layout, System};\n\nuse tree::{Key, Op::*, prop_tree_matches_btreemap};\n\n#[global_"
  },
  {
    "path": "tests/common/mod.rs",
    "chars": 1813,
    "preview": "// the memshred feature causes all allocated and deallocated\n// memory to be set to a specific non-zero value of 0xa1 fo"
  },
  {
    "path": "tests/concurrent_batch_atomicity.rs",
    "chars": 2071,
    "preview": "use std::sync::{Arc, Barrier};\nuse std::thread;\n\nuse sled::{Config, Db as SledDb};\n\nconst CONCURRENCY: usize = 32;\nconst"
  },
  {
    "path": "tests/crash_tests/crash_batches.rs",
    "chars": 3936,
    "preview": "use std::thread;\n\nuse rand::Rng;\n\nuse super::*;\n\nconst CACHE_SIZE: usize = 1024 * 1024;\nconst BATCH_SIZE: u32 = 8;\nconst"
  },
  {
    "path": "tests/crash_tests/crash_heap.rs",
    "chars": 340,
    "preview": "use super::*;\n\nconst FANOUT: usize = 3;\n\npub fn run_crash_heap() {\n    let path = std::path::Path::new(CRASH_DIR).join(H"
  },
  {
    "path": "tests/crash_tests/crash_iter.rs",
    "chars": 5493,
    "preview": "use std::sync::{Arc, Barrier};\nuse std::thread;\n\nuse super::*;\n\nconst CACHE_SIZE: usize = 256;\n\npub fn run_crash_iter() "
  },
  {
    "path": "tests/crash_tests/crash_metadata_store.rs",
    "chars": 194,
    "preview": "use super::*;\n\npub fn run_crash_metadata_store() {\n    let (metadata_store, recovered) =\n        MetadataStore::recover("
  },
  {
    "path": "tests/crash_tests/crash_object_cache.rs",
    "chars": 390,
    "preview": "use super::*;\n\nconst FANOUT: usize = 3;\n\npub fn run_crash_object_cache() {\n    let path = std::path::Path::new(CRASH_DIR"
  },
  {
    "path": "tests/crash_tests/crash_sequential_writes.rs",
    "chars": 3679,
    "preview": "use std::thread;\n\nuse super::*;\n\nconst CACHE_SIZE: usize = 1024 * 1024;\nconst CYCLE: usize = 256;\nconst SEGMENT_SIZE: us"
  },
  {
    "path": "tests/crash_tests/crash_tx.rs",
    "chars": 3053,
    "preview": "use super::*;\n\nconst CACHE_SIZE: usize = 1024 * 1024;\n\npub fn run_crash_tx() {\n    let config = Config::new()\n        .c"
  },
  {
    "path": "tests/crash_tests/mod.rs",
    "chars": 1881,
    "preview": "use std::mem::size_of;\nuse std::process::exit;\nuse std::thread;\nuse std::time::Duration;\n\nuse rand::Rng;\n\nuse sled::{\n  "
  },
  {
    "path": "tests/test_crash_recovery.rs",
    "chars": 4933,
    "preview": "mod common;\nmod crash_tests;\n\nuse std::alloc::{Layout, System};\nuse std::env::{self, VarError};\nuse std::process::Comman"
  },
  {
    "path": "tests/test_quiescent.rs",
    "chars": 2105,
    "preview": "#![cfg(all(target_os = \"linux\", not(miri)))]\n\nmod common;\n\nuse std::time::{Duration, Instant};\n\nuse common::cleanup;\n\n#["
  },
  {
    "path": "tests/test_space_leaks.rs",
    "chars": 505,
    "preview": "use std::io;\n\nmod common;\n\n#[test]\n#[cfg_attr(miri, ignore)]\nfn size_leak() -> io::Result<()> {\n    common::setup_logger"
  },
  {
    "path": "tests/test_tree.rs",
    "chars": 37768,
    "preview": "mod common;\nmod tree;\n\nuse std::{\n    io,\n    sync::{\n        Arc, Barrier,\n        atomic::{AtomicBool, AtomicUsize, Or"
  },
  {
    "path": "tests/test_tree_failpoints.rs",
    "chars": 49123,
    "preview": "#![cfg(feature = \"failpoints\")]\nmod common;\n\nuse std::collections::BTreeMap;\nuse std::convert::TryInto;\nuse std::sync::M"
  },
  {
    "path": "tests/tree/mod.rs",
    "chars": 10900,
    "preview": "use std::{collections::BTreeMap, convert::TryInto, fmt, panic};\n\nuse quickcheck::{Arbitrary, Gen};\nuse rand_distr::{Dist"
  }
]

About this extraction

This page contains the full source code of the spacejam/sled GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 63 files (521.3 KB), approximately 134.6k tokens, and a symbol index with 566 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo