[
  {
    "path": ".github/CODEOWNERS",
    "content": "# CODEOWNERS info: https://help.github.com/en/articles/about-code-owners\n# Owners are automatically requested for review for PRs that changes code\n# that they own.\n* @dgraph-io/maintainers\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/bug_report.md",
    "content": "---\nname: Bug report\nabout: Create a report to help us improve\ntitle: \"\"\nlabels: bug\nassignees: \"\"\n---\n\n## Describe the bug\n\nA clear and concise description of what the bug is.\n\n## To Reproduce\n\nSteps to reproduce the behavior:\n\n1. Go to '...'\n2. Click on '....'\n3. Scroll down to '....'\n4. See error\n\n## Expected behavior\n\nA clear and concise description of what you expected to happen.\n\n## Screenshots\n\nIf applicable, add screenshots to help explain your problem.\n\n## Environment\n\n- OS: [e.g. macOS, Windows, Ubuntu]\n- Language [e.g. AssemblyScript, Go]\n- Version [e.g. v0.xx]\n\n## Additional context\n\nAdd any other context about the problem here.\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/config.yml",
    "content": "blank_issues_enabled: false\ncontact_links:\n  - name: Badger Community Support\n    url: https://github.com/orgs/dgraph-io/discussions\n    about: Please ask and answer questions here\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/feature_request.md",
    "content": "---\nname: Feature request\nabout: Suggest an idea for this project\ntitle: \"\"\nlabels: \"\"\nassignees: \"\"\n---\n\n## Is your feature request related to a problem? Please describe\n\nA clear and concise description of what the problem is. Ex. I'm always frustrated when [...]\n\n## Describe the solution you'd like\n\nA clear and concise description of what you want to happen.\n\n## Describe alternatives you've considered\n\nA clear and concise description of any alternative solutions or features you've considered.\n\n## Additional context\n\nAdd any other context or screenshots about the feature request here.\n"
  },
  {
    "path": ".github/PULL_REQUEST_TEMPLATE.md",
    "content": "**Description**\n\nPlease explain the changes you made here.\n\n**Checklist**\n\n- [ ] Code compiles correctly and linting passes locally\n- [ ] Tests added for new functionality, or regression tests for bug fixes added as applicable\n\n**Instructions**\n\n- The PR title should follow the [Conventional Commits](https://www.conventionalcommits.org/)\n  syntax, leading with `fix:`, `feat:`, `chore:`, `ci:`, etc.\n- The description should briefly explain what the PR is about. In the case of a bugfix, describe or\n  link to the bug.\n- In the checklist section, check the boxes in that are applicable, using `[x]` syntax.\n  - If not applicable, remove the entire line. Only leave the box unchecked if you intend to come\n    back and check the box later.\n- Delete the `Instructions` line and everything below it, to indicate you have read and are\n  following these instructions. 🙂\n\nThank you for your contribution to Badger!\n"
  },
  {
    "path": ".github/renovate.json",
    "content": "{\n  \"$schema\": \"https://docs.renovatebot.com/renovate-schema.json\",\n  \"extends\": [\"local>dgraph-io/renovate-config\"],\n  \"rangeStrategy\": \"widen\",\n  \"packageRules\": [\n    { \"matchManagers\": [\"gomod\"], \"matchPackageNames\": [\"go\"], \"enabled\": false },\n    { \"matchManagers\": [\"gomod\"], \"matchDepNames\": [\"go\"], \"enabled\": false }\n  ],\n  \"ignoreDeps\": [\"go\"]\n}\n"
  },
  {
    "path": ".github/workflows/cd-badger.yml",
    "content": "name: cd-badger\n\non:\n  workflow_dispatch:\n    inputs:\n      releasetag:\n        description: releasetag\n        required: true\n        type: string\n\npermissions:\n  contents: read\n\njobs:\n  badger-build-amd64:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v5\n        with:\n          ref: \"${{ github.event.inputs.releasetag }}\"\n      - name: Set up Go\n        uses: actions/setup-go@v6\n        with:\n          go-version-file: go.mod\n      - name: Set Badger Release Version\n        run: |\n          #!/bin/bash\n          GIT_TAG_NAME='${{ github.event.inputs.releasetag }}'\n          if [[ \"$GIT_TAG_NAME\" == \"v\"* ]];\n          then\n            echo \"this is a release tag\"\n          else\n            echo \"this is NOT a release tag\"\n            exit 1\n          fi\n          BADGER_RELEASE_VERSION='${{ github.event.inputs.releasetag }}'\n          echo \"making a new release for \"$BADGER_RELEASE_VERSION\n          echo \"BADGER_RELEASE_VERSION=$BADGER_RELEASE_VERSION\" >> $GITHUB_ENV\n      - name: Fetch dependencies\n        run: sudo apt-get update && sudo apt-get -y install build-essential\n      - name: Build badger linux/amd64\n        run: make badger\n      - name: Generate SHA for Linux Build\n        run:\n          cd badger && sha256sum badger-linux-amd64 | cut -c-64 > badger-checksum-linux-amd64.sha256\n      - name: Tar Archive for Linux Build\n        run: cd badger && tar -zcvf badger-linux-amd64.tar.gz badger-linux-amd64\n      - name: Upload Badger Binary Build Artifacts\n        uses: actions/upload-artifact@v4\n        with:\n          name: badger-linux-amd64-${{ github.run_id }}-${{ github.job }}\n          path: |\n            badger/badger-checksum-linux-amd64.sha256\n            badger/badger-linux-amd64.tar.gz\n\n  badger-build-arm64:\n    runs-on: ubuntu-24.04-arm\n    steps:\n      - uses: actions/checkout@v5\n        with:\n          ref: \"${{ github.event.inputs.releasetag }}\"\n      - name: Set up Go\n        uses: actions/setup-go@v6\n        with:\n          go-version-file: go.mod\n      - name: Set Badger Release Version\n        run: |\n          #!/bin/bash\n          GIT_TAG_NAME='${{ github.event.inputs.releasetag }}'\n          if [[ \"$GIT_TAG_NAME\" == \"v\"* ]];\n          then\n            echo \"this is a release tag\"\n          else\n            echo \"this is NOT a release tag\"\n            exit 1\n          fi\n          BADGER_RELEASE_VERSION='${{ github.event.inputs.releasetag }}'\n          echo \"making a new release for \"$BADGER_RELEASE_VERSION\n          echo \"BADGER_RELEASE_VERSION=$BADGER_RELEASE_VERSION\" >> $GITHUB_ENV\n      - name: Fetch dependencies\n        run: sudo apt-get -y install build-essential\n      - name: Build badger linux/arm64\n        run: make badger\n      - name: Generate SHA for Linux Build\n        run:\n          cd badger && sha256sum badger-linux-arm64 | cut -c-64 > badger-checksum-linux-arm64.sha256\n      - name: Tar Archive for Linux Build\n        run: cd badger && tar -zcvf badger-linux-arm64.tar.gz badger-linux-arm64\n      - name: List Artifacts\n        run: ls -al badger/\n      - name: Upload Badger Binary Build Artifacts\n        uses: actions/upload-artifact@v4\n        with:\n          name: badger-linux-arm64-${{ github.run_id }}-${{ github.job }}\n          path: |\n            badger/badger-checksum-linux-arm64.sha256\n            badger/badger-linux-arm64.tar.gz\n"
  },
  {
    "path": ".github/workflows/ci-badger-bank-tests-nightly.yml",
    "content": "name: ci-badger-bank-tests-nightly\n\non:\n  push:\n    paths-ignore:\n      - \"**.md\"\n      - docs/**\n      - images/**\n    branches:\n      - main\n      - release/v*\n  schedule:\n    - cron: 1 3 * * *\n\npermissions:\n  contents: read\n\njobs:\n  badger-bank:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v5\n      - name: Setup Go\n        uses: actions/setup-go@v6\n        with:\n          go-version-file: go.mod\n      - name: Install Dependencies\n        run: make dependency\n      - name: Install jemalloc\n        run: make jemalloc\n      - name: Install Badger\n        run: cd badger && go install --race --tags=jemalloc .\n      - name: Run Badger Bank Test\n        run: |\n          #!/bin/bash -x\n          set -o pipefail\n          # get 16 random bytes from /dev/urandom\n          hexdump -vn16 -e'4/4 \"%08X\" 1 \"\\n\"' /dev/urandom > badgerkey16bytes\n          badger bank test --dir=. --encryption-key \"badgerkey16bytes\" -d=4h 2>&1 | tee badgerbanktest.log | grep -v 'Moved $5'\n          if [ $? -ne 0 ]; then\n            if grep -qi 'data race' badgerbanktest.log; then\n              echo \"Detected data race via grep...\"\n              cat badgerbanktest.log | grep -v 'Moved $5'\n            else\n              echo \"No data race detected via grep. Assuming txn violation...\"\n              tail -1000 badgerbanktest.log\n              badger bank disect --dir=. --decryption-key \"badgerkey16bytes\"\n            fi\n            exit 1\n          fi\n          echo 'Bank test finished with no issues.'\n"
  },
  {
    "path": ".github/workflows/ci-badger-bank-tests.yml",
    "content": "name: ci-badger-bank-tests\n\non:\n  workflow_dispatch: # allows manual trigger from GitHub\n  pull_request:\n    paths-ignore:\n      - \"**.md\"\n      - docs/**\n      - images/**\n    branches:\n      - main\n      - release/v*\n\npermissions:\n  contents: read\n\njobs:\n  badger-bank:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v5\n      - name: Setup Go\n        uses: actions/setup-go@v6\n        with:\n          go-version-file: go.mod\n      - name: Install Dependencies\n        run: make dependency\n      - name: Install jemalloc\n        run: make jemalloc\n      - name: Install Badger\n        run: cd badger && go install --race --tags=jemalloc .\n      - name: Run Badger Bank Test\n        run: |\n          #!/bin/bash\n          mkdir bank && cd bank\n          badger bank test -v --dir=. -d=20m\n"
  },
  {
    "path": ".github/workflows/ci-badger-tests.yml",
    "content": "name: ci-badger-tests\n\non:\n  workflow_dispatch: # allows manual trigger from GitHub\n  pull_request:\n    paths-ignore:\n      - \"**.md\"\n      - docs/**\n      - images/**\n      - contrib/**\n    branches:\n      - main\n      - release/v*\n\npermissions:\n  contents: read\n\njobs:\n  cross-compile:\n    runs-on: ubuntu-latest\n    strategy:\n      matrix:\n        include:\n          - goos: linux\n            goarch: amd64\n          - goos: linux\n            goarch: arm64\n          - goos: darwin\n            goarch: amd64\n          - goos: darwin\n            goarch: arm64\n          - goos: windows\n            goarch: amd64\n          - goos: aix\n            goarch: ppc64\n          - goos: plan9\n            goarch: amd64\n    steps:\n      - uses: actions/checkout@v5\n      - name: Setup Go\n        uses: actions/setup-go@v6\n        with:\n          go-version-file: go.mod\n      - name: Cross-compile for ${{ matrix.goos }}/${{ matrix.goarch }}\n        env:\n          GOOS: ${{ matrix.goos }}\n          GOARCH: ${{ matrix.goarch }}\n        run: go build ./...\n\n  badger-tests:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v5\n      - name: Setup Go\n        uses: actions/setup-go@v6\n        with:\n          go-version-file: go.mod\n      - name: Install Dependencies\n        run: make dependency\n      - name: Run Badger Tests\n        run: make test\n"
  },
  {
    "path": ".github/workflows/ci-dgraph-tests.yml",
    "content": "name: ci-dgraph-tests\n\non:\n  push:\n    paths-ignore:\n      - \"**.md\"\n      - docs/**\n      - images/**\n    branches:\n      - main\n\npermissions:\n  contents: read\n\njobs:\n  dgraph-tests:\n    runs-on: ubuntu-latest\n    steps:\n      - name: Checkout Dgraph repo\n        uses: actions/checkout@v5\n        with:\n          repository: dgraph-io/dgraph\n          ref: main\n      - name: Set up Go\n        uses: actions/setup-go@v6\n        with:\n          go-version-file: go.mod\n      - name: Install gotestsum\n        run: go install gotest.tools/gotestsum@latest\n      - name: Fetch latest Badger version\n        run: |\n          go get github.com/dgraph-io/badger/v4@main\n      - name: Set up Node\n        uses: actions/setup-node@v5\n        with:\n          node-version: 16 || 22\n      - name: Install protobuf-compiler\n        run: sudo apt update && sudo apt install -y protobuf-compiler\n      - name: Check protobuf\n        run: |\n          cd ./protos\n          go mod tidy\n          make regenerate\n          git diff --exit-code -- .\n      - name: Make Linux Build and Docker Image\n        run: make docker-image\n      - name: Build Test Binary\n        run: |\n          #!/bin/bash\n          # build the test binary\n          cd t; go build .\n      - name: Clean Up Environment\n        run: |\n          #!/bin/bash\n          # clean cache\n          go clean -testcache\n          # clean up docker containers before test execution\n          cd t; ./t -r\n      - name: Run Unit Tests\n        run: |\n          #!/bin/bash\n          # go env settings\n          export GOPATH=~/go\n          # move the binary\n          cp dgraph/dgraph ~/go/bin/dgraph\n          # run the tests\n          cd t; ./t --pkg=edgraph,posting,worker,query\n          # clean up docker containers after test execution\n          ./t -r\n          # sleep\n          sleep 5\n"
  },
  {
    "path": ".github/workflows/trunk.yml",
    "content": "name: Trunk Code Quality\non:\n  pull_request:\n    branches: main\n\npermissions:\n  contents: read\n  actions: write\n  checks: write\n\njobs:\n  trunk-code-quality:\n    name: Trunk Code Quality\n    uses: dgraph-io/.github/.github/workflows/trunk.yml@main\n"
  },
  {
    "path": ".gitignore",
    "content": "# Binaries for programs and plugins\n*.exe\n*.exe~\n*.dll\n*.so\n*.dylib\nbadger/badger-*\n\n# Test binary, build with `go test -c`\n*.test\nbadger-test*/\n\n# Output of the go coverage tool\n*.out\n\n#darwin\n.DS_Store\n\n"
  },
  {
    "path": ".trunk/.gitignore",
    "content": "*out\n*logs\n*actions\n*notifications\n*tools\nplugins\nuser_trunk.yaml\nuser.yaml\ntmp\n"
  },
  {
    "path": ".trunk/configs/.checkov.yaml",
    "content": "skip-check:\n  - CKV_GHA_7\n"
  },
  {
    "path": ".trunk/configs/.markdownlint.json",
    "content": "{\n  \"line-length\": { \"line_length\": 150, \"tables\": false },\n  \"no-inline-html\": false,\n  \"no-bare-urls\": false,\n  \"no-space-in-emphasis\": false,\n  \"no-emphasis-as-heading\": false,\n  \"first-line-heading\": false\n}\n"
  },
  {
    "path": ".trunk/configs/.prettierrc",
    "content": "{\n  \"semi\": false,\n  \"proseWrap\": \"always\",\n  \"printWidth\": 100\n}\n"
  },
  {
    "path": ".trunk/configs/.shellcheckrc",
    "content": "enable=all\nsource-path=SCRIPTDIR\ndisable=SC2154\n\n# If you're having issues with shellcheck following source, disable the errors via:\n# disable=SC1090\n# disable=SC1091\n"
  },
  {
    "path": ".trunk/configs/.yamllint.yaml",
    "content": "rules:\n  quoted-strings:\n    required: only-when-needed\n    extra-allowed: [\"{|}\"]\n  key-duplicates: {}\n  octal-values:\n    forbid-implicit-octal: true\n"
  },
  {
    "path": ".trunk/configs/svgo.config.mjs",
    "content": "export default {\n  plugins: [\n    {\n      name: \"preset-default\",\n      params: {\n        overrides: {\n          removeViewBox: false, // https://github.com/svg/svgo/issues/1128\n          sortAttrs: true,\n          removeOffCanvasPaths: true,\n        },\n      },\n    },\n  ],\n}\n"
  },
  {
    "path": ".trunk/trunk.yaml",
    "content": "# This file controls the behavior of Trunk: https://docs.trunk.io/cli\n# To learn more about the format of this file, see https://docs.trunk.io/reference/trunk-yaml\nversion: 0.1\n\ncli:\n  version: 1.25.0\n\nplugins:\n  sources:\n    - id: trunk\n      ref: v1.7.4\n      uri: https://github.com/trunk-io/plugins\n\nruntimes:\n  enabled:\n    - go@1.25.5\n\nlint:\n  ignore:\n    - linters: [ALL]\n      paths:\n        - pb/*.pb.go\n  enabled:\n    - golangci-lint2@2.4.0\n    - trivy@0.64.1\n    - renovate@41.76.0\n    - actionlint@1.7.7\n    - checkov@3.2.461\n    - git-diff-check\n    - gofmt@1.20.4\n    - golangci-lint@1.64.8\n    - markdownlint@0.45.0\n    - osv-scanner@2.0.3\n    - oxipng@9.1.5\n    - prettier@3.6.2\n    - shellcheck@0.10.0\n    - shfmt@3.6.0\n    - svgo@4.0.0\n    - taplo@0.9.3\n    - trufflehog@3.90.5\n    - yamllint@1.37.1\nactions:\n  enabled:\n    - trunk-announce\n    - trunk-check-pre-push\n    - trunk-fmt-pre-commit\n    - trunk-upgrade-available\n"
  },
  {
    "path": ".vscode/extensions.json",
    "content": "{\n  \"recommendations\": [\"trunk.io\"]\n}\n"
  },
  {
    "path": ".vscode/settings.json",
    "content": "{\n  \"editor.formatOnSave\": true,\n  \"editor.defaultFormatter\": \"trunk.io\",\n  \"editor.trimAutoWhitespace\": true,\n  \"trunk.autoInit\": false\n}\n"
  },
  {
    "path": "CHANGELOG.md",
    "content": "# Changelog\n\nAll notable changes to this project will be documented in this file.\n\nThe format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).\n\n## [4.9.1] - 2026-02-04\n\n**Fixed**\n\n- fix(aix): add aix directory synchronization support (#2115)\n- fix: correct the comment on value size in skl.node (#2250)\n\n**Tests**\n\n- test: add checksum tests for package y (#2246)\n\n**Chores**\n\n- chore(ci): update arm runner label (#2248)\n\n**Full Changelog**: https://github.com/dgraph-io/badger/compare/v4.9.0...v4.9.1\n\n## [4.9.0] - 2025-12-15\n\n**Fixed**\n\n- fix(y): y.SafeCopy shall always return empty slice rather than nil (#2245)\n  > **WARNING** SafeCopy now returns an empty slice rather than nil. For those using our `y` utility\n  > package, this could be a breaking change. This has implications for empty slices stored in\n  > badger, specifically, upon retrieval the value stored with the key will be equal to what was set\n  > (an empty []byte). See #2067 for more details.\n- fix: test.sh error (#2225)\n- fix: typo of abandoned (#2222)\n\n**Docs**\n\n- add doc for encryption at rest (#2240)\n- move docs pages in the repo (#2232)\n\n**Chores**\n\n- chore(ci): restrict Dgraph test to core packages only (#2242)\n- chore: update README.md with correct links and badges (#2239)\n- chore: change renovate to maintain backwards compatible go version (#2236)\n- chore: configure renovate to leave go version as declared (#2235)\n- chore(deps): Update actions (major) (#2229)\n- chore(deps): Update actions/checkout action to v5 (#2221)\n- chore(deps): Update go minor and patch (#2218)\n- chore: update the trunk conf file (#2217)\n- chore(deps): Update dependency node to v22 (#2219)\n- chore(deps): Update go minor and patch (#2212)\n\n**CI**\n\n- move to GitHub Actions runners\n\n**Full Changelog**: https://github.com/dgraph-io/badger/compare/v4.8.0...v4.8.1\n\n## [4.8.0] - 2025-07-15\n\n**Features**\n\n- feat(stream): Update stream framework with new alternate keyToList function (#2211)\n\n**Fixed**\n\n- fix: crash loop on missing manifest tables (#2198)\n\n**Chores**\n\n- chore(deps): Update module golang.org/x/sys to v0.34.0 (#2210)\n- chore(deps): Update go minor and patch (#2208)\n- chore(deps): Update go minor and patch (#2204)\n- chore(deps): Update go minor and patch (#2202)\n- chore(deps): Update go minor and patch (#2200)\n- chore(deps): Update module golang.org/x/sys to v0.33.0 (#2195)\n- chore(deps): Update go minor and patch (#2189)\n- Compile with jemalloc v5.3.0 (#2191)\n\n**CI**\n\n- Update trunk.yml\n- move Trunk to action\n\n**Docs**\n\n- docs: add new badge (#2194)\n\n**Full Changelog**: https://github.com/dgraph-io/badger/compare/v4.7.0...v4.8.0\n\n## [4.7.0] - 2025-04-08\n\n**Chores**\n\n- chore(deps): remove dependency on github.com/pkg/errors (#2184)\n- chore(deps): Update go minor and patch (#2187)\n- chore(deps): Update go minor and patch (#2181)\n- chore(deps): Update module golang.org/x/sys to v0.31.0 (#2179)\n\n**Fixed**\n\n- fix broken badge (#2186)\n\n**Docs**\n\n- Update README.md\n- doc: add Blink Labs projects to the using Badger list (#2183)\n- doc: add FlowG to \"Projects Using Badger\" section of the README (#2180)\n\n**Full Changelog**: https://github.com/dgraph-io/badger/compare/v4.6.0...v4.7.0\n\n## [4.6.0] - 2025-02-26\n\n**Chores**\n\n- chore(deps): Migrate from OpenCensus to OpenTelemetry (#2169)\n- chore(deps): Update go minor and patch (#2177)\n- chore(deps): Update module github.com/spf13/cobra to v1.9.0 (#2174)\n- chore: add editor config\n- update .gitignore (#2176)\n\n**Fixed**\n\n- fix: remove accidentally uploaded binary `badger-darwin-arm64` (#2175)\n\n**Full Changelog**: https://github.com/dgraph-io/badger/compare/v4.5.2...v4.6.0\n\n## [4.5.2] - 2025-02-14\n\n**Chores**\n\n- chore(deps): Update go minor and patch (#2168)\n- chore(deps): bump minimum Go support to 1.22 (#2171)\n- chore: migrate docs to centralized docs repo (#2166)\n- chore: align repo conventions (#2158)\n- chore(deps): bump the patch group with 2 updates (#2156)\n- chore(deps): bump github.com/google/flatbuffers from 24.12.23+incompatible to 25.1.21+incompatible\n  (#2153)\n- chore(deps): bump golangci/golangci-lint-action from 6.1.1 to 6.2.0 in the actions group (#2154)\n- Update renovate.json\n- Update trunk.yaml\n- enable Trivy\n\n**Fixed**\n\n- update docs link in error message (#2170)\n- Revert \"Update badgerpb4.pb.go\" (#2172)\n\n**Docs**\n\n- Update README.md\n- Added my project that uses Badger database (#2157)\n- Create SECURITY.md\n\n**Full Changelog**: https://github.com/dgraph-io/badger/compare/v4.5.1...v4.5.2\n\n## [4.5.1] - 2025-01-21\n\n- chore(deps): bump google.golang.org/protobuf from 1.36.2 to 1.36.3 in the patch group (#2150)\n- bump github.com/dgraph-io/ristretto/v2 from 2.0.1 to 2.1.0 in the minor group (#2151)\n- feat(info): print total size of listed keys (#2149)\n- chore(deps): bump google.golang.org/protobuf from 1.36.1 to 1.36.2 in the patch group (#2146)\n- chore(deps): bump the minor group with 2 updates (#2147)\n- fix(info): print Total BloomFilter Size with totalBloomFilter instead of totalIndex (#2145)\n- chore(deps): bump the minor group with 2 updates (#2141)\n- chore(deps): bump google.golang.org/protobuf from 1.36.0 to 1.36.1 in the patch group (#2140)\n- chore(deps): bump google.golang.org/protobuf from 1.35.2 to 1.36.0 in the minor group (#2139)\n- chore(deps): bump github.com/dgraph-io/ristretto/v2 from 2.0.0 to 2.0.1 in the patch group (#2136)\n- chore(deps): bump golang.org/x/net from 0.31.0 to 0.32.0 in the minor group (#2137)\n- chore(deps): bump the minor group with 2 updates (#2135)\n- docs: Add pagination explanation to docs (#2134)\n- Fix build for GOARCH=wasm with GOOS=js or GOOS=wasip1 (#2048)\n\n**Full Changelog**: https://github.com/dgraph-io/badger/compare/v4.5.0...v4.5.1\n\n## [4.5.0] - 2024-11-29\n\n- fix the cd pipeline by @mangalaman93 in https://github.com/dgraph-io/badger/pull/2127\n- chore(deps): bump the minor group with 2 updates by @dependabot in\n  https://github.com/dgraph-io/badger/pull/2128\n- chore(deps): bump github.com/stretchr/testify from 1.9.0 to 1.10.0 in the minor group by\n  @dependabot in https://github.com/dgraph-io/badger/pull/2130\n- upgrade protobuf library by @shivaji-kharse in https://github.com/dgraph-io/badger/pull/2131\n\n**Full Changelog**: https://github.com/dgraph-io/badger/compare/v4.4.0...v4.5.0\n\n## [4.4.0] - 2024-10-26\n\n- retract v4.3.0 due to #2121 and #2113, upgrade to Go v1.23, use ristretto v2 in\n  https://github.com/dgraph-io/badger/pull/2122\n- Allow stream custom maxsize per batch in https://github.com/dgraph-io/badger/pull/2063\n- chore(deps): bump github.com/klauspost/compress from 1.17.10 to 1.17.11 in the patch group in\n  https://github.com/dgraph-io/badger/pull/2120\n- fix: sentinel errors should not have stack traces in https://github.com/dgraph-io/badger/pull/2042\n- chore(deps): bump the minor group with 2 updates in https://github.com/dgraph-io/badger/pull/2119\n\n**Full Changelog**: https://github.com/dgraph-io/badger/compare/v4.3.1...v4.4.0\n\n## [4.3.1] - 2024-10-06\n\n- chore: update docs links by @ryanfoxtyler in https://github.com/dgraph-io/badger/pull/2097\n- chore(deps): bump golang.org/x/sys from 0.24.0 to 0.25.0 in the minor group by @dependabot in\n  https://github.com/dgraph-io/badger/pull/2100\n- chore(deps): bump golang.org/x/net from 0.28.0 to 0.29.0 in the minor group by @dependabot in\n  https://github.com/dgraph-io/badger/pull/2106\n- fix: fix reverse iterator broken by seek by @harshil-goel in\n  https://github.com/dgraph-io/badger/pull/2109\n- chore(deps): bump github.com/klauspost/compress from 1.17.9 to 1.17.10 in the patch group by\n  @dependabot in https://github.com/dgraph-io/badger/pull/2114\n- chore(deps): bump github.com/dgraph-io/ristretto from 0.1.2-0.20240116140435-c67e07994f91 to 1.0.0\n  by @dependabot in https://github.com/dgraph-io/badger/pull/2112\n\n**Full Changelog**: https://github.com/dgraph-io/badger/compare/v4.3.0...v4.3.1\n\n## [4.3.0] - 2024-08-29\n\n> **Warning** The tag v4.3.0 has been retracted due to an issue go.sum. Use v4.3.1 (see #2121 and\n> #2113)\n\n**Fixes**\n\n- chore(changelog): add a missed entry in CHANGELOG for v4.2.0 by @mangalaman93 in\n  https://github.com/dgraph-io/badger/pull/1988\n- update README with project KVS using badger by @tauraamui in\n  https://github.com/dgraph-io/badger/pull/1989\n- fix edge case for watermark when index is zero by @mangalaman93 in\n  https://github.com/dgraph-io/badger/pull/1999\n- upgrade spf13/cobra to version v1.7.0 by @mangalaman93 in\n  https://github.com/dgraph-io/badger/pull/2001\n- chore: update readme by @joshua-goldstein in https://github.com/dgraph-io/badger/pull/2011\n- perf: upgrade compress package test and benchmark. by @siddhant2001 in\n  https://github.com/dgraph-io/badger/pull/2009\n- fix(Transactions): Fix resource consumption on empty write transaction by @Zach-Johnson in\n  https://github.com/dgraph-io/badger/pull/2018\n- chore(deps): bump golang.org/x/net from 0.7.0 to 0.17.0 by @dependabot in\n  https://github.com/dgraph-io/badger/pull/2017\n- perf(compactor): optimize allocations: use buffer for priorities by @deff7 in\n  https://github.com/dgraph-io/badger/pull/2006\n- fix(Transaction): discard empty transactions on CommitWith by @Wondertan in\n  https://github.com/dgraph-io/badger/pull/2031\n- fix(levelHandler): use lock for levelHandler sort tables instead of rlock by @xgzlucario in\n  https://github.com/dgraph-io/badger/pull/2034\n- Docs: update README with project LLS using badger by @Boc-chi-no in\n  https://github.com/dgraph-io/badger/pull/2032\n- chore: MaxTableSize has been renamed to BaseTableSize by @mitar in\n  https://github.com/dgraph-io/badger/pull/2038\n- Update CODEOWNERS by @ryanfoxtyler in https://github.com/dgraph-io/badger/pull/2043\n- Chore(): add Stale Action by @ryanfoxtyler in https://github.com/dgraph-io/badger/pull/2070\n- Update ristretto and refactor for use of generics by @paralin in\n  https://github.com/dgraph-io/badger/pull/2047\n- chore: Remove obsolete comment by @mitar in https://github.com/dgraph-io/badger/pull/2039\n- chore(Docs): Update jQuery 3.2.1 to 3.7.1 by @kokizzu in\n  https://github.com/dgraph-io/badger/pull/2023\n- chore(deps): bump the go_modules group with 3 updates by @dependabot in\n  https://github.com/dgraph-io/badger/pull/2074\n- docs(): update docs path by @ryanfoxtyler in https://github.com/dgraph-io/badger/pull/2076\n- perf: fix operation in seek by @harshil-goel in https://github.com/dgraph-io/badger/pull/2077\n- Add lakeFS to README.md by @N-o-Z in https://github.com/dgraph-io/badger/pull/2078\n- chore(): add Dependabot by @ryanfoxtyler in https://github.com/dgraph-io/badger/pull/2080\n- chore(deps): bump golangci/golangci-lint-action from 4 to 6 by @dependabot in\n  https://github.com/dgraph-io/badger/pull/2083\n- chore(deps): bump actions/upload-artifact from 3 to 4 by @dependabot in\n  https://github.com/dgraph-io/badger/pull/2081\n- chore(deps): bump github/codeql-action from 2 to 3 by @dependabot in\n  https://github.com/dgraph-io/badger/pull/2082\n- chore(deps): bump the minor group with 7 updates by @dependabot in\n  https://github.com/dgraph-io/badger/pull/2089\n- Action Manager by @madhu72 in https://github.com/dgraph-io/badger/pull/2050\n- chore(deps): bump golang.org/x/sys from 0.23.0 to 0.24.0 in the minor group by @dependabot in\n  https://github.com/dgraph-io/badger/pull/2091\n- chore(deps): bump github.com/golang/protobuf from 1.5.3 to 1.5.4 in the patch group by @dependabot\n  in https://github.com/dgraph-io/badger/pull/2090\n- chore: fix some comments by @dufucun in https://github.com/dgraph-io/badger/pull/2092\n- chore(deps): bump github.com/google/flatbuffers from 1.12.1 to 24.3.25+incompatible by @dependabot\n  in https://github.com/dgraph-io/badger/pull/2084\n\n**CI**\n\n- ci: change cron frequency to fix ghost jobs by @joshua-goldstein in\n  https://github.com/dgraph-io/badger/pull/2010\n- fix(CI): Update to pull_request trigger by @ryanfoxtyler in\n  https://github.com/dgraph-io/badger/pull/2056\n- ci/cd optimization by @ryanfoxtyler in https://github.com/dgraph-io/badger/pull/2051\n- fix(cd): fixed cd pipeline by @harshil-goel in https://github.com/dgraph-io/badger/pull/2093\n- fix(cd): change name by @harshil-goel in https://github.com/dgraph-io/badger/pull/2094\n- fix(cd): added more debug things to cd by @harshil-goel in\n  https://github.com/dgraph-io/badger/pull/2095\n- fix(cd): removing some debug items by @harshil-goel in\n  https://github.com/dgraph-io/badger/pull/2096\n\n**Full Changelog**: https://github.com/dgraph-io/badger/compare/v4.2.0...v4.3.0\n\n## [4.2.0] - 2023-08-03\n\n**Breaking**\n\n- feat(metrics): fix and update metrics in badger (#1948)\n- fix(metrics): remove badger version in the metrics name (#1982)\n\n**Fixed**\n\n- fix(db): avoid panic in parallel reads after closing DB (#1987)\n- fix(logging): fix direct access to logger (#1980)\n- fix(sec): bump google.golang.org/grpc from 1.20.1 to 1.53.0 (#1977)\n- fix(sec): update gopkg.in/yaml.v2 package (#1969)\n- fix(test): fix flakiness of TestPersistLFDiscardStats (#1963)\n- fix(stream): setup oracle correctly in stream writer (#1968) (#1904)\n- fix(stream): improve incremental stream writer (#1901)\n- fix(test): improve the params in BenchmarkDbGrowth (#1967)\n- fix(sync): sync active memtable and value log on Db.Sync (#1847) (#1953)\n- fix(test): handle draining of closed channel, speed up test. (#1957)\n- fix(test): fix table checksum test. Test on uncompressed. (#1952)\n- fix(level): change split key range right key to use ts=0 (#1932)\n- fix(test): the new test case PagebufferReader5 introduced an error. (#1936)\n- fix(test): add missing unlock in TestPersistLFDiscardStats (#1924)\n- fix(PageBufferReader): should conform to io.Reader interface (#1935)\n- fix(publisher): publish updates after persistence in WAL (#1917)\n\n**CI**\n\n- chore(ci): split off coverage workflow (#1944)\n- chore(ci): adding trivy scanning workflow (#1925)\n\n## [4.1.0] - 2023-03-30\n\nThis release adds support for incremental stream writer. We also do some cleanup in the docs and\nresolve some CI issues for community PR's. We resolve high and medium CVE's and fix\n[#1833](https://github.com/dgraph-io/badger/issues/1833).\n\n**Features**\n\n- feat(stream): add support for incremental stream writer (#1722) (#1874)\n\n**Fixes**\n\n- chore: upgrade xxhash from v1.1.0 to v2.1.2 (#1910) (fixes\n  [#1833](https://github.com/dgraph-io/badger/issues/1833))\n\n**Security**\n\n- chore(deps): bump golang.org/x/net from 0.0.0-20201021035429-f5854403a974 to 0.7.0 (#1885)\n\n**CVEs**\n\n- [CVE-2021-31525](https://github.com/dgraph-io/badger/security/dependabot/7)\n- [CVE-2022-41723](https://github.com/dgraph-io/badger/security/dependabot/4)\n- [CVE-2022-27664](https://github.com/dgraph-io/badger/security/dependabot/5)\n- [CVE-2021-33194](https://github.com/dgraph-io/badger/security/dependabot/9)\n- [CVE-2022-41723](https://github.com/dgraph-io/badger/security/dependabot/13)\n- [CVE-2021-33194](https://github.com/dgraph-io/badger/security/dependabot/16)\n- [CVE-2021-38561](https://github.com/dgraph-io/badger/security/dependabot/8)\n\n**Chores**\n\n- fix(docs): update README (#1915)\n- cleanup sstable file after tests (#1912)\n- chore(ci): add dgraph regression tests (#1908)\n- docs: fix the default value in docs (#1909)\n- chore: update URL for unsupported manifest version error (#1905)\n- docs(README): add raft-badger to projects using badger (#1902)\n- sync the docs with README with projects using badger (#1903)\n- fix: update code comments for WithNumCompactors (#1900)\n- docs: add loggie to projects using badger (#1882)\n- chore(memtable): refactor code for memtable flush (#1866)\n- resolve coveralls issue for community PR's (#1892, #1894, #1896)\n\n## [4.0.1] - 2023-02-28\n\nWe issue a follow up release in order to resolve a bug in subscriber. We also generate updated\nprotobufs for Badger v4.\n\n**Fixed**\n\n- fix(pb): fix generated protos #1888\n- fix(publisher): initialize the atomic variable #1889\n\n**Chores**\n\n- chore(cd): tag based deployments #1887\n- chore(ci): fail fast when testing #1890\n\n## [4.0.0] - 2023-02-27\n\n> **Warning** The tag v4.0.0 has been retracted due to a bug in publisher. Use v4.0.1 (see #1889)\n\nThis release fixes a bug in the maxHeaderSize parameter that could lead to panics. We introduce an\nexternal magic number to keep track of external dependencies. We bump up the minimum required Go\nversion to 1.19. No changes were made to the format of data on disk. This is a major release because\nwe are making a switch to SemVer in order to make it easier for the community to understand when\nbreaking API and data format changes are made.\n\n**Fixed**\n\n- fix: update maxHeaderSize #1877\n- feat(externalMagic): Introduce external magic number (#1745) #1852\n- fix(bench): bring in benchmark fixes from main #1863\n\n**Chores**\n\n- upgrade go to 1.19 #1868\n- enable linters (gosimple, govet, lll, unused, staticcheck, errcheck, ineffassign, gofmt) #1871\n  #1870 #1876\n- remove dependency on io/ioutil #1879\n- various doc and comment fixes #1857\n- moving from CalVer to SemVer\n\n## [3.2103.5] - 2022-12-15\n\nWe release Badger CLI tool binaries for amd64 and now arm64. This release does not involve any core\ncode changes to Badger. We add a CD job for building Badger for arm64.\n\n## [3.2103.4] - 2022-11-04\n\n**Fixed**\n\n- fix(manifest): fix manifest corruption due to race condition in concurrent compactions (#1756)\n\n**Chores**\n\n- We bring the release branch to parity with main by updating the CI/CD jobs, Readme, Codeowners, PR\n  and issue templates, etc.\n\n## [3.2103.3] - 2022-10-14\n\n**Remarks**\n\n- This is a minor patch release that fixes arm64 related issues. The issues in the `z` package in\n  Ristretto were resolved in Ristretto v0.1.1.\n\n**Fixed**\n\n- fix(arm64): bump ristretto v0.1.0 --> v0.1.1 (#1806)\n\n## [3.2103.2] - 2021-10-07\n\n**Fixed**\n\n- fix(compact): close vlog after the compaction at L0 has completed (#1752)\n- fix(builder): put the upper limit on reallocation (#1748)\n- deps: Bump github.com/google/flatbuffers to v1.12.1 (#1746)\n- fix(levels): Avoid a deadlock when acquiring read locks in levels (#1744)\n- fix(pubsub): avoid deadlock in publisher and subscriber (#1749) (#1751)\n\n## [3.2103.1] - 2021-07-08\n\n**Fixed**\n\n- fix(compaction): copy over the file ID when building tables #1713\n- fix: Fix conflict detection for managed DB (#1716)\n- fix(pendingWrites): don't skip the pending entries with version=0 (#1721)\n\n**Features**\n\n- feat(zstd): replace datadog's zstd with Klauspost's zstd (#1709)\n\n## [3.2103.0] - 2021-06-02\n\n**Breaking**\n\n- Subscribe: Add option to subscribe with holes in prefixes. (#1658)\n\n**Fixed**\n\n- fix(compaction): Remove compaction backoff mechanism (#1686)\n- Add a name to mutexes to make them unexported (#1678)\n- fix(merge-operator): don't read the deleted keys (#1675)\n- fix(discard): close the discard stats file on db close (#1672)\n- fix(iterator): fix iterator when data does not exist in read only mode (#1670)\n- fix(badger): Do not reuse variable across badger commands (#1624)\n- fix(dropPrefix): check properly if the key is present in a table (#1623)\n\n**Performance**\n\n- Opt(Stream): Optimize how we deduce key ranges for iteration (#1687)\n- Increase value threshold from 1 KB to 1 MB (#1664)\n- opt(DropPrefix): check if there exist some data to drop before dropping prefixes (#1621)\n\n**Features**\n\n- feat(options): allow special handling and checking when creating options from superflag (#1688)\n- overwrite default Options from SuperFlag string (#1663)\n- Support SinceTs in iterators (#1653)\n- feat(info): Add a flag to parse and print DISCARD file (#1662)\n- feat(vlog): making vlog threshold dynamic 6ce3b7c (#1635)\n- feat(options): add NumGoroutines option for default Stream.numGo (#1656)\n- feat(Trie): Working prefix match with holes (#1654)\n- feat: add functionality to ban a prefix (#1638)\n- feat(compaction): Support Lmax to Lmax compaction (#1615)\n\n**New APIs**\n\n- Badger.DB\n  - BanNamespace\n  - BannedNamespaces\n  - Ranges\n- Badger.Options\n  - FromSuperFlag\n  - WithNumGoRoutines\n  - WithNamespaceOffset\n  - WithVLogPercentile\n- Badger.Trie\n  - AddMatch\n  - DeleteMatch\n- Badger.Table\n  - StaleDataSize\n- Badger.Table.Builder\n  - AddStaleKey\n- Badger.InitDiscardStats\n\n**Removed APIs**\n\n- Badger.DB\n  - KeySplits\n- Badger.Options\n  - SkipVlog\n\n### Changed APIs\n\n- Badger.DB\n  - Subscribe\n- Badger.Options\n  - WithValueThreshold\n\n## [3.2011.1] - 2021-01-22\n\n**Fixed**\n\n- Fix(compaction): Set base level correctly after stream (#1651)\n- Fix: update ristretto and use filepath (#1652)\n- Fix(badger): Do not reuse variable across badger commands (#1650)\n- Fix(build): fix 32-bit build (#1646)\n- Fix(table): always sync SST to disk (#1645)\n\n## [3.2011.0] - 2021-01-15\n\nThis release is not backward compatible with Badger v2.x.x\n\n**Breaking**:\n\n- opt(compactions): Improve compaction performance (#1574)\n- Change how Badger handles WAL (#1555)\n- feat(index): Use flatbuffers instead of protobuf (#1546)\n\n**Fixed**:\n\n- Fix(GC): Set bits correctly for moved keys (#1619)\n- Fix(tableBuilding): reduce scope of valuePointer (#1617)\n- Fix(compaction): fix table size estimation on compaction (#1613)\n- Fix(OOM): Reuse pb.KVs in Stream (#1609)\n- Fix race condition in L0StallMs variable (#1605)\n- Fix(stream): Stop produceKVs on error (#1604)\n- Fix(skiplist): Remove z.Buffer from skiplist (#1600)\n- Fix(readonly): fix the file opening mode (#1592)\n- Fix: Disable CompactL0OnClose by default (#1586)\n- Fix(compaction): Don't drop data when split overlaps with top tables (#1587)\n- Fix(subcompaction): Close builder before throttle.Done (#1582)\n- Fix(table): Add onDisk size (#1569)\n- Fix(Stream): Only send done markers if told to do so\n- Fix(value log GC): Fix a bug which caused value log files to not be GCed.\n- Fix segmentation fault when cache sizes are small. (#1552)\n- Fix(builder): Too many small tables when compression is enabled (#1549)\n- Fix integer overflow error when building for 386 (#1541)\n- Fix(writeBatch): Avoid deadlock in commit callback (#1529)\n- Fix(db): Handle nil logger (#1534)\n- Fix(maxVersion): Use choosekey instead of KeyToList (#1532)\n- Fix(Backup/Restore): Keep all versions (#1462)\n- Fix(build): Fix nocgo builds. (#1493)\n- Fix(cleanup): Avoid truncating in value.Open on error (#1465)\n- Fix(compaction): Don't use cache for table compaction (#1467)\n- Fix(compaction): Use separate compactors for L0, L1 (#1466)\n- Fix(options): Do not implicitly enable cache (#1458)\n- Fix(cleanup): Do not close cache before compaction (#1464)\n- Fix(replay): Update head for LSM entires also (#1456)\n- fix(levels): Cleanup builder resources on building an empty table (#1414)\n\n**Performance**\n\n- perf(GC): Remove move keys (#1539)\n- Keep the cheaper parts of the index within table struct. (#1608)\n- Opt(stream): Use z.Buffer to stream data (#1606)\n- opt(builder): Use z.Allocator for building tables (#1576)\n- opt(memory): Use z.Calloc for allocating KVList (#1563)\n- opt: Small memory usage optimizations (#1562)\n- KeySplits checks tables and memtables when number of splits is small. (#1544)\n- perf: Reduce memory usage by better struct packing (#1528)\n- perf(tableIterator): Don't do next on NewIterator (#1512)\n- Improvements: Manual Memory allocation via Calloc (#1459)\n- Various bug fixes: Break up list and run DropAll func (#1439)\n- Add a limit to the size of the batches sent over a stream. (#1412)\n- Commit does not panic after Finish, instead returns an error (#1396)\n- levels: Compaction incorrectly drops some delete markers (#1422)\n- Remove vlog file if bootstrap, syncDir or mmap fails (#1434)\n\n**Features**:\n\n- Use opencensus for tracing (#1566)\n- Export functions from Key Registry (#1561)\n- Allow sizes of block and index caches to be updated. (#1551)\n- Add metric for number of tables being compacted (#1554)\n- feat(info): Show index and bloom filter size (#1543)\n- feat(db): Add db.MaxVersion API (#1526)\n- Expose DB options in Badger. (#1521)\n- Feature: Add a Calloc based Buffer (#1471)\n- Add command to stream contents of DB into another DB. (#1463)\n- Expose NumAlloc metrics via expvar (#1470)\n- Support fully disabling the bloom filter (#1319)\n- Add --enc-key flag in badger info tool (#1441)\n\n**New APIs**\n\n- Badger.DB\n  - CacheMaxCost (#1551)\n  - Levels (#1574)\n  - LevelsToString (#1574)\n  - Opts (#1521)\n- Badger.Options\n  - WithBaseLevelSize (#1574)\n  - WithBaseTableSize (#1574)\n  - WithMemTableSize (#1574)\n- Badger.KeyRegistry\n  - DataKey (#1561)\n  - LatestDataKey (#1561)\n\n**Removed APIs**\n\n- Badger.Options\n  - WithKeepL0InMemory (#1555)\n  - WithLevelOneSize (#1574)\n  - WithLoadBloomsOnOpen (#1555)\n  - WithLogRotatesToFlush (#1574)\n  - WithMaxTableSize (#1574)\n  - WithTableLoadingMode (#1555)\n  - WithTruncate (#1555)\n  - WithValueLogLoadingMode (#1555)\n\n## [2.2007.4] - 2021-08-25\n\n**Fixed**\n\n- Fix build on Plan 9 (#1451) (#1508) (#1738)\n\n**Features**\n\n- feat(zstd): backport replacement of DataDog's zstd with Klauspost's zstd (#1736)\n\n## [2.2007.3] - 2021-07-21\n\n**Fixed**\n\n- fix(maxVersion): Use choosekey instead of KeyToList (#1532) #1533\n- fix(flatten): Add --num_versions flag (#1518) #1520\n- fix(build): Fix integer overflow on 32-bit architectures #1558\n- fix(pb): avoid protobuf warning due to common filename (#1519)\n\n**Features**\n\n- Add command to stream contents of DB into another DB. (#1486)\n\n**New APIs**\n\n- DB.StreamDB\n- DB.MaxVersion\n\n## [2.2007.2] - 2020-08-31\n\n**Fixed**\n\n- Compaction: Use separate compactors for L0, L1 (#1466)\n- Rework Block and Index cache (#1473)\n- Add IsClosed method (#1478)\n- Cleanup: Avoid truncating in vlog.Open on error (#1465)\n- Cleanup: Do not close cache before compactions (#1464)\n\n**New APIs**\n\n- Badger.DB\n  - BlockCacheMetrics (#1473)\n  - IndexCacheMetrics (#1473)\n- Badger.Option\n  - WithBlockCacheSize (#1473)\n  - WithIndexCacheSize (#1473)\n\n**Removed APIs** [Breaking Changes]\n\n- Badger.DB\n  - DataCacheMetrics (#1473)\n  - BfCacheMetrics (#1473)\n- Badger.Option\n  - WithMaxCacheSize (#1473)\n  - WithMaxBfCacheSize (#1473)\n  - WithKeepBlockIndicesInCache (#1473)\n  - WithKeepBlocksInCache (#1473)\n\n## [2.2007.1] - 2020-08-19\n\n**Fixed**\n\n- Remove vlog file if bootstrap, syncDir or mmap fails (#1434)\n- levels: Compaction incorrectly drops some delete markers (#1422)\n- Replay: Update head for LSM entires also (#1456)\n\n## [2.2007.0] - 2020-08-10\n\n**Fixed**\n\n- Add a limit to the size of the batches sent over a stream. (#1412)\n- Fix Sequence generates duplicate values (#1281)\n- Fix race condition in DoesNotHave (#1287)\n- Fail fast if cgo is disabled and compression is ZSTD (#1284)\n- Proto: make badger/v2 compatible with v1 (#1293)\n- Proto: Rename dgraph.badger.v2.pb to badgerpb2 (#1314)\n- Handle duplicates in ManagedWriteBatch (#1315)\n- Ensure `bitValuePointer` flag is cleared for LSM entry values written to LSM (#1313)\n- DropPrefix: Return error on blocked writes (#1329)\n- Confirm `badgerMove` entry required before rewrite (#1302)\n- Drop move keys when its key prefix is dropped (#1331)\n- Iterator: Always add key to txn.reads (#1328)\n- Restore: Account for value size as well (#1358)\n- Compaction: Expired keys and delete markers are never purged (#1354)\n- GC: Consider size of value while rewriting (#1357)\n- Force KeepL0InMemory to be true when InMemory is true (#1375)\n- Rework DB.DropPrefix (#1381)\n- Update head while replaying value log (#1372)\n- Avoid panic on multiple closer.Signal calls (#1401)\n- Return error if the vlog writes exceeds more than 4GB (#1400)\n\n**Performance**\n\n- Clean up transaction oracle as we go (#1275)\n- Use cache for storing block offsets (#1336)\n\n**Features**\n\n- Support disabling conflict detection (#1344)\n- Add leveled logging (#1249)\n- Support entry version in Write batch (#1310)\n- Add Write method to batch write (#1321)\n- Support multiple iterators in read-write transactions (#1286)\n\n**New APIs**\n\n- Badger.DB\n  - NewManagedWriteBatch (#1310)\n  - DropPrefix (#1381)\n- Badger.Option\n  - WithDetectConflicts (#1344)\n  - WithKeepBlockIndicesInCache (#1336)\n  - WithKeepBlocksInCache (#1336)\n- Badger.WriteBatch\n  - DeleteAt (#1310)\n  - SetEntryAt (#1310)\n  - Write (#1321)\n\n### Changes to Default Options\n\n- DefaultOptions: Set KeepL0InMemory to false (#1345)\n- Increase default valueThreshold from 32B to 1KB (#1346)\n\n### Deprecated\n\n- Badger.Option\n  - WithEventLogging (#1203)\n\n### Reverts\n\nThis sections lists the changes which were reverted because of non-reproducible crashes.\n\n- Compress/Encrypt Blocks in the background (#1227)\n\n## [2.0.3] - 2020-03-24\n\n**Fixed**\n\n- Add support for watching nil prefix in subscribe API (#1246)\n\n**Performance**\n\n- Compress/Encrypt Blocks in the background (#1227)\n- Disable cache by default (#1257)\n\n**Features**\n\n- Add BypassDirLock option (#1243)\n- Add separate cache for bloomfilters (#1260)\n\n**New APIs**\n\n- badger.DB\n  - BfCacheMetrics (#1260)\n  - DataCacheMetrics (#1260)\n- badger.Options\n  - WithBypassLockGuard (#1243)\n  - WithLoadBloomsOnOpen (#1260)\n  - WithMaxBfCacheSize (#1260)\n\n## [2.0.2] - 2020-03-02\n\n**Fixed**\n\n- Cast sz to uint32 to fix compilation on 32 bit. (#1175)\n- Fix checkOverlap in compaction. (#1166)\n- Avoid sync in inmemory mode. (#1190)\n- Support disabling the cache completely. (#1185)\n- Add support for caching bloomfilters. (#1204)\n- Fix int overflow for 32bit. (#1216)\n- Remove the 'this entry should've caught' log from value.go. (#1170)\n- Rework concurrency semantics of valueLog.maxFid. (#1187)\n\n**Performance**\n\n- Use fastRand instead of locked-rand in skiplist. (#1173)\n- Improve write stalling on level 0 and 1. (#1186)\n- Disable compression and set ZSTD Compression Level to 1. (#1191)\n\n## [2.0.1] - 2020-01-02\n\n**New APIs**\n\n- badger.Options\n  - WithInMemory (f5b6321)\n  - WithZSTDCompressionLevel (3eb4e72)\n\n- Badger.TableInfo\n  - EstimatedSz (f46f8ea)\n\n**Features**\n\n- Introduce in-memory mode in badger. (#1113)\n\n**Fixed**\n\n- Limit manifest's change set size. (#1119)\n- Cast idx to uint32 to fix compilation on i386. (#1118)\n- Fix request increment ref bug. (#1121)\n- Fix windows dataloss issue. (#1134)\n- Fix VerifyValueChecksum checks. (#1138)\n- Fix encryption in stream writer. (#1146)\n- Fix segmentation fault in vlog.Read. (header.Decode) (#1150)\n- Fix merge iterator duplicates issue. (#1157)\n\n**Performance**\n\n- Set level 15 as default compression level in Zstd. (#1111)\n- Optimize createTable in stream_writer.go. (#1132)\n\n## [2.0.0] - 2019-11-12\n\n**New APIs**\n\n- badger.DB\n  - NewWriteBatchAt (7f43769)\n  - CacheMetrics (b9056f1)\n\n- badger.Options\n  - WithMaxCacheSize (b9056f1)\n  - WithEventLogging (75c6a44)\n  - WithBlockSize (1439463)\n  - WithBloomFalsePositive (1439463)\n  - WithKeepL0InMemory (ee70ff2)\n  - WithVerifyValueChecksum (ee70ff2)\n  - WithCompression (5f3b061)\n  - WithEncryptionKey (a425b0e)\n  - WithEncryptionKeyRotationDuration (a425b0e)\n  - WithChecksumVerificationMode (7b4083d)\n\n**Features**\n\n- Data cache to speed up lookups and iterations. (#1066)\n- Data compression. (#1013)\n- Data encryption-at-rest. (#1042)\n\n**Fixed**\n\n- Fix deadlock when flushing discard stats. (#976)\n- Set move key's expiresAt for keys with TTL. (#1006)\n- Fix unsafe usage in Decode. (#1097)\n- Fix race condition on db.orc.nextTxnTs. (#1101)\n- Fix level 0 GC dataloss bug. (#1090)\n- Fix deadlock in discard stats. (#1070)\n- Support checksum verification for values read from vlog. (#1052)\n- Store entire L0 in memory. (#963)\n- Fix table.Smallest/Biggest and iterator Prefix bug. (#997)\n- Use standard proto functions for Marshal/Unmarshal and Size. (#994)\n- Fix boundaries on GC batch size. (#987)\n- VlogSize to store correct directory name to expvar.Map. (#956)\n- Fix transaction too big issue in restore. (#957)\n- Fix race condition in updateDiscardStats. (#973)\n- Cast results of len to uint32 to fix compilation in i386 arch. (#961)\n- Making the stream writer APIs goroutine-safe. (#959)\n- Fix prefix bug in key iterator and allow all versions. (#950)\n- Drop discard stats if we can't unmarshal it. (#936)\n- Fix race condition in flushDiscardStats function. (#921)\n- Ensure rewrite in vlog is within transactional limits. (#911)\n- Fix discard stats moved by GC bug. (#929)\n- Fix busy-wait loop in Watermark. (#920)\n\n**Performance**\n\n- Introduce fast merge iterator. (#1080)\n- Binary search based table picker. (#983)\n- Flush vlog buffer if it grows beyond threshold. (#1067)\n- Introduce StreamDone in Stream Writer. (#1061)\n- Performance Improvements to block iterator. (#977)\n- Prevent unnecessary safecopy in iterator parseKV. (#971)\n- Use pointers instead of binary encoding. (#965)\n- Reuse block iterator inside table iterator. (#972)\n- [breaking/format] Remove vlen from entry header. (#945)\n- Replace FarmHash with AESHash for Oracle conflicts. (#952)\n- [breaking/format] Optimize Bloom filters. (#940)\n- [breaking/format] Use varint for header encoding (without header length). (#935)\n- Change file picking strategy in compaction. (#894)\n- [breaking/format] Block level changes. (#880)\n- [breaking/format] Add key-offset index to the end of SST table. (#881)\n\n## [1.6.0] - 2019-07-01\n\nThis is a release including almost 200 commits, so expect many changes - some of them not backward\ncompatible.\n\nRegarding backward compatibility in Badger versions, you might be interested on reading\n[VERSIONING.md](VERSIONING.md).\n\n_Note_: The hashes in parentheses correspond to the commits that impacted the given feature.\n\n**New APIs**\n\n- badger.DB\n  - DropPrefix (291295e)\n  - Flatten (7e41bba)\n  - KeySplits (4751ef1)\n  - MaxBatchCount (b65e2a3)\n  - MaxBatchSize (b65e2a3)\n  - PrintKeyValueHistogram (fd59907)\n  - Subscribe (26128a7)\n  - Sync (851e462)\n\n- badger.DefaultOptions() and badger.LSMOnlyOptions() (91ce687)\n  - badger.Options.WithX methods\n\n- badger.Entry (e9447c9)\n  - NewEntry\n  - WithMeta\n  - WithDiscard\n  - WithTTL\n\n- badger.Item\n  - KeySize (fd59907)\n  - ValueSize (5242a99)\n\n- badger.IteratorOptions\n  - PickTable (7d46029, 49a49e3)\n  - Prefix (7d46029)\n\n- badger.Logger (fbb2778)\n\n- badger.Options\n  - CompactL0OnClose (7e41bba)\n  - Logger (3f66663)\n  - LogRotatesToFlush (2237832)\n\n- badger.Stream (14cbd89, 3258067)\n- badger.StreamWriter (7116e16)\n- badger.TableInfo.KeyCount (fd59907)\n- badger.TableManifest (2017987)\n- badger.Tx.NewKeyIterator (49a49e3)\n- badger.WriteBatch (6daccf9, 7e78e80)\n\n**Modified APIs**\n\n**Breaking**\n\n- badger.DefaultOptions and badger.LSMOnlyOptions are now functions rather than variables (91ce687)\n- badger.Item.Value now receives a function that returns an error (439fd46)\n- badger.Txn.Commit doesn't receive any params now (6daccf9)\n- badger.DB.Tables now receives a boolean (76b5341)\n\n**Features**\n\n- badger.LSMOptions changed values (799c33f)\n- badger.DB.NewIterator now allows multiple iterators per RO txn (41d9656)\n- badger.Options.TableLoadingMode's new default is options.MemoryMap (6b97bac)\n\n**Removed APIs**\n\n- badger.ManagedDB (d22c0e8)\n- badger.Options.DoNotCompact (7e41bba)\n- badger.Txn.SetWithX (e9447c9)\n\n**Tools**\n\n- badger bank disect (13db058)\n- badger bank test (13db058) --mmap (03870e3)\n- badger fill (7e41bba)\n- badger flatten (7e41bba)\n- badger info --histogram (fd59907) --history --lookup --show-keys --show-meta --with-prefix\n  (09e9b63) --show-internal (fb2eed9)\n- badger benchmark read (239041e)\n- badger benchmark write (6d3b67d)\n\n## [1.5.5] - 2019-06-20\n\n- Introduce support for Go Modules\n\n## [1.5.3] - 2018-07-11\n\nBug Fixes:\n\n- Fix a panic caused due to item.vptr not copying over vs.Value, when looking for a move key.\n\n## [1.5.2] - 2018-06-19\n\nBug Fixes:\n\n- Fix the way move key gets generated.\n- If a transaction has unclosed, or multiple iterators running simultaneously, throw a panic. Every\n  iterator must be properly closed. At any point in time, only one iterator per transaction can be\n  running. This is to avoid bugs in a transaction data structure which is thread unsafe.\n\n- _Warning: This change might cause panics in user code. Fix is to properly close your iterators,\n  and only have one running at a time per transaction._\n\n## [1.5.1] - 2018-06-04\n\nBug Fixes:\n\n- Fix for infinite yieldItemValue recursion. #503\n- Fix recursive addition of `badgerMove` prefix.\n  https://github.com/dgraph-io/badger/commit/2e3a32f0ccac3066fb4206b28deb39c210c5266f\n- Use file size based window size for sampling, instead of fixing it to 10MB. #501\n\nCleanup:\n\n- Clarify comments and documentation.\n- Move badger tool one directory level up.\n\n## [1.5.0] - 2018-05-08\n\n- Introduce `NumVersionsToKeep` option. This option is used to discard many versions of the same\n  key, which saves space.\n- Add a new `SetWithDiscard` method, which would indicate that all the older versions of the key are\n  now invalid. Those versions would be discarded during compactions.\n- Value log GC moves are now bound to another keyspace to ensure latest versions of data are always\n  at the top in LSM tree.\n- Introduce `ValueLogMaxEntries` to restrict the number of key-value pairs per value log file. This\n  helps bound the time it takes to garbage collect one file.\n\n## [1.4.0] - 2018-05-04\n\n- Make mmap-ing of value log optional.\n- Run GC multiple times, based on recorded discard statistics.\n- Add MergeOperator.\n- Force compact L0 on clsoe (#439).\n- Add truncate option to warn about data loss (#452).\n- Discard key versions during compaction (#464).\n- Introduce new `LSMOnlyOptions`, to make Badger act like a typical LSM based DB.\n\nBug fix:\n\n- (Temporary) Check max version across all tables in Get (removed in next release).\n- Update commit and read ts while loading from backup.\n- Ensure all transaction entries are part of the same value log file.\n- On commit, run unlock callbacks before doing writes (#413).\n- Wait for goroutines to finish before closing iterators (#421).\n\n## [1.3.0] - 2017-12-12\n\n- Add `DB.NextSequence()` method to generate monotonically increasing integer sequences.\n- Add `DB.Size()` method to return the size of LSM and value log files.\n- Tweaked mmap code to make Windows 32-bit builds work.\n- Tweaked build tags on some files to make iOS builds work.\n- Fix `DB.PurgeOlderVersions()` to not violate some constraints.\n\n## [1.2.0] - 2017-11-30\n\n- Expose a `Txn.SetEntry()` method to allow setting the key-value pair and all the metadata at the\n  same time.\n\n## [1.1.1] - 2017-11-28\n\n- Fix bug where txn.Get was returing key deleted in same transaction.\n- Fix race condition while decrementing reference in oracle.\n- Update doneCommit in the callback for CommitAsync.\n- Iterator see writes of current txn.\n\n## [1.1.0] - 2017-11-13\n\n- Create Badger directory if it does not exist when `badger.Open` is called.\n- Added `Item.ValueCopy()` to avoid deadlocks in long-running iterations\n- Fixed 64-bit alignment issues to make Badger run on Arm v7\n\n## [1.0.1] - 2017-11-06\n\n- Fix an uint16 overflow when resizing key slice\n\n[4.9.0]: https://github.com/dgraph-io/badger/compare/v4.8.0...v4.9.0\n[4.8.0]: https://github.com/dgraph-io/badger/compare/v4.7.0...v4.8.0\n[4.7.0]: https://github.com/dgraph-io/badger/compare/v4.6.0...v4.7.0\n[4.6.0]: https://github.com/dgraph-io/badger/compare/v4.5.2...v4.6.0\n[4.5.2]: https://github.com/dgraph-io/badger/compare/v4.5.1...v4.5.2\n[4.5.1]: https://github.com/dgraph-io/badger/compare/v4.5.0...v4.5.1\n[4.5.0]: https://github.com/dgraph-io/badger/compare/v4.4.0...v4.5.0\n[4.4.0]: https://github.com/dgraph-io/badger/compare/v4.3.1...v4.4.0\n[4.3.1]: https://github.com/dgraph-io/badger/compare/v4.3.0...v4.3.1\n[4.3.0]: https://github.com/dgraph-io/badger/compare/v4.2.0...v4.3.0\n[4.2.0]: https://github.com/dgraph-io/badger/compare/v4.1.0...v4.2.0\n[4.1.0]: https://github.com/dgraph-io/badger/compare/v4.0.1...v4.1.0\n[4.0.1]: https://github.com/dgraph-io/badger/compare/v4.0.0...v4.0.1\n[4.0.0]: https://github.com/dgraph-io/badger/compare/v3.2103.5...v4.0.0\n[3.2103.5]: https://github.com/dgraph-io/badger/compare/v3.2103.4...v3.2103.5\n[3.2103.4]: https://github.com/dgraph-io/badger/compare/v3.2103.3...v3.2103.4\n[3.2103.3]: https://github.com/dgraph-io/badger/compare/v3.2103.2...v3.2103.3\n[3.2103.2]: https://github.com/dgraph-io/badger/compare/v3.2103.1...v3.2103.2\n[3.2103.1]: https://github.com/dgraph-io/badger/compare/v3.2103.0...v3.2103.1\n[3.2103.0]: https://github.com/dgraph-io/badger/compare/v3.2011.1...v3.2103.0\n[3.2011.1]: https://github.com/dgraph-io/badger/compare/v3.2011.0...v3.2011.1\n[3.2011.0]: https://github.com/dgraph-io/badger/compare/v2.2007.4...v3.2011.0\n[2.2007.4]: https://github.com/dgraph-io/badger/compare/v2.2007.3...v2.2007.4\n[2.2007.3]: https://github.com/dgraph-io/badger/compare/v2.2007.2...v2.2007.3\n[2.2007.2]: https://github.com/dgraph-io/badger/compare/v2.2007.1...v2.2007.2\n[2.2007.1]: https://github.com/dgraph-io/badger/compare/v2.2007.0...v2.2007.1\n[2.2007.0]: https://github.com/dgraph-io/badger/compare/v2.0.3...v2.2007.0\n[2.0.3]: https://github.com/dgraph-io/badger/compare/v2.0.2...v2.0.3\n[2.0.2]: https://github.com/dgraph-io/badger/compare/v2.0.1...v2.0.2\n[2.0.1]: https://github.com/dgraph-io/badger/compare/v2.0.0...v2.0.1\n[2.0.0]: https://github.com/dgraph-io/badger/compare/v1.6.0...v2.0.0\n[1.6.0]: https://github.com/dgraph-io/badger/compare/v1.5.5...v1.6.0\n[1.5.5]: https://github.com/dgraph-io/badger/compare/v1.5.3...v1.5.5\n[1.5.3]: https://github.com/dgraph-io/badger/compare/v1.5.2...v1.5.3\n[1.5.2]: https://github.com/dgraph-io/badger/compare/v1.5.1...v1.5.2\n[1.5.1]: https://github.com/dgraph-io/badger/compare/v1.5.0...v1.5.1\n[1.5.0]: https://github.com/dgraph-io/badger/compare/v1.4.0...v1.5.0\n[1.4.0]: https://github.com/dgraph-io/badger/compare/v1.3.0...v1.4.0\n[1.3.0]: https://github.com/dgraph-io/badger/compare/v1.2.0...v1.3.0\n[1.2.0]: https://github.com/dgraph-io/badger/compare/v1.1.1...v1.2.0\n[1.1.1]: https://github.com/dgraph-io/badger/compare/v1.1.0...v1.1.1\n[1.1.0]: https://github.com/dgraph-io/badger/compare/v1.0.1...v1.1.0\n[1.0.1]: https://github.com/dgraph-io/badger/compare/v1.0.0...v1.0.1\n"
  },
  {
    "path": "CODE_OF_CONDUCT.md",
    "content": "# Contributor Covenant Code of Conduct\n\n## Our Pledge\n\nWe as members, contributors, and leaders pledge to make participation in our community a\nharassment-free experience for everyone, regardless of age, body size, visible or invisible\ndisability, ethnicity, sex characteristics, gender identity and expression, level of experience,\neducation, socio-economic status, nationality, personal appearance, race, religion, or sexual\nidentity and orientation.\n\nWe pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and\nhealthy community.\n\n## Our Standards\n\nExamples of behavior that contributes to a positive environment for our community include:\n\n- Demonstrating empathy and kindness toward other people\n- Being respectful of differing opinions, viewpoints, and experiences\n- Giving and gracefully accepting constructive feedback\n- Accepting responsibility and apologizing to those affected by our mistakes, and learning from the\n  experience\n- Focusing on what is best not just for us as individuals, but for the overall community\n\nExamples of unacceptable behavior include:\n\n- The use of sexualized language or imagery, and sexual attention or advances of any kind\n- Trolling, insulting or derogatory comments, and personal or political attacks\n- Public or private harassment\n- Publishing others' private information, such as a physical or email address, without their\n  explicit permission\n- Other conduct which could reasonably be considered inappropriate in a professional setting\n\n## Enforcement Responsibilities\n\nCommunity leaders are responsible for clarifying and enforcing our standards of acceptable behavior\nand will take appropriate and fair corrective action in response to any behavior that they deem\ninappropriate, threatening, offensive, or harmful.\n\nCommunity leaders have the right and responsibility to remove, edit, or reject comments, commits,\ncode, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and\nwill communicate reasons for moderation decisions when appropriate.\n\n## Scope\n\nThis Code of Conduct applies within all community spaces, and also applies when an individual is\nofficially representing the community in public spaces. Examples of representing our community\ninclude using an official e-mail address, posting via an official social media account, or acting as\nan appointed representative at an online or offline event.\n\n## Enforcement\n\nInstances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community\nleaders responsible for enforcement at dgraph-admin@istaridigital.com. All complaints will be\nreviewed and investigated promptly and fairly.\n\nAll community leaders are obligated to respect the privacy and security of the reporter of any\nincident.\n\n## Enforcement Guidelines\n\nCommunity leaders will follow these Community Impact Guidelines in determining the consequences for\nany action they deem in violation of this Code of Conduct:\n\n### 1. Correction\n\n**Community Impact**: Use of inappropriate language or other behavior deemed unprofessional or\nunwelcome in the community.\n\n**Consequence**: A private, written warning from community leaders, providing clarity around the\nnature of the violation and an explanation of why the behavior was inappropriate. A public apology\nmay be requested.\n\n### 2. Warning\n\n**Community Impact**: A violation through a single incident or series of actions.\n\n**Consequence**: A warning with consequences for continued behavior. No interaction with the people\ninvolved, including unsolicited interaction with those enforcing the Code of Conduct, for a\nspecified period of time. This includes avoiding interactions in community spaces as well as\nexternal channels like social media. Violating these terms may lead to a temporary or permanent ban.\n\n### 3. Temporary Ban\n\n**Community Impact**: A serious violation of community standards, including sustained inappropriate\nbehavior.\n\n**Consequence**: A temporary ban from any sort of interaction or public communication with the\ncommunity for a specified period of time. No public or private interaction with the people involved,\nincluding unsolicited interaction with those enforcing the Code of Conduct, is allowed during this\nperiod. Violating these terms may lead to a permanent ban.\n\n### 4. Permanent Ban\n\n**Community Impact**: Demonstrating a pattern of violation of community standards, including\nsustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement\nof classes of individuals.\n\n**Consequence**: A permanent ban from any sort of public interaction within the community.\n\n## Attribution\n\nThis Code of Conduct is adapted from the [Contributor Covenant][homepage], version 2.0, available at\nhttps://www.contributor-covenant.org/version/2/0/code_of_conduct.html.\n\nCommunity Impact Guidelines were inspired by\n[Mozilla's code of conduct enforcement ladder](https://github.com/mozilla/diversity).\n\n[homepage]: https://www.contributor-covenant.org\n\nFor answers to common questions about this code of conduct, see the FAQ at\nhttps://www.contributor-covenant.org/faq. Translations are available at\nhttps://www.contributor-covenant.org/translations.\n"
  },
  {
    "path": "CONTRIBUTING.md",
    "content": "# Contribution Guide\n\n- [Before you get started](#before-you-get-started)\n  - [Code of Conduct](#code-of-conduct)\n- [Your First Contribution](#your-first-contribution)\n  - [Find a good first topic](#find-a-good-first-topic)\n- [Setting up your development environment](#setting-up-your-development-environment)\n  - [Fork the project](#fork-the-project)\n  - [Clone the project](#clone-the-project)\n  - [New branch for a new code](#new-branch-for-a-new-code)\n  - [Test](#test)\n  - [Commit and push](#commit-and-push)\n  - [Create a Pull Request](#create-a-pull-request)\n  - [Sign the CLA](#sign-the-cla)\n  - [Get a code review](#get-a-code-review)\n\n## Before you get started\n\n### Code of Conduct\n\nPlease make sure to read and observe our [Code of Conduct](./CODE_OF_CONDUCT.md).\n\n## Your First Contribution\n\n### Find a good first topic\n\nYou can start by finding an existing issue with the\n[good first issue](https://github.com/dgraph-io/badger/labels/good%20first%20issue) or\n[help wanted](https://github.com/dgraph-io/badger/labels/help%20wanted) labels. These issues are\nwell suited for new contributors.\n\n## Setting up your development environment\n\n- [Install Go 1.25.0 or above](https://golang.org/doc/install).\n- Install\n  [trunk](https://docs.trunk.io/code-quality/overview/getting-started/install#install-the-launcher).\n  Our CI uses trunk to lint and check code, having it installed locally will save you time.\n\n### Fork the project\n\n- Visit https://github.com/dgraph-io/badger\n- Click the `Fork` button (top right) to create a fork of the repository\n\n### Clone the project\n\n```sh\ngit clone https://github.com/$GITHUB_USER/badger\ncd badger\ngit remote add upstream git@github.com:dgraph-io/badger.git\n\n# Never push to the upstream master\ngit remote set-url --push upstream no_push\n```\n\n### New branch for a new code\n\nGet your local master up to date:\n\n```sh\ngit fetch upstream\ngit checkout master\ngit rebase upstream/master\n```\n\nCreate a new branch from the master:\n\n```sh\ngit checkout -b my_new_feature\n```\n\nAnd now you can finally add your changes to project.\n\n### Test\n\nBuild and run all tests:\n\n```sh\n./test.sh\n```\n\n### Commit and push\n\nCommit your changes:\n\n```sh\ngit commit\n```\n\nWhen the changes are ready to review:\n\n```sh\ngit push origin my_new_feature\n```\n\n### Create a Pull Request\n\nJust open `https://github.com/$GITHUB_USER/badger/pull/new/my_new_feature` and fill the PR\ndescription.\n\n### Sign the CLA\n\nClick the **Sign in with Github to agree** button to sign the CLA.\n[An example](https://cla-assistant.io/dgraph-io/badger?pullRequest=1377).\n\n### Get a code review\n\nIf your pull request (PR) is opened, it will be assigned to one or more reviewers. Those reviewers\nwill do a code review.\n\nTo address review comments, you should commit the changes to the same branch of the PR on your fork.\n"
  },
  {
    "path": "LICENSE",
    "content": "                                 Apache License\n                           Version 2.0, January 2004\n                        http://www.apache.org/licenses/\n\n   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION\n\n   1. Definitions.\n\n      \"License\" shall mean the terms and conditions for use, reproduction,\n      and distribution as defined by Sections 1 through 9 of this document.\n\n      \"Licensor\" shall mean the copyright owner or entity authorized by\n      the copyright owner that is granting the License.\n\n      \"Legal Entity\" shall mean the union of the acting entity and all\n      other entities that control, are controlled by, or are under common\n      control with that entity. For the purposes of this definition,\n      \"control\" means (i) the power, direct or indirect, to cause the\n      direction or management of such entity, whether by contract or\n      otherwise, or (ii) ownership of fifty percent (50%) or more of the\n      outstanding shares, or (iii) beneficial ownership of such entity.\n\n      \"You\" (or \"Your\") shall mean an individual or Legal Entity\n      exercising permissions granted by this License.\n\n      \"Source\" form shall mean the preferred form for making modifications,\n      including but not limited to software source code, documentation\n      source, and configuration files.\n\n      \"Object\" form shall mean any form resulting from mechanical\n      transformation or translation of a Source form, including but\n      not limited to compiled object code, generated documentation,\n      and conversions to other media types.\n\n      \"Work\" shall mean the work of authorship, whether in Source or\n      Object form, made available under the License, as indicated by a\n      copyright notice that is included in or attached to the work\n      (an example is provided in the Appendix below).\n\n      \"Derivative Works\" shall mean any work, whether in Source or Object\n      form, that is based on (or derived from) the Work and for which the\n      editorial revisions, annotations, elaborations, or other modifications\n      represent, as a whole, an original work of authorship. For the purposes\n      of this License, Derivative Works shall not include works that remain\n      separable from, or merely link (or bind by name) to the interfaces of,\n      the Work and Derivative Works thereof.\n\n      \"Contribution\" shall mean any work of authorship, including\n      the original version of the Work and any modifications or additions\n      to that Work or Derivative Works thereof, that is intentionally\n      submitted to Licensor for inclusion in the Work by the copyright owner\n      or by an individual or Legal Entity authorized to submit on behalf of\n      the copyright owner. For the purposes of this definition, \"submitted\"\n      means any form of electronic, verbal, or written communication sent\n      to the Licensor or its representatives, including but not limited to\n      communication on electronic mailing lists, source code control systems,\n      and issue tracking systems that are managed by, or on behalf of, the\n      Licensor for the purpose of discussing and improving the Work, but\n      excluding communication that is conspicuously marked or otherwise\n      designated in writing by the copyright owner as \"Not a Contribution.\"\n\n      \"Contributor\" shall mean Licensor and any individual or Legal Entity\n      on behalf of whom a Contribution has been received by Licensor and\n      subsequently incorporated within the Work.\n\n   2. Grant of Copyright License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      copyright license to reproduce, prepare Derivative Works of,\n      publicly display, publicly perform, sublicense, and distribute the\n      Work and such Derivative Works in Source or Object form.\n\n   3. Grant of Patent License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      (except as stated in this section) patent license to make, have made,\n      use, offer to sell, sell, import, and otherwise transfer the Work,\n      where such license applies only to those patent claims licensable\n      by such Contributor that are necessarily infringed by their\n      Contribution(s) alone or by combination of their Contribution(s)\n      with the Work to which such Contribution(s) was submitted. If You\n      institute patent litigation against any entity (including a\n      cross-claim or counterclaim in a lawsuit) alleging that the Work\n      or a Contribution incorporated within the Work constitutes direct\n      or contributory patent infringement, then any patent licenses\n      granted to You under this License for that Work shall terminate\n      as of the date such litigation is filed.\n\n   4. Redistribution. You may reproduce and distribute copies of the\n      Work or Derivative Works thereof in any medium, with or without\n      modifications, and in Source or Object form, provided that You\n      meet the following conditions:\n\n      (a) You must give any other recipients of the Work or\n          Derivative Works a copy of this License; and\n\n      (b) You must cause any modified files to carry prominent notices\n          stating that You changed the files; and\n\n      (c) You must retain, in the Source form of any Derivative Works\n          that You distribute, all copyright, patent, trademark, and\n          attribution notices from the Source form of the Work,\n          excluding those notices that do not pertain to any part of\n          the Derivative Works; and\n\n      (d) If the Work includes a \"NOTICE\" text file as part of its\n          distribution, then any Derivative Works that You distribute must\n          include a readable copy of the attribution notices contained\n          within such NOTICE file, excluding those notices that do not\n          pertain to any part of the Derivative Works, in at least one\n          of the following places: within a NOTICE text file distributed\n          as part of the Derivative Works; within the Source form or\n          documentation, if provided along with the Derivative Works; or,\n          within a display generated by the Derivative Works, if and\n          wherever such third-party notices normally appear. The contents\n          of the NOTICE file are for informational purposes only and\n          do not modify the License. You may add Your own attribution\n          notices within Derivative Works that You distribute, alongside\n          or as an addendum to the NOTICE text from the Work, provided\n          that such additional attribution notices cannot be construed\n          as modifying the License.\n\n      You may add Your own copyright statement to Your modifications and\n      may provide additional or different license terms and conditions\n      for use, reproduction, or distribution of Your modifications, or\n      for any such Derivative Works as a whole, provided Your use,\n      reproduction, and distribution of the Work otherwise complies with\n      the conditions stated in this License.\n\n   5. Submission of Contributions. Unless You explicitly state otherwise,\n      any Contribution intentionally submitted for inclusion in the Work\n      by You to the Licensor shall be under the terms and conditions of\n      this License, without any additional terms or conditions.\n      Notwithstanding the above, nothing herein shall supersede or modify\n      the terms of any separate license agreement you may have executed\n      with Licensor regarding such Contributions.\n\n   6. Trademarks. This License does not grant permission to use the trade\n      names, trademarks, service marks, or product names of the Licensor,\n      except as required for reasonable and customary use in describing the\n      origin of the Work and reproducing the content of the NOTICE file.\n\n   7. Disclaimer of Warranty. Unless required by applicable law or\n      agreed to in writing, Licensor provides the Work (and each\n      Contributor provides its Contributions) on an \"AS IS\" BASIS,\n      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or\n      implied, including, without limitation, any warranties or conditions\n      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A\n      PARTICULAR PURPOSE. You are solely responsible for determining the\n      appropriateness of using or redistributing the Work and assume any\n      risks associated with Your exercise of permissions under this License.\n\n   8. Limitation of Liability. In no event and under no legal theory,\n      whether in tort (including negligence), contract, or otherwise,\n      unless required by applicable law (such as deliberate and grossly\n      negligent acts) or agreed to in writing, shall any Contributor be\n      liable to You for damages, including any direct, indirect, special,\n      incidental, or consequential damages of any character arising as a\n      result of this License or out of the use or inability to use the\n      Work (including but not limited to damages for loss of goodwill,\n      work stoppage, computer failure or malfunction, or any and all\n      other commercial damages or losses), even if such Contributor\n      has been advised of the possibility of such damages.\n\n   9. Accepting Warranty or Additional Liability. While redistributing\n      the Work or Derivative Works thereof, You may choose to offer,\n      and charge a fee for, acceptance of support, warranty, indemnity,\n      or other liability obligations and/or rights consistent with this\n      License. However, in accepting such obligations, You may act only\n      on Your own behalf and on Your sole responsibility, not on behalf\n      of any other Contributor, and only if You agree to indemnify,\n      defend, and hold each Contributor harmless for any liability\n      incurred by, or claims asserted against, such Contributor by reason\n      of your accepting any such warranty or additional liability.\n\n   END OF TERMS AND CONDITIONS\n"
  },
  {
    "path": "Makefile",
    "content": "#\n# SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n# SPDX-License-Identifier: Apache-2.0\n#\n\nUSER_ID      = $(shell id -u)\nHAS_JEMALLOC = $(shell test -f /usr/local/lib/libjemalloc.a && echo \"jemalloc\")\nJEMALLOC_URL = \"https://github.com/jemalloc/jemalloc/releases/download/5.3.0/jemalloc-5.3.0.tar.bz2\"\n\n\n.PHONY: all badger test jemalloc dependency\n\nbadger: jemalloc\n\t@echo \"Compiling Badger binary...\"\n\t@$(MAKE) -C badger badger\n\t@echo \"Badger binary located in badger directory.\"\n\ntest: jemalloc\n\t@echo \"Running Badger tests...\"\n\t@./test.sh\n\njemalloc:\n\t@if [ -z \"$(HAS_JEMALLOC)\" ] ; then \\\n\t\tmkdir -p /tmp/jemalloc-temp && cd /tmp/jemalloc-temp ; \\\n\t\techo \"Downloading jemalloc...\" ; \\\n\t\tcurl -s -L ${JEMALLOC_URL} -o jemalloc.tar.bz2 ; \\\n\t\ttar xjf ./jemalloc.tar.bz2 ; \\\n\t\tcd jemalloc-5.3.0 ; \\\n\t\t./configure --with-jemalloc-prefix='je_' --with-malloc-conf='background_thread:true,metadata_thp:auto'; \\\n\t\tmake ; \\\n\t\tif [ \"$(USER_ID)\" -eq \"0\" ]; then \\\n\t\t\tmake install ; \\\n\t\telse \\\n\t\t\techo \"==== Need sudo access to install jemalloc\" ; \\\n\t\t\tsudo make install ; \\\n\t\tfi \\\n\tfi\n\ndependency:\n\t@echo \"Installing dependencies...\"\n\t@sudo apt-get update\n\t@sudo apt-get -y install \\\n    \tca-certificates \\\n    \tcurl \\\n    \tgnupg \\\n    \tlsb-release \\\n    \tbuild-essential \\\n    \tprotobuf-compiler \\\n"
  },
  {
    "path": "README.md",
    "content": "# BadgerDB\n\n[![Go Reference](https://pkg.go.dev/badge/github.com/dgraph-io/badger/v4.svg)](https://pkg.go.dev/github.com/dgraph-io/badger/v4)\n[![Go Report Card](https://goreportcard.com/badge/github.com/dgraph-io/badger/v4)](https://goreportcard.com/report/github.com/dgraph-io/badger/v4)\n[![Sourcegraph](https://sourcegraph.com/github.com/dgraph-io/badger/-/badge.svg)](https://sourcegraph.com/github.com/dgraph-io/badger?badge)\n[![ci-badger-tests](https://github.com/dgraph-io/badger/actions/workflows/ci-badger-tests.yml/badge.svg)](https://github.com/dgraph-io/badger/actions/workflows/ci-badger-tests.yml)\n[![ci-badger-bank-tests](https://github.com/dgraph-io/badger/actions/workflows/ci-badger-bank-tests.yml/badge.svg)](https://github.com/dgraph-io/badger/actions/workflows/ci-badger-bank-tests.yml)\n[![ci-badger-bank-tests-nightly](https://github.com/dgraph-io/badger/actions/workflows/ci-badger-bank-tests-nightly.yml/badge.svg)](https://github.com/dgraph-io/badger/actions/workflows/ci-badger-bank-tests-nightly.yml)\n\n![Badger mascot](images/diggy-shadow.png)\n\nBadgerDB is an embeddable, persistent and fast key-value (KV) database written in pure Go. It is the\nunderlying database for [Dgraph](https://github.com/dgraph-io/dgraph), a fast, distributed graph\ndatabase. It's meant to be a performant alternative to non-Go-based key-value stores like RocksDB.\n\n## Project Status\n\nBadger is stable and is being used to serve data sets worth hundreds of terabytes. Badger supports\nconcurrent ACID transactions with serializable snapshot isolation (SSI) guarantees. A Jepsen-style\nbank test runs nightly for 8h, with `--race` flag and ensures the maintenance of transactional\nguarantees. Badger has also been tested to work with filesystem level anomalies, to ensure\npersistence and consistency. Badger is being used by a number of projects which includes Dgraph,\nJaeger Tracing, UsenetExpress, and many more.\n\nThe list of projects using Badger can be found [here](#projects-using-badger).\n\nPlease consult the [Changelog] for more detailed information on releases.\n\nNote: Badger is built with go 1.23 and we refrain from bumping this version to minimize downstream\neffects of those using Badger in applications built with older versions of Go.\n\n[Changelog]: https://github.com/dgraph-io/badger/blob/main/CHANGELOG.md\n\n## Table of Contents\n\n- [BadgerDB](#badgerdb)\n  - [Project Status](#project-status)\n  - [Table of Contents](#table-of-contents)\n  - [Getting Started](#getting-started)\n    - [Installing](#installing)\n      - [Installing Badger Command Line Tool](#installing-badger-command-line-tool)\n      - [Choosing a version](#choosing-a-version)\n  - [Badger Documentation](#badger-documentation)\n  - [Resources](#resources)\n    - [Blog Posts](#blog-posts)\n  - [Design](#design)\n    - [Comparisons](#comparisons)\n    - [Benchmarks](#benchmarks)\n  - [Projects Using Badger](#projects-using-badger)\n  - [Contributing](#contributing)\n  - [Contact](#contact)\n\n## Getting Started\n\n### Installing\n\nTo start using Badger, install Go 1.23 or above. Badger v3 and above needs go modules. From your\nproject, run the following command\n\n```sh\ngo get github.com/dgraph-io/badger/v4\n```\n\nThis will retrieve the library.\n\n#### Installing Badger Command Line Tool\n\nBadger provides a CLI tool which can perform certain operations like offline backup/restore. To\ninstall the Badger CLI, retrieve the repository and checkout the desired version. Then run\n\n```sh\ncd badger\ngo install .\n```\n\nThis will install the badger command line utility into your $GOBIN path.\n\n## Badger Documentation\n\nBadger Documentation is available at [https://badger.dgraph.io](https://badger.dgraph.io)\n\n## Resources\n\n### Blog Posts\n\n1. [Introducing Badger: A fast key-value store written natively in Go](https://hypermode.com/blog/badger/)\n2. [Make Badger crash resilient with ALICE](https://hypermode.com/blog/alice/)\n3. [Badger vs LMDB vs BoltDB: Benchmarking key-value databases in Go](https://hypermode.com/blog/badger-lmdb-boltdb/)\n4. [Concurrent ACID Transactions in Badger](https://hypermode.com/blog/badger-txn/)\n\n## Design\n\nBadger was written with these design goals in mind:\n\n- Write a key-value database in pure Go.\n- Use latest research to build the fastest KV database for data sets spanning terabytes.\n- Optimize for SSDs.\n\nBadger’s design is based on a paper titled _[WiscKey: Separating Keys from Values in SSD-conscious\nStorage][wisckey]_.\n\n[wisckey]: https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf\n\n### Comparisons\n\n| Feature                       | Badger                                     | RocksDB                       | BoltDB    |\n| ----------------------------- | ------------------------------------------ | ----------------------------- | --------- |\n| Design                        | LSM tree with value log                    | LSM tree only                 | B+ tree   |\n| High Read throughput          | Yes                                        | No                            | Yes       |\n| High Write throughput         | Yes                                        | Yes                           | No        |\n| Designed for SSDs             | Yes (with latest research <sup>1</sup>)    | Not specifically <sup>2</sup> | No        |\n| Embeddable                    | Yes                                        | Yes                           | Yes       |\n| Sorted KV access              | Yes                                        | Yes                           | Yes       |\n| Pure Go (no Cgo)              | Yes                                        | No                            | Yes       |\n| Transactions                  | Yes, ACID, concurrent with SSI<sup>3</sup> | Yes (but non-ACID)            | Yes, ACID |\n| Snapshots                     | Yes                                        | Yes                           | Yes       |\n| TTL support                   | Yes                                        | Yes                           | No        |\n| 3D access (key-value-version) | Yes<sup>4</sup>                            | No                            | No        |\n\n<sup>1</sup> The [WISCKEY paper][wisckey] (on which Badger is based) saw big wins with separating\nvalues from keys, significantly reducing the write amplification compared to a typical LSM tree.\n\n<sup>2</sup> RocksDB is an SSD optimized version of LevelDB, which was designed specifically for\nrotating disks. As such RocksDB's design isn't aimed at SSDs.\n\n<sup>3</sup> SSI: Serializable Snapshot Isolation. For more details, see the blog post\n[Concurrent ACID Transactions in Badger](https://hypermode.com/blog/badger-txn/)\n\n<sup>4</sup> Badger provides direct access to value versions via its Iterator API. Users can also\nspecify how many versions to keep per key via Options.\n\n### Benchmarks\n\nWe have run comprehensive benchmarks against RocksDB, Bolt and LMDB. The benchmarking code, and the\ndetailed logs for the benchmarks can be found in the [badger-bench] repo. More explanation,\nincluding graphs can be found the blog posts (linked above).\n\n[badger-bench]: https://github.com/dgraph-io/badger-bench\n\n## Projects Using Badger\n\nBelow is a list of known projects that use Badger:\n\n- [Dgraph](https://github.com/dgraph-io/dgraph) - Distributed graph database.\n- [Jaeger](https://github.com/jaegertracing/jaeger) - Distributed tracing platform.\n- [go-ipfs](https://github.com/ipfs/go-ipfs) - Go client for the InterPlanetary File System (IPFS),\n  a new hypermedia distribution protocol.\n- [Riot](https://github.com/go-ego/riot) - An open-source, distributed search engine.\n- [emitter](https://github.com/emitter-io/emitter) - Scalable, low latency, distributed pub/sub\n  broker with message storage, uses MQTT, gossip and badger.\n- [OctoSQL](https://github.com/cube2222/octosql) - Query tool that allows you to join, analyse and\n  transform data from multiple databases using SQL.\n- [Dkron](https://dkron.io/) - Distributed, fault tolerant job scheduling system.\n- [smallstep/certificates](https://github.com/smallstep/certificates) - Step-ca is an online\n  certificate authority for secure, automated certificate management.\n- [Sandglass](https://github.com/celrenheit/sandglass) - distributed, horizontally scalable,\n  persistent, time sorted message queue.\n- [TalariaDB](https://github.com/grab/talaria) - Grab's Distributed, low latency time-series\n  database.\n- [Sloop](https://github.com/salesforce/sloop) - Salesforce's Kubernetes History Visualization\n  Project.\n- [Usenet Express](https://usenetexpress.com/) - Serving over 300TB of data with Badger.\n- [gorush](https://github.com/appleboy/gorush) - A push notification server written in Go.\n- [0-stor](https://github.com/zero-os/0-stor) - Single device object store.\n- [Dispatch Protocol](https://github.com/dispatchlabs/disgo) - Blockchain protocol for distributed\n  application data analytics.\n- [GarageMQ](https://github.com/valinurovam/garagemq) - AMQP server written in Go.\n- [RedixDB](https://alash3al.github.io/redix/) - A real-time persistent key-value store with the\n  same redis protocol.\n- [BBVA](https://github.com/BBVA/raft-badger) - Raft backend implementation using BadgerDB for\n  Hashicorp raft.\n- [Fantom](https://github.com/Fantom-foundation/go-lachesis) - aBFT Consensus platform for\n  distributed applications.\n- [decred](https://github.com/decred/dcrdata) - An open, progressive, and self-funding\n  cryptocurrency with a system of community-based governance integrated into its blockchain.\n- [OpenNetSys](https://github.com/opennetsys/c3-go) - Create useful dApps in any software language.\n- [HoneyTrap](https://github.com/honeytrap/honeytrap) - An extensible and opensource system for\n  running, monitoring and managing honeypots.\n- [Insolar](https://github.com/insolar/insolar) - Enterprise-ready blockchain platform.\n- [IoTeX](https://github.com/iotexproject/iotex-core) - The next generation of the decentralized\n  network for IoT powered by scalability- and privacy-centric blockchains.\n- [go-sessions](https://github.com/kataras/go-sessions) - The sessions manager for Go net/http and\n  fasthttp.\n- [Babble](https://github.com/mosaicnetworks/babble) - BFT Consensus platform for distributed\n  applications.\n- [Tormenta](https://github.com/jpincas/tormenta) - Embedded object-persistence layer / simple JSON\n  database for Go projects.\n- [BadgerHold](https://github.com/timshannon/badgerhold) - An embeddable NoSQL store for querying Go\n  types built on Badger\n- [Goblero](https://github.com/didil/goblero) - Pure Go embedded persistent job queue backed by\n  BadgerDB\n- [Surfline](https://www.surfline.com) - Serving global wave and weather forecast data with Badger.\n- [Cete](https://github.com/mosuka/cete) - Simple and highly available distributed key-value store\n  built on Badger. Makes it easy bringing up a cluster of Badger with Raft consensus algorithm by\n  hashicorp/raft.\n- [Volument](https://volument.com/) - A new take on website analytics backed by Badger.\n- [KVdb](https://kvdb.io/) - Hosted key-value store and serverless platform built on top of Badger.\n- [Terminotes](https://gitlab.com/asad-awadia/terminotes) - Self hosted notes storage and search\n  server - storage powered by BadgerDB\n- [Pyroscope](https://github.com/pyroscope-io/pyroscope) - Open source continuous profiling platform\n  built with BadgerDB\n- [Veri](https://github.com/bgokden/veri) - A distributed feature store optimized for Search and\n  Recommendation tasks.\n- [bIter](https://github.com/MikkelHJuul/bIter) - A library and Iterator interface for working with\n  the `badger.Iterator`, simplifying from-to, and prefix mechanics.\n- [ld](https://github.com/MikkelHJuul/ld) - (Lean Database) A very simple gRPC-only key-value\n  database, exposing BadgerDB with key-range scanning semantics.\n- [Souin](https://github.com/darkweak/Souin) - A RFC compliant HTTP cache with lot of other features\n  based on Badger for the storage. Compatible with all existing reverse-proxies.\n- [Xuperchain](https://github.com/xuperchain/xupercore) - A highly flexible blockchain architecture\n  with great transaction performance.\n- [m2](https://github.com/qichengzx/m2) - A simple http key/value store based on the raft protocol.\n- [chaindb](https://github.com/ChainSafe/chaindb) - A blockchain storage layer used by\n  [Gossamer](https://chainsafe.github.io/gossamer/), a Go client for the\n  [Polkadot Network](https://polkadot.network/).\n- [vxdb](https://github.com/vitalvas/vxdb) - Simple schema-less Key-Value NoSQL database with\n  simplest API interface.\n- [Opacity](https://github.com/opacity/storage-node) - Backend implementation for the Opacity\n  storage project\n- [Vephar](https://github.com/vaccovecrana/vephar) - A minimal key/value store using hashicorp-raft\n  for cluster coordination and Badger for data storage.\n- [gowarcserver](https://github.com/nlnwa/gowarcserver) - Open-source server for warc files. Can be\n  used in conjunction with pywb\n- [flow-go](https://github.com/onflow/flow-go) - A fast, secure, and developer-friendly blockchain\n  built to support the next generation of games, apps and the digital assets that power them.\n- [Wrgl](https://www.wrgl.co) - A data version control system that works like Git but specialized to\n  store and diff CSV.\n- [Loggie](https://github.com/loggie-io/loggie) - A lightweight, cloud-native data transfer agent\n  and aggregator.\n- [raft-badger](https://github.com/rfyiamcool/raft-badger) - raft-badger implements LogStore and\n  StableStore Interface of hashcorp/raft. it is used to store raft log and metadata of\n  hashcorp/raft.\n- [DVID](https://github.com/janelia-flyem/dvid) - A dataservice for branched versioning of a variety\n  of data types. Originally created for large-scale brain reconstructions in Connectomics.\n- [KVS](https://github.com/tauraamui/kvs) - A library for making it easy to persist, load and query\n  full structs into BadgerDB, using an ownership hierarchy model.\n- [LLS](https://github.com/Boc-chi-no/LLS) - LLS is an efficient URL Shortener that can be used to\n  shorten links and track link usage. Support for BadgerDB and MongoDB. Improved performance by more\n  than 30% when using BadgerDB\n- [lakeFS](https://github.com/treeverse/lakeFS) - lakeFS is an open-source data version control that\n  transforms your object storage to Git-like repositories. lakeFS uses BadgerDB for its underlying\n  local metadata KV store implementation\n- [Goptivum](https://github.com/smegg99/Goptivum) - Goptivum is a better frontend and API for the\n  Vulcan Optivum schedule program\n- [ActionManager](https://mftlabs.io/actionmanager) - A dynamic entity manager based on rjsf schema\n  and badger db\n- [MightyMap](https://github.com/thisisdevelopment/mightymap) - Mightymap: Conveys both robustness\n  and high capability, fitting for a powerful concurrent map.\n- [FlowG](https://github.com/link-society/flowg) - A low-code log processing facility\n- [Bluefin](https://github.com/blinklabs-io/bluefin) - Bluefin is a TUNA Proof of Work miner for the\n  Fortuna smart contract on the Cardano blockchain\n- [cDNSd](https://github.com/blinklabs-io/cdnsd) - A Cardano blockchain backed DNS server daemon\n- [Dingo](https://github.com/blinklabs-io/dingo) - A Cardano blockchain data node\n\nIf you are using Badger in a project please send a pull request to add it to the list.\n\n### Platform Compatibility\n\nBadger uses OS-specific implementations for directory locking and `fsync` operations. On\n**POSIX-compliant systems** (Linux, macOS, BSD), these work as expected.\n\nFor **non-POSIX platforms**, be aware of potential limitations:\n\n| Platform | File             | Notes                                                                |\n| -------- | ---------------- | -------------------------------------------------------------------- |\n| AIX      | `dir_aix.go`     | Directory `fsync` not supported; durability on crash may be affected |\n| Windows  | `dir_windows.go` | Uses different locking mechanism                                     |\n| Plan9    | `dir_plan9.go`   | No file locking support                                              |\n| WASM/JS  | `dir_other.go`   | No file locking support                                              |\n\nIf you encounter issues on these platforms, review the corresponding `dir_*.go` source file for\nimplementation details.\n\n## Contributing\n\nIf you're interested in contributing to Badger see [CONTRIBUTING](./CONTRIBUTING.md).\n\n## Contact\n\n- Please use [Github issues](https://github.com/dgraph-io/badger/issues) for filing bugs.\n- Please use [Discussions](https://github.com/orgs/dgraph-io/discussions) for questions,\n  discussions, and feature requests.\n"
  },
  {
    "path": "SECURITY.md",
    "content": "# Reporting Security Concerns\n\nWe take the security of Badger very seriously. If you believe you have found a security\nvulnerability in Badger, we encourage you to let us know right away.\n\nWe will investigate all legitimate reports and do our best to quickly fix the problem. Please report\nany issues or vulnerabilities via GitHub Security Advisories instead of posting a public issue in\nGitHub. You can also send security communications to dgraph-admin@istaridigital.com.\n\nPlease include the version identifier and details on how the vulnerability can be exploited.\n"
  },
  {
    "path": "VERSIONING.md",
    "content": "# Serialization Versioning: Semantic Versioning for databases\n\nSemantic Versioning, commonly known as SemVer, is a great idea that has been very widely adopted as\na way to decide how to name software versions. The whole concept is very well summarized on\nsemver.org with the following lines:\n\n> Given a version number MAJOR.MINOR.PATCH, increment the:\n>\n> 1. MAJOR version when you make incompatible API changes,\n> 2. MINOR version when you add functionality in a backwards-compatible manner, and\n> 3. PATCH version when you make backwards-compatible bug fixes.\n>\n> Additional labels for pre-release and build metadata are available as extensions to the\n> MAJOR.MINOR.PATCH format.\n\nUnfortunately, API changes are not the most important changes for libraries that serialize data for\nlater consumption. For these libraries, such as BadgerDB, changes to the API are much easier to\nhandle than change to the data format used to store data on disk.\n\n## Serialization Version specification\n\nSerialization Versioning, like Semantic Versioning, uses 3 numbers and also calls them\nMAJOR.MINOR.PATCH, but the semantics of the numbers are slightly modified:\n\nGiven a version number MAJOR.MINOR.PATCH, increment the:\n\n- MAJOR version when you make changes that require a transformation of the dataset before it can be\n  used again.\n- MINOR version when old datasets are still readable but the API might have changed in\n  backwards-compatible or incompatible ways.\n- PATCH version when you make backwards-compatible bug fixes.\n\nAdditional labels for pre-release and build metadata are available as extensions to the\nMAJOR.MINOR.PATCH format.\n\nFollowing this naming strategy, migration from v1.x to v2.x requires a migration strategy for your\nexisting dataset, and as such has to be carefully planned. Migrations in between different minor\nversions (e.g. v1.5.x and v1.6.x) might break your build, as the API _might_ have changed, but once\nyour code compiles there's no need for any data migration. Lastly, changes in between two different\npatch versions should never break your build or dataset.\n\nFor more background on our decision to adopt Serialization Versioning, read the blog post [Semantic\nVersioning, Go Modules, and Databases][blog] and the original proposal on [this comment on Dgraph's\nDiscuss forum][discuss].\n\n[blog]: https://open.dgraph.io/post/serialization-versioning/\n[discuss]: https://discuss.dgraph.io/t/go-modules-on-badger-and-dgraph/4662/7\n"
  },
  {
    "path": "backup.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"bufio\"\n\t\"bytes\"\n\t\"context\"\n\t\"encoding/binary\"\n\t\"fmt\"\n\t\"io\"\n\n\t\"google.golang.org/protobuf/proto\"\n\n\t\"github.com/dgraph-io/badger/v4/pb\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\n// flushThreshold determines when a buffer will be flushed. When performing a\n// backup/restore, the entries will be batched up until the total size of batch\n// is more than flushThreshold or entry size (without the value size) is more\n// than the maxBatchSize.\nconst flushThreshold = 100 << 20\n\n// Backup dumps a protobuf-encoded list of all entries in the database into the\n// given writer, that are newer than or equal to the specified version. It\n// returns a timestamp (version) indicating the version of last entry that is\n// dumped, which after incrementing by 1 can be passed into later invocation to\n// generate incremental backup of entries that have been added/modified since\n// the last invocation of DB.Backup().\n// DB.Backup is a wrapper function over Stream.Backup to generate full and\n// incremental backups of the DB. For more control over how many goroutines are\n// used to generate the backup, or if you wish to backup only a certain range\n// of keys, use Stream.Backup directly.\nfunc (db *DB) Backup(w io.Writer, since uint64) (uint64, error) {\n\tstream := db.NewStream()\n\tstream.LogPrefix = \"DB.Backup\"\n\tstream.SinceTs = since\n\treturn stream.Backup(w, since)\n}\n\n// Backup dumps a protobuf-encoded list of all entries in the database into the\n// given writer, that are newer than or equal to the specified version. It returns a\n// timestamp(version) indicating the version of last entry that was dumped, which\n// after incrementing by 1 can be passed into a later invocation to generate an\n// incremental dump of entries that have been added/modified since the last\n// invocation of Stream.Backup().\n//\n// This can be used to backup the data in a database at a given point in time.\nfunc (stream *Stream) Backup(w io.Writer, since uint64) (uint64, error) {\n\tstream.KeyToList = func(key []byte, itr *Iterator) (*pb.KVList, error) {\n\t\tlist := &pb.KVList{}\n\t\ta := itr.Alloc\n\t\tfor ; itr.Valid(); itr.Next() {\n\t\t\titem := itr.Item()\n\t\t\tif !bytes.Equal(item.Key(), key) {\n\t\t\t\treturn list, nil\n\t\t\t}\n\t\t\tif item.Version() < since {\n\t\t\t\treturn nil, fmt.Errorf(\"Backup: Item Version: %d less than sinceTs: %d\",\n\t\t\t\t\titem.Version(), since)\n\t\t\t}\n\n\t\t\tvar valCopy []byte\n\t\t\tif !item.IsDeletedOrExpired() {\n\t\t\t\t// No need to copy value, if item is deleted or expired.\n\t\t\t\terr := item.Value(func(val []byte) error {\n\t\t\t\t\tvalCopy = a.Copy(val)\n\t\t\t\t\treturn nil\n\t\t\t\t})\n\t\t\t\tif err != nil {\n\t\t\t\t\tstream.db.opt.Errorf(\"Key [%x, %d]. Error while fetching value [%v]\\n\",\n\t\t\t\t\t\titem.Key(), item.Version(), err)\n\t\t\t\t\treturn nil, err\n\t\t\t\t}\n\t\t\t}\n\n\t\t\t// clear txn bits\n\t\t\tmeta := item.meta &^ (bitTxn | bitFinTxn)\n\t\t\tkv := y.NewKV(a)\n\t\t\t*kv = pb.KV{\n\t\t\t\tKey:       a.Copy(item.Key()),\n\t\t\t\tValue:     valCopy,\n\t\t\t\tUserMeta:  a.Copy([]byte{item.UserMeta()}),\n\t\t\t\tVersion:   item.Version(),\n\t\t\t\tExpiresAt: item.ExpiresAt(),\n\t\t\t\tMeta:      a.Copy([]byte{meta}),\n\t\t\t}\n\t\t\tlist.Kv = append(list.Kv, kv)\n\n\t\t\tswitch {\n\t\t\tcase item.DiscardEarlierVersions():\n\t\t\t\t// If we need to discard earlier versions of this item, add a delete\n\t\t\t\t// marker just below the current version.\n\t\t\t\tlist.Kv = append(list.Kv, &pb.KV{\n\t\t\t\t\tKey:     item.KeyCopy(nil),\n\t\t\t\t\tVersion: item.Version() - 1,\n\t\t\t\t\tMeta:    []byte{bitDelete},\n\t\t\t\t})\n\t\t\t\treturn list, nil\n\n\t\t\tcase item.IsDeletedOrExpired():\n\t\t\t\treturn list, nil\n\t\t\t}\n\t\t}\n\t\treturn list, nil\n\t}\n\n\tvar maxVersion uint64\n\tstream.Send = func(buf *z.Buffer) error {\n\t\tlist, err := BufferToKVList(buf)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tout := list.Kv[:0]\n\t\tfor _, kv := range list.Kv {\n\t\t\tif maxVersion < kv.Version {\n\t\t\t\tmaxVersion = kv.Version\n\t\t\t}\n\t\t\tif !kv.StreamDone {\n\t\t\t\t// Don't pick stream done changes.\n\t\t\t\tout = append(out, kv)\n\t\t\t}\n\t\t}\n\t\tlist.Kv = out\n\t\treturn writeTo(list, w)\n\t}\n\n\tif err := stream.Orchestrate(context.Background()); err != nil {\n\t\treturn 0, err\n\t}\n\treturn maxVersion, nil\n}\n\nfunc writeTo(list *pb.KVList, w io.Writer) error {\n\tif err := binary.Write(w, binary.LittleEndian, uint64(proto.Size(list))); err != nil {\n\t\treturn err\n\t}\n\tbuf, err := proto.Marshal(list)\n\tif err != nil {\n\t\treturn err\n\t}\n\t_, err = w.Write(buf)\n\treturn err\n}\n\n// KVLoader is used to write KVList objects in to badger. It can be used to restore a backup.\ntype KVLoader struct {\n\tdb          *DB\n\tthrottle    *y.Throttle\n\tentries     []*Entry\n\tentriesSize int64\n\ttotalSize   int64\n}\n\n// NewKVLoader returns a new instance of KVLoader.\nfunc (db *DB) NewKVLoader(maxPendingWrites int) *KVLoader {\n\treturn &KVLoader{\n\t\tdb:       db,\n\t\tthrottle: y.NewThrottle(maxPendingWrites),\n\t\tentries:  make([]*Entry, 0, db.opt.maxBatchCount),\n\t}\n}\n\n// Set writes the key-value pair to the database.\nfunc (l *KVLoader) Set(kv *pb.KV) error {\n\tvar userMeta, meta byte\n\tif len(kv.UserMeta) > 0 {\n\t\tuserMeta = kv.UserMeta[0]\n\t}\n\tif len(kv.Meta) > 0 {\n\t\tmeta = kv.Meta[0]\n\t}\n\te := &Entry{\n\t\tKey:       y.KeyWithTs(kv.Key, kv.Version),\n\t\tValue:     kv.Value,\n\t\tUserMeta:  userMeta,\n\t\tExpiresAt: kv.ExpiresAt,\n\t\tmeta:      meta,\n\t}\n\testimatedSize := e.estimateSizeAndSetThreshold(l.db.valueThreshold())\n\t// Flush entries if inserting the next entry would overflow the transactional limits.\n\tif int64(len(l.entries))+1 >= l.db.opt.maxBatchCount ||\n\t\tl.entriesSize+estimatedSize >= l.db.opt.maxBatchSize ||\n\t\tl.totalSize >= flushThreshold {\n\t\tif err := l.send(); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\tl.entries = append(l.entries, e)\n\tl.entriesSize += estimatedSize\n\tl.totalSize += estimatedSize + int64(len(e.Value))\n\treturn nil\n}\n\nfunc (l *KVLoader) send() error {\n\tif err := l.throttle.Do(); err != nil {\n\t\treturn err\n\t}\n\tif err := l.db.batchSetAsync(l.entries, func(err error) {\n\t\tl.throttle.Done(err)\n\t}); err != nil {\n\t\treturn err\n\t}\n\n\tl.entries = make([]*Entry, 0, l.db.opt.maxBatchCount)\n\tl.entriesSize = 0\n\tl.totalSize = 0\n\treturn nil\n}\n\n// Finish is meant to be called after all the key-value pairs have been loaded.\nfunc (l *KVLoader) Finish() error {\n\tif len(l.entries) > 0 {\n\t\tif err := l.send(); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\treturn l.throttle.Finish()\n}\n\n// Load reads a protobuf-encoded list of all entries from a reader and writes\n// them to the database. This can be used to restore the database from a backup\n// made by calling DB.Backup(). If more complex logic is needed to restore a badger\n// backup, the KVLoader interface should be used instead.\n//\n// DB.Load() should be called on a database that is not running any other\n// concurrent transactions while it is running.\nfunc (db *DB) Load(r io.Reader, maxPendingWrites int) error {\n\tbr := bufio.NewReaderSize(r, 16<<10)\n\tunmarshalBuf := make([]byte, 1<<10)\n\n\tldr := db.NewKVLoader(maxPendingWrites)\n\tfor {\n\t\tvar sz uint64\n\t\terr := binary.Read(br, binary.LittleEndian, &sz)\n\t\tif err == io.EOF {\n\t\t\tbreak\n\t\t} else if err != nil {\n\t\t\treturn err\n\t\t}\n\n\t\tif cap(unmarshalBuf) < int(sz) {\n\t\t\tunmarshalBuf = make([]byte, sz)\n\t\t}\n\n\t\tif _, err = io.ReadFull(br, unmarshalBuf[:sz]); err != nil {\n\t\t\treturn err\n\t\t}\n\n\t\tlist := &pb.KVList{}\n\t\tif err := proto.Unmarshal(unmarshalBuf[:sz], list); err != nil {\n\t\t\treturn err\n\t\t}\n\n\t\tfor _, kv := range list.Kv {\n\t\t\tif err := ldr.Set(kv); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\n\t\t\t// Update nextTxnTs, memtable stores this\n\t\t\t// timestamp in badger head when flushed.\n\t\t\tif kv.Version >= db.orc.nextTxnTs {\n\t\t\t\tdb.orc.nextTxnTs = kv.Version + 1\n\t\t\t}\n\t\t}\n\t}\n\n\tif err := ldr.Finish(); err != nil {\n\t\treturn err\n\t}\n\tdb.orc.txnMark.Done(db.orc.nextTxnTs - 1)\n\treturn nil\n}\n"
  },
  {
    "path": "backup_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"bytes\"\n\t\"fmt\"\n\t\"math/rand\"\n\t\"os\"\n\t\"path/filepath\"\n\t\"reflect\"\n\t\"strconv\"\n\t\"testing\"\n\t\"time\"\n\n\t\"github.com/stretchr/testify/require\"\n\n\t\"github.com/dgraph-io/badger/v4/pb\"\n)\n\nfunc TestBackupRestore1(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\tdb, err := Open(getTestOptions(dir))\n\trequire.NoError(t, err)\n\n\t// Write some stuff\n\tentries := []struct {\n\t\tkey      []byte\n\t\tval      []byte\n\t\tuserMeta byte\n\t\tversion  uint64\n\t}{\n\t\t{key: []byte(\"answer1\"), val: []byte(\"42\"), version: 1},\n\t\t{key: []byte(\"answer2\"), val: []byte(\"43\"), userMeta: 1, version: 2},\n\t}\n\n\terr = db.Update(func(txn *Txn) error {\n\t\te := entries[0]\n\t\terr := txn.SetEntry(NewEntry(e.key, e.val).WithMeta(e.userMeta))\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\treturn nil\n\t})\n\trequire.NoError(t, err)\n\n\terr = db.Update(func(txn *Txn) error {\n\t\te := entries[1]\n\t\terr := txn.SetEntry(NewEntry(e.key, e.val).WithMeta(e.userMeta))\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\treturn nil\n\t})\n\trequire.NoError(t, err)\n\n\t// Use different directory.\n\tdir, err = os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\tbak, err := os.CreateTemp(dir, \"badgerbak\")\n\trequire.NoError(t, err)\n\t_, err = db.Backup(bak, 0)\n\trequire.NoError(t, err)\n\trequire.NoError(t, bak.Close())\n\trequire.NoError(t, db.Close())\n\n\tdb, err = Open(getTestOptions(dir))\n\trequire.NoError(t, err)\n\tdefer db.Close()\n\tbak, err = os.Open(bak.Name())\n\trequire.NoError(t, err)\n\tdefer bak.Close()\n\n\trequire.NoError(t, db.Load(bak, 16))\n\n\terr = db.View(func(txn *Txn) error {\n\t\topts := DefaultIteratorOptions\n\t\topts.AllVersions = true\n\t\tit := txn.NewIterator(opts)\n\t\tdefer it.Close()\n\t\tvar count int\n\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\titem := it.Item()\n\t\t\tval, err := item.ValueCopy(nil)\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tt.Logf(\"Got entry: %v\\n\", item.Version())\n\t\t\trequire.Equal(t, entries[count].key, item.Key())\n\t\t\trequire.Equal(t, entries[count].val, val)\n\t\t\trequire.Equal(t, entries[count].version, item.Version())\n\t\t\trequire.Equal(t, entries[count].userMeta, item.UserMeta())\n\t\t\tcount++\n\t\t}\n\t\trequire.Equal(t, count, 2)\n\t\treturn nil\n\t})\n\trequire.NoError(t, err)\n\trequire.Equal(t, 3, int(db.orc.nextTs()))\n}\n\nfunc TestBackupRestore2(t *testing.T) {\n\ttmpdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\n\tdefer removeDir(tmpdir)\n\n\ts1Path := filepath.Join(tmpdir, \"test1\")\n\ts2Path := filepath.Join(tmpdir, \"test2\")\n\ts3Path := filepath.Join(tmpdir, \"test3\")\n\n\tdb1, err := Open(getTestOptions(s1Path))\n\trequire.NoError(t, err)\n\n\tdefer db1.Close()\n\tkey1 := []byte(\"key1\")\n\tkey2 := []byte(\"key2\")\n\trawValue := []byte(\"NotLongValue\")\n\tN := byte(251)\n\terr = db1.Update(func(tx *Txn) error {\n\t\tif err := tx.SetEntry(NewEntry(key1, rawValue)); err != nil {\n\t\t\treturn err\n\t\t}\n\t\treturn tx.SetEntry(NewEntry(key2, rawValue))\n\t})\n\trequire.NoError(t, err)\n\n\tfor i := byte(1); i < N; i++ {\n\t\terr = db1.Update(func(tx *Txn) error {\n\t\t\tif err := tx.SetEntry(NewEntry(append(key1, i), rawValue)); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\treturn tx.SetEntry(NewEntry(append(key2, i), rawValue))\n\t\t})\n\t\trequire.NoError(t, err)\n\n\t}\n\tvar backup bytes.Buffer\n\t_, err = db1.Backup(&backup, 0)\n\trequire.NoError(t, err)\n\n\tfmt.Println(\"backup1 length:\", backup.Len())\n\n\tdb2, err := Open(getTestOptions(s2Path))\n\trequire.NoError(t, err)\n\n\tdefer db2.Close()\n\terr = db2.Load(&backup, 16)\n\trequire.NoError(t, err)\n\n\t// Check nextTs is correctly set.\n\trequire.Equal(t, db1.orc.nextTs(), db2.orc.nextTs())\n\n\tfor i := byte(1); i < N; i++ {\n\t\terr = db2.View(func(tx *Txn) error {\n\t\t\tk := append(key1, i)\n\t\t\titem, err := tx.Get(k)\n\t\t\tif err != nil {\n\t\t\t\tif err == ErrKeyNotFound {\n\t\t\t\t\treturn fmt.Errorf(\"Key %q has been not found, but was set\\n\", k)\n\t\t\t\t}\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tv, err := item.ValueCopy(nil)\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tif !reflect.DeepEqual(v, rawValue) {\n\t\t\t\treturn fmt.Errorf(\"Values not match, got %v, expected %v\", v, rawValue)\n\t\t\t}\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\n\t}\n\n\tfor i := byte(1); i < N; i++ {\n\t\terr = db2.Update(func(tx *Txn) error {\n\t\t\tif err := tx.SetEntry(NewEntry(append(key1, i), rawValue)); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\treturn tx.SetEntry(NewEntry(append(key2, i), rawValue))\n\t\t})\n\t\trequire.NoError(t, err)\n\n\t}\n\n\tbackup.Reset()\n\t_, err = db2.Backup(&backup, 0)\n\trequire.NoError(t, err)\n\n\tfmt.Println(\"backup2 length:\", backup.Len())\n\tdb3, err := Open(getTestOptions(s3Path))\n\trequire.NoError(t, err)\n\n\tdefer db3.Close()\n\n\terr = db3.Load(&backup, 16)\n\trequire.NoError(t, err)\n\n\t// Check nextTs is correctly set.\n\trequire.Equal(t, db2.orc.nextTs(), db3.orc.nextTs())\n\n\tfor i := byte(1); i < N; i++ {\n\t\terr = db3.View(func(tx *Txn) error {\n\t\t\tk := append(key1, i)\n\t\t\titem, err := tx.Get(k)\n\t\t\tif err != nil {\n\t\t\t\tif err == ErrKeyNotFound {\n\t\t\t\t\treturn fmt.Errorf(\"Key %q has been not found, but was set\\n\", k)\n\t\t\t\t}\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tv, err := item.ValueCopy(nil)\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tif !reflect.DeepEqual(v, rawValue) {\n\t\t\t\treturn fmt.Errorf(\"Values not match, got %v, expected %v\", v, rawValue)\n\t\t\t}\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\n\t}\n\n}\n\nvar randSrc = rand.NewSource(time.Now().UnixNano())\n\nfunc createEntries(n int) []*pb.KV {\n\tentries := make([]*pb.KV, n)\n\tfor i := 0; i < n; i++ {\n\t\tentries[i] = &pb.KV{\n\t\t\tKey:      []byte(fmt.Sprint(\"key\", i)),\n\t\t\tValue:    []byte{1},\n\t\t\tUserMeta: []byte{0},\n\t\t\tMeta:     []byte{0},\n\t\t}\n\t}\n\treturn entries\n}\n\nfunc populateEntries(db *DB, entries []*pb.KV) error {\n\treturn db.Update(func(txn *Txn) error {\n\t\tvar err error\n\t\tfor i, e := range entries {\n\t\t\tif err = txn.SetEntry(NewEntry(e.Key, e.Value)); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tentries[i].Version = 1\n\t\t}\n\t\treturn nil\n\t})\n}\n\nfunc TestBackup(t *testing.T) {\n\ttest := func(t *testing.T, db *DB) {\n\t\tvar bb bytes.Buffer\n\t\tN := 1000\n\t\tentries := createEntries(N)\n\t\trequire.NoError(t, populateEntries(db, entries))\n\n\t\t_, err := db.Backup(&bb, 0)\n\t\trequire.NoError(t, err)\n\n\t\terr = db.View(func(txn *Txn) error {\n\t\t\topts := DefaultIteratorOptions\n\t\t\tit := txn.NewIterator(opts)\n\t\t\tdefer it.Close()\n\t\t\tvar count int\n\t\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\t\titem := it.Item()\n\t\t\t\tidx, err := strconv.Atoi(string(item.Key())[3:])\n\t\t\t\tif err != nil {\n\t\t\t\t\treturn err\n\t\t\t\t}\n\t\t\t\tif idx > N || !bytes.Equal(entries[idx].Key, item.Key()) {\n\t\t\t\t\treturn fmt.Errorf(\"%s: %s\", string(item.Key()), ErrKeyNotFound)\n\t\t\t\t}\n\t\t\t\tcount++\n\t\t\t}\n\t\t\tif N != count {\n\t\t\t\treturn fmt.Errorf(\"wrong number of items: %d expected, %d actual\", N, count)\n\t\t\t}\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\t}\n\tt.Run(\"disk mode\", func(t *testing.T) {\n\t\ttmpdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\t\trequire.NoError(t, err)\n\n\t\tdefer removeDir(tmpdir)\n\t\topt := DefaultOptions(filepath.Join(tmpdir, \"backup0\"))\n\t\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\t\ttest(t, db)\n\t\t})\n\t})\n\tt.Run(\"InMemory mode\", func(t *testing.T) {\n\t\topt := DefaultOptions(\"\")\n\t\topt.InMemory = true\n\t\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\t\ttest(t, db)\n\t\t})\n\t})\n}\n\nfunc TestBackupRestore3(t *testing.T) {\n\tvar bb bytes.Buffer\n\ttmpdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\n\tdefer removeDir(tmpdir)\n\n\tN := 1000\n\tentries := createEntries(N)\n\n\tvar db1NextTs uint64\n\t// backup\n\t{\n\t\tdb1, err := Open(DefaultOptions(filepath.Join(tmpdir, \"backup1\")))\n\t\trequire.NoError(t, err)\n\n\t\tdefer db1.Close()\n\t\trequire.NoError(t, populateEntries(db1, entries))\n\n\t\t_, err = db1.Backup(&bb, 0)\n\t\trequire.NoError(t, err)\n\n\t\tdb1NextTs = db1.orc.nextTs()\n\t\trequire.NoError(t, db1.Close())\n\t}\n\trequire.True(t, len(entries) == N)\n\trequire.True(t, bb.Len() > 0)\n\n\t// restore\n\tdb2, err := Open(DefaultOptions(filepath.Join(tmpdir, \"restore1\")))\n\trequire.NoError(t, err)\n\n\tdefer db2.Close()\n\trequire.NotEqual(t, db1NextTs, db2.orc.nextTs())\n\trequire.NoError(t, db2.Load(&bb, 16))\n\trequire.Equal(t, db1NextTs, db2.orc.nextTs())\n\n\t// verify\n\terr = db2.View(func(txn *Txn) error {\n\t\topts := DefaultIteratorOptions\n\t\tit := txn.NewIterator(opts)\n\t\tdefer it.Close()\n\t\tvar count int\n\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\titem := it.Item()\n\t\t\tidx, err := strconv.Atoi(string(item.Key())[3:])\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tif idx > N || !bytes.Equal(entries[idx].Key, item.Key()) {\n\t\t\t\treturn fmt.Errorf(\"%s: %s\", string(item.Key()), ErrKeyNotFound)\n\t\t\t}\n\t\t\tcount++\n\t\t}\n\t\tif N != count {\n\t\t\treturn fmt.Errorf(\"wrong number of items: %d expected, %d actual\", N, count)\n\t\t}\n\t\treturn nil\n\t})\n\trequire.NoError(t, err)\n}\n\nfunc TestBackupLoadIncremental(t *testing.T) {\n\ttmpdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\n\tdefer removeDir(tmpdir)\n\n\tN := 100\n\tentries := createEntries(N)\n\tupdates := make(map[int]byte)\n\tvar bb bytes.Buffer\n\n\tvar db1NextTs uint64\n\t// backup\n\t{\n\t\tdb1, err := Open(DefaultOptions(filepath.Join(tmpdir, \"backup2\")))\n\t\trequire.NoError(t, err)\n\n\t\tdefer db1.Close()\n\n\t\trequire.NoError(t, populateEntries(db1, entries))\n\t\tsince, err := db1.Backup(&bb, 0)\n\t\trequire.NoError(t, err)\n\n\t\tints := rand.New(randSrc).Perm(N)\n\n\t\t// pick 10 items to mark as deleted.\n\t\terr = db1.Update(func(txn *Txn) error {\n\t\t\tfor _, i := range ints[:10] {\n\t\t\t\tif err := txn.Delete(entries[i].Key); err != nil {\n\t\t\t\t\treturn err\n\t\t\t\t}\n\t\t\t\tupdates[i] = bitDelete\n\t\t\t}\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\t\tsince, err = db1.Backup(&bb, since)\n\t\trequire.NoError(t, err)\n\n\t\t// pick 5 items to mark as expired.\n\t\terr = db1.Update(func(txn *Txn) error {\n\t\t\tfor _, i := range (ints)[10:15] {\n\t\t\t\tentry := NewEntry(entries[i].Key, entries[i].Value).WithTTL(-time.Hour)\n\t\t\t\tif err := txn.SetEntry(entry); err != nil {\n\t\t\t\t\treturn err\n\t\t\t\t}\n\t\t\t\tupdates[i] = bitDelete // expired\n\t\t\t}\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\t\tsince, err = db1.Backup(&bb, since)\n\t\trequire.NoError(t, err)\n\n\t\t// pick 5 items to mark as discard.\n\t\terr = db1.Update(func(txn *Txn) error {\n\t\t\tfor _, i := range ints[15:20] {\n\t\t\t\tentry := NewEntry(entries[i].Key, entries[i].Value).WithDiscard()\n\t\t\t\tif err := txn.SetEntry(entry); err != nil {\n\t\t\t\t\treturn err\n\t\t\t\t}\n\t\t\t\tupdates[i] = bitDiscardEarlierVersions\n\t\t\t}\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\t\t_, err = db1.Backup(&bb, since)\n\t\trequire.NoError(t, err)\n\n\t\tdb1NextTs = db1.orc.nextTs()\n\n\t\trequire.NoError(t, db1.Close())\n\t}\n\trequire.True(t, len(entries) == N)\n\trequire.True(t, bb.Len() > 0)\n\n\t// restore\n\tdb2, err := Open(getTestOptions(filepath.Join(tmpdir, \"restore2\")))\n\trequire.NoError(t, err)\n\n\tdefer db2.Close()\n\n\trequire.NotEqual(t, db1NextTs, db2.orc.nextTs())\n\trequire.NoError(t, db2.Load(&bb, 16))\n\trequire.Equal(t, db1NextTs, db2.orc.nextTs())\n\n\t// verify\n\tactual := make(map[int]byte)\n\terr = db2.View(func(txn *Txn) error {\n\t\topts := DefaultIteratorOptions\n\t\topts.AllVersions = true\n\t\tit := txn.NewIterator(opts)\n\t\tdefer it.Close()\n\t\tvar count int\n\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\titem := it.Item()\n\t\t\tidx, err := strconv.Atoi(string(item.Key())[3:])\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tif item.IsDeletedOrExpired() {\n\t\t\t\t_, ok := updates[idx]\n\t\t\t\tif !ok {\n\t\t\t\t\treturn fmt.Errorf(\"%s: not expected to be updated but it is\",\n\t\t\t\t\t\tstring(item.Key()))\n\t\t\t\t}\n\t\t\t\tactual[idx] = item.meta\n\t\t\t\tcount++\n\t\t\t\tcontinue\n\t\t\t}\n\t\t}\n\t\tif len(updates) != count {\n\t\t\treturn fmt.Errorf(\"mismatched updated items: %d expected, %d actual\",\n\t\t\t\tlen(updates), count)\n\t\t}\n\t\treturn nil\n\t})\n\trequire.NoError(t, err, \"%v %v\", updates, actual)\n}\n\nfunc TestBackupBitClear(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\topt := getTestOptions(dir)\n\topt.ValueThreshold = 10 // This is important\n\tdb, err := Open(opt)\n\trequire.NoError(t, err)\n\n\tkey := []byte(\"foo\")\n\tval := []byte(fmt.Sprintf(\"%0100d\", 1))\n\trequire.Greater(t, int64(len(val)), db.valueThreshold())\n\n\terr = db.Update(func(txn *Txn) error {\n\t\te := NewEntry(key, val)\n\t\t// Value > valueTheshold so bitValuePointer will be set.\n\t\treturn txn.SetEntry(e)\n\t})\n\trequire.NoError(t, err)\n\n\t// Use different directory.\n\tdir, err = os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\tbak, err := os.CreateTemp(dir, \"badgerbak\")\n\trequire.NoError(t, err)\n\t_, err = db.Backup(bak, 0)\n\trequire.NoError(t, err)\n\trequire.NoError(t, bak.Close())\n\n\toldValue := db.orc.nextTs()\n\trequire.NoError(t, db.Close())\n\n\topt = getTestOptions(dir)\n\topt.ValueThreshold = 200 // This is important.\n\tdb, err = Open(opt)\n\trequire.NoError(t, err)\n\tdefer db.Close()\n\n\tbak, err = os.Open(bak.Name())\n\trequire.NoError(t, err)\n\tdefer bak.Close()\n\n\trequire.NoError(t, db.Load(bak, 16))\n\t// Ensure nextTs is still the same.\n\trequire.Equal(t, oldValue, db.orc.nextTs())\n\n\trequire.NoError(t, db.View(func(txn *Txn) error {\n\t\te, err := txn.Get(key)\n\t\trequire.NoError(t, err)\n\t\tv, err := e.ValueCopy(nil)\n\t\trequire.NoError(t, err)\n\t\trequire.Equal(t, val, v)\n\t\treturn nil\n\t}))\n}\n"
  },
  {
    "path": "batch.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"errors\"\n\t\"fmt\"\n\t\"sync\"\n\t\"sync/atomic\"\n\n\t\"google.golang.org/protobuf/proto\"\n\n\t\"github.com/dgraph-io/badger/v4/pb\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\n// WriteBatch holds the necessary info to perform batched writes.\ntype WriteBatch struct {\n\tsync.Mutex\n\ttxn      *Txn\n\tdb       *DB\n\tthrottle *y.Throttle\n\terr      atomic.Value\n\n\tisManaged bool\n\tcommitTs  uint64\n\tfinished  bool\n}\n\n// NewWriteBatch creates a new WriteBatch. This provides a way to conveniently do a lot of writes,\n// batching them up as tightly as possible in a single transaction and using callbacks to avoid\n// waiting for them to commit, thus achieving good performance. This API hides away the logic of\n// creating and committing transactions. Due to the nature of SSI guaratees provided by Badger,\n// blind writes can never encounter transaction conflicts (ErrConflict).\nfunc (db *DB) NewWriteBatch() *WriteBatch {\n\tif db.opt.managedTxns {\n\t\tpanic(\"cannot use NewWriteBatch in managed mode. Use NewWriteBatchAt instead\")\n\t}\n\treturn db.newWriteBatch(false)\n}\n\nfunc (db *DB) newWriteBatch(isManaged bool) *WriteBatch {\n\treturn &WriteBatch{\n\t\tdb:        db,\n\t\tisManaged: isManaged,\n\t\ttxn:       db.newTransaction(true, isManaged),\n\t\tthrottle:  y.NewThrottle(16),\n\t}\n}\n\n// SetMaxPendingTxns sets a limit on maximum number of pending transactions while writing batches.\n// This function should be called before using WriteBatch. Default value of MaxPendingTxns is\n// 16 to minimise memory usage.\nfunc (wb *WriteBatch) SetMaxPendingTxns(max int) {\n\twb.throttle = y.NewThrottle(max)\n}\n\n// Cancel function must be called if there's a chance that Flush might not get\n// called. If neither Flush or Cancel is called, the transaction oracle would\n// never get a chance to clear out the row commit timestamp map, thus causing an\n// unbounded memory consumption. Typically, you can call Cancel as a defer\n// statement right after NewWriteBatch is called.\n//\n// Note that any committed writes would still go through despite calling Cancel.\nfunc (wb *WriteBatch) Cancel() {\n\twb.Lock()\n\tdefer wb.Unlock()\n\twb.finished = true\n\tif err := wb.throttle.Finish(); err != nil {\n\t\twb.db.opt.Errorf(\"WatchBatch.Cancel error while finishing: %v\", err)\n\t}\n\twb.txn.Discard()\n}\n\nfunc (wb *WriteBatch) callback(err error) {\n\t// sync.WaitGroup is thread-safe, so it doesn't need to be run inside wb.Lock.\n\tdefer wb.throttle.Done(err)\n\tif err == nil {\n\t\treturn\n\t}\n\tif err := wb.Error(); err != nil {\n\t\treturn\n\t}\n\twb.err.Store(err)\n}\n\nfunc (wb *WriteBatch) writeKV(kv *pb.KV) error {\n\te := Entry{Key: kv.Key, Value: kv.Value}\n\tif len(kv.UserMeta) > 0 {\n\t\te.UserMeta = kv.UserMeta[0]\n\t}\n\ty.AssertTrue(kv.Version != 0)\n\te.version = kv.Version\n\treturn wb.handleEntry(&e)\n}\n\nfunc (wb *WriteBatch) Write(buf *z.Buffer) error {\n\twb.Lock()\n\tdefer wb.Unlock()\n\n\terr := buf.SliceIterate(func(s []byte) error {\n\t\tkv := &pb.KV{}\n\t\tif err := proto.Unmarshal(s, kv); err != nil {\n\t\t\treturn err\n\t\t}\n\t\treturn wb.writeKV(kv)\n\t})\n\treturn err\n}\n\nfunc (wb *WriteBatch) WriteList(kvList *pb.KVList) error {\n\twb.Lock()\n\tdefer wb.Unlock()\n\tfor _, kv := range kvList.Kv {\n\t\tif err := wb.writeKV(kv); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\treturn nil\n}\n\n// SetEntryAt is the equivalent of Txn.SetEntry but it also allows setting version for the entry.\n// SetEntryAt can be used only in managed mode.\nfunc (wb *WriteBatch) SetEntryAt(e *Entry, ts uint64) error {\n\tif !wb.db.opt.managedTxns {\n\t\treturn errors.New(\"SetEntryAt can only be used in managed mode. Use SetEntry instead\")\n\t}\n\te.version = ts\n\treturn wb.SetEntry(e)\n}\n\n// Should be called with lock acquired.\nfunc (wb *WriteBatch) handleEntry(e *Entry) error {\n\tif err := wb.txn.SetEntry(e); err != ErrTxnTooBig {\n\t\treturn err\n\t}\n\t// Txn has reached it's zenith. Commit now.\n\tif cerr := wb.commit(); cerr != nil {\n\t\treturn cerr\n\t}\n\t// This time the error must not be ErrTxnTooBig, otherwise, we make the\n\t// error permanent.\n\tif err := wb.txn.SetEntry(e); err != nil {\n\t\twb.err.Store(err)\n\t\treturn err\n\t}\n\treturn nil\n}\n\n// SetEntry is the equivalent of Txn.SetEntry.\nfunc (wb *WriteBatch) SetEntry(e *Entry) error {\n\twb.Lock()\n\tdefer wb.Unlock()\n\treturn wb.handleEntry(e)\n}\n\n// Set is equivalent of Txn.Set().\nfunc (wb *WriteBatch) Set(k, v []byte) error {\n\te := &Entry{Key: k, Value: v}\n\treturn wb.SetEntry(e)\n}\n\n// DeleteAt is equivalent of Txn.Delete but accepts a delete timestamp.\nfunc (wb *WriteBatch) DeleteAt(k []byte, ts uint64) error {\n\te := Entry{Key: k, meta: bitDelete, version: ts}\n\treturn wb.SetEntry(&e)\n}\n\n// Delete is equivalent of Txn.Delete.\nfunc (wb *WriteBatch) Delete(k []byte) error {\n\twb.Lock()\n\tdefer wb.Unlock()\n\n\tif err := wb.txn.Delete(k); err != ErrTxnTooBig {\n\t\treturn err\n\t}\n\tif err := wb.commit(); err != nil {\n\t\treturn err\n\t}\n\tif err := wb.txn.Delete(k); err != nil {\n\t\twb.err.Store(err)\n\t\treturn err\n\t}\n\treturn nil\n}\n\n// Caller to commit must hold a write lock.\nfunc (wb *WriteBatch) commit() error {\n\tif err := wb.Error(); err != nil {\n\t\treturn err\n\t}\n\tif wb.finished {\n\t\treturn y.ErrCommitAfterFinish\n\t}\n\tif err := wb.throttle.Do(); err != nil {\n\t\twb.err.Store(err)\n\t\treturn err\n\t}\n\twb.txn.CommitWith(wb.callback)\n\twb.txn = wb.db.newTransaction(true, wb.isManaged)\n\twb.txn.commitTs = wb.commitTs\n\treturn wb.Error()\n}\n\n// Flush must be called at the end to ensure that any pending writes get committed to Badger. Flush\n// returns any error stored by WriteBatch.\nfunc (wb *WriteBatch) Flush() error {\n\twb.Lock()\n\terr := wb.commit()\n\tif err != nil {\n\t\twb.Unlock()\n\t\treturn err\n\t}\n\twb.finished = true\n\twb.txn.Discard()\n\twb.Unlock()\n\n\tif err := wb.throttle.Finish(); err != nil {\n\t\tif wb.Error() != nil {\n\t\t\treturn fmt.Errorf(\"wb.err: %w err: %w\", wb.Error(), err)\n\t\t}\n\t\treturn err\n\t}\n\n\treturn wb.Error()\n}\n\n// Error returns any errors encountered so far. No commits would be run once an error is detected.\nfunc (wb *WriteBatch) Error() error {\n\t// If the interface conversion fails, the err will be nil.\n\terr, _ := wb.err.Load().(error)\n\treturn err\n}\n"
  },
  {
    "path": "batch_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"fmt\"\n\t\"os\"\n\t\"testing\"\n\t\"time\"\n\n\t\"github.com/stretchr/testify/require\"\n\n\t\"github.com/dgraph-io/badger/v4/y\"\n)\n\nfunc TestWriteBatch(t *testing.T) {\n\tkey := func(i int) []byte {\n\t\treturn []byte(fmt.Sprintf(\"%10d\", i))\n\t}\n\tval := func(i int) []byte {\n\t\treturn []byte(fmt.Sprintf(\"%128d\", i))\n\t}\n\n\ttest := func(t *testing.T, db *DB) {\n\t\twb := db.NewWriteBatch()\n\t\tdefer wb.Cancel()\n\n\t\t// Sanity check for SetEntryAt.\n\t\trequire.Error(t, wb.SetEntryAt(&Entry{}, 12))\n\n\t\tN, M := 50000, 1000\n\t\tstart := time.Now()\n\n\t\tfor i := 0; i < N; i++ {\n\t\t\trequire.NoError(t, wb.Set(key(i), val(i)))\n\t\t}\n\t\tfor i := 0; i < M; i++ {\n\t\t\trequire.NoError(t, wb.Delete(key(i)))\n\t\t}\n\t\trequire.NoError(t, wb.Flush())\n\t\tt.Logf(\"Time taken for %d writes (w/ test options): %s\\n\", N+M, time.Since(start))\n\n\t\terr := db.View(func(txn *Txn) error {\n\t\t\titr := txn.NewIterator(DefaultIteratorOptions)\n\t\t\tdefer itr.Close()\n\n\t\t\ti := M\n\t\t\tfor itr.Rewind(); itr.Valid(); itr.Next() {\n\t\t\t\titem := itr.Item()\n\t\t\t\trequire.Equal(t, string(key(i)), string(item.Key()))\n\t\t\t\tvalcopy, err := item.ValueCopy(nil)\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\trequire.Equal(t, val(i), valcopy)\n\t\t\t\ti++\n\t\t\t}\n\t\t\trequire.Equal(t, N, i)\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\t}\n\tt.Run(\"disk mode\", func(t *testing.T) {\n\t\topt := getTestOptions(\"\")\n\t\t// Set value threshold to 32 bytes otherwise write batch will generate\n\t\t// too many files and we will crash with too many files open error.\n\t\topt.ValueThreshold = 32\n\t\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\t\ttest(t, db)\n\t\t})\n\t\tt.Logf(\"Disk mode done\\n\")\n\t})\n\tt.Run(\"InMemory mode\", func(t *testing.T) {\n\t\tt.Skipf(\"TODO(ibrahim): Please fix this\")\n\t\topt := getTestOptions(\"\")\n\t\topt.InMemory = true\n\t\tdb, err := Open(opt)\n\t\trequire.NoError(t, err)\n\t\ttest(t, db)\n\t\tt.Logf(\"Disk mode done\\n\")\n\t\trequire.NoError(t, db.Close())\n\t})\n}\n\n// This test ensures we don't end up in deadlock in case of empty writebatch.\nfunc TestEmptyWriteBatch(t *testing.T) {\n\tt.Run(\"normal mode\", func(t *testing.T) {\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\twb := db.NewWriteBatch()\n\t\t\trequire.NoError(t, wb.Flush())\n\t\t\twb = db.NewWriteBatch()\n\t\t\trequire.NoError(t, wb.Flush())\n\t\t\twb = db.NewWriteBatch()\n\t\t\t// Flush commits inner txn and sets a new one instead.\n\t\t\t// Thus we need to save it to check if it was discarded.\n\t\t\ttxn := wb.txn\n\t\t\trequire.NoError(t, wb.Flush())\n\t\t\t// check that flushed txn was discarded and marked as read.\n\t\t\trequire.True(t, txn.discarded)\n\t\t})\n\t})\n\tt.Run(\"managed mode\", func(t *testing.T) {\n\t\topt := getTestOptions(\"\")\n\t\topt.managedTxns = true\n\t\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\t\tt.Run(\"WriteBatchAt\", func(t *testing.T) {\n\t\t\t\twb := db.NewWriteBatchAt(2)\n\t\t\t\trequire.NoError(t, wb.Flush())\n\t\t\t\twb = db.NewWriteBatchAt(208)\n\t\t\t\trequire.NoError(t, wb.Flush())\n\t\t\t\twb = db.NewWriteBatchAt(31)\n\t\t\t\trequire.NoError(t, wb.Flush())\n\t\t\t})\n\t\t\tt.Run(\"ManagedWriteBatch\", func(t *testing.T) {\n\t\t\t\twb := db.NewManagedWriteBatch()\n\t\t\t\trequire.NoError(t, wb.Flush())\n\t\t\t\twb = db.NewManagedWriteBatch()\n\t\t\t\trequire.NoError(t, wb.Flush())\n\t\t\t\twb = db.NewManagedWriteBatch()\n\t\t\t\trequire.NoError(t, wb.Flush())\n\t\t\t})\n\t\t})\n\t})\n}\n\n// This test ensures we don't panic during flush.\n// See issue: https://github.com/dgraph-io/badger/issues/1394\nfunc TestFlushPanic(t *testing.T) {\n\tt.Run(\"flush after flush\", func(t *testing.T) {\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\twb := db.NewWriteBatch()\n\t\t\twb.Flush()\n\t\t\trequire.Error(t, y.ErrCommitAfterFinish, wb.Flush())\n\t\t})\n\t})\n\tt.Run(\"flush after cancel\", func(t *testing.T) {\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\twb := db.NewWriteBatch()\n\t\t\twb.Cancel()\n\t\t\trequire.Error(t, y.ErrCommitAfterFinish, wb.Flush())\n\t\t})\n\t})\n}\n\nfunc TestBatchErrDeadlock(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\topt := DefaultOptions(dir)\n\tdb, err := OpenManaged(opt)\n\trequire.NoError(t, err)\n\n\twb := db.NewManagedWriteBatch()\n\trequire.NoError(t, wb.SetEntryAt(&Entry{Key: []byte(\"foo\")}, 0))\n\trequire.Error(t, wb.Flush())\n\trequire.NoError(t, db.Close())\n}\n"
  },
  {
    "path": "changes.sh",
    "content": "#!/bin/bash\n\nset -e\nGHORG=${GHORG:-dgraph-io}\nGHREPO=${GHREPO:-badger}\ncat <<EOF\nThis description was generated using this script:\n\\`\\`\\`sh\n$(cat \"$0\")\n\\`\\`\\`\nInvoked as:\n\n    $(echo GHORG=\"${GHORG}\" GHREPO=\"${GHREPO}\" $(basename \"$0\") ${@:1})\n\nEOF\ngit log --oneline --reverse ${@:1} |\n\tsed -E \"s/^(\\S{7}\\s)//g\" |\n\tsed -E \"s/([\\s|\\(| ])#([0-9]+)/\\1${GHORG}\\/${GHREPO}#\\2/g\"\n"
  },
  {
    "path": "compaction.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"bytes\"\n\t\"fmt\"\n\t\"log\"\n\t\"math\"\n\t\"sync\"\n\n\t\"github.com/dgraph-io/badger/v4/table\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n)\n\ntype keyRange struct {\n\tleft  []byte\n\tright []byte\n\tinf   bool\n\tsize  int64 // size is used for Key splits.\n}\n\nfunc (r keyRange) isEmpty() bool {\n\treturn len(r.left) == 0 && len(r.right) == 0 && !r.inf\n}\n\nvar infRange = keyRange{inf: true}\n\nfunc (r keyRange) String() string {\n\treturn fmt.Sprintf(\"[left=%x, right=%x, inf=%v]\", r.left, r.right, r.inf)\n}\n\nfunc (r keyRange) equals(dst keyRange) bool {\n\treturn bytes.Equal(r.left, dst.left) &&\n\t\tbytes.Equal(r.right, dst.right) &&\n\t\tr.inf == dst.inf\n}\n\nfunc (r *keyRange) extend(kr keyRange) {\n\t// TODO(ibrahim): Is this needed?\n\tif kr.isEmpty() {\n\t\treturn\n\t}\n\tif r.isEmpty() {\n\t\t*r = kr\n\t}\n\tif len(r.left) == 0 || y.CompareKeys(kr.left, r.left) < 0 {\n\t\tr.left = kr.left\n\t}\n\tif len(r.right) == 0 || y.CompareKeys(kr.right, r.right) > 0 {\n\t\tr.right = kr.right\n\t}\n\tif kr.inf {\n\t\tr.inf = true\n\t}\n}\n\nfunc (r keyRange) overlapsWith(dst keyRange) bool {\n\t// Empty keyRange always overlaps.\n\tif r.isEmpty() {\n\t\treturn true\n\t}\n\t// TODO(ibrahim): Do you need this?\n\t// Empty dst doesn't overlap with anything.\n\tif dst.isEmpty() {\n\t\treturn false\n\t}\n\tif r.inf || dst.inf {\n\t\treturn true\n\t}\n\n\t// [dst.left, dst.right] ... [r.left, r.right]\n\t// If my left is greater than dst right, we have no overlap.\n\tif y.CompareKeys(r.left, dst.right) > 0 {\n\t\treturn false\n\t}\n\t// [r.left, r.right] ... [dst.left, dst.right]\n\t// If my right is less than dst left, we have no overlap.\n\tif y.CompareKeys(r.right, dst.left) < 0 {\n\t\treturn false\n\t}\n\t// We have overlap.\n\treturn true\n}\n\n// getKeyRange returns the smallest and the biggest in the list of tables.\n// TODO(naman): Write a test for this. The smallest and the biggest should\n// be the smallest of the leftmost table and the biggest of the right most table.\nfunc getKeyRange(tables ...*table.Table) keyRange {\n\tif len(tables) == 0 {\n\t\treturn keyRange{}\n\t}\n\tsmallest := tables[0].Smallest()\n\tbiggest := tables[0].Biggest()\n\tfor i := 1; i < len(tables); i++ {\n\t\tif y.CompareKeys(tables[i].Smallest(), smallest) < 0 {\n\t\t\tsmallest = tables[i].Smallest()\n\t\t}\n\t\tif y.CompareKeys(tables[i].Biggest(), biggest) > 0 {\n\t\t\tbiggest = tables[i].Biggest()\n\t\t}\n\t}\n\n\t// We pick all the versions of the smallest and the biggest key. Note that version zero would\n\t// be the rightmost key, considering versions are default sorted in descending order.\n\treturn keyRange{\n\t\tleft:  y.KeyWithTs(y.ParseKey(smallest), math.MaxUint64),\n\t\tright: y.KeyWithTs(y.ParseKey(biggest), 0),\n\t}\n}\n\ntype levelCompactStatus struct {\n\tranges  []keyRange\n\tdelSize int64\n}\n\nfunc (lcs *levelCompactStatus) debug() string {\n\tvar b bytes.Buffer\n\tfor _, r := range lcs.ranges {\n\t\tb.WriteString(r.String())\n\t}\n\treturn b.String()\n}\n\nfunc (lcs *levelCompactStatus) overlapsWith(dst keyRange) bool {\n\tfor _, r := range lcs.ranges {\n\t\tif r.overlapsWith(dst) {\n\t\t\treturn true\n\t\t}\n\t}\n\treturn false\n}\n\nfunc (lcs *levelCompactStatus) remove(dst keyRange) bool {\n\tfinal := lcs.ranges[:0]\n\tvar found bool\n\tfor _, r := range lcs.ranges {\n\t\tif !r.equals(dst) {\n\t\t\tfinal = append(final, r)\n\t\t} else {\n\t\t\tfound = true\n\t\t}\n\t}\n\tlcs.ranges = final\n\treturn found\n}\n\ntype compactStatus struct {\n\tsync.RWMutex\n\tlevels []*levelCompactStatus\n\ttables map[uint64]struct{}\n}\n\nfunc (cs *compactStatus) overlapsWith(level int, this keyRange) bool {\n\tcs.RLock()\n\tdefer cs.RUnlock()\n\n\tthisLevel := cs.levels[level]\n\treturn thisLevel.overlapsWith(this)\n}\n\nfunc (cs *compactStatus) delSize(l int) int64 {\n\tcs.RLock()\n\tdefer cs.RUnlock()\n\treturn cs.levels[l].delSize\n}\n\ntype thisAndNextLevelRLocked struct{}\n\n// compareAndAdd will check whether we can run this compactDef. That it doesn't overlap with any\n// other running compaction. If it can be run, it would store this run in the compactStatus state.\nfunc (cs *compactStatus) compareAndAdd(_ thisAndNextLevelRLocked, cd compactDef) bool {\n\tcs.Lock()\n\tdefer cs.Unlock()\n\n\ttl := cd.thisLevel.level\n\ty.AssertTruef(tl < len(cs.levels), \"Got level %d. Max levels: %d\", tl, len(cs.levels))\n\tthisLevel := cs.levels[cd.thisLevel.level]\n\tnextLevel := cs.levels[cd.nextLevel.level]\n\n\tif thisLevel.overlapsWith(cd.thisRange) {\n\t\treturn false\n\t}\n\tif nextLevel.overlapsWith(cd.nextRange) {\n\t\treturn false\n\t}\n\t// Check whether this level really needs compaction or not. Otherwise, we'll end up\n\t// running parallel compactions for the same level.\n\t// Update: We should not be checking size here. Compaction priority already did the size checks.\n\t// Here we should just be executing the wish of others.\n\n\tthisLevel.ranges = append(thisLevel.ranges, cd.thisRange)\n\tnextLevel.ranges = append(nextLevel.ranges, cd.nextRange)\n\tthisLevel.delSize += cd.thisSize\n\tfor _, t := range append(cd.top, cd.bot...) {\n\t\tcs.tables[t.ID()] = struct{}{}\n\t}\n\treturn true\n}\n\nfunc (cs *compactStatus) delete(cd compactDef) {\n\tcs.Lock()\n\tdefer cs.Unlock()\n\n\ttl := cd.thisLevel.level\n\ty.AssertTruef(tl < len(cs.levels), \"Got level %d. Max levels: %d\", tl, len(cs.levels))\n\n\tthisLevel := cs.levels[cd.thisLevel.level]\n\tnextLevel := cs.levels[cd.nextLevel.level]\n\n\tthisLevel.delSize -= cd.thisSize\n\tfound := thisLevel.remove(cd.thisRange)\n\t// The following check makes sense only if we're compacting more than one\n\t// table. In case of the max level, we might rewrite a single table to\n\t// remove stale data.\n\tif cd.thisLevel != cd.nextLevel && !cd.nextRange.isEmpty() {\n\t\tfound = nextLevel.remove(cd.nextRange) && found\n\t}\n\n\tif !found {\n\t\tthis := cd.thisRange\n\t\tnext := cd.nextRange\n\t\tfmt.Printf(\"Looking for: %s in this level %d.\\n\", this, tl)\n\t\tfmt.Printf(\"This Level:\\n%s\\n\", thisLevel.debug())\n\t\tfmt.Println()\n\t\tfmt.Printf(\"Looking for: %s in next level %d.\\n\", next, cd.nextLevel.level)\n\t\tfmt.Printf(\"Next Level:\\n%s\\n\", nextLevel.debug())\n\t\tlog.Fatal(\"keyRange not found\")\n\t}\n\tfor _, t := range append(cd.top, cd.bot...) {\n\t\t_, ok := cs.tables[t.ID()]\n\t\ty.AssertTrue(ok)\n\t\tdelete(cs.tables, t.ID())\n\t}\n}\n"
  },
  {
    "path": "contrib/RELEASE.md",
    "content": "# Badger Release Process\n\nThis document outlines the steps needed to build and push a new release of Badger.\n\n1. Have a team member \"at-the-ready\" with github `writer` access (you'll need them to approve PRs).\n1. Create a new branch (prepare-for-release-vXX.X.X, for instance).\n1. Update dependencies in `go.mod` for Ristretto, if required.\n1. Update the CHANGELOG.md. Opus 4.5 does a great job of doing this. Example prompt:\n   `I'm releasing vXX.X.X off the main branch, add a new entry for this release. Conform to the`\n   `\"Keep a Changelog\" format, use past entries as a formatting guide. Run the trunk linter.`\n1. Validate the version does not have storage incompatibilities with the previous version. If so,\n   add a warning to the CHANGELOG.md that export/import of data will need to be run as part of the\n   upgrade process.\n1. Commit and push your changes. Create a PR and have a team member approve it.\n1. Once your \"prepare for release branch\" is merged into main, on the github\n   [releases page](https://github.com/dgraph-io/badger/releases), create a new draft release.\n1. Start the deployment workflow from the\n   [CD workflow page](https://github.com/dgraph-io/badger/actions/workflows/cd-badger.yml).\n\n   The CD workflow handles the building and copying of release artifacts to the releases area.\n\n1. For all major and minor releases (non-patches), create a release branch. In order to easily\n   backport fixes to the release branch, create a release branch from the tag head. For instance, if\n   we're releasing v4.9.0, create a branch called `release/v4.9` from the tag head (ensure you're on\n   the main branch from which you created the tag):\n\n   ```sh\n   git checkout main\n   git pull origin main\n   git checkout -b release/v4.9\n   git push origin release/v4.9\n   ```\n\n1. Splash the \"go index\" cache to ensure that the latest release is available to the public with:\n\n   ```sh\n   go list -m github.com/dgraph-io/badger/v4@vX.X.X\n   ```\n\n1. If needed, create a new announcement thread in the\n   [Discussions](https://github.com/orgs/dgraph-io/discussions) forum for the release.\n"
  },
  {
    "path": "db.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"bytes\"\n\t\"context\"\n\t\"encoding/binary\"\n\t\"errors\"\n\t\"expvar\"\n\t\"fmt\"\n\t\"math\"\n\t\"os\"\n\t\"path/filepath\"\n\t\"sort\"\n\t\"strings\"\n\t\"sync\"\n\t\"sync/atomic\"\n\t\"time\"\n\n\thumanize \"github.com/dustin/go-humanize\"\n\n\t\"github.com/dgraph-io/badger/v4/fb\"\n\t\"github.com/dgraph-io/badger/v4/options\"\n\t\"github.com/dgraph-io/badger/v4/pb\"\n\t\"github.com/dgraph-io/badger/v4/skl\"\n\t\"github.com/dgraph-io/badger/v4/table\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\nvar (\n\tbadgerPrefix = []byte(\"!badger!\")       // Prefix for internal keys used by badger.\n\ttxnKey       = []byte(\"!badger!txn\")    // For indicating end of entries in txn.\n\tbannedNsKey  = []byte(\"!badger!banned\") // For storing the banned namespaces.\n)\n\ntype closers struct {\n\tupdateSize  *z.Closer\n\tcompactors  *z.Closer\n\tmemtable    *z.Closer\n\twrites      *z.Closer\n\tvalueGC     *z.Closer\n\tpub         *z.Closer\n\tcacheHealth *z.Closer\n}\n\ntype lockedKeys struct {\n\tsync.RWMutex\n\tkeys map[uint64]struct{}\n}\n\nfunc (lk *lockedKeys) add(key uint64) {\n\tlk.Lock()\n\tdefer lk.Unlock()\n\tlk.keys[key] = struct{}{}\n}\n\nfunc (lk *lockedKeys) has(key uint64) bool {\n\tlk.RLock()\n\tdefer lk.RUnlock()\n\t_, ok := lk.keys[key]\n\treturn ok\n}\n\nfunc (lk *lockedKeys) all() []uint64 {\n\tlk.RLock()\n\tdefer lk.RUnlock()\n\tkeys := make([]uint64, 0, len(lk.keys))\n\tfor key := range lk.keys {\n\t\tkeys = append(keys, key)\n\t}\n\treturn keys\n}\n\n// DB provides the various functions required to interact with Badger.\n// DB is thread-safe.\ntype DB struct {\n\ttestOnlyDBExtensions\n\n\tlock sync.RWMutex // Guards list of inmemory tables, not individual reads and writes.\n\n\tdirLockGuard *directoryLockGuard\n\t// nil if Dir and ValueDir are the same\n\tvalueDirGuard *directoryLockGuard\n\n\tclosers closers\n\n\tmt  *memTable   // Our latest (actively written) in-memory table\n\timm []*memTable // Add here only AFTER pushing to flushChan.\n\n\t// Initialized via openMemTables.\n\tnextMemFid int\n\n\topt       Options\n\tmanifest  *manifestFile\n\tlc        *levelsController\n\tvlog      valueLog\n\twriteCh   chan *request\n\tflushChan chan *memTable // For flushing memtables.\n\tcloseOnce sync.Once      // For closing DB only once.\n\n\tblockWrites atomic.Int32\n\tisClosed    atomic.Uint32\n\n\torc              *oracle\n\tbannedNamespaces *lockedKeys\n\tthreshold        *vlogThreshold\n\n\tpub        *publisher\n\tregistry   *KeyRegistry\n\tblockCache *ristretto.Cache[[]byte, *table.Block]\n\tindexCache *ristretto.Cache[uint64, *fb.TableIndex]\n\tallocPool  *z.AllocatorPool\n}\n\nconst (\n\tkvWriteChCapacity = 1000\n)\n\nfunc checkAndSetOptions(opt *Options) error {\n\t// It's okay to have zero compactors which will disable all compactions but\n\t// we cannot have just one compactor otherwise we will end up with all data\n\t// on level 2.\n\tif opt.NumCompactors == 1 {\n\t\treturn errors.New(\"Cannot have 1 compactor. Need at least 2\")\n\t}\n\n\tif opt.InMemory && (opt.Dir != \"\" || opt.ValueDir != \"\") {\n\t\treturn errors.New(\"Cannot use badger in Disk-less mode with Dir or ValueDir set\")\n\t}\n\topt.maxBatchSize = (15 * opt.MemTableSize) / 100\n\topt.maxBatchCount = opt.maxBatchSize / int64(skl.MaxNodeSize)\n\n\t// This is the maximum value, vlogThreshold can have if dynamic thresholding is enabled.\n\topt.maxValueThreshold = math.Min(maxValueThreshold, float64(opt.maxBatchSize))\n\tif opt.VLogPercentile < 0.0 || opt.VLogPercentile > 1.0 {\n\t\treturn errors.New(\"vlogPercentile must be within range of 0.0-1.0\")\n\t}\n\n\t// We are limiting opt.ValueThreshold to maxValueThreshold for now.\n\tif opt.ValueThreshold > maxValueThreshold {\n\t\treturn fmt.Errorf(\"Invalid ValueThreshold, must be less or equal to %d\",\n\t\t\tmaxValueThreshold)\n\t}\n\n\t// If ValueThreshold is greater than opt.maxBatchSize, we won't be able to push any data using\n\t// the transaction APIs. Transaction batches entries into batches of size opt.maxBatchSize.\n\tif opt.ValueThreshold > opt.maxBatchSize {\n\t\treturn fmt.Errorf(\"Valuethreshold %d greater than max batch size of %d. Either \"+\n\t\t\t\"reduce opt.ValueThreshold or increase opt.BaseTableSize.\",\n\t\t\topt.ValueThreshold, opt.maxBatchSize)\n\t}\n\t// ValueLogFileSize should be strictly LESS than 2<<30 otherwise we will\n\t// overflow the uint32 when we mmap it in OpenMemtable.\n\tif !(opt.ValueLogFileSize < 2<<30 && opt.ValueLogFileSize >= 1<<20) {\n\t\treturn ErrValueLogSize\n\t}\n\n\tif opt.ReadOnly {\n\t\t// Do not perform compaction in read only mode.\n\t\topt.CompactL0OnClose = false\n\t}\n\n\tneedCache := (opt.Compression != options.None) || (len(opt.EncryptionKey) > 0)\n\tif needCache && opt.BlockCacheSize == 0 {\n\t\tpanic(\"BlockCacheSize should be set since compression/encryption are enabled\")\n\t}\n\treturn nil\n}\n\n// Open returns a new DB object.\nfunc Open(opt Options) (*DB, error) {\n\tif err := checkAndSetOptions(&opt); err != nil {\n\t\treturn nil, err\n\t}\n\tvar dirLockGuard, valueDirLockGuard *directoryLockGuard\n\n\t// Create directories and acquire lock on it only if badger is not running in InMemory mode.\n\t// We don't have any directories/files in InMemory mode so we don't need to acquire\n\t// any locks on them.\n\tif !opt.InMemory {\n\t\tif err := createDirs(opt); err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t\tvar err error\n\t\tif !opt.BypassLockGuard {\n\t\t\tdirLockGuard, err = acquireDirectoryLock(opt.Dir, lockFile, opt.ReadOnly)\n\t\t\tif err != nil {\n\t\t\t\treturn nil, err\n\t\t\t}\n\t\t\tdefer func() {\n\t\t\t\tif dirLockGuard != nil {\n\t\t\t\t\t_ = dirLockGuard.release()\n\t\t\t\t}\n\t\t\t}()\n\t\t\tabsDir, err := filepath.Abs(opt.Dir)\n\t\t\tif err != nil {\n\t\t\t\treturn nil, err\n\t\t\t}\n\t\t\tabsValueDir, err := filepath.Abs(opt.ValueDir)\n\t\t\tif err != nil {\n\t\t\t\treturn nil, err\n\t\t\t}\n\t\t\tif absValueDir != absDir {\n\t\t\t\tvalueDirLockGuard, err = acquireDirectoryLock(opt.ValueDir, lockFile, opt.ReadOnly)\n\t\t\t\tif err != nil {\n\t\t\t\t\treturn nil, err\n\t\t\t\t}\n\t\t\t\tdefer func() {\n\t\t\t\t\tif valueDirLockGuard != nil {\n\t\t\t\t\t\t_ = valueDirLockGuard.release()\n\t\t\t\t\t}\n\t\t\t\t}()\n\t\t\t}\n\t\t}\n\t}\n\n\tmanifestFile, manifest, err := openOrCreateManifestFile(opt)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\tdefer func() {\n\t\tif manifestFile != nil {\n\t\t\t_ = manifestFile.close()\n\t\t}\n\t}()\n\n\tdb := &DB{\n\t\timm:              make([]*memTable, 0, opt.NumMemtables),\n\t\tflushChan:        make(chan *memTable, opt.NumMemtables),\n\t\twriteCh:          make(chan *request, kvWriteChCapacity),\n\t\topt:              opt,\n\t\tmanifest:         manifestFile,\n\t\tdirLockGuard:     dirLockGuard,\n\t\tvalueDirGuard:    valueDirLockGuard,\n\t\torc:              newOracle(opt),\n\t\tpub:              newPublisher(),\n\t\tallocPool:        z.NewAllocatorPool(8),\n\t\tbannedNamespaces: &lockedKeys{keys: make(map[uint64]struct{})},\n\t\tthreshold:        initVlogThreshold(&opt),\n\t}\n\n\tdb.syncChan = opt.syncChan\n\n\t// Cleanup all the goroutines started by badger in case of an error.\n\tdefer func() {\n\t\tif err != nil {\n\t\t\topt.Errorf(\"Received err: %v. Cleaning up...\", err)\n\t\t\tdb.cleanup()\n\t\t\tdb = nil\n\t\t}\n\t}()\n\n\tif opt.BlockCacheSize > 0 {\n\t\tnumInCache := opt.BlockCacheSize / int64(opt.BlockSize)\n\t\tif numInCache == 0 {\n\t\t\t// Make the value of this variable at least one since the cache requires\n\t\t\t// the number of counters to be greater than zero.\n\t\t\tnumInCache = 1\n\t\t}\n\n\t\tconfig := ristretto.Config[[]byte, *table.Block]{\n\t\t\tNumCounters: numInCache * 8,\n\t\t\tMaxCost:     opt.BlockCacheSize,\n\t\t\tBufferItems: 64,\n\t\t\tMetrics:     true,\n\t\t\tOnExit:      table.BlockEvictHandler,\n\t\t}\n\t\tdb.blockCache, err = ristretto.NewCache[[]byte, *table.Block](&config)\n\t\tif err != nil {\n\t\t\treturn nil, y.Wrap(err, \"failed to create data cache\")\n\t\t}\n\t}\n\n\tif opt.IndexCacheSize > 0 {\n\t\t// Index size is around 5% of the table size.\n\t\tindexSz := int64(float64(opt.MemTableSize) * 0.05)\n\t\tnumInCache := opt.IndexCacheSize / indexSz\n\t\tif numInCache == 0 {\n\t\t\t// Make the value of this variable at least one since the cache requires\n\t\t\t// the number of counters to be greater than zero.\n\t\t\tnumInCache = 1\n\t\t}\n\n\t\tconfig := ristretto.Config[uint64, *fb.TableIndex]{\n\t\t\tNumCounters: numInCache * 8,\n\t\t\tMaxCost:     opt.IndexCacheSize,\n\t\t\tBufferItems: 64,\n\t\t\tMetrics:     true,\n\t\t}\n\t\tdb.indexCache, err = ristretto.NewCache(&config)\n\t\tif err != nil {\n\t\t\treturn nil, y.Wrap(err, \"failed to create bf cache\")\n\t\t}\n\t}\n\n\tdb.closers.cacheHealth = z.NewCloser(1)\n\tgo db.monitorCache(db.closers.cacheHealth)\n\n\tif db.opt.InMemory {\n\t\tdb.opt.SyncWrites = false\n\t\t// If badger is running in memory mode, push everything into the LSM Tree.\n\t\tdb.opt.ValueThreshold = math.MaxInt32\n\t}\n\tkrOpt := KeyRegistryOptions{\n\t\tReadOnly:                      opt.ReadOnly,\n\t\tDir:                           opt.Dir,\n\t\tEncryptionKey:                 opt.EncryptionKey,\n\t\tEncryptionKeyRotationDuration: opt.EncryptionKeyRotationDuration,\n\t\tInMemory:                      opt.InMemory,\n\t}\n\n\tif db.registry, err = OpenKeyRegistry(krOpt); err != nil {\n\t\treturn db, err\n\t}\n\tdb.calculateSize()\n\tdb.closers.updateSize = z.NewCloser(1)\n\tgo db.updateSize(db.closers.updateSize)\n\n\tif err := db.openMemTables(db.opt); err != nil {\n\t\treturn nil, y.Wrapf(err, \"while opening memtables\")\n\t}\n\n\tif !db.opt.ReadOnly {\n\t\tif db.mt, err = db.newMemTable(); err != nil {\n\t\t\treturn nil, y.Wrapf(err, \"cannot create memtable\")\n\t\t}\n\t}\n\n\t// newLevelsController potentially loads files in directory.\n\tif db.lc, err = newLevelsController(db, &manifest); err != nil {\n\t\treturn db, err\n\t}\n\n\t// Initialize vlog struct.\n\tdb.vlog.init(db)\n\n\tif !opt.ReadOnly {\n\t\tdb.closers.compactors = z.NewCloser(1)\n\t\tdb.lc.startCompact(db.closers.compactors)\n\n\t\tdb.closers.memtable = z.NewCloser(1)\n\t\tgo func() {\n\t\t\tdb.flushMemtable(db.closers.memtable) // Need levels controller to be up.\n\t\t}()\n\t\t// Flush them to disk asap.\n\t\tfor _, mt := range db.imm {\n\t\t\tdb.flushChan <- mt\n\t\t}\n\t}\n\t// We do increment nextTxnTs below. So, no need to do it here.\n\tdb.orc.nextTxnTs = db.MaxVersion()\n\tdb.opt.Infof(\"Set nextTxnTs to %d\", db.orc.nextTxnTs)\n\n\tif err = db.vlog.open(db); err != nil {\n\t\treturn db, y.Wrapf(err, \"During db.vlog.open\")\n\t}\n\n\t// Let's advance nextTxnTs to one more than whatever we observed via\n\t// replaying the logs.\n\tdb.orc.txnMark.Done(db.orc.nextTxnTs)\n\t// In normal mode, we must update readMark so older versions of keys can be removed during\n\t// compaction when run in offline mode via the flatten tool.\n\tdb.orc.readMark.Done(db.orc.nextTxnTs)\n\tdb.orc.incrementNextTs()\n\n\tgo db.threshold.listenForValueThresholdUpdate()\n\n\tif err := db.initBannedNamespaces(); err != nil {\n\t\treturn db, fmt.Errorf(\"While setting banned keys: %w\", err)\n\t}\n\n\tdb.closers.writes = z.NewCloser(1)\n\tgo db.doWrites(db.closers.writes)\n\n\tif !db.opt.InMemory {\n\t\tdb.closers.valueGC = z.NewCloser(1)\n\t\tgo db.vlog.waitOnGC(db.closers.valueGC)\n\t}\n\n\tdb.closers.pub = z.NewCloser(1)\n\tgo db.pub.listenForUpdates(db.closers.pub)\n\n\tvalueDirLockGuard = nil\n\tdirLockGuard = nil\n\tmanifestFile = nil\n\treturn db, nil\n}\n\n// initBannedNamespaces retrieves the banned namespaces from the DB and updates in-memory structure.\nfunc (db *DB) initBannedNamespaces() error {\n\tif db.opt.NamespaceOffset < 0 {\n\t\treturn nil\n\t}\n\treturn db.View(func(txn *Txn) error {\n\t\tiopts := DefaultIteratorOptions\n\t\tiopts.Prefix = bannedNsKey\n\t\tiopts.PrefetchValues = false\n\t\tiopts.InternalAccess = true\n\t\titr := txn.NewIterator(iopts)\n\t\tdefer itr.Close()\n\t\tfor itr.Rewind(); itr.Valid(); itr.Next() {\n\t\t\tkey := y.BytesToU64(itr.Item().Key()[len(bannedNsKey):])\n\t\t\tdb.bannedNamespaces.add(key)\n\t\t}\n\t\treturn nil\n\t})\n}\n\nfunc (db *DB) MaxVersion() uint64 {\n\tvar maxVersion uint64\n\tupdate := func(a uint64) {\n\t\tif a > maxVersion {\n\t\t\tmaxVersion = a\n\t\t}\n\t}\n\tdb.lock.Lock()\n\t// In read only mode, we do not create new mem table.\n\tif !db.opt.ReadOnly {\n\t\tupdate(db.mt.maxVersion)\n\t}\n\tfor _, mt := range db.imm {\n\t\tupdate(mt.maxVersion)\n\t}\n\tdb.lock.Unlock()\n\tfor _, ti := range db.Tables() {\n\t\tupdate(ti.MaxVersion)\n\t}\n\treturn maxVersion\n}\n\nfunc (db *DB) monitorCache(c *z.Closer) {\n\tdefer c.Done()\n\tcount := 0\n\tanalyze := func(name string, metrics *ristretto.Metrics) {\n\t\t// If the mean life expectancy is less than 10 seconds, the cache\n\t\t// might be too small.\n\t\tle := metrics.LifeExpectancySeconds()\n\t\tif le == nil {\n\t\t\treturn\n\t\t}\n\t\tlifeTooShort := le.Count > 0 && float64(le.Sum)/float64(le.Count) < 10\n\t\thitRatioTooLow := metrics.Ratio() > 0 && metrics.Ratio() < 0.4\n\t\tif lifeTooShort && hitRatioTooLow {\n\t\t\tdb.opt.Warningf(\"%s might be too small. Metrics: %s\\n\", name, metrics)\n\t\t\tdb.opt.Warningf(\"Cache life expectancy (in seconds): %+v\\n\", le)\n\n\t\t} else if le.Count > 1000 && count%5 == 0 {\n\t\t\tdb.opt.Infof(\"%s metrics: %s\\n\", name, metrics)\n\t\t}\n\t}\n\n\tticker := time.NewTicker(1 * time.Minute)\n\tdefer ticker.Stop()\n\tfor {\n\t\tselect {\n\t\tcase <-c.HasBeenClosed():\n\t\t\treturn\n\t\tcase <-ticker.C:\n\t\t}\n\n\t\tanalyze(\"Block cache\", db.BlockCacheMetrics())\n\t\tanalyze(\"Index cache\", db.IndexCacheMetrics())\n\t\tcount++\n\t}\n}\n\n// cleanup stops all the goroutines started by badger. This is used in open to\n// cleanup goroutines in case of an error.\nfunc (db *DB) cleanup() {\n\tdb.stopMemoryFlush()\n\tdb.stopCompactions()\n\n\tdb.blockCache.Close()\n\tdb.indexCache.Close()\n\tif db.closers.updateSize != nil {\n\t\tdb.closers.updateSize.Signal()\n\t}\n\tif db.closers.valueGC != nil {\n\t\tdb.closers.valueGC.Signal()\n\t}\n\tif db.closers.writes != nil {\n\t\tdb.closers.writes.Signal()\n\t}\n\tif db.closers.pub != nil {\n\t\tdb.closers.pub.Signal()\n\t}\n\n\tdb.orc.Stop()\n\n\t// Do not use vlog.Close() here. vlog.Close truncates the files. We don't\n\t// want to truncate files unless the user has specified the truncate flag.\n}\n\n// BlockCacheMetrics returns the metrics for the underlying block cache.\nfunc (db *DB) BlockCacheMetrics() *ristretto.Metrics {\n\tif db.blockCache != nil {\n\t\treturn db.blockCache.Metrics\n\t}\n\treturn nil\n}\n\n// IndexCacheMetrics returns the metrics for the underlying index cache.\nfunc (db *DB) IndexCacheMetrics() *ristretto.Metrics {\n\tif db.indexCache != nil {\n\t\treturn db.indexCache.Metrics\n\t}\n\treturn nil\n}\n\n// Close closes a DB. It's crucial to call it to ensure all the pending updates make their way to\n// disk. Calling DB.Close() multiple times would still only close the DB once.\nfunc (db *DB) Close() error {\n\tvar err error\n\tdb.closeOnce.Do(func() {\n\t\terr = db.close()\n\t})\n\treturn err\n}\n\n// IsClosed denotes if the badger DB is closed or not. A DB instance should not\n// be used after closing it.\nfunc (db *DB) IsClosed() bool {\n\treturn db.isClosed.Load() == 1\n}\n\nfunc (db *DB) close() (err error) {\n\tdefer db.allocPool.Release()\n\n\tdb.opt.Debugf(\"Closing database\")\n\tdb.opt.Infof(\"Lifetime L0 stalled for: %s\\n\", time.Duration(db.lc.l0stallsMs.Load()))\n\n\tdb.blockWrites.Store(1)\n\tdb.isClosed.Store(1)\n\n\tif !db.opt.InMemory {\n\t\t// Stop value GC first.\n\t\tdb.closers.valueGC.SignalAndWait()\n\t}\n\n\t// Stop writes next.\n\tdb.closers.writes.SignalAndWait()\n\n\t// Don't accept any more write.\n\tclose(db.writeCh)\n\n\tdb.closers.pub.SignalAndWait()\n\tdb.closers.cacheHealth.Signal()\n\n\t// Make sure that block writer is done pushing stuff into memtable!\n\t// Otherwise, you will have a race condition: we are trying to flush memtables\n\t// and remove them completely, while the block / memtable writer is still\n\t// trying to push stuff into the memtable. This will also resolve the value\n\t// offset problem: as we push into memtable, we update value offsets there.\n\tif db.mt != nil {\n\t\tif db.mt.sl.Empty() {\n\t\t\t// Remove the memtable if empty.\n\t\t\tdb.mt.DecrRef()\n\t\t} else {\n\t\t\tdb.opt.Debugf(\"Flushing memtable\")\n\t\t\tfor {\n\t\t\t\tpushedMemTable := func() bool {\n\t\t\t\t\tdb.lock.Lock()\n\t\t\t\t\tdefer db.lock.Unlock()\n\t\t\t\t\ty.AssertTrue(db.mt != nil)\n\t\t\t\t\tselect {\n\t\t\t\t\tcase db.flushChan <- db.mt:\n\t\t\t\t\t\tdb.imm = append(db.imm, db.mt) // Flusher will attempt to remove this from s.imm.\n\t\t\t\t\t\tdb.mt = nil                    // Will segfault if we try writing!\n\t\t\t\t\t\tdb.opt.Debugf(\"pushed to flush chan\\n\")\n\t\t\t\t\t\treturn true\n\t\t\t\t\tdefault:\n\t\t\t\t\t\t// If we fail to push, we need to unlock and wait for a short while.\n\t\t\t\t\t\t// The flushing operation needs to update s.imm. Otherwise, we have a\n\t\t\t\t\t\t// deadlock.\n\t\t\t\t\t\t// TODO: Think about how to do this more cleanly, maybe without any locks.\n\t\t\t\t\t}\n\t\t\t\t\treturn false\n\t\t\t\t}()\n\t\t\t\tif pushedMemTable {\n\t\t\t\t\tbreak\n\t\t\t\t}\n\t\t\t\ttime.Sleep(10 * time.Millisecond)\n\t\t\t}\n\t\t}\n\t}\n\tdb.stopMemoryFlush()\n\tdb.stopCompactions()\n\n\t// Force Compact L0\n\t// We don't need to care about cstatus since no parallel compaction is running.\n\tif db.opt.CompactL0OnClose {\n\t\terr := db.lc.doCompact(173, compactionPriority{level: 0, score: 1.73})\n\t\tswitch err {\n\t\tcase errFillTables:\n\t\t\t// This error only means that there might be enough tables to do a compaction. So, we\n\t\t\t// should not report it to the end user to avoid confusing them.\n\t\tcase nil:\n\t\t\tdb.opt.Debugf(\"Force compaction on level 0 done\")\n\t\tdefault:\n\t\t\tdb.opt.Warningf(\"While forcing compaction on level 0: %v\", err)\n\t\t}\n\t}\n\n\t// Now close the value log.\n\tif vlogErr := db.vlog.Close(); vlogErr != nil {\n\t\terr = y.Wrap(vlogErr, \"DB.Close\")\n\t}\n\n\tdb.opt.Infof(db.LevelsToString())\n\tif lcErr := db.lc.close(); err == nil {\n\t\terr = y.Wrap(lcErr, \"DB.Close\")\n\t}\n\tdb.opt.Debugf(\"Waiting for closer\")\n\tdb.closers.updateSize.SignalAndWait()\n\tdb.orc.Stop()\n\tdb.blockCache.Close()\n\tdb.indexCache.Close()\n\n\tdb.threshold.close()\n\n\tif db.opt.InMemory {\n\t\treturn\n\t}\n\n\tif db.dirLockGuard != nil {\n\t\tif guardErr := db.dirLockGuard.release(); err == nil {\n\t\t\terr = y.Wrap(guardErr, \"DB.Close\")\n\t\t}\n\t}\n\tif db.valueDirGuard != nil {\n\t\tif guardErr := db.valueDirGuard.release(); err == nil {\n\t\t\terr = y.Wrap(guardErr, \"DB.Close\")\n\t\t}\n\t}\n\tif manifestErr := db.manifest.close(); err == nil {\n\t\terr = y.Wrap(manifestErr, \"DB.Close\")\n\t}\n\tif registryErr := db.registry.Close(); err == nil {\n\t\terr = y.Wrap(registryErr, \"DB.Close\")\n\t}\n\n\t// Fsync directories to ensure that lock file, and any other removed files whose directory\n\t// we haven't specifically fsynced, are guaranteed to have their directory entry removal\n\t// persisted to disk.\n\tif syncErr := db.syncDir(db.opt.Dir); err == nil {\n\t\terr = y.Wrap(syncErr, \"DB.Close\")\n\t}\n\tif syncErr := db.syncDir(db.opt.ValueDir); err == nil {\n\t\terr = y.Wrap(syncErr, \"DB.Close\")\n\t}\n\n\treturn err\n}\n\n// VerifyChecksum verifies checksum for all tables on all levels.\n// This method can be used to verify checksum, if opt.ChecksumVerificationMode is NoVerification.\nfunc (db *DB) VerifyChecksum() error {\n\treturn db.lc.verifyChecksum()\n}\n\nconst (\n\tlockFile = \"LOCK\"\n)\n\n// Sync syncs database content to disk. This function provides\n// more control to user to sync data whenever required.\nfunc (db *DB) Sync() error {\n\t/**\n\tMake an attempt to sync both the logs, the active memtable's WAL and the vLog (1847).\n\tCases:\n\t- All_ok\t\t\t:: If both the logs sync successfully.\n\n\t- Entry_Lost\t\t:: If an entry with a value pointer was present in the active memtable's WAL,\n\t\t\t\t\t\t:: and the WAL was synced but there was an error in syncing the vLog.\n\t\t\t\t\t\t:: The entry will be considered lost and this case will need to be handled during recovery.\n\n\t- Entries_Lost\t\t:: If there were errors in syncing both the logs, multiple entries would be lost.\n\n\t- Entries_Lost      :: If the active memtable's WAL is not synced but the vLog is synced, it will\n\t\t\t\t\t\t:: result in entries being lost because recovery of the active memtable is done from its WAL.\n\t\t\t\t\t\t:: Check `UpdateSkipList` in memtable.go.\n\n\t- Nothing_lost\t\t:: If an entry with its value was present in the active memtable's WAL, and the WAL was synced,\n\t\t\t\t\t\t:: but there was an error in syncing the vLog.\n\t\t\t\t\t\t:: Nothing is lost for this very specific entry because the entry is completely present in the memtable's WAL.\n\n\t- Partially_lost    :: If entries were written partially in either of the logs,\n\t\t\t\t\t\t:: the logs will be truncated during recovery.\n\t\t\t\t\t\t:: As a result of truncation, some entries might be lost.\n\t\t\t\t\t    :: Assume that 4KB of data is to be synced and invoking `Sync` results only in syncing 3KB\n\t                    :: of data and then the machine shuts down or the disk failure happens,\n\t\t\t\t\t\t:: this will result in partial writes. [[This case needs verification]]\n\t*/\n\tdb.lock.RLock()\n\tmemtableSyncError := db.mt.SyncWAL()\n\tdb.lock.RUnlock()\n\n\tvLogSyncError := db.vlog.sync()\n\treturn y.CombineErrors(memtableSyncError, vLogSyncError)\n}\n\n// getMemtables returns the current memtables and get references.\nfunc (db *DB) getMemTables() ([]*memTable, func()) {\n\tdb.lock.RLock()\n\tdefer db.lock.RUnlock()\n\n\tvar tables []*memTable\n\n\t// Mutable memtable does not exist in read-only mode.\n\tif !db.opt.ReadOnly {\n\t\t// Get mutable memtable.\n\t\ttables = append(tables, db.mt)\n\t\tdb.mt.IncrRef()\n\t}\n\n\t// Get immutable memtables.\n\tlast := len(db.imm) - 1\n\tfor i := range db.imm {\n\t\ttables = append(tables, db.imm[last-i])\n\t\tdb.imm[last-i].IncrRef()\n\t}\n\treturn tables, func() {\n\t\tfor _, tbl := range tables {\n\t\t\ttbl.DecrRef()\n\t\t}\n\t}\n}\n\n// get returns the value in memtable or disk for given key.\n// Note that value will include meta byte.\n//\n// IMPORTANT: We should never write an entry with an older timestamp for the same key, We need to\n// maintain this invariant to search for the latest value of a key, or else we need to search in all\n// tables and find the max version among them.  To maintain this invariant, we also need to ensure\n// that all versions of a key are always present in the same table from level 1, because compaction\n// can push any table down.\n//\n// Update(23/09/2020) - We have dropped the move key implementation. Earlier we\n// were inserting move keys to fix the invalid value pointers but we no longer\n// do that. For every get(\"fooX\") call where X is the version, we will search\n// for \"fooX\" in all the levels of the LSM tree. This is expensive but it\n// removes the overhead of handling move keys completely.\nfunc (db *DB) get(key []byte) (y.ValueStruct, error) {\n\tif db.IsClosed() {\n\t\treturn y.ValueStruct{}, ErrDBClosed\n\t}\n\ttables, decr := db.getMemTables() // Lock should be released.\n\tdefer decr()\n\n\tvar maxVs y.ValueStruct\n\tversion := y.ParseTs(key)\n\n\ty.NumGetsAdd(db.opt.MetricsEnabled, 1)\n\tfor i := 0; i < len(tables); i++ {\n\t\tvs := tables[i].sl.Get(key)\n\t\ty.NumMemtableGetsAdd(db.opt.MetricsEnabled, 1)\n\t\tif vs.Meta == 0 && vs.Value == nil {\n\t\t\tcontinue\n\t\t}\n\t\t// Found the required version of the key, return immediately.\n\t\tif vs.Version == version {\n\t\t\ty.NumGetsWithResultsAdd(db.opt.MetricsEnabled, 1)\n\t\t\treturn vs, nil\n\t\t}\n\t\tif maxVs.Version < vs.Version {\n\t\t\tmaxVs = vs\n\t\t}\n\t}\n\treturn db.lc.get(key, maxVs, 0)\n}\n\nvar requestPool = sync.Pool{\n\tNew: func() interface{} {\n\t\treturn new(request)\n\t},\n}\n\nfunc (db *DB) writeToLSM(b *request) error {\n\t// We should check the length of b.Prts and b.Entries only when badger is not\n\t// running in InMemory mode. In InMemory mode, we don't write anything to the\n\t// value log and that's why the length of b.Ptrs will always be zero.\n\tif !db.opt.InMemory && len(b.Ptrs) != len(b.Entries) {\n\t\treturn fmt.Errorf(\"Ptrs and Entries don't match: %+v\", b)\n\t}\n\n\tfor i, entry := range b.Entries {\n\t\tvar err error\n\t\tif entry.skipVlogAndSetThreshold(db.valueThreshold()) {\n\t\t\t// Will include deletion / tombstone case.\n\t\t\terr = db.mt.Put(entry.Key,\n\t\t\t\ty.ValueStruct{\n\t\t\t\t\tValue: entry.Value,\n\t\t\t\t\t// Ensure value pointer flag is removed. Otherwise, the value will fail\n\t\t\t\t\t// to be retrieved during iterator prefetch. `bitValuePointer` is only\n\t\t\t\t\t// known to be set in write to LSM when the entry is loaded from a backup\n\t\t\t\t\t// with lower ValueThreshold and its value was stored in the value log.\n\t\t\t\t\tMeta:      entry.meta &^ bitValuePointer,\n\t\t\t\t\tUserMeta:  entry.UserMeta,\n\t\t\t\t\tExpiresAt: entry.ExpiresAt,\n\t\t\t\t})\n\t\t} else {\n\t\t\t// Write pointer to Memtable.\n\t\t\terr = db.mt.Put(entry.Key,\n\t\t\t\ty.ValueStruct{\n\t\t\t\t\tValue:     b.Ptrs[i].Encode(),\n\t\t\t\t\tMeta:      entry.meta | bitValuePointer,\n\t\t\t\t\tUserMeta:  entry.UserMeta,\n\t\t\t\t\tExpiresAt: entry.ExpiresAt,\n\t\t\t\t})\n\t\t}\n\t\tif err != nil {\n\t\t\treturn y.Wrapf(err, \"while writing to memTable\")\n\t\t}\n\t}\n\tif db.opt.SyncWrites {\n\t\treturn db.mt.SyncWAL()\n\t}\n\treturn nil\n}\n\n// writeRequests is called serially by only one goroutine.\nfunc (db *DB) writeRequests(reqs []*request) error {\n\tif len(reqs) == 0 {\n\t\treturn nil\n\t}\n\n\tdone := func(err error) {\n\t\tfor _, r := range reqs {\n\t\t\tr.Err = err\n\t\t\tr.Wg.Done()\n\t\t}\n\t}\n\tdb.opt.Debugf(\"writeRequests called. Writing to value log\")\n\terr := db.vlog.write(reqs)\n\tif err != nil {\n\t\tdone(err)\n\t\treturn err\n\t}\n\n\tdb.opt.Debugf(\"Writing to memtable\")\n\tvar count int\n\tfor _, b := range reqs {\n\t\tif len(b.Entries) == 0 {\n\t\t\tcontinue\n\t\t}\n\t\tcount += len(b.Entries)\n\t\tvar i uint64\n\t\tvar err error\n\t\tfor err = db.ensureRoomForWrite(); err == errNoRoom; err = db.ensureRoomForWrite() {\n\t\t\ti++\n\t\t\tif i%100 == 0 {\n\t\t\t\tdb.opt.Debugf(\"Making room for writes\")\n\t\t\t}\n\t\t\t// We need to poll a bit because both hasRoomForWrite and the flusher need access to s.imm.\n\t\t\t// When flushChan is full and you are blocked there, and the flusher is trying to update s.imm,\n\t\t\t// you will get a deadlock.\n\t\t\ttime.Sleep(10 * time.Millisecond)\n\t\t}\n\t\tif err != nil {\n\t\t\tdone(err)\n\t\t\treturn y.Wrap(err, \"writeRequests\")\n\t\t}\n\t\tif err := db.writeToLSM(b); err != nil {\n\t\t\tdone(err)\n\t\t\treturn y.Wrap(err, \"writeRequests\")\n\t\t}\n\t}\n\n\tdb.opt.Debugf(\"Sending updates to subscribers\")\n\tdb.pub.sendUpdates(reqs)\n\n\tdone(nil)\n\tdb.opt.Debugf(\"%d entries written\", count)\n\treturn nil\n}\n\nfunc (db *DB) sendToWriteCh(entries []*Entry) (*request, error) {\n\tif db.blockWrites.Load() == 1 {\n\t\treturn nil, ErrBlockedWrites\n\t}\n\tvar count, size int64\n\tfor _, e := range entries {\n\t\tsize += e.estimateSizeAndSetThreshold(db.valueThreshold())\n\t\tcount++\n\t}\n\ty.NumBytesWrittenUserAdd(db.opt.MetricsEnabled, size)\n\tif count >= db.opt.maxBatchCount || size >= db.opt.maxBatchSize {\n\t\treturn nil, ErrTxnTooBig\n\t}\n\n\t// We can only service one request because we need each txn to be stored in a contiguous section.\n\t// Txns should not interleave among other txns or rewrites.\n\treq := requestPool.Get().(*request)\n\treq.reset()\n\treq.Entries = entries\n\treq.Wg.Add(1)\n\treq.IncrRef()     // for db write\n\tdb.writeCh <- req // Handled in doWrites.\n\ty.NumPutsAdd(db.opt.MetricsEnabled, int64(len(entries)))\n\n\treturn req, nil\n}\n\nfunc (db *DB) doWrites(lc *z.Closer) {\n\tdefer lc.Done()\n\tpendingCh := make(chan struct{}, 1)\n\n\twriteRequests := func(reqs []*request) {\n\t\tif err := db.writeRequests(reqs); err != nil {\n\t\t\tdb.opt.Errorf(\"writeRequests: %v\", err)\n\t\t}\n\t\t<-pendingCh\n\t}\n\n\t// This variable tracks the number of pending writes.\n\treqLen := new(expvar.Int)\n\ty.PendingWritesSet(db.opt.MetricsEnabled, db.opt.Dir, reqLen)\n\n\treqs := make([]*request, 0, 10)\n\tfor {\n\t\tvar r *request\n\t\tselect {\n\t\tcase r = <-db.writeCh:\n\t\tcase <-lc.HasBeenClosed():\n\t\t\tgoto closedCase\n\t\t}\n\n\t\tfor {\n\t\t\treqs = append(reqs, r)\n\t\t\treqLen.Set(int64(len(reqs)))\n\n\t\t\tif len(reqs) >= 3*kvWriteChCapacity {\n\t\t\t\tpendingCh <- struct{}{} // blocking.\n\t\t\t\tgoto writeCase\n\t\t\t}\n\n\t\t\tselect {\n\t\t\t// Either push to pending, or continue to pick from writeCh.\n\t\t\tcase r = <-db.writeCh:\n\t\t\tcase pendingCh <- struct{}{}:\n\t\t\t\tgoto writeCase\n\t\t\tcase <-lc.HasBeenClosed():\n\t\t\t\tgoto closedCase\n\t\t\t}\n\t\t}\n\n\tclosedCase:\n\t\t// All the pending request are drained.\n\t\t// Don't close the writeCh, because it has be used in several places.\n\t\tfor {\n\t\t\tselect {\n\t\t\tcase r = <-db.writeCh:\n\t\t\t\treqs = append(reqs, r)\n\t\t\tdefault:\n\t\t\t\tpendingCh <- struct{}{} // Push to pending before doing a write.\n\t\t\t\twriteRequests(reqs)\n\t\t\t\treturn\n\t\t\t}\n\t\t}\n\n\twriteCase:\n\t\tgo writeRequests(reqs)\n\t\treqs = make([]*request, 0, 10)\n\t\treqLen.Set(0)\n\t}\n}\n\n// batchSet applies a list of badger.Entry. If a request level error occurs it\n// will be returned.\n//\n//\tCheck(kv.BatchSet(entries))\nfunc (db *DB) batchSet(entries []*Entry) error {\n\treq, err := db.sendToWriteCh(entries)\n\tif err != nil {\n\t\treturn err\n\t}\n\n\treturn req.Wait()\n}\n\n// batchSetAsync is the asynchronous version of batchSet. It accepts a callback\n// function which is called when all the sets are complete. If a request level\n// error occurs, it will be passed back via the callback.\n//\n//\terr := kv.BatchSetAsync(entries, func(err error)) {\n//\t   Check(err)\n//\t}\nfunc (db *DB) batchSetAsync(entries []*Entry, f func(error)) error {\n\treq, err := db.sendToWriteCh(entries)\n\tif err != nil {\n\t\treturn err\n\t}\n\tgo func() {\n\t\terr := req.Wait()\n\t\t// Write is complete. Let's call the callback function now.\n\t\tf(err)\n\t}()\n\treturn nil\n}\n\nvar errNoRoom = errors.New(\"No room for write\")\n\n// ensureRoomForWrite is always called serially.\nfunc (db *DB) ensureRoomForWrite() error {\n\tvar err error\n\tdb.lock.Lock()\n\tdefer db.lock.Unlock()\n\n\ty.AssertTrue(db.mt != nil) // A nil mt indicates that DB is being closed.\n\tif !db.mt.isFull() {\n\t\treturn nil\n\t}\n\n\tselect {\n\tcase db.flushChan <- db.mt:\n\t\tdb.opt.Debugf(\"Flushing memtable, mt.size=%d size of flushChan: %d\\n\",\n\t\t\tdb.mt.sl.MemSize(), len(db.flushChan))\n\t\t// We manage to push this task. Let's modify imm.\n\t\tdb.imm = append(db.imm, db.mt)\n\t\tdb.mt, err = db.newMemTable()\n\t\tif err != nil {\n\t\t\treturn y.Wrapf(err, \"cannot create new mem table\")\n\t\t}\n\t\t// New memtable is empty. We certainly have room.\n\t\treturn nil\n\tdefault:\n\t\t// We need to do this to unlock and allow the flusher to modify imm.\n\t\treturn errNoRoom\n\t}\n}\n\nfunc arenaSize(opt Options) int64 {\n\treturn opt.MemTableSize + opt.maxBatchSize + opt.maxBatchCount*int64(skl.MaxNodeSize)\n}\n\n// buildL0Table builds a new table from the memtable.\nfunc buildL0Table(iter y.Iterator, dropPrefixes [][]byte, bopts table.Options) *table.Builder {\n\tdefer iter.Close()\n\n\tb := table.NewTableBuilder(bopts)\n\tfor iter.Rewind(); iter.Valid(); iter.Next() {\n\t\tif len(dropPrefixes) > 0 && hasAnyPrefixes(iter.Key(), dropPrefixes) {\n\t\t\tcontinue\n\t\t}\n\t\tvs := iter.Value()\n\t\tvar vp valuePointer\n\t\tif vs.Meta&bitValuePointer > 0 {\n\t\t\tvp.Decode(vs.Value)\n\t\t}\n\t\tb.Add(iter.Key(), iter.Value(), vp.Len)\n\t}\n\n\treturn b\n}\n\n// handleMemTableFlush must be run serially.\nfunc (db *DB) handleMemTableFlush(mt *memTable, dropPrefixes [][]byte) error {\n\tbopts := buildTableOptions(db)\n\titr := mt.sl.NewUniIterator(false)\n\tbuilder := buildL0Table(itr, nil, bopts)\n\tdefer builder.Close()\n\n\t// buildL0Table can return nil if the none of the items in the skiplist are\n\t// added to the builder. This can happen when drop prefix is set and all\n\t// the items are skipped.\n\tif builder.Empty() {\n\t\tbuilder.Finish()\n\t\treturn nil\n\t}\n\n\tfileID := db.lc.reserveFileID()\n\tvar tbl *table.Table\n\tvar err error\n\tif db.opt.InMemory {\n\t\tdata := builder.Finish()\n\t\ttbl, err = table.OpenInMemoryTable(data, fileID, &bopts)\n\t} else {\n\t\ttbl, err = table.CreateTable(table.NewFilename(fileID, db.opt.Dir), builder)\n\t}\n\tif err != nil {\n\t\treturn y.Wrap(err, \"error while creating table\")\n\t}\n\t// We own a ref on tbl.\n\terr = db.lc.addLevel0Table(tbl) // This will incrRef\n\t_ = tbl.DecrRef()               // Releases our ref.\n\treturn err\n}\n\n// flushMemtable must keep running until we send it an empty memtable. If there\n// are errors during handling the memtable flush, we'll retry indefinitely.\nfunc (db *DB) flushMemtable(lc *z.Closer) {\n\tdefer lc.Done()\n\n\tfor mt := range db.flushChan {\n\t\tif mt == nil {\n\t\t\tcontinue\n\t\t}\n\n\t\tfor {\n\t\t\tif err := db.handleMemTableFlush(mt, nil); err != nil {\n\t\t\t\t// Encountered error. Retry indefinitely.\n\t\t\t\tdb.opt.Errorf(\"error flushing memtable to disk: %v, retrying\", err)\n\t\t\t\ttime.Sleep(time.Second)\n\t\t\t\tcontinue\n\t\t\t}\n\n\t\t\t// Update s.imm. Need a lock.\n\t\t\tdb.lock.Lock()\n\t\t\t// This is a single-threaded operation. mt corresponds to the head of\n\t\t\t// db.imm list. Once we flush it, we advance db.imm. The next mt\n\t\t\t// which would arrive here would match db.imm[0], because we acquire a\n\t\t\t// lock over DB when pushing to flushChan.\n\t\t\t// TODO: This logic is dirty AF. Any change and this could easily break.\n\t\t\ty.AssertTrue(mt == db.imm[0])\n\t\t\tdb.imm = db.imm[1:]\n\t\t\tmt.DecrRef() // Return memory.\n\t\t\t// unlock\n\t\t\tdb.lock.Unlock()\n\t\t\tbreak\n\t\t}\n\t}\n}\n\nfunc exists(path string) (bool, error) {\n\t_, err := os.Stat(path)\n\tif err == nil {\n\t\treturn true, nil\n\t}\n\tif os.IsNotExist(err) {\n\t\treturn false, nil\n\t}\n\treturn true, err\n}\n\n// This function does a filewalk, calculates the size of vlog and sst files and stores it in\n// y.LSMSize and y.VlogSize.\nfunc (db *DB) calculateSize() {\n\tif db.opt.InMemory {\n\t\treturn\n\t}\n\tnewInt := func(val int64) *expvar.Int {\n\t\tv := new(expvar.Int)\n\t\tv.Add(val)\n\t\treturn v\n\t}\n\n\ttotalSize := func(dir string) (int64, int64) {\n\t\tvar lsmSize, vlogSize int64\n\t\terr := filepath.Walk(dir, func(path string, info os.FileInfo, err error) error {\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\text := filepath.Ext(path)\n\t\t\tswitch ext {\n\t\t\tcase \".sst\":\n\t\t\t\tlsmSize += info.Size()\n\t\t\tcase \".vlog\":\n\t\t\t\tvlogSize += info.Size()\n\t\t\t}\n\t\t\treturn nil\n\t\t})\n\t\tif err != nil {\n\t\t\tdb.opt.Debugf(\"Got error while calculating total size of directory: %s\", dir)\n\t\t}\n\t\treturn lsmSize, vlogSize\n\t}\n\n\tlsmSize, vlogSize := totalSize(db.opt.Dir)\n\ty.LSMSizeSet(db.opt.MetricsEnabled, db.opt.Dir, newInt(lsmSize))\n\t// If valueDir is different from dir, we'd have to do another walk.\n\tif db.opt.ValueDir != db.opt.Dir {\n\t\t_, vlogSize = totalSize(db.opt.ValueDir)\n\t}\n\ty.VlogSizeSet(db.opt.MetricsEnabled, db.opt.ValueDir, newInt(vlogSize))\n}\n\nfunc (db *DB) updateSize(lc *z.Closer) {\n\tdefer lc.Done()\n\tif db.opt.InMemory {\n\t\treturn\n\t}\n\n\tmetricsTicker := time.NewTicker(time.Minute)\n\tdefer metricsTicker.Stop()\n\n\tfor {\n\t\tselect {\n\t\tcase <-metricsTicker.C:\n\t\t\tdb.calculateSize()\n\t\tcase <-lc.HasBeenClosed():\n\t\t\treturn\n\t\t}\n\t}\n}\n\n// RunValueLogGC triggers a value log garbage collection.\n//\n// It picks value log files to perform GC based on statistics that are collected\n// during compactions.  If no such statistics are available, then log files are\n// picked in random order. The process stops as soon as the first log file is\n// encountered which does not result in garbage collection.\n//\n// When a log file is picked, it is first sampled. If the sample shows that we\n// can discard at least discardRatio space of that file, it would be rewritten.\n//\n// If a call to RunValueLogGC results in no rewrites, then an ErrNoRewrite is\n// thrown indicating that the call resulted in no file rewrites.\n//\n// We recommend setting discardRatio to 0.5, thus indicating that a file be\n// rewritten if half the space can be discarded.  This results in a lifetime\n// value log write amplification of 2 (1 from original write + 0.5 rewrite +\n// 0.25 + 0.125 + ... = 2). Setting it to higher value would result in fewer\n// space reclaims, while setting it to a lower value would result in more space\n// reclaims at the cost of increased activity on the LSM tree. discardRatio\n// must be in the range (0.0, 1.0), both endpoints excluded, otherwise an\n// ErrInvalidRequest is returned.\n//\n// Only one GC is allowed at a time. If another value log GC is running, or DB\n// has been closed, this would return an ErrRejected.\n//\n// Note: Every time GC is run, it would produce a spike of activity on the LSM\n// tree.\nfunc (db *DB) RunValueLogGC(discardRatio float64) error {\n\tif db.opt.InMemory {\n\t\treturn ErrGCInMemoryMode\n\t}\n\tif discardRatio >= 1.0 || discardRatio <= 0.0 {\n\t\treturn ErrInvalidRequest\n\t}\n\n\t// Pick a log file and run GC\n\treturn db.vlog.runGC(discardRatio)\n}\n\n// Size returns the size of lsm and value log files in bytes. It can be used to decide how often to\n// call RunValueLogGC.\nfunc (db *DB) Size() (lsm, vlog int64) {\n\tif y.LSMSizeGet(db.opt.MetricsEnabled, db.opt.Dir) == nil {\n\t\tlsm, vlog = 0, 0\n\t\treturn\n\t}\n\tlsm = y.LSMSizeGet(db.opt.MetricsEnabled, db.opt.Dir).(*expvar.Int).Value()\n\tvlog = y.VlogSizeGet(db.opt.MetricsEnabled, db.opt.ValueDir).(*expvar.Int).Value()\n\treturn\n}\n\n// Sequence represents a Badger sequence.\ntype Sequence struct {\n\tlock      sync.Mutex\n\tdb        *DB\n\tkey       []byte\n\tnext      uint64\n\tleased    uint64\n\tbandwidth uint64\n}\n\n// Next would return the next integer in the sequence, updating the lease by running a transaction\n// if needed.\nfunc (seq *Sequence) Next() (uint64, error) {\n\tseq.lock.Lock()\n\tdefer seq.lock.Unlock()\n\tif seq.next >= seq.leased {\n\t\tif err := seq.updateLease(); err != nil {\n\t\t\treturn 0, err\n\t\t}\n\t}\n\tval := seq.next\n\tseq.next++\n\treturn val, nil\n}\n\n// Release the leased sequence to avoid wasted integers. This should be done right\n// before closing the associated DB. However it is valid to use the sequence after\n// it was released, causing a new lease with full bandwidth.\nfunc (seq *Sequence) Release() error {\n\tseq.lock.Lock()\n\tdefer seq.lock.Unlock()\n\terr := seq.db.Update(func(txn *Txn) error {\n\t\titem, err := txn.Get(seq.key)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\n\t\tvar num uint64\n\t\tif err := item.Value(func(v []byte) error {\n\t\t\tnum = binary.BigEndian.Uint64(v)\n\t\t\treturn nil\n\t\t}); err != nil {\n\t\t\treturn err\n\t\t}\n\n\t\tif num == seq.leased {\n\t\t\tvar buf [8]byte\n\t\t\tbinary.BigEndian.PutUint64(buf[:], seq.next)\n\t\t\treturn txn.SetEntry(NewEntry(seq.key, buf[:]))\n\t\t}\n\n\t\treturn nil\n\t})\n\tif err != nil {\n\t\treturn err\n\t}\n\tseq.leased = seq.next\n\treturn nil\n}\n\nfunc (seq *Sequence) updateLease() error {\n\treturn seq.db.Update(func(txn *Txn) error {\n\t\titem, err := txn.Get(seq.key)\n\t\tswitch {\n\t\tcase err == ErrKeyNotFound:\n\t\t\tseq.next = 0\n\t\tcase err != nil:\n\t\t\treturn err\n\t\tdefault:\n\t\t\tvar num uint64\n\t\t\tif err := item.Value(func(v []byte) error {\n\t\t\t\tnum = binary.BigEndian.Uint64(v)\n\t\t\t\treturn nil\n\t\t\t}); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tseq.next = num\n\t\t}\n\n\t\tlease := seq.next + seq.bandwidth\n\t\tvar buf [8]byte\n\t\tbinary.BigEndian.PutUint64(buf[:], lease)\n\t\tif err = txn.SetEntry(NewEntry(seq.key, buf[:])); err != nil {\n\t\t\treturn err\n\t\t}\n\t\tseq.leased = lease\n\t\treturn nil\n\t})\n}\n\n// GetSequence would initiate a new sequence object, generating it from the stored lease, if\n// available, in the database. Sequence can be used to get a list of monotonically increasing\n// integers. Multiple sequences can be created by providing different keys. Bandwidth sets the\n// size of the lease, determining how many Next() requests can be served from memory.\n//\n// GetSequence is not supported on ManagedDB. Calling this would result in a panic.\nfunc (db *DB) GetSequence(key []byte, bandwidth uint64) (*Sequence, error) {\n\tif db.opt.managedTxns {\n\t\tpanic(\"Cannot use GetSequence with managedDB=true.\")\n\t}\n\n\tswitch {\n\tcase len(key) == 0:\n\t\treturn nil, ErrEmptyKey\n\tcase bandwidth == 0:\n\t\treturn nil, ErrZeroBandwidth\n\t}\n\tseq := &Sequence{\n\t\tdb:        db,\n\t\tkey:       key,\n\t\tnext:      0,\n\t\tleased:    0,\n\t\tbandwidth: bandwidth,\n\t}\n\terr := seq.updateLease()\n\treturn seq, err\n}\n\n// Tables gets the TableInfo objects from the level controller. If withKeysCount\n// is true, TableInfo objects also contain counts of keys for the tables.\nfunc (db *DB) Tables() []TableInfo {\n\treturn db.lc.getTableInfo()\n}\n\n// Levels gets the LevelInfo.\nfunc (db *DB) Levels() []LevelInfo {\n\treturn db.lc.getLevelInfo()\n}\n\n// EstimateSize can be used to get rough estimate of data size for a given prefix.\nfunc (db *DB) EstimateSize(prefix []byte) (uint64, uint64) {\n\tvar onDiskSize, uncompressedSize uint64\n\ttables := db.Tables()\n\tfor _, ti := range tables {\n\t\tif bytes.HasPrefix(ti.Left, prefix) && bytes.HasPrefix(ti.Right, prefix) {\n\t\t\tonDiskSize += uint64(ti.OnDiskSize)\n\t\t\tuncompressedSize += uint64(ti.UncompressedSize)\n\t\t}\n\t}\n\treturn onDiskSize, uncompressedSize\n}\n\n// Ranges can be used to get rough key ranges to divide up iteration over the DB. The ranges here\n// would consider the prefix, but would not necessarily start or end with the prefix. In fact, the\n// first range would have nil as left key, and the last range would have nil as the right key.\nfunc (db *DB) Ranges(prefix []byte, numRanges int) []*keyRange {\n\tvar splits []string\n\ttables := db.Tables()\n\n\t// We just want table ranges here and not keys count.\n\tfor _, ti := range tables {\n\t\t// We don't use ti.Left, because that has a tendency to store !badger keys. Skip over tables\n\t\t// at upper levels. Only choose tables from the last level.\n\t\tif ti.Level != db.opt.MaxLevels-1 {\n\t\t\tcontinue\n\t\t}\n\t\tif bytes.HasPrefix(ti.Right, prefix) {\n\t\t\tsplits = append(splits, string(ti.Right))\n\t\t}\n\t}\n\n\t// If the number of splits is low, look at the offsets inside the\n\t// tables to generate more splits.\n\tif len(splits) < 32 {\n\t\tnumTables := len(tables)\n\t\tif numTables == 0 {\n\t\t\tnumTables = 1\n\t\t}\n\t\tnumPerTable := 32 / numTables\n\t\tif numPerTable == 0 {\n\t\t\tnumPerTable = 1\n\t\t}\n\t\tsplits = db.lc.keySplits(numPerTable, prefix)\n\t}\n\n\t// If the number of splits is still < 32, then look at the memtables.\n\tif len(splits) < 32 {\n\t\tmaxPerSplit := 10000\n\t\tmtSplits := func(mt *memTable) {\n\t\t\tif mt == nil {\n\t\t\t\treturn\n\t\t\t}\n\t\t\tcount := 0\n\t\t\titer := mt.sl.NewIterator()\n\t\t\tfor iter.SeekToFirst(); iter.Valid(); iter.Next() {\n\t\t\t\tif count%maxPerSplit == 0 {\n\t\t\t\t\t// Add a split every maxPerSplit keys.\n\t\t\t\t\tif bytes.HasPrefix(iter.Key(), prefix) {\n\t\t\t\t\t\tsplits = append(splits, string(iter.Key()))\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tcount += 1\n\t\t\t}\n\t\t\t_ = iter.Close()\n\t\t}\n\n\t\tdb.lock.Lock()\n\t\tdefer db.lock.Unlock()\n\t\tvar memTables []*memTable\n\t\tmemTables = append(memTables, db.imm...)\n\t\tfor _, mt := range memTables {\n\t\t\tmtSplits(mt)\n\t\t}\n\t\tmtSplits(db.mt)\n\t}\n\n\t// We have our splits now. Let's convert them to ranges.\n\tsort.Strings(splits)\n\tvar ranges []*keyRange\n\tvar start []byte\n\tfor _, key := range splits {\n\t\tranges = append(ranges, &keyRange{left: start, right: y.SafeCopy(nil, []byte(key))})\n\t\tstart = y.SafeCopy(nil, []byte(key))\n\t}\n\tranges = append(ranges, &keyRange{left: start})\n\n\t// Figure out the approximate table size this range has to deal with.\n\tfor _, t := range tables {\n\t\ttr := keyRange{left: t.Left, right: t.Right}\n\t\tfor _, r := range ranges {\n\t\t\tif len(r.left) == 0 || len(r.right) == 0 {\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\tif r.overlapsWith(tr) {\n\t\t\t\tr.size += int64(t.UncompressedSize)\n\t\t\t}\n\t\t}\n\t}\n\n\tvar total int64\n\tfor _, r := range ranges {\n\t\ttotal += r.size\n\t}\n\tif total == 0 {\n\t\treturn ranges\n\t}\n\t// Figure out the average size, so we know how to bin the ranges together.\n\tavg := total / int64(numRanges)\n\n\tvar out []*keyRange\n\tvar i int\n\tfor i < len(ranges) {\n\t\tr := ranges[i]\n\t\tcur := &keyRange{left: r.left, size: r.size, right: r.right}\n\t\ti++\n\t\tfor ; i < len(ranges); i++ {\n\t\t\tnext := ranges[i]\n\t\t\tif cur.size+next.size > avg {\n\t\t\t\tbreak\n\t\t\t}\n\t\t\tcur.right = next.right\n\t\t\tcur.size += next.size\n\t\t}\n\t\tout = append(out, cur)\n\t}\n\treturn out\n}\n\n// MaxBatchCount returns max possible entries in batch\nfunc (db *DB) MaxBatchCount() int64 {\n\treturn db.opt.maxBatchCount\n}\n\n// MaxBatchSize returns max possible batch size\nfunc (db *DB) MaxBatchSize() int64 {\n\treturn db.opt.maxBatchSize\n}\n\nfunc (db *DB) stopMemoryFlush() {\n\t// Stop memtable flushes.\n\tif db.closers.memtable != nil {\n\t\tclose(db.flushChan)\n\t\tdb.closers.memtable.SignalAndWait()\n\t}\n}\n\nfunc (db *DB) stopCompactions() {\n\t// Stop compactions.\n\tif db.closers.compactors != nil {\n\t\tdb.closers.compactors.SignalAndWait()\n\t}\n}\n\nfunc (db *DB) startCompactions() {\n\t// Resume compactions.\n\tif db.closers.compactors != nil {\n\t\tdb.closers.compactors = z.NewCloser(1)\n\t\tdb.lc.startCompact(db.closers.compactors)\n\t}\n}\n\nfunc (db *DB) startMemoryFlush() {\n\t// Start memory fluhser.\n\tif db.closers.memtable != nil {\n\t\tdb.flushChan = make(chan *memTable, db.opt.NumMemtables)\n\t\tdb.closers.memtable = z.NewCloser(1)\n\t\tgo func() {\n\t\t\tdb.flushMemtable(db.closers.memtable)\n\t\t}()\n\t}\n}\n\n// Flatten can be used to force compactions on the LSM tree so all the tables fall on the same\n// level. This ensures that all the versions of keys are colocated and not split across multiple\n// levels, which is necessary after a restore from backup. During Flatten, live compactions are\n// stopped. Ideally, no writes are going on during Flatten. Otherwise, it would create competition\n// between flattening the tree and new tables being created at level zero.\nfunc (db *DB) Flatten(workers int) error {\n\n\tdb.stopCompactions()\n\tdefer db.startCompactions()\n\n\tcompactAway := func(cp compactionPriority) error {\n\t\tdb.opt.Infof(\"Attempting to compact with %+v\\n\", cp)\n\t\terrCh := make(chan error, 1)\n\t\tfor i := 0; i < workers; i++ {\n\t\t\tgo func() {\n\t\t\t\terrCh <- db.lc.doCompact(175, cp)\n\t\t\t}()\n\t\t}\n\t\tvar success int\n\t\tvar rerr error\n\t\tfor i := 0; i < workers; i++ {\n\t\t\terr := <-errCh\n\t\t\tif err != nil {\n\t\t\t\trerr = err\n\t\t\t\tdb.opt.Warningf(\"While running doCompact with %+v. Error: %v\\n\", cp, err)\n\t\t\t} else {\n\t\t\t\tsuccess++\n\t\t\t}\n\t\t}\n\t\tif success == 0 {\n\t\t\treturn rerr\n\t\t}\n\t\t// We could do at least one successful compaction. So, we'll consider this a success.\n\t\tdb.opt.Infof(\"%d compactor(s) succeeded. One or more tables from level %d compacted.\\n\",\n\t\t\tsuccess, cp.level)\n\t\treturn nil\n\t}\n\n\thbytes := func(sz int64) string {\n\t\treturn humanize.IBytes(uint64(sz))\n\t}\n\n\tt := db.lc.levelTargets()\n\tfor {\n\t\tdb.opt.Infof(\"\\n\")\n\t\tvar levels []int\n\t\tfor i, l := range db.lc.levels {\n\t\t\tsz := l.getTotalSize()\n\t\t\tdb.opt.Infof(\"Level: %d. %8s Size. %8s Max.\\n\",\n\t\t\t\ti, hbytes(l.getTotalSize()), hbytes(t.targetSz[i]))\n\t\t\tif sz > 0 {\n\t\t\t\tlevels = append(levels, i)\n\t\t\t}\n\t\t}\n\t\tif len(levels) <= 1 {\n\t\t\tprios := db.lc.pickCompactLevels(nil)\n\t\t\tif len(prios) == 0 || prios[0].score <= 1.0 {\n\t\t\t\tdb.opt.Infof(\"All tables consolidated into one level. Flattening done.\\n\")\n\t\t\t\treturn nil\n\t\t\t}\n\t\t\tif err := compactAway(prios[0]); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tcontinue\n\t\t}\n\t\t// Create an artificial compaction priority, to ensure that we compact the level.\n\t\tcp := compactionPriority{level: levels[0], score: 1.71}\n\t\tif err := compactAway(cp); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n}\n\nfunc (db *DB) blockWrite() error {\n\t// Stop accepting new writes.\n\tif !db.blockWrites.CompareAndSwap(0, 1) {\n\t\treturn ErrBlockedWrites\n\t}\n\n\t// Make all pending writes finish. The following will also close writeCh.\n\tdb.closers.writes.SignalAndWait()\n\tdb.opt.Infof(\"Writes flushed. Stopping compactions now...\")\n\treturn nil\n}\n\nfunc (db *DB) unblockWrite() {\n\tdb.closers.writes = z.NewCloser(1)\n\tgo db.doWrites(db.closers.writes)\n\n\t// Resume writes.\n\tdb.blockWrites.Store(0)\n}\n\nfunc (db *DB) prepareToDrop() (func(), error) {\n\tif db.opt.ReadOnly {\n\t\tpanic(\"Attempting to drop data in read-only mode.\")\n\t}\n\t// In order prepare for drop, we need to block the incoming writes and\n\t// write it to db. Then, flush all the pending memtable. So that, we\n\t// don't miss any entries.\n\tif err := db.blockWrite(); err != nil {\n\t\treturn func() {}, err\n\t}\n\treqs := make([]*request, 0, 10)\n\tfor {\n\t\tselect {\n\t\tcase r := <-db.writeCh:\n\t\t\treqs = append(reqs, r)\n\t\tdefault:\n\t\t\tif err := db.writeRequests(reqs); err != nil {\n\t\t\t\tdb.opt.Errorf(\"writeRequests: %v\", err)\n\t\t\t}\n\t\t\tdb.stopMemoryFlush()\n\t\t\treturn func() {\n\t\t\t\tdb.opt.Infof(\"Resuming writes\")\n\t\t\t\tdb.startMemoryFlush()\n\t\t\t\tdb.unblockWrite()\n\t\t\t}, nil\n\t\t}\n\t}\n}\n\n// DropAll would drop all the data stored in Badger. It does this in the following way.\n// - Stop accepting new writes.\n// - Pause memtable flushes and compactions.\n// - Pick all tables from all levels, create a changeset to delete all these\n// tables and apply it to manifest.\n// - Pick all log files from value log, and delete all of them. Restart value log files from zero.\n// - Resume memtable flushes and compactions.\n//\n// NOTE: DropAll is resilient to concurrent writes, but not to reads. It is up to the user to not do\n// any reads while DropAll is going on, otherwise they may result in panics. Ideally, both reads and\n// writes are paused before running DropAll, and resumed after it is finished.\nfunc (db *DB) DropAll() error {\n\tf, err := db.dropAll()\n\tif f != nil {\n\t\tf()\n\t}\n\treturn err\n}\n\nfunc (db *DB) dropAll() (func(), error) {\n\tdb.opt.Infof(\"DropAll called. Blocking writes...\")\n\tf, err := db.prepareToDrop()\n\tif err != nil {\n\t\treturn f, err\n\t}\n\t// prepareToDrop will stop all the incoming write and flushes any pending memtables.\n\t// Before we drop, we'll stop the compaction because anyways all the data are going to\n\t// be deleted.\n\tdb.stopCompactions()\n\tresume := func() {\n\t\tdb.startCompactions()\n\t\tf()\n\t}\n\t// Block all foreign interactions with memory tables.\n\tdb.lock.Lock()\n\tdefer db.lock.Unlock()\n\n\t// Remove inmemory tables. Calling DecrRef for safety. Not sure if they're absolutely needed.\n\tdb.mt.DecrRef()\n\tfor _, mt := range db.imm {\n\t\tmt.DecrRef()\n\t}\n\tdb.imm = db.imm[:0]\n\tdb.mt, err = db.newMemTable() // Set it up for future writes.\n\tif err != nil {\n\t\treturn resume, y.Wrapf(err, \"cannot open new memtable\")\n\t}\n\n\tnum, err := db.lc.dropTree()\n\tif err != nil {\n\t\treturn resume, err\n\t}\n\tdb.opt.Infof(\"Deleted %d SSTables. Now deleting value logs...\\n\", num)\n\n\tnum, err = db.vlog.dropAll()\n\tif err != nil {\n\t\treturn resume, err\n\t}\n\tdb.lc.nextFileID.Store(1)\n\tdb.opt.Infof(\"Deleted %d value log files. DropAll done.\\n\", num)\n\tdb.blockCache.Clear()\n\tdb.indexCache.Clear()\n\tdb.threshold.Clear(db.opt)\n\treturn resume, nil\n}\n\n// DropPrefix would drop all the keys with the provided prefix. It does this in the following way:\n//   - Stop accepting new writes.\n//   - Stop memtable flushes before acquiring lock. Because we're acquiring lock here\n//     and memtable flush stalls for lock, which leads to deadlock\n//   - Flush out all memtables, skipping over keys with the given prefix, Kp.\n//   - Write out the value log header to memtables when flushing, so we don't accidentally bring Kp\n//     back after a restart.\n//   - Stop compaction.\n//   - Compact L0->L1, skipping over Kp.\n//   - Compact rest of the levels, Li->Li, picking tables which have Kp.\n//   - Resume memtable flushes, compactions and writes.\nfunc (db *DB) DropPrefix(prefixes ...[]byte) error {\n\tif len(prefixes) == 0 {\n\t\treturn nil\n\t}\n\tdb.opt.Infof(\"DropPrefix called for %s\", prefixes)\n\tf, err := db.prepareToDrop()\n\tif err != nil {\n\t\treturn err\n\t}\n\tdefer f()\n\n\tvar filtered [][]byte\n\tif filtered, err = db.filterPrefixesToDrop(prefixes); err != nil {\n\t\treturn err\n\t}\n\t// If there is no prefix for which the data already exist, do not do anything.\n\tif len(filtered) == 0 {\n\t\tdb.opt.Infof(\"No prefixes to drop\")\n\t\treturn nil\n\t}\n\t// Block all foreign interactions with memory tables.\n\tdb.lock.Lock()\n\tdefer db.lock.Unlock()\n\n\tdb.imm = append(db.imm, db.mt)\n\tfor _, memtable := range db.imm {\n\t\tif memtable.sl.Empty() {\n\t\t\tmemtable.DecrRef()\n\t\t\tcontinue\n\t\t}\n\t\tdb.opt.Debugf(\"Flushing memtable\")\n\t\tif err := db.handleMemTableFlush(memtable, filtered); err != nil {\n\t\t\tdb.opt.Errorf(\"While trying to flush memtable: %v\", err)\n\t\t\treturn err\n\t\t}\n\t\tmemtable.DecrRef()\n\t}\n\tdb.stopCompactions()\n\tdefer db.startCompactions()\n\tdb.imm = db.imm[:0]\n\tdb.mt, err = db.newMemTable()\n\tif err != nil {\n\t\treturn y.Wrapf(err, \"cannot create new mem table\")\n\t}\n\n\t// Drop prefixes from the levels.\n\tif err := db.lc.dropPrefixes(filtered); err != nil {\n\t\treturn err\n\t}\n\tdb.opt.Infof(\"DropPrefix done\")\n\treturn nil\n}\n\nfunc (db *DB) filterPrefixesToDrop(prefixes [][]byte) ([][]byte, error) {\n\tvar filtered [][]byte\n\tfor _, prefix := range prefixes {\n\t\terr := db.View(func(txn *Txn) error {\n\t\t\tiopts := DefaultIteratorOptions\n\t\t\tiopts.Prefix = prefix\n\t\t\tiopts.PrefetchValues = false\n\t\t\titr := txn.NewIterator(iopts)\n\t\t\tdefer itr.Close()\n\t\t\titr.Rewind()\n\t\t\tif itr.ValidForPrefix(prefix) {\n\t\t\t\tfiltered = append(filtered, prefix)\n\t\t\t}\n\t\t\treturn nil\n\t\t})\n\t\tif err != nil {\n\t\t\treturn filtered, err\n\t\t}\n\t}\n\treturn filtered, nil\n}\n\n// Checks if the key is banned. Returns the respective error if the key belongs to any of the banned\n// namepspaces. Else it returns nil.\nfunc (db *DB) isBanned(key []byte) error {\n\tif db.opt.NamespaceOffset < 0 {\n\t\treturn nil\n\t}\n\tif len(key) <= db.opt.NamespaceOffset+8 {\n\t\treturn nil\n\t}\n\tif db.bannedNamespaces.has(y.BytesToU64(key[db.opt.NamespaceOffset:])) {\n\t\treturn ErrBannedKey\n\t}\n\treturn nil\n}\n\n// BanNamespace bans a namespace. Read/write to keys belonging to any of such namespace is denied.\nfunc (db *DB) BanNamespace(ns uint64) error {\n\tif db.opt.NamespaceOffset < 0 {\n\t\treturn ErrNamespaceMode\n\t}\n\tdb.opt.Infof(\"Banning namespace: %d\", ns)\n\t// First set the banned namespaces in DB and then update the in-memory structure.\n\tkey := y.KeyWithTs(append(bannedNsKey, y.U64ToBytes(ns)...), 1)\n\tentry := []*Entry{{\n\t\tKey:   key,\n\t\tValue: nil,\n\t}}\n\treq, err := db.sendToWriteCh(entry)\n\tif err != nil {\n\t\treturn err\n\t}\n\tif err := req.Wait(); err != nil {\n\t\treturn err\n\t}\n\tdb.bannedNamespaces.add(ns)\n\treturn nil\n}\n\n// BannedNamespaces returns the list of prefixes banned for DB.\nfunc (db *DB) BannedNamespaces() []uint64 {\n\treturn db.bannedNamespaces.all()\n}\n\n// KVList contains a list of key-value pairs.\ntype KVList = pb.KVList\n\n// Subscribe can be used to watch key changes for the given key prefixes and the ignore string.\n// At least one prefix should be passed, or an error will be returned.\n// You can use an empty prefix to monitor all changes to the DB.\n// Ignore string is the byte ranges for which prefix matching will be ignored.\n// For example: ignore = \"2-3\", and prefix = \"abc\" will match for keys \"abxxc\", \"abdfc\" etc.\n// This function blocks until the given context is done or an error occurs.\n// The given function will be called with a new KVList containing the modified keys and the\n// corresponding values.\nfunc (db *DB) Subscribe(ctx context.Context, cb func(kv *KVList) error, matches []pb.Match) error {\n\tif cb == nil {\n\t\treturn ErrNilCallback\n\t}\n\n\tc := z.NewCloser(1)\n\ts, err := db.pub.newSubscriber(c, matches)\n\tif err != nil {\n\t\treturn y.Wrapf(err, \"while creating a new subscriber\")\n\t}\n\tslurp := func(batch *pb.KVList) error {\n\t\tfor {\n\t\t\tselect {\n\t\t\tcase kvs := <-s.sendCh:\n\t\t\t\tbatch.Kv = append(batch.Kv, kvs.Kv...)\n\t\t\tdefault:\n\t\t\t\tif len(batch.GetKv()) > 0 {\n\t\t\t\t\treturn cb(batch)\n\t\t\t\t}\n\t\t\t\treturn nil\n\t\t\t}\n\t\t}\n\t}\n\n\tdrain := func() {\n\t\tfor {\n\t\t\tselect {\n\t\t\tcase _, ok := <-s.sendCh:\n\t\t\t\tif !ok {\n\t\t\t\t\t// Channel is closed.\n\t\t\t\t\treturn\n\t\t\t\t}\n\t\t\tdefault:\n\t\t\t\treturn\n\t\t\t}\n\t\t}\n\t}\n\tfor {\n\t\tselect {\n\t\tcase <-c.HasBeenClosed():\n\t\t\t// No need to delete here. Closer will be called only while\n\t\t\t// closing DB. Subscriber will be deleted by cleanSubscribers.\n\t\t\terr := slurp(new(pb.KVList))\n\t\t\t// Drain if any pending updates.\n\t\t\tc.Done()\n\t\t\treturn err\n\t\tcase <-ctx.Done():\n\t\t\tc.Done()\n\t\t\ts.active.Store(0)\n\t\t\tdrain()\n\t\t\tdb.pub.deleteSubscriber(s.id)\n\t\t\t// Delete the subscriber to avoid further updates.\n\t\t\treturn ctx.Err()\n\t\tcase batch := <-s.sendCh:\n\t\t\terr := slurp(batch)\n\t\t\tif err != nil {\n\t\t\t\tc.Done()\n\t\t\t\ts.active.Store(0)\n\t\t\t\tdrain()\n\t\t\t\t// Delete the subscriber if there is an error by the callback.\n\t\t\t\tdb.pub.deleteSubscriber(s.id)\n\t\t\t\treturn err\n\t\t\t}\n\t\t}\n\t}\n}\n\nfunc (db *DB) syncDir(dir string) error {\n\tif db.opt.InMemory {\n\t\treturn nil\n\t}\n\treturn syncDir(dir)\n}\n\nfunc createDirs(opt Options) error {\n\tfor _, path := range []string{opt.Dir, opt.ValueDir} {\n\t\tdirExists, err := exists(path)\n\t\tif err != nil {\n\t\t\treturn y.Wrapf(err, \"Invalid Dir: %q\", path)\n\t\t}\n\t\tif !dirExists {\n\t\t\tif opt.ReadOnly {\n\t\t\t\treturn fmt.Errorf(\"Cannot find directory %q for read-only open\", path)\n\t\t\t}\n\t\t\t// Try to create the directory\n\t\t\terr = os.MkdirAll(path, 0700)\n\t\t\tif err != nil {\n\t\t\t\treturn y.Wrapf(err, \"Error Creating Dir: %q\", path)\n\t\t\t}\n\t\t}\n\t}\n\treturn nil\n}\n\n// Stream the contents of this DB to a new DB with options outOptions that will be\n// created in outDir.\nfunc (db *DB) StreamDB(outOptions Options) error {\n\toutDir := outOptions.Dir\n\n\t// Open output DB.\n\toutDB, err := OpenManaged(outOptions)\n\tif err != nil {\n\t\treturn y.Wrapf(err, \"cannot open out DB at %s\", outDir)\n\t}\n\tdefer outDB.Close()\n\twriter := outDB.NewStreamWriter()\n\tif err := writer.Prepare(); err != nil {\n\t\treturn y.Wrapf(err, \"cannot create stream writer in out DB at %s\", outDir)\n\t}\n\n\t// Stream contents of DB to the output DB.\n\tstream := db.NewStreamAt(math.MaxUint64)\n\tstream.LogPrefix = fmt.Sprintf(\"Streaming DB to new DB at %s\", outDir)\n\n\tstream.Send = func(buf *z.Buffer) error {\n\t\treturn writer.Write(buf)\n\t}\n\tif err := stream.Orchestrate(context.Background()); err != nil {\n\t\treturn y.Wrapf(err, \"cannot stream DB to out DB at %s\", outDir)\n\t}\n\tif err := writer.Flush(); err != nil {\n\t\treturn y.Wrapf(err, \"cannot flush writer\")\n\t}\n\treturn nil\n}\n\n// Opts returns a copy of the DB options.\nfunc (db *DB) Opts() Options {\n\treturn db.opt\n}\n\ntype CacheType int\n\nconst (\n\tBlockCache CacheType = iota\n\tIndexCache\n)\n\n// CacheMaxCost updates the max cost of the given cache (either block or index cache).\n// The call will have an effect only if the DB was created with the cache. Otherwise it is\n// a no-op. If you pass a negative value, the function will return the current value\n// without updating it.\nfunc (db *DB) CacheMaxCost(cache CacheType, maxCost int64) (int64, error) {\n\tif db == nil {\n\t\treturn 0, nil\n\t}\n\n\tif maxCost < 0 {\n\t\tswitch cache {\n\t\tcase BlockCache:\n\t\t\treturn db.blockCache.MaxCost(), nil\n\t\tcase IndexCache:\n\t\t\treturn db.indexCache.MaxCost(), nil\n\t\tdefault:\n\t\t\treturn 0, errors.New(\"invalid cache type\")\n\t\t}\n\t}\n\n\tswitch cache {\n\tcase BlockCache:\n\t\tdb.blockCache.UpdateMaxCost(maxCost)\n\t\treturn maxCost, nil\n\tcase IndexCache:\n\t\tdb.indexCache.UpdateMaxCost(maxCost)\n\t\treturn maxCost, nil\n\tdefault:\n\t\treturn 0, errors.New(\"invalid cache type\")\n\t}\n}\n\nfunc (db *DB) LevelsToString() string {\n\tlevels := db.Levels()\n\th := func(sz int64) string {\n\t\treturn humanize.IBytes(uint64(sz))\n\t}\n\tbase := func(b bool) string {\n\t\tif b {\n\t\t\treturn \"B\"\n\t\t}\n\t\treturn \" \"\n\t}\n\n\tvar b strings.Builder\n\tb.WriteRune('\\n')\n\tfor _, li := range levels {\n\t\tb.WriteString(fmt.Sprintf(\n\t\t\t\"Level %d [%s]: NumTables: %02d. Size: %s of %s. Score: %.2f->%.2f\"+\n\t\t\t\t\" StaleData: %s Target FileSize: %s\\n\",\n\t\t\tli.Level, base(li.IsBaseLevel), li.NumTables,\n\t\t\th(li.Size), h(li.TargetSize), li.Score, li.Adjusted, h(li.StaleDatSize),\n\t\t\th(li.TargetFileSize)))\n\t}\n\tb.WriteString(\"Level Done\\n\")\n\treturn b.String()\n}\n"
  },
  {
    "path": "db2_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"bytes\"\n\t\"context\"\n\t\"encoding/binary\"\n\t\"flag\"\n\t\"fmt\"\n\t\"log\"\n\t\"math\"\n\t\"math/rand\"\n\t\"os\"\n\t\"path/filepath\"\n\t\"regexp\"\n\t\"runtime\"\n\t\"sync\"\n\t\"testing\"\n\t\"time\"\n\n\t\"github.com/stretchr/testify/require\"\n\n\t\"github.com/dgraph-io/badger/v4/options\"\n\t\"github.com/dgraph-io/badger/v4/pb\"\n\t\"github.com/dgraph-io/badger/v4/table\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\nfunc TestTruncateVlogWithClose(t *testing.T) {\n\tkey := func(i int) []byte {\n\t\treturn []byte(fmt.Sprintf(\"%d%10d\", i, i))\n\t}\n\tdata := func(l int) []byte {\n\t\tm := make([]byte, l)\n\t\t_, err := rand.Read(m)\n\t\trequire.NoError(t, err)\n\t\treturn m\n\t}\n\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\topt := getTestOptions(dir)\n\topt.SyncWrites = true\n\topt.ValueThreshold = 1 // Force all reads from value log.\n\n\tdb, err := Open(opt)\n\trequire.NoError(t, err)\n\n\terr = db.Update(func(txn *Txn) error {\n\t\treturn txn.SetEntry(NewEntry(key(0), data(4055)))\n\t})\n\trequire.NoError(t, err)\n\n\t// Close the DB.\n\trequire.NoError(t, db.Close())\n\t// We start value logs at 1.\n\trequire.NoError(t, os.Truncate(filepath.Join(dir, \"000001.vlog\"), 4090))\n\n\t// Reopen and write some new data.\n\tdb, err = Open(opt)\n\trequire.NoError(t, err)\n\tfor i := 0; i < 32; i++ {\n\t\terr := db.Update(func(txn *Txn) error {\n\t\t\treturn txn.SetEntry(NewEntry(key(i), data(10)))\n\t\t})\n\t\trequire.NoError(t, err)\n\t}\n\n\t// Read it back to ensure that we can read it now.\n\tfor i := 0; i < 32; i++ {\n\t\terr := db.View(func(txn *Txn) error {\n\t\t\titem, err := txn.Get(key(i))\n\t\t\trequire.NoError(t, err)\n\t\t\tval := getItemValue(t, item)\n\t\t\trequire.Equal(t, 10, len(val))\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\t}\n\trequire.NoError(t, db.Close())\n\n\t// Reopen and read the data again.\n\tdb, err = Open(opt)\n\trequire.NoError(t, err)\n\tfor i := 0; i < 32; i++ {\n\t\terr := db.View(func(txn *Txn) error {\n\t\t\titem, err := txn.Get(key(i))\n\t\t\trequire.NoError(t, err, \"key: %s\", key(i))\n\t\t\tval := getItemValue(t, item)\n\t\t\trequire.Equal(t, 10, len(val))\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\t}\n\trequire.NoError(t, db.Close())\n}\n\nvar manual = flag.Bool(\"manual\", false, \"Set when manually running some tests.\")\n\n// Badger dir to be used for performing db.Open benchmark.\nvar benchDir = flag.String(\"benchdir\", \"\", \"Set when running db.Open benchmark\")\n\n// The following 3 TruncateVlogNoClose tests should be run one after another.\n// None of these close the DB, simulating a crash. They should be run with a\n// script, which truncates the value log to 4090, lining up with the end of the\n// first entry in the txn. At <4090, it would cause the entry to be truncated\n// immediately, at >4090, same thing.\nfunc TestTruncateVlogNoClose(t *testing.T) {\n\tif !*manual {\n\t\tt.Skip(\"Skipping test meant to be run manually.\")\n\t\treturn\n\t}\n\tdir := \"p\"\n\topts := getTestOptions(dir)\n\topts.SyncWrites = true\n\n\tkv, err := Open(opts)\n\trequire.NoError(t, err)\n\tkey := func(i int) string {\n\t\treturn fmt.Sprintf(\"%d%10d\", i, i)\n\t}\n\tdata := fmt.Sprintf(\"%4055d\", 1)\n\terr = kv.Update(func(txn *Txn) error {\n\t\treturn txn.SetEntry(NewEntry([]byte(key(0)), []byte(data)))\n\t})\n\trequire.NoError(t, err)\n}\nfunc TestTruncateVlogNoClose2(t *testing.T) {\n\tif !*manual {\n\t\tt.Skip(\"Skipping test meant to be run manually.\")\n\t\treturn\n\t}\n\tdir := \"p\"\n\topts := getTestOptions(dir)\n\topts.SyncWrites = true\n\n\tkv, err := Open(opts)\n\trequire.NoError(t, err)\n\tkey := func(i int) string {\n\t\treturn fmt.Sprintf(\"%d%10d\", i, i)\n\t}\n\tdata := fmt.Sprintf(\"%10d\", 1)\n\tfor i := 32; i < 64; i++ {\n\t\terr := kv.Update(func(txn *Txn) error {\n\t\t\treturn txn.SetEntry(NewEntry([]byte(key(i)), []byte(data)))\n\t\t})\n\t\trequire.NoError(t, err)\n\t}\n\tfor i := 32; i < 64; i++ {\n\t\trequire.NoError(t, kv.View(func(txn *Txn) error {\n\t\t\titem, err := txn.Get([]byte(key(i)))\n\t\t\trequire.NoError(t, err)\n\t\t\tval := getItemValue(t, item)\n\t\t\trequire.NotNil(t, val)\n\t\t\trequire.True(t, len(val) > 0)\n\t\t\treturn nil\n\t\t}))\n\t}\n}\nfunc TestTruncateVlogNoClose3(t *testing.T) {\n\tif !*manual {\n\t\tt.Skip(\"Skipping test meant to be run manually.\")\n\t\treturn\n\t}\n\tfmt.Print(\"Running\")\n\tdir := \"p\"\n\topts := getTestOptions(dir)\n\topts.SyncWrites = true\n\n\tkv, err := Open(opts)\n\trequire.NoError(t, err)\n\tkey := func(i int) string {\n\t\treturn fmt.Sprintf(\"%d%10d\", i, i)\n\t}\n\tfor i := 32; i < 64; i++ {\n\t\trequire.NoError(t, kv.View(func(txn *Txn) error {\n\t\t\titem, err := txn.Get([]byte(key(i)))\n\t\t\trequire.NoError(t, err)\n\t\t\tval := getItemValue(t, item)\n\t\t\trequire.NotNil(t, val)\n\t\t\trequire.True(t, len(val) > 0)\n\t\t\treturn nil\n\t\t}))\n\t}\n}\n\nfunc TestBigKeyValuePairs(t *testing.T) {\n\t// This test takes too much memory. So, run separately.\n\tif !*manual {\n\t\tt.Skip(\"Skipping test meant to be run manually.\")\n\t\treturn\n\t}\n\n\t// Passing an empty directory since it will be filled by runBadgerTest.\n\topts := DefaultOptions(\"\").\n\t\tWithBaseTableSize(1 << 20).\n\t\tWithValueLogMaxEntries(64)\n\trunBadgerTest(t, &opts, func(t *testing.T, db *DB) {\n\t\tbigK := make([]byte, 65001)\n\t\tbigV := make([]byte, db.opt.ValueLogFileSize+1)\n\t\tsmall := make([]byte, 65000)\n\n\t\ttxn := db.NewTransaction(true)\n\t\trequire.Regexp(t, regexp.MustCompile(\"Key.*exceeded\"), txn.SetEntry(NewEntry(bigK, small)))\n\t\trequire.Regexp(t, regexp.MustCompile(\"Value.*exceeded\"),\n\t\t\ttxn.SetEntry(NewEntry(small, bigV)))\n\n\t\trequire.NoError(t, txn.SetEntry(NewEntry(small, small)))\n\t\trequire.Regexp(t, regexp.MustCompile(\"Key.*exceeded\"), txn.SetEntry(NewEntry(bigK, bigV)))\n\n\t\trequire.NoError(t, db.View(func(txn *Txn) error {\n\t\t\t_, err := txn.Get(small)\n\t\t\trequire.Equal(t, ErrKeyNotFound, err)\n\t\t\treturn nil\n\t\t}))\n\n\t\t// Now run a longer test, which involves value log GC.\n\t\tdata := fmt.Sprintf(\"%100d\", 1)\n\t\tkey := func(i int) string {\n\t\t\treturn fmt.Sprintf(\"%65000d\", i)\n\t\t}\n\n\t\tsaveByKey := func(key string, value []byte) error {\n\t\t\treturn db.Update(func(txn *Txn) error {\n\t\t\t\treturn txn.SetEntry(NewEntry([]byte(key), value))\n\t\t\t})\n\t\t}\n\n\t\tgetByKey := func(key string) error {\n\t\t\treturn db.View(func(txn *Txn) error {\n\t\t\t\titem, err := txn.Get([]byte(key))\n\t\t\t\tif err != nil {\n\t\t\t\t\treturn err\n\t\t\t\t}\n\t\t\t\treturn item.Value(func(val []byte) error {\n\t\t\t\t\tif len(val) == 0 {\n\t\t\t\t\t\tlog.Fatalf(\"key not found %q\", len(key))\n\t\t\t\t\t}\n\t\t\t\t\treturn nil\n\t\t\t\t})\n\t\t\t})\n\t\t}\n\n\t\tfor i := 0; i < 32; i++ {\n\t\t\tif i < 30 {\n\t\t\t\trequire.NoError(t, saveByKey(key(i), []byte(data)))\n\t\t\t} else {\n\t\t\t\trequire.NoError(t, saveByKey(key(i), []byte(fmt.Sprintf(\"%100d\", i))))\n\t\t\t}\n\t\t}\n\n\t\tfor j := 0; j < 5; j++ {\n\t\t\tfor i := 0; i < 32; i++ {\n\t\t\t\tif i < 30 {\n\t\t\t\t\trequire.NoError(t, saveByKey(key(i), []byte(data)))\n\t\t\t\t} else {\n\t\t\t\t\trequire.NoError(t, saveByKey(key(i), []byte(fmt.Sprintf(\"%100d\", i))))\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\n\t\tfor i := 0; i < 32; i++ {\n\t\t\trequire.NoError(t, getByKey(key(i)))\n\t\t}\n\n\t\tvar loops int\n\t\tvar err error\n\t\tfor err == nil {\n\t\t\terr = db.RunValueLogGC(0.5)\n\t\t\trequire.NotRegexp(t, regexp.MustCompile(\"truncate\"), err)\n\t\t\tloops++\n\t\t}\n\t\tt.Logf(\"Ran value log GC %d times. Last error: %v\\n\", loops, err)\n\t})\n}\n\n// The following test checks for issue #585.\nfunc TestPushValueLogLimit(t *testing.T) {\n\t// This test takes too much memory. So, run separately.\n\tif !*manual {\n\t\tt.Skip(\"Skipping test meant to be run manually.\")\n\t\treturn\n\t}\n\n\t// Passing an empty directory since it will be filled by runBadgerTest.\n\topt := DefaultOptions(\"\").\n\t\tWithValueLogMaxEntries(64).\n\t\tWithValueLogFileSize(2<<30 - 1)\n\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\tdata := []byte(fmt.Sprintf(\"%30d\", 1))\n\t\tkey := func(i int) string {\n\t\t\treturn fmt.Sprintf(\"%100d\", i)\n\t\t}\n\n\t\tfor i := 0; i < 32; i++ {\n\t\t\tif i == 4 {\n\t\t\t\tv := make([]byte, math.MaxInt32)\n\t\t\t\terr := db.Update(func(txn *Txn) error {\n\t\t\t\t\treturn txn.SetEntry(NewEntry([]byte(key(i)), v))\n\t\t\t\t})\n\t\t\t\trequire.NoError(t, err)\n\t\t\t} else {\n\t\t\t\terr := db.Update(func(txn *Txn) error {\n\t\t\t\t\treturn txn.SetEntry(NewEntry([]byte(key(i)), data))\n\t\t\t\t})\n\t\t\t\trequire.NoError(t, err)\n\t\t\t}\n\t\t}\n\n\t\tfor i := 0; i < 32; i++ {\n\t\t\terr := db.View(func(txn *Txn) error {\n\t\t\t\titem, err := txn.Get([]byte(key(i)))\n\t\t\t\trequire.NoError(t, err, \"Getting key: %s\", key(i))\n\t\t\t\terr = item.Value(func(v []byte) error {\n\t\t\t\t\t_ = v\n\t\t\t\t\treturn nil\n\t\t\t\t})\n\t\t\t\trequire.NoError(t, err, \"Getting value: %s\", key(i))\n\t\t\t\treturn nil\n\t\t\t})\n\t\t\trequire.NoError(t, err)\n\t\t}\n\t})\n}\n\n// The following benchmark test is supposed to be run against a badger directory with some data.\n// Use badger fill to create data if it doesn't exist.\nfunc BenchmarkDBOpen(b *testing.B) {\n\tif *benchDir == \"\" {\n\t\tb.Skip(\"Please set -benchdir to badger directory\")\n\t}\n\tdir := *benchDir\n\t// Passing an empty directory since it will be filled by runBadgerTest.\n\topt := DefaultOptions(dir).\n\t\tWithReadOnly(true)\n\tfor i := 0; i < b.N; i++ {\n\t\tdb, err := Open(opt)\n\t\trequire.NoError(b, err)\n\t\trequire.NoError(b, db.Close())\n\t}\n}\n\n// Test for values of size uint32.\nfunc TestBigValues(t *testing.T) {\n\tif !*manual {\n\t\tt.Skip(\"Skipping test meant to be run manually.\")\n\t\treturn\n\t}\n\topts := DefaultOptions(\"\").\n\t\tWithValueThreshold(1 << 20).\n\t\tWithValueLogMaxEntries(100)\n\ttest := func(t *testing.T, db *DB) {\n\t\tkeyCount := 1000\n\n\t\tdata := bytes.Repeat([]byte(\"a\"), (1 << 20)) // Valuesize 1 MB.\n\t\tkey := func(i int) string {\n\t\t\treturn fmt.Sprintf(\"%65000d\", i)\n\t\t}\n\n\t\tsaveByKey := func(key string, value []byte) error {\n\t\t\treturn db.Update(func(txn *Txn) error {\n\t\t\t\treturn txn.SetEntry(NewEntry([]byte(key), value))\n\t\t\t})\n\t\t}\n\n\t\tgetByKey := func(key string) error {\n\t\t\treturn db.View(func(txn *Txn) error {\n\t\t\t\titem, err := txn.Get([]byte(key))\n\t\t\t\tif err != nil {\n\t\t\t\t\treturn err\n\t\t\t\t}\n\t\t\t\treturn item.Value(func(val []byte) error {\n\t\t\t\t\tif len(val) == 0 || len(val) != len(data) || !bytes.Equal(val, data) {\n\t\t\t\t\t\tlog.Fatalf(\"key not found %q\", len(key))\n\t\t\t\t\t}\n\t\t\t\t\treturn nil\n\t\t\t\t})\n\t\t\t})\n\t\t}\n\n\t\tfor i := 0; i < keyCount; i++ {\n\t\t\trequire.NoError(t, saveByKey(key(i), data))\n\t\t}\n\n\t\tfor i := 0; i < keyCount; i++ {\n\t\t\trequire.NoError(t, getByKey(key(i)))\n\t\t}\n\t}\n\tt.Run(\"disk mode\", func(t *testing.T) {\n\t\trunBadgerTest(t, &opts, func(t *testing.T, db *DB) {\n\t\t\ttest(t, db)\n\t\t})\n\t})\n\tt.Run(\"InMemory mode\", func(t *testing.T) {\n\t\topts.InMemory = true\n\t\topts.Dir = \"\"\n\t\topts.ValueDir = \"\"\n\t\tdb, err := Open(opts)\n\t\trequire.NoError(t, err)\n\t\ttest(t, db)\n\t\trequire.NoError(t, db.Close())\n\t})\n}\n\n// This test is for compaction file picking testing. We are creating db with two levels. We have 10\n// tables on level 3 and 3 tables on level 2. Tables on level 2 have overlap with 2, 4, 3 tables on\n// level 3.\nfunc TestCompactionFilePicking(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\tdb, err := Open(DefaultOptions(dir))\n\trequire.NoError(t, err, \"error while opening db\")\n\tdefer func() {\n\t\trequire.NoError(t, db.Close())\n\t}()\n\n\tl3 := db.lc.levels[3]\n\tfor i := 1; i <= 10; i++ {\n\t\t// Each table has difference of 1 between smallest and largest key.\n\t\ttab := createTableWithRange(t, db, 2*i-1, 2*i)\n\t\taddToManifest(t, db, tab, 3, db.opt)\n\t\trequire.NoError(t, l3.replaceTables([]*table.Table{}, []*table.Table{tab}))\n\t}\n\n\tl2 := db.lc.levels[2]\n\t// First table has keys 1 and 4.\n\ttab := createTableWithRange(t, db, 1, 4)\n\taddToManifest(t, db, tab, 2, db.opt)\n\trequire.NoError(t, l2.replaceTables([]*table.Table{}, []*table.Table{tab}))\n\n\t// Second table has keys 5 and 12.\n\ttab = createTableWithRange(t, db, 5, 12)\n\taddToManifest(t, db, tab, 2, db.opt)\n\trequire.NoError(t, l2.replaceTables([]*table.Table{}, []*table.Table{tab}))\n\n\t// Third table has keys 13 and 18.\n\ttab = createTableWithRange(t, db, 13, 18)\n\taddToManifest(t, db, tab, 2, db.opt)\n\trequire.NoError(t, l2.replaceTables([]*table.Table{}, []*table.Table{tab}))\n\n\tcdef := &compactDef{\n\t\tthisLevel: db.lc.levels[2],\n\t\tnextLevel: db.lc.levels[3],\n\t}\n\n\ttables := db.lc.levels[2].tables\n\tdb.lc.sortByHeuristic(tables, cdef)\n\n\tvar expKey [8]byte\n\t// First table should be with smallest and biggest keys as 1 and 4 which\n\t// has the lowest version.\n\tbinary.BigEndian.PutUint64(expKey[:], uint64(1))\n\trequire.Equal(t, expKey[:], y.ParseKey(tables[0].Smallest()))\n\tbinary.BigEndian.PutUint64(expKey[:], uint64(4))\n\trequire.Equal(t, expKey[:], y.ParseKey(tables[0].Biggest()))\n\n\t// Second table should be with smallest and biggest keys as 13 and 18\n\t// which has the second lowest version.\n\tbinary.BigEndian.PutUint64(expKey[:], uint64(13))\n\trequire.Equal(t, expKey[:], y.ParseKey(tables[2].Smallest()))\n\tbinary.BigEndian.PutUint64(expKey[:], uint64(18))\n\trequire.Equal(t, expKey[:], y.ParseKey(tables[2].Biggest()))\n\n\t// Third table should be with smallest and biggest keys as 5 and 12 which\n\t// has the maximum version.\n\tbinary.BigEndian.PutUint64(expKey[:], uint64(5))\n\trequire.Equal(t, expKey[:], y.ParseKey(tables[1].Smallest()))\n\tbinary.BigEndian.PutUint64(expKey[:], uint64(12))\n\trequire.Equal(t, expKey[:], y.ParseKey(tables[1].Biggest()))\n}\n\n// addToManifest function is used in TestCompactionFilePicking. It adds table to db manifest.\nfunc addToManifest(t *testing.T, db *DB, tab *table.Table, level uint32, opt Options) {\n\tchange := &pb.ManifestChange{\n\t\tId:          tab.ID(),\n\t\tOp:          pb.ManifestChange_CREATE,\n\t\tLevel:       level,\n\t\tCompression: uint32(tab.CompressionType()),\n\t}\n\trequire.NoError(t, db.manifest.addChanges([]*pb.ManifestChange{change}, opt),\n\t\t\"unable to add to manifest\")\n}\n\n// createTableWithRange function is used in TestCompactionFilePicking. It creates\n// a table with key starting from start and ending with end.\nfunc createTableWithRange(t *testing.T, db *DB, start, end int) *table.Table {\n\tbopts := buildTableOptions(db)\n\tb := table.NewTableBuilder(bopts)\n\tdefer b.Close()\n\tnums := []int{start, end}\n\tfor _, i := range nums {\n\t\tkey := make([]byte, 8)\n\t\tbinary.BigEndian.PutUint64(key[:], uint64(i))\n\t\tkey = y.KeyWithTs(key, uint64(0))\n\t\tval := y.ValueStruct{Value: []byte(fmt.Sprintf(\"%d\", i))}\n\t\tb.Add(key, val, 0)\n\t}\n\n\tfileID := db.lc.reserveFileID()\n\ttab, err := table.CreateTable(table.NewFilename(fileID, db.opt.Dir), b)\n\trequire.NoError(t, err)\n\treturn tab\n}\n\nfunc TestReadSameVlog(t *testing.T) {\n\tkey := func(i int) []byte {\n\t\treturn []byte(fmt.Sprintf(\"%d%10d\", i, i))\n\t}\n\ttestReadingSameKey := func(t *testing.T, db *DB) {\n\t\t// Forcing to read all values from vlog.\n\t\tfor i := 0; i < 50; i++ {\n\t\t\terr := db.Update(func(txn *Txn) error {\n\t\t\t\treturn txn.Set(key(i), key(i))\n\t\t\t})\n\t\t\trequire.NoError(t, err)\n\t\t}\n\t\t// reading it again several times\n\t\tfor i := 0; i < 50; i++ {\n\t\t\tfor j := 0; j < 10; j++ {\n\t\t\t\terr := db.View(func(txn *Txn) error {\n\t\t\t\t\titem, err := txn.Get(key(i))\n\t\t\t\t\trequire.NoError(t, err)\n\t\t\t\t\trequire.Equal(t, key(i), getItemValue(t, item))\n\t\t\t\t\treturn nil\n\t\t\t\t})\n\t\t\t\trequire.NoError(t, err)\n\t\t\t}\n\t\t}\n\t}\n\n\tt.Run(\"Test Read Again Plain Text\", func(t *testing.T) {\n\t\topt := getTestOptions(\"\")\n\t\t// Forcing to read from vlog\n\t\topt.ValueThreshold = 1\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\ttestReadingSameKey(t, db)\n\t\t})\n\n\t})\n\n\tt.Run(\"Test Read Again Encryption\", func(t *testing.T) {\n\t\topt := getTestOptions(\"\")\n\t\topt.ValueThreshold = 1\n\t\t// Generate encryption key.\n\t\teKey := make([]byte, 32)\n\t\t_, err := rand.Read(eKey)\n\t\trequire.NoError(t, err)\n\t\topt.EncryptionKey = eKey\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\ttestReadingSameKey(t, db)\n\t\t})\n\t})\n}\n\n// The test ensures we don't lose data when badger is opened with KeepL0InMemory and GC is being\n// done.\nfunc TestL0GCBug(t *testing.T) {\n\tt.Skipf(\"TestL0GCBug is DISABLED. TODO(ibrahim): Do we need this?\")\n\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\t// Do not change any of the options below unless it's necessary.\n\topts := getTestOptions(dir)\n\topts.NumLevelZeroTables = 50\n\topts.NumLevelZeroTablesStall = 51\n\topts.ValueLogMaxEntries = 2\n\topts.ValueThreshold = 2\n\t// Setting LoadingMode to mmap seems to cause segmentation fault while closing DB.\n\n\tdb1, err := Open(opts)\n\trequire.NoError(t, err)\n\tkey := func(i int) []byte {\n\t\treturn []byte(fmt.Sprintf(\"%10d\", i))\n\t}\n\tval := []byte{1, 1, 1, 1, 1, 1, 1, 1}\n\t// Insert 100 entries. This will create about 50*3 vlog files and 6 SST files.\n\tfor i := 0; i < 3; i++ {\n\t\tfor j := 0; j < 100; j++ {\n\t\t\terr = db1.Update(func(txn *Txn) error {\n\t\t\t\treturn txn.SetEntry(NewEntry(key(j), val))\n\t\t\t})\n\t\t\trequire.NoError(t, err)\n\t\t}\n\t}\n\t// Run value log GC multiple times. This would ensure at least\n\t// one value log file is garbage collected.\n\tsuccess := 0\n\tfor i := 0; i < 10; i++ {\n\t\terr := db1.RunValueLogGC(0.01)\n\t\tif err == nil {\n\t\t\tsuccess++\n\t\t}\n\t\tif err != nil && err != ErrNoRewrite {\n\t\t\tt.Fatalf(err.Error())\n\t\t}\n\t}\n\t// Ensure at least one GC call was successful.\n\trequire.NotZero(t, success)\n\t// CheckKeys reads all the keys previously stored.\n\tcheckKeys := func(db *DB) {\n\t\tfor i := 0; i < 100; i++ {\n\t\t\terr := db.View(func(txn *Txn) error {\n\t\t\t\titem, err := txn.Get(key(i))\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\tval1 := getItemValue(t, item)\n\t\t\t\trequire.Equal(t, val, val1)\n\t\t\t\treturn nil\n\t\t\t})\n\t\t\trequire.NoError(t, err)\n\t\t}\n\t}\n\n\tcheckKeys(db1)\n\t// Simulate a crash by not closing db1 but releasing the locks.\n\tif db1.dirLockGuard != nil {\n\t\trequire.NoError(t, db1.dirLockGuard.release())\n\t\tdb1.dirLockGuard = nil\n\t}\n\tif db1.valueDirGuard != nil {\n\t\trequire.NoError(t, db1.valueDirGuard.release())\n\t\tdb1.valueDirGuard = nil\n\t}\n\trequire.NoError(t, db1.Close())\n\n\tdb2, err := Open(opts)\n\trequire.NoError(t, err)\n\n\t// Ensure we still have all the keys.\n\tcheckKeys(db2)\n\trequire.NoError(t, db2.Close())\n}\n\n// Regression test for https://github.com/dgraph-io/badger/issues/1126\n//\n// The test has 3 steps\n// Step 1 - Create badger data. It is necessary that the value size is\n//\n//\tgreater than valuethreshold. The value log file size after\n//\tthis step is around 170 bytes.\n//\n// Step 2 - Re-open the same badger and simulate a crash. The value log file\n//\n//\tsize after this crash is around 2 GB (we increase the file size to mmap it).\n//\n// Step 3 - Re-open the same badger. We should be able to read all the data\n//\n//\tinserted in the first step.\nfunc TestWindowsDataLoss(t *testing.T) {\n\tif runtime.GOOS != \"windows\" {\n\t\tt.Skip(\"The test is only for Windows.\")\n\t}\n\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\topt := DefaultOptions(dir).WithSyncWrites(true)\n\topt.ValueThreshold = 32\n\n\tdb, err := Open(opt)\n\trequire.NoError(t, err)\n\tkeyCount := 20\n\tvar keyList [][]byte // Stores all the keys generated.\n\tfor i := 0; i < keyCount; i++ {\n\t\t// It is important that we create different transactions for each request.\n\t\terr := db.Update(func(txn *Txn) error {\n\t\t\tkey := []byte(fmt.Sprintf(\"%d\", i))\n\t\t\tv := []byte(\"barValuebarValuebarValuebarValuebarValue\")\n\t\t\trequire.Greater(t, len(v), db.valueThreshold())\n\n\t\t\t// 32 bytes length and now it's not working\n\t\t\terr := txn.Set(key, v)\n\t\t\trequire.NoError(t, err)\n\t\t\tkeyList = append(keyList, key)\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\t}\n\trequire.NoError(t, db.Close())\n\n\tdb, err = Open(opt)\n\trequire.NoError(t, err)\n\t// Return after reading one entry. We're simulating a crash.\n\t// Simulate a crash by not closing db but releasing the locks.\n\tif db.dirLockGuard != nil {\n\t\trequire.NoError(t, db.dirLockGuard.release())\n\t}\n\tif db.valueDirGuard != nil {\n\t\trequire.NoError(t, db.valueDirGuard.release())\n\t}\n\t// Don't use vlog.Close here. We don't want to fix the file size. Only un-mmap\n\t// the data so that we can truncate the file during the next vlog.Open.\n\trequire.NoError(t, z.Munmap(db.vlog.filesMap[db.vlog.maxFid].Data))\n\tfor _, f := range db.vlog.filesMap {\n\t\trequire.NoError(t, f.Fd.Close())\n\t}\n\trequire.NoError(t, db.registry.Close())\n\trequire.NoError(t, db.manifest.close())\n\trequire.NoError(t, db.lc.close())\n\n\tdb, err = Open(opt)\n\trequire.NoError(t, err)\n\tdefer db.Close()\n\n\ttxn := db.NewTransaction(false)\n\tdefer txn.Discard()\n\tit := txn.NewIterator(DefaultIteratorOptions)\n\tdefer it.Close()\n\n\tvar result [][]byte // stores all the keys read from the db.\n\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\titem := it.Item()\n\t\tk := item.Key()\n\t\terr := item.Value(func(v []byte) error {\n\t\t\t_ = v\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\t\tresult = append(result, k)\n\t}\n\trequire.ElementsMatch(t, keyList, result)\n}\n\nfunc TestDropPrefixWithNoData(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\tval := []byte(\"value\")\n\t\trequire.NoError(t, db.Update(func(txn *Txn) error {\n\t\t\trequire.NoError(t, txn.Set([]byte(\"aaa\"), val))\n\t\t\trequire.NoError(t, txn.Set([]byte(\"aab\"), val))\n\t\t\trequire.NoError(t, txn.Set([]byte(\"aba\"), val))\n\t\t\trequire.NoError(t, txn.Set([]byte(\"aca\"), val))\n\t\t\treturn nil\n\t\t}))\n\n\t\t// If we drop prefix, we flush the memtables and create a new mutable memtable. Hence, the\n\t\t// nextMemFid increases by 1. But if there does not exist any data for the prefixes, we\n\t\t// don't do that.\n\t\tmemFid := db.nextMemFid\n\t\tprefixes := [][]byte{[]byte(\"bbb\")}\n\t\trequire.NoError(t, db.DropPrefix(prefixes...))\n\t\trequire.Equal(t, memFid, db.nextMemFid)\n\t\tprefixes = [][]byte{[]byte(\"aba\"), []byte(\"bbb\")}\n\t\trequire.NoError(t, db.DropPrefix(prefixes...))\n\t\trequire.Equal(t, memFid+1, db.nextMemFid)\n\t})\n}\n\nfunc TestDropAllDropPrefix(t *testing.T) {\n\tkey := func(i int) []byte {\n\t\treturn []byte(fmt.Sprintf(\"%10d\", i))\n\t}\n\tval := func(i int) []byte {\n\t\treturn []byte(fmt.Sprintf(\"%128d\", i))\n\t}\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\twb := db.NewWriteBatch()\n\t\tdefer wb.Cancel()\n\n\t\tN := 50000\n\n\t\tfor i := 0; i < N; i++ {\n\t\t\trequire.NoError(t, wb.Set(key(i), val(i)))\n\t\t}\n\t\trequire.NoError(t, wb.Flush())\n\n\t\tvar wg sync.WaitGroup\n\t\twg.Add(3)\n\t\tgo func() {\n\t\t\tdefer wg.Done()\n\t\t\terr := db.DropPrefix([]byte(\"000\"))\n\t\t\tfor err == ErrBlockedWrites {\n\t\t\t\terr = db.DropPrefix([]byte(\"000\"))\n\t\t\t\ttime.Sleep(time.Millisecond * 500)\n\t\t\t}\n\t\t\trequire.NoError(t, err)\n\t\t}()\n\t\tgo func() {\n\t\t\tdefer wg.Done()\n\t\t\terr := db.DropPrefix([]byte(\"111\"))\n\t\t\tfor err == ErrBlockedWrites {\n\t\t\t\terr = db.DropPrefix([]byte(\"111\"))\n\t\t\t\ttime.Sleep(time.Millisecond * 500)\n\t\t\t}\n\t\t\trequire.NoError(t, err)\n\t\t}()\n\t\tgo func() {\n\t\t\ttime.Sleep(time.Millisecond) // Let drop prefix run first.\n\t\t\tdefer wg.Done()\n\t\t\terr := db.DropAll()\n\t\t\tfor err == ErrBlockedWrites {\n\t\t\t\terr = db.DropAll()\n\t\t\t\ttime.Sleep(time.Millisecond * 300)\n\t\t\t}\n\t\t\trequire.NoError(t, err)\n\t\t}()\n\t\twg.Wait()\n\t})\n}\n\nfunc TestIsClosed(t *testing.T) {\n\ttest := func(inMemory bool) {\n\t\topt := DefaultOptions(\"\")\n\t\tif inMemory {\n\t\t\topt.InMemory = true\n\t\t} else {\n\t\t\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\t\t\trequire.NoError(t, err)\n\t\t\tdefer removeDir(dir)\n\n\t\t\topt.Dir = dir\n\t\t\topt.ValueDir = dir\n\t\t}\n\n\t\tdb, err := Open(opt)\n\t\trequire.NoError(t, err)\n\t\trequire.False(t, db.IsClosed())\n\t\trequire.NoError(t, db.Close())\n\t\trequire.True(t, db.IsClosed())\n\t}\n\n\tt.Run(\"normal\", func(t *testing.T) {\n\t\ttest(false)\n\t})\n\tt.Run(\"in-memory\", func(t *testing.T) {\n\t\ttest(true)\n\t})\n\n}\n\n// This test is failing currently because we're returning version+1 from MaxVersion()\nfunc TestMaxVersion(t *testing.T) {\n\tN := 10000\n\tkey := func(i int) []byte {\n\t\treturn []byte(fmt.Sprintf(\"%d%10d\", i, i))\n\t}\n\tt.Run(\"normal\", func(t *testing.T) {\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\t// This will create commits from 1 to N.\n\t\t\tfor i := 0; i < N; i++ {\n\t\t\t\ttxnSet(t, db, key(i), nil, 0)\n\t\t\t}\n\t\t\tver := db.MaxVersion()\n\t\t\trequire.Equal(t, N, int(ver))\n\t\t})\n\t})\n\tt.Run(\"multiple versions\", func(t *testing.T) {\n\t\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\t\trequire.NoError(t, err)\n\t\tdefer removeDir(dir)\n\n\t\topt := getTestOptions(dir)\n\t\topt.NumVersionsToKeep = 100\n\t\tdb, err := OpenManaged(opt)\n\t\trequire.NoError(t, err)\n\n\t\twb := db.NewManagedWriteBatch()\n\t\tdefer wb.Cancel()\n\n\t\tk := make([]byte, 100)\n\t\trand.Read(k)\n\t\t// Create multiple version of the same key.\n\t\tfor i := 1; i <= N; i++ {\n\t\t\trequire.NoError(t, wb.SetEntryAt(&Entry{Key: k}, uint64(i)))\n\t\t}\n\t\trequire.NoError(t, wb.Flush())\n\n\t\tver := db.MaxVersion()\n\t\trequire.Equal(t, N, int(ver))\n\n\t\trequire.NoError(t, db.Close())\n\t})\n\tt.Run(\"Managed mode\", func(t *testing.T) {\n\t\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\t\trequire.NoError(t, err)\n\t\tdefer removeDir(dir)\n\n\t\topt := getTestOptions(dir)\n\t\tdb, err := OpenManaged(opt)\n\t\trequire.NoError(t, err)\n\n\t\twb := db.NewManagedWriteBatch()\n\t\tdefer wb.Cancel()\n\n\t\t// This will create commits from 1 to N.\n\t\tfor i := 1; i <= N; i++ {\n\t\t\trequire.NoError(t, wb.SetEntryAt(&Entry{Key: []byte(fmt.Sprintf(\"%d\", i))}, uint64(i)))\n\t\t}\n\t\trequire.NoError(t, wb.Flush())\n\n\t\tver := db.MaxVersion()\n\t\trequire.NoError(t, err)\n\t\trequire.Equal(t, N, int(ver))\n\n\t\trequire.NoError(t, db.Close())\n\t})\n}\n\nfunc TestTxnReadTs(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\topt := DefaultOptions(dir)\n\tdb, err := Open(opt)\n\trequire.NoError(t, err)\n\trequire.Equal(t, 0, int(db.orc.readTs()))\n\n\ttxnSet(t, db, []byte(\"foo\"), nil, 0)\n\trequire.Equal(t, 1, int(db.orc.readTs()))\n\trequire.NoError(t, db.Close())\n\trequire.Equal(t, 1, int(db.orc.readTs()))\n\n\tdb, err = Open(opt)\n\trequire.NoError(t, err)\n\trequire.Equal(t, 1, int(db.orc.readTs()))\n}\n\n// This tests failed for stream writer with jemalloc and compression enabled.\nfunc TestKeyCount(t *testing.T) {\n\tif !*manual {\n\t\tt.Skip(\"Skipping test meant to be run manually.\")\n\t\treturn\n\t}\n\n\twriteSorted := func(db *DB, num uint64) {\n\t\tvalSz := 128\n\t\tvalue := make([]byte, valSz)\n\t\ty.Check2(rand.Read(value))\n\t\tes := 8 + valSz // key size is 8 bytes and value size is valSz\n\n\t\twriter := db.NewStreamWriter()\n\t\trequire.NoError(t, writer.Prepare())\n\n\t\twg := &sync.WaitGroup{}\n\t\twriteCh := make(chan *pb.KVList, 3)\n\t\twriteRange := func(start, end uint64, streamId uint32) {\n\t\t\t// end is not included.\n\t\t\tdefer wg.Done()\n\t\t\tkvs := &pb.KVList{}\n\t\t\tvar sz int\n\t\t\tfor i := start; i < end; i++ {\n\t\t\t\tkey := make([]byte, 8)\n\t\t\t\tbinary.BigEndian.PutUint64(key, i)\n\t\t\t\tkvs.Kv = append(kvs.Kv, &pb.KV{\n\t\t\t\t\tKey:      key,\n\t\t\t\t\tValue:    value,\n\t\t\t\t\tVersion:  1,\n\t\t\t\t\tStreamId: streamId,\n\t\t\t\t})\n\n\t\t\t\tsz += es\n\n\t\t\t\tif sz >= 4<<20 { // 4 MB\n\t\t\t\t\twriteCh <- kvs\n\t\t\t\t\tkvs = &pb.KVList{}\n\t\t\t\t\tsz = 0\n\t\t\t\t}\n\t\t\t}\n\t\t\twriteCh <- kvs\n\t\t}\n\n\t\t// Let's create some streams.\n\t\twidth := num / 16\n\t\tstreamID := uint32(0)\n\t\tfor start := uint64(0); start < num; start += width {\n\t\t\tend := start + width\n\t\t\tif end > num {\n\t\t\t\tend = num\n\t\t\t}\n\t\t\tstreamID++\n\t\t\twg.Add(1)\n\t\t\tgo writeRange(start, end, streamID)\n\t\t}\n\t\tgo func() {\n\t\t\twg.Wait()\n\t\t\tclose(writeCh)\n\t\t}()\n\n\t\twrite := func(kvs *pb.KVList) error {\n\t\t\tbuf := z.NewBuffer(1<<20, \"test\")\n\t\t\tdefer func() { require.NoError(t, buf.Release()) }()\n\n\t\t\tfor _, kv := range kvs.Kv {\n\t\t\t\tKVToBuffer(kv, buf)\n\t\t\t}\n\t\t\trequire.NoError(t, writer.Write(buf))\n\t\t\treturn nil\n\t\t}\n\n\t\tfor kvs := range writeCh {\n\t\t\trequire.NoError(t, write(kvs))\n\t\t}\n\t\trequire.NoError(t, writer.Flush())\n\t}\n\n\tN := uint64(10 * 1e6) // 10 million entries\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\topt := DefaultOptions(dir).\n\t\tWithBlockCacheSize(100 << 20).\n\t\tWithCompression(options.ZSTD)\n\n\tdb, err := Open(opt)\n\ty.Check(err)\n\tdefer db.Close()\n\twriteSorted(db, N)\n\trequire.NoError(t, db.Close())\n\tt.Logf(\"Writing DONE\\n\")\n\n\t// Read the db\n\tdb2, err := Open(DefaultOptions(dir))\n\ty.Check(err)\n\tdefer db.Close()\n\tlastKey := -1\n\tcount := 0\n\n\tstreams := make(map[uint32]int)\n\tstream := db2.NewStream()\n\tstream.Send = func(buf *z.Buffer) error {\n\t\tlist, err := BufferToKVList(buf)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tfor _, kv := range list.Kv {\n\t\t\tlast := streams[kv.StreamId]\n\t\t\tkey := binary.BigEndian.Uint64(kv.Key)\n\t\t\t// The following should happen as we're writing sorted data.\n\t\t\tif last > 0 {\n\t\t\t\trequire.Equalf(t, last+1, int(key), \"Expected key: %d, Found Key: %d\", lastKey+1, int(key))\n\t\t\t}\n\t\t\tstreams[kv.StreamId] = int(key)\n\t\t}\n\t\tcount += len(list.Kv)\n\t\treturn nil\n\t}\n\trequire.NoError(t, stream.Orchestrate(context.Background()))\n\trequire.Equal(t, N, uint64(count))\n}\n\nfunc TestAssertValueLogIsNotWrittenToOnStartup(t *testing.T) {\n\topt := DefaultOptions(\"\").WithValueLogFileSize(1 << 20).WithValueThreshold(1 << 4)\n\n\tdir, err := os.MkdirTemp(\".\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\topenDb := func(readonly bool) *DB {\n\t\topts := &opt\n\t\topts.Dir = dir\n\t\topts.ValueDir = dir\n\t\tif readonly {\n\t\t\topts.ReadOnly = true\n\t\t}\n\n\t\tif opts.InMemory {\n\t\t\topts.Dir = \"\"\n\t\t\topts.ValueDir = \"\"\n\t\t}\n\t\tdb, err := Open(*opts)\n\t\trequire.NoError(t, err)\n\n\t\treturn db\n\t}\n\n\tkey := func(i int) string {\n\t\treturn fmt.Sprintf(\"key%100d\", i)\n\t}\n\n\tassertOnLoadDb := func(db *DB) uint32 {\n\t\tdata := []byte(fmt.Sprintf(\"value%100d\", 1))\n\t\tfor i := 0; i < 20; i++ {\n\t\t\terr := db.Update(func(txn *Txn) error {\n\t\t\t\treturn txn.SetEntry(NewEntry([]byte(key(i)), data))\n\t\t\t})\n\t\t\trequire.NoError(t, err)\n\t\t}\n\t\treturn db.vlog.maxFid\n\t}\n\n\tlatestVLogFileSize := func(db *DB, vLogId uint32) uint32 {\n\t\treturn db.vlog.filesMap[vLogId].size.Load()\n\t}\n\n\tassertOnReadDb := func(db *DB) {\n\t\tfor i := 0; i < 20; i++ {\n\t\t\terr := db.View(func(txn *Txn) error {\n\t\t\t\titem, err := txn.Get([]byte(key(i)))\n\t\t\t\trequire.NoError(t, err, \"Getting key: %s\", key(i))\n\t\t\t\terr = item.Value(func(v []byte) error {\n\t\t\t\t\t_ = v\n\t\t\t\t\treturn nil\n\t\t\t\t})\n\t\t\t\trequire.NoError(t, err, \"Getting value for the key: %s\", key(i))\n\t\t\t\treturn nil\n\t\t\t})\n\t\t\trequire.NoError(t, err)\n\t\t}\n\t}\n\n\tdb := openDb(false)\n\tvLogFileSize := latestVLogFileSize(db, assertOnLoadDb(db))\n\tassertOnReadDb(db)\n\n\trequire.NoError(t, db.Sync())\n\trequire.NoError(t, db.Close())\n\n\tdb = openDb(true)\n\tdefer func() {\n\t\trequire.NoError(t, db.Close())\n\t}()\n\n\tassertOnReadDb(db)\n\trequire.Equal(t, latestVLogFileSize(db, db.vlog.maxFid), vLogFileSize)\n}\n"
  },
  {
    "path": "db_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"bytes\"\n\t\"context\"\n\t\"encoding/binary\"\n\t\"flag\"\n\t\"fmt\"\n\t\"math\"\n\t\"math/rand\"\n\t\"os\"\n\t\"path/filepath\"\n\t\"runtime\"\n\t\"sort\"\n\t\"sync\"\n\t\"testing\"\n\t\"time\"\n\n\t\"github.com/stretchr/testify/require\"\n\n\t\"github.com/dgraph-io/badger/v4/options\"\n\t\"github.com/dgraph-io/badger/v4/pb\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\n// waitForMessage(ch, expected, count, timeout, t) will block until either\n// `timeout` seconds have occurred or `count` instances of the string `expected`\n// have occurred on the channel `ch`. We log messages or generate errors using `t`.\nfunc waitForMessage(ch chan string, expected string, count int, timeout int, t *testing.T) {\n\tif count <= 0 {\n\t\tt.Logf(\"Will skip waiting for %s since expected count <= 0.\",\n\t\t\texpected)\n\t\treturn\n\t}\n\ttout := time.NewTimer(time.Duration(timeout) * time.Second)\n\tremaining := count\n\tfor {\n\t\tselect {\n\t\tcase curMsg, ok := <-ch:\n\t\t\tif !ok {\n\t\t\t\tt.Errorf(\"Test channel closed while waiting for \"+\n\t\t\t\t\t\"message %s with %d remaining instances expected\",\n\t\t\t\t\texpected, remaining)\n\t\t\t\treturn\n\t\t\t}\n\t\t\tt.Logf(\"Found message: %s\", curMsg)\n\t\t\tif curMsg == expected {\n\t\t\t\tremaining--\n\t\t\t\tif remaining == 0 {\n\t\t\t\t\treturn\n\t\t\t\t}\n\t\t\t}\n\t\tcase <-tout.C:\n\t\t\tt.Errorf(\"Timed out after %d seconds while waiting on test chan \"+\n\t\t\t\t\"for message '%s' with %d remaining instances expected\",\n\t\t\t\ttimeout, expected, remaining)\n\t\t\treturn\n\t\t}\n\t}\n}\n\n// summary is produced when DB is closed. Currently it is used only for testing.\ntype summary struct {\n\tfileIDs map[uint64]bool\n}\n\nfunc (s *levelsController) getSummary() *summary {\n\tout := &summary{\n\t\tfileIDs: make(map[uint64]bool),\n\t}\n\tfor _, l := range s.levels {\n\t\tl.getSummary(out)\n\t}\n\treturn out\n}\n\nfunc (s *levelHandler) getSummary(sum *summary) {\n\ts.RLock()\n\tdefer s.RUnlock()\n\tfor _, t := range s.tables {\n\t\tsum.fileIDs[t.ID()] = true\n\t}\n}\n\nfunc (s *DB) validate() error { return s.lc.validate() }\n\nfunc getTestOptions(dir string) Options {\n\topt := DefaultOptions(dir).\n\t\tWithSyncWrites(false).\n\t\tWithLoggingLevel(WARNING)\n\treturn opt\n}\n\nfunc getItemValue(t *testing.T, item *Item) (val []byte) {\n\tt.Helper()\n\tvar v []byte\n\terr := item.Value(func(val []byte) error {\n\t\tv = append(v, val...)\n\t\treturn nil\n\t})\n\tif err != nil {\n\t\tt.Error(err)\n\t}\n\tif v == nil {\n\t\treturn nil\n\t}\n\tanother, err := item.ValueCopy(nil)\n\trequire.NoError(t, err)\n\trequire.Equal(t, v, another)\n\treturn v\n}\n\nfunc txnSet(t *testing.T, kv *DB, key []byte, val []byte, meta byte) {\n\ttxn := kv.NewTransaction(true)\n\trequire.NoError(t, txn.SetEntry(NewEntry(key, val).WithMeta(meta)))\n\trequire.NoError(t, txn.Commit())\n}\n\nfunc txnDelete(t *testing.T, kv *DB, key []byte) {\n\ttxn := kv.NewTransaction(true)\n\trequire.NoError(t, txn.Delete(key))\n\trequire.NoError(t, txn.Commit())\n}\n\n// Opens a badger db and runs a a test on it.\nfunc runBadgerTest(t *testing.T, opts *Options, test func(t *testing.T, db *DB)) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\tif opts == nil {\n\t\topts = new(Options)\n\t\t*opts = getTestOptions(dir)\n\t} else {\n\t\topts.Dir = dir\n\t\topts.ValueDir = dir\n\t}\n\n\tif opts.InMemory {\n\t\topts.Dir = \"\"\n\t\topts.ValueDir = \"\"\n\t}\n\tdb, err := Open(*opts)\n\trequire.NoError(t, err)\n\tdefer func() {\n\t\trequire.NoError(t, db.Close())\n\t}()\n\ttest(t, db)\n}\n\nfunc TestReverseIterator(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\tkey := make([]byte, 6)\n\t\terr := db.Update(func(txn *Txn) error {\n\t\t\tbinary.BigEndian.PutUint16(key, 5)\n\t\t\tbinary.BigEndian.PutUint32(key[2:], 1)\n\t\t\terr1 := txn.Set(key, []byte(\"value1\"))\n\t\t\trequire.NoError(t, err1)\n\n\t\t\tbinary.BigEndian.PutUint32(key[2:], 2)\n\t\t\terr1 = txn.Set(key, []byte(\"value2\"))\n\t\t\trequire.NoError(t, err1)\n\t\t\treturn nil\n\t\t})\n\n\t\trequire.NoError(t, err)\n\n\t\terr = db.View(func(txn *Txn) error {\n\t\t\tsearchBuffer := make([]byte, 3)\n\t\t\tbinary.BigEndian.PutUint16(searchBuffer, 5)\n\t\t\tsearchBuffer[2] = 0xFF\n\n\t\t\titeratorOptions := DefaultIteratorOptions\n\t\t\titeratorOptions.Reverse = true\n\t\t\titeratorOptions.PrefetchValues = false\n\t\t\titeratorOptions.Prefix = searchBuffer\n\t\t\tit := txn.NewIterator(iteratorOptions)\n\t\t\tdefer it.Close()\n\n\t\t\tit.Rewind()\n\t\t\trequire.Equal(t, it.Item().Key(), key)\n\t\t\treturn nil\n\t\t})\n\n\t\trequire.NoError(t, err)\n\t})\n}\n\nfunc TestWrite(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\tfor i := 0; i < 100; i++ {\n\t\t\ttxnSet(t, db, []byte(fmt.Sprintf(\"key%d\", i)), []byte(fmt.Sprintf(\"val%d\", i)), 0x00)\n\t\t}\n\t})\n}\n\nfunc TestUpdateAndView(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\terr := db.Update(func(txn *Txn) error {\n\t\t\tfor i := 0; i < 10; i++ {\n\t\t\t\tentry := NewEntry([]byte(fmt.Sprintf(\"key%d\", i)), []byte(fmt.Sprintf(\"val%d\", i)))\n\t\t\t\tif err := txn.SetEntry(entry); err != nil {\n\t\t\t\t\treturn err\n\t\t\t\t}\n\t\t\t}\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\n\t\terr = db.View(func(txn *Txn) error {\n\t\t\tfor i := 0; i < 10; i++ {\n\t\t\t\titem, err := txn.Get([]byte(fmt.Sprintf(\"key%d\", i)))\n\t\t\t\tif err != nil {\n\t\t\t\t\treturn err\n\t\t\t\t}\n\n\t\t\t\texpected := []byte(fmt.Sprintf(\"val%d\", i))\n\t\t\t\tif err := item.Value(func(val []byte) error {\n\t\t\t\t\trequire.Equal(t, expected, val,\n\t\t\t\t\t\t\"Invalid value for key %q. expected: %q, actual: %q\",\n\t\t\t\t\t\titem.Key(), expected, val)\n\t\t\t\t\treturn nil\n\t\t\t\t}); err != nil {\n\t\t\t\t\treturn err\n\t\t\t\t}\n\t\t\t}\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\t})\n}\n\nfunc TestConcurrentWrite(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t// Not a benchmark. Just a simple test for concurrent writes.\n\t\tn := 20\n\t\tm := 500\n\t\tvar wg sync.WaitGroup\n\t\tfor i := 0; i < n; i++ {\n\t\t\twg.Add(1)\n\t\t\tgo func(i int) {\n\t\t\t\tdefer wg.Done()\n\t\t\t\tfor j := 0; j < m; j++ {\n\t\t\t\t\ttxnSet(t, db, []byte(fmt.Sprintf(\"k%05d_%08d\", i, j)),\n\t\t\t\t\t\t[]byte(fmt.Sprintf(\"v%05d_%08d\", i, j)), byte(j%127))\n\t\t\t\t}\n\t\t\t}(i)\n\t\t}\n\t\twg.Wait()\n\n\t\tt.Log(\"Starting iteration\")\n\n\t\topt := IteratorOptions{}\n\t\topt.Reverse = false\n\t\topt.PrefetchSize = 10\n\t\topt.PrefetchValues = true\n\n\t\ttxn := db.NewTransaction(true)\n\t\tit := txn.NewIterator(opt)\n\t\tdefer it.Close()\n\t\tvar i, j int\n\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\titem := it.Item()\n\t\t\tk := item.Key()\n\t\t\tif k == nil {\n\t\t\t\tbreak // end of iteration.\n\t\t\t}\n\n\t\t\trequire.EqualValues(t, fmt.Sprintf(\"k%05d_%08d\", i, j), string(k))\n\t\t\tv := getItemValue(t, item)\n\t\t\trequire.EqualValues(t, fmt.Sprintf(\"v%05d_%08d\", i, j), string(v))\n\t\t\trequire.Equal(t, item.UserMeta(), byte(j%127))\n\t\t\tj++\n\t\t\tif j == m {\n\t\t\t\ti++\n\t\t\t\tj = 0\n\t\t\t}\n\t\t}\n\t\trequire.EqualValues(t, n, i)\n\t\trequire.EqualValues(t, 0, j)\n\t})\n}\n\nfunc TestGet(t *testing.T) {\n\ttest := func(t *testing.T, db *DB) {\n\t\ttxnSet(t, db, []byte(\"key1\"), []byte(\"val1\"), 0x08)\n\n\t\ttxn := db.NewTransaction(false)\n\t\titem, err := txn.Get([]byte(\"key1\"))\n\t\trequire.NoError(t, err)\n\t\trequire.EqualValues(t, \"val1\", getItemValue(t, item))\n\t\trequire.Equal(t, byte(0x08), item.UserMeta())\n\t\ttxn.Discard()\n\n\t\ttxnSet(t, db, []byte(\"key1\"), []byte(\"val2\"), 0x09)\n\n\t\ttxn = db.NewTransaction(false)\n\t\titem, err = txn.Get([]byte(\"key1\"))\n\t\trequire.NoError(t, err)\n\t\trequire.EqualValues(t, \"val2\", getItemValue(t, item))\n\t\trequire.Equal(t, byte(0x09), item.UserMeta())\n\t\ttxn.Discard()\n\n\t\ttxnDelete(t, db, []byte(\"key1\"))\n\n\t\ttxn = db.NewTransaction(false)\n\t\t_, err = txn.Get([]byte(\"key1\"))\n\t\trequire.Equal(t, ErrKeyNotFound, err)\n\t\ttxn.Discard()\n\n\t\ttxnSet(t, db, []byte(\"key1\"), []byte(\"val3\"), 0x01)\n\n\t\ttxn = db.NewTransaction(false)\n\t\titem, err = txn.Get([]byte(\"key1\"))\n\t\trequire.NoError(t, err)\n\t\trequire.EqualValues(t, \"val3\", getItemValue(t, item))\n\t\trequire.Equal(t, byte(0x01), item.UserMeta())\n\n\t\tlongVal := make([]byte, 1000)\n\t\ttxnSet(t, db, []byte(\"key1\"), longVal, 0x00)\n\n\t\ttxn = db.NewTransaction(false)\n\t\titem, err = txn.Get([]byte(\"key1\"))\n\t\trequire.NoError(t, err)\n\t\trequire.EqualValues(t, longVal, getItemValue(t, item))\n\t\ttxn.Discard()\n\t}\n\tt.Run(\"disk mode\", func(t *testing.T) {\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\ttest(t, db)\n\t\t})\n\t})\n\tt.Run(\"InMemory mode\", func(t *testing.T) {\n\t\topts := DefaultOptions(\"\").WithInMemory(true)\n\t\tdb, err := Open(opts)\n\t\trequire.NoError(t, err)\n\t\ttest(t, db)\n\t\trequire.NoError(t, db.Close())\n\t})\n\tt.Run(\"cache enabled\", func(t *testing.T) {\n\t\topts := DefaultOptions(\"\").WithBlockCacheSize(10 << 20)\n\t\trunBadgerTest(t, &opts, func(t *testing.T, db *DB) {\n\t\t\ttest(t, db)\n\t\t})\n\t})\n}\n\nfunc TestGetAfterDelete(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t// populate with one entry\n\t\tkey := []byte(\"key\")\n\t\ttxnSet(t, db, key, []byte(\"val1\"), 0x00)\n\t\trequire.NoError(t, db.Update(func(txn *Txn) error {\n\t\t\terr := txn.Delete(key)\n\t\t\trequire.NoError(t, err)\n\n\t\t\t_, err = txn.Get(key)\n\t\t\trequire.Equal(t, ErrKeyNotFound, err)\n\t\t\treturn nil\n\t\t}))\n\t})\n}\n\nfunc TestTxnTooBig(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\tdata := func(i int) []byte {\n\t\t\treturn []byte(fmt.Sprintf(\"%b\", i))\n\t\t}\n\t\t//\tn := 500000\n\t\tn := 1000\n\t\ttxn := db.NewTransaction(true)\n\t\tfor i := 0; i < n; {\n\t\t\tif err := txn.SetEntry(NewEntry(data(i), data(i))); err != nil {\n\t\t\t\trequire.NoError(t, txn.Commit())\n\t\t\t\ttxn = db.NewTransaction(true)\n\t\t\t} else {\n\t\t\t\ti++\n\t\t\t}\n\t\t}\n\t\trequire.NoError(t, txn.Commit())\n\n\t\ttxn = db.NewTransaction(true)\n\t\tfor i := 0; i < n; {\n\t\t\tif err := txn.Delete(data(i)); err != nil {\n\t\t\t\trequire.NoError(t, txn.Commit())\n\t\t\t\ttxn = db.NewTransaction(true)\n\t\t\t} else {\n\t\t\t\ti++\n\t\t\t}\n\t\t}\n\t\trequire.NoError(t, txn.Commit())\n\t})\n}\n\nfunc TestForceCompactL0(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\t// This test relies on CompactL0OnClose\n\topts := getTestOptions(dir).WithCompactL0OnClose(true)\n\topts.ValueLogFileSize = 15 << 20\n\topts.managedTxns = true\n\tdb, err := Open(opts)\n\trequire.NoError(t, err)\n\n\tdata := func(i int) []byte {\n\t\treturn []byte(fmt.Sprintf(\"%b\", i))\n\t}\n\tn := 80\n\tm := 45 // Increasing would cause ErrTxnTooBig\n\tsz := 32 << 10\n\tv := make([]byte, sz)\n\tfor i := 0; i < n; i += 2 {\n\t\tversion := uint64(i)\n\t\ttxn := db.NewTransactionAt(version, true)\n\t\tfor j := 0; j < m; j++ {\n\t\t\trequire.NoError(t, txn.SetEntry(NewEntry(data(j), v)))\n\t\t}\n\t\trequire.NoError(t, txn.CommitAt(version+1, nil))\n\t}\n\tdb.Close()\n\n\topts.managedTxns = true\n\tdb, err = Open(opts)\n\trequire.NoError(t, err)\n\trequire.Equal(t, len(db.lc.levels[0].tables), 0)\n\trequire.NoError(t, db.Close())\n}\n\nfunc TestStreamDB(t *testing.T) {\n\tcheck := func(db *DB) {\n\t\tfor i := 0; i < 100; i++ {\n\t\t\tkey := []byte(fmt.Sprintf(\"key%d\", i))\n\t\t\tval := []byte(fmt.Sprintf(\"val%d\", i))\n\t\t\ttxn := db.NewTransactionAt(1, false)\n\t\t\titem, err := txn.Get(key)\n\t\t\trequire.NoError(t, err)\n\t\t\trequire.EqualValues(t, val, getItemValue(t, item))\n\t\t\trequire.Equal(t, byte(0x00), item.UserMeta())\n\t\t\ttxn.Discard()\n\t\t}\n\t}\n\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\topts := getTestOptions(dir).\n\t\tWithCompression(options.ZSTD).\n\t\tWithBlockCacheSize(100 << 20)\n\n\tdb, err := OpenManaged(opts)\n\trequire.NoError(t, err)\n\tdefer func() {\n\t\trequire.NoError(t, db.Close())\n\t}()\n\n\twriter := db.NewManagedWriteBatch()\n\tfor i := 0; i < 100; i++ {\n\t\tkey := []byte(fmt.Sprintf(\"key%d\", i))\n\t\tval := []byte(fmt.Sprintf(\"val%d\", i))\n\t\trequire.NoError(t, writer.SetEntryAt(NewEntry(key, val).WithMeta(0x00), 1))\n\t}\n\trequire.NoError(t, writer.Flush())\n\tcheck(db)\n\n\toutDir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\toutOpt := getTestOptions(outDir)\n\trequire.NoError(t, db.StreamDB(outOpt))\n\n\toutDB, err := OpenManaged(outOpt)\n\trequire.NoError(t, err)\n\tdefer func() {\n\t\trequire.NoError(t, outDB.Close())\n\t}()\n\tcheck(outDB)\n}\n\nfunc dirSize(path string) (int64, error) {\n\tvar size int64\n\terr := filepath.Walk(path, func(_ string, info os.FileInfo, err error) error {\n\t\tif err != nil {\n\t\t\tif os.IsNotExist(err) {\n\t\t\t\treturn nil\n\t\t\t}\n\t\t\treturn err\n\t\t}\n\t\tif !info.IsDir() {\n\t\t\tsize += info.Size()\n\t\t}\n\t\treturn err\n\t})\n\treturn (size >> 20), err\n}\n\n// BenchmarkDbGrowth ensures DB does not grow with repeated adds and deletes.\n//\n// New keys are created with each for-loop iteration. During each\n// iteration, the previous for-loop iteration's keys are deleted.\n//\n// To reproduce continuous growth problem due to `badgerMove` keys,\n// update `value.go` `discardEntry` line 1628 to return false\n//\n// Also with PR #1303, the delete keys are properly cleaned which\n// further reduces disk size.\nfunc BenchmarkDbGrowth(b *testing.B) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(b, err)\n\tdefer removeDir(dir)\n\n\tstart := 0\n\tlastStart := 0\n\tnumKeys := 2000\n\tvalueSize := 1024\n\tvalue := make([]byte, valueSize)\n\n\tdiscardRatio := 0.001\n\tmaxWrites := 200\n\topts := getTestOptions(dir)\n\topts.ValueLogFileSize = 64 << 15\n\topts.BaseTableSize = 4 << 15\n\topts.BaseLevelSize = 16 << 15\n\topts.NumVersionsToKeep = 1\n\topts.NumLevelZeroTables = 1\n\topts.NumLevelZeroTablesStall = 2\n\topts.ValueThreshold = 1024\n\topts.MemTableSize = 1 << 20\n\tdb, err := Open(opts)\n\trequire.NoError(b, err)\n\tfor numWrites := 0; numWrites < maxWrites; numWrites++ {\n\t\ttxn := db.NewTransaction(true)\n\t\tif start > 0 {\n\t\t\tfor i := lastStart; i < start; i++ {\n\t\t\t\tkey := make([]byte, 8)\n\t\t\t\tbinary.BigEndian.PutUint64(key[:], uint64(i))\n\t\t\t\terr := txn.Delete(key)\n\t\t\t\tif err == ErrTxnTooBig {\n\t\t\t\t\trequire.NoError(b, txn.Commit())\n\t\t\t\t\ttxn = db.NewTransaction(true)\n\t\t\t\t} else {\n\t\t\t\t\trequire.NoError(b, err)\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\n\t\tfor i := start; i < numKeys+start; i++ {\n\t\t\tkey := make([]byte, 8)\n\t\t\tbinary.BigEndian.PutUint64(key[:], uint64(i))\n\t\t\terr := txn.SetEntry(NewEntry(key, value))\n\t\t\tif err == ErrTxnTooBig {\n\t\t\t\trequire.NoError(b, txn.Commit())\n\t\t\t\ttxn = db.NewTransaction(true)\n\t\t\t} else {\n\t\t\t\trequire.NoError(b, err)\n\t\t\t}\n\t\t}\n\t\trequire.NoError(b, txn.Commit())\n\t\trequire.NoError(b, db.Flatten(1))\n\t\tfor {\n\t\t\terr = db.RunValueLogGC(discardRatio)\n\t\t\tif err == ErrNoRewrite {\n\t\t\t\tbreak\n\t\t\t} else {\n\t\t\t\trequire.NoError(b, err)\n\t\t\t}\n\t\t}\n\t\tsize, err := dirSize(dir)\n\t\trequire.NoError(b, err)\n\t\tfmt.Printf(\"Badger DB Size = %dMB\\n\", size)\n\t\tlastStart = start\n\t\tstart += numKeys\n\t}\n\n\tdb.Close()\n\tsize, err := dirSize(dir)\n\trequire.NoError(b, err)\n\trequire.LessOrEqual(b, size, int64(16))\n\tfmt.Printf(\"Badger DB Size = %dMB\\n\", size)\n}\n\n// Put a lot of data to move some data to disk.\n// WARNING: This test might take a while but it should pass!\nfunc TestGetMore(t *testing.T) {\n\tif !*manual {\n\t\tt.Skip(\"Skipping test meant to be run manually.\")\n\t\treturn\n\t}\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\tdata := func(i int) []byte {\n\t\t\treturn []byte(fmt.Sprintf(\"%b\", i))\n\t\t}\n\t\tn := 200000\n\t\tm := 45 // Increasing would cause ErrTxnTooBig\n\t\tfor i := 0; i < n; i += m {\n\t\t\tif (i % 10000) == 0 {\n\t\t\t\tfmt.Printf(\"Inserting i=%d\\n\", i)\n\t\t\t}\n\t\t\ttxn := db.NewTransaction(true)\n\t\t\tfor j := i; j < i+m && j < n; j++ {\n\t\t\t\trequire.NoError(t, txn.SetEntry(NewEntry(data(j), data(j))))\n\t\t\t}\n\t\t\trequire.NoError(t, txn.Commit())\n\t\t}\n\t\trequire.NoError(t, db.validate())\n\n\t\tfor i := 0; i < n; i++ {\n\t\t\ttxn := db.NewTransaction(false)\n\t\t\titem, err := txn.Get(data(i))\n\t\t\tif err != nil {\n\t\t\t\tt.Error(err)\n\t\t\t}\n\t\t\trequire.EqualValues(t, string(data(i)), string(getItemValue(t, item)))\n\t\t\ttxn.Discard()\n\t\t}\n\n\t\t// Overwrite\n\t\tfor i := 0; i < n; i += m {\n\t\t\ttxn := db.NewTransaction(true)\n\t\t\tfor j := i; j < i+m && j < n; j++ {\n\t\t\t\trequire.NoError(t, txn.SetEntry(NewEntry(data(j),\n\t\t\t\t\t// Use a long value that will certainly exceed value threshold.\n\t\t\t\t\t[]byte(fmt.Sprintf(\"zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz%9d\", j)))))\n\t\t\t}\n\t\t\trequire.NoError(t, txn.Commit())\n\t\t}\n\t\trequire.NoError(t, db.validate())\n\n\t\tfor i := 0; i < n; i++ {\n\t\t\texpectedValue := fmt.Sprintf(\"zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz%9d\", i)\n\t\t\tk := data(i)\n\t\t\ttxn := db.NewTransaction(false)\n\t\t\titem, err := txn.Get(k)\n\t\t\tif err != nil {\n\t\t\t\tt.Error(err)\n\t\t\t}\n\t\t\tgot := string(getItemValue(t, item))\n\t\t\tif expectedValue != got {\n\n\t\t\t\tvs, err := db.get(y.KeyWithTs(k, math.MaxUint64))\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\tfmt.Printf(\"wanted=%q Item: %s\\n\", k, item)\n\t\t\t\tfmt.Printf(\"on re-run, got version: %+v\\n\", vs)\n\n\t\t\t\ttxn := db.NewTransaction(false)\n\t\t\t\titr := txn.NewIterator(DefaultIteratorOptions)\n\t\t\t\tfor itr.Seek(k); itr.Valid(); itr.Next() {\n\t\t\t\t\titem := itr.Item()\n\t\t\t\t\tfmt.Printf(\"item=%s\\n\", item)\n\t\t\t\t\tif !bytes.Equal(item.Key(), k) {\n\t\t\t\t\t\tbreak\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\titr.Close()\n\t\t\t\ttxn.Discard()\n\t\t\t}\n\t\t\trequire.EqualValues(t, expectedValue, string(getItemValue(t, item)), \"wanted=%q Item: %s\\n\", k, item)\n\t\t\ttxn.Discard()\n\t\t}\n\n\t\t// \"Delete\" key.\n\t\tfor i := 0; i < n; i += m {\n\t\t\tif (i % 10000) == 0 {\n\t\t\t\tfmt.Printf(\"Deleting i=%d\\n\", i)\n\t\t\t}\n\t\t\ttxn := db.NewTransaction(true)\n\t\t\tfor j := i; j < i+m && j < n; j++ {\n\t\t\t\trequire.NoError(t, txn.Delete(data(j)))\n\t\t\t}\n\t\t\trequire.NoError(t, txn.Commit())\n\t\t}\n\t\trequire.NoError(t, db.validate())\n\t\tfor i := 0; i < n; i++ {\n\t\t\tif (i % 10000) == 0 {\n\t\t\t\t// Display some progress. Right now, it's not very fast with no caching.\n\t\t\t\tfmt.Printf(\"Testing i=%d\\n\", i)\n\t\t\t}\n\t\t\tk := data(i)\n\t\t\ttxn := db.NewTransaction(false)\n\t\t\t_, err := txn.Get(k)\n\t\t\trequire.Equal(t, ErrKeyNotFound, err, \"should not have found k: %q\", k)\n\t\t\ttxn.Discard()\n\t\t}\n\t})\n}\n\n// Put a lot of data to move some data to disk.\n// WARNING: This test might take a while but it should pass!\nfunc TestExistsMore(t *testing.T) {\n\ttest := func(t *testing.T, db *DB) {\n\t\t//\tn := 500000\n\t\tn := 10000\n\t\tm := 45\n\t\tfor i := 0; i < n; i += m {\n\t\t\tif (i % 1000) == 0 {\n\t\t\t\tt.Logf(\"Putting i=%d\\n\", i)\n\t\t\t}\n\t\t\ttxn := db.NewTransaction(true)\n\t\t\tfor j := i; j < i+m && j < n; j++ {\n\t\t\t\trequire.NoError(t, txn.SetEntry(NewEntry([]byte(fmt.Sprintf(\"%09d\", j)),\n\t\t\t\t\t[]byte(fmt.Sprintf(\"%09d\", j)))))\n\t\t\t}\n\t\t\trequire.NoError(t, txn.Commit())\n\t\t}\n\t\trequire.NoError(t, db.validate())\n\n\t\tfor i := 0; i < n; i++ {\n\t\t\tif (i % 1000) == 0 {\n\t\t\t\tfmt.Printf(\"Testing i=%d\\n\", i)\n\t\t\t}\n\t\t\tk := fmt.Sprintf(\"%09d\", i)\n\t\t\trequire.NoError(t, db.View(func(txn *Txn) error {\n\t\t\t\t_, err := txn.Get([]byte(k))\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\treturn nil\n\t\t\t}))\n\t\t}\n\t\trequire.NoError(t, db.View(func(txn *Txn) error {\n\t\t\t_, err := txn.Get([]byte(\"non-exists\"))\n\t\t\trequire.Error(t, err)\n\t\t\treturn nil\n\t\t}))\n\n\t\t// \"Delete\" key.\n\t\tfor i := 0; i < n; i += m {\n\t\t\tif (i % 1000) == 0 {\n\t\t\t\tfmt.Printf(\"Deleting i=%d\\n\", i)\n\t\t\t}\n\t\t\ttxn := db.NewTransaction(true)\n\t\t\tfor j := i; j < i+m && j < n; j++ {\n\t\t\t\trequire.NoError(t, txn.Delete([]byte(fmt.Sprintf(\"%09d\", j))))\n\t\t\t}\n\t\t\trequire.NoError(t, txn.Commit())\n\t\t}\n\t\trequire.NoError(t, db.validate())\n\t\tfor i := 0; i < n; i++ {\n\t\t\tif (i % 10000) == 0 {\n\t\t\t\t// Display some progress. Right now, it's not very fast with no caching.\n\t\t\t\tfmt.Printf(\"Testing i=%d\\n\", i)\n\t\t\t}\n\t\t\tk := fmt.Sprintf(\"%09d\", i)\n\n\t\t\trequire.NoError(t, db.View(func(txn *Txn) error {\n\t\t\t\t_, err := txn.Get([]byte(k))\n\t\t\t\trequire.Error(t, err)\n\t\t\t\treturn nil\n\t\t\t}))\n\t\t}\n\t\tfmt.Println(\"Done and closing\")\n\t}\n\tt.Run(\"disk mode\", func(t *testing.T) {\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\ttest(t, db)\n\t\t})\n\t})\n\tt.Run(\"InMemory mode\", func(t *testing.T) {\n\t\topt := DefaultOptions(\"\").WithInMemory(true)\n\t\tdb, err := Open(opt)\n\t\trequire.NoError(t, err)\n\t\ttest(t, db)\n\t\trequire.NoError(t, db.Close())\n\t})\n}\n\nfunc TestIterate2Basic(t *testing.T) {\n\ttest := func(t *testing.T, db *DB) {\n\t\tbkey := func(i int) []byte {\n\t\t\treturn []byte(fmt.Sprintf(\"%09d\", i))\n\t\t}\n\t\tbval := func(i int) []byte {\n\t\t\treturn []byte(fmt.Sprintf(\"%025d\", i))\n\t\t}\n\n\t\t// n := 500000\n\t\tn := 10000\n\t\tfor i := 0; i < n; i++ {\n\t\t\tif (i % 1000) == 0 {\n\t\t\t\tt.Logf(\"Put i=%d\\n\", i)\n\t\t\t}\n\t\t\ttxnSet(t, db, bkey(i), bval(i), byte(i%127))\n\t\t}\n\n\t\topt := IteratorOptions{}\n\t\topt.PrefetchValues = true\n\t\topt.PrefetchSize = 10\n\n\t\ttxn := db.NewTransaction(false)\n\t\tit := txn.NewIterator(opt)\n\t\t{\n\t\t\tvar count int\n\t\t\trewind := true\n\t\t\tt.Log(\"Starting first basic iteration\")\n\t\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\t\titem := it.Item()\n\t\t\t\tkey := item.Key()\n\t\t\t\tif rewind && count == 5000 {\n\t\t\t\t\t// Rewind would skip /head/ key, and it.Next() would skip 0.\n\t\t\t\t\tcount = 1\n\t\t\t\t\tit.Rewind()\n\t\t\t\t\tt.Log(\"Rewinding from 5000 to zero.\")\n\t\t\t\t\trewind = false\n\t\t\t\t\tcontinue\n\t\t\t\t}\n\t\t\t\trequire.EqualValues(t, bkey(count), string(key))\n\t\t\t\tval := getItemValue(t, item)\n\t\t\t\trequire.EqualValues(t, bval(count), string(val))\n\t\t\t\trequire.Equal(t, byte(count%127), item.UserMeta())\n\t\t\t\tcount++\n\t\t\t}\n\t\t\trequire.EqualValues(t, n, count)\n\t\t}\n\n\t\t{\n\t\t\tt.Log(\"Starting second basic iteration\")\n\t\t\tidx := 5030\n\t\t\tfor it.Seek(bkey(idx)); it.Valid(); it.Next() {\n\t\t\t\titem := it.Item()\n\t\t\t\trequire.EqualValues(t, bkey(idx), string(item.Key()))\n\t\t\t\trequire.EqualValues(t, bval(idx), string(getItemValue(t, item)))\n\t\t\t\tidx++\n\t\t\t}\n\t\t}\n\t\tit.Close()\n\t}\n\tt.Run(\"disk mode\", func(t *testing.T) {\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\ttest(t, db)\n\t\t})\n\t})\n\tt.Run(\"InMemory mode\", func(t *testing.T) {\n\t\topt := DefaultOptions(\"\").WithInMemory(true)\n\t\tdb, err := Open(opt)\n\t\trequire.NoError(t, err)\n\t\ttest(t, db)\n\t\trequire.NoError(t, db.Close())\n\t})\n\n}\n\nfunc TestLoad(t *testing.T) {\n\ttestLoad := func(t *testing.T, opt Options) {\n\t\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\t\trequire.NoError(t, err)\n\t\tdefer removeDir(dir)\n\t\topt.Dir = dir\n\t\topt.ValueDir = dir\n\t\tn := 10000\n\t\t{\n\t\t\tkv, err := Open(opt)\n\t\t\trequire.NoError(t, err)\n\t\t\tfor i := 0; i < n; i++ {\n\t\t\t\tif (i % 10000) == 0 {\n\t\t\t\t\tfmt.Printf(\"Putting i=%d\\n\", i)\n\t\t\t\t}\n\t\t\t\tk := []byte(fmt.Sprintf(\"%09d\", i))\n\t\t\t\ttxnSet(t, kv, k, k, 0x00)\n\t\t\t}\n\t\t\trequire.Equal(t, 10000, int(kv.orc.readTs()))\n\t\t\tkv.Close()\n\t\t}\n\t\tkv, err := Open(opt)\n\t\trequire.NoError(t, err)\n\t\trequire.Equal(t, 10000, int(kv.orc.readTs()))\n\n\t\tfor i := 0; i < n; i++ {\n\t\t\tif (i % 10000) == 0 {\n\t\t\t\tfmt.Printf(\"Testing i=%d\\n\", i)\n\t\t\t}\n\t\t\tk := fmt.Sprintf(\"%09d\", i)\n\t\t\trequire.NoError(t, kv.View(func(txn *Txn) error {\n\t\t\t\titem, err := txn.Get([]byte(k))\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\trequire.EqualValues(t, k, string(getItemValue(t, item)))\n\t\t\t\treturn nil\n\t\t\t}))\n\t\t}\n\t\tkv.Close()\n\t\tsummary := kv.lc.getSummary()\n\n\t\t// Check that files are garbage collected.\n\t\tidMap := getIDMap(dir)\n\t\tfor fileID := range idMap {\n\t\t\t// Check that name is in summary.filenames.\n\t\t\trequire.True(t, summary.fileIDs[fileID], \"%d\", fileID)\n\t\t}\n\t\trequire.EqualValues(t, len(idMap), len(summary.fileIDs))\n\n\t\tvar fileIDs []uint64\n\t\tfor k := range summary.fileIDs { // Map to array.\n\t\t\tfileIDs = append(fileIDs, k)\n\t\t}\n\t\tsort.Slice(fileIDs, func(i, j int) bool { return fileIDs[i] < fileIDs[j] })\n\t\tfmt.Printf(\"FileIDs: %v\\n\", fileIDs)\n\t}\n\tt.Run(\"TestLoad Without Encryption/Compression\", func(t *testing.T) {\n\t\topt := getTestOptions(\"\")\n\t\topt.Compression = options.None\n\t\ttestLoad(t, opt)\n\t})\n\tt.Run(\"TestLoad With Encryption and no compression\", func(t *testing.T) {\n\t\tkey := make([]byte, 32)\n\t\t_, err := rand.Read(key)\n\t\trequire.NoError(t, err)\n\t\topt := getTestOptions(\"\")\n\t\topt.EncryptionKey = key\n\t\topt.BlockCacheSize = 100 << 20\n\t\topt.IndexCacheSize = 100 << 20\n\t\topt.Compression = options.None\n\t\ttestLoad(t, opt)\n\t})\n\tt.Run(\"TestLoad With Encryption and compression\", func(t *testing.T) {\n\t\tkey := make([]byte, 32)\n\t\t_, err := rand.Read(key)\n\t\trequire.NoError(t, err)\n\t\topt := getTestOptions(\"\")\n\t\topt.EncryptionKey = key\n\t\topt.Compression = options.ZSTD\n\t\topt.BlockCacheSize = 100 << 20\n\t\topt.IndexCacheSize = 100 << 20\n\t\ttestLoad(t, opt)\n\t})\n\tt.Run(\"TestLoad without Encryption and with compression\", func(t *testing.T) {\n\t\topt := getTestOptions(\"\")\n\t\topt.Compression = options.ZSTD\n\t\topt.BlockCacheSize = 100 << 20\n\t\ttestLoad(t, opt)\n\t})\n}\n\nfunc TestIterateDeleted(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\ttxnSet(t, db, []byte(\"Key1\"), []byte(\"Value1\"), 0x00)\n\t\ttxnSet(t, db, []byte(\"Key2\"), []byte(\"Value2\"), 0x00)\n\n\t\titerOpt := DefaultIteratorOptions\n\t\titerOpt.PrefetchValues = false\n\t\ttxn := db.NewTransaction(false)\n\t\tidxIt := txn.NewIterator(iterOpt)\n\t\tdefer idxIt.Close()\n\n\t\tcount := 0\n\t\ttxn2 := db.NewTransaction(true)\n\t\tprefix := []byte(\"Key\")\n\t\tfor idxIt.Seek(prefix); idxIt.ValidForPrefix(prefix); idxIt.Next() {\n\t\t\tkey := idxIt.Item().Key()\n\t\t\tcount++\n\t\t\tnewKey := make([]byte, len(key))\n\t\t\tcopy(newKey, key)\n\t\t\trequire.NoError(t, txn2.Delete(newKey))\n\t\t}\n\t\trequire.Equal(t, 2, count)\n\t\trequire.NoError(t, txn2.Commit())\n\n\t\tfor _, prefetch := range [...]bool{true, false} {\n\t\t\tt.Run(fmt.Sprintf(\"Prefetch=%t\", prefetch), func(t *testing.T) {\n\t\t\t\ttxn := db.NewTransaction(false)\n\t\t\t\titerOpt = DefaultIteratorOptions\n\t\t\t\titerOpt.PrefetchValues = prefetch\n\t\t\t\tidxIt = txn.NewIterator(iterOpt)\n\n\t\t\t\tvar estSize int64\n\t\t\t\tvar idxKeys []string\n\t\t\t\tfor idxIt.Seek(prefix); idxIt.Valid(); idxIt.Next() {\n\t\t\t\t\titem := idxIt.Item()\n\t\t\t\t\tkey := item.Key()\n\t\t\t\t\testSize += item.EstimatedSize()\n\t\t\t\t\tif !bytes.HasPrefix(key, prefix) {\n\t\t\t\t\t\tbreak\n\t\t\t\t\t}\n\t\t\t\t\tidxKeys = append(idxKeys, string(key))\n\t\t\t\t\tt.Logf(\"%+v\\n\", idxIt.Item())\n\t\t\t\t}\n\t\t\t\trequire.Equal(t, 0, len(idxKeys))\n\t\t\t\trequire.Equal(t, int64(0), estSize)\n\t\t\t})\n\t\t}\n\t})\n}\n\nfunc TestIterateParallel(t *testing.T) {\n\tif !*manual {\n\t\tt.Skip(\"Skipping test meant to be run manually.\")\n\t\treturn\n\t}\n\tkey := func(account int) []byte {\n\t\tvar b [4]byte\n\t\tbinary.BigEndian.PutUint32(b[:], uint32(account))\n\t\treturn append([]byte(\"account-\"), b[:]...)\n\t}\n\n\tN := 100000\n\titerate := func(txn *Txn, wg *sync.WaitGroup) {\n\t\tdefer wg.Done()\n\t\titr := txn.NewIterator(DefaultIteratorOptions)\n\t\tdefer itr.Close()\n\n\t\tvar count int\n\t\tfor itr.Rewind(); itr.Valid(); itr.Next() {\n\t\t\tcount++\n\t\t\titem := itr.Item()\n\t\t\trequire.Equal(t, \"account-\", string(item.Key()[0:8]))\n\t\t\terr := item.Value(func(val []byte) error {\n\t\t\t\trequire.Equal(t, \"1000\", string(val))\n\t\t\t\treturn nil\n\t\t\t})\n\t\t\trequire.NoError(t, err)\n\t\t}\n\t\trequire.Equal(t, N, count)\n\t\titr.Close() // Double close.\n\t}\n\n\topt := DefaultOptions(\"\")\n\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\tvar wg sync.WaitGroup\n\t\tvar txns []*Txn\n\t\tfor i := 0; i < N; i++ {\n\t\t\twg.Add(1)\n\t\t\ttxn := db.NewTransaction(true)\n\t\t\trequire.NoError(t, txn.SetEntry(NewEntry(key(i), []byte(\"1000\"))))\n\t\t\ttxns = append(txns, txn)\n\t\t}\n\t\tfor _, txn := range txns {\n\t\t\ttxn.CommitWith(func(err error) {\n\t\t\t\ty.Check(err)\n\t\t\t\twg.Done()\n\t\t\t})\n\t\t}\n\n\t\twg.Wait()\n\n\t\t// Check that a RW txn can run multiple iterators.\n\t\ttxn := db.NewTransaction(true)\n\t\titr := txn.NewIterator(DefaultIteratorOptions)\n\t\trequire.NotPanics(t, func() {\n\t\t\t// Now that multiple iterators are supported in read-write\n\t\t\t// transactions, make sure this does not panic anymore. Then just\n\t\t\t// close the iterator.\n\t\t\ttxn.NewIterator(DefaultIteratorOptions).Close()\n\t\t})\n\t\t// The transaction should still panic since there is still one pending\n\t\t// iterator that is open.\n\t\trequire.Panics(t, txn.Discard)\n\t\titr.Close()\n\t\ttxn.Discard()\n\n\t\t// (Regression) Make sure that creating multiple concurrent iterators\n\t\t// within a read only transaction continues to work.\n\t\tt.Run(\"multiple read-only iterators\", func(t *testing.T) {\n\t\t\t// Run multiple iterators for a RO txn.\n\t\t\ttxn = db.NewTransaction(false)\n\t\t\tdefer txn.Discard()\n\t\t\twg.Add(3)\n\t\t\tgo iterate(txn, &wg)\n\t\t\tgo iterate(txn, &wg)\n\t\t\tgo iterate(txn, &wg)\n\t\t\twg.Wait()\n\t\t})\n\n\t\t// Make sure that when we create multiple concurrent iterators within a\n\t\t// read-write transaction that it actually iterates successfully.\n\t\tt.Run(\"multiple read-write iterators\", func(t *testing.T) {\n\t\t\t// Run multiple iterators for a RO txn.\n\t\t\ttxn = db.NewTransaction(true)\n\t\t\tdefer txn.Discard()\n\t\t\twg.Add(3)\n\t\t\tgo iterate(txn, &wg)\n\t\t\tgo iterate(txn, &wg)\n\t\t\tgo iterate(txn, &wg)\n\t\t\twg.Wait()\n\t\t})\n\t})\n}\n\nfunc TestDeleteWithoutSyncWrite(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\tkv, err := Open(DefaultOptions(dir))\n\tif err != nil {\n\t\tt.Error(err)\n\t\tt.Fail()\n\t}\n\n\tkey := []byte(\"k1\")\n\t// Set a value with size > value threshold so that its written to value log.\n\ttxnSet(t, kv, key, []byte(\"ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789FOOBARZOGZOG\"), 0x00)\n\ttxnDelete(t, kv, key)\n\tkv.Close()\n\n\t// Reopen KV\n\tkv, err = Open(DefaultOptions(dir))\n\tif err != nil {\n\t\tt.Error(err)\n\t\tt.Fail()\n\t}\n\tdefer kv.Close()\n\n\trequire.NoError(t, kv.View(func(txn *Txn) error {\n\t\t_, err := txn.Get(key)\n\t\trequire.Equal(t, ErrKeyNotFound, err)\n\t\treturn nil\n\t}))\n}\n\nfunc TestPidFile(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t// Reopen database\n\t\t_, err := Open(getTestOptions(db.opt.Dir))\n\t\trequire.Error(t, err)\n\t\trequire.Contains(t, err.Error(), \"Another process is using this Badger database\")\n\t})\n}\n\nfunc TestInvalidKey(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\terr := db.Update(func(txn *Txn) error {\n\t\t\terr := txn.SetEntry(NewEntry([]byte(\"!badger!head\"), nil))\n\t\t\trequire.Equal(t, ErrInvalidKey, err)\n\n\t\t\terr = txn.SetEntry(NewEntry([]byte(\"!badger!\"), nil))\n\t\t\trequire.Equal(t, ErrInvalidKey, err)\n\n\t\t\terr = txn.SetEntry(NewEntry([]byte(\"!badger\"), []byte(\"BadgerDB\")))\n\t\t\trequire.NoError(t, err)\n\t\t\treturn err\n\t\t})\n\t\trequire.NoError(t, err)\n\n\t\trequire.NoError(t, db.View(func(txn *Txn) error {\n\t\t\titem, err := txn.Get([]byte(\"!badger\"))\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\trequire.NoError(t, item.Value(func(val []byte) error {\n\t\t\t\trequire.Equal(t, []byte(\"BadgerDB\"), val)\n\t\t\t\treturn nil\n\t\t\t}))\n\t\t\treturn nil\n\t\t}))\n\t})\n}\n\nfunc TestIteratorPrefetchSize(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\n\t\tbkey := func(i int) []byte {\n\t\t\treturn []byte(fmt.Sprintf(\"%09d\", i))\n\t\t}\n\t\tbval := func(i int) []byte {\n\t\t\treturn []byte(fmt.Sprintf(\"%025d\", i))\n\t\t}\n\n\t\tn := 100\n\t\tfor i := 0; i < n; i++ {\n\t\t\t// if (i % 10) == 0 {\n\t\t\t// \tt.Logf(\"Put i=%d\\n\", i)\n\t\t\t// }\n\t\t\ttxnSet(t, db, bkey(i), bval(i), byte(i%127))\n\t\t}\n\n\t\tgetIteratorCount := func(prefetchSize int) int {\n\t\t\topt := IteratorOptions{}\n\t\t\topt.PrefetchValues = true\n\t\t\topt.PrefetchSize = prefetchSize\n\n\t\t\tvar count int\n\t\t\ttxn := db.NewTransaction(false)\n\t\t\tit := txn.NewIterator(opt)\n\t\t\t{\n\t\t\t\tt.Log(\"Starting first basic iteration\")\n\t\t\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\t\t\tcount++\n\t\t\t\t}\n\t\t\t\trequire.EqualValues(t, n, count)\n\t\t\t}\n\t\t\treturn count\n\t\t}\n\n\t\tvar sizes = []int{-10, 0, 1, 10}\n\t\tfor _, size := range sizes {\n\t\t\tc := getIteratorCount(size)\n\t\t\trequire.Equal(t, 100, c)\n\t\t}\n\t})\n}\n\nfunc TestSetIfAbsentAsync(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\tkv, _ := Open(getTestOptions(dir))\n\n\tbkey := func(i int) []byte {\n\t\treturn []byte(fmt.Sprintf(\"%09d\", i))\n\t}\n\n\tf := func(err error) {}\n\n\tn := 1000\n\tfor i := 0; i < n; i++ {\n\t\t// if (i % 10) == 0 {\n\t\t// \tt.Logf(\"Put i=%d\\n\", i)\n\t\t// }\n\t\ttxn := kv.NewTransaction(true)\n\t\t_, err = txn.Get(bkey(i))\n\t\trequire.Equal(t, ErrKeyNotFound, err)\n\t\trequire.NoError(t, txn.SetEntry(NewEntry(bkey(i), nil).WithMeta(byte(i%127))))\n\t\ttxn.CommitWith(f)\n\t}\n\n\trequire.NoError(t, kv.Close())\n\tkv, err = Open(getTestOptions(dir))\n\trequire.NoError(t, err)\n\n\topt := DefaultIteratorOptions\n\ttxn := kv.NewTransaction(false)\n\tvar count int\n\tit := txn.NewIterator(opt)\n\t{\n\t\tt.Log(\"Starting first basic iteration\")\n\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\tcount++\n\t\t}\n\t\trequire.EqualValues(t, n, count)\n\t}\n\trequire.Equal(t, n, count)\n\trequire.NoError(t, kv.Close())\n}\n\nfunc TestGetSetRace(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\n\t\tdata := make([]byte, 4096)\n\t\t_, err := rand.Read(data)\n\t\trequire.NoError(t, err)\n\n\t\tvar (\n\t\t\tnumOp = 100\n\t\t\twg    sync.WaitGroup\n\t\t\tkeyCh = make(chan string)\n\t\t)\n\n\t\t// writer\n\t\twg.Add(1)\n\t\tgo func() {\n\t\t\tdefer func() {\n\t\t\t\twg.Done()\n\t\t\t\tclose(keyCh)\n\t\t\t}()\n\n\t\t\tfor i := 0; i < numOp; i++ {\n\t\t\t\tkey := fmt.Sprintf(\"%d\", i)\n\t\t\t\ttxnSet(t, db, []byte(key), data, 0x00)\n\t\t\t\tkeyCh <- key\n\t\t\t}\n\t\t}()\n\n\t\t// reader\n\t\twg.Add(1)\n\t\tgo func() {\n\t\t\tdefer wg.Done()\n\n\t\t\tfor key := range keyCh {\n\t\t\t\trequire.NoError(t, db.View(func(txn *Txn) error {\n\t\t\t\t\titem, err := txn.Get([]byte(key))\n\t\t\t\t\trequire.NoError(t, err)\n\t\t\t\t\terr = item.Value(nil)\n\t\t\t\t\trequire.NoError(t, err)\n\t\t\t\t\treturn nil\n\t\t\t\t}))\n\t\t\t}\n\t\t}()\n\n\t\twg.Wait()\n\t})\n}\n\nfunc TestDiscardVersionsBelow(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t// Write 4 versions of the same key\n\t\tfor i := 0; i < 4; i++ {\n\t\t\terr := db.Update(func(txn *Txn) error {\n\t\t\t\treturn txn.SetEntry(NewEntry([]byte(\"answer\"), []byte(fmt.Sprintf(\"%d\", i))))\n\t\t\t})\n\t\t\trequire.NoError(t, err)\n\t\t}\n\n\t\topts := DefaultIteratorOptions\n\t\topts.AllVersions = true\n\t\topts.PrefetchValues = false\n\n\t\t// Verify that there are 4 versions, and record 3rd version (2nd from top in iteration)\n\t\trequire.NoError(t, db.View(func(txn *Txn) error {\n\t\t\tit := txn.NewIterator(opts)\n\t\t\tdefer it.Close()\n\t\t\tvar count int\n\t\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\t\tcount++\n\t\t\t\titem := it.Item()\n\t\t\t\trequire.Equal(t, []byte(\"answer\"), item.Key())\n\t\t\t\tif item.DiscardEarlierVersions() {\n\t\t\t\t\tbreak\n\t\t\t\t}\n\t\t\t}\n\t\t\trequire.Equal(t, 4, count)\n\t\t\treturn nil\n\t\t}))\n\n\t\t// Set new version and discard older ones.\n\t\terr := db.Update(func(txn *Txn) error {\n\t\t\treturn txn.SetEntry(NewEntry([]byte(\"answer\"), []byte(\"5\")).WithDiscard())\n\t\t})\n\t\trequire.NoError(t, err)\n\n\t\t// Verify that there are only 2 versions left, and versions\n\t\t// below ts have been deleted.\n\t\trequire.NoError(t, db.View(func(txn *Txn) error {\n\t\t\tit := txn.NewIterator(opts)\n\t\t\tdefer it.Close()\n\t\t\tvar count int\n\t\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\t\tcount++\n\t\t\t\titem := it.Item()\n\t\t\t\trequire.Equal(t, []byte(\"answer\"), item.Key())\n\t\t\t\tif item.DiscardEarlierVersions() {\n\t\t\t\t\tbreak\n\t\t\t\t}\n\t\t\t}\n\t\t\trequire.Equal(t, 1, count)\n\t\t\treturn nil\n\t\t}))\n\t})\n}\n\nfunc TestExpiry(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t// Write two keys, one with a TTL\n\t\terr := db.Update(func(txn *Txn) error {\n\t\t\treturn txn.SetEntry(NewEntry([]byte(\"answer1\"), []byte(\"42\")))\n\t\t})\n\t\trequire.NoError(t, err)\n\n\t\terr = db.Update(func(txn *Txn) error {\n\t\t\treturn txn.SetEntry(NewEntry([]byte(\"answer2\"), []byte(\"43\")).WithTTL(1 * time.Second))\n\t\t})\n\t\trequire.NoError(t, err)\n\n\t\ttime.Sleep(2 * time.Second)\n\n\t\t// Verify that only unexpired key is found during iteration\n\t\terr = db.View(func(txn *Txn) error {\n\t\t\t_, err := txn.Get([]byte(\"answer1\"))\n\t\t\trequire.NoError(t, err)\n\n\t\t\t_, err = txn.Get([]byte(\"answer2\"))\n\t\t\trequire.Equal(t, ErrKeyNotFound, err)\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\n\t\t// Verify that only one key is found during iteration\n\t\topts := DefaultIteratorOptions\n\t\topts.PrefetchValues = false\n\t\terr = db.View(func(txn *Txn) error {\n\t\t\tit := txn.NewIterator(opts)\n\t\t\tdefer it.Close()\n\t\t\tvar count int\n\t\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\t\tcount++\n\t\t\t\titem := it.Item()\n\t\t\t\trequire.Equal(t, []byte(\"answer1\"), item.Key())\n\t\t\t}\n\t\t\trequire.Equal(t, 1, count)\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\t})\n}\n\nfunc TestExpiryImproperDBClose(t *testing.T) {\n\ttestReplay := func(opt Options) {\n\t\t// L0 compaction doesn't affect the test in any way. It is set to allow\n\t\t// graceful shutdown of db0.\n\t\tdb0, err := Open(opt.WithCompactL0OnClose(false))\n\t\trequire.NoError(t, err)\n\n\t\tdur := 1 * time.Hour\n\t\texpiryTime := uint64(time.Now().Add(dur).Unix())\n\t\terr = db0.Update(func(txn *Txn) error {\n\t\t\terr = txn.SetEntry(NewEntry([]byte(\"test_key\"), []byte(\"test_value\")).WithTTL(dur))\n\t\t\trequire.NoError(t, err)\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\n\t\t// Simulate a crash  by not closing db0, but releasing the locks.\n\t\tif db0.dirLockGuard != nil {\n\t\t\trequire.NoError(t, db0.dirLockGuard.release())\n\t\t\tdb0.dirLockGuard = nil\n\t\t}\n\t\tif db0.valueDirGuard != nil {\n\t\t\trequire.NoError(t, db0.valueDirGuard.release())\n\t\t\tdb0.valueDirGuard = nil\n\t\t}\n\t\trequire.NoError(t, db0.Close())\n\n\t\tdb1, err := Open(opt)\n\t\trequire.NoError(t, err)\n\t\terr = db1.View(func(txn *Txn) error {\n\t\t\titm, err := txn.Get([]byte(\"test_key\"))\n\t\t\trequire.NoError(t, err)\n\t\t\trequire.True(t, expiryTime <= itm.ExpiresAt() && itm.ExpiresAt() <= uint64(time.Now().Add(dur).Unix()),\n\t\t\t\t\"expiry time of entry is invalid\")\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\t\trequire.NoError(t, db1.Close())\n\t}\n\n\tt.Run(\"Test plain text\", func(t *testing.T) {\n\t\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\t\trequire.NoError(t, err)\n\t\tdefer removeDir(dir)\n\t\topt := getTestOptions(dir)\n\t\ttestReplay(opt)\n\t})\n\n\tt.Run(\"Test encryption\", func(t *testing.T) {\n\t\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\t\trequire.NoError(t, err)\n\t\tdefer removeDir(dir)\n\t\topt := getTestOptions(dir)\n\t\tkey := make([]byte, 32)\n\t\t_, err = rand.Read(key)\n\t\trequire.NoError(t, err)\n\t\topt.EncryptionKey = key\n\t\topt.BlockCacheSize = 10 << 20\n\t\topt.IndexCacheSize = 10 << 20\n\t\ttestReplay(opt)\n\t})\n\n}\n\nfunc randBytes(n int) []byte {\n\trecv := make([]byte, n)\n\tin, err := rand.Read(recv)\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\treturn recv[:in]\n}\n\nvar benchmarkData = []struct {\n\tkey, value []byte\n\tsuccess    bool // represent if KV should be inserted successfully or not\n}{\n\t{randBytes(100), nil, true},\n\t{randBytes(1000), []byte(\"foo\"), true},\n\t{[]byte(\"foo\"), randBytes(1000), true},\n\t{[]byte(\"\"), randBytes(1000), false},\n\t{nil, randBytes(1000000), false},\n\t{randBytes(100000), nil, false},\n\t{randBytes(1000000), nil, false},\n}\n\nfunc TestLargeKeys(t *testing.T) {\n\ttest := func(t *testing.T, opt Options) {\n\t\tdb, err := Open(opt)\n\t\trequire.NoError(t, err)\n\t\tfor i := 0; i < 1000; i++ {\n\t\t\ttx := db.NewTransaction(true)\n\t\t\tfor _, kv := range benchmarkData {\n\t\t\t\tk := make([]byte, len(kv.key))\n\t\t\t\tcopy(k, kv.key)\n\n\t\t\t\tv := make([]byte, len(kv.value))\n\t\t\t\tcopy(v, kv.value)\n\t\t\t\tif err := tx.SetEntry(NewEntry(k, v)); err != nil {\n\t\t\t\t\t// check is success should be true\n\t\t\t\t\tif kv.success {\n\t\t\t\t\t\tt.Fatalf(\"failed with: %s\", err)\n\t\t\t\t\t}\n\t\t\t\t} else if !kv.success {\n\t\t\t\t\tt.Fatal(\"insertion should fail\")\n\t\t\t\t}\n\t\t\t}\n\t\t\tif err := tx.Commit(); err != nil {\n\t\t\t\tt.Fatalf(\"#%d: batchSet err: %v\", i, err)\n\t\t\t}\n\t\t}\n\t\trequire.NoError(t, db.Close())\n\t}\n\tt.Run(\"disk mode\", func(t *testing.T) {\n\t\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\t\trequire.NoError(t, err)\n\t\tdefer removeDir(dir)\n\t\topt := DefaultOptions(dir).WithValueLogFileSize(1024 * 1024 * 1024)\n\t\ttest(t, opt)\n\t})\n\tt.Run(\"InMemory mode\", func(t *testing.T) {\n\t\topt := DefaultOptions(\"\").WithValueLogFileSize(1024 * 1024 * 1024)\n\t\topt.InMemory = true\n\t\ttest(t, opt)\n\t})\n}\n\nfunc TestCreateDirs(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"parent\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\tdb, err := Open(DefaultOptions(filepath.Join(dir, \"badger\")))\n\trequire.NoError(t, err)\n\trequire.NoError(t, db.Close())\n\t_, err = os.Stat(dir)\n\trequire.NoError(t, err)\n}\n\nfunc TestGetSetDeadlock(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\tfmt.Println(dir)\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\tdb, err := Open(DefaultOptions(dir).WithValueLogFileSize(1 << 20))\n\trequire.NoError(t, err)\n\tdefer func() { require.NoError(t, db.Close()) }()\n\n\tval := make([]byte, 1<<19)\n\tkey := []byte(\"key1\")\n\trequire.NoError(t, db.Update(func(txn *Txn) error {\n\t\trand.Read(val)\n\t\trequire.NoError(t, txn.SetEntry(NewEntry(key, val)))\n\t\treturn nil\n\t}))\n\n\ttimeout, done := time.After(10*time.Second), make(chan bool)\n\n\tgo func() {\n\t\trequire.NoError(t, db.Update(func(txn *Txn) error {\n\t\t\titem, err := txn.Get(key)\n\t\t\trequire.NoError(t, err)\n\t\t\terr = item.Value(nil) // This take a RLock on file\n\t\t\trequire.NoError(t, err)\n\n\t\t\trand.Read(val)\n\t\t\trequire.NoError(t, txn.SetEntry(NewEntry(key, val)))\n\t\t\trequire.NoError(t, txn.SetEntry(NewEntry([]byte(\"key2\"), val)))\n\t\t\treturn nil\n\t\t}))\n\t\tdone <- true\n\t}()\n\n\tselect {\n\tcase <-timeout:\n\t\tt.Fatal(\"db.Update did not finish within 10s, assuming deadlock.\")\n\tcase <-done:\n\t\tt.Log(\"db.Update finished.\")\n\t}\n}\n\nfunc TestWriteDeadlock(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\tdb, err := Open(DefaultOptions(dir).WithValueLogFileSize(10 << 20))\n\trequire.NoError(t, err)\n\tdefer func() { require.NoError(t, db.Close()) }()\n\tprint := func(count *int) {\n\t\t*count++\n\t\tif *count%100 == 0 {\n\t\t\tfmt.Printf(\"%05d\\r\", *count)\n\t\t}\n\t}\n\n\tvar count int\n\tval := make([]byte, 10000)\n\trequire.NoError(t, db.Update(func(txn *Txn) error {\n\t\tfor i := 0; i < 1000; i++ {\n\t\t\tkey := fmt.Sprintf(\"%d\", i)\n\t\t\trand.Read(val)\n\t\t\trequire.NoError(t, txn.SetEntry(NewEntry([]byte(key), val)))\n\t\t\tprint(&count)\n\t\t}\n\t\treturn nil\n\t}))\n\n\tcount = 0\n\tfmt.Println(\"\\nWrites done. Iteration and updates starting...\")\n\terr = db.Update(func(txn *Txn) error {\n\t\topt := DefaultIteratorOptions\n\t\topt.PrefetchValues = false\n\t\tit := txn.NewIterator(opt)\n\t\tdefer it.Close()\n\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\titem := it.Item()\n\n\t\t\t// Using Value() would cause deadlock.\n\t\t\t// item.Value()\n\t\t\tout, err := item.ValueCopy(nil)\n\t\t\trequire.NoError(t, err)\n\t\t\trequire.Equal(t, len(val), len(out))\n\n\t\t\tkey := y.Copy(item.Key())\n\t\t\trand.Read(val)\n\t\t\trequire.NoError(t, txn.SetEntry(NewEntry(key, val)))\n\t\t\tprint(&count)\n\t\t}\n\t\treturn nil\n\t})\n\trequire.NoError(t, err)\n}\n\nfunc TestSequence(t *testing.T) {\n\tkey0 := []byte(\"seq0\")\n\tkey1 := []byte(\"seq1\")\n\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\tseq0, err := db.GetSequence(key0, 10)\n\t\trequire.NoError(t, err)\n\t\tseq1, err := db.GetSequence(key1, 100)\n\t\trequire.NoError(t, err)\n\n\t\tfor i := uint64(0); i < uint64(105); i++ {\n\t\t\tnum, err := seq0.Next()\n\t\t\trequire.NoError(t, err)\n\t\t\trequire.Equal(t, i, num)\n\n\t\t\tnum, err = seq1.Next()\n\t\t\trequire.NoError(t, err)\n\t\t\trequire.Equal(t, i, num)\n\t\t}\n\t\terr = db.View(func(txn *Txn) error {\n\t\t\titem, err := txn.Get(key0)\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tvar num0 uint64\n\t\t\tif err := item.Value(func(val []byte) error {\n\t\t\t\tnum0 = binary.BigEndian.Uint64(val)\n\t\t\t\treturn nil\n\t\t\t}); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\trequire.Equal(t, uint64(110), num0)\n\n\t\t\titem, err = txn.Get(key1)\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tvar num1 uint64\n\t\t\tif err := item.Value(func(val []byte) error {\n\t\t\t\tnum1 = binary.BigEndian.Uint64(val)\n\t\t\t\treturn nil\n\t\t\t}); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\trequire.Equal(t, uint64(200), num1)\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\t})\n}\n\nfunc TestSequence_Release(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t// get sequence, use once and release\n\t\tkey := []byte(\"key\")\n\t\tseq, err := db.GetSequence(key, 1000)\n\t\trequire.NoError(t, err)\n\t\tnum, err := seq.Next()\n\t\trequire.NoError(t, err)\n\t\trequire.Equal(t, uint64(0), num)\n\t\trequire.NoError(t, seq.Release())\n\n\t\t// we used up 0 and 1 should be stored now\n\t\terr = db.View(func(txn *Txn) error {\n\t\t\titem, err := txn.Get(key)\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tval, err := item.ValueCopy(nil)\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\trequire.Equal(t, num+1, binary.BigEndian.Uint64(val))\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\n\t\t// using it again will lease 1+1000\n\t\tnum, err = seq.Next()\n\t\trequire.NoError(t, err)\n\t\trequire.Equal(t, uint64(1), num)\n\t\terr = db.View(func(txn *Txn) error {\n\t\t\titem, err := txn.Get(key)\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tval, err := item.ValueCopy(nil)\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\trequire.Equal(t, uint64(1001), binary.BigEndian.Uint64(val))\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\t})\n}\n\nfunc TestTestSequence2(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\tkey := []byte(\"key\")\n\t\tseq1, err := db.GetSequence(key, 2)\n\t\trequire.NoError(t, err)\n\n\t\tseq2, err := db.GetSequence(key, 2)\n\t\trequire.NoError(t, err)\n\t\tnum, err := seq2.Next()\n\t\trequire.NoError(t, err)\n\t\trequire.Equal(t, uint64(2), num)\n\n\t\trequire.NoError(t, seq2.Release())\n\t\trequire.NoError(t, seq1.Release())\n\n\t\tseq3, err := db.GetSequence(key, 2)\n\t\trequire.NoError(t, err)\n\t\tfor i := 0; i < 5; i++ {\n\t\t\tnum2, err := seq3.Next()\n\t\t\trequire.NoError(t, err)\n\t\t\trequire.Equal(t, uint64(i)+3, num2)\n\t\t}\n\n\t\trequire.NoError(t, seq3.Release())\n\t})\n}\n\nfunc TestReadOnly(t *testing.T) {\n\tt.Skipf(\"TODO: ReadOnly needs truncation, so this fails\")\n\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\topts := getTestOptions(dir)\n\n\t// Create the DB\n\tdb, err := Open(opts)\n\trequire.NoError(t, err)\n\tfor i := 0; i < 10000; i++ {\n\t\ttxnSet(t, db, []byte(fmt.Sprintf(\"key%d\", i)), []byte(fmt.Sprintf(\"value%d\", i)), 0x00)\n\t}\n\n\t// Attempt a read-only open while it's open read-write.\n\topts.ReadOnly = true\n\t_, err = Open(opts)\n\trequire.Error(t, err)\n\tif err == ErrWindowsNotSupported {\n\t\trequire.NoError(t, db.Close())\n\t\treturn\n\t}\n\trequire.Contains(t, err.Error(), \"Another process is using this Badger database\")\n\tdb.Close()\n\n\t// Open one read-only\n\topts.ReadOnly = true\n\tkv1, err := Open(opts)\n\trequire.NoError(t, err)\n\tdefer kv1.Close()\n\n\t// Open another read-only\n\tkv2, err := Open(opts)\n\trequire.NoError(t, err)\n\tdefer kv2.Close()\n\n\t// Attempt a read-write open while it's open for read-only\n\topts.ReadOnly = false\n\t_, err = Open(opts)\n\trequire.Error(t, err)\n\trequire.Contains(t, err.Error(), \"Another process is using this Badger database\")\n\n\t// Get a thing from the DB\n\ttxn1 := kv1.NewTransaction(true)\n\tv1, err := txn1.Get([]byte(\"key1\"))\n\trequire.NoError(t, err)\n\tb1, err := v1.ValueCopy(nil)\n\trequire.NoError(t, err)\n\trequire.Equal(t, b1, []byte(\"value1\"))\n\terr = txn1.Commit()\n\trequire.NoError(t, err)\n\n\t// Get a thing from the DB via the other connection\n\ttxn2 := kv2.NewTransaction(true)\n\tv2, err := txn2.Get([]byte(\"key2000\"))\n\trequire.NoError(t, err)\n\tb2, err := v2.ValueCopy(nil)\n\trequire.NoError(t, err)\n\trequire.Equal(t, b2, []byte(\"value2000\"))\n\terr = txn2.Commit()\n\trequire.NoError(t, err)\n\n\t// Attempt to set a value on a read-only connection\n\ttxn := kv1.NewTransaction(true)\n\terr = txn.SetEntry(NewEntry([]byte(\"key\"), []byte(\"value\")))\n\trequire.Error(t, err)\n\trequire.Contains(t, err.Error(), \"No sets or deletes are allowed in a read-only transaction\")\n\terr = txn.Commit()\n\trequire.NoError(t, err)\n}\n\nfunc TestLSMOnly(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\topts := LSMOnlyOptions(dir)\n\tdopts := DefaultOptions(dir)\n\n\tdopts.ValueThreshold = 1 << 21\n\t_, err = Open(dopts)\n\trequire.Contains(t, err.Error(), \"Invalid ValueThreshold\")\n\n\t// Also test for error, when ValueThresholdSize is greater than maxBatchSize.\n\tdopts.ValueThreshold = LSMOnlyOptions(dir).ValueThreshold\n\t// maxBatchSize is calculated from BaseTableSize.\n\tdopts.MemTableSize = LSMOnlyOptions(dir).ValueThreshold\n\t_, err = Open(dopts)\n\trequire.Error(t, err, \"db creation should have been failed\")\n\trequire.Contains(t, err.Error(),\n\t\tfmt.Sprintf(\"Valuethreshold %d greater than max batch size\", dopts.ValueThreshold))\n\n\topts.ValueLogMaxEntries = 100\n\tdb, err := Open(opts)\n\trequire.NoError(t, err)\n\n\tvalue := make([]byte, 128)\n\t_, err = rand.Read(value)\n\tfor i := 0; i < 500; i++ {\n\t\trequire.NoError(t, err)\n\t\ttxnSet(t, db, []byte(fmt.Sprintf(\"key%d\", i)), value, 0x00)\n\t}\n\trequire.NoError(t, db.Close())\n\n}\n\n// This test function is doing some intricate sorcery.\nfunc TestMinReadTs(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\tfor i := 0; i < 10; i++ {\n\t\t\trequire.NoError(t, db.Update(func(txn *Txn) error {\n\t\t\t\treturn txn.SetEntry(NewEntry([]byte(\"x\"), []byte(\"y\")))\n\t\t\t}))\n\t\t}\n\t\ttime.Sleep(time.Millisecond)\n\n\t\treadTxn0 := db.NewTransaction(false)\n\t\trequire.Equal(t, uint64(10), readTxn0.readTs)\n\n\t\tmin := db.orc.readMark.DoneUntil()\n\t\trequire.Equal(t, uint64(9), min)\n\n\t\treadTxn := db.NewTransaction(false)\n\t\tfor i := 0; i < 10; i++ {\n\t\t\trequire.NoError(t, db.Update(func(txn *Txn) error {\n\t\t\t\treturn txn.SetEntry(NewEntry([]byte(\"x\"), []byte(\"y\")))\n\t\t\t}))\n\t\t}\n\t\trequire.Equal(t, uint64(20), db.orc.readTs())\n\n\t\ttime.Sleep(time.Millisecond)\n\t\trequire.Equal(t, min, db.orc.readMark.DoneUntil())\n\n\t\treadTxn0.Discard()\n\t\treadTxn.Discard()\n\t\ttime.Sleep(time.Millisecond)\n\t\trequire.Equal(t, uint64(19), db.orc.readMark.DoneUntil())\n\t\tdb.orc.readMark.Done(uint64(20)) // Because we called readTs.\n\n\t\tfor i := 0; i < 10; i++ {\n\t\t\trequire.NoError(t, db.View(func(txn *Txn) error {\n\t\t\t\treturn nil\n\t\t\t}))\n\t\t}\n\t\ttime.Sleep(time.Millisecond)\n\t\trequire.Equal(t, uint64(20), db.orc.readMark.DoneUntil())\n\t})\n}\n\nfunc TestGoroutineLeak(t *testing.T) {\n\tif !*manual {\n\t\tt.Skip(\"Skipping test meant to be run manually.\")\n\t\treturn\n\t}\n\ttest := func(t *testing.T, opt *Options) {\n\t\ttime.Sleep(1 * time.Second)\n\t\tbefore := runtime.NumGoroutine()\n\t\tt.Logf(\"Num go: %d\", before)\n\t\tfor i := 0; i < 12; i++ {\n\t\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\t\tupdated := false\n\t\t\t\tctx, cancel := context.WithCancel(context.Background())\n\t\t\t\tvar wg sync.WaitGroup\n\t\t\t\twg.Add(1)\n\t\t\t\tgo func() {\n\t\t\t\t\tmatch := pb.Match{Prefix: []byte(\"key\"), IgnoreBytes: \"\"}\n\t\t\t\t\terr := db.Subscribe(ctx, func(kvs *pb.KVList) error {\n\t\t\t\t\t\trequire.Equal(t, []byte(\"value\"), kvs.Kv[0].GetValue())\n\t\t\t\t\t\tupdated = true\n\t\t\t\t\t\twg.Done()\n\t\t\t\t\t\treturn nil\n\t\t\t\t\t}, []pb.Match{match})\n\t\t\t\t\tif err != nil {\n\t\t\t\t\t\trequire.Equal(t, err.Error(), context.Canceled.Error())\n\t\t\t\t\t}\n\t\t\t\t}()\n\t\t\t\t// Wait for the go routine to be scheduled.\n\t\t\t\ttime.Sleep(time.Second)\n\t\t\t\terr := db.Update(func(txn *Txn) error {\n\t\t\t\t\treturn txn.SetEntry(NewEntry([]byte(\"key\"), []byte(\"value\")))\n\t\t\t\t})\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\twg.Wait()\n\t\t\t\tcancel()\n\t\t\t\trequire.Equal(t, true, updated)\n\t\t\t})\n\t\t}\n\t\ttime.Sleep(2 * time.Second)\n\t\trequire.Equal(t, before, runtime.NumGoroutine())\n\t}\n\tt.Run(\"disk mode\", func(t *testing.T) {\n\t\ttest(t, nil)\n\t})\n\tt.Run(\"InMemory mode\", func(t *testing.T) {\n\t\topt := DefaultOptions(\"\").WithInMemory(true)\n\t\ttest(t, &opt)\n\t})\n}\n\nfunc ExampleOpen() {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\tdefer removeDir(dir)\n\tdb, err := Open(DefaultOptions(dir))\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\tdefer func() { y.Check(db.Close()) }()\n\n\terr = db.View(func(txn *Txn) error {\n\t\t_, err := txn.Get([]byte(\"key\"))\n\t\t// We expect ErrKeyNotFound\n\t\tfmt.Println(err)\n\t\treturn nil\n\t})\n\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\n\ttxn := db.NewTransaction(true) // Read-write txn\n\terr = txn.SetEntry(NewEntry([]byte(\"key\"), []byte(\"value\")))\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\terr = txn.Commit()\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\n\terr = db.View(func(txn *Txn) error {\n\t\titem, err := txn.Get([]byte(\"key\"))\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tval, err := item.ValueCopy(nil)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tfmt.Printf(\"%s\\n\", string(val))\n\t\treturn nil\n\t})\n\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\n\t// Output:\n\t// Key not found\n\t// value\n}\n\nfunc ExampleTxn_NewIterator() {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\tdefer removeDir(dir)\n\n\tdb, err := Open(DefaultOptions(dir))\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\tdefer func() { y.Check(db.Close()) }()\n\n\tbkey := func(i int) []byte {\n\t\treturn []byte(fmt.Sprintf(\"%09d\", i))\n\t}\n\tbval := func(i int) []byte {\n\t\treturn []byte(fmt.Sprintf(\"%025d\", i))\n\t}\n\n\ttxn := db.NewTransaction(true)\n\n\t// Fill in 1000 items\n\tn := 1000\n\tfor i := 0; i < n; i++ {\n\t\terr := txn.SetEntry(NewEntry(bkey(i), bval(i)))\n\t\tif err != nil {\n\t\t\tpanic(err)\n\t\t}\n\t}\n\n\tif err := txn.Commit(); err != nil {\n\t\tpanic(err)\n\t}\n\n\topt := DefaultIteratorOptions\n\topt.PrefetchSize = 10\n\n\t// Iterate over 1000 items\n\tvar count int\n\terr = db.View(func(txn *Txn) error {\n\t\tit := txn.NewIterator(opt)\n\t\tdefer it.Close()\n\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\tcount++\n\t\t}\n\t\treturn nil\n\t})\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\tfmt.Printf(\"Counted %d elements\", count)\n\t// Output:\n\t// Counted 1000 elements\n}\n\nfunc TestSyncForRace(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\tdb, err := Open(DefaultOptions(dir).WithSyncWrites(false))\n\trequire.NoError(t, err)\n\tdefer func() { require.NoError(t, db.Close()) }()\n\n\tcloseChan := make(chan struct{})\n\tdoneChan := make(chan struct{})\n\n\tgo func() {\n\t\tticker := time.NewTicker(100 * time.Microsecond)\n\t\tfor {\n\t\t\tselect {\n\t\t\tcase <-ticker.C:\n\t\t\t\tif err := db.Sync(); err != nil {\n\t\t\t\t\trequire.NoError(t, err)\n\t\t\t\t}\n\t\t\t\tdb.opt.Debugf(\"Sync Iteration completed\")\n\t\t\tcase <-closeChan:\n\t\t\t\tclose(doneChan)\n\t\t\t\treturn\n\t\t\t}\n\t\t}\n\t}()\n\n\tsz := 128 << 10 // 5 entries per value log file.\n\tv := make([]byte, sz)\n\trand.Read(v[:rand.Intn(sz)])\n\ttxn := db.NewTransaction(true)\n\tfor i := 0; i < 10000; i++ {\n\t\trequire.NoError(t, txn.SetEntry(NewEntry([]byte(fmt.Sprintf(\"key%d\", i)), v)))\n\t\tif i%3 == 0 {\n\t\t\trequire.NoError(t, txn.Commit())\n\t\t\ttxn = db.NewTransaction(true)\n\t\t}\n\t\tif i%100 == 0 {\n\t\t\tdb.opt.Debugf(\"next 100 entries added to DB\")\n\t\t}\n\t}\n\trequire.NoError(t, txn.Commit())\n\n\tclose(closeChan)\n\t<-doneChan\n}\n\nfunc TestSyncForNoErrors(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\tdb, err := Open(DefaultOptions(dir).WithSyncWrites(false))\n\trequire.NoError(t, err)\n\tdefer func() { require.NoError(t, db.Close()) }()\n\n\ttxn := db.NewTransaction(true)\n\tfor i := 0; i < 10; i++ {\n\t\trequire.NoError(\n\t\t\tt,\n\t\t\ttxn.SetEntry(NewEntry(\n\t\t\t\t[]byte(fmt.Sprintf(\"key%d\", i)),\n\t\t\t\t[]byte(fmt.Sprintf(\"value%d\", i)),\n\t\t\t)),\n\t\t)\n\t}\n\trequire.NoError(t, txn.Commit())\n\n\tif err := db.Sync(); err != nil {\n\t\trequire.NoError(t, err)\n\t}\n}\n\nfunc TestSyncForReadingTheEntriesThatWereSynced(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\tdb, err := Open(DefaultOptions(dir).WithSyncWrites(false))\n\trequire.NoError(t, err)\n\tdefer func() { require.NoError(t, db.Close()) }()\n\n\ttxn := db.NewTransaction(true)\n\tfor i := 0; i < 10; i++ {\n\t\trequire.NoError(\n\t\t\tt,\n\t\t\ttxn.SetEntry(NewEntry(\n\t\t\t\t[]byte(fmt.Sprintf(\"key%d\", i)),\n\t\t\t\t[]byte(fmt.Sprintf(\"value%d\", i)),\n\t\t\t)),\n\t\t)\n\t}\n\trequire.NoError(t, txn.Commit())\n\n\tif err := db.Sync(); err != nil {\n\t\trequire.NoError(t, err)\n\t}\n\n\treadOnlyTxn := db.NewTransaction(false)\n\tfor i := 0; i < 10; i++ {\n\t\titem, err := readOnlyTxn.Get([]byte(fmt.Sprintf(\"key%d\", i)))\n\t\trequire.NoError(t, err)\n\n\t\tvalue := getItemValue(t, item)\n\t\trequire.Equal(t, []byte(fmt.Sprintf(\"value%d\", i)), value)\n\t}\n}\n\nfunc TestForceFlushMemtable(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err, \"temp dir for badger could not be created\")\n\n\tops := getTestOptions(dir)\n\tops.ValueLogMaxEntries = 1\n\n\tdb, err := Open(ops)\n\trequire.NoError(t, err, \"error while opening db\")\n\tdefer func() { require.NoError(t, db.Close()) }()\n\n\tfor i := 0; i < 3; i++ {\n\t\terr = db.Update(func(txn *Txn) error {\n\t\t\treturn txn.SetEntry(NewEntry([]byte(fmt.Sprintf(\"key-%d\", i)),\n\t\t\t\t[]byte(fmt.Sprintf(\"value-%d\", i))))\n\t\t})\n\t\trequire.NoError(t, err, \"unable to set key and value\")\n\t}\n\ttime.Sleep(1 * time.Second)\n\n\t// We want to make sure that memtable is flushed on disk. While flushing memtable to disk,\n\t// latest head is also stored in it. Hence we will try to read head from disk. To make sure\n\t// this. we will truncate all memtables.\n\tdb.lock.Lock()\n\tdb.mt.DecrRef()\n\tfor _, mt := range db.imm {\n\t\tmt.DecrRef()\n\t}\n\tdb.imm = db.imm[:0]\n\tdb.mt, err = db.newMemTable()\n\trequire.NoError(t, err)\n\tdb.lock.Unlock()\n\n\t// Since we are inserting 3 entries and ValueLogMaxEntries is 1, there will be 3 rotation.\n\trequire.True(t, db.nextMemFid == 3,\n\t\tfmt.Sprintf(\"expected fid: %d, actual fid: %d\", 2, db.nextMemFid))\n}\n\nfunc TestVerifyChecksum(t *testing.T) {\n\ttestVerfiyCheckSum := func(t *testing.T, opt Options) {\n\t\tpath, err := os.MkdirTemp(\"\", \"badger-test\")\n\t\trequire.NoError(t, err)\n\t\tdefer os.Remove(path)\n\t\topt.ValueDir = path\n\t\topt.Dir = path\n\t\t// use stream write for writing.\n\t\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\t\tvalue := make([]byte, 32)\n\t\t\ty.Check2(rand.Read(value))\n\t\t\tst := 0\n\n\t\t\tbuf := z.NewBuffer(10<<20, \"test\")\n\t\t\tdefer func() { require.NoError(t, buf.Release()) }()\n\t\t\tfor i := 0; i < 1000; i++ {\n\t\t\t\tkey := make([]byte, 8)\n\t\t\t\tbinary.BigEndian.PutUint64(key, uint64(i))\n\t\t\t\tKVToBuffer(&pb.KV{\n\t\t\t\t\tKey:      key,\n\t\t\t\t\tValue:    value,\n\t\t\t\t\tStreamId: uint32(st),\n\t\t\t\t\tVersion:  1,\n\t\t\t\t}, buf)\n\t\t\t\tif i%100 == 0 {\n\t\t\t\t\tst++\n\t\t\t\t}\n\t\t\t}\n\n\t\t\tsw := db.NewStreamWriter()\n\t\t\trequire.NoError(t, sw.Prepare(), \"sw.Prepare() failed\")\n\t\t\trequire.NoError(t, sw.Write(buf), \"sw.Write() failed\")\n\t\t\trequire.NoError(t, sw.Flush(), \"sw.Flush() failed\")\n\n\t\t\trequire.NoError(t, db.VerifyChecksum(), \"checksum verification failed for DB\")\n\t\t})\n\t}\n\tt.Run(\"Testing Verify Checksum without encryption\", func(t *testing.T) {\n\t\ttestVerfiyCheckSum(t, getTestOptions(\"\"))\n\t})\n\tt.Run(\"Testing Verify Checksum with Encryption\", func(t *testing.T) {\n\t\tkey := make([]byte, 32)\n\t\t_, err := rand.Read(key)\n\t\trequire.NoError(t, err)\n\t\topt := getTestOptions(\"\")\n\t\topt.EncryptionKey = key\n\t\topt.BlockCacheSize = 1 << 20\n\t\topt.IndexCacheSize = 1 << 20\n\t\ttestVerfiyCheckSum(t, opt)\n\t})\n}\n\nfunc TestMain(m *testing.M) {\n\tflag.Parse()\n\tos.Exit(m.Run())\n}\n\nfunc removeDir(dir string) {\n\tif err := os.RemoveAll(dir); err != nil {\n\t\tpanic(err)\n\t}\n}\n\nfunc TestWriteInemory(t *testing.T) {\n\topt := DefaultOptions(\"\").WithInMemory(true)\n\tdb, err := Open(opt)\n\trequire.NoError(t, err)\n\tdefer func() {\n\t\trequire.NoError(t, db.Close())\n\t}()\n\tfor i := 0; i < 100; i++ {\n\t\ttxnSet(t, db, []byte(fmt.Sprintf(\"key%d\", i)), []byte(fmt.Sprintf(\"val%d\", i)), 0x00)\n\t}\n\terr = db.View(func(txn *Txn) error {\n\t\tfor j := 0; j < 100; j++ {\n\t\t\titem, err := txn.Get([]byte(fmt.Sprintf(\"key%d\", j)))\n\t\t\trequire.NoError(t, err)\n\t\t\texpected := []byte(fmt.Sprintf(\"val%d\", j))\n\t\t\trequire.NoError(t, item.Value(func(val []byte) error {\n\t\t\t\trequire.Equal(t, expected, val,\n\t\t\t\t\t\"Invalid value for key %q. expected: %q, actual: %q\",\n\t\t\t\t\titem.Key(), expected, val)\n\t\t\t\treturn nil\n\t\t\t}))\n\t\t}\n\t\treturn nil\n\t})\n\trequire.NoError(t, err)\n}\n\nfunc TestMinCacheSize(t *testing.T) {\n\topt := DefaultOptions(\"\").\n\t\tWithInMemory(true).\n\t\tWithIndexCacheSize(16).\n\t\tWithBlockCacheSize(16)\n\tdb, err := Open(opt)\n\trequire.NoError(t, err)\n\tdefer func() {\n\t\trequire.NoError(t, db.Close())\n\t}()\n\n}\n\nfunc TestUpdateMaxCost(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err, \"temp dir for badger could not be created\")\n\tdefer os.RemoveAll(dir)\n\n\tops := getTestOptions(dir).\n\t\tWithBlockCacheSize(1 << 20).\n\t\tWithIndexCacheSize(2 << 20)\n\tdb, err := Open(ops)\n\trequire.NoError(t, err)\n\n\tcost, err := db.CacheMaxCost(BlockCache, -1)\n\trequire.NoError(t, err)\n\trequire.Equal(t, int64(1<<20), cost)\n\tcost, err = db.CacheMaxCost(IndexCache, -1)\n\trequire.NoError(t, err)\n\trequire.Equal(t, int64(2<<20), cost)\n\n\t_, err = db.CacheMaxCost(BlockCache, 2<<20)\n\trequire.NoError(t, err)\n\tcost, err = db.CacheMaxCost(BlockCache, -1)\n\trequire.NoError(t, err)\n\trequire.Equal(t, int64(2<<20), cost)\n\t_, err = db.CacheMaxCost(IndexCache, 4<<20)\n\trequire.NoError(t, err)\n\tcost, err = db.CacheMaxCost(IndexCache, -1)\n\trequire.NoError(t, err)\n\trequire.Equal(t, int64(4<<20), cost)\n}\n\nfunc TestOpenDBReadOnly(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer os.RemoveAll(dir)\n\n\tmp := make(map[string][]byte)\n\n\tops := getTestOptions(dir)\n\tops.ReadOnly = false\n\tdb, err := Open(ops)\n\trequire.NoError(t, err)\n\t// Add bunch of entries that don't go into value log.\n\trequire.NoError(t, db.Update(func(txn *Txn) error {\n\t\tval := make([]byte, 10)\n\t\trand.Read(val)\n\t\tfor i := 0; i < 10; i++ {\n\t\t\tkey := fmt.Sprintf(\"key-%05d\", i)\n\t\t\trequire.NoError(t, txn.Set([]byte(key), val))\n\t\t\tmp[key] = val\n\t\t}\n\t\treturn nil\n\t}))\n\trequire.NoError(t, db.Close())\n\n\tops.ReadOnly = true\n\tdb, err = Open(ops)\n\trequire.NoError(t, err)\n\trequire.NoError(t, db.Close())\n\n\tdb, err = Open(ops)\n\trequire.NoError(t, err)\n\tvar count int\n\tread := func() {\n\t\tcount = 0\n\t\trequire.NoError(t, db.View(func(txn *Txn) error {\n\t\t\tit := txn.NewIterator(DefaultIteratorOptions)\n\t\t\tdefer it.Close()\n\t\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\t\titem := it.Item()\n\t\t\t\trequire.NoError(t, item.Value(func(val []byte) error {\n\t\t\t\t\trequire.Equal(t, mp[string(item.Key())], val)\n\t\t\t\t\treturn nil\n\t\t\t\t}))\n\t\t\t\tcount++\n\t\t\t}\n\t\t\treturn nil\n\t\t}))\n\t}\n\tread()\n\trequire.Equal(t, 10, count)\n\trequire.NoError(t, db.Close())\n\n\tops.ReadOnly = false\n\tdb, err = Open(ops)\n\trequire.NoError(t, err)\n\t// Add bunch of entries that go into value log.\n\trequire.NoError(t, db.Update(func(txn *Txn) error {\n\t\trequire.Greater(t, db.valueThreshold(), int64(10))\n\t\tval := make([]byte, db.valueThreshold()+10)\n\t\trand.Read(val)\n\t\tfor i := 0; i < 10; i++ {\n\t\t\tkey := fmt.Sprintf(\"KEY-%05d\", i)\n\t\t\trequire.NoError(t, txn.Set([]byte(key), val))\n\t\t\tmp[key] = val\n\t\t}\n\t\treturn nil\n\t}))\n\trequire.NoError(t, db.Close())\n\n\tops.ReadOnly = true\n\tdb, err = Open(ops)\n\trequire.NoError(t, err)\n\tread()\n\trequire.Equal(t, 20, count)\n\trequire.NoError(t, db.Close())\n}\n\nfunc TestBannedPrefixes(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err, \"temp dir for badger could not be created\")\n\tdefer os.RemoveAll(dir)\n\n\topt := getTestOptions(dir).WithNamespaceOffset(3)\n\t// All values go into vlog files. This is for checking if banned keys are properly decoded on DB\n\t// restart.\n\topt.ValueThreshold = 0\n\topt.ValueLogMaxEntries = 2\n\t// We store the uint64 namespace at idx=3, so first 3 bytes are insignificant to us.\n\tinitialBytes := make([]byte, opt.NamespaceOffset)\n\tdb, err := Open(opt)\n\trequire.NoError(t, err)\n\trequire.Equal(t, 1, len(db.vlog.filesMap))\n\n\tvar keys [][]byte\n\tvar allPrefixes []uint64 = []uint64{1234, 3456, 5678, 7890, 901234}\n\tfor _, p := range allPrefixes {\n\t\tprefix := y.U64ToBytes(p)\n\t\tfor i := 0; i < 10; i++ {\n\t\t\t// We store the uint64 namespace at idx=3, so first 3 bytes are insignificant to us.\n\t\t\tkey := []byte(fmt.Sprintf(\"%s%s-key%02d\", initialBytes, prefix, i))\n\t\t\tkeys = append(keys, key)\n\t\t}\n\t}\n\n\tbannedPrefixes := make(map[uint64]struct{})\n\tisBanned := func(key []byte) bool {\n\t\tprefix := y.BytesToU64(key[3:])\n\t\tif _, ok := bannedPrefixes[prefix]; ok {\n\t\t\treturn true\n\t\t}\n\t\treturn false\n\t}\n\n\tvalidate := func() {\n\t\t// Validate read/write.\n\t\tfor _, key := range keys {\n\t\t\ttxn := db.NewTransaction(true)\n\t\t\t_, rerr := txn.Get(key)\n\t\t\twerr := txn.Set(key, []byte(\"value\"))\n\t\t\ttxn.Discard()\n\t\t\tif isBanned(key) {\n\t\t\t\trequire.Equal(t, ErrBannedKey, rerr)\n\t\t\t\trequire.Equal(t, ErrBannedKey, werr)\n\t\t\t} else {\n\t\t\t\trequire.NoError(t, rerr)\n\t\t\t\trequire.NoError(t, werr)\n\t\t\t}\n\t\t}\n\t}\n\n\tfor _, key := range keys {\n\t\trequire.NoError(t, db.Update(func(txn *Txn) error {\n\t\t\treturn txn.SetEntry(NewEntry(key, []byte(\"value\")))\n\t\t}))\n\t}\n\tvalidate()\n\n\t// Ban a couple of prefix and validate that we should not be able to read/write them.\n\trequire.NoError(t, db.BanNamespace(1234))\n\tbannedPrefixes[1234] = struct{}{}\n\tvalidate()\n\n\trequire.NoError(t, db.BanNamespace(5678))\n\tbannedPrefixes[5678] = struct{}{}\n\tvalidate()\n\n\trequire.Greater(t, len(db.vlog.filesMap), 1)\n\trequire.NoError(t, db.Close())\n\n\tdb, err = Open(opt)\n\trequire.NoError(t, err)\n\tvalidate()\n\trequire.NoError(t, db.Close())\n}\n\n// Tests that the iterator skips the banned prefixes. Sets keys with multiple versions in\n// different namespaces and maintains a sorted list of those keys in-memory.\n// Then, ban few prefixes and iterate over the DB and match it with the corresponding keys from the\n// in-memory list. Simulate the skipping in in-memory list as well.\nfunc TestIterateWithBanned(t *testing.T) {\n\topt := DefaultOptions(\"\").WithNamespaceOffset(3)\n\topt.NumVersionsToKeep = math.MaxInt64\n\n\t// We store the uint64 namespace at idx=3, so first 3 bytes are insignificant to us.\n\tinitialBytes := make([]byte, opt.NamespaceOffset)\n\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\tbkey := func(prefix uint64, i int) []byte {\n\t\t\treturn []byte(fmt.Sprintf(\"%s%s-%04d\", initialBytes, y.U64ToBytes(prefix), i))\n\t\t}\n\n\t\tN := 100\n\t\tV := 3\n\n\t\t// Generate 26*N keys, each with V versions (versions set by txnSet())\n\t\tvar keys [][]byte\n\t\tfor i := 'a'; i <= 'z'; i++ {\n\t\t\tfor j := 0; j < N; j++ {\n\t\t\t\tfor v := 0; v < V; v++ {\n\t\t\t\t\tkeys = append(keys, bkey(uint64(i*1000), j))\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\tfor _, key := range keys {\n\t\t\ttxnSet(t, db, key, key, 0)\n\t\t}\n\n\t\t// Validate that we don't see the banned keys in iterating.\n\t\t// Pass it the iterator options, idx to start from in the in-memory list, and the number of\n\t\t// keys we expect to see through iteration.\n\t\tvalidate := func(iopts IteratorOptions, idx, expected int) {\n\t\t\ttxn := db.NewTransaction(false)\n\t\t\tdefer txn.Discard()\n\t\t\titr := txn.NewIterator(iopts)\n\t\t\tdefer itr.Close()\n\t\t\tincIdx := func() {\n\t\t\t\tn := 1\n\t\t\t\t// If AllVersions is set, then we need to skip V keys.\n\t\t\t\tif !iopts.AllVersions {\n\t\t\t\t\tn *= V\n\t\t\t\t}\n\t\t\t\t// If Reverse iterating, then decrement the index of in-memory list.\n\t\t\t\tif iopts.Reverse {\n\t\t\t\t\tidx -= n\n\t\t\t\t} else {\n\t\t\t\t\tidx += n\n\t\t\t\t}\n\t\t\t}\n\t\t\tcount := 0\n\t\t\tfor itr.Seek(itr.opt.Prefix); itr.Valid(); itr.Next() {\n\t\t\t\t// Iterator should skip the banned keys, so should we.\n\t\t\t\tfor db.isBanned(keys[idx]) != nil {\n\t\t\t\t\tincIdx()\n\t\t\t\t}\n\t\t\t\tcount++\n\t\t\t\trequire.Equalf(t, keys[idx], itr.Item().Key(), \"count:%d\", count)\n\t\t\t\tincIdx()\n\t\t\t}\n\t\t\trequire.Equal(t, expected, count)\n\t\t}\n\n\t\tgetIterablePrefix := func(i int) []byte {\n\t\t\treturn []byte(fmt.Sprintf(\"%s%s\", initialBytes, y.U64ToBytes(uint64(i*1000))))\n\t\t}\n\t\tvalidate(IteratorOptions{}, 0, 26*N)\n\t\tvalidate(IteratorOptions{Reverse: true, AllVersions: true}, 26*N*V-1, 26*N*V)\n\t\tvalidate(IteratorOptions{Prefix: getIterablePrefix('f')}, 5*N*V, N)\n\n\t\trequire.NoError(t, db.BanNamespace(uint64('a'*1000)))\n\t\tvalidate(IteratorOptions{}, 1*N*V, 25*N)\n\t\tvalidate(IteratorOptions{AllVersions: true}, 1*N*V, 25*N*V)\n\t\tvalidate(IteratorOptions{Reverse: true, AllVersions: true}, 26*N*V-1, 25*N*V)\n\n\t\trequire.NoError(t, db.BanNamespace(uint64('b'*1000)))\n\t\tvalidate(IteratorOptions{}, 2*N*V, 24*N)\n\t\tvalidate(IteratorOptions{AllVersions: true}, 2*N*V, 24*N*V)\n\n\t\trequire.NoError(t, db.BanNamespace(uint64('d'*1000)))\n\t\tvalidate(IteratorOptions{}, 2*N*V, 23*N)\n\t\tvalidate(IteratorOptions{AllVersions: true}, 2*N*V, 23*N*V)\n\t\tvalidate(IteratorOptions{Prefix: getIterablePrefix('f')}, 5*N*V, N)\n\t\tvalidate(IteratorOptions{Prefix: getIterablePrefix('f'), AllVersions: true}, 5*N*V, N*V)\n\n\t\trequire.NoError(t, db.BanNamespace(uint64('g')*1000))\n\t\tvalidate(IteratorOptions{AllVersions: true}, 2*N*V, 22*N*V)\n\n\t\trequire.NoError(t, db.BanNamespace(uint64('r'*1000)))\n\t\tvalidate(IteratorOptions{}, 2*N*V, 21*N)\n\t\tvalidate(IteratorOptions{Reverse: true, AllVersions: true}, 26*N*V-1, 21*N*V)\n\n\t\t// Iterate over the banned prefix. Passing -1 as we don't expect to access keys.\n\t\tvalidate(IteratorOptions{Prefix: getIterablePrefix('g')}, -1, 0)\n\t\tvalidate(IteratorOptions{Prefix: getIterablePrefix('g'), Reverse: true, AllVersions: true},\n\t\t\t-1, 0)\n\n\t\trequire.NoError(t, db.BanNamespace(uint64('z'*1000)))\n\t\tvalidate(IteratorOptions{}, 2*N*V, 20*N)\n\t\tvalidate(IteratorOptions{AllVersions: true}, 2*N*V, 20*N*V)\n\t\tvalidate(IteratorOptions{Reverse: true, AllVersions: true}, 25*N*V-1, 20*N*V)\n\t})\n}\n\n// A basic test that checks if the DB works even if user is not using the DefaultOptions.\nfunc TestBannedAtZeroOffset(t *testing.T) {\n\topt := getTestOptions(\"\")\n\t// When DefaultOptions is not used, NamespaceOffset will be set to 0.\n\topt.NamespaceOffset = 0\n\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\trequire.Equal(t, 0, db.opt.NamespaceOffset)\n\t\terr := db.Update(func(txn *Txn) error {\n\t\t\tfor i := 0; i < 10; i++ {\n\t\t\t\tentry := NewEntry([]byte(fmt.Sprintf(\"key%d\", i)), []byte(fmt.Sprintf(\"val%d\", i)))\n\t\t\t\tif err := txn.SetEntry(entry); err != nil {\n\t\t\t\t\treturn err\n\t\t\t\t}\n\t\t\t}\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\n\t\terr = db.View(func(txn *Txn) error {\n\t\t\tfor i := 0; i < 10; i++ {\n\t\t\t\titem, err := txn.Get([]byte(fmt.Sprintf(\"key%d\", i)))\n\t\t\t\tif err != nil {\n\t\t\t\t\treturn err\n\t\t\t\t}\n\n\t\t\t\texpected := []byte(fmt.Sprintf(\"val%d\", i))\n\t\t\t\tif err := item.Value(func(val []byte) error {\n\t\t\t\t\trequire.Equal(t, expected, val,\n\t\t\t\t\t\t\"Invalid value for key %q. expected: %q, actual: %q\",\n\t\t\t\t\t\titem.Key(), expected, val)\n\t\t\t\t\treturn nil\n\t\t\t\t}); err != nil {\n\t\t\t\t\treturn err\n\t\t\t\t}\n\t\t\t}\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\t})\n}\n\nfunc TestCompactL0OnClose(t *testing.T) {\n\topt := getTestOptions(\"\")\n\topt.CompactL0OnClose = true\n\topt.ValueThreshold = 1 // Every value goes to value log\n\topt.NumVersionsToKeep = 1\n\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\tvar keys [][]byte\n\t\tval := make([]byte, 1<<12)\n\t\tfor i := 0; i < 10; i++ {\n\t\t\tkey := make([]byte, 40)\n\t\t\t_, err := rand.Read(key)\n\t\t\trequire.NoError(t, err)\n\t\t\tkeys = append(keys, key)\n\n\t\t\terr = db.Update(func(txn *Txn) error {\n\t\t\t\treturn txn.SetEntry(NewEntry(key, val))\n\t\t\t})\n\t\t\trequire.NoError(t, err)\n\t\t}\n\n\t\tfor _, key := range keys {\n\t\t\terr := db.Update(func(txn *Txn) error {\n\t\t\t\treturn txn.SetEntry(NewEntry(key, val))\n\t\t\t})\n\t\t\trequire.NoError(t, err)\n\t\t}\n\t})\n}\n\nfunc TestCloseDBWhileReading(t *testing.T) {\n\tdir := t.TempDir()\n\tdb, err := Open(DefaultOptions(dir))\n\trequire.NoError(t, err)\n\n\tkey := []byte(\"key\")\n\terr = db.Update(func(txn *Txn) error {\n\t\treturn txn.Set(key, []byte(\"value\"))\n\t})\n\trequire.NoError(t, err)\n\n\tvar wg sync.WaitGroup\n\tfor i := 0; i < 10; i++ {\n\t\twg.Add(1)\n\t\tgo func() {\n\t\t\tdefer wg.Done()\n\t\t\tfor {\n\t\t\t\terr := db.View(func(txn *Txn) error {\n\t\t\t\t\t_, err := txn.Get(key)\n\t\t\t\t\treturn err\n\t\t\t\t})\n\t\t\t\tif err != nil {\n\t\t\t\t\trequire.Contains(t, err.Error(), \"DB Closed\")\n\t\t\t\t\tbreak\n\t\t\t\t}\n\t\t\t}\n\t\t}()\n\t}\n\n\ttime.Sleep(time.Second)\n\trequire.NoError(t, db.Close())\n\twg.Wait()\n}\n"
  },
  {
    "path": "dir_aix.go",
    "content": "//go:build aix\n// +build aix\n\n/*\n * SPDX-FileCopyrightText: © 2017-2026 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"errors\"\n\t\"fmt\"\n\t\"os\"\n\t\"path/filepath\"\n\t\"sync\"\n\n\t\"golang.org/x/sys/unix\"\n\n\t\"github.com/dgraph-io/badger/v4/y\"\n)\n\n// AIX flock locks files, not descriptors. So, multiple descriptors cannot\n// be used in the same file. The first to close removals the lock on the\n// file.\ntype directoryLockGuard struct {\n\t// The absolute path to our pid file.\n\tpath string\n\t// Was this a shared lock for a read-only database?\n\treadOnly bool\n}\n\n// AIX flocking is file x process, not fd x file x process like linux. We can\n// only hold one descriptor with a lock open at any given time.\ntype aixFlock struct {\n\tfile     *os.File\n\tcount    int\n\treadOnly bool\n}\n\n// Keep a map of locks synchronized by a mutex.\nvar aixFlockMap = map[string]*aixFlock{}\nvar aixFlockMapLock sync.Mutex\n\n// acquireDirectoryLock gets a lock on the directory (using flock). If\n// this is not read-only, it will also write our pid to\n// dirPath/pidFileName for convenience.\nfunc acquireDirectoryLock(dirPath string, pidFileName string, readOnly bool) (\n\t*directoryLockGuard, error) {\n\n\t// Convert to absolute path so that Release still works even if we do an unbalanced\n\t// chdir in the meantime.\n\tabsPidFilePath, err := filepath.Abs(filepath.Join(dirPath, pidFileName))\n\tif err != nil {\n\t\treturn nil, y.Wrapf(err, \"cannot get absolute path for pid lock file\")\n\t}\n\n\taixFlockMapLock.Lock()\n\tdefer aixFlockMapLock.Unlock()\n\n\tlg := &directoryLockGuard{absPidFilePath, readOnly}\n\n\tif lock, fnd := aixFlockMap[absPidFilePath]; fnd {\n\t\tif !readOnly || lock.readOnly != readOnly {\n\t\t\treturn nil, fmt.Errorf(\n\t\t\t\t\"Cannot acquire directory lock on %q.  Another process is using this Badger database.\", dirPath)\n\t\t}\n\t\tlock.count++\n\t} else {\n\t\t// This is the first acquirer, set up a lock file and register it.\n\t\tf, err := os.OpenFile(absPidFilePath, os.O_RDWR|os.O_CREATE, 0666)\n\t\tif err != nil {\n\t\t\treturn nil, y.Wrapf(err, \"cannot create/open pid file %q\", absPidFilePath)\n\t\t}\n\n\t\topts := unix.F_WRLCK\n\t\tif readOnly {\n\t\t\topts = unix.F_RDLCK\n\t\t}\n\n\t\tflckt := unix.Flock_t{int16(opts), 0, 0, 0, 0, 0, 0}\n\t\terr = unix.FcntlFlock(uintptr(f.Fd()), unix.F_SETLK, &flckt)\n\t\tif err != nil {\n\t\t\tf.Close()\n\t\t\treturn nil, y.Wrapf(err,\n\t\t\t\t\"Cannot acquire directory lock on %q.  Another process is using this Badger database.\", dirPath)\n\t\t}\n\n\t\tif !readOnly {\n\t\t\tf.Truncate(0)\n\t\t\t// Write our pid to the file.\n\t\t\t_, err = f.Write([]byte(fmt.Sprintf(\"%d\\n\", os.Getpid())))\n\t\t\tif err != nil {\n\t\t\t\tf.Close()\n\t\t\t\treturn nil, y.Wrapf(err,\n\t\t\t\t\t\"Cannot write pid file %q\", absPidFilePath)\n\t\t\t}\n\t\t}\n\t\taixFlockMap[absPidFilePath] = &aixFlock{f, 1, readOnly}\n\t}\n\treturn lg, nil\n}\n\n// Release deletes the pid file and releases our lock on the directory.\nfunc (guard *directoryLockGuard) release() error {\n\tvar err error\n\n\taixFlockMapLock.Lock()\n\tdefer aixFlockMapLock.Unlock()\n\n\tif lock, fnd := aixFlockMap[guard.path]; fnd {\n\t\tlock.count--\n\t\tif lock.count == 0 {\n\t\t\tif !lock.readOnly {\n\t\t\t\t// Try to clear the PID if we succeed.\n\t\t\t\tlock.file.Truncate(0)\n\t\t\t\tos.Remove(guard.path)\n\t\t\t}\n\n\t\t\tif closeErr := lock.file.Close(); err == nil {\n\t\t\t\terr = closeErr\n\t\t\t}\n\t\t\tdelete(aixFlockMap, guard.path)\n\t\t\tguard.path = \"\"\n\t\t}\n\t} else {\n\t\terr = errors.New(fmt.Sprintf(\"unknown lock %v\", guard.path))\n\t}\n\n\treturn err\n}\n\n// openDir opens a directory for syncing.\nfunc openDir(path string) (*os.File, error) { return os.Open(path) }\n\n// When you create or delete a file, you have to ensure the directory entry for the file is synced\n// in order to guarantee the file is visible (if the system crashes). (See the man page for fsync,\n// or see https://github.com/coreos/etcd/issues/6368 for an example.)\nfunc syncDir(dir string) error {\n\tvar err error\n\t// AIX does not support fsync on a directory.\n\t// Data durability on crash may be affected.\n\treturn err\n}\n"
  },
  {
    "path": "dir_other.go",
    "content": "//go:build js || wasip1\n// +build js wasip1\n\n/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"fmt\"\n\t\"os\"\n\t\"path/filepath\"\n\n\t\"github.com/dgraph-io/badger/v4/y\"\n)\n\n// directoryLockGuard holds a lock on a directory and a pid file inside.  The pid file isn't part\n// of the locking mechanism, it's just advisory.\ntype directoryLockGuard struct {\n\t// File handle on the directory, which we've flocked.\n\tf *os.File\n\t// The absolute path to our pid file.\n\tpath string\n\t// Was this a shared lock for a read-only database?\n\treadOnly bool\n}\n\n// acquireDirectoryLock gets a lock on the directory (using flock). If\n// this is not read-only, it will also write our pid to\n// dirPath/pidFileName for convenience.\nfunc acquireDirectoryLock(dirPath string, pidFileName string, readOnly bool) (\n\t*directoryLockGuard, error) {\n\t// Convert to absolute path so that Release still works even if we do an unbalanced\n\t// chdir in the meantime.\n\tabsPidFilePath, err := filepath.Abs(filepath.Join(dirPath, pidFileName))\n\tif err != nil {\n\t\treturn nil, y.Wrapf(err, \"cannot get absolute path for pid lock file\")\n\t}\n\tf, err := os.Open(dirPath)\n\tif err != nil {\n\t\treturn nil, y.Wrapf(err, \"cannot open directory %q\", dirPath)\n\t}\n\n\t// NOTE: Here is where we would normally call flock.\n\t// This is not supported in js / wasm, so skip it.\n\n\tif !readOnly {\n\t\t// Yes, we happily overwrite a pre-existing pid file.  We're the\n\t\t// only read-write badger process using this directory.\n\t\terr = os.WriteFile(absPidFilePath, []byte(fmt.Sprintf(\"%d\\n\", os.Getpid())), 0666)\n\t\tif err != nil {\n\t\t\tf.Close()\n\t\t\treturn nil, y.Wrapf(err,\n\t\t\t\t\"Cannot write pid file %q\", absPidFilePath)\n\t\t}\n\t}\n\n\treturn &directoryLockGuard{f, absPidFilePath, readOnly}, nil\n}\n\n// Release deletes the pid file and releases our lock on the directory.\nfunc (guard *directoryLockGuard) release() error {\n\tvar err error\n\tif !guard.readOnly {\n\t\t// It's important that we remove the pid file first.\n\t\terr = os.Remove(guard.path)\n\t}\n\n\tif closeErr := guard.f.Close(); err == nil {\n\t\terr = closeErr\n\t}\n\tguard.path = \"\"\n\tguard.f = nil\n\n\treturn err\n}\n\n// openDir opens a directory for syncing.\nfunc openDir(path string) (*os.File, error) { return os.Open(path) }\n\n// When you create or delete a file, you have to ensure the directory entry for the file is synced\n// in order to guarantee the file is visible (if the system crashes). (See the man page for fsync,\n// or see https://github.com/coreos/etcd/issues/6368 for an example.)\nfunc syncDir(dir string) error {\n\tf, err := openDir(dir)\n\tif err != nil {\n\t\treturn y.Wrapf(err, \"While opening directory: %s.\", dir)\n\t}\n\n\terr = f.Sync()\n\tcloseErr := f.Close()\n\tif err != nil {\n\t\treturn y.Wrapf(err, \"While syncing directory: %s.\", dir)\n\t}\n\treturn y.Wrapf(closeErr, \"While closing directory: %s.\", dir)\n}\n"
  },
  {
    "path": "dir_plan9.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"fmt\"\n\t\"os\"\n\t\"path/filepath\"\n\t\"strings\"\n\n\t\"github.com/dgraph-io/badger/v4/y\"\n)\n\n// directoryLockGuard holds a lock on a directory and a pid file inside.  The pid file isn't part\n// of the locking mechanism, it's just advisory.\ntype directoryLockGuard struct {\n\t// File handle on the directory, which we've locked.\n\tf *os.File\n\t// The absolute path to our pid file.\n\tpath string\n}\n\n// acquireDirectoryLock gets a lock on the directory.\n// It will also write our pid to dirPath/pidFileName for convenience.\n// readOnly is not supported on Plan 9.\nfunc acquireDirectoryLock(dirPath string, pidFileName string, readOnly bool) (\n\t*directoryLockGuard, error) {\n\tif readOnly {\n\t\treturn nil, ErrPlan9NotSupported\n\t}\n\n\t// Convert to absolute path so that Release still works even if we do an unbalanced\n\t// chdir in the meantime.\n\tabsPidFilePath, err := filepath.Abs(filepath.Join(dirPath, pidFileName))\n\tif err != nil {\n\t\treturn nil, y.Wrap(err, \"cannot get absolute path for pid lock file\")\n\t}\n\n\t// If the file was unpacked or created by some other program, it might not\n\t// have the ModeExclusive bit set. Set it before we call OpenFile, so that we\n\t// can be confident that a successful OpenFile implies exclusive use.\n\t//\n\t// OpenFile fails if the file ModeExclusive bit set *and* the file is already open.\n\t// So, if the file is closed when the DB crashed, we're fine. When the process\n\t// that was managing the DB crashes, the OS will close the file for us.\n\t//\n\t// This bit of code is copied from Go's lockedfile internal package:\n\t// https://github.com/golang/go/blob/go1.15rc1/src/cmd/go/internal/lockedfile/lockedfile_plan9.go#L58\n\tif fi, err := os.Stat(absPidFilePath); err == nil {\n\t\tif fi.Mode()&os.ModeExclusive == 0 {\n\t\t\tif err := os.Chmod(absPidFilePath, fi.Mode()|os.ModeExclusive); err != nil {\n\t\t\t\treturn nil, y.Wrapf(err, \"could not set exclusive mode bit\")\n\t\t\t}\n\t\t}\n\t} else if !os.IsNotExist(err) {\n\t\treturn nil, err\n\t}\n\tf, err := os.OpenFile(absPidFilePath, os.O_WRONLY|os.O_TRUNC|os.O_CREATE, 0666|os.ModeExclusive)\n\tif err != nil {\n\t\tif isLocked(err) {\n\t\t\treturn nil, y.Wrapf(err,\n\t\t\t\t\"Cannot open pid lock file %q.  Another process is using this Badger database\",\n\t\t\t\tabsPidFilePath)\n\t\t}\n\t\treturn nil, y.Wrapf(err, \"Cannot open pid lock file %q\", absPidFilePath)\n\t}\n\n\tif _, err = fmt.Fprintf(f, \"%d\\n\", os.Getpid()); err != nil {\n\t\tf.Close()\n\t\treturn nil, y.Wrapf(err, \"could not write pid\")\n\t}\n\treturn &directoryLockGuard{f, absPidFilePath}, nil\n}\n\n// Release deletes the pid file and releases our lock on the directory.\nfunc (guard *directoryLockGuard) release() error {\n\t// It's important that we remove the pid file first.\n\terr := os.Remove(guard.path)\n\n\tif closeErr := guard.f.Close(); err == nil {\n\t\terr = closeErr\n\t}\n\tguard.path = \"\"\n\tguard.f = nil\n\n\treturn err\n}\n\n// openDir opens a directory for syncing.\nfunc openDir(path string) (*os.File, error) { return os.Open(path) }\n\n// When you create or delete a file, you have to ensure the directory entry for the file is synced\n// in order to guarantee the file is visible (if the system crashes). (See the man page for fsync,\n// or see https://github.com/coreos/etcd/issues/6368 for an example.)\nfunc syncDir(dir string) error {\n\tf, err := openDir(dir)\n\tif err != nil {\n\t\treturn y.Wrapf(err, \"While opening directory: %s.\", dir)\n\t}\n\n\terr = f.Sync()\n\tcloseErr := f.Close()\n\tif err != nil {\n\t\treturn y.Wrapf(err, \"While syncing directory: %s.\", dir)\n\t}\n\treturn y.Wrapf(closeErr, \"While closing directory: %s.\", dir)\n}\n\n// Opening an exclusive-use file returns an error.\n// The expected error strings are:\n//\n//   - \"open/create -- file is locked\" (cwfs, kfs)\n//   - \"exclusive lock\" (fossil)\n//   - \"exclusive use file already open\" (ramfs)\n//\n// See https://github.com/golang/go/blob/go1.15rc1/src/cmd/go/internal/lockedfile/lockedfile_plan9.go#L16\nvar lockedErrStrings = [...]string{\n\t\"file is locked\",\n\t\"exclusive lock\",\n\t\"exclusive use file already open\",\n}\n\n// Even though plan9 doesn't support the Lock/RLock/Unlock functions to\n// manipulate already-open files, IsLocked is still meaningful: os.OpenFile\n// itself may return errors that indicate that a file with the ModeExclusive bit\n// set is already open.\nfunc isLocked(err error) bool {\n\ts := err.Error()\n\n\tfor _, frag := range lockedErrStrings {\n\t\tif strings.Contains(s, frag) {\n\t\t\treturn true\n\t\t}\n\t}\n\treturn false\n}\n"
  },
  {
    "path": "dir_unix.go",
    "content": "//go:build !windows && !plan9 && !js && !wasip1 && !aix\n// +build !windows,!plan9,!js,!wasip1,!aix\n\n/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"fmt\"\n\t\"os\"\n\t\"path/filepath\"\n\n\t\"golang.org/x/sys/unix\"\n\n\t\"github.com/dgraph-io/badger/v4/y\"\n)\n\n// directoryLockGuard holds a lock on a directory and a pid file inside.  The pid file isn't part\n// of the locking mechanism, it's just advisory.\ntype directoryLockGuard struct {\n\t// File handle on the directory, which we've flocked.\n\tf *os.File\n\t// The absolute path to our pid file.\n\tpath string\n\t// Was this a shared lock for a read-only database?\n\treadOnly bool\n}\n\n// acquireDirectoryLock gets a lock on the directory (using flock). If\n// this is not read-only, it will also write our pid to\n// dirPath/pidFileName for convenience.\nfunc acquireDirectoryLock(dirPath string, pidFileName string, readOnly bool) (\n\t*directoryLockGuard, error) {\n\t// Convert to absolute path so that Release still works even if we do an unbalanced\n\t// chdir in the meantime.\n\tabsPidFilePath, err := filepath.Abs(filepath.Join(dirPath, pidFileName))\n\tif err != nil {\n\t\treturn nil, y.Wrapf(err, \"cannot get absolute path for pid lock file\")\n\t}\n\tf, err := os.Open(dirPath)\n\tif err != nil {\n\t\treturn nil, y.Wrapf(err, \"cannot open directory %q\", dirPath)\n\t}\n\topts := unix.LOCK_EX | unix.LOCK_NB\n\tif readOnly {\n\t\topts = unix.LOCK_SH | unix.LOCK_NB\n\t}\n\n\terr = unix.Flock(int(f.Fd()), opts)\n\tif err != nil {\n\t\tf.Close()\n\t\treturn nil, y.Wrapf(err,\n\t\t\t\"Cannot acquire directory lock on %q.  Another process is using this Badger database.\",\n\t\t\tdirPath)\n\t}\n\n\tif !readOnly {\n\t\t// Yes, we happily overwrite a pre-existing pid file.  We're the\n\t\t// only read-write badger process using this directory.\n\t\terr = os.WriteFile(absPidFilePath, []byte(fmt.Sprintf(\"%d\\n\", os.Getpid())), 0666)\n\t\tif err != nil {\n\t\t\tf.Close()\n\t\t\treturn nil, y.Wrapf(err,\n\t\t\t\t\"Cannot write pid file %q\", absPidFilePath)\n\t\t}\n\t}\n\treturn &directoryLockGuard{f, absPidFilePath, readOnly}, nil\n}\n\n// Release deletes the pid file and releases our lock on the directory.\nfunc (guard *directoryLockGuard) release() error {\n\tvar err error\n\tif !guard.readOnly {\n\t\t// It's important that we remove the pid file first.\n\t\terr = os.Remove(guard.path)\n\t}\n\n\tif closeErr := guard.f.Close(); err == nil {\n\t\terr = closeErr\n\t}\n\tguard.path = \"\"\n\tguard.f = nil\n\n\treturn err\n}\n\n// openDir opens a directory for syncing.\nfunc openDir(path string) (*os.File, error) { return os.Open(path) }\n\n// When you create or delete a file, you have to ensure the directory entry for the file is synced\n// in order to guarantee the file is visible (if the system crashes). (See the man page for fsync,\n// or see https://github.com/coreos/etcd/issues/6368 for an example.)\nfunc syncDir(dir string) error {\n\tf, err := openDir(dir)\n\tif err != nil {\n\t\treturn y.Wrapf(err, \"While opening directory: %s.\", dir)\n\t}\n\n\terr = f.Sync()\n\tcloseErr := f.Close()\n\tif err != nil {\n\t\treturn y.Wrapf(err, \"While syncing directory: %s.\", dir)\n\t}\n\treturn y.Wrapf(closeErr, \"While closing directory: %s.\", dir)\n}\n"
  },
  {
    "path": "dir_windows.go",
    "content": "//go:build windows\n// +build windows\n\n/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\n// OpenDir opens a directory in windows with write access for syncing.\nimport (\n\t\"os\"\n\t\"path/filepath\"\n\t\"syscall\"\n\n\t\"github.com/dgraph-io/badger/v4/y\"\n)\n\n// FILE_ATTRIBUTE_TEMPORARY - A file that is being used for temporary storage.\n// FILE_FLAG_DELETE_ON_CLOSE - The file is to be deleted immediately after all of its handles are\n// closed, which includes the specified handle and any other open or duplicated handles.\n// See: https://docs.microsoft.com/en-us/windows/desktop/FileIO/file-attribute-constants\n// NOTE: Added here to avoid importing golang.org/x/sys/windows\nconst (\n\tFILE_ATTRIBUTE_TEMPORARY  = 0x00000100\n\tFILE_FLAG_DELETE_ON_CLOSE = 0x04000000\n)\n\nfunc openDir(path string) (*os.File, error) {\n\tfd, err := openDirWin(path)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\treturn os.NewFile(uintptr(fd), path), nil\n}\n\nfunc openDirWin(path string) (fd syscall.Handle, err error) {\n\tif len(path) == 0 {\n\t\treturn syscall.InvalidHandle, syscall.ERROR_FILE_NOT_FOUND\n\t}\n\tpathp, err := syscall.UTF16PtrFromString(path)\n\tif err != nil {\n\t\treturn syscall.InvalidHandle, err\n\t}\n\taccess := uint32(syscall.GENERIC_READ | syscall.GENERIC_WRITE)\n\tsharemode := uint32(syscall.FILE_SHARE_READ | syscall.FILE_SHARE_WRITE)\n\tcreatemode := uint32(syscall.OPEN_EXISTING)\n\tfl := uint32(syscall.FILE_FLAG_BACKUP_SEMANTICS)\n\treturn syscall.CreateFile(pathp, access, sharemode, nil, createmode, fl, 0)\n}\n\n// DirectoryLockGuard holds a lock on the directory.\ntype directoryLockGuard struct {\n\th    syscall.Handle\n\tpath string\n}\n\n// AcquireDirectoryLock acquires exclusive access to a directory.\nfunc acquireDirectoryLock(dirPath string, pidFileName string, readOnly bool) (*directoryLockGuard, error) {\n\tif readOnly {\n\t\treturn nil, ErrWindowsNotSupported\n\t}\n\n\t// Convert to absolute path so that Release still works even if we do an unbalanced\n\t// chdir in the meantime.\n\tabsLockFilePath, err := filepath.Abs(filepath.Join(dirPath, pidFileName))\n\tif err != nil {\n\t\treturn nil, y.Wrap(err, \"Cannot get absolute path for pid lock file\")\n\t}\n\n\t// This call creates a file handler in memory that only one process can use at a time. When\n\t// that process ends, the file is deleted by the system.\n\t// FILE_ATTRIBUTE_TEMPORARY is used to tell Windows to try to create the handle in memory.\n\t// FILE_FLAG_DELETE_ON_CLOSE is not specified in syscall_windows.go but tells Windows to delete\n\t// the file when all processes holding the handler are closed.\n\t// XXX: this works but it's a bit clunky. i'd prefer to use LockFileEx but it needs unsafe pkg.\n\th, err := syscall.CreateFile(\n\t\tsyscall.StringToUTF16Ptr(absLockFilePath), 0, 0, nil,\n\t\tsyscall.OPEN_ALWAYS,\n\t\tuint32(FILE_ATTRIBUTE_TEMPORARY|FILE_FLAG_DELETE_ON_CLOSE),\n\t\t0)\n\tif err != nil {\n\t\treturn nil, y.Wrapf(err,\n\t\t\t\"Cannot create lock file %q.  Another process is using this Badger database\",\n\t\t\tabsLockFilePath)\n\t}\n\n\treturn &directoryLockGuard{h: h, path: absLockFilePath}, nil\n}\n\n// Release removes the directory lock.\nfunc (g *directoryLockGuard) release() error {\n\tg.path = \"\"\n\treturn syscall.CloseHandle(g.h)\n}\n\n// Windows doesn't support syncing directories to the file system. See\n// https://github.com/dgraph-io/badger/issues/699#issuecomment-504133587 for more details.\nfunc syncDir(dir string) error { return nil }\n"
  },
  {
    "path": "discard.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"encoding/binary\"\n\t\"os\"\n\t\"path/filepath\"\n\t\"sort\"\n\t\"sync\"\n\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\n// discardStats keeps track of the amount of data that could be discarded for\n// a given logfile.\ntype discardStats struct {\n\tsync.Mutex\n\n\t*z.MmapFile\n\topt           Options\n\tnextEmptySlot int\n}\n\nconst discardFname string = \"DISCARD\"\n\nfunc InitDiscardStats(opt Options) (*discardStats, error) {\n\tfname := filepath.Join(opt.ValueDir, discardFname)\n\n\t// 1MB file can store 65.536 discard entries. Each entry is 16 bytes.\n\tmf, err := z.OpenMmapFile(fname, os.O_CREATE|os.O_RDWR, 1<<20)\n\tlf := &discardStats{\n\t\tMmapFile: mf,\n\t\topt:      opt,\n\t}\n\tif err == z.NewFile {\n\t\t// We don't need to zero out the entire 1MB.\n\t\tlf.zeroOut()\n\n\t} else if err != nil {\n\t\treturn nil, y.Wrapf(err, \"while opening file: %s\\n\", discardFname)\n\t}\n\n\tfor slot := 0; slot < lf.maxSlot(); slot++ {\n\t\tif lf.get(16*slot) == 0 {\n\t\t\tlf.nextEmptySlot = slot\n\t\t\tbreak\n\t\t}\n\t}\n\tsort.Sort(lf)\n\topt.Infof(\"Discard stats nextEmptySlot: %d\\n\", lf.nextEmptySlot)\n\treturn lf, nil\n}\n\nfunc (lf *discardStats) Len() int {\n\treturn lf.nextEmptySlot\n}\nfunc (lf *discardStats) Less(i, j int) bool {\n\treturn lf.get(16*i) < lf.get(16*j)\n}\nfunc (lf *discardStats) Swap(i, j int) {\n\tleft := lf.Data[16*i : 16*i+16]\n\tright := lf.Data[16*j : 16*j+16]\n\tvar tmp [16]byte\n\tcopy(tmp[:], left)\n\tcopy(left, right)\n\tcopy(right, tmp[:])\n}\n\n// offset is not slot.\nfunc (lf *discardStats) get(offset int) uint64 {\n\treturn binary.BigEndian.Uint64(lf.Data[offset : offset+8])\n}\nfunc (lf *discardStats) set(offset int, val uint64) {\n\tbinary.BigEndian.PutUint64(lf.Data[offset:offset+8], val)\n}\n\n// zeroOut would zero out the next slot.\nfunc (lf *discardStats) zeroOut() {\n\tlf.set(lf.nextEmptySlot*16, 0)\n\tlf.set(lf.nextEmptySlot*16+8, 0)\n}\n\nfunc (lf *discardStats) maxSlot() int {\n\treturn len(lf.Data) / 16\n}\n\n// Update would update the discard stats for the given file id. If discard is\n// 0, it would return the current value of discard for the file. If discard is\n// < 0, it would set the current value of discard to zero for the file.\nfunc (lf *discardStats) Update(fidu uint32, discard int64) int64 {\n\tfid := uint64(fidu)\n\tlf.Lock()\n\tdefer lf.Unlock()\n\n\tidx := sort.Search(lf.nextEmptySlot, func(slot int) bool {\n\t\treturn lf.get(slot*16) >= fid\n\t})\n\tif idx < lf.nextEmptySlot && lf.get(idx*16) == fid {\n\t\toff := idx*16 + 8\n\t\tcurDisc := lf.get(off)\n\t\tif discard == 0 {\n\t\t\treturn int64(curDisc)\n\t\t}\n\t\tif discard < 0 {\n\t\t\tlf.set(off, 0)\n\t\t\treturn 0\n\t\t}\n\t\tlf.set(off, curDisc+uint64(discard))\n\t\treturn int64(curDisc + uint64(discard))\n\t}\n\tif discard <= 0 {\n\t\t// No need to add a new entry.\n\t\treturn 0\n\t}\n\n\t// Could not find the fid. Add the entry.\n\tidx = lf.nextEmptySlot\n\tlf.set(idx*16, fid)\n\tlf.set(idx*16+8, uint64(discard))\n\n\t// Move to next slot.\n\tlf.nextEmptySlot++\n\tfor lf.nextEmptySlot >= lf.maxSlot() {\n\t\ty.Check(lf.Truncate(2 * int64(len(lf.Data))))\n\t}\n\tlf.zeroOut()\n\n\tsort.Sort(lf)\n\treturn discard\n}\n\nfunc (lf *discardStats) Iterate(f func(fid, stats uint64)) {\n\tfor slot := 0; slot < lf.nextEmptySlot; slot++ {\n\t\tidx := 16 * slot\n\t\tf(lf.get(idx), lf.get(idx+8))\n\t}\n}\n\n// MaxDiscard returns the file id with maximum discard bytes.\nfunc (lf *discardStats) MaxDiscard() (uint32, int64) {\n\tlf.Lock()\n\tdefer lf.Unlock()\n\n\tvar maxFid, maxVal uint64\n\tlf.Iterate(func(fid, val uint64) {\n\t\tif maxVal < val {\n\t\t\tmaxVal = val\n\t\t\tmaxFid = fid\n\t\t}\n\t})\n\treturn uint32(maxFid), int64(maxVal)\n}\n"
  },
  {
    "path": "discard_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"os\"\n\t\"testing\"\n\n\t\"github.com/stretchr/testify/require\"\n)\n\nfunc TestDiscardStats(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\topt := DefaultOptions(dir)\n\tds, err := InitDiscardStats(opt)\n\trequire.NoError(t, err)\n\trequire.Zero(t, ds.nextEmptySlot)\n\tfid, _ := ds.MaxDiscard()\n\trequire.Zero(t, fid)\n\n\tfor i := uint32(0); i < 20; i++ {\n\t\trequire.Equal(t, int64(i*100), ds.Update(i, int64(i*100)))\n\t}\n\tds.Iterate(func(id, val uint64) {\n\t\trequire.Equal(t, id*100, val)\n\t})\n\tfor i := uint32(0); i < 10; i++ {\n\t\trequire.Equal(t, 0, int(ds.Update(i, -1)))\n\t}\n\tds.Iterate(func(id, val uint64) {\n\t\tif id < 10 {\n\t\t\trequire.Zero(t, val)\n\t\t\treturn\n\t\t}\n\t\trequire.Equal(t, int(id*100), int(val))\n\t})\n}\n\nfunc TestReloadDiscardStats(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\topt := DefaultOptions(dir)\n\tdb, err := Open(opt)\n\trequire.NoError(t, err)\n\tds := db.vlog.discardStats\n\n\tds.Update(uint32(1), 1)\n\tds.Update(uint32(2), 1)\n\tds.Update(uint32(1), -1)\n\trequire.NoError(t, db.Close())\n\n\t// Reopen the DB, discard stats should be same.\n\tdb2, err := Open(opt)\n\trequire.NoError(t, err)\n\tds2 := db2.vlog.discardStats\n\trequire.Zero(t, ds2.Update(uint32(1), 0))\n\trequire.Equal(t, 1, int(ds2.Update(uint32(2), 0)))\n}\n"
  },
  {
    "path": "doc.go",
    "content": "/*\nPackage badger implements an embeddable, simple and fast key-value database,\nwritten in pure Go. It is designed to be highly performant for both reads and\nwrites simultaneously. Badger uses Multi-Version Concurrency Control (MVCC), and\nsupports transactions. It runs transactions concurrently, with serializable\nsnapshot isolation guarantees.\n\nBadger uses an LSM tree along with a value log to separate keys from values,\nhence reducing both write amplification and the size of the LSM tree. This\nallows LSM tree to be served entirely from RAM, while the values are served\nfrom SSD.\n\n# Usage\n\nBadger has the following main types: DB, Txn, Item and Iterator. DB contains\nkeys that are associated with values. It must be opened with the appropriate\noptions before it can be accessed.\n\nAll operations happen inside a Txn. Txn represents a transaction, which can\nbe read-only or read-write. Read-only transactions can read values for a\ngiven key (which are returned inside an Item), or iterate over a set of\nkey-value pairs using an Iterator (which are returned as Item type values as\nwell). Read-write transactions can also update and delete keys from the DB.\n\nSee the examples for more usage details.\n*/\npackage badger\n"
  },
  {
    "path": "docs/design.md",
    "content": "# Design\n\nWe wrote Badger with these design goals in mind:\n\n- Write a key-value database in pure Go\n- Use latest research to build the fastest KV database for data sets spanning terabytes\n- Optimize for modern storage devices\n\nBadger’s design is based on a paper titled\n[WiscKey: Separating Keys from Values in SSD-conscious Storage](https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf).\n\n## References\n\nThe following blog posts are a great starting point for learning more about Badger and the\nunderlying design principles:\n\n- [Introducing Badger: A fast key-value store written natively in Go](https://hypermode.com/blog/badger/)\n- [Make Badger crash resilient with ALICE](https://hypermode.com/blog/alice/)\n- [Badger vs LMDB vs BoltDB: Benchmarking key-value databases in Go](https://hypermode.com/blog/badger-lmdb-boltdb/)\n- [Concurrent ACID Transactions in Badger](https://hypermode.com/blog/badger-txn/)\n\n## Comparisons\n\n| Feature                       | Badger                                 | RocksDB                      | BoltDB  |\n| ----------------------------- | -------------------------------------- | ---------------------------- | ------- |\n| Design                        | LSM tree with value log                | LSM tree only                | B+ tree |\n| High Read throughput          | Yes                                    | No                           | Yes     |\n| High Write throughput         | Yes                                    | Yes                          | No      |\n| Designed for SSDs             | Yes (with latest research<sup>1</sup>) | Not specifically<sup>2</sup> | No      |\n| Embeddable                    | Yes                                    | Yes                          | Yes     |\n| Sorted KV access              | Yes                                    | Yes                          | Yes     |\n| Pure Go (no Cgo)              | Yes                                    | No                           | Yes     |\n| Transactions                  | Yes                                    | Yes                          | Yes     |\n| ACID-compliant                | Yes, concurrent with SSI<sup>3</sup>   | No                           | Yes     |\n| Snapshots                     | Yes                                    | Yes                          | Yes     |\n| TTL support                   | Yes                                    | Yes                          | No      |\n| 3D access (key-value-version) | Yes<sup>4</sup>                        | No                           | No      |\n\n<sup>1</sup> The WiscKey paper (on which Badger is based) saw big wins with separating values from\nkeys, significantly reducing the write amplification compared to a typical LSM tree.\n\n<sup>2</sup> RocksDB is an SSD-optimized version of LevelDB, which was designed specifically for\nrotating disks. As such RocksDB's design isn't aimed at SSDs.\n\n<sup>3</sup> SSI: Serializable Snapshot Isolation. For more details, see the blog post\n[Concurrent ACID Transactions in Badger](https://hypermode.com/blog/badger-txn/)\n\n<sup>4</sup> Badger provides direct access to value versions via its Iterator API. Users can also\nspecify how many versions to keep per key via Options.\n\n## Benchmarks\n\nWe've run comprehensive benchmarks against RocksDB, BoltDB, and LMDB. The benchmarking code with\ndetailed logs are in the [badger-bench](https://github.com/dgraph-io/badger-bench) repo.\n"
  },
  {
    "path": "docs/encryption-at-rest.md",
    "content": "# Encryption at Rest in Dgraph and Badger\n\nBadger provides encryption at rest using AES encryption, enabling compliance with security standards\nsuch as HIPAA and PCI DSS. This feature was introduced in Badger v2 and is available to all systems\nbuilt on Badger, including Dgraph.\n\n## Overview\n\nBadger implements encryption at the storage layer, allowing systems like Dgraph to inherit\nencryption capabilities without additional implementation. This separation of concerns means:\n\n- Badger manages data security and encryption at the disk level\n- Higher-level systems like Dgraph focus on distributed operations and graph semantics\n- All Badger-based systems benefit from encryption improvements\n\n## Encryption Algorithm\n\nBadger uses the\n[Advanced Encryption Standard (AES)](https://en.wikipedia.org/wiki/Advanced_Encryption_Standard),\nstandardized by NIST and widely adopted across databases including MongoDB, SQLite, and RocksDB. AES\nis a symmetric encryption algorithm: the same key encrypts and decrypts data.\n\nAES key sizes: 128, 192, or 256 bits. All provide strong security; 128-bit keys are computationally\ninfeasible to brute force.\n\n## Key Management\n\nBadger uses a two-tier key system:\n\n### Master Key\n\nThe user-provided AES encryption key that encrypts data keys. Master key length determines AES\nvariant:\n\n- 16 bytes: AES-128\n- 24 bytes: AES-192\n- 32 bytes: AES-256\n\n**Important:** Use a cryptographically secure random key. Never use predictable strings. Generate\nkeys using a password manager or secure random generator.\n\n### Data Keys\n\nAuto-generated keys that encrypt actual data on disk. Each encrypted data key is stored alongside\nthe encrypted data. Master keys encrypt data keys, not data directly.\n\n**Benefits:**\n\n- Master key rotation only requires re-encrypting data keys (small, fast operation)\n- Data keys rotate automatically without re-encrypting all data\n- Minimal performance impact during key rotation\n\n## Key Rotation\n\n### Data Key Rotation\n\nBadger automatically rotates data keys every 10 days by default. Configure the rotation interval\nusing `Options.WithEncryptionKeyRotationDuration`.\n\nAll historical data keys are retained to decrypt older data. Each data key is 32 bytes; 1000 keys\nconsume 32KB. At 10-day intervals, this represents approximately 27 years of keys.\n\n### Master Key Rotation\n\nUsers must manually rotate master keys. Use the `rotate` command:\n\n```shell\nbadger rotate --dir=badger_dir --old-key-path=old/path --new-key-path=new/path\n```\n\n**Requirements:**\n\n- Database must be offline during master key rotation\n- Only data keys are re-encrypted (fast operation)\n- Future versions may support online rotation\n\n## Initialization Vectors\n\nTo prevent identical plaintext from producing identical ciphertext, Badger uses Initialization\nVectors (IVs).\n\n### SSTable Encryption\n\nEach 4KB block in SSTables uses a unique 16-byte IV stored in plaintext at the end of the encrypted\nblock. Storage overhead: 0.4% (16 bytes per 4KB block).\n\n**Security:** IVs can be stored in plaintext. Decryption requires the data key, which requires the\nmaster key. Knowledge of the IV alone is insufficient.\n\n### Value Log Encryption\n\nValue log entries are encrypted individually to match access patterns. To minimize storage overhead,\nBadger uses a 12-byte file-level IV combined with a 4-byte value offset to form the 16-byte IV.\n\n**Benefits:**\n\n- Saves 16 bytes per value entry\n- 12-byte overhead per vlog file (vs 16 bytes per value)\n- For 10,000 entries: 12 bytes total vs 160,000 bytes with per-value IVs\n\n## Enabling Encryption\n\n### New Database\n\nEnable encryption when creating a new database:\n\n```go\nopts := badger.DefaultOptions(\"/tmp/badger\").\n    WithEncryptionKey(masterKey).\n    WithEncryptionKeyRotationDuration(dataKeyRotationDuration) // defaults to 10 days\n```\n\n### Existing Database\n\nEnable encryption on an unencrypted database:\n\n```shell\nbadger rotate --dir=badger_dir --new-key-path=new/path\n```\n\n**Note:** This enables encryption for new files only. Existing data is encrypted during compaction\nas new files are generated. Badger operates in hybrid mode, tracking encryption status per file.\n\n### Immediate Full Encryption\n\nTo encrypt all existing data immediately:\n\n1. Export the database: `badger backup --dir=badger_dir -f backup.bak`\n2. Create a new encrypted database instance\n3. Restore the data: `badger restore --dir=new_badger_dir -f backup.bak`\n\nAlternatively, use the Stream Framework and StreamWriter interface for in-place encryption with high\nthroughput.\n\n## Security Considerations\n\n### Key Security\n\n- Store master keys securely (key management service, secure vault)\n- Rotate master keys regularly\n- Use strong, randomly generated keys\n- Protect physical access to systems performing encryption\n\n### Key Leakage\n\nKey security is more critical than key size. Threats include:\n\n- Side-channel attacks (electromagnetic radiation analysis)\n- Key reuse patterns enabling cryptanalysis\n- Physical access to encryption systems\n\nRegular key rotation mitigates these risks.\n\n## Terminology\n\nIn this context, \"key\" refers to:\n\n- **Database key**: The key in a key-value pair stored in Badger\n- **Encryption key**: The cryptographic key used for encryption/decryption (master key or data key)\n\nWhen ambiguous, this document uses \"encryption key\" for cryptographic keys.\n"
  },
  {
    "path": "docs/index.md",
    "content": "# Badger\n\n## What is Badger?\n\nBadgerDB is an embeddable, persistent, and fast key-value (KV) database written in pure Go. It is\nthe underlying database for [Dgraph](https://github.com/dgraph-io/dgraph), a fast, distributed graph\ndatabase. It is meant to be an efficient alternative to non-Go-based key-value stores like RocksDB.\n\n## Changelog\n\nWe keep the [repo Changelog](https://github.com/dgraph-io/badger/blob/main/CHANGELOG.md) up to date\nwith each release.\n\n[Quickstart](quickstart.md)\n\n[Design](design.md)\n\n[Encryption at rest](encryption-at-rest.md)\n\n[Troubleshooting](troubleshooting.md)\n"
  },
  {
    "path": "docs/quickstart.md",
    "content": "# Quickstart\n\n## Prerequisites\n\n- [Go](https://go.dev/doc/install) - v1.23 or higher\n- Text editor - we recommend [VS Code](https://code.visualstudio.com/)\n- Terminal - access Badger through a command-line interface (CLI)\n\n## Installing\n\nTo start using Badger, run the following command to retrieve the library.\n\n```sh\ngo get github.com/dgraph-io/badger/v4\n```\n\nThen, install the Badger command line utility into your `$GOBIN` path.\n\n```sh\ngo install github.com/dgraph-io/badger/v4/badger@latest\n```\n\n## Opening a database\n\nThe top-level object in Badger is a `DB`. It represents multiple files on disk in specific\ndirectories, which contain the data for a single database.\n\nTo open your database, use the `badger.Open()` function, with the appropriate options. The `Dir` and\n`ValueDir` options are mandatory and you must specify them in your client. To simplify, you can set\nboth options to the same value.\n\n<Note>\n  Badger obtains a lock on the directories. Multiple processes can't open the\n  same database at the same time.\n</Note>\n\n```go\npackage main\n\nimport (\n  \"log\"\n\n  badger \"github.com/dgraph-io/badger/v4\"\n)\n\nfunc main() {\n  // Open the Badger database located in the /tmp/badger directory.\n  // It is created if it doesn't exist.\n  db, err := badger.Open(badger.DefaultOptions(\"/tmp/badger\"))\n  if err != nil {\n    log.Fatal(err)\n  }\n\n  defer db.Close()\n\n  // your code here\n}\n```\n\n### In-memory/diskless mode\n\nBy default, Badger ensures all data persists to disk. It also supports a pure in-memory mode. When\nBadger is running in this mode, all data remains in memory only. Reads and writes are much faster,\nbut Badger loses all stored data in the case of a crash or close. To open Badger in in-memory mode,\nset the `InMemory` option.\n\n```go\nopt := badger.DefaultOptions(\"\").WithInMemory(true)\n```\n\n### Encryption mode\n\nIf you enable encryption in Badger, you also need to set the index cache size.\n\n<Tip>\n  The cache improves the performance. Otherwise, reads can be very slow with\n  encryption enabled.\n</Tip>\n\nFor example, to set a `100 Mb` cache:\n\n```go\nopts.IndexCache = 100 << 20 // 100 mb or some other size based on the amount of data\n```\n\n## Transactions\n\n### Read-only transactions\n\nTo start a read-only transaction, you can use the `DB.View()` method:\n\n```go\nerr := db.View(func(txn *badger.Txn) error {\n  // your code here\n\n  return nil\n})\n```\n\nYou can't perform any writes or deletes within this transaction. Badger ensures that you get a\nconsistent view of the database within this closure. Any writes that happen elsewhere after the\ntransaction has started aren't seen by calls made within the closure.\n\n### Read-write transactions\n\nTo start a read-write transaction, you can use the `DB.Update()` method:\n\n```go\nerr := db.Update(func(txn *badger.Txn) error {\n  // Your code here…\n  return nil\n})\n```\n\nBadger allows all database operations inside a read-write transaction.\n\nAlways check the returned error value. If you return an error within your closure it's passed\nthrough.\n\nAn `ErrConflict` error is reported in case of a conflict. Depending on the state of your app, you\nhave the option to retry the operation if you receive this error.\n\nAn `ErrTxnTooBig` is reported in case the number of pending writes/deletes in the transaction\nexceeds a certain limit. In that case, it's best to commit the transaction and start a new\ntransaction immediately. Here is an example (we aren't checking for errors in some places for\nsimplicity):\n\n```go\nupdates := make(map[string]string)\ntxn := db.NewTransaction(true)\nfor k,v := range updates {\n  if err := txn.Set([]byte(k),[]byte(v)); err == badger.ErrTxnTooBig {\n    _ = txn.Commit()\n    txn = db.NewTransaction(true)\n    _ = txn.Set([]byte(k),[]byte(v))\n  }\n}\n_ = txn.Commit()\n```\n\n### Managing transactions manually\n\nThe `DB.View()` and `DB.Update()` methods are wrappers around the `DB.NewTransaction()` and\n`Txn.Commit()` methods (or `Txn.Discard()` in case of read-only transactions). These helper methods\nstart the transaction, execute a function, and then safely discard your transaction if an error is\nreturned. This is the recommended way to use Badger transactions.\n\nHowever, sometimes you may want to manually create and commit your transactions. You can use the\n`DB.NewTransaction()` function directly, which takes in a boolean argument to specify whether a\nread-write transaction is required. For read-write transactions, it's necessary to call\n`Txn.Commit()` to ensure the transaction is committed. For read-only transactions, calling\n`Txn.Discard()` is sufficient. `Txn.Commit()` also calls `Txn.Discard()` internally to cleanup the\ntransaction, so just calling `Txn.Commit()` is sufficient for read-write transaction. However, if\nyour code doesn’t call `Txn.Commit()` for some reason (for e.g it returns prematurely with an\nerror), then please make sure you call `Txn.Discard()` in a `defer` block. Refer to the code below.\n\n```go\n// Start a writable transaction.\ntxn := db.NewTransaction(true)\ndefer txn.Discard()\n\n// Use the transaction...\nerr := txn.Set([]byte(\"answer\"), []byte(\"42\"))\nif err != nil {\n    return err\n}\n\n// Commit the transaction and check for error.\nif err := txn.Commit(); err != nil {\n    return err\n}\n```\n\nThe first argument to `DB.NewTransaction()` is a boolean stating if the transaction should be\nwritable.\n\nBadger allows an optional callback to the `Txn.Commit()` method. Normally, the callback can be set\nto `nil`, and the method returns after all the writes have succeeded. However, if this callback is\nprovided, the `Txn.Commit()` method returns as soon as it has checked for any conflicts. The actual\nwriting to the disk happens asynchronously, and the callback is invoked once the writing has\nfinished, or an error has occurred. This can improve the throughput of the app in some cases. But it\nalso means that a transaction isn't durable until the callback has been invoked with a `nil` error\nvalue.\n\n## Using key/value pairs\n\nTo save a key/value pair, use the `Txn.Set()` method:\n\n```go\nerr := db.Update(func(txn *badger.Txn) error {\n  err := txn.Set([]byte(\"answer\"), []byte(\"42\"))\n  return err\n})\n```\n\nKey/Value pair can also be saved by first creating `Entry`, then setting this `Entry` using\n`Txn.SetEntry()`. `Entry` also exposes methods to set properties on it.\n\n```go\nerr := db.Update(func(txn *badger.Txn) error {\n  e := badger.NewEntry([]byte(\"answer\"), []byte(\"42\"))\n  err := txn.SetEntry(e)\n  return err\n})\n```\n\nThis sets the value of the `\"answer\"` key to `\"42\"`. To retrieve this value, we can use the\n`Txn.Get()` method:\n\n```go\nerr := db.View(func(txn *badger.Txn) error {\n  item, err := txn.Get([]byte(\"answer\"))\n  handle(err)\n\n  var valNot, valCopy []byte\n  err := item.Value(func(val []byte) error {\n    // This func with val would only be called if item.Value encounters no error.\n\n    // Accessing val here is valid.\n    fmt.Printf(\"The answer is: %s\\n\", val)\n\n    // Copying or parsing val is valid.\n    valCopy = append([]byte{}, val...)\n\n    // Assigning val slice to another variable is NOT OK.\n    valNot = val // Do not do this.\n    return nil\n  })\n  handle(err)\n\n  // DO NOT access val here. It is the most common cause of bugs.\n  fmt.Printf(\"NEVER do this. %s\\n\", valNot)\n\n  // You must copy it to use it outside item.Value(...).\n  fmt.Printf(\"The answer is: %s\\n\", valCopy)\n\n  // Alternatively, you could also use item.ValueCopy().\n  valCopy, err = item.ValueCopy(nil)\n  handle(err)\n  fmt.Printf(\"The answer is: %s\\n\", valCopy)\n\n  return nil\n})\n```\n\n`Txn.Get()` returns `ErrKeyNotFound` if the value isn't found.\n\nPlease note that values returned from `Get()` are only valid while the transaction is open. If you\nneed to use a value outside of the transaction then you must use `copy()` to copy it to another byte\nslice.\n\nUse the `Txn.Delete()` method to delete a key.\n\n## Monotonically increasing integers\n\nTo get unique monotonically increasing integers with strong durability, you can use the\n`DB.GetSequence` method. This method returns a `Sequence` object, which is thread-safe and can be\nused concurrently via various goroutines.\n\nBadger would lease a range of integers to hand out from memory, with the bandwidth provided to\n`DB.GetSequence`. The frequency at which disk writes are done is determined by this lease bandwidth\nand the frequency of `Next` invocations. Setting a bandwidth too low would do more disk writes,\nsetting it too high would result in wasted integers if Badger is closed or crashes. To avoid wasted\nintegers, call `Release` before closing Badger.\n\n```go\nseq, err := db.GetSequence(key, 1000)\ndefer seq.Release()\nfor {\n  num, err := seq.Next()\n}\n```\n\n## Merge operations\n\nBadger provides support for ordered merge operations. You can define a func of type `MergeFunc`\nwhich takes in an existing value, and a value to be _merged_ with it. It returns a new value which\nis the result of the merge operation. All values are specified in byte arrays. For example, this is\na merge function (`add`) which appends a `[]byte` value to an existing `[]byte` value.\n\n```go\n// Merge function to append one byte slice to another\nfunc add(originalValue, newValue []byte) []byte {\n  return append(originalValue, newValue...)\n}\n```\n\nThis function can then be passed to the `DB.GetMergeOperator()` method, along with a key, and a\nduration value. The duration specifies how often the merge function is run on values that have been\nadded using the `MergeOperator.Add()` method.\n\n`MergeOperator.Get()` method can be used to retrieve the cumulative value of the key associated with\nthe merge operation.\n\n```go\nkey := []byte(\"merge\")\n\nm := db.GetMergeOperator(key, add, 200*time.Millisecond)\ndefer m.Stop()\n\nm.Add([]byte(\"A\"))\nm.Add([]byte(\"B\"))\nm.Add([]byte(\"C\"))\n\nres, _ := m.Get() // res should have value ABC encoded\n```\n\nExample: merge operator which increments a counter\n\n```go\nfunc uint64ToBytes(i uint64) []byte {\n  var buf [8]byte\n  binary.BigEndian.PutUint64(buf[:], i)\n  return buf[:]\n}\n\nfunc bytesToUint64(b []byte) uint64 {\n  return binary.BigEndian.Uint64(b)\n}\n\n// Merge function to add two uint64 numbers\nfunc add(existing, new []byte) []byte {\n  return uint64ToBytes(bytesToUint64(existing) + bytesToUint64(new))\n}\n```\n\nIt can be used as\n\n```go\nkey := []byte(\"merge\")\n\nm := db.GetMergeOperator(key, add, 200*time.Millisecond)\ndefer m.Stop()\n\nm.Add(uint64ToBytes(1))\nm.Add(uint64ToBytes(2))\nm.Add(uint64ToBytes(3))\n\nres, _ := m.Get() // res should have value 6 encoded\n```\n\n## Setting time to live and user metadata on keys\n\nBadger allows setting an optional Time to Live (TTL) value on keys. Once the TTL has elapsed, the\nkey is no longer retrievable and is eligible for garbage collection. A TTL can be set as a\n`time.Duration` value using the `Entry.WithTTL()` and `Txn.SetEntry()` API methods.\n\n```go\nerr := db.Update(func(txn *badger.Txn) error {\n  e := badger.NewEntry([]byte(\"answer\"), []byte(\"42\")).WithTTL(time.Hour)\n  err := txn.SetEntry(e)\n  return err\n})\n```\n\nAn optional user metadata value can be set on each key. A user metadata value is represented by a\nsingle byte. It can be used to set certain bits along with the key to aid in interpreting or\ndecoding the key-value pair. User metadata can be set using `Entry.WithMeta()` and `Txn.SetEntry()`\nAPI methods.\n\n```go\nerr := db.Update(func(txn *badger.Txn) error {\n  e := badger.NewEntry([]byte(\"answer\"), []byte(\"42\")).WithMeta(byte(1))\n  err := txn.SetEntry(e)\n  return err\n})\n```\n\n`Entry` APIs can be used to add the user metadata and TTL for same key. This `Entry` then can be set\nusing `Txn.SetEntry()`.\n\n```go\nerr := db.Update(func(txn *badger.Txn) error {\n  e := badger.NewEntry([]byte(\"answer\"), []byte(\"42\")).WithMeta(byte(1)).WithTTL(time.Hour)\n  err := txn.SetEntry(e)\n  return err\n})\n```\n\n## Iterating over keys\n\nTo iterate over keys, we can use an `Iterator`, which can be obtained using the `Txn.NewIterator()`\nmethod. Iteration happens in byte-wise lexicographical sorting order.\n\n```go\nerr := db.View(func(txn *badger.Txn) error {\n  opts := badger.DefaultIteratorOptions\n  opts.PrefetchSize = 10\n  it := txn.NewIterator(opts)\n  defer it.Close()\n  for it.Rewind(); it.Valid(); it.Next() {\n    item := it.Item()\n    k := item.Key()\n    err := item.Value(func(v []byte) error {\n      fmt.Printf(\"key=%s, value=%s\\n\", k, v)\n      return nil\n    })\n    if err != nil {\n      return err\n    }\n  }\n  return nil\n})\n```\n\nThe iterator allows you to move to a specific point in the list of keys and move forward or backward\nthrough the keys one at a time.\n\nBy default, Badger prefetches the values of the next 100 items. You can adjust that with the\n`IteratorOptions.PrefetchSize` field. However, setting it to a value higher than `GOMAXPROCS` (which\nwe recommend to be 128 or higher) shouldn’t give any additional benefits. You can also turn off the\nfetching of values altogether. See section below on key-only iteration.\n\n### Prefix scans\n\nTo iterate over a key prefix, you can combine `Seek()` and `ValidForPrefix()`:\n\n```go\ndb.View(func(txn *badger.Txn) error {\n  it := txn.NewIterator(badger.DefaultIteratorOptions)\n  defer it.Close()\n  prefix := []byte(\"1234\")\n  for it.Seek(prefix); it.ValidForPrefix(prefix); it.Next() {\n    item := it.Item()\n    k := item.Key()\n    err := item.Value(func(v []byte) error {\n      fmt.Printf(\"key=%s, value=%s\\n\", k, v)\n      return nil\n    })\n    if err != nil {\n      return err\n    }\n  }\n  return nil\n})\n```\n\n### Possible pagination implementation using Prefix scans\n\nConsidering that iteration happens in **byte-wise lexicographical sorting** order, it's possible to\ncreate a sorting-sensitive key. For example, a simple blog post key might look\nlike:`feed:userUuid:timestamp:postUuid`. Here, the `timestamp` part of the key is treated as an\nattribute, and items are stored in the corresponding order:\n\n| Order Ascending | Key                                                           |\n| :-------------: | :------------------------------------------------------------ |\n|        1        | feed:tQpnEDVRoCxTFQDvyQEzdo:1733127889:tQpnEDVRoCxTFQDvyQEzdo |\n|        2        | feed:tQpnEDVRoCxTFQDvyQEzdo:1733127533:1Mryrou1xoekEaxzrFiHwL |\n|        3        | feed:tQpnEDVRoCxTFQDvyQEzdo:1733127486:pprRrNL2WP4yfVXsSNBSx6 |\n\nIt is important to properly configure keys for lexicographical sorting to avoid incorrect ordering.\n\nA **prefix scan** through the preceding keys can be achieved using the prefix\n`feed:tQpnEDVRoCxTFQDvyQEzdo`. All matching keys are returned, sorted by `timestamp`.  \nSorting can be done in ascending or descending order based on `timestamp` or `reversed timestamp` as\nneeded:\n\n```go\nreversedTimestamp := math.MaxInt64-time.Now().Unix()\n```\n\nThis makes it possible to implement simple pagination by using a limit for the number of keys and a\ncursor (the last key from the previous iteration) to identify where to resume.\n\n```go\n// startCursor may look like 'feed:tQpnEDVRoCxTFQDvyQEzdo:1733127486'.\n// A prefix scan with this cursor locates the specific key where\n// the previous iteration stopped.\nerr = db.badger.View(func(txn *badger.Txn) error {\n        it := txn.NewIterator(opts)\n        defer it.Close()\n\n        // Prefix example 'feed:tQpnEDVRoCxTFQDvyQEzdo'\n        // if no cursor provided prefix scan starts from the beginning\n        p := prefix\n        if startCursor != nil {\n             p = startCursor\n        }\n        iterNum := 0 // Tracks the number of iterations to enforce the limit.\n        for it.Seek(p); it.ValidForPrefix(p); it.Next() {\n            // The method it.ValidForPrefix ensures that iteration continues\n            // as long as keys match the prefix.\n            // For example, if p = 'feed:tQpnEDVRoCxTFQDvyQEzdo:1733127486',\n            // it matches keys like\n            // 'feed:tQpnEDVRoCxTFQDvyQEzdo:1733127889:pprRrNL2WP4yfVXsSNBSx6'.\n\n            // Once the starting point for iteration is found, revert the prefix\n            // back to 'feed:tQpnEDVRoCxTFQDvyQEzdo' to continue iterating sequentially.\n            // Otherwise, iteration would stop after a single prefix-key match.\n            p = prefix\n\n            item := it.Item()\n            key := string(item.Key())\n\n            if iterNum > limit { // Limit reached.\n                nextCursor = key // Save the next cursor for future iterations.\n                return nil\n            }\n            iterNum++ // Increment iteration count.\n\n            err := item.Value(func(v []byte) error {\n                fmt.Printf(\"key=%s, value=%s\\n\", k, v)\n                return nil\n            })\n            if err != nil {\n                return err\n            }\n        }\n        // If the number of iterations is less than the limit,\n        // it means there are no more items for the prefix.\n        if iterNum < limit {\n            nextCursor = \"\"\n        }\n        return nil\n    })\nreturn nextCursor, err\n```\n\n### Key-only iteration\n\nBadger supports a unique mode of iteration called _key-only_ iteration. It is several order of\nmagnitudes faster than regular iteration, because it involves access to the Log-structured merge\n(LSM)-tree only, which is usually resident entirely in RAM. To enable key-only iteration, you need\nto set the `IteratorOptions.PrefetchValues` field to `false`. This can also be used to do sparse\nreads for selected keys during an iteration, by calling `item.Value()` only when required.\n\n```go\nerr := db.View(func(txn *badger.Txn) error {\n  opts := badger.DefaultIteratorOptions\n  opts.PrefetchValues = false\n  it := txn.NewIterator(opts)\n  defer it.Close()\n  for it.Rewind(); it.Valid(); it.Next() {\n    item := it.Item()\n    k := item.Key()\n    fmt.Printf(\"key=%s\\n\", k)\n  }\n  return nil\n})\n```\n\n## Stream\n\nBadger provides a Stream framework, which concurrently iterates over all or a portion of the DB,\nconverting data into custom key-values, and streams it out serially to be sent over network, written\nto disk, or even written back to Badger. This is a lot faster way to iterate over Badger than using\na single Iterator. Stream supports Badger in both managed and normal mode.\n\nStream uses the natural boundaries created by SSTables within the Log-structure merge (LSM)-tree, to\nquickly generate key ranges. Each goroutine then picks a range and runs an iterator to iterate over\nit. Each iterator iterates over all versions of values and is created from the same transaction,\nthus working over a snapshot of the DB. Every time a new key is encountered, it calls\n`ChooseKey(item)`, followed by `KeyToList(key, itr)`. This allows a user to select or reject that\nkey, and if selected, convert the value versions into custom key-values. The goroutine batches up 4\nMB worth of key-values, before sending it over to a channel. Another goroutine further batches up\ndata from this channel using _smart batching_ algorithm and calls `Send` serially.\n\nThis framework is designed for high throughput key-value iteration, spreading the work of iteration\nacross many goroutines. `DB.Backup` uses this framework to provide full and incremental backups\nquickly. Dgraph is a heavy user of this framework. In fact, this framework was developed and used\nwithin Dgraph, before getting ported over to Badger.\n\n```go\nstream := db.NewStream()\n// db.NewStreamAt(readTs) for managed mode.\n\n// -- Optional settings\nstream.NumGo = 16                     // Set number of goroutines to use for iteration.\nstream.Prefix = []byte(\"some-prefix\") // Leave nil for iteration over the whole DB.\nstream.LogPrefix = \"Badger.Streaming\" // For identifying stream logs. Outputs to Logger.\n\n// ChooseKey is called concurrently for every key. If left nil, assumes true by default.\nstream.ChooseKey = func(item *badger.Item) bool {\n  return bytes.HasSuffix(item.Key(), []byte(\"er\"))\n}\n\n// KeyToList is called concurrently for chosen keys. This can be used to convert\n// Badger data into custom key-values. If nil, uses stream.ToList, a default\n// implementation, which picks all valid key-values.\nstream.KeyToList = nil\n\n// -- End of optional settings.\n\n// Send is called serially, while Stream.Orchestrate is running.\nstream.Send = func(list *pb.KVList) error {\n  return proto.MarshalText(w, list) // Write to w.\n}\n\n// Run the stream\nif err := stream.Orchestrate(context.Background()); err != nil {\n  return err\n}\n// Done.\n```\n\n## Garbage collection\n\nBadger values need to be garbage collected, because of two reasons:\n\n- Badger keeps values separately from the Log-structure merge (LSM)-tree. This means that the\n  compaction operations that clean up the LSM tree do not touch the values at all. Values need to be\n  cleaned up separately.\n\n- Concurrent read/write transactions could leave behind multiple values for a single key, because\n  they're stored with different versions. These could accumulate, and take up unneeded space beyond\n  the time these older versions are needed.\n\nBadger relies on the client to perform garbage collection at a time of their choosing. It provides\nthe following method, which can be invoked at an appropriate time:\n\n- `DB.RunValueLogGC()`: This method is designed to do garbage collection while Badger is online.\n  Along with randomly picking a file, it uses statistics generated by the LSM tree compactions to\n  pick files that are likely to lead to maximum space reclamation. It is recommended to be called\n  during periods of low activity in your system, or periodically. One call would only result in\n  removal of at max one log file. As an optimization, you could also immediately re-run it whenever\n  it returns nil error (indicating a successful value log GC), as shown below.\n\n  ```go\n  ticker := time.NewTicker(5 * time.Minute)\n  defer ticker.Stop()\n  for range ticker.C {\n  again:\n    err := db.RunValueLogGC(0.7)\n    if err == nil {\n      goto again\n    }\n  }\n  ```\n\n- `DB.PurgeOlderVersions()`: This method is **DEPRECATED** since v1.5.0. Now, Badger's LSM tree\n  automatically discards older/invalid versions of keys.\n\n<Note>\n  The `RunValueLogGC` method would not garbage collect the latest value log.\n</Note>\n\n## Database backup\n\nThere are two public API methods `DB.Backup()` and `DB.Load()` which can be used to do online\nbackups and restores. Badger v0.9 provides a CLI tool `badger`, which can do offline backup/restore.\nMake sure you have `$GOPATH/bin` in your PATH to use this tool.\n\nThe command below creates a version-agnostic backup of the database, to a file `badger.bak` in the\ncurrent working directory\n\n```sh\nbadger backup --dir <path/to/badgerdb>\n```\n\nTo restore `badger.bak` in the current working directory to a new database:\n\n```sh\nbadger restore --dir <path/to/badgerdb>\n```\n\nSee `badger --help` for more details.\n\nIf you have a Badger database that was created using v0.8 (or below), you can use the\n`badger_backup` tool provided in v0.8.1, and then restore it using the preceding command to upgrade\nyour database to work with the latest version.\n\n```sh\nbadger_backup --dir <path/to/badgerdb> --backup-file badger.bak\n```\n\nWe recommend all users to use the `Backup` and `Restore` APIs and tools. However, Badger is also\nrsync-friendly because all files are immutable, barring the latest value log which is append-only.\nSo, rsync can be used as rudimentary way to perform a backup. In the following script, we repeat\nrsync to ensure that the LSM tree remains consistent with the MANIFEST file while doing a full\nbackup.\n\n```sh\n#!/bin/bash\nset -o history\nset -o histexpand\n# Makes a complete copy of a Badger database directory.\n# Repeat rsync if the MANIFEST and SSTables are updated.\nrsync -avz --delete db/ dst\nwhile !! | grep -q \"(MANIFEST\\|\\.sst)$\"; do :; done\n```\n\n## Memory usage\n\nBadger's memory usage can be managed by tweaking several options available in the `Options` struct\nthat's passed in when opening the database using `DB.Open`.\n\n- Number of memtables (`Options.NumMemtables`)\n  - If you modify `Options.NumMemtables`, also adjust `Options.NumLevelZeroTables` and\n    `Options.NumLevelZeroTablesStall` accordingly.\n- Number of concurrent compactions (`Options.NumCompactors`)\n- Size of table (`Options.BaseTableSize`)\n- Size of value log file (`Options.ValueLogFileSize`)\n\nIf you want to decrease the memory usage of Badger instance, tweak these options (ideally one at a\ntime) until you achieve the desired memory usage.\n"
  },
  {
    "path": "docs/troubleshooting.md",
    "content": "# Troubleshooting\n\n## Writes are getting stuck\n\n**Update: with the new `Value(func(v []byte))` API, this deadlock can no longer happen.**\n\nThe following is true for users on Badger v1.x.\n\nThis can happen if a long running iteration with `Prefetch` is set to false, but an `Item::Value`\ncall is made internally in the loop. That causes Badger to acquire read locks over the value log\nfiles to avoid value log GC removing the file from underneath. As a side effect, this also blocks a\nnew value log GC file from being created, when the value log file boundary is hit.\n\nPlease see GitHub issues [#293](https://github.com/dgraph-io/badger/issues/293) and\n[#315](https://github.com/dgraph-io/badger/issues/315).\n\nThere are multiple workarounds during iteration:\n\n1. Use `Item::ValueCopy` instead of `Item::Value` when retrieving value.\n1. Set `Prefetch` to true. Badger would then copy over the value and release the file lock\n   immediately.\n1. When `Prefetch` is false, don't call `Item::Value` and do a pure key-only iteration. This might\n   be useful if you just want to delete a lot of keys.\n1. Do the writes in a separate transaction after the reads.\n\n## Writes are really slow\n\nAre you creating a new transaction for every single key update, and waiting for it to `Commit` fully\nbefore creating a new one? This leads to very low throughput.\n\nWe've created `WriteBatch` API which provides a way to batch up many updates into a single\ntransaction and `Commit` that transaction using callbacks to avoid blocking. This amortizes the cost\nof a transaction really well, and provides the most efficient way to do bulk writes.\n\n```go\nwb := db.NewWriteBatch()\ndefer wb.Cancel()\n\nfor i := 0; i < N; i++ {\n  err := wb.Set(key(i), value(i), 0) // Will create txns as needed.\n  handle(err)\n}\nhandle(wb.Flush()) // Wait for all txns to finish.\n```\n\nNote that `WriteBatch` API doesn't allow any reads. For read-modify-write workloads, you should be\nusing the `Transaction` API.\n\n## I don't see any disk writes\n\nIf you're using Badger with `SyncWrites=false`, then your writes might not be written to value log\nand won't get synced to disk immediately. Writes to LSM tree are done in-memory first, before they\nget compacted to disk. The compaction would only happen once `BaseTableSize` has been reached. So,\nif you're doing a few writes and then checking, you might not see anything on disk. Once you `Close`\nthe database, you'll see these writes on disk.\n\n## Reverse iteration doesn't produce the right results\n\nJust like forward iteration goes to the first key which is equal or greater than the SEEK key,\nreverse iteration goes to the first key which is equal or lesser than the SEEK key. Therefore, SEEK\nkey would not be part of the results. You can typically add a `0xff` byte as a suffix to the SEEK\nkey to include it in the results. See the following issues:\n[#436](https://github.com/dgraph-io/badger/issues/436) and\n[#347](https://github.com/dgraph-io/badger/issues/347).\n\n## Which instances should I use for Badger?\n\nWe recommend using instances which provide local SSD storage, without any limit on the maximum IOPS.\nIn AWS, these are storage optimized instances like i3. They provide local SSDs which clock 100K IOPS\nover 4KB blocks easily.\n\n## I'm getting a closed channel error\n\n```sh\npanic: close of closed channel\npanic: send on closed channel\n```\n\nIf you're seeing panics like this, it is because you're operating on a closed DB. This can happen,\nif you call `Close()` before sending a write, or multiple times. You should ensure that you only\ncall `Close()` once, and all your read/write operations finish before closing.\n\n## Are there any Go specific settings that I should use?\n\nWe _highly_ recommend setting a high number for `GOMAXPROCS`, which allows Go to observe the full\nIOPS throughput provided by modern SSDs. In Dgraph, we have set it to 128. For more details,\n[see this thread](https://groups.google.com/d/topic/golang-nuts/jPb_h3TvlKE/discussion).\n\n## Are there any Linux specific settings that I should use?\n\nWe recommend setting `max file descriptors` to a high number depending upon the expected size of\nyour data. On Linux and Mac, you can check the file descriptor limit with `ulimit -n -H` for the\nhard limit and `ulimit -n -S` for the soft limit. A soft limit of `65535` is a good lower bound. You\ncan adjust the limit as needed.\n\n## I see \"manifest has unsupported version: X (we support Y)\" error\n\nThis error means you have a badger directory which was created by an older version of badger and\nyou're trying to open in a newer version of badger. The underlying data format can change across\nbadger versions and users have to migrate their data directory. Badger data can be migrated from\nversion X of badger to version Y of badger by following the steps listed below. Assume you were on\nbadger v1.6.0 and you wish to migrate to v2.0.0 version.\n\n1. Install Badger version v1.6.0\n   - `cd $GOPATH/src/github.com/dgraph-io/badger`\n   - `git checkout v1.6.0`\n   - `cd badger && go install`\n\n     This should install the old Badger binary in your `$GOBIN`.\n\n2. Create Backup\n   - `badger backup --dir path/to/badger/directory -f badger.backup`\n3. Install Badger version v2.0.0\n   - `cd $GOPATH/src/github.com/dgraph-io/badger`\n   - `git checkout v2.0.0`\n   - `cd badger && go install`\n\n     This should install the new Badger binary in your `$GOBIN`.\n\n4. Restore data from backup\n   - `badger restore --dir path/to/new/badger/directory -f badger.backup`\n\n     This creates a new directory on `path/to/new/badger/directory` and adds data in the new format\n     to it.\n\nNOTE - The preceding steps shouldn't cause any data loss but please ensure the new data is valid\nbefore deleting the old Badger directory.\n\n## Why do I need gcc to build badger? Does badger need Cgo?\n\nBadger doesn't directly use Cgo but it relies on https://github.com/DataDog/zstd library for zstd\ncompression and the library requires [`gcc/cgo`](https://pkg.go.dev/cmd/cgo). You can build Badger\nwithout Cgo by running `CGO_ENABLED=0 go build`. This builds Badger without the support for ZSTD\ncompression algorithm.\n\nAs of Badger versions [v2.2007.4](https://github.com/dgraph-io/badger/releases/tag/v2.2007.4) and\n[v3.2103.1](https://github.com/dgraph-io/badger/releases/tag/v3.2103.1) the DataDog ZSTD library was\nreplaced by pure Golang version and Cgo is no longer required. The new library is\n[backwards compatible in nearly all cases](https://discuss.hypermode.com/t/use-pure-go-zstd-implementation/8670/10):\n\n<Note>\n  Yes they're compatible both ways. The only exception is 0 bytes of input which\n  gives 0 bytes output with the Go zstd. But you already have the\n  zstd.WithZeroFrames(true) which wraps 0 bytes in a header so it can be fed to\n  DD zstd. This is only relevant when downgrading.\n</Note>\n"
  },
  {
    "path": "errors.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\tstderrors \"errors\"\n\t\"math\"\n)\n\nconst (\n\t// ValueThresholdLimit is the maximum permissible value of opt.ValueThreshold.\n\tValueThresholdLimit = math.MaxUint16 - 16 + 1\n)\n\nvar (\n\t// ErrValueLogSize is returned when opt.ValueLogFileSize option is not within the valid\n\t// range.\n\tErrValueLogSize = stderrors.New(\"Invalid ValueLogFileSize, must be in range [1MB, 2GB)\")\n\n\t// ErrKeyNotFound is returned when key isn't found on a txn.Get.\n\tErrKeyNotFound = stderrors.New(\"Key not found\")\n\n\t// ErrTxnTooBig is returned if too many writes are fit into a single transaction.\n\tErrTxnTooBig = stderrors.New(\"Txn is too big to fit into one request\")\n\n\t// ErrConflict is returned when a transaction conflicts with another transaction. This can\n\t// happen if the read rows had been updated concurrently by another transaction.\n\tErrConflict = stderrors.New(\"Transaction Conflict. Please retry\")\n\n\t// ErrReadOnlyTxn is returned if an update function is called on a read-only transaction.\n\tErrReadOnlyTxn = stderrors.New(\"No sets or deletes are allowed in a read-only transaction\")\n\n\t// ErrDiscardedTxn is returned if a previously discarded transaction is reused.\n\tErrDiscardedTxn = stderrors.New(\"This transaction has been discarded. Create a new one\")\n\n\t// ErrEmptyKey is returned if an empty key is passed on an update function.\n\tErrEmptyKey = stderrors.New(\"Key cannot be empty\")\n\n\t// ErrInvalidKey is returned if the key has a special !badger! prefix,\n\t// reserved for internal usage.\n\tErrInvalidKey = stderrors.New(\"Key is using a reserved !badger! prefix\")\n\n\t// ErrBannedKey is returned if the read/write key belongs to any banned namespace.\n\tErrBannedKey = stderrors.New(\"Key is using the banned prefix\")\n\n\t// ErrThresholdZero is returned if threshold is set to zero, and value log GC is called.\n\t// In such a case, GC can't be run.\n\tErrThresholdZero = stderrors.New(\n\t\t\"Value log GC can't run because threshold is set to zero\")\n\n\t// ErrNoRewrite is returned if a call for value log GC doesn't result in a log file rewrite.\n\tErrNoRewrite = stderrors.New(\n\t\t\"Value log GC attempt didn't result in any cleanup\")\n\n\t// ErrRejected is returned if a value log GC is called either while another GC is running, or\n\t// after DB::Close has been called.\n\tErrRejected = stderrors.New(\"Value log GC request rejected\")\n\n\t// ErrInvalidRequest is returned if the user request is invalid.\n\tErrInvalidRequest = stderrors.New(\"Invalid request\")\n\n\t// ErrManagedTxn is returned if the user tries to use an API which isn't\n\t// allowed due to external management of transactions, when using ManagedDB.\n\tErrManagedTxn = stderrors.New(\n\t\t\"Invalid API request. Not allowed to perform this action using ManagedDB\")\n\n\t// ErrNamespaceMode is returned if the user tries to use an API which is allowed only when\n\t// NamespaceOffset is non-negative.\n\tErrNamespaceMode = stderrors.New(\n\t\t\"Invalid API request. Not allowed to perform this action when NamespaceMode is not set.\")\n\n\t// ErrInvalidDump if a data dump made previously cannot be loaded into the database.\n\tErrInvalidDump = stderrors.New(\"Data dump cannot be read\")\n\n\t// ErrZeroBandwidth is returned if the user passes in zero bandwidth for sequence.\n\tErrZeroBandwidth = stderrors.New(\"Bandwidth must be greater than zero\")\n\n\t// ErrWindowsNotSupported is returned when opt.ReadOnly is used on Windows\n\tErrWindowsNotSupported = stderrors.New(\"Read-only mode is not supported on Windows\")\n\n\t// ErrPlan9NotSupported is returned when opt.ReadOnly is used on Plan 9\n\tErrPlan9NotSupported = stderrors.New(\"Read-only mode is not supported on Plan 9\")\n\n\t// ErrTruncateNeeded is returned when the value log gets corrupt, and requires truncation of\n\t// corrupt data to allow Badger to run properly.\n\tErrTruncateNeeded = stderrors.New(\n\t\t\"Log truncate required to run DB. This might result in data loss\")\n\n\t// ErrBlockedWrites is returned if the user called DropAll. During the process of dropping all\n\t// data from Badger, we stop accepting new writes, by returning this error.\n\tErrBlockedWrites = stderrors.New(\"Writes are blocked, possibly due to DropAll or Close\")\n\n\t// ErrNilCallback is returned when subscriber's callback is nil.\n\tErrNilCallback = stderrors.New(\"Callback cannot be nil\")\n\n\t// ErrEncryptionKeyMismatch is returned when the storage key is not\n\t// matched with the key previously given.\n\tErrEncryptionKeyMismatch = stderrors.New(\"Encryption key mismatch\")\n\n\t// ErrInvalidDataKeyID is returned if the datakey id is invalid.\n\tErrInvalidDataKeyID = stderrors.New(\"Invalid datakey id\")\n\n\t// ErrInvalidEncryptionKey is returned if length of encryption keys is invalid.\n\tErrInvalidEncryptionKey = stderrors.New(\"Encryption key's length should be\" +\n\t\t\"either 16, 24, or 32 bytes\")\n\t// ErrGCInMemoryMode is returned when db.RunValueLogGC is called in in-memory mode.\n\tErrGCInMemoryMode = stderrors.New(\"Cannot run value log GC when DB is opened in InMemory mode\")\n\n\t// ErrDBClosed is returned when a get operation is performed after closing the DB.\n\tErrDBClosed = stderrors.New(\"DB Closed\")\n)\n"
  },
  {
    "path": "fb/BlockOffset.go",
    "content": "// Code generated by the FlatBuffers compiler. DO NOT EDIT.\n\npackage fb\n\nimport (\n\tflatbuffers \"github.com/google/flatbuffers/go\"\n)\n\ntype BlockOffset struct {\n\t_tab flatbuffers.Table\n}\n\nfunc GetRootAsBlockOffset(buf []byte, offset flatbuffers.UOffsetT) *BlockOffset {\n\tn := flatbuffers.GetUOffsetT(buf[offset:])\n\tx := &BlockOffset{}\n\tx.Init(buf, n+offset)\n\treturn x\n}\n\nfunc (rcv *BlockOffset) Init(buf []byte, i flatbuffers.UOffsetT) {\n\trcv._tab.Bytes = buf\n\trcv._tab.Pos = i\n}\n\nfunc (rcv *BlockOffset) Table() flatbuffers.Table {\n\treturn rcv._tab\n}\n\nfunc (rcv *BlockOffset) Key(j int) byte {\n\to := flatbuffers.UOffsetT(rcv._tab.Offset(4))\n\tif o != 0 {\n\t\ta := rcv._tab.Vector(o)\n\t\treturn rcv._tab.GetByte(a + flatbuffers.UOffsetT(j*1))\n\t}\n\treturn 0\n}\n\nfunc (rcv *BlockOffset) KeyLength() int {\n\to := flatbuffers.UOffsetT(rcv._tab.Offset(4))\n\tif o != 0 {\n\t\treturn rcv._tab.VectorLen(o)\n\t}\n\treturn 0\n}\n\nfunc (rcv *BlockOffset) KeyBytes() []byte {\n\to := flatbuffers.UOffsetT(rcv._tab.Offset(4))\n\tif o != 0 {\n\t\treturn rcv._tab.ByteVector(o + rcv._tab.Pos)\n\t}\n\treturn nil\n}\n\nfunc (rcv *BlockOffset) MutateKey(j int, n byte) bool {\n\to := flatbuffers.UOffsetT(rcv._tab.Offset(4))\n\tif o != 0 {\n\t\ta := rcv._tab.Vector(o)\n\t\treturn rcv._tab.MutateByte(a+flatbuffers.UOffsetT(j*1), n)\n\t}\n\treturn false\n}\n\nfunc (rcv *BlockOffset) Offset() uint32 {\n\to := flatbuffers.UOffsetT(rcv._tab.Offset(6))\n\tif o != 0 {\n\t\treturn rcv._tab.GetUint32(o + rcv._tab.Pos)\n\t}\n\treturn 0\n}\n\nfunc (rcv *BlockOffset) MutateOffset(n uint32) bool {\n\treturn rcv._tab.MutateUint32Slot(6, n)\n}\n\nfunc (rcv *BlockOffset) Len() uint32 {\n\to := flatbuffers.UOffsetT(rcv._tab.Offset(8))\n\tif o != 0 {\n\t\treturn rcv._tab.GetUint32(o + rcv._tab.Pos)\n\t}\n\treturn 0\n}\n\nfunc (rcv *BlockOffset) MutateLen(n uint32) bool {\n\treturn rcv._tab.MutateUint32Slot(8, n)\n}\n\nfunc BlockOffsetStart(builder *flatbuffers.Builder) {\n\tbuilder.StartObject(3)\n}\nfunc BlockOffsetAddKey(builder *flatbuffers.Builder, key flatbuffers.UOffsetT) {\n\tbuilder.PrependUOffsetTSlot(0, flatbuffers.UOffsetT(key), 0)\n}\nfunc BlockOffsetStartKeyVector(builder *flatbuffers.Builder, numElems int) flatbuffers.UOffsetT {\n\treturn builder.StartVector(1, numElems, 1)\n}\nfunc BlockOffsetAddOffset(builder *flatbuffers.Builder, offset uint32) {\n\tbuilder.PrependUint32Slot(1, offset, 0)\n}\nfunc BlockOffsetAddLen(builder *flatbuffers.Builder, len uint32) {\n\tbuilder.PrependUint32Slot(2, len, 0)\n}\nfunc BlockOffsetEnd(builder *flatbuffers.Builder) flatbuffers.UOffsetT {\n\treturn builder.EndObject()\n}\n"
  },
  {
    "path": "fb/TableIndex.go",
    "content": "// Code generated by the FlatBuffers compiler. DO NOT EDIT.\n\npackage fb\n\nimport (\n\tflatbuffers \"github.com/google/flatbuffers/go\"\n)\n\ntype TableIndex struct {\n\t_tab flatbuffers.Table\n}\n\nfunc GetRootAsTableIndex(buf []byte, offset flatbuffers.UOffsetT) *TableIndex {\n\tn := flatbuffers.GetUOffsetT(buf[offset:])\n\tx := &TableIndex{}\n\tx.Init(buf, n+offset)\n\treturn x\n}\n\nfunc (rcv *TableIndex) Init(buf []byte, i flatbuffers.UOffsetT) {\n\trcv._tab.Bytes = buf\n\trcv._tab.Pos = i\n}\n\nfunc (rcv *TableIndex) Table() flatbuffers.Table {\n\treturn rcv._tab\n}\n\nfunc (rcv *TableIndex) Offsets(obj *BlockOffset, j int) bool {\n\to := flatbuffers.UOffsetT(rcv._tab.Offset(4))\n\tif o != 0 {\n\t\tx := rcv._tab.Vector(o)\n\t\tx += flatbuffers.UOffsetT(j) * 4\n\t\tx = rcv._tab.Indirect(x)\n\t\tobj.Init(rcv._tab.Bytes, x)\n\t\treturn true\n\t}\n\treturn false\n}\n\nfunc (rcv *TableIndex) OffsetsLength() int {\n\to := flatbuffers.UOffsetT(rcv._tab.Offset(4))\n\tif o != 0 {\n\t\treturn rcv._tab.VectorLen(o)\n\t}\n\treturn 0\n}\n\nfunc (rcv *TableIndex) BloomFilter(j int) byte {\n\to := flatbuffers.UOffsetT(rcv._tab.Offset(6))\n\tif o != 0 {\n\t\ta := rcv._tab.Vector(o)\n\t\treturn rcv._tab.GetByte(a + flatbuffers.UOffsetT(j*1))\n\t}\n\treturn 0\n}\n\nfunc (rcv *TableIndex) BloomFilterLength() int {\n\to := flatbuffers.UOffsetT(rcv._tab.Offset(6))\n\tif o != 0 {\n\t\treturn rcv._tab.VectorLen(o)\n\t}\n\treturn 0\n}\n\nfunc (rcv *TableIndex) BloomFilterBytes() []byte {\n\to := flatbuffers.UOffsetT(rcv._tab.Offset(6))\n\tif o != 0 {\n\t\treturn rcv._tab.ByteVector(o + rcv._tab.Pos)\n\t}\n\treturn nil\n}\n\nfunc (rcv *TableIndex) MutateBloomFilter(j int, n byte) bool {\n\to := flatbuffers.UOffsetT(rcv._tab.Offset(6))\n\tif o != 0 {\n\t\ta := rcv._tab.Vector(o)\n\t\treturn rcv._tab.MutateByte(a+flatbuffers.UOffsetT(j*1), n)\n\t}\n\treturn false\n}\n\nfunc (rcv *TableIndex) MaxVersion() uint64 {\n\to := flatbuffers.UOffsetT(rcv._tab.Offset(8))\n\tif o != 0 {\n\t\treturn rcv._tab.GetUint64(o + rcv._tab.Pos)\n\t}\n\treturn 0\n}\n\nfunc (rcv *TableIndex) MutateMaxVersion(n uint64) bool {\n\treturn rcv._tab.MutateUint64Slot(8, n)\n}\n\nfunc (rcv *TableIndex) KeyCount() uint32 {\n\to := flatbuffers.UOffsetT(rcv._tab.Offset(10))\n\tif o != 0 {\n\t\treturn rcv._tab.GetUint32(o + rcv._tab.Pos)\n\t}\n\treturn 0\n}\n\nfunc (rcv *TableIndex) MutateKeyCount(n uint32) bool {\n\treturn rcv._tab.MutateUint32Slot(10, n)\n}\n\nfunc (rcv *TableIndex) UncompressedSize() uint32 {\n\to := flatbuffers.UOffsetT(rcv._tab.Offset(12))\n\tif o != 0 {\n\t\treturn rcv._tab.GetUint32(o + rcv._tab.Pos)\n\t}\n\treturn 0\n}\n\nfunc (rcv *TableIndex) MutateUncompressedSize(n uint32) bool {\n\treturn rcv._tab.MutateUint32Slot(12, n)\n}\n\nfunc (rcv *TableIndex) OnDiskSize() uint32 {\n\to := flatbuffers.UOffsetT(rcv._tab.Offset(14))\n\tif o != 0 {\n\t\treturn rcv._tab.GetUint32(o + rcv._tab.Pos)\n\t}\n\treturn 0\n}\n\nfunc (rcv *TableIndex) MutateOnDiskSize(n uint32) bool {\n\treturn rcv._tab.MutateUint32Slot(14, n)\n}\n\nfunc (rcv *TableIndex) StaleDataSize() uint32 {\n\to := flatbuffers.UOffsetT(rcv._tab.Offset(16))\n\tif o != 0 {\n\t\treturn rcv._tab.GetUint32(o + rcv._tab.Pos)\n\t}\n\treturn 0\n}\n\nfunc (rcv *TableIndex) MutateStaleDataSize(n uint32) bool {\n\treturn rcv._tab.MutateUint32Slot(16, n)\n}\n\nfunc TableIndexStart(builder *flatbuffers.Builder) {\n\tbuilder.StartObject(7)\n}\nfunc TableIndexAddOffsets(builder *flatbuffers.Builder, offsets flatbuffers.UOffsetT) {\n\tbuilder.PrependUOffsetTSlot(0, flatbuffers.UOffsetT(offsets), 0)\n}\nfunc TableIndexStartOffsetsVector(builder *flatbuffers.Builder, numElems int) flatbuffers.UOffsetT {\n\treturn builder.StartVector(4, numElems, 4)\n}\nfunc TableIndexAddBloomFilter(builder *flatbuffers.Builder, bloomFilter flatbuffers.UOffsetT) {\n\tbuilder.PrependUOffsetTSlot(1, flatbuffers.UOffsetT(bloomFilter), 0)\n}\nfunc TableIndexStartBloomFilterVector(builder *flatbuffers.Builder, numElems int) flatbuffers.UOffsetT {\n\treturn builder.StartVector(1, numElems, 1)\n}\nfunc TableIndexAddMaxVersion(builder *flatbuffers.Builder, maxVersion uint64) {\n\tbuilder.PrependUint64Slot(2, maxVersion, 0)\n}\nfunc TableIndexAddKeyCount(builder *flatbuffers.Builder, keyCount uint32) {\n\tbuilder.PrependUint32Slot(3, keyCount, 0)\n}\nfunc TableIndexAddUncompressedSize(builder *flatbuffers.Builder, uncompressedSize uint32) {\n\tbuilder.PrependUint32Slot(4, uncompressedSize, 0)\n}\nfunc TableIndexAddOnDiskSize(builder *flatbuffers.Builder, onDiskSize uint32) {\n\tbuilder.PrependUint32Slot(5, onDiskSize, 0)\n}\nfunc TableIndexAddStaleDataSize(builder *flatbuffers.Builder, staleDataSize uint32) {\n\tbuilder.PrependUint32Slot(6, staleDataSize, 0)\n}\nfunc TableIndexEnd(builder *flatbuffers.Builder) flatbuffers.UOffsetT {\n\treturn builder.EndObject()\n}\n"
  },
  {
    "path": "fb/flatbuffer.fbs",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\nnamespace fb;\n\ntable TableIndex {\n  offsets:[BlockOffset];\n  bloom_filter:[ubyte];\n  max_version:uint64;\n  key_count:uint32;\n  uncompressed_size:uint32;\n  on_disk_size:uint32;\n  stale_data_size:uint32;\n}\n\ntable BlockOffset {\n  key:[ubyte];\n  offset:uint;\n  len:uint;\n}\n\nroot_type TableIndex;\nroot_type BlockOffset;\n"
  },
  {
    "path": "fb/gen.sh",
    "content": "#!/usr/bin/env bash\n\nset -e\n\n## Install flatc if not present\n## ref. https://google.github.io/flatbuffers/flatbuffers_guide_building.html\ncommand -v flatc >/dev/null || { ./install_flatbuffers.sh; }\n\nflatc --go flatbuffer.fbs\n# Move files to the correct directory.\nmv fb/* ./\nrmdir fb\n"
  },
  {
    "path": "fb/install_flatbuffers.sh",
    "content": "#!/usr/bin/env bash\n\nset -e\n\ninstall_mac() {\n\tcommand -v brew >/dev/null ||\n\t\t{\n\t\t\techo \"[ERROR]: 'brew' command not not found. Exiting\" 1>&2\n\t\t\texit 1\n\t\t}\n\tbrew install flatbuffers\n}\n\ninstall_linux() {\n\tfor CMD in curl cmake g++ make; do\n\t\tcommand -v \"${CMD}\" >/dev/null ||\n\t\t\t{\n\t\t\t\techo \"[ERROR]: '${CMD}' command not not found. Exiting\" 1>&2\n\t\t\t\texit 1\n\t\t\t}\n\tdone\n\n\t## Create Temp Build Directory\n\tBUILD_DIR=$(mktemp -d)\n\tpushd \"${BUILD_DIR}\"\n\n\t## Fetch Latest Tarball\n\tLATEST_VERSION=$(curl -s https://api.github.com/repos/google/flatbuffers/releases/latest | grep -oP '(?<=tag_name\": \")[^\"]+')\n\tcurl -sLO https://github.com/google/flatbuffers/archive/\"${LATEST_VERSION}\".tar.gz\n\ttar xf \"${LATEST_VERSION}\".tar.gz\n\n\t## Build Binaries\n\tcd flatbuffers-\"${LATEST_VERSION#v}\"\n\tcmake -G \"Unix Makefiles\" -DCMAKE_BUILD_TYPE=Release\n\tmake\n\t./flattests\n\tcp flatc /usr/local/bin/flatc\n\n\t## Cleanup Temp Build Directory\n\tpopd\n\trm -rf \"${BUILD_DIR}\"\n}\n\nSYSTEM=$(uname -s)\n\ncase ${SYSTEM,,} in\nlinux)\n\tsudo bash -c \"$(declare -f install_linux); install_linux\"\n\t;;\ndarwin)\n\tinstall_mac\n\t;;\nesac\n"
  },
  {
    "path": "go.mod",
    "content": "module github.com/dgraph-io/badger/v4\n\ngo 1.23.0\n\ntoolchain go1.25.0\n\nrequire (\n\tgithub.com/cespare/xxhash/v2 v2.3.0\n\tgithub.com/dgraph-io/ristretto/v2 v2.2.0\n\tgithub.com/dustin/go-humanize v1.0.1\n\tgithub.com/google/flatbuffers v25.2.10+incompatible\n\tgithub.com/klauspost/compress v1.18.0\n\tgithub.com/spf13/cobra v1.9.1\n\tgithub.com/stretchr/testify v1.10.0\n\tgo.opentelemetry.io/contrib/zpages v0.62.0\n\tgo.opentelemetry.io/otel v1.37.0\n\tgolang.org/x/sys v0.35.0\n\tgoogle.golang.org/protobuf v1.36.7\n)\n\nrequire (\n\tgithub.com/davecgh/go-spew v1.1.1 // indirect\n\tgithub.com/go-logr/logr v1.4.3 // indirect\n\tgithub.com/go-logr/stdr v1.2.2 // indirect\n\tgithub.com/google/uuid v1.6.0 // indirect\n\tgithub.com/inconshreveable/mousetrap v1.1.0 // indirect\n\tgithub.com/pmezard/go-difflib v1.0.0 // indirect\n\tgithub.com/spf13/pflag v1.0.6 // indirect\n\tgo.opentelemetry.io/auto/sdk v1.1.0 // indirect\n\tgo.opentelemetry.io/otel/metric v1.37.0 // indirect\n\tgo.opentelemetry.io/otel/sdk v1.37.0 // indirect\n\tgo.opentelemetry.io/otel/trace v1.37.0 // indirect\n\tgopkg.in/yaml.v3 v3.0.1 // indirect\n)\n\nretract v4.0.0 // see #1888 and #1889\n\nretract v4.3.0 // see #2113 and #2121\n"
  },
  {
    "path": "go.sum",
    "content": "github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs=\ngithub.com/cespare/xxhash/v2 v2.3.0/go.mod h1:VGX0DQ3Q6kWi7AoAeZDth3/j3BFtOZR5XLFGgcrjCOs=\ngithub.com/cpuguy83/go-md2man/v2 v2.0.6/go.mod h1:oOW0eioCTA6cOiMLiUPZOpcVxMig6NIQQ7OS05n1F4g=\ngithub.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=\ngithub.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=\ngithub.com/dgraph-io/ristretto/v2 v2.2.0 h1:bkY3XzJcXoMuELV8F+vS8kzNgicwQFAaGINAEJdWGOM=\ngithub.com/dgraph-io/ristretto/v2 v2.2.0/go.mod h1:RZrm63UmcBAaYWC1DotLYBmTvgkrs0+XhBd7Npn7/zI=\ngithub.com/dgryski/go-farm v0.0.0-20240924180020-3414d57e47da h1:aIftn67I1fkbMa512G+w+Pxci9hJPB8oMnkcP3iZF38=\ngithub.com/dgryski/go-farm v0.0.0-20240924180020-3414d57e47da/go.mod h1:SqUrOPUnsFjfmXRMNPybcSiG0BgUW2AuFH8PAnS2iTw=\ngithub.com/dustin/go-humanize v1.0.1 h1:GzkhY7T5VNhEkwH0PVJgjz+fX1rhBrR7pRT3mDkpeCY=\ngithub.com/dustin/go-humanize v1.0.1/go.mod h1:Mu1zIs6XwVuF/gI1OepvI0qD18qycQx+mFykh5fBlto=\ngithub.com/go-logr/logr v1.2.2/go.mod h1:jdQByPbusPIv2/zmleS9BjJVeZ6kBagPoEUsqbVz/1A=\ngithub.com/go-logr/logr v1.4.3 h1:CjnDlHq8ikf6E492q6eKboGOC0T8CDaOvkHCIg8idEI=\ngithub.com/go-logr/logr v1.4.3/go.mod h1:9T104GzyrTigFIr8wt5mBrctHMim0Nb2HLGrmQ40KvY=\ngithub.com/go-logr/stdr v1.2.2 h1:hSWxHoqTgW2S2qGc0LTAI563KZ5YKYRhT3MFKZMbjag=\ngithub.com/go-logr/stdr v1.2.2/go.mod h1:mMo/vtBO5dYbehREoey6XUKy/eSumjCCveDpRre4VKE=\ngithub.com/google/flatbuffers v25.2.10+incompatible h1:F3vclr7C3HpB1k9mxCGRMXq6FdUalZ6H/pNX4FP1v0Q=\ngithub.com/google/flatbuffers v25.2.10+incompatible/go.mod h1:1AeVuKshWv4vARoZatz6mlQ0JxURH0Kv5+zNeJKJCa8=\ngithub.com/google/go-cmp v0.7.0 h1:wk8382ETsv4JYUZwIsn6YpYiWiBsYLSJiTsyBybVuN8=\ngithub.com/google/go-cmp v0.7.0/go.mod h1:pXiqmnSA92OHEEa9HXL2W4E7lf9JzCmGVUdgjX3N/iU=\ngithub.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0=\ngithub.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=\ngithub.com/inconshreveable/mousetrap v1.1.0 h1:wN+x4NVGpMsO7ErUn/mUI3vEoE6Jt13X2s0bqwp9tc8=\ngithub.com/inconshreveable/mousetrap v1.1.0/go.mod h1:vpF70FUmC8bwa3OWnCshd2FqLfsEA9PFc4w1p2J65bw=\ngithub.com/klauspost/compress v1.18.0 h1:c/Cqfb0r+Yi+JtIEq73FWXVkRonBlf0CRNYc8Zttxdo=\ngithub.com/klauspost/compress v1.18.0/go.mod h1:2Pp+KzxcywXVXMr50+X0Q/Lsb43OQHYWRCY2AiWywWQ=\ngithub.com/kr/pretty v0.3.1 h1:flRD4NNwYAUpkphVc1HcthR4KEIFJ65n8Mw5qdRn3LE=\ngithub.com/kr/pretty v0.3.1/go.mod h1:hoEshYVHaxMs3cyo3Yncou5ZscifuDolrwPKZanG3xk=\ngithub.com/kr/text v0.2.0 h1:5Nx0Ya0ZqY2ygV366QzturHI13Jq95ApcVaJBhpS+AY=\ngithub.com/kr/text v0.2.0/go.mod h1:eLer722TekiGuMkidMxC/pM04lWEeraHUUmBw8l2grE=\ngithub.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=\ngithub.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=\ngithub.com/rogpeppe/go-internal v1.13.1 h1:KvO1DLK/DRN07sQ1LQKScxyZJuNnedQ5/wKSR38lUII=\ngithub.com/rogpeppe/go-internal v1.13.1/go.mod h1:uMEvuHeurkdAXX61udpOXGD/AzZDWNMNyH2VO9fmH0o=\ngithub.com/russross/blackfriday/v2 v2.1.0/go.mod h1:+Rmxgy9KzJVeS9/2gXHxylqXiyQDYRxCVz55jmeOWTM=\ngithub.com/spf13/cobra v1.9.1 h1:CXSaggrXdbHK9CF+8ywj8Amf7PBRmPCOJugH954Nnlo=\ngithub.com/spf13/cobra v1.9.1/go.mod h1:nDyEzZ8ogv936Cinf6g1RU9MRY64Ir93oCnqb9wxYW0=\ngithub.com/spf13/pflag v1.0.6 h1:jFzHGLGAlb3ruxLB8MhbI6A8+AQX/2eW4qeyNZXNp2o=\ngithub.com/spf13/pflag v1.0.6/go.mod h1:McXfInJRrz4CZXVZOBLb0bTZqETkiAhM9Iw0y3An2Bg=\ngithub.com/stretchr/testify v1.10.0 h1:Xv5erBjTwe/5IxqUQTdXv5kgmIvbHo3QQyRwhJsOfJA=\ngithub.com/stretchr/testify v1.10.0/go.mod h1:r2ic/lqez/lEtzL7wO/rwa5dbSLXVDPFyf8C91i36aY=\ngo.opentelemetry.io/auto/sdk v1.1.0 h1:cH53jehLUN6UFLY71z+NDOiNJqDdPRaXzTel0sJySYA=\ngo.opentelemetry.io/auto/sdk v1.1.0/go.mod h1:3wSPjt5PWp2RhlCcmmOial7AvC4DQqZb7a7wCow3W8A=\ngo.opentelemetry.io/contrib/zpages v0.62.0 h1:9fUYTLmrK0x/lweM2uM+BOx069jLx8PxVqWhegGJ9Bo=\ngo.opentelemetry.io/contrib/zpages v0.62.0/go.mod h1:C8kXoiC1Ytvereztus2R+kqdSa6W/MZ8FfS8Zwj+LiM=\ngo.opentelemetry.io/otel v1.37.0 h1:9zhNfelUvx0KBfu/gb+ZgeAfAgtWrfHJZcAqFC228wQ=\ngo.opentelemetry.io/otel v1.37.0/go.mod h1:ehE/umFRLnuLa/vSccNq9oS1ErUlkkK71gMcN34UG8I=\ngo.opentelemetry.io/otel/metric v1.37.0 h1:mvwbQS5m0tbmqML4NqK+e3aDiO02vsf/WgbsdpcPoZE=\ngo.opentelemetry.io/otel/metric v1.37.0/go.mod h1:04wGrZurHYKOc+RKeye86GwKiTb9FKm1WHtO+4EVr2E=\ngo.opentelemetry.io/otel/sdk v1.37.0 h1:ItB0QUqnjesGRvNcmAcU0LyvkVyGJ2xftD29bWdDvKI=\ngo.opentelemetry.io/otel/sdk v1.37.0/go.mod h1:VredYzxUvuo2q3WRcDnKDjbdvmO0sCzOvVAiY+yUkAg=\ngo.opentelemetry.io/otel/trace v1.37.0 h1:HLdcFNbRQBE2imdSEgm/kwqmQj1Or1l/7bW6mxVK7z4=\ngo.opentelemetry.io/otel/trace v1.37.0/go.mod h1:TlgrlQ+PtQO5XFerSPUYG0JSgGyryXewPGyayAWSBS0=\ngo.uber.org/goleak v1.3.0 h1:2K3zAYmnTNqV73imy9J1T3WC+gmCePx2hEGkimedGto=\ngo.uber.org/goleak v1.3.0/go.mod h1:CoHD4mav9JJNrW/WLlf7HGZPjdw8EucARQHekz1X6bE=\ngolang.org/x/sys v0.35.0 h1:vz1N37gP5bs89s7He8XuIYXpyY0+QlsKmzipCbUtyxI=\ngolang.org/x/sys v0.35.0/go.mod h1:BJP2sWEmIv4KK5OTEluFJCKSidICx8ciO85XgH3Ak8k=\ngoogle.golang.org/protobuf v1.36.7 h1:IgrO7UwFQGJdRNXH/sQux4R1Dj1WAKcLElzeeRaXV2A=\ngoogle.golang.org/protobuf v1.36.7/go.mod h1:jduwjTPXsFjZGTmRluh+L6NjiWu7pchiJ2/5YcXBHnY=\ngopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=\ngopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntNwaWcugrBjAiHlqqRiVk=\ngopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c/go.mod h1:JHkPIbrfpd72SG/EVd6muEfDQjcINNoR0C8j2r3qZ4Q=\ngopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=\ngopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=\n"
  },
  {
    "path": "histogram.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"fmt\"\n\t\"math\"\n)\n\n// PrintHistogram builds and displays the key-value size histogram.\n// When keyPrefix is set, only the keys that have prefix \"keyPrefix\" are\n// considered for creating the histogram\nfunc (db *DB) PrintHistogram(keyPrefix []byte) {\n\tif db == nil {\n\t\tfmt.Println(\"\\nCannot build histogram: DB is nil.\")\n\t\treturn\n\t}\n\thistogram := db.buildHistogram(keyPrefix)\n\tfmt.Printf(\"Histogram of key sizes (in bytes)\\n\")\n\thistogram.keySizeHistogram.printHistogram()\n\tfmt.Printf(\"Histogram of value sizes (in bytes)\\n\")\n\thistogram.valueSizeHistogram.printHistogram()\n}\n\n// histogramData stores information about a histogram\ntype histogramData struct {\n\tbins        []int64\n\tcountPerBin []int64\n\ttotalCount  int64\n\tmin         int64\n\tmax         int64\n\tsum         int64\n}\n\n// sizeHistogram contains keySize histogram and valueSize histogram\ntype sizeHistogram struct {\n\tkeySizeHistogram, valueSizeHistogram histogramData\n}\n\n// newSizeHistogram returns a new instance of keyValueSizeHistogram with\n// properly initialized fields.\nfunc newSizeHistogram() *sizeHistogram {\n\t// TODO(ibrahim): find appropriate bin size.\n\tkeyBins := createHistogramBins(1, 16)\n\tvalueBins := createHistogramBins(1, 30)\n\treturn &sizeHistogram{\n\t\tkeySizeHistogram: histogramData{\n\t\t\tbins:        keyBins,\n\t\t\tcountPerBin: make([]int64, len(keyBins)+1),\n\t\t\tmax:         math.MinInt64,\n\t\t\tmin:         math.MaxInt64,\n\t\t\tsum:         0,\n\t\t},\n\t\tvalueSizeHistogram: histogramData{\n\t\t\tbins:        valueBins,\n\t\t\tcountPerBin: make([]int64, len(valueBins)+1),\n\t\t\tmax:         math.MinInt64,\n\t\t\tmin:         math.MaxInt64,\n\t\t\tsum:         0,\n\t\t},\n\t}\n}\n\n// createHistogramBins creates bins for an histogram. The bin sizes are powers\n// of two of the form [2^min_exponent, ..., 2^max_exponent].\nfunc createHistogramBins(minExponent, maxExponent uint32) []int64 {\n\tvar bins []int64\n\tfor i := minExponent; i <= maxExponent; i++ {\n\t\tbins = append(bins, int64(1)<<i)\n\t}\n\treturn bins\n}\n\n// Update the min and max fields if value is less than or greater than the\n// current min/max value.\nfunc (histogram *histogramData) Update(value int64) {\n\tif value > histogram.max {\n\t\thistogram.max = value\n\t}\n\tif value < histogram.min {\n\t\thistogram.min = value\n\t}\n\n\thistogram.sum += value\n\thistogram.totalCount++\n\n\tfor index := 0; index <= len(histogram.bins); index++ {\n\t\t// Allocate value in the last buckets if we reached the end of the Bounds array.\n\t\tif index == len(histogram.bins) {\n\t\t\thistogram.countPerBin[index]++\n\t\t\tbreak\n\t\t}\n\n\t\t// Check if the value should be added to the \"index\" bin\n\t\tif value < histogram.bins[index] {\n\t\t\thistogram.countPerBin[index]++\n\t\t\tbreak\n\t\t}\n\t}\n}\n\n// buildHistogram builds the key-value size histogram.\n// When keyPrefix is set, only the keys that have prefix \"keyPrefix\" are\n// considered for creating the histogram\nfunc (db *DB) buildHistogram(keyPrefix []byte) *sizeHistogram {\n\ttxn := db.NewTransaction(false)\n\tdefer txn.Discard()\n\n\titr := txn.NewIterator(DefaultIteratorOptions)\n\tdefer itr.Close()\n\n\tbadgerHistogram := newSizeHistogram()\n\n\t// Collect key and value sizes.\n\tfor itr.Seek(keyPrefix); itr.ValidForPrefix(keyPrefix); itr.Next() {\n\t\titem := itr.Item()\n\t\tbadgerHistogram.keySizeHistogram.Update(item.KeySize())\n\t\tbadgerHistogram.valueSizeHistogram.Update(item.ValueSize())\n\t}\n\treturn badgerHistogram\n}\n\n// printHistogram prints the histogram data in a human-readable format.\nfunc (histogram histogramData) printHistogram() {\n\tfmt.Printf(\"Total count: %d\\n\", histogram.totalCount)\n\tfmt.Printf(\"Min value: %d\\n\", histogram.min)\n\tfmt.Printf(\"Max value: %d\\n\", histogram.max)\n\tfmt.Printf(\"Mean: %.2f\\n\", float64(histogram.sum)/float64(histogram.totalCount))\n\tfmt.Printf(\"%24s %9s\\n\", \"Range\", \"Count\")\n\n\tnumBins := len(histogram.bins)\n\tfor index, count := range histogram.countPerBin {\n\t\tif count == 0 {\n\t\t\tcontinue\n\t\t}\n\n\t\t// The last bin represents the bin that contains the range from\n\t\t// the last bin up to infinity so it's processed differently than the\n\t\t// other bins.\n\t\tif index == len(histogram.countPerBin)-1 {\n\t\t\tlowerBound := int(histogram.bins[numBins-1])\n\t\t\tfmt.Printf(\"[%10d, %10s) %9d\\n\", lowerBound, \"infinity\", count)\n\t\t\tcontinue\n\t\t}\n\n\t\tupperBound := int(histogram.bins[index])\n\t\tlowerBound := 0\n\t\tif index > 0 {\n\t\t\tlowerBound = int(histogram.bins[index-1])\n\t\t}\n\n\t\tfmt.Printf(\"[%10d, %10d) %9d\\n\", lowerBound, upperBound, count)\n\t}\n\tfmt.Println()\n}\n"
  },
  {
    "path": "histogram_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"testing\"\n\n\t\"github.com/stretchr/testify/require\"\n)\n\nfunc TestBuildKeyValueSizeHistogram(t *testing.T) {\n\tt.Run(\"All same size key-values\", func(t *testing.T) {\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\tentries := int64(40)\n\t\t\terr := db.Update(func(txn *Txn) error {\n\t\t\t\tfor i := rune(0); i < rune(entries); i++ {\n\t\t\t\t\terr := txn.SetEntry(NewEntry([]byte(string(i)), []byte(\"B\")))\n\t\t\t\t\tif err != nil {\n\t\t\t\t\t\treturn err\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\treturn nil\n\t\t\t})\n\t\t\trequire.NoError(t, err)\n\n\t\t\thistogram := db.buildHistogram(nil)\n\t\t\tkeyHistogram := histogram.keySizeHistogram\n\t\t\tvalueHistogram := histogram.valueSizeHistogram\n\n\t\t\trequire.Equal(t, entries, keyHistogram.totalCount)\n\t\t\trequire.Equal(t, entries, valueHistogram.totalCount)\n\n\t\t\t// Each entry is of size one. So the sum of sizes should be the same\n\t\t\t// as number of entries\n\t\t\trequire.Equal(t, entries, valueHistogram.sum)\n\t\t\trequire.Equal(t, entries, keyHistogram.sum)\n\n\t\t\t// All value sizes are same. The first bin should have all the values.\n\t\t\trequire.Equal(t, entries, valueHistogram.countPerBin[0])\n\t\t\trequire.Equal(t, entries, keyHistogram.countPerBin[0])\n\n\t\t\trequire.Equal(t, int64(1), keyHistogram.max)\n\t\t\trequire.Equal(t, int64(1), keyHistogram.min)\n\t\t\trequire.Equal(t, int64(1), valueHistogram.max)\n\t\t\trequire.Equal(t, int64(1), valueHistogram.min)\n\t\t})\n\t})\n\n\tt.Run(\"different size key-values\", func(t *testing.T) {\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\tentries := int64(3)\n\t\t\terr := db.Update(func(txn *Txn) error {\n\t\t\t\tif err := txn.SetEntry(NewEntry([]byte(\"A\"), []byte(\"B\"))); err != nil {\n\t\t\t\t\treturn err\n\t\t\t\t}\n\n\t\t\t\tif err := txn.SetEntry(NewEntry([]byte(\"AA\"), []byte(\"BB\"))); err != nil {\n\t\t\t\t\treturn err\n\t\t\t\t}\n\n\t\t\t\treturn txn.SetEntry(NewEntry([]byte(\"AAA\"), []byte(\"BBB\")))\n\t\t\t})\n\t\t\trequire.NoError(t, err)\n\n\t\t\thistogram := db.buildHistogram(nil)\n\t\t\tkeyHistogram := histogram.keySizeHistogram\n\t\t\tvalueHistogram := histogram.valueSizeHistogram\n\n\t\t\trequire.Equal(t, entries, keyHistogram.totalCount)\n\t\t\trequire.Equal(t, entries, valueHistogram.totalCount)\n\n\t\t\t// Each entry is of size one. So the sum of sizes should be the same\n\t\t\t// as number of entries\n\t\t\trequire.Equal(t, int64(6), valueHistogram.sum)\n\t\t\trequire.Equal(t, int64(6), keyHistogram.sum)\n\n\t\t\t// Length 1 key is in first bucket, length 2 and 3 are in the second\n\t\t\t// bucket\n\t\t\trequire.Equal(t, int64(1), valueHistogram.countPerBin[0])\n\t\t\trequire.Equal(t, int64(2), valueHistogram.countPerBin[1])\n\t\t\trequire.Equal(t, int64(1), keyHistogram.countPerBin[0])\n\t\t\trequire.Equal(t, int64(2), keyHistogram.countPerBin[1])\n\n\t\t\trequire.Equal(t, int64(3), keyHistogram.max)\n\t\t\trequire.Equal(t, int64(1), keyHistogram.min)\n\t\t\trequire.Equal(t, int64(3), valueHistogram.max)\n\t\t\trequire.Equal(t, int64(1), valueHistogram.min)\n\t\t})\n\t})\n}\n"
  },
  {
    "path": "integration/testgc/.gitignore",
    "content": "/testgc\n"
  },
  {
    "path": "integration/testgc/main.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage main\n\nimport (\n\t\"encoding/binary\"\n\t\"fmt\"\n\t\"log\"\n\t\"math/rand\"\n\t\"net/http\"\n\t_ \"net/http/pprof\" //nolint:gosec\n\t\"os\"\n\t\"sync\"\n\t\"sync/atomic\"\n\t\"time\"\n\n\t\"github.com/dgraph-io/badger/v4\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\nvar maxValue int64 = 10000000\nvar suffix = make([]byte, 128)\n\ntype testSuite struct {\n\tsync.Mutex\n\tvals map[uint64]uint64\n\n\tcount atomic.Uint64 // Not under mutex lock.\n}\n\nfunc encoded(i uint64) []byte {\n\tout := make([]byte, 8)\n\tbinary.BigEndian.PutUint64(out, i)\n\treturn out\n}\n\nfunc (s *testSuite) write(db *badger.DB) error {\n\treturn db.Update(func(txn *badger.Txn) error {\n\t\tfor i := 0; i < 10; i++ {\n\t\t\t// These keys would be overwritten.\n\t\t\tkeyi := uint64(rand.Int63n(maxValue))\n\t\t\tkey := encoded(keyi)\n\t\t\tvali := s.count.Add(1)\n\t\t\tval := encoded(vali)\n\t\t\tval = append(val, suffix...)\n\t\t\tif err := txn.SetEntry(badger.NewEntry(key, val)); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t}\n\t\tfor i := 0; i < 20; i++ {\n\t\t\t// These keys would be new and never overwritten.\n\t\t\tkeyi := s.count.Add(1)\n\t\t\tif keyi%1000000 == 0 {\n\t\t\t\tlog.Printf(\"Count: %d\\n\", keyi)\n\t\t\t}\n\t\t\tkey := encoded(keyi)\n\t\t\tval := append(key, suffix...)\n\t\t\tif err := txn.SetEntry(badger.NewEntry(key, val)); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t}\n\t\treturn nil\n\t})\n}\n\nfunc (s *testSuite) read(db *badger.DB) error {\n\tmax := int64(s.count.Load())\n\tkeyi := uint64(rand.Int63n(max))\n\tkey := encoded(keyi)\n\n\terr := db.View(func(txn *badger.Txn) error {\n\t\titem, err := txn.Get(key)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tval, err := item.ValueCopy(nil)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\ty.AssertTruef(len(val) == len(suffix)+8, \"Found val of len: %d\\n\", len(val))\n\t\tvali := binary.BigEndian.Uint64(val[0:8])\n\t\ts.Lock()\n\t\texpected := s.vals[keyi]\n\t\tif vali < expected {\n\t\t\tlog.Fatalf(\"Expected: %d. Found: %d. Key: %d\\n\", expected, vali, keyi)\n\t\t} else if vali == expected {\n\t\t\t// pass\n\t\t} else {\n\t\t\ts.vals[keyi] = vali\n\t\t}\n\t\ts.Unlock()\n\t\treturn nil\n\t})\n\tif err == badger.ErrKeyNotFound {\n\t\treturn nil\n\t}\n\treturn err\n}\n\nfunc main() {\n\tfmt.Println(\"Badger Integration test for value log GC.\")\n\n\tdir := \"/mnt/drive/badgertest\"\n\tos.RemoveAll(dir)\n\n\tdb, err := badger.Open(badger.DefaultOptions(dir).\n\t\tWithSyncWrites(false))\n\tif err != nil {\n\t\tlog.Fatal(err)\n\t}\n\tdefer db.Close()\n\n\tgo func() {\n\t\t_ = http.ListenAndServe(\"localhost:8080\", nil)\n\t}()\n\n\tcloser := z.NewCloser(11)\n\tgo func() {\n\t\t// Run value log GC.\n\t\tdefer closer.Done()\n\t\tvar count int\n\t\tticker := time.NewTicker(5 * time.Second)\n\t\tdefer ticker.Stop()\n\t\tfor range ticker.C {\n\t\tagain:\n\t\t\tselect {\n\t\t\tcase <-closer.HasBeenClosed():\n\t\t\t\tlog.Printf(\"Num times value log GC was successful: %d\\n\", count)\n\t\t\t\treturn\n\t\t\tdefault:\n\t\t\t}\n\t\t\tlog.Printf(\"Starting a value log GC\")\n\t\t\terr := db.RunValueLogGC(0.1)\n\t\t\tlog.Printf(\"Result of value log GC: %v\\n\", err)\n\t\t\tif err == nil {\n\t\t\t\tcount++\n\t\t\t\tgoto again\n\t\t\t}\n\t\t}\n\t}()\n\n\ts := testSuite{vals: make(map[uint64]uint64)}\n\ts.count.Store(uint64(maxValue))\n\tvar numLoops atomic.Uint64\n\tticker := time.NewTicker(5 * time.Second)\n\tfor i := 0; i < 10; i++ {\n\t\tgo func() {\n\t\t\tdefer closer.Done()\n\t\t\tfor {\n\t\t\t\tif err := s.write(db); err != nil {\n\t\t\t\t\tlog.Fatal(err)\n\t\t\t\t}\n\t\t\t\tfor j := 0; j < 10; j++ {\n\t\t\t\t\tif err := s.read(db); err != nil {\n\t\t\t\t\t\tlog.Fatal(err)\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tnl := numLoops.Add(1)\n\t\t\t\tselect {\n\t\t\t\tcase <-closer.HasBeenClosed():\n\t\t\t\t\treturn\n\t\t\t\tcase <-ticker.C:\n\t\t\t\t\tlog.Printf(\"Num loops: %d\\n\", nl)\n\t\t\t\tdefault:\n\t\t\t\t}\n\t\t\t}\n\t\t}()\n\t}\n\ttime.Sleep(5 * time.Minute)\n\tlog.Println(\"Signaling...\")\n\tcloser.SignalAndWait()\n\tlog.Println(\"Wait done. Now iterating over everything.\")\n\n\terr = db.View(func(txn *badger.Txn) error {\n\t\tiopts := badger.DefaultIteratorOptions\n\t\titr := txn.NewIterator(iopts)\n\t\tdefer itr.Close()\n\n\t\tvar total, tested int\n\t\tfor itr.Rewind(); itr.Valid(); itr.Next() {\n\t\t\titem := itr.Item()\n\t\t\tkey := item.Key()\n\t\t\tkeyi := binary.BigEndian.Uint64(key)\n\t\t\ttotal++\n\n\t\t\tval, err := item.ValueCopy(nil)\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tif len(val) < 8 {\n\t\t\t\tlog.Printf(\"Unexpected value: %x\\n\", val)\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\tvali := binary.BigEndian.Uint64(val[0:8])\n\n\t\t\texpected, ok := s.vals[keyi] // Not all keys must be in vals map.\n\t\t\tif ok {\n\t\t\t\ttested++\n\t\t\t\tif vali < expected {\n\t\t\t\t\t// vali must be equal or greater than what's in the map.\n\t\t\t\t\tlog.Fatalf(\"Expected: %d. Got: %d. Key: %d\\n\", expected, vali, keyi)\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\tlog.Printf(\"Total iterated: %d. Tested values: %d\\n\", total, tested)\n\t\treturn nil\n\t})\n\tif err != nil {\n\t\tlog.Fatalf(\"Error while iterating: %v\", err)\n\t}\n\tlog.Println(\"Iteration done. Test successful.\")\n\ttime.Sleep(time.Minute) // Time to do some poking around.\n}\n"
  },
  {
    "path": "iterator.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"bytes\"\n\t\"fmt\"\n\t\"hash/crc32\"\n\t\"math\"\n\t\"sort\"\n\t\"sync\"\n\t\"time\"\n\n\t\"github.com/dgraph-io/badger/v4/table\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\ntype prefetchStatus uint8\n\nconst (\n\tprefetched prefetchStatus = iota + 1\n)\n\n// Item is returned during iteration. Both the Key() and Value() output is only valid until\n// iterator.Next() is called.\ntype Item struct {\n\tkey       []byte\n\tvptr      []byte\n\tval       []byte\n\tversion   uint64\n\texpiresAt uint64\n\n\tslice *y.Slice // Used only during prefetching.\n\tnext  *Item\n\ttxn   *Txn\n\n\terr      error\n\twg       sync.WaitGroup\n\tstatus   prefetchStatus\n\tmeta     byte // We need to store meta to know about bitValuePointer.\n\tuserMeta byte\n}\n\n// String returns a string representation of Item\nfunc (item *Item) String() string {\n\treturn fmt.Sprintf(\"key=%q, version=%d, meta=%x\", item.Key(), item.Version(), item.meta)\n}\n\n// Key returns the key.\n//\n// Key is only valid as long as item is valid, or transaction is valid.  If you need to use it\n// outside its validity, please use KeyCopy.\nfunc (item *Item) Key() []byte {\n\treturn item.key\n}\n\n// KeyCopy returns a copy of the key of the item, writing it to dst slice.\n// If nil is passed, or capacity of dst isn't sufficient, a new slice would be allocated and\n// returned.\nfunc (item *Item) KeyCopy(dst []byte) []byte {\n\treturn y.SafeCopy(dst, item.key)\n}\n\n// Version returns the commit timestamp of the item.\nfunc (item *Item) Version() uint64 {\n\treturn item.version\n}\n\n// Value retrieves the value of the item from the value log.\n//\n// This method must be called within a transaction. Calling it outside a\n// transaction is considered undefined behavior. If an iterator is being used,\n// then Item.Value() is defined in the current iteration only, because items are\n// reused.\n//\n// If you need to use a value outside a transaction, please use Item.ValueCopy\n// instead, or copy it yourself. Value might change once discard or commit is called.\n// Use ValueCopy if you want to do a Set after Get.\nfunc (item *Item) Value(fn func(val []byte) error) error {\n\titem.wg.Wait()\n\tif item.status == prefetched {\n\t\tif item.err == nil && fn != nil {\n\t\t\tif err := fn(item.val); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t}\n\t\treturn item.err\n\t}\n\tbuf, cb, err := item.yieldItemValue()\n\tdefer runCallback(cb)\n\tif err != nil {\n\t\treturn err\n\t}\n\tif fn != nil {\n\t\treturn fn(buf)\n\t}\n\treturn nil\n}\n\n// ValueCopy returns a copy of the value of the item from the value log, writing it to dst slice.\n// If nil is passed, or capacity of dst isn't sufficient, a new slice would be allocated and\n// returned. Tip: It might make sense to reuse the returned slice as dst argument for the next call.\n//\n// This function is useful in long running iterate/update transactions to avoid a write deadlock.\n// See Github issue: https://github.com/dgraph-io/badger/issues/315\nfunc (item *Item) ValueCopy(dst []byte) ([]byte, error) {\n\titem.wg.Wait()\n\tif item.status == prefetched {\n\t\treturn y.SafeCopy(dst, item.val), item.err\n\t}\n\tbuf, cb, err := item.yieldItemValue()\n\tdefer runCallback(cb)\n\treturn y.SafeCopy(dst, buf), err\n}\n\nfunc (item *Item) hasValue() bool {\n\tif item.meta == 0 && item.vptr == nil {\n\t\t// key not found\n\t\treturn false\n\t}\n\treturn true\n}\n\n// IsDeletedOrExpired returns true if item contains deleted or expired value.\nfunc (item *Item) IsDeletedOrExpired() bool {\n\treturn isDeletedOrExpired(item.meta, item.expiresAt)\n}\n\n// DiscardEarlierVersions returns whether the item was created with the\n// option to discard earlier versions of a key when multiple are available.\nfunc (item *Item) DiscardEarlierVersions() bool {\n\treturn item.meta&bitDiscardEarlierVersions > 0\n}\n\nfunc (item *Item) yieldItemValue() ([]byte, func(), error) {\n\tkey := item.Key() // No need to copy.\n\tif !item.hasValue() {\n\t\treturn nil, nil, nil\n\t}\n\n\tif item.slice == nil {\n\t\titem.slice = new(y.Slice)\n\t}\n\n\tif (item.meta & bitValuePointer) == 0 {\n\t\tval := item.slice.Resize(len(item.vptr))\n\t\tcopy(val, item.vptr)\n\t\treturn val, nil, nil\n\t}\n\n\tvar vp valuePointer\n\tvp.Decode(item.vptr)\n\tdb := item.txn.db\n\tresult, cb, err := db.vlog.Read(vp, item.slice)\n\tif err != nil {\n\t\tdb.opt.Errorf(\"Unable to read: Key: %v, Version : %v, meta: %v, userMeta: %v\"+\n\t\t\t\" Error: %v\", key, item.version, item.meta, item.userMeta, err)\n\t\tvar txn *Txn\n\t\tif db.opt.managedTxns {\n\t\t\ttxn = db.NewTransactionAt(math.MaxUint64, false)\n\t\t} else {\n\t\t\ttxn = db.NewTransaction(false)\n\t\t}\n\t\tdefer txn.Discard()\n\n\t\tiopt := DefaultIteratorOptions\n\t\tiopt.AllVersions = true\n\t\tiopt.InternalAccess = true\n\t\tiopt.PrefetchValues = false\n\n\t\tit := txn.NewKeyIterator(item.Key(), iopt)\n\t\tdefer it.Close()\n\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\titem := it.Item()\n\t\t\tvar vp valuePointer\n\t\t\tif item.meta&bitValuePointer > 0 {\n\t\t\t\tvp.Decode(item.vptr)\n\t\t\t}\n\t\t\tdb.opt.Errorf(\"Key: %v, Version : %v, meta: %v, userMeta: %v valuePointer: %+v\",\n\t\t\t\titem.Key(), item.version, item.meta, item.userMeta, vp)\n\t\t}\n\t}\n\t// Don't return error if we cannot read the value. Just log the error.\n\treturn result, cb, nil\n}\n\nfunc runCallback(cb func()) {\n\tif cb != nil {\n\t\tcb()\n\t}\n}\n\nfunc (item *Item) prefetchValue() {\n\tval, cb, err := item.yieldItemValue()\n\tdefer runCallback(cb)\n\n\titem.err = err\n\titem.status = prefetched\n\tif val == nil {\n\t\treturn\n\t}\n\tbuf := item.slice.Resize(len(val))\n\tcopy(buf, val)\n\titem.val = buf\n}\n\n// EstimatedSize returns the approximate size of the key-value pair.\n//\n// This can be called while iterating through a store to quickly estimate the\n// size of a range of key-value pairs (without fetching the corresponding\n// values).\nfunc (item *Item) EstimatedSize() int64 {\n\tif !item.hasValue() {\n\t\treturn 0\n\t}\n\tif (item.meta & bitValuePointer) == 0 {\n\t\treturn int64(len(item.key) + len(item.vptr))\n\t}\n\tvar vp valuePointer\n\tvp.Decode(item.vptr)\n\treturn int64(vp.Len) // includes key length.\n}\n\n// KeySize returns the size of the key.\n// Exact size of the key is key + 8 bytes of timestamp\nfunc (item *Item) KeySize() int64 {\n\treturn int64(len(item.key))\n}\n\n// ValueSize returns the approximate size of the value.\n//\n// This can be called to quickly estimate the size of a value without fetching\n// it.\nfunc (item *Item) ValueSize() int64 {\n\tif !item.hasValue() {\n\t\treturn 0\n\t}\n\tif (item.meta & bitValuePointer) == 0 {\n\t\treturn int64(len(item.vptr))\n\t}\n\tvar vp valuePointer\n\tvp.Decode(item.vptr)\n\n\tklen := int64(len(item.key) + 8) // 8 bytes for timestamp.\n\t// 6 bytes are for the approximate length of the header. Since header is encoded in varint, we\n\t// cannot find the exact length of header without fetching it.\n\treturn int64(vp.Len) - klen - 6 - crc32.Size\n}\n\n// UserMeta returns the userMeta set by the user. Typically, this byte, optionally set by the user\n// is used to interpret the value.\nfunc (item *Item) UserMeta() byte {\n\treturn item.userMeta\n}\n\n// ExpiresAt returns a Unix time value indicating when the item will be\n// considered expired. 0 indicates that the item will never expire.\nfunc (item *Item) ExpiresAt() uint64 {\n\treturn item.expiresAt\n}\n\n// TODO: Switch this to use linked list container in Go.\ntype list struct {\n\thead *Item\n\ttail *Item\n}\n\nfunc (l *list) push(i *Item) {\n\ti.next = nil\n\tif l.tail == nil {\n\t\tl.head = i\n\t\tl.tail = i\n\t\treturn\n\t}\n\tl.tail.next = i\n\tl.tail = i\n}\n\nfunc (l *list) pop() *Item {\n\tif l.head == nil {\n\t\treturn nil\n\t}\n\ti := l.head\n\tif l.head == l.tail {\n\t\tl.tail = nil\n\t\tl.head = nil\n\t} else {\n\t\tl.head = i.next\n\t}\n\ti.next = nil\n\treturn i\n}\n\n// IteratorOptions is used to set options when iterating over Badger key-value\n// stores.\n//\n// This package provides DefaultIteratorOptions which contains options that\n// should work for most applications. Consider using that as a starting point\n// before customizing it for your own needs.\ntype IteratorOptions struct {\n\t// PrefetchSize is the number of KV pairs to prefetch while iterating.\n\t// Valid only if PrefetchValues is true.\n\tPrefetchSize int\n\t// PrefetchValues Indicates whether we should prefetch values during\n\t// iteration and store them.\n\tPrefetchValues bool\n\tReverse        bool // Direction of iteration. False is forward, true is backward.\n\tAllVersions    bool // Fetch all valid versions of the same key.\n\tInternalAccess bool // Used to allow internal access to badger keys.\n\n\t// The following option is used to narrow down the SSTables that iterator\n\t// picks up. If Prefix is specified, only tables which could have this\n\t// prefix are picked based on their range of keys.\n\tprefixIsKey bool   // If set, use the prefix for bloom filter lookup.\n\tPrefix      []byte // Only iterate over this given prefix.\n\tSinceTs     uint64 // Only read data that has version > SinceTs.\n}\n\nfunc (opt *IteratorOptions) compareToPrefix(key []byte) int {\n\t// We should compare key without timestamp. For example key - a[TS] might be > \"aa\" prefix.\n\tkey = y.ParseKey(key)\n\tif len(key) > len(opt.Prefix) {\n\t\tkey = key[:len(opt.Prefix)]\n\t}\n\treturn bytes.Compare(key, opt.Prefix)\n}\n\nfunc (opt *IteratorOptions) pickTable(t table.TableInterface) bool {\n\t// Ignore this table if its max version is less than the sinceTs.\n\tif t.MaxVersion() < opt.SinceTs {\n\t\treturn false\n\t}\n\tif len(opt.Prefix) == 0 {\n\t\treturn true\n\t}\n\tif opt.compareToPrefix(t.Smallest()) > 0 {\n\t\treturn false\n\t}\n\tif opt.compareToPrefix(t.Biggest()) < 0 {\n\t\treturn false\n\t}\n\t// Bloom filter lookup would only work if opt.Prefix does NOT have the read\n\t// timestamp as part of the key.\n\tif opt.prefixIsKey && t.DoesNotHave(y.Hash(opt.Prefix)) {\n\t\treturn false\n\t}\n\treturn true\n}\n\n// pickTables picks the necessary table for the iterator. This function also assumes\n// that the tables are sorted in the right order.\nfunc (opt *IteratorOptions) pickTables(all []*table.Table) []*table.Table {\n\tfilterTables := func(tables []*table.Table) []*table.Table {\n\t\tif opt.SinceTs > 0 {\n\t\t\ttmp := tables[:0]\n\t\t\tfor _, t := range tables {\n\t\t\t\tif t.MaxVersion() < opt.SinceTs {\n\t\t\t\t\tcontinue\n\t\t\t\t}\n\t\t\t\ttmp = append(tmp, t)\n\t\t\t}\n\t\t\ttables = tmp\n\t\t}\n\t\treturn tables\n\t}\n\n\tif len(opt.Prefix) == 0 {\n\t\tout := make([]*table.Table, len(all))\n\t\tcopy(out, all)\n\t\treturn filterTables(out)\n\t}\n\tsIdx := sort.Search(len(all), func(i int) bool {\n\t\t// table.Biggest >= opt.prefix\n\t\t// if opt.Prefix < table.Biggest, then surely it is not in any of the preceding tables.\n\t\treturn opt.compareToPrefix(all[i].Biggest()) >= 0\n\t})\n\tif sIdx == len(all) {\n\t\t// Not found.\n\t\treturn []*table.Table{}\n\t}\n\n\tfiltered := all[sIdx:]\n\tif !opt.prefixIsKey {\n\t\teIdx := sort.Search(len(filtered), func(i int) bool {\n\t\t\treturn opt.compareToPrefix(filtered[i].Smallest()) > 0\n\t\t})\n\t\tout := make([]*table.Table, len(filtered[:eIdx]))\n\t\tcopy(out, filtered[:eIdx])\n\t\treturn filterTables(out)\n\t}\n\n\t// opt.prefixIsKey == true. This code is optimizing for opt.prefixIsKey part.\n\tvar out []*table.Table\n\thash := y.Hash(opt.Prefix)\n\tfor _, t := range filtered {\n\t\t// When we encounter the first table whose smallest key is higher than opt.Prefix, we can\n\t\t// stop. This is an IMPORTANT optimization, just considering how often we call\n\t\t// NewKeyIterator.\n\t\tif opt.compareToPrefix(t.Smallest()) > 0 {\n\t\t\t// if table.Smallest > opt.Prefix, then this and all tables after this can be ignored.\n\t\t\tbreak\n\t\t}\n\t\t// opt.Prefix is actually the key. So, we can run bloom filter checks\n\t\t// as well.\n\t\tif t.DoesNotHave(hash) {\n\t\t\tcontinue\n\t\t}\n\t\tout = append(out, t)\n\t}\n\treturn filterTables(out)\n}\n\n// DefaultIteratorOptions contains default options when iterating over Badger key-value stores.\nvar DefaultIteratorOptions = IteratorOptions{\n\tPrefetchValues: true,\n\tPrefetchSize:   100,\n\tReverse:        false,\n\tAllVersions:    false,\n}\n\n// Iterator helps iterating over the KV pairs in a lexicographically sorted order.\ntype Iterator struct {\n\tiitr   y.Iterator\n\ttxn    *Txn\n\treadTs uint64\n\n\topt   IteratorOptions\n\titem  *Item\n\tdata  list\n\twaste list\n\n\tlastKey []byte // Used to skip over multiple versions of the same key.\n\n\tclosed  bool\n\tscanned int // Used to estimate the size of data scanned by iterator.\n\n\t// ThreadId is an optional value that can be set to identify which goroutine created\n\t// the iterator. It can be used, for example, to uniquely identify each of the\n\t// iterators created by the stream interface\n\tThreadId int\n\n\tAlloc *z.Allocator\n}\n\n// NewIterator returns a new iterator. Depending upon the options, either only keys, or both\n// key-value pairs would be fetched. The keys are returned in lexicographically sorted order.\n// Using prefetch is recommended if you're doing a long running iteration, for performance.\n//\n// Multiple Iterators:\n// For a read-only txn, multiple iterators can be running simultaneously. However, for a read-write\n// txn, iterators have the nuance of being a snapshot of the writes for the transaction at the time\n// iterator was created. If writes are performed after an iterator is created, then that iterator\n// will not be able to see those writes. Only writes performed before an iterator was created can be\n// viewed.\nfunc (txn *Txn) NewIterator(opt IteratorOptions) *Iterator {\n\tif txn.discarded {\n\t\tpanic(ErrDiscardedTxn)\n\t}\n\tif txn.db.IsClosed() {\n\t\tpanic(ErrDBClosed)\n\t}\n\n\ty.NumIteratorsCreatedAdd(txn.db.opt.MetricsEnabled, 1)\n\n\t// Keep track of the number of active iterators.\n\ttxn.numIterators.Add(1)\n\n\t// TODO: If Prefix is set, only pick those memtables which have keys with the prefix.\n\ttables, decr := txn.db.getMemTables()\n\tdefer decr()\n\ttxn.db.vlog.incrIteratorCount()\n\tvar iters []y.Iterator\n\tif itr := txn.newPendingWritesIterator(opt.Reverse); itr != nil {\n\t\titers = append(iters, itr)\n\t}\n\tfor i := 0; i < len(tables); i++ {\n\t\titers = append(iters, tables[i].sl.NewUniIterator(opt.Reverse))\n\t}\n\titers = txn.db.lc.appendIterators(iters, &opt) // This will increment references.\n\tres := &Iterator{\n\t\ttxn:    txn,\n\t\tiitr:   table.NewMergeIterator(iters, opt.Reverse),\n\t\topt:    opt,\n\t\treadTs: txn.readTs,\n\t}\n\treturn res\n}\n\n// NewKeyIterator is just like NewIterator, but allows the user to iterate over all versions of a\n// single key. Internally, it sets the Prefix option in provided opt, and uses that prefix to\n// additionally run bloom filter lookups before picking tables from the LSM tree.\nfunc (txn *Txn) NewKeyIterator(key []byte, opt IteratorOptions) *Iterator {\n\tif len(opt.Prefix) > 0 {\n\t\tpanic(\"opt.Prefix should be nil for NewKeyIterator.\")\n\t}\n\topt.Prefix = key // This key must be without the timestamp.\n\topt.prefixIsKey = true\n\topt.AllVersions = true\n\treturn txn.NewIterator(opt)\n}\n\nfunc (it *Iterator) newItem() *Item {\n\titem := it.waste.pop()\n\tif item == nil {\n\t\titem = &Item{slice: new(y.Slice), txn: it.txn}\n\t}\n\treturn item\n}\n\n// Item returns pointer to the current key-value pair.\n// This item is only valid until it.Next() gets called.\nfunc (it *Iterator) Item() *Item {\n\ttx := it.txn\n\ttx.addReadKey(it.item.Key())\n\treturn it.item\n}\n\n// Valid returns false when iteration is done.\nfunc (it *Iterator) Valid() bool {\n\tif it.item == nil {\n\t\treturn false\n\t}\n\tif it.opt.prefixIsKey {\n\t\treturn bytes.Equal(it.item.key, it.opt.Prefix)\n\t}\n\treturn bytes.HasPrefix(it.item.key, it.opt.Prefix)\n}\n\n// ValidForPrefix returns false when iteration is done\n// or when the current key is not prefixed by the specified prefix.\nfunc (it *Iterator) ValidForPrefix(prefix []byte) bool {\n\treturn it.Valid() && bytes.HasPrefix(it.item.key, prefix)\n}\n\n// Close would close the iterator. It is important to call this when you're done with iteration.\nfunc (it *Iterator) Close() {\n\tif it.closed {\n\t\treturn\n\t}\n\tit.closed = true\n\tif it.iitr == nil {\n\t\tit.txn.numIterators.Add(-1)\n\t\treturn\n\t}\n\n\tit.iitr.Close()\n\t// It is important to wait for the fill goroutines to finish. Otherwise, we might leave zombie\n\t// goroutines behind, which are waiting to acquire file read locks after DB has been closed.\n\twaitFor := func(l list) {\n\t\titem := l.pop()\n\t\tfor item != nil {\n\t\t\titem.wg.Wait()\n\t\t\titem = l.pop()\n\t\t}\n\t}\n\twaitFor(it.waste)\n\twaitFor(it.data)\n\n\t// TODO: We could handle this error.\n\t_ = it.txn.db.vlog.decrIteratorCount()\n\tit.txn.numIterators.Add(-1)\n}\n\n// Next would advance the iterator by one. Always check it.Valid() after a Next()\n// to ensure you have access to a valid it.Item().\nfunc (it *Iterator) Next() {\n\tif it.iitr == nil {\n\t\treturn\n\t}\n\t// Reuse current item\n\tit.item.wg.Wait() // Just cleaner to wait before pushing to avoid doing ref counting.\n\tit.scanned += len(it.item.key) + len(it.item.val) + len(it.item.vptr) + 2\n\tit.waste.push(it.item)\n\n\t// Set next item to current\n\tit.item = it.data.pop()\n\tfor it.iitr.Valid() && hasPrefix(it) {\n\t\tif it.parseItem() {\n\t\t\t// parseItem calls one extra next.\n\t\t\t// This is used to deal with the complexity of reverse iteration.\n\t\t\tbreak\n\t\t}\n\t}\n}\n\nfunc isDeletedOrExpired(meta byte, expiresAt uint64) bool {\n\tif meta&bitDelete > 0 {\n\t\treturn true\n\t}\n\tif expiresAt == 0 {\n\t\treturn false\n\t}\n\treturn expiresAt <= uint64(time.Now().Unix())\n}\n\n// parseItem is a complex function because it needs to handle both forward and reverse iteration\n// implementation. We store keys such that their versions are sorted in descending order. This makes\n// forward iteration efficient, but revese iteration complicated. This tradeoff is better because\n// forward iteration is more common than reverse. It returns true, if either the iterator is invalid\n// or it has pushed an item into it.data list, else it returns false.\n//\n// This function advances the iterator.\nfunc (it *Iterator) parseItem() bool {\n\tmi := it.iitr\n\tkey := mi.Key()\n\n\tsetItem := func(item *Item) {\n\t\tif it.item == nil {\n\t\t\tit.item = item\n\t\t} else {\n\t\t\tit.data.push(item)\n\t\t}\n\t}\n\n\tisInternalKey := bytes.HasPrefix(key, badgerPrefix)\n\t// Skip badger keys.\n\tif !it.opt.InternalAccess && isInternalKey {\n\t\tmi.Next()\n\t\treturn false\n\t}\n\n\t// Skip any versions which are beyond the readTs.\n\tversion := y.ParseTs(key)\n\t// Ignore everything that is above the readTs and below or at the sinceTs.\n\tif version > it.readTs || (it.opt.SinceTs > 0 && version <= it.opt.SinceTs) {\n\t\tmi.Next()\n\t\treturn false\n\t}\n\n\t// Skip banned keys only if it does not have badger internal prefix.\n\tif !isInternalKey && it.txn.db.isBanned(key) != nil {\n\t\tmi.Next()\n\t\treturn false\n\t}\n\n\tif it.opt.AllVersions {\n\t\t// Return deleted or expired values also, otherwise user can't figure out\n\t\t// whether the key was deleted.\n\t\titem := it.newItem()\n\t\tit.fill(item)\n\t\tsetItem(item)\n\t\tmi.Next()\n\t\treturn true\n\t}\n\n\t// If iterating in forward direction, then just checking the last key against current key would\n\t// be sufficient.\n\tif !it.opt.Reverse {\n\t\tif y.SameKey(it.lastKey, key) {\n\t\t\tmi.Next()\n\t\t\treturn false\n\t\t}\n\t\t// Only track in forward direction.\n\t\t// We should update lastKey as soon as we find a different key in our snapshot.\n\t\t// Consider keys: a 5, b 7 (del), b 5. When iterating, lastKey = a.\n\t\t// Then we see b 7, which is deleted. If we don't store lastKey = b, we'll then return b 5,\n\t\t// which is wrong. Therefore, update lastKey here.\n\t\tit.lastKey = y.SafeCopy(it.lastKey, mi.Key())\n\t}\n\nFILL:\n\t// If deleted, advance and return.\n\tvs := mi.Value()\n\tif isDeletedOrExpired(vs.Meta, vs.ExpiresAt) {\n\t\tmi.Next()\n\t\treturn false\n\t}\n\n\titem := it.newItem()\n\tit.fill(item)\n\t// fill item based on current cursor position. All Next calls have returned, so reaching here\n\t// means no Next was called.\n\n\tmi.Next()                           // Advance but no fill item yet.\n\tif !it.opt.Reverse || !mi.Valid() { // Forward direction, or invalid.\n\t\tsetItem(item)\n\t\treturn true\n\t}\n\n\t// Reverse direction.\n\tnextTs := y.ParseTs(mi.Key())\n\tmik := y.ParseKey(mi.Key())\n\tif nextTs <= it.readTs && bytes.Equal(mik, item.key) {\n\t\t// This is a valid potential candidate.\n\t\tgoto FILL\n\t}\n\t// Ignore the next candidate. Return the current one.\n\tsetItem(item)\n\treturn true\n}\n\nfunc (it *Iterator) fill(item *Item) {\n\tvs := it.iitr.Value()\n\titem.meta = vs.Meta\n\titem.userMeta = vs.UserMeta\n\titem.expiresAt = vs.ExpiresAt\n\n\titem.version = y.ParseTs(it.iitr.Key())\n\titem.key = y.SafeCopy(item.key, y.ParseKey(it.iitr.Key()))\n\n\titem.vptr = y.SafeCopy(item.vptr, vs.Value)\n\titem.val = nil\n\tif it.opt.PrefetchValues {\n\t\titem.wg.Add(1)\n\t\tgo func() {\n\t\t\t// FIXME we are not handling errors here.\n\t\t\titem.prefetchValue()\n\t\t\titem.wg.Done()\n\t\t}()\n\t}\n}\n\nfunc hasPrefix(it *Iterator) bool {\n\t// We shouldn't check prefix in case the iterator is going in reverse. Since in reverse we expect\n\t// people to append items to the end of prefix.\n\tif !it.opt.Reverse && len(it.opt.Prefix) > 0 {\n\t\treturn bytes.HasPrefix(y.ParseKey(it.iitr.Key()), it.opt.Prefix)\n\t}\n\treturn true\n}\n\nfunc (it *Iterator) prefetch() {\n\tprefetchSize := 2\n\tif it.opt.PrefetchValues && it.opt.PrefetchSize > 1 {\n\t\tprefetchSize = it.opt.PrefetchSize\n\t}\n\n\ti := it.iitr\n\tvar count int\n\tit.item = nil\n\tfor i.Valid() && hasPrefix(it) {\n\t\tif !it.parseItem() {\n\t\t\tcontinue\n\t\t}\n\t\tcount++\n\t\tif count == prefetchSize {\n\t\t\tbreak\n\t\t}\n\t}\n}\n\n// Seek would seek to the provided key if present. If absent, it would seek to the next\n// smallest key greater than the provided key if iterating in the forward direction.\n// Behavior would be reversed if iterating backwards.\nfunc (it *Iterator) Seek(key []byte) {\n\tif it.iitr == nil {\n\t\treturn\n\t}\n\tif len(key) > 0 {\n\t\tit.txn.addReadKey(key)\n\t}\n\tfor i := it.data.pop(); i != nil; i = it.data.pop() {\n\t\ti.wg.Wait()\n\t\tit.waste.push(i)\n\t}\n\n\tit.lastKey = it.lastKey[:0]\n\tif len(key) == 0 {\n\t\tkey = it.opt.Prefix\n\t}\n\tif len(key) == 0 {\n\t\tit.iitr.Rewind()\n\t\tit.prefetch()\n\t\treturn\n\t}\n\n\tif !it.opt.Reverse {\n\t\tkey = y.KeyWithTs(key, it.txn.readTs)\n\t} else {\n\t\tkey = y.KeyWithTs(key, 0)\n\t}\n\tit.iitr.Seek(key)\n\tit.prefetch()\n}\n\n// Rewind would rewind the iterator cursor all the way to zero-th position, which would be the\n// smallest key if iterating forward, and largest if iterating backward. It does not keep track of\n// whether the cursor started with a Seek().\nfunc (it *Iterator) Rewind() {\n\tit.Seek(nil)\n}\n"
  },
  {
    "path": "iterator_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"bytes\"\n\t\"fmt\"\n\t\"math\"\n\t\"math/rand\"\n\t\"os\"\n\t\"path/filepath\"\n\t\"strings\"\n\t\"testing\"\n\n\t\"github.com/stretchr/testify/require\"\n\n\t\"github.com/dgraph-io/badger/v4/options\"\n\t\"github.com/dgraph-io/badger/v4/table\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n)\n\ntype tableMock struct {\n\tleft, right []byte\n}\n\nfunc (tm *tableMock) Smallest() []byte             { return tm.left }\nfunc (tm *tableMock) Biggest() []byte              { return tm.right }\nfunc (tm *tableMock) DoesNotHave(hash uint32) bool { return false }\nfunc (tm *tableMock) MaxVersion() uint64           { return math.MaxUint64 }\n\nfunc TestPickTables(t *testing.T) {\n\topt := DefaultIteratorOptions\n\n\twithin := func(prefix, left, right []byte) {\n\t\topt.Prefix = prefix\n\t\t// PickTable expects smallest and biggest to contain timestamps.\n\t\ttm := &tableMock{left: y.KeyWithTs(left, 1), right: y.KeyWithTs(right, 1)}\n\t\trequire.True(t, opt.pickTable(tm), \"within failed for %b %b %b\\n\", prefix, left, right)\n\t}\n\toutside := func(prefix, left, right string) {\n\t\topt.Prefix = []byte(prefix)\n\t\t// PickTable expects smallest and biggest to contain timestamps.\n\t\ttm := &tableMock{left: y.KeyWithTs([]byte(left), 1), right: y.KeyWithTs([]byte(right), 1)}\n\t\trequire.False(t, opt.pickTable(tm), \"outside failed for %b %b %b\", prefix, left, right)\n\t}\n\twithin([]byte(\"abc\"), []byte(\"ab\"), []byte(\"ad\"))\n\twithin([]byte(\"abc\"), []byte(\"abc\"), []byte(\"ad\"))\n\twithin([]byte(\"abc\"), []byte(\"abb123\"), []byte(\"ad\"))\n\twithin([]byte(\"abc\"), []byte(\"abc123\"), []byte(\"abd234\"))\n\twithin([]byte(\"abc\"), []byte(\"abc123\"), []byte(\"abc456\"))\n\t// Regression test for https://github.com/dgraph-io/badger/issues/992\n\twithin([]byte{0, 0, 1}, []byte{0}, []byte{0, 0, 1})\n\n\toutside(\"abd\", \"abe\", \"ad\")\n\toutside(\"abd\", \"ac\", \"ad\")\n\toutside(\"abd\", \"b\", \"e\")\n\toutside(\"abd\", \"a\", \"ab\")\n\toutside(\"abd\", \"ab\", \"abc\")\n\toutside(\"abd\", \"ab\", \"abc123\")\n}\n\nfunc TestPickSortTables(t *testing.T) {\n\ttype MockKeys struct {\n\t\tsmall string\n\t\tlarge string\n\t}\n\tgenTables := func(mks ...MockKeys) []*table.Table {\n\t\tout := make([]*table.Table, 0)\n\t\tfor _, mk := range mks {\n\t\t\topts := table.Options{ChkMode: options.OnTableAndBlockRead}\n\t\t\ttbl := buildTable(t, [][]string{{mk.small, \"some value\"},\n\t\t\t\t{mk.large, \"some value\"}}, opts)\n\t\t\tdefer func() { require.NoError(t, tbl.DecrRef()) }()\n\t\t\tout = append(out, tbl)\n\t\t}\n\t\treturn out\n\t}\n\ttables := genTables(MockKeys{small: \"a\", large: \"abc\"},\n\t\tMockKeys{small: \"abcd\", large: \"cde\"},\n\t\tMockKeys{small: \"cge\", large: \"chf\"},\n\t\tMockKeys{small: \"glr\", large: \"gyup\"})\n\topt := DefaultIteratorOptions\n\topt.prefixIsKey = false\n\topt.Prefix = []byte(\"c\")\n\tfiltered := opt.pickTables(tables)\n\trequire.Equal(t, 2, len(filtered))\n\t// build table adds time stamp so removing tailing bytes.\n\trequire.Equal(t, filtered[0].Smallest()[:4], []byte(\"abcd\"))\n\trequire.Equal(t, filtered[1].Smallest()[:3], []byte(\"cge\"))\n\ttables = genTables(MockKeys{small: \"a\", large: \"abc\"},\n\t\tMockKeys{small: \"abcd\", large: \"ade\"},\n\t\tMockKeys{small: \"cge\", large: \"chf\"},\n\t\tMockKeys{small: \"glr\", large: \"gyup\"})\n\tfiltered = opt.pickTables(tables)\n\trequire.Equal(t, 1, len(filtered))\n\trequire.Equal(t, filtered[0].Smallest()[:3], []byte(\"cge\"))\n\ttables = genTables(MockKeys{small: \"a\", large: \"abc\"},\n\t\tMockKeys{small: \"abcd\", large: \"ade\"},\n\t\tMockKeys{small: \"cge\", large: \"chf\"},\n\t\tMockKeys{small: \"ckr\", large: \"cyup\"},\n\t\tMockKeys{small: \"csfr\", large: \"gyup\"})\n\tfiltered = opt.pickTables(tables)\n\trequire.Equal(t, 3, len(filtered))\n\trequire.Equal(t, filtered[0].Smallest()[:3], []byte(\"cge\"))\n\trequire.Equal(t, filtered[1].Smallest()[:3], []byte(\"ckr\"))\n\trequire.Equal(t, filtered[2].Smallest()[:4], []byte(\"csfr\"))\n\n\topt.Prefix = []byte(\"aa\")\n\tfiltered = opt.pickTables(tables)\n\trequire.Equal(t, y.ParseKey(filtered[0].Smallest()), []byte(\"a\"))\n\trequire.Equal(t, y.ParseKey(filtered[0].Biggest()), []byte(\"abc\"))\n}\n\nfunc TestIterateSinceTs(t *testing.T) {\n\tbkey := func(i int) []byte {\n\t\treturn []byte(fmt.Sprintf(\"%04d\", i))\n\t}\n\tval := []byte(\"OK\")\n\tn := 100000\n\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\tbatch := db.NewWriteBatch()\n\t\tfor i := 0; i < n; i++ {\n\t\t\tif (i % 10000) == 0 {\n\t\t\t\tt.Logf(\"Put i=%d\\n\", i)\n\t\t\t}\n\t\t\trequire.NoError(t, batch.Set(bkey(i), val))\n\t\t}\n\t\trequire.NoError(t, batch.Flush())\n\n\t\tmaxVs := db.MaxVersion()\n\t\tsinceTs := maxVs - maxVs/10\n\t\tiopt := DefaultIteratorOptions\n\t\tiopt.SinceTs = sinceTs\n\n\t\trequire.NoError(t, db.View(func(txn *Txn) error {\n\t\t\tit := txn.NewIterator(iopt)\n\t\t\tdefer it.Close()\n\n\t\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\t\ti := it.Item()\n\t\t\t\trequire.GreaterOrEqual(t, i.Version(), sinceTs)\n\t\t\t}\n\t\t\treturn nil\n\t\t}))\n\n\t})\n}\n\nfunc TestIterateSinceTsWithPendingWrites(t *testing.T) {\n\t// The pending entries still have version=0. Even IteratorOptions.SinceTs is 0, the entries\n\t// should be visible.\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\ttxn := db.NewTransaction(true)\n\t\tdefer txn.Discard()\n\t\trequire.NoError(t, txn.Set([]byte(\"key1\"), []byte(\"value1\")))\n\t\trequire.NoError(t, txn.Set([]byte(\"key2\"), []byte(\"value2\")))\n\t\titr := txn.NewIterator(DefaultIteratorOptions)\n\t\tdefer itr.Close()\n\t\tcount := 0\n\t\tfor itr.Rewind(); itr.Valid(); itr.Next() {\n\t\t\tcount++\n\t\t}\n\t\trequire.Equal(t, 2, count)\n\t})\n}\n\nfunc TestIteratePrefix(t *testing.T) {\n\tif !*manual {\n\t\tt.Skip(\"Skipping test meant to be run manually.\")\n\t\treturn\n\t}\n\ttestIteratorPrefix := func(t *testing.T, db *DB) {\n\t\tbkey := func(i int) []byte {\n\t\t\treturn []byte(fmt.Sprintf(\"%04d\", i))\n\t\t}\n\t\tval := []byte(\"OK\")\n\t\tn := 10000\n\n\t\tbatch := db.NewWriteBatch()\n\t\tfor i := 0; i < n; i++ {\n\t\t\tif (i % 1000) == 0 {\n\t\t\t\tt.Logf(\"Put i=%d\\n\", i)\n\t\t\t}\n\t\t\trequire.NoError(t, batch.Set(bkey(i), val))\n\t\t}\n\t\trequire.NoError(t, batch.Flush())\n\n\t\tcountKeys := func(prefix string) int {\n\t\t\tt.Logf(\"Testing with prefix: %s\", prefix)\n\t\t\tvar count int\n\t\t\topt := DefaultIteratorOptions\n\t\t\topt.Prefix = []byte(prefix)\n\t\t\terr := db.View(func(txn *Txn) error {\n\t\t\t\titr := txn.NewIterator(opt)\n\t\t\t\tdefer itr.Close()\n\t\t\t\tfor itr.Rewind(); itr.Valid(); itr.Next() {\n\t\t\t\t\titem := itr.Item()\n\t\t\t\t\terr := item.Value(func(v []byte) error {\n\t\t\t\t\t\trequire.Equal(t, val, v)\n\t\t\t\t\t\treturn nil\n\t\t\t\t\t})\n\t\t\t\t\trequire.NoError(t, err)\n\t\t\t\t\trequire.True(t, bytes.HasPrefix(item.Key(), opt.Prefix))\n\t\t\t\t\tcount++\n\t\t\t\t}\n\t\t\t\treturn nil\n\t\t\t})\n\t\t\trequire.NoError(t, err)\n\t\t\treturn count\n\t\t}\n\n\t\tcountOneKey := func(key []byte) int {\n\t\t\tvar count int\n\t\t\terr := db.View(func(txn *Txn) error {\n\t\t\t\titr := txn.NewKeyIterator(key, DefaultIteratorOptions)\n\t\t\t\tdefer itr.Close()\n\t\t\t\tfor itr.Rewind(); itr.Valid(); itr.Next() {\n\t\t\t\t\titem := itr.Item()\n\t\t\t\t\terr := item.Value(func(v []byte) error {\n\t\t\t\t\t\trequire.Equal(t, val, v)\n\t\t\t\t\t\treturn nil\n\t\t\t\t\t})\n\t\t\t\t\trequire.NoError(t, err)\n\t\t\t\t\trequire.Equal(t, key, item.Key())\n\t\t\t\t\tcount++\n\t\t\t\t}\n\t\t\t\treturn nil\n\t\t\t})\n\t\t\trequire.NoError(t, err)\n\t\t\treturn count\n\t\t}\n\n\t\tfor i := 0; i <= 9; i++ {\n\t\t\trequire.Equal(t, 1, countKeys(fmt.Sprintf(\"%d%d%d%d\", i, i, i, i)))\n\t\t\trequire.Equal(t, 10, countKeys(fmt.Sprintf(\"%d%d%d\", i, i, i)))\n\t\t\trequire.Equal(t, 100, countKeys(fmt.Sprintf(\"%d%d\", i, i)))\n\t\t\trequire.Equal(t, 1000, countKeys(fmt.Sprintf(\"%d\", i)))\n\t\t}\n\t\trequire.Equal(t, 10000, countKeys(\"\"))\n\n\t\tt.Logf(\"Testing each key with key iterator\")\n\t\tfor i := 0; i < n; i++ {\n\t\t\trequire.Equal(t, 1, countOneKey(bkey(i)))\n\t\t}\n\t}\n\n\tt.Run(\"With Default options\", func(t *testing.T) {\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\ttestIteratorPrefix(t, db)\n\t\t})\n\t})\n\n\tt.Run(\"With Block Offsets in Cache\", func(t *testing.T) {\n\t\topts := getTestOptions(\"\")\n\t\topts.IndexCacheSize = 100 << 20\n\t\trunBadgerTest(t, &opts, func(t *testing.T, db *DB) {\n\t\t\ttestIteratorPrefix(t, db)\n\t\t})\n\t})\n\n\tt.Run(\"With Block Offsets and Blocks in Cache\", func(t *testing.T) {\n\t\topts := getTestOptions(\"\")\n\t\topts.BlockCacheSize = 100 << 20\n\t\topts.IndexCacheSize = 100 << 20\n\t\trunBadgerTest(t, &opts, func(t *testing.T, db *DB) {\n\t\t\ttestIteratorPrefix(t, db)\n\t\t})\n\t})\n\n\tt.Run(\"With Blocks in Cache\", func(t *testing.T) {\n\t\topts := getTestOptions(\"\")\n\t\topts.BlockCacheSize = 100 << 20\n\t\trunBadgerTest(t, &opts, func(t *testing.T, db *DB) {\n\t\t\ttestIteratorPrefix(t, db)\n\t\t})\n\t})\n\n}\n\n// Sanity test to verify the iterator does not crash the db in readonly mode if data does not exist.\nfunc TestIteratorReadOnlyWithNoData(t *testing.T) {\n\tdir, err := os.MkdirTemp(\".\", \"badger-test\")\n\ty.Check(err)\n\tdefer removeDir(dir)\n\topts := getTestOptions(dir)\n\tdb, err := Open(opts)\n\trequire.NoError(t, err)\n\trequire.NoError(t, db.Close())\n\n\topts.ReadOnly = true\n\tdb, err = Open(opts)\n\trequire.NoError(t, err)\n\tdefer func() {\n\t\trequire.NoError(t, db.Close())\n\t}()\n\n\trequire.NoError(t, db.View(func(txn *Txn) error {\n\t\tiopts := DefaultIteratorOptions\n\t\tiopts.Prefix = []byte(\"xxx\")\n\t\titr := txn.NewIterator(iopts)\n\t\tdefer itr.Close()\n\t\treturn nil\n\t}))\n}\n\n// go test -v -run=XXX -bench=BenchmarkIterate -benchtime=3s\n// Benchmark with opt.Prefix set ===\n// goos: linux\n// goarch: amd64\n// pkg: github.com/dgraph-io/badger\n// BenchmarkIteratePrefixSingleKey/Key_lookups-4         \t   10000\t    365539 ns/op\n// --- BENCH: BenchmarkIteratePrefixSingleKey/Key_lookups-4\n//\n//\titerator_test.go:147: Inner b.N: 1\n//\titerator_test.go:147: Inner b.N: 100\n//\titerator_test.go:147: Inner b.N: 10000\n//\n// --- BENCH: BenchmarkIteratePrefixSingleKey\n//\n//\titerator_test.go:143: LSM files: 79\n//\titerator_test.go:145: Outer b.N: 1\n//\n// PASS\n// ok  \tgithub.com/dgraph-io/badger\t41.586s\n//\n// Benchmark with NO opt.Prefix set ===\n// goos: linux\n// goarch: amd64\n// pkg: github.com/dgraph-io/badger\n// BenchmarkIteratePrefixSingleKey/Key_lookups-4         \t   10000\t    460924 ns/op\n// --- BENCH: BenchmarkIteratePrefixSingleKey/Key_lookups-4\n//\n//\titerator_test.go:147: Inner b.N: 1\n//\titerator_test.go:147: Inner b.N: 100\n//\titerator_test.go:147: Inner b.N: 10000\n//\n// --- BENCH: BenchmarkIteratePrefixSingleKey\n//\n//\titerator_test.go:143: LSM files: 83\n//\titerator_test.go:145: Outer b.N: 1\n//\n// PASS\n// ok  \tgithub.com/dgraph-io/badger\t41.836s\n//\n// Only my laptop there's a 20% improvement in latency with ~80 files.\nfunc BenchmarkIteratePrefixSingleKey(b *testing.B) {\n\tdir, err := os.MkdirTemp(\".\", \"badger-test\")\n\ty.Check(err)\n\tdefer removeDir(dir)\n\topts := getTestOptions(dir)\n\tdb, err := Open(opts)\n\ty.Check(err)\n\tdefer db.Close()\n\n\tN := 100000 // Should generate around 80 SSTables.\n\tval := []byte(\"OK\")\n\tbkey := func(i int) []byte {\n\t\treturn []byte(fmt.Sprintf(\"%06d\", i))\n\t}\n\n\tbatch := db.NewWriteBatch()\n\tfor i := 0; i < N; i++ {\n\t\ty.Check(batch.Set(bkey(i), val))\n\t}\n\ty.Check(batch.Flush())\n\tvar lsmFiles int\n\terr = filepath.Walk(dir, func(path string, info os.FileInfo, err error) error {\n\t\tif strings.HasSuffix(path, \".sst\") {\n\t\t\tlsmFiles++\n\t\t}\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\treturn nil\n\t})\n\ty.Check(err)\n\tb.Logf(\"LSM files: %d\", lsmFiles)\n\tb.Logf(\"Key splits: %v\", db.Ranges(nil, 10000))\n\tb.Logf(\"Key splits with prefix: %v\", db.Ranges([]byte(\"09\"), 10000))\n\n\tb.Logf(\"Outer b.N: %d\", b.N)\n\tb.Run(\"Key lookups\", func(b *testing.B) {\n\t\tb.Logf(\"Inner b.N: %d\", b.N)\n\t\tfor i := 0; i < b.N; i++ {\n\t\t\tkey := bkey(rand.Intn(N))\n\t\t\terr := db.View(func(txn *Txn) error {\n\t\t\t\topt := DefaultIteratorOptions\n\t\t\t\t// NOTE: Comment opt.Prefix out here to compare the performance\n\t\t\t\t// difference between providing Prefix as an option, v/s not. I\n\t\t\t\t// see a 20% improvement when there are ~80 SSTables.\n\t\t\t\topt.Prefix = key\n\t\t\t\topt.AllVersions = true\n\n\t\t\t\titr := txn.NewIterator(opt)\n\t\t\t\tdefer itr.Close()\n\n\t\t\t\tvar count int\n\t\t\t\tfor itr.Seek(key); itr.ValidForPrefix(key); itr.Next() {\n\t\t\t\t\tcount++\n\t\t\t\t}\n\t\t\t\tif count != 1 {\n\t\t\t\t\tb.Fatalf(\"Count must be one key: %s. Found: %d\", key, count)\n\t\t\t\t}\n\t\t\t\treturn nil\n\t\t\t})\n\t\t\tif err != nil {\n\t\t\t\tb.Fatalf(\"Error while View: %v\", err)\n\t\t\t}\n\t\t}\n\t})\n}\n"
  },
  {
    "path": "key_registry.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"bytes\"\n\t\"crypto/aes\"\n\t\"crypto/rand\"\n\t\"encoding/binary\"\n\t\"hash/crc32\"\n\t\"io\"\n\t\"os\"\n\t\"path/filepath\"\n\t\"sync\"\n\t\"time\"\n\n\t\"github.com/dgraph-io/badger/v4/pb\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"google.golang.org/protobuf/proto\"\n)\n\nconst (\n\t// KeyRegistryFileName is the file name for the key registry file.\n\tKeyRegistryFileName = \"KEYREGISTRY\"\n\t// KeyRegistryRewriteFileName is the file name for the rewrite key registry file.\n\tKeyRegistryRewriteFileName = \"REWRITE-KEYREGISTRY\"\n)\n\n// SanityText is used to check whether the given user provided storage key is valid or not\nvar sanityText = []byte(\"Hello Badger\")\n\n// KeyRegistry used to maintain all the data keys.\ntype KeyRegistry struct {\n\tsync.RWMutex\n\tdataKeys    map[uint64]*pb.DataKey\n\tlastCreated int64 //lastCreated is the timestamp(seconds) of the last data key generated.\n\tnextKeyID   uint64\n\tfp          *os.File\n\topt         KeyRegistryOptions\n}\n\ntype KeyRegistryOptions struct {\n\tDir                           string\n\tReadOnly                      bool\n\tEncryptionKey                 []byte\n\tEncryptionKeyRotationDuration time.Duration\n\tInMemory                      bool\n}\n\n// newKeyRegistry returns KeyRegistry.\nfunc newKeyRegistry(opt KeyRegistryOptions) *KeyRegistry {\n\treturn &KeyRegistry{\n\t\tdataKeys:  make(map[uint64]*pb.DataKey),\n\t\tnextKeyID: 0,\n\t\topt:       opt,\n\t}\n}\n\n// OpenKeyRegistry opens key registry if it exists, otherwise it'll create key registry\n// and returns key registry.\nfunc OpenKeyRegistry(opt KeyRegistryOptions) (*KeyRegistry, error) {\n\t// sanity check the encryption key length.\n\tif len(opt.EncryptionKey) > 0 {\n\t\tswitch len(opt.EncryptionKey) {\n\t\tdefault:\n\t\t\treturn nil, y.Wrapf(ErrInvalidEncryptionKey, \"During OpenKeyRegistry\")\n\t\tcase 16, 24, 32:\n\t\t\tbreak\n\t\t}\n\t}\n\t// If db is opened in InMemory mode, we don't need to write key registry to the disk.\n\tif opt.InMemory {\n\t\treturn newKeyRegistry(opt), nil\n\t}\n\tpath := filepath.Join(opt.Dir, KeyRegistryFileName)\n\tvar flags y.Flags\n\tif opt.ReadOnly {\n\t\tflags |= y.ReadOnly\n\t} else {\n\t\tflags |= y.Sync\n\t}\n\tfp, err := y.OpenExistingFile(path, flags)\n\t// OpenExistingFile just open file.\n\t// So checking whether the file exist or not. If not\n\t// We'll create new keyregistry.\n\tif os.IsNotExist(err) {\n\t\t// Creating new registry file if not exist.\n\t\tkr := newKeyRegistry(opt)\n\t\tif opt.ReadOnly {\n\t\t\treturn kr, nil\n\t\t}\n\t\t// Writing the key registry to the file.\n\t\tif err := WriteKeyRegistry(kr, opt); err != nil {\n\t\t\treturn nil, y.Wrapf(err, \"Error while writing key registry.\")\n\t\t}\n\t\tfp, err = y.OpenExistingFile(path, flags)\n\t\tif err != nil {\n\t\t\treturn nil, y.Wrapf(err, \"Error while opening newly created key registry.\")\n\t\t}\n\t} else if err != nil {\n\t\treturn nil, y.Wrapf(err, \"Error while opening key registry.\")\n\t}\n\tkr, err := readKeyRegistry(fp, opt)\n\tif err != nil {\n\t\t// This case happens only if the file is opened properly and\n\t\t// not able to read.\n\t\tfp.Close()\n\t\treturn nil, err\n\t}\n\tif opt.ReadOnly {\n\t\t// We'll close the file in readonly mode.\n\t\treturn kr, fp.Close()\n\t}\n\tkr.fp = fp\n\treturn kr, nil\n}\n\n// keyRegistryIterator reads all the datakey from the key registry\ntype keyRegistryIterator struct {\n\tencryptionKey []byte\n\tfp            *os.File\n\t// lenCrcBuf contains crc buf and data length to move forward.\n\tlenCrcBuf [8]byte\n}\n\n// newKeyRegistryIterator returns iterator which will allow you to iterate\n// over the data key of the key registry.\nfunc newKeyRegistryIterator(fp *os.File, encryptionKey []byte) (*keyRegistryIterator, error) {\n\treturn &keyRegistryIterator{\n\t\tencryptionKey: encryptionKey,\n\t\tfp:            fp,\n\t\tlenCrcBuf:     [8]byte{},\n\t}, validRegistry(fp, encryptionKey)\n}\n\n// validRegistry checks that given encryption key is valid or not.\nfunc validRegistry(fp *os.File, encryptionKey []byte) error {\n\tiv := make([]byte, aes.BlockSize)\n\tvar err error\n\tif _, err = fp.Read(iv); err != nil {\n\t\treturn y.Wrapf(err, \"Error while reading IV for key registry.\")\n\t}\n\teSanityText := make([]byte, len(sanityText))\n\tif _, err = fp.Read(eSanityText); err != nil {\n\t\treturn y.Wrapf(err, \"Error while reading sanity text.\")\n\t}\n\tif len(encryptionKey) > 0 {\n\t\t// Decrypting sanity text.\n\t\tif eSanityText, err = y.XORBlockAllocate(eSanityText, encryptionKey, iv); err != nil {\n\t\t\treturn y.Wrapf(err, \"During validRegistry\")\n\t\t}\n\t}\n\t// Check the given key is valid or not.\n\tif !bytes.Equal(eSanityText, sanityText) {\n\t\treturn ErrEncryptionKeyMismatch\n\t}\n\treturn nil\n}\n\nfunc (kri *keyRegistryIterator) next() (*pb.DataKey, error) {\n\tvar err error\n\t// Read crc buf and data length.\n\tif _, err = kri.fp.Read(kri.lenCrcBuf[:]); err != nil {\n\t\t// EOF means end of the iteration.\n\t\tif err != io.EOF {\n\t\t\treturn nil, y.Wrapf(err, \"While reading crc in keyRegistryIterator.next\")\n\t\t}\n\t\treturn nil, err\n\t}\n\tl := int64(binary.BigEndian.Uint32(kri.lenCrcBuf[0:4]))\n\t// Read protobuf data.\n\tdata := make([]byte, l)\n\tif _, err = kri.fp.Read(data); err != nil {\n\t\t// EOF means end of the iteration.\n\t\tif err != io.EOF {\n\t\t\treturn nil, y.Wrapf(err, \"While reading protobuf in keyRegistryIterator.next\")\n\t\t}\n\t\treturn nil, err\n\t}\n\t// Check checksum.\n\tif crc32.Checksum(data, y.CastagnoliCrcTable) != binary.BigEndian.Uint32(kri.lenCrcBuf[4:]) {\n\t\treturn nil, y.Wrapf(y.ErrChecksumMismatch, \"Error while checking checksum for data key.\")\n\t}\n\tdataKey := &pb.DataKey{}\n\tif err = proto.Unmarshal(data, dataKey); err != nil {\n\t\treturn nil, y.Wrapf(err, \"While unmarshal of datakey in keyRegistryIterator.next\")\n\t}\n\tif len(kri.encryptionKey) > 0 {\n\t\t// Decrypt the key if the storage key exists.\n\t\tif dataKey.Data, err = y.XORBlockAllocate(dataKey.Data, kri.encryptionKey, dataKey.Iv); err != nil {\n\t\t\treturn nil, y.Wrapf(err, \"While decrypting datakey in keyRegistryIterator.next\")\n\t\t}\n\t}\n\treturn dataKey, nil\n}\n\n// readKeyRegistry will read the key registry file and build the key registry struct.\nfunc readKeyRegistry(fp *os.File, opt KeyRegistryOptions) (*KeyRegistry, error) {\n\titr, err := newKeyRegistryIterator(fp, opt.EncryptionKey)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\tkr := newKeyRegistry(opt)\n\tvar dk *pb.DataKey\n\tdk, err = itr.next()\n\tfor err == nil && dk != nil {\n\t\tif dk.KeyId > kr.nextKeyID {\n\t\t\t// Set the maximum key ID for next key ID generation.\n\t\t\tkr.nextKeyID = dk.KeyId\n\t\t}\n\t\tif dk.CreatedAt > kr.lastCreated {\n\t\t\t// Set the last generated key timestamp.\n\t\t\tkr.lastCreated = dk.CreatedAt\n\t\t}\n\t\t// No need to lock since we are building the initial state.\n\t\tkr.dataKeys[dk.KeyId] = dk\n\t\t// Forward the iterator.\n\t\tdk, err = itr.next()\n\t}\n\t// We read all the key. So, Ignoring this error.\n\tif err == io.EOF {\n\t\terr = nil\n\t}\n\treturn kr, err\n}\n\n/*\nStructure of Key Registry.\n+-------------------+---------------------+--------------------+--------------+------------------+\n|     IV            | Sanity Text         | DataKey1           | DataKey2     | ...              |\n+-------------------+---------------------+--------------------+--------------+------------------+\n*/\n\n// WriteKeyRegistry will rewrite the existing key registry file with new one.\n// It is okay to give closed key registry. Since, it's using only the datakey.\nfunc WriteKeyRegistry(reg *KeyRegistry, opt KeyRegistryOptions) error {\n\tbuf := &bytes.Buffer{}\n\tiv, err := y.GenerateIV()\n\ty.Check(err)\n\t// Encrypt sanity text if the encryption key is presents.\n\teSanity := sanityText\n\tif len(opt.EncryptionKey) > 0 {\n\t\tvar err error\n\t\teSanity, err = y.XORBlockAllocate(eSanity, opt.EncryptionKey, iv)\n\t\tif err != nil {\n\t\t\treturn y.Wrapf(err, \"Error while encrypting sanity text in WriteKeyRegistry\")\n\t\t}\n\t}\n\ty.Check2(buf.Write(iv))\n\ty.Check2(buf.Write(eSanity))\n\t// Write all the datakeys to the buf.\n\tfor _, k := range reg.dataKeys {\n\t\t// Writing the datakey to the given buffer.\n\t\tif err := storeDataKey(buf, opt.EncryptionKey, k); err != nil {\n\t\t\treturn y.Wrapf(err, \"Error while storing datakey in WriteKeyRegistry\")\n\t\t}\n\t}\n\ttmpPath := filepath.Join(opt.Dir, KeyRegistryRewriteFileName)\n\t// Open temporary file to write the data and do atomic rename.\n\tfp, err := y.OpenTruncFile(tmpPath, true)\n\tif err != nil {\n\t\treturn y.Wrapf(err, \"Error while opening tmp file in WriteKeyRegistry\")\n\t}\n\t// Write buf to the disk.\n\tif _, err = fp.Write(buf.Bytes()); err != nil {\n\t\t// close the fd before returning error. We're not using defer\n\t\t// because, for windows we need to close the fd explicitly before\n\t\t// renaming.\n\t\tfp.Close()\n\t\treturn y.Wrapf(err, \"Error while writing buf in WriteKeyRegistry\")\n\t}\n\t// In Windows the files should be closed before doing a Rename.\n\tif err = fp.Close(); err != nil {\n\t\treturn y.Wrapf(err, \"Error while closing tmp file in WriteKeyRegistry\")\n\t}\n\t// Rename to the original file.\n\tif err = os.Rename(tmpPath, filepath.Join(opt.Dir, KeyRegistryFileName)); err != nil {\n\t\treturn y.Wrapf(err, \"Error while renaming file in WriteKeyRegistry\")\n\t}\n\t// Sync Dir.\n\treturn syncDir(opt.Dir)\n}\n\n// DataKey returns datakey of the given key id.\nfunc (kr *KeyRegistry) DataKey(id uint64) (*pb.DataKey, error) {\n\tkr.RLock()\n\tdefer kr.RUnlock()\n\tif id == 0 {\n\t\t// nil represent plain text.\n\t\treturn nil, nil\n\t}\n\tdk, ok := kr.dataKeys[id]\n\tif !ok {\n\t\treturn nil, y.Wrapf(ErrInvalidDataKeyID, \"Error for the KEY ID %d\", id)\n\t}\n\treturn dk, nil\n}\n\n// LatestDataKey will give you the latest generated datakey based on the rotation\n// period. If the last generated datakey lifetime exceeds the rotation period.\n// It'll create new datakey.\nfunc (kr *KeyRegistry) LatestDataKey() (*pb.DataKey, error) {\n\tif len(kr.opt.EncryptionKey) == 0 {\n\t\t// nil is for no encryption.\n\t\treturn nil, nil\n\t}\n\t// validKey return datakey if the last generated key duration less than\n\t// rotation duration.\n\tvalidKey := func() (*pb.DataKey, bool) {\n\t\t// Time difference from the last generated time.\n\t\tdiff := time.Since(time.Unix(kr.lastCreated, 0))\n\t\tif diff < kr.opt.EncryptionKeyRotationDuration {\n\t\t\treturn kr.dataKeys[kr.nextKeyID], true\n\t\t}\n\t\treturn nil, false\n\t}\n\tkr.RLock()\n\tkey, valid := validKey()\n\tkr.RUnlock()\n\tif valid {\n\t\t// If less than EncryptionKeyRotationDuration, returns the last generated key.\n\t\treturn key, nil\n\t}\n\tkr.Lock()\n\tdefer kr.Unlock()\n\t// Key might have generated by another go routine. So,\n\t// checking once again.\n\tkey, valid = validKey()\n\tif valid {\n\t\treturn key, nil\n\t}\n\tk := make([]byte, len(kr.opt.EncryptionKey))\n\tiv, err := y.GenerateIV()\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\t_, err = rand.Read(k)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\t// Otherwise Increment the KeyID and generate new datakey.\n\tkr.nextKeyID++\n\tdk := &pb.DataKey{\n\t\tKeyId:     kr.nextKeyID,\n\t\tData:      k,\n\t\tCreatedAt: time.Now().Unix(),\n\t\tIv:        iv,\n\t}\n\t// Don't store the datakey on file if badger is running in InMemory mode.\n\tif !kr.opt.InMemory {\n\t\t// Store the datekey.\n\t\tbuf := &bytes.Buffer{}\n\t\tif err = storeDataKey(buf, kr.opt.EncryptionKey, dk); err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t\t// Persist the datakey to the disk\n\t\tif _, err = kr.fp.Write(buf.Bytes()); err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t}\n\t// storeDatakey encrypts the datakey So, placing un-encrypted key in the memory.\n\tdk.Data = k\n\tkr.lastCreated = dk.CreatedAt\n\tkr.dataKeys[kr.nextKeyID] = dk\n\treturn dk, nil\n}\n\n// Close closes the key registry.\nfunc (kr *KeyRegistry) Close() error {\n\tif !(kr.opt.ReadOnly || kr.opt.InMemory) {\n\t\treturn kr.fp.Close()\n\t}\n\treturn nil\n}\n\n// storeDataKey stores datakey in an encrypted format in the given buffer. If storage key preset.\nfunc storeDataKey(buf *bytes.Buffer, storageKey []byte, k *pb.DataKey) error {\n\t// xor will encrypt the IV and xor with the given data.\n\t// It'll used for both encryption and decryption.\n\txor := func() error {\n\t\tif len(storageKey) == 0 {\n\t\t\treturn nil\n\t\t}\n\t\tvar err error\n\t\tk.Data, err = y.XORBlockAllocate(k.Data, storageKey, k.Iv)\n\t\treturn err\n\t}\n\t// In memory datakey will be plain text so encrypting before storing to the disk.\n\tvar err error\n\tif err = xor(); err != nil {\n\t\treturn y.Wrapf(err, \"Error while encrypting datakey in storeDataKey\")\n\t}\n\tvar data []byte\n\tif data, err = proto.Marshal(k); err != nil {\n\t\terr = y.Wrapf(err, \"Error while marshaling datakey in storeDataKey\")\n\t\tvar err2 error\n\t\t// decrypting the datakey back.\n\t\tif err2 = xor(); err2 != nil {\n\t\t\treturn y.Wrapf(err,\n\t\t\t\ty.Wrapf(err2, \"Error while decrypting datakey in storeDataKey\").Error())\n\t\t}\n\t\treturn err\n\t}\n\tvar lenCrcBuf [8]byte\n\tbinary.BigEndian.PutUint32(lenCrcBuf[0:4], uint32(len(data)))\n\tbinary.BigEndian.PutUint32(lenCrcBuf[4:8], crc32.Checksum(data, y.CastagnoliCrcTable))\n\ty.Check2(buf.Write(lenCrcBuf[:]))\n\ty.Check2(buf.Write(data))\n\t// Decrypting the datakey back since we're using the pointer.\n\treturn xor()\n}\n"
  },
  {
    "path": "key_registry_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"math/rand\"\n\t\"os\"\n\t\"testing\"\n\n\t\"github.com/stretchr/testify/require\"\n)\n\nfunc getRegistryTestOptions(dir string, key []byte) KeyRegistryOptions {\n\treturn KeyRegistryOptions{\n\t\tDir:           dir,\n\t\tEncryptionKey: key,\n\t\tReadOnly:      false,\n\t}\n}\nfunc TestBuildRegistry(t *testing.T) {\n\tencryptionKey := make([]byte, 32)\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\t_, err = rand.Read(encryptionKey)\n\trequire.NoError(t, err)\n\topt := getRegistryTestOptions(dir, encryptionKey)\n\n\tkr, err := OpenKeyRegistry(opt)\n\trequire.NoError(t, err)\n\tdk, err := kr.LatestDataKey()\n\trequire.NoError(t, err)\n\t// We're resetting the last created timestamp. So, it creates\n\t// new datakey.\n\tkr.lastCreated = 0\n\tdk1, err := kr.LatestDataKey()\n\t// We generated two key. So, checking the length.\n\trequire.Equal(t, 2, len(kr.dataKeys))\n\trequire.NoError(t, err)\n\trequire.NoError(t, kr.Close())\n\n\tkr2, err := OpenKeyRegistry(opt)\n\trequire.NoError(t, err)\n\trequire.Equal(t, 2, len(kr2.dataKeys))\n\t// Asserting the correctness of the datakey after opening the registry.\n\trequire.Equal(t, dk.Data, kr.dataKeys[dk.KeyId].Data)\n\trequire.Equal(t, dk1.Data, kr.dataKeys[dk1.KeyId].Data)\n\trequire.NoError(t, kr2.Close())\n}\n\nfunc TestRewriteRegistry(t *testing.T) {\n\tencryptionKey := make([]byte, 32)\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\t_, err = rand.Read(encryptionKey)\n\trequire.NoError(t, err)\n\topt := getRegistryTestOptions(dir, encryptionKey)\n\tkr, err := OpenKeyRegistry(opt)\n\trequire.NoError(t, err)\n\t_, err = kr.LatestDataKey()\n\trequire.NoError(t, err)\n\t// We're resetting the last created timestamp. So, it creates\n\t// new datakey.\n\tkr.lastCreated = 0\n\t_, err = kr.LatestDataKey()\n\trequire.NoError(t, err)\n\trequire.NoError(t, kr.Close())\n\tdelete(kr.dataKeys, 1)\n\trequire.NoError(t, WriteKeyRegistry(kr, opt))\n\tkr2, err := OpenKeyRegistry(opt)\n\trequire.NoError(t, err)\n\trequire.Equal(t, 1, len(kr2.dataKeys))\n\trequire.NoError(t, kr2.Close())\n}\n\nfunc TestMismatch(t *testing.T) {\n\tencryptionKey := make([]byte, 32)\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\t_, err = rand.Read(encryptionKey)\n\trequire.NoError(t, err)\n\topt := getRegistryTestOptions(dir, encryptionKey)\n\tkr, err := OpenKeyRegistry(opt)\n\trequire.NoError(t, err)\n\trequire.NoError(t, kr.Close())\n\t// Opening with the same key and asserting.\n\tkr, err = OpenKeyRegistry(opt)\n\trequire.NoError(t, err)\n\trequire.NoError(t, kr.Close())\n\t// Opening with the invalid key and asserting.\n\tencryptionKey = make([]byte, 32)\n\t_, err = rand.Read(encryptionKey)\n\trequire.NoError(t, err)\n\topt.EncryptionKey = encryptionKey\n\t_, err = OpenKeyRegistry(opt)\n\trequire.Error(t, err)\n\trequire.EqualError(t, err, ErrEncryptionKeyMismatch.Error())\n}\n\nfunc TestEncryptionAndDecryption(t *testing.T) {\n\tencryptionKey := make([]byte, 32)\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\t_, err = rand.Read(encryptionKey)\n\trequire.NoError(t, err)\n\topt := getRegistryTestOptions(dir, encryptionKey)\n\tkr, err := OpenKeyRegistry(opt)\n\trequire.NoError(t, err)\n\tdk, err := kr.LatestDataKey()\n\trequire.NoError(t, err)\n\trequire.NoError(t, kr.Close())\n\t// Checking the correctness of the datakey after closing and\n\t// opening the key registry.\n\tkr, err = OpenKeyRegistry(opt)\n\trequire.NoError(t, err)\n\tdk1, err := kr.DataKey(dk.GetKeyId())\n\trequire.NoError(t, err)\n\trequire.Equal(t, dk.Data, dk1.Data)\n\trequire.NoError(t, kr.Close())\n}\n\nfunc TestKeyRegistryInMemory(t *testing.T) {\n\tencryptionKey := make([]byte, 32)\n\t_, err := rand.Read(encryptionKey)\n\trequire.NoError(t, err)\n\n\topt := getRegistryTestOptions(\"\", encryptionKey)\n\topt.InMemory = true\n\n\tkr, err := OpenKeyRegistry(opt)\n\trequire.NoError(t, err)\n\t_, err = kr.LatestDataKey()\n\trequire.NoError(t, err)\n\t// We're resetting the last created timestamp. So, it creates\n\t// new datakey.\n\tkr.lastCreated = 0\n\t_, err = kr.LatestDataKey()\n\t// We generated two key. So, checking the length.\n\trequire.Equal(t, 2, len(kr.dataKeys))\n\trequire.NoError(t, err)\n\trequire.NoError(t, kr.Close())\n}\n"
  },
  {
    "path": "level_handler.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"fmt\"\n\t\"sort\"\n\t\"sync\"\n\n\t\"github.com/dgraph-io/badger/v4/table\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n)\n\ntype levelHandler struct {\n\t// Guards tables, totalSize.\n\tsync.RWMutex\n\n\t// For level >= 1, tables are sorted by key ranges, which do not overlap.\n\t// For level 0, tables are sorted by time.\n\t// For level 0, newest table are at the back. Compact the oldest one first, which is at the front.\n\ttables         []*table.Table\n\ttotalSize      int64\n\ttotalStaleSize int64\n\n\t// The following are initialized once and const.\n\tlevel    int\n\tstrLevel string\n\tdb       *DB\n}\n\nfunc (s *levelHandler) isLastLevel() bool {\n\treturn s.level == s.db.opt.MaxLevels-1\n}\n\nfunc (s *levelHandler) getTotalStaleSize() int64 {\n\ts.RLock()\n\tdefer s.RUnlock()\n\treturn s.totalStaleSize\n}\n\nfunc (s *levelHandler) getTotalSize() int64 {\n\ts.RLock()\n\tdefer s.RUnlock()\n\treturn s.totalSize\n}\n\n// initTables replaces s.tables with given tables. This is done during loading.\nfunc (s *levelHandler) initTables(tables []*table.Table) {\n\ts.Lock()\n\tdefer s.Unlock()\n\n\ts.tables = tables\n\ts.totalSize = 0\n\ts.totalStaleSize = 0\n\tfor _, t := range tables {\n\t\ts.addSize(t)\n\t}\n\n\tif s.level == 0 {\n\t\t// Key range will overlap. Just sort by fileID in ascending order\n\t\t// because newer tables are at the end of level 0.\n\t\tsort.Slice(s.tables, func(i, j int) bool {\n\t\t\treturn s.tables[i].ID() < s.tables[j].ID()\n\t\t})\n\t} else {\n\t\t// Sort tables by keys.\n\t\tsort.Slice(s.tables, func(i, j int) bool {\n\t\t\treturn y.CompareKeys(s.tables[i].Smallest(), s.tables[j].Smallest()) < 0\n\t\t})\n\t}\n}\n\n// deleteTables remove tables idx0, ..., idx1-1.\nfunc (s *levelHandler) deleteTables(toDel []*table.Table) error {\n\ts.Lock() // s.Unlock() below\n\n\ttoDelMap := make(map[uint64]struct{})\n\tfor _, t := range toDel {\n\t\ttoDelMap[t.ID()] = struct{}{}\n\t}\n\n\t// Make a copy as iterators might be keeping a slice of tables.\n\tvar newTables []*table.Table\n\tfor _, t := range s.tables {\n\t\t_, found := toDelMap[t.ID()]\n\t\tif !found {\n\t\t\tnewTables = append(newTables, t)\n\t\t\tcontinue\n\t\t}\n\t\ts.subtractSize(t)\n\t}\n\ts.tables = newTables\n\n\ts.Unlock() // Unlock s _before_ we DecrRef our tables, which can be slow.\n\n\treturn decrRefs(toDel)\n}\n\n// replaceTables will replace tables[left:right] with newTables. Note this EXCLUDES tables[right].\n// You must call decr() to delete the old tables _after_ writing the update to the manifest.\nfunc (s *levelHandler) replaceTables(toDel, toAdd []*table.Table) error {\n\t// Need to re-search the range of tables in this level to be replaced as other goroutines might\n\t// be changing it as well.  (They can't touch our tables, but if they add/remove other tables,\n\t// the indices get shifted around.)\n\ts.Lock() // We s.Unlock() below.\n\n\ttoDelMap := make(map[uint64]struct{})\n\tfor _, t := range toDel {\n\t\ttoDelMap[t.ID()] = struct{}{}\n\t}\n\tvar newTables []*table.Table\n\tfor _, t := range s.tables {\n\t\t_, found := toDelMap[t.ID()]\n\t\tif !found {\n\t\t\tnewTables = append(newTables, t)\n\t\t\tcontinue\n\t\t}\n\t\ts.subtractSize(t)\n\t}\n\n\t// Increase totalSize first.\n\tfor _, t := range toAdd {\n\t\ts.addSize(t)\n\t\tt.IncrRef()\n\t\tnewTables = append(newTables, t)\n\t}\n\n\t// Assign tables.\n\ts.tables = newTables\n\tsort.Slice(s.tables, func(i, j int) bool {\n\t\treturn y.CompareKeys(s.tables[i].Smallest(), s.tables[j].Smallest()) < 0\n\t})\n\ts.Unlock() // s.Unlock before we DecrRef tables -- that can be slow.\n\treturn decrRefs(toDel)\n}\n\n// addTable adds toAdd table to levelHandler. Normally when we add tables to levelHandler, we sort\n// tables based on table.Smallest. This is required for correctness of the system. But in case of\n// stream writer this can be avoided. We can just add tables to levelHandler's table list\n// and after all addTable calls, we can sort table list(check sortTable method).\n// NOTE: levelHandler.sortTables() should be called after call addTable calls are done.\nfunc (s *levelHandler) addTable(t *table.Table) {\n\ts.Lock()\n\tdefer s.Unlock()\n\n\ts.addSize(t) // Increase totalSize first.\n\tt.IncrRef()\n\ts.tables = append(s.tables, t)\n}\n\n// sortTables sorts tables of levelHandler based on table.Smallest.\n// Normally it should be called after all addTable calls.\nfunc (s *levelHandler) sortTables() {\n\ts.Lock()\n\tdefer s.Unlock()\n\n\tsort.Slice(s.tables, func(i, j int) bool {\n\t\treturn y.CompareKeys(s.tables[i].Smallest(), s.tables[j].Smallest()) < 0\n\t})\n}\n\nfunc decrRefs(tables []*table.Table) error {\n\tfor _, table := range tables {\n\t\tif err := table.DecrRef(); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\treturn nil\n}\n\nfunc newLevelHandler(db *DB, level int) *levelHandler {\n\treturn &levelHandler{\n\t\tlevel:    level,\n\t\tstrLevel: fmt.Sprintf(\"l%d\", level),\n\t\tdb:       db,\n\t}\n}\n\n// tryAddLevel0Table returns true if ok and no stalling.\nfunc (s *levelHandler) tryAddLevel0Table(t *table.Table) bool {\n\ty.AssertTrue(s.level == 0)\n\t// Need lock as we may be deleting the first table during a level 0 compaction.\n\ts.Lock()\n\tdefer s.Unlock()\n\t// Stall (by returning false) if we are above the specified stall setting for L0.\n\tif len(s.tables) >= s.db.opt.NumLevelZeroTablesStall {\n\t\treturn false\n\t}\n\n\ts.tables = append(s.tables, t)\n\tt.IncrRef()\n\ts.addSize(t)\n\n\treturn true\n}\n\n// This should be called while holding the lock on the level.\nfunc (s *levelHandler) addSize(t *table.Table) {\n\ts.totalSize += t.Size()\n\ts.totalStaleSize += int64(t.StaleDataSize())\n}\n\n// This should be called while holding the lock on the level.\nfunc (s *levelHandler) subtractSize(t *table.Table) {\n\ts.totalSize -= t.Size()\n\ts.totalStaleSize -= int64(t.StaleDataSize())\n}\nfunc (s *levelHandler) numTables() int {\n\ts.RLock()\n\tdefer s.RUnlock()\n\treturn len(s.tables)\n}\n\nfunc (s *levelHandler) close() error {\n\ts.RLock()\n\tdefer s.RUnlock()\n\tvar err error\n\tfor _, t := range s.tables {\n\t\tif closeErr := t.Close(-1); closeErr != nil && err == nil {\n\t\t\terr = closeErr\n\t\t}\n\t}\n\treturn y.Wrap(err, \"levelHandler.close\")\n}\n\n// getTableForKey acquires a read-lock to access s.tables. It returns a list of tableHandlers.\nfunc (s *levelHandler) getTableForKey(key []byte) ([]*table.Table, func() error) {\n\ts.RLock()\n\tdefer s.RUnlock()\n\n\tif s.level == 0 {\n\t\t// For level 0, we need to check every table. Remember to make a copy as s.tables may change\n\t\t// once we exit this function, and we don't want to lock s.tables while seeking in tables.\n\t\t// CAUTION: Reverse the tables.\n\t\tout := make([]*table.Table, 0, len(s.tables))\n\t\tfor i := len(s.tables) - 1; i >= 0; i-- {\n\t\t\tout = append(out, s.tables[i])\n\t\t\ts.tables[i].IncrRef()\n\t\t}\n\t\treturn out, func() error {\n\t\t\tfor _, t := range out {\n\t\t\t\tif err := t.DecrRef(); err != nil {\n\t\t\t\t\treturn err\n\t\t\t\t}\n\t\t\t}\n\t\t\treturn nil\n\t\t}\n\t}\n\t// For level >= 1, we can do a binary search as key range does not overlap.\n\tidx := sort.Search(len(s.tables), func(i int) bool {\n\t\treturn y.CompareKeys(s.tables[i].Biggest(), key) >= 0\n\t})\n\tif idx >= len(s.tables) {\n\t\t// Given key is strictly > than every element we have.\n\t\treturn nil, func() error { return nil }\n\t}\n\ttbl := s.tables[idx]\n\ttbl.IncrRef()\n\treturn []*table.Table{tbl}, tbl.DecrRef\n}\n\n// get returns value for a given key or the key after that. If not found, return nil.\nfunc (s *levelHandler) get(key []byte) (y.ValueStruct, error) {\n\ttables, decr := s.getTableForKey(key)\n\tkeyNoTs := y.ParseKey(key)\n\n\thash := y.Hash(keyNoTs)\n\tvar maxVs y.ValueStruct\n\tfor _, th := range tables {\n\t\tif th.DoesNotHave(hash) {\n\t\t\ty.NumLSMBloomHitsAdd(s.db.opt.MetricsEnabled, s.strLevel, 1)\n\t\t\tcontinue\n\t\t}\n\n\t\tit := th.NewIterator(0)\n\t\tdefer it.Close()\n\n\t\ty.NumLSMGetsAdd(s.db.opt.MetricsEnabled, s.strLevel, 1)\n\t\tit.Seek(key)\n\t\tif !it.Valid() {\n\t\t\tcontinue\n\t\t}\n\t\tif y.SameKey(key, it.Key()) {\n\t\t\tif version := y.ParseTs(it.Key()); maxVs.Version < version {\n\t\t\t\tmaxVs = it.ValueCopy()\n\t\t\t\tmaxVs.Version = version\n\t\t\t}\n\t\t}\n\t}\n\treturn maxVs, decr()\n}\n\n// appendIterators appends iterators to an array of iterators, for merging.\n// Note: This obtains references for the table handlers. Remember to close these iterators.\nfunc (s *levelHandler) appendIterators(iters []y.Iterator, opt *IteratorOptions) []y.Iterator {\n\ts.RLock()\n\tdefer s.RUnlock()\n\n\tvar topt int\n\tif opt.Reverse {\n\t\ttopt = table.REVERSED\n\t}\n\tif s.level == 0 {\n\t\t// Remember to add in reverse order!\n\t\t// The newer table at the end of s.tables should be added first as it takes precedence.\n\t\t// Level 0 tables are not in key sorted order, so we need to consider them one by one.\n\t\tvar out []*table.Table\n\t\tfor _, t := range s.tables {\n\t\t\tif opt.pickTable(t) {\n\t\t\t\tout = append(out, t)\n\t\t\t}\n\t\t}\n\t\treturn appendIteratorsReversed(iters, out, topt)\n\t}\n\n\ttables := opt.pickTables(s.tables)\n\tif len(tables) == 0 {\n\t\treturn iters\n\t}\n\treturn append(iters, table.NewConcatIterator(tables, topt))\n}\n\ntype levelHandlerRLocked struct{}\n\n// overlappingTables returns the tables that intersect with key range. Returns a half-interval.\n// This function should already have acquired a read lock, and this is so important the caller must\n// pass an empty parameter declaring such.\nfunc (s *levelHandler) overlappingTables(_ levelHandlerRLocked, kr keyRange) (int, int) {\n\tif len(kr.left) == 0 || len(kr.right) == 0 {\n\t\treturn 0, 0\n\t}\n\tleft := sort.Search(len(s.tables), func(i int) bool {\n\t\treturn y.CompareKeys(kr.left, s.tables[i].Biggest()) <= 0\n\t})\n\tright := sort.Search(len(s.tables), func(i int) bool {\n\t\treturn y.CompareKeys(kr.right, s.tables[i].Smallest()) < 0\n\t})\n\treturn left, right\n}\n"
  },
  {
    "path": "levels.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"bytes\"\n\t\"context\"\n\t\"encoding/hex\"\n\t\"errors\"\n\t\"fmt\"\n\t\"math\"\n\t\"math/rand\"\n\t\"os\"\n\t\"sort\"\n\t\"strings\"\n\t\"sync\"\n\t\"sync/atomic\"\n\t\"time\"\n\n\t\"go.opentelemetry.io/otel\"\n\t\"go.opentelemetry.io/otel/attribute\"\n\n\t\"github.com/dgraph-io/badger/v4/pb\"\n\t\"github.com/dgraph-io/badger/v4/table\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\ntype levelsController struct {\n\tnextFileID atomic.Uint64\n\tl0stallsMs atomic.Int64\n\n\t// The following are initialized once and const.\n\tlevels []*levelHandler\n\tkv     *DB\n\n\tcstatus compactStatus\n}\n\n// revertToManifest checks that all necessary table files exist and removes all table files not\n// referenced by the manifest. idMap is a set of table file id's that were read from the directory\n// listing.\nfunc revertToManifest(kv *DB, mf *Manifest, idMap map[uint64]struct{}) error {\n\t// 1. Check all files in manifest exist.\n\tfor id := range mf.Tables {\n\t\tif _, ok := idMap[id]; !ok {\n\t\t\treturn fmt.Errorf(\"file does not exist for table %d\", id)\n\t\t}\n\t}\n\n\t// 2. Delete files that shouldn't exist.\n\tfor id := range idMap {\n\t\tif _, ok := mf.Tables[id]; !ok {\n\t\t\tkv.opt.Debugf(\"Table file %d not referenced in MANIFEST\\n\", id)\n\t\t\tfilename := table.NewFilename(id, kv.opt.Dir)\n\t\t\tif err := os.Remove(filename); err != nil {\n\t\t\t\treturn y.Wrapf(err, \"While removing table %d\", id)\n\t\t\t}\n\t\t}\n\t}\n\n\treturn nil\n}\n\nfunc newLevelsController(db *DB, mf *Manifest) (*levelsController, error) {\n\ty.AssertTrue(db.opt.NumLevelZeroTablesStall > db.opt.NumLevelZeroTables)\n\ts := &levelsController{\n\t\tkv:     db,\n\t\tlevels: make([]*levelHandler, db.opt.MaxLevels),\n\t}\n\ts.cstatus.tables = make(map[uint64]struct{})\n\ts.cstatus.levels = make([]*levelCompactStatus, db.opt.MaxLevels)\n\n\tfor i := 0; i < db.opt.MaxLevels; i++ {\n\t\ts.levels[i] = newLevelHandler(db, i)\n\t\ts.cstatus.levels[i] = new(levelCompactStatus)\n\t}\n\n\tif db.opt.InMemory {\n\t\treturn s, nil\n\t}\n\t// Compare manifest against directory, check for existent/non-existent files, and remove.\n\tif err := revertToManifest(db, mf, getIDMap(db.opt.Dir)); err != nil {\n\t\treturn nil, err\n\t}\n\n\tvar mu sync.Mutex\n\ttables := make([][]*table.Table, db.opt.MaxLevels)\n\tvar maxFileID uint64\n\n\t// We found that using 3 goroutines allows disk throughput to be utilized to its max.\n\t// Disk utilization is the main thing we should focus on, while trying to read the data. That's\n\t// the one factor that remains constant between HDD and SSD.\n\tthrottle := y.NewThrottle(3)\n\n\tstart := time.Now()\n\tvar numOpened atomic.Int32\n\ttick := time.NewTicker(3 * time.Second)\n\tdefer tick.Stop()\n\n\tfor fileID, tf := range mf.Tables {\n\t\tfname := table.NewFilename(fileID, db.opt.Dir)\n\t\tselect {\n\t\tcase <-tick.C:\n\t\t\tdb.opt.Infof(\"%d tables out of %d opened in %s\\n\", numOpened.Load(),\n\t\t\t\tlen(mf.Tables), time.Since(start).Round(time.Millisecond))\n\t\tdefault:\n\t\t}\n\t\tif err := throttle.Do(); err != nil {\n\t\t\tcloseAllTables(tables)\n\t\t\treturn nil, err\n\t\t}\n\t\tif fileID > maxFileID {\n\t\t\tmaxFileID = fileID\n\t\t}\n\t\tgo func(fname string, tf TableManifest) {\n\t\t\tvar rerr error\n\t\t\tdefer func() {\n\t\t\t\tthrottle.Done(rerr)\n\t\t\t\tnumOpened.Add(1)\n\t\t\t}()\n\t\t\tdk, err := db.registry.DataKey(tf.KeyID)\n\t\t\tif err != nil {\n\t\t\t\trerr = y.Wrapf(err, \"Error while reading datakey\")\n\t\t\t\treturn\n\t\t\t}\n\t\t\ttopt := buildTableOptions(db)\n\t\t\t// Explicitly set Compression and DataKey based on how the table was generated.\n\t\t\ttopt.Compression = tf.Compression\n\t\t\ttopt.DataKey = dk\n\n\t\t\tmf, err := z.OpenMmapFile(fname, db.opt.getFileFlags(), 0)\n\t\t\tif err != nil {\n\t\t\t\trerr = y.Wrapf(err, \"Opening file: %q\", fname)\n\t\t\t\treturn\n\t\t\t}\n\t\t\tt, err := table.OpenTable(mf, topt)\n\t\t\tif err != nil {\n\t\t\t\tif strings.HasPrefix(err.Error(), \"CHECKSUM_MISMATCH:\") {\n\t\t\t\t\tdb.opt.Errorf(err.Error())\n\t\t\t\t\tdb.opt.Errorf(\"Ignoring table %s\", mf.Fd.Name())\n\t\t\t\t\t// Do not set rerr. We will continue without this table.\n\t\t\t\t} else {\n\t\t\t\t\trerr = y.Wrapf(err, \"Opening table: %q\", fname)\n\t\t\t\t}\n\t\t\t\treturn\n\t\t\t}\n\n\t\t\tmu.Lock()\n\t\t\ttables[tf.Level] = append(tables[tf.Level], t)\n\t\t\tmu.Unlock()\n\t\t}(fname, tf)\n\t}\n\tif err := throttle.Finish(); err != nil {\n\t\tcloseAllTables(tables)\n\t\treturn nil, err\n\t}\n\tdb.opt.Infof(\"All %d tables opened in %s\\n\", numOpened.Load(),\n\t\ttime.Since(start).Round(time.Millisecond))\n\ts.nextFileID.Store(maxFileID + 1)\n\tfor i, tbls := range tables {\n\t\ts.levels[i].initTables(tbls)\n\t}\n\n\t// Make sure key ranges do not overlap etc.\n\tif err := s.validate(); err != nil {\n\t\t_ = s.cleanupLevels()\n\t\treturn nil, y.Wrap(err, \"Level validation\")\n\t}\n\n\t// Sync directory (because we have at least removed some files, or previously created the\n\t// manifest file).\n\tif err := syncDir(db.opt.Dir); err != nil {\n\t\t_ = s.close()\n\t\treturn nil, err\n\t}\n\n\treturn s, nil\n}\n\n// Closes the tables, for cleanup in newLevelsController.  (We Close() instead of using DecrRef()\n// because that would delete the underlying files.)  We ignore errors, which is OK because tables\n// are read-only.\nfunc closeAllTables(tables [][]*table.Table) {\n\tfor _, tableSlice := range tables {\n\t\tfor _, table := range tableSlice {\n\t\t\t_ = table.Close(-1)\n\t\t}\n\t}\n}\n\nfunc (s *levelsController) cleanupLevels() error {\n\tvar firstErr error\n\tfor _, l := range s.levels {\n\t\tif err := l.close(); err != nil && firstErr == nil {\n\t\t\tfirstErr = err\n\t\t}\n\t}\n\treturn firstErr\n}\n\n// dropTree picks all tables from all levels, creates a manifest changeset,\n// applies it, and then decrements the refs of these tables, which would result\n// in their deletion.\nfunc (s *levelsController) dropTree() (int, error) {\n\t// First pick all tables, so we can create a manifest changelog.\n\tvar all []*table.Table\n\tfor _, l := range s.levels {\n\t\tl.RLock()\n\t\tall = append(all, l.tables...)\n\t\tl.RUnlock()\n\t}\n\tif len(all) == 0 {\n\t\treturn 0, nil\n\t}\n\n\t// Generate the manifest changes.\n\tchanges := []*pb.ManifestChange{}\n\tfor _, table := range all {\n\t\t// Add a delete change only if the table is not in memory.\n\t\tif !table.IsInmemory {\n\t\t\tchanges = append(changes, newDeleteChange(table.ID()))\n\t\t}\n\t}\n\tchangeSet := pb.ManifestChangeSet{Changes: changes}\n\tif err := s.kv.manifest.addChanges(changeSet.Changes, s.kv.opt); err != nil {\n\t\treturn 0, err\n\t}\n\n\t// Now that manifest has been successfully written, we can delete the tables.\n\tfor _, l := range s.levels {\n\t\tl.Lock()\n\t\tl.totalSize = 0\n\t\tl.tables = l.tables[:0]\n\t\tl.Unlock()\n\t}\n\tfor _, table := range all {\n\t\tif err := table.DecrRef(); err != nil {\n\t\t\treturn 0, err\n\t\t}\n\t}\n\treturn len(all), nil\n}\n\n// dropPrefix runs a L0->L1 compaction, and then runs same level compaction on the rest of the\n// levels. For L0->L1 compaction, it runs compactions normally, but skips over\n// all the keys with the provided prefix.\n// For Li->Li compactions, it picks up the tables which would have the prefix. The\n// tables who only have keys with this prefix are quickly dropped. The ones which have other keys\n// are run through MergeIterator and compacted to create new tables. All the mechanisms of\n// compactions apply, i.e. level sizes and MANIFEST are updated as in the normal flow.\nfunc (s *levelsController) dropPrefixes(prefixes [][]byte) error {\n\topt := s.kv.opt\n\t// Iterate levels in the reverse order because if we were to iterate from\n\t// lower level (say level 0) to a higher level (say level 3) we could have\n\t// a state in which level 0 is compacted and an older version of a key exists in lower level.\n\t// At this point, if someone creates an iterator, they would see an old\n\t// value for a key from lower levels. Iterating in reverse order ensures we\n\t// drop the oldest data first so that lookups never return stale data.\n\tfor i := len(s.levels) - 1; i >= 0; i-- {\n\t\tl := s.levels[i]\n\n\t\tl.RLock()\n\t\tif l.level == 0 {\n\t\t\tsize := len(l.tables)\n\t\t\tl.RUnlock()\n\n\t\t\tif size > 0 {\n\t\t\t\tcp := compactionPriority{\n\t\t\t\t\tlevel: 0,\n\t\t\t\t\tscore: 1.74,\n\t\t\t\t\t// A unique number greater than 1.0 does two things. Helps identify this\n\t\t\t\t\t// function in logs, and forces a compaction.\n\t\t\t\t\tdropPrefixes: prefixes,\n\t\t\t\t}\n\t\t\t\tif err := s.doCompact(174, cp); err != nil {\n\t\t\t\t\topt.Warningf(\"While compacting level 0: %v\", err)\n\t\t\t\t\treturn nil\n\t\t\t\t}\n\t\t\t}\n\t\t\tcontinue\n\t\t}\n\n\t\t// Build a list of compaction tableGroups affecting all the prefixes we\n\t\t// need to drop. We need to build tableGroups that satisfy the invariant that\n\t\t// bottom tables are consecutive.\n\t\t// tableGroup contains groups of consecutive tables.\n\t\tvar tableGroups [][]*table.Table\n\t\tvar tableGroup []*table.Table\n\n\t\tfinishGroup := func() {\n\t\t\tif len(tableGroup) > 0 {\n\t\t\t\ttableGroups = append(tableGroups, tableGroup)\n\t\t\t\ttableGroup = nil\n\t\t\t}\n\t\t}\n\n\t\tfor _, table := range l.tables {\n\t\t\tif containsAnyPrefixes(table, prefixes) {\n\t\t\t\ttableGroup = append(tableGroup, table)\n\t\t\t} else {\n\t\t\t\tfinishGroup()\n\t\t\t}\n\t\t}\n\t\tfinishGroup()\n\n\t\tl.RUnlock()\n\n\t\tif len(tableGroups) == 0 {\n\t\t\tcontinue\n\t\t}\n\t\topt.Infof(\"Dropping prefix at level %d (%d tableGroups)\", l.level, len(tableGroups))\n\t\tfor _, operation := range tableGroups {\n\t\t\tcd := compactDef{\n\t\t\t\tthisLevel:    l,\n\t\t\t\tnextLevel:    l,\n\t\t\t\ttop:          nil,\n\t\t\t\tbot:          operation,\n\t\t\t\tdropPrefixes: prefixes,\n\t\t\t\tt:            s.levelTargets(),\n\t\t\t}\n\t\t\t_, span := otel.Tracer(\"\").Start(context.TODO(), \"Badger.Compaction\")\n\t\t\tspan.SetAttributes(attribute.Int(\"Compaction level\", l.level))\n\t\t\tspan.SetAttributes(attribute.String(\"Drop Prefixes\", fmt.Sprintf(\"%v\", prefixes)))\n\t\t\tcd.t.baseLevel = l.level\n\t\t\tif err := s.runCompactDef(-1, l.level, cd); err != nil {\n\t\t\t\topt.Warningf(\"While running compact def: %+v. Error: %v\", cd, err)\n\t\t\t\tspan.End()\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tspan.SetAttributes(\n\t\t\t\tattribute.Int(\"Top tables count\", len(cd.top)),\n\t\t\t\tattribute.Int(\"Bottom tables count\", len(cd.bot)))\n\t\t\tspan.End()\n\t\t}\n\n\t}\n\treturn nil\n}\n\nfunc (s *levelsController) startCompact(lc *z.Closer) {\n\tn := s.kv.opt.NumCompactors\n\tlc.AddRunning(n - 1)\n\tfor i := 0; i < n; i++ {\n\t\tgo s.runCompactor(i, lc)\n\t}\n}\n\ntype targets struct {\n\tbaseLevel int\n\ttargetSz  []int64\n\tfileSz    []int64\n}\n\n// levelTargets calculates the targets for levels in the LSM tree. The idea comes from Dynamic Level\n// Sizes ( https://rocksdb.org/blog/2015/07/23/dynamic-level.html ) in RocksDB. The sizes of levels\n// are calculated based on the size of the lowest level, typically L6. So, if L6 size is 1GB, then\n// L5 target size is 100MB, L4 target size is 10MB and so on.\n//\n// L0 files don't automatically go to L1. Instead, they get compacted to Lbase, where Lbase is\n// chosen based on the first level which is non-empty from top (check L1 through L6). For an empty\n// DB, that would be L6.  So, L0 compactions go to L6, then L5, L4 and so on.\n//\n// Lbase is advanced to the upper levels when its target size exceeds BaseLevelSize. For\n// example, when L6 reaches 1.1GB, then L4 target sizes becomes 11MB, thus exceeding the\n// BaseLevelSize of 10MB. L3 would then become the new Lbase, with a target size of 1MB <\n// BaseLevelSize.\nfunc (s *levelsController) levelTargets() targets {\n\tadjust := func(sz int64) int64 {\n\t\tif sz < s.kv.opt.BaseLevelSize {\n\t\t\treturn s.kv.opt.BaseLevelSize\n\t\t}\n\t\treturn sz\n\t}\n\n\tt := targets{\n\t\ttargetSz: make([]int64, len(s.levels)),\n\t\tfileSz:   make([]int64, len(s.levels)),\n\t}\n\t// DB size is the size of the last level.\n\tdbSize := s.lastLevel().getTotalSize()\n\tfor i := len(s.levels) - 1; i > 0; i-- {\n\t\tltarget := adjust(dbSize)\n\t\tt.targetSz[i] = ltarget\n\t\tif t.baseLevel == 0 && ltarget <= s.kv.opt.BaseLevelSize {\n\t\t\tt.baseLevel = i\n\t\t}\n\t\tdbSize /= int64(s.kv.opt.LevelSizeMultiplier)\n\t}\n\n\ttsz := s.kv.opt.BaseTableSize\n\tfor i := 0; i < len(s.levels); i++ {\n\t\tif i == 0 {\n\t\t\t// Use MemTableSize for Level 0. Because at Level 0, we stop compactions based on the\n\t\t\t// number of tables, not the size of the level. So, having a 1:1 size ratio between\n\t\t\t// memtable size and the size of L0 files is better than churning out 32 files per\n\t\t\t// memtable (assuming 64MB MemTableSize and 2MB BaseTableSize).\n\t\t\tt.fileSz[i] = s.kv.opt.MemTableSize\n\t\t} else if i <= t.baseLevel {\n\t\t\tt.fileSz[i] = tsz\n\t\t} else {\n\t\t\ttsz *= int64(s.kv.opt.TableSizeMultiplier)\n\t\t\tt.fileSz[i] = tsz\n\t\t}\n\t}\n\n\t// Bring the base level down to the last empty level.\n\tfor i := t.baseLevel + 1; i < len(s.levels)-1; i++ {\n\t\tif s.levels[i].getTotalSize() > 0 {\n\t\t\tbreak\n\t\t}\n\t\tt.baseLevel = i\n\t}\n\n\t// If the base level is empty and the next level size is less than the\n\t// target size, pick the next level as the base level.\n\tb := t.baseLevel\n\tlvl := s.levels\n\tif b < len(lvl)-1 && lvl[b].getTotalSize() == 0 && lvl[b+1].getTotalSize() < t.targetSz[b+1] {\n\t\tt.baseLevel++\n\t}\n\treturn t\n}\n\nfunc (s *levelsController) runCompactor(id int, lc *z.Closer) {\n\tdefer lc.Done()\n\n\trandomDelay := time.NewTimer(time.Duration(rand.Int31n(1000)) * time.Millisecond)\n\tselect {\n\tcase <-randomDelay.C:\n\tcase <-lc.HasBeenClosed():\n\t\trandomDelay.Stop()\n\t\treturn\n\t}\n\n\tmoveL0toFront := func(prios []compactionPriority) []compactionPriority {\n\t\tidx := -1\n\t\tfor i, p := range prios {\n\t\t\tif p.level == 0 {\n\t\t\t\tidx = i\n\t\t\t\tbreak\n\t\t\t}\n\t\t}\n\t\t// If idx == -1, we didn't find L0.\n\t\t// If idx == 0, then we don't need to do anything. L0 is already at the front.\n\t\tif idx > 0 {\n\t\t\tout := append([]compactionPriority{}, prios[idx])\n\t\t\tout = append(out, prios[:idx]...)\n\t\t\tout = append(out, prios[idx+1:]...)\n\t\t\treturn out\n\t\t}\n\t\treturn prios\n\t}\n\n\trun := func(p compactionPriority) bool {\n\t\terr := s.doCompact(id, p)\n\t\tswitch err {\n\t\tcase nil:\n\t\t\treturn true\n\t\tcase errFillTables:\n\t\t\t// pass\n\t\tdefault:\n\t\t\ts.kv.opt.Warningf(\"While running doCompact: %v\\n\", err)\n\t\t}\n\t\treturn false\n\t}\n\n\tvar priosBuffer []compactionPriority\n\trunOnce := func() bool {\n\t\tprios := s.pickCompactLevels(priosBuffer)\n\t\tdefer func() {\n\t\t\tpriosBuffer = prios\n\t\t}()\n\t\tif id == 0 {\n\t\t\t// Worker ID zero prefers to compact L0 always.\n\t\t\tprios = moveL0toFront(prios)\n\t\t}\n\t\tfor _, p := range prios {\n\t\t\tif id == 0 && p.level == 0 {\n\t\t\t\t// Allow worker zero to run level 0, irrespective of its adjusted score.\n\t\t\t} else if p.adjusted < 1.0 {\n\t\t\t\tbreak\n\t\t\t}\n\t\t\tif run(p) {\n\t\t\t\treturn true\n\t\t\t}\n\t\t}\n\n\t\treturn false\n\t}\n\n\ttryLmaxToLmaxCompaction := func() {\n\t\tp := compactionPriority{\n\t\t\tlevel: s.lastLevel().level,\n\t\t\tt:     s.levelTargets(),\n\t\t}\n\t\trun(p)\n\n\t}\n\tcount := 0\n\tticker := time.NewTicker(50 * time.Millisecond)\n\tdefer ticker.Stop()\n\tfor {\n\t\tselect {\n\t\t// Can add a done channel or other stuff.\n\t\tcase <-ticker.C:\n\t\t\tcount++\n\t\t\t// Each ticker is 50ms so 50*200=10seconds.\n\t\t\tif s.kv.opt.LmaxCompaction && id == 2 && count >= 200 {\n\t\t\t\ttryLmaxToLmaxCompaction()\n\t\t\t\tcount = 0\n\t\t\t} else {\n\t\t\t\trunOnce()\n\t\t\t}\n\t\tcase <-lc.HasBeenClosed():\n\t\t\treturn\n\t\t}\n\t}\n}\n\ntype compactionPriority struct {\n\tlevel        int\n\tscore        float64\n\tadjusted     float64\n\tdropPrefixes [][]byte\n\tt            targets\n}\n\nfunc (s *levelsController) lastLevel() *levelHandler {\n\treturn s.levels[len(s.levels)-1]\n}\n\n// pickCompactLevel determines which level to compact.\n// Based on: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction\n// It tries to reuse priosBuffer to reduce memory allocation,\n// passing nil is acceptable, then new memory will be allocated.\nfunc (s *levelsController) pickCompactLevels(priosBuffer []compactionPriority) (prios []compactionPriority) {\n\tt := s.levelTargets()\n\taddPriority := func(level int, score float64) {\n\t\tpri := compactionPriority{\n\t\t\tlevel:    level,\n\t\t\tscore:    score,\n\t\t\tadjusted: score,\n\t\t\tt:        t,\n\t\t}\n\t\tprios = append(prios, pri)\n\t}\n\n\t// Grow buffer to fit all levels.\n\tif cap(priosBuffer) < len(s.levels) {\n\t\tpriosBuffer = make([]compactionPriority, 0, len(s.levels))\n\t}\n\tprios = priosBuffer[:0]\n\n\t// Add L0 priority based on the number of tables.\n\taddPriority(0, float64(s.levels[0].numTables())/float64(s.kv.opt.NumLevelZeroTables))\n\n\t// All other levels use size to calculate priority.\n\tfor i := 1; i < len(s.levels); i++ {\n\t\t// Don't consider those tables that are already being compacted right now.\n\t\tdelSize := s.cstatus.delSize(i)\n\n\t\tl := s.levels[i]\n\t\tsz := l.getTotalSize() - delSize\n\t\taddPriority(i, float64(sz)/float64(t.targetSz[i]))\n\t}\n\ty.AssertTrue(len(prios) == len(s.levels))\n\n\t// The following code is borrowed from PebbleDB and results in healthier LSM tree structure.\n\t// If Li-1 has score > 1.0, then we'll divide Li-1 score by Li. If Li score is >= 1.0, then Li-1\n\t// score is reduced, which means we'll prioritize the compaction of lower levels (L5, L4 and so\n\t// on) over the higher levels (L0, L1 and so on). On the other hand, if Li score is < 1.0, then\n\t// we'll increase the priority of Li-1.\n\t// Overall what this means is, if the bottom level is already overflowing, then de-prioritize\n\t// compaction of the above level. If the bottom level is not full, then increase the priority of\n\t// above level.\n\tvar prevLevel int\n\tfor level := t.baseLevel; level < len(s.levels); level++ {\n\t\tif prios[prevLevel].adjusted >= 1 {\n\t\t\t// Avoid absurdly large scores by placing a floor on the score that we'll\n\t\t\t// adjust a level by. The value of 0.01 was chosen somewhat arbitrarily\n\t\t\tconst minScore = 0.01\n\t\t\tif prios[level].score >= minScore {\n\t\t\t\tprios[prevLevel].adjusted /= prios[level].adjusted\n\t\t\t} else {\n\t\t\t\tprios[prevLevel].adjusted /= minScore\n\t\t\t}\n\t\t}\n\t\tprevLevel = level\n\t}\n\n\t// Pick all the levels whose original score is >= 1.0, irrespective of their adjusted score.\n\t// We'll still sort them by their adjusted score below. Having both these scores allows us to\n\t// make better decisions about compacting L0. If we see a score >= 1.0, we can do L0->L0\n\t// compactions. If the adjusted score >= 1.0, then we can do L0->Lbase compactions.\n\tout := prios[:0]\n\tfor _, p := range prios[:len(prios)-1] {\n\t\tif p.score >= 1.0 {\n\t\t\tout = append(out, p)\n\t\t}\n\t}\n\tprios = out\n\n\t// Sort by the adjusted score.\n\tsort.Slice(prios, func(i, j int) bool {\n\t\treturn prios[i].adjusted > prios[j].adjusted\n\t})\n\treturn prios\n}\n\n// checkOverlap checks if the given tables overlap with any level from the given \"lev\" onwards.\nfunc (s *levelsController) checkOverlap(tables []*table.Table, lev int) bool {\n\tkr := getKeyRange(tables...)\n\tfor i, lh := range s.levels {\n\t\tif i < lev { // Skip upper levels.\n\t\t\tcontinue\n\t\t}\n\t\tlh.RLock()\n\t\tleft, right := lh.overlappingTables(levelHandlerRLocked{}, kr)\n\t\tlh.RUnlock()\n\t\tif right-left > 0 {\n\t\t\treturn true\n\t\t}\n\t}\n\treturn false\n}\n\n// subcompact runs a single sub-compaction, iterating over the specified key-range only.\n//\n// We use splits to do a single compaction concurrently. If we have >= 3 tables\n// involved in the bottom level during compaction, we choose key ranges to\n// split the main compaction up into sub-compactions. Each sub-compaction runs\n// concurrently, only iterating over the provided key range, generating tables.\n// This speeds up the compaction significantly.\nfunc (s *levelsController) subcompact(it y.Iterator, kr keyRange, cd compactDef,\n\tinflightBuilders *y.Throttle, res chan<- *table.Table) {\n\n\t// Check overlap of the top level with the levels which are not being\n\t// compacted in this compaction.\n\thasOverlap := s.checkOverlap(cd.allTables(), cd.nextLevel.level+1)\n\n\t// Pick a discard ts, so we can discard versions below this ts. We should\n\t// never discard any versions starting from above this timestamp, because\n\t// that would affect the snapshot view guarantee provided by transactions.\n\tdiscardTs := s.kv.orc.discardAtOrBelow()\n\n\t// Try to collect stats so that we can inform value log about GC. That would help us find which\n\t// value log file should be GCed.\n\tdiscardStats := make(map[uint32]int64)\n\tupdateStats := func(vs y.ValueStruct) {\n\t\t// We don't need to store/update discard stats when badger is running in Disk-less mode.\n\t\tif s.kv.opt.InMemory {\n\t\t\treturn\n\t\t}\n\t\tif vs.Meta&bitValuePointer > 0 {\n\t\t\tvar vp valuePointer\n\t\t\tvp.Decode(vs.Value)\n\t\t\tdiscardStats[vp.Fid] += int64(vp.Len)\n\t\t}\n\t}\n\n\t// exceedsAllowedOverlap returns true if the given key range would overlap with more than 10\n\t// tables from level below nextLevel (nextLevel+1). This helps avoid generating tables at Li\n\t// with huge overlaps with Li+1.\n\texceedsAllowedOverlap := func(kr keyRange) bool {\n\t\tn2n := cd.nextLevel.level + 1\n\t\tif n2n <= 1 || n2n >= len(s.levels) {\n\t\t\treturn false\n\t\t}\n\t\tn2nl := s.levels[n2n]\n\t\tn2nl.RLock()\n\t\tdefer n2nl.RUnlock()\n\n\t\tl, r := n2nl.overlappingTables(levelHandlerRLocked{}, kr)\n\t\treturn r-l >= 10\n\t}\n\n\tvar (\n\t\tlastKey, skipKey       []byte\n\t\tnumBuilds, numVersions int\n\t\t// Denotes if the first key is a series of duplicate keys had\n\t\t// \"DiscardEarlierVersions\" set\n\t\tfirstKeyHasDiscardSet bool\n\t)\n\n\taddKeys := func(builder *table.Builder) {\n\t\ttimeStart := time.Now()\n\t\tvar numKeys, numSkips uint64\n\t\tvar rangeCheck int\n\t\tvar tableKr keyRange\n\t\tfor ; it.Valid(); it.Next() {\n\t\t\t// See if we need to skip the prefix.\n\t\t\tif len(cd.dropPrefixes) > 0 && hasAnyPrefixes(it.Key(), cd.dropPrefixes) {\n\t\t\t\tnumSkips++\n\t\t\t\tupdateStats(it.Value())\n\t\t\t\tcontinue\n\t\t\t}\n\n\t\t\t// See if we need to skip this key.\n\t\t\tif len(skipKey) > 0 {\n\t\t\t\tif y.SameKey(it.Key(), skipKey) {\n\t\t\t\t\tnumSkips++\n\t\t\t\t\tupdateStats(it.Value())\n\t\t\t\t\tcontinue\n\t\t\t\t} else {\n\t\t\t\t\tskipKey = skipKey[:0]\n\t\t\t\t}\n\t\t\t}\n\n\t\t\tif !y.SameKey(it.Key(), lastKey) {\n\t\t\t\tfirstKeyHasDiscardSet = false\n\t\t\t\tif len(kr.right) > 0 && y.CompareKeys(it.Key(), kr.right) >= 0 {\n\t\t\t\t\tbreak\n\t\t\t\t}\n\t\t\t\tif builder.ReachedCapacity() {\n\t\t\t\t\t// Only break if we are on a different key, and have reached capacity. We want\n\t\t\t\t\t// to ensure that all versions of the key are stored in the same sstable, and\n\t\t\t\t\t// not divided across multiple tables at the same level.\n\t\t\t\t\tbreak\n\t\t\t\t}\n\t\t\t\tlastKey = y.SafeCopy(lastKey, it.Key())\n\t\t\t\tnumVersions = 0\n\t\t\t\tfirstKeyHasDiscardSet = it.Value().Meta&bitDiscardEarlierVersions > 0\n\n\t\t\t\tif len(tableKr.left) == 0 {\n\t\t\t\t\ttableKr.left = y.SafeCopy(tableKr.left, it.Key())\n\t\t\t\t}\n\t\t\t\ttableKr.right = lastKey\n\n\t\t\t\trangeCheck++\n\t\t\t\tif rangeCheck%5000 == 0 {\n\t\t\t\t\t// This table's range exceeds the allowed range overlap with the level after\n\t\t\t\t\t// next. So, we stop writing to this table. If we don't do this, then we end up\n\t\t\t\t\t// doing very expensive compactions involving too many tables. To amortize the\n\t\t\t\t\t// cost of this check, we do it only every N keys.\n\t\t\t\t\tif exceedsAllowedOverlap(tableKr) {\n\t\t\t\t\t\t// s.kv.opt.Debugf(\"L%d -> L%d Breaking due to exceedsAllowedOverlap with\n\t\t\t\t\t\t// kr: %s\\n\", cd.thisLevel.level, cd.nextLevel.level, tableKr)\n\t\t\t\t\t\tbreak\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\n\t\t\tvs := it.Value()\n\t\t\tversion := y.ParseTs(it.Key())\n\n\t\t\tisExpired := isDeletedOrExpired(vs.Meta, vs.ExpiresAt)\n\n\t\t\t// Do not discard entries inserted by merge operator. These entries will be\n\t\t\t// discarded once they're merged\n\t\t\tif version <= discardTs && vs.Meta&bitMergeEntry == 0 {\n\t\t\t\t// Keep track of the number of versions encountered for this key. Only consider the\n\t\t\t\t// versions which are below the minReadTs, otherwise, we might end up discarding the\n\t\t\t\t// only valid version for a running transaction.\n\t\t\t\tnumVersions++\n\t\t\t\t// Keep the current version and discard all the next versions if\n\t\t\t\t// - The `discardEarlierVersions` bit is set OR\n\t\t\t\t// - We've already processed `NumVersionsToKeep` number of versions\n\t\t\t\t// (including the current item being processed)\n\t\t\t\tlastValidVersion := vs.Meta&bitDiscardEarlierVersions > 0 ||\n\t\t\t\t\tnumVersions == s.kv.opt.NumVersionsToKeep\n\n\t\t\t\tif isExpired || lastValidVersion {\n\t\t\t\t\t// If this version of the key is deleted or expired, skip all the rest of the\n\t\t\t\t\t// versions. Ensure that we're only removing versions below readTs.\n\t\t\t\t\tskipKey = y.SafeCopy(skipKey, it.Key())\n\n\t\t\t\t\tswitch {\n\t\t\t\t\t// Add the key to the table only if it has not expired.\n\t\t\t\t\t// We don't want to add the deleted/expired keys.\n\t\t\t\t\tcase !isExpired && lastValidVersion:\n\t\t\t\t\t\t// Add this key. We have set skipKey, so the following key versions\n\t\t\t\t\t\t// would be skipped.\n\t\t\t\t\tcase hasOverlap:\n\t\t\t\t\t\t// If this key range has overlap with lower levels, then keep the deletion\n\t\t\t\t\t\t// marker with the latest version, discarding the rest. We have set skipKey,\n\t\t\t\t\t\t// so the following key versions would be skipped.\n\t\t\t\t\tdefault:\n\t\t\t\t\t\t// If no overlap, we can skip all the versions, by continuing here.\n\t\t\t\t\t\tnumSkips++\n\t\t\t\t\t\tupdateStats(vs)\n\t\t\t\t\t\tcontinue // Skip adding this key.\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t\tnumKeys++\n\t\t\tvar vp valuePointer\n\t\t\tif vs.Meta&bitValuePointer > 0 {\n\t\t\t\tvp.Decode(vs.Value)\n\t\t\t}\n\t\t\tswitch {\n\t\t\tcase firstKeyHasDiscardSet:\n\t\t\t\t// This key is same as the last key which had \"DiscardEarlierVersions\" set. The\n\t\t\t\t// the next compactions will drop this key if its ts >\n\t\t\t\t// discardTs (of the next compaction).\n\t\t\t\tbuilder.AddStaleKey(it.Key(), vs, vp.Len)\n\t\t\tcase isExpired:\n\t\t\t\t// If the key is expired, the next compaction will drop it if\n\t\t\t\t// its ts > discardTs (of the next compaction).\n\t\t\t\tbuilder.AddStaleKey(it.Key(), vs, vp.Len)\n\t\t\tdefault:\n\t\t\t\tbuilder.Add(it.Key(), vs, vp.Len)\n\t\t\t}\n\t\t}\n\t\ts.kv.opt.Debugf(\"[%d] LOG Compact. Added %d keys. Skipped %d keys. Iteration took: %v\",\n\t\t\tcd.compactorId, numKeys, numSkips, time.Since(timeStart).Round(time.Millisecond))\n\t} // End of function: addKeys\n\n\tif len(kr.left) > 0 {\n\t\tit.Seek(kr.left)\n\t} else {\n\t\tit.Rewind()\n\t}\n\tfor it.Valid() {\n\t\tif len(kr.right) > 0 && y.CompareKeys(it.Key(), kr.right) >= 0 {\n\t\t\tbreak\n\t\t}\n\n\t\tbopts := buildTableOptions(s.kv)\n\t\t// Set TableSize to the target file size for that level.\n\t\tbopts.TableSize = uint64(cd.t.fileSz[cd.nextLevel.level])\n\t\tbuilder := table.NewTableBuilder(bopts)\n\n\t\t// This would do the iteration and add keys to builder.\n\t\taddKeys(builder)\n\n\t\t// It was true that it.Valid() at least once in the loop above, which means we\n\t\t// called Add() at least once, and builder is not Empty().\n\t\tif builder.Empty() {\n\t\t\t// Cleanup builder resources:\n\t\t\tbuilder.Finish()\n\t\t\tbuilder.Close()\n\t\t\tcontinue\n\t\t}\n\t\tnumBuilds++\n\t\tif err := inflightBuilders.Do(); err != nil {\n\t\t\t// Can't return from here, until I decrRef all the tables that I built so far.\n\t\t\tbreak\n\t\t}\n\t\tgo func(builder *table.Builder, fileID uint64) {\n\t\t\tvar err error\n\t\t\tdefer inflightBuilders.Done(err)\n\t\t\tdefer builder.Close()\n\n\t\t\tvar tbl *table.Table\n\t\t\tif s.kv.opt.InMemory {\n\t\t\t\ttbl, err = table.OpenInMemoryTable(builder.Finish(), fileID, &bopts)\n\t\t\t} else {\n\t\t\t\tfname := table.NewFilename(fileID, s.kv.opt.Dir)\n\t\t\t\ttbl, err = table.CreateTable(fname, builder)\n\t\t\t}\n\n\t\t\t// If we couldn't build the table, return fast.\n\t\t\tif err != nil {\n\t\t\t\treturn\n\t\t\t}\n\t\t\tres <- tbl\n\t\t}(builder, s.reserveFileID())\n\t}\n\ts.kv.vlog.updateDiscardStats(discardStats)\n\ts.kv.opt.Debugf(\"Discard stats: %v\", discardStats)\n}\n\n// compactBuildTables merges topTables and botTables to form a list of new tables.\nfunc (s *levelsController) compactBuildTables(\n\tlev int, cd compactDef) ([]*table.Table, func() error, error) {\n\n\ttopTables := cd.top\n\tbotTables := cd.bot\n\n\tnumTables := int64(len(topTables) + len(botTables))\n\ty.NumCompactionTablesAdd(s.kv.opt.MetricsEnabled, numTables)\n\tdefer y.NumCompactionTablesAdd(s.kv.opt.MetricsEnabled, -numTables)\n\n\tkeepTable := func(t *table.Table) bool {\n\t\tfor _, prefix := range cd.dropPrefixes {\n\t\t\tif bytes.HasPrefix(t.Smallest(), prefix) &&\n\t\t\t\tbytes.HasPrefix(t.Biggest(), prefix) {\n\t\t\t\t// All the keys in this table have the dropPrefix. So, this\n\t\t\t\t// table does not need to be in the iterator and can be\n\t\t\t\t// dropped immediately.\n\t\t\t\treturn false\n\t\t\t}\n\t\t}\n\t\treturn true\n\t}\n\tvar valid []*table.Table\n\tfor _, table := range botTables {\n\t\tif keepTable(table) {\n\t\t\tvalid = append(valid, table)\n\t\t}\n\t}\n\n\tnewIterator := func() []y.Iterator {\n\t\t// Create iterators across all the tables involved first.\n\t\tvar iters []y.Iterator\n\t\tswitch {\n\t\tcase lev == 0:\n\t\t\titers = appendIteratorsReversed(iters, topTables, table.NOCACHE)\n\t\tcase len(topTables) > 0:\n\t\t\ty.AssertTrue(len(topTables) == 1)\n\t\t\titers = []y.Iterator{topTables[0].NewIterator(table.NOCACHE)}\n\t\t}\n\t\t// Next level has level>=1 and we can use ConcatIterator as key ranges do not overlap.\n\t\treturn append(iters, table.NewConcatIterator(valid, table.NOCACHE))\n\t}\n\n\tres := make(chan *table.Table, 3)\n\tinflightBuilders := y.NewThrottle(8 + len(cd.splits))\n\tfor _, kr := range cd.splits {\n\t\t// Initiate Do here so we can register the goroutines for buildTables too.\n\t\tif err := inflightBuilders.Do(); err != nil {\n\t\t\ts.kv.opt.Errorf(\"cannot start subcompaction: %+v\", err)\n\t\t\treturn nil, nil, err\n\t\t}\n\t\tgo func(kr keyRange) {\n\t\t\tdefer inflightBuilders.Done(nil)\n\t\t\tit := table.NewMergeIterator(newIterator(), false)\n\t\t\tdefer it.Close()\n\t\t\ts.subcompact(it, kr, cd, inflightBuilders, res)\n\t\t}(kr)\n\t}\n\n\tvar newTables []*table.Table\n\tvar wg sync.WaitGroup\n\twg.Add(1)\n\tgo func() {\n\t\tdefer wg.Done()\n\t\tfor t := range res {\n\t\t\tnewTables = append(newTables, t)\n\t\t}\n\t}()\n\n\t// Wait for all table builders to finish and also for newTables accumulator to finish.\n\terr := inflightBuilders.Finish()\n\tclose(res)\n\twg.Wait() // Wait for all tables to be picked up.\n\n\tif err == nil {\n\t\t// Ensure created files' directory entries are visible.  We don't mind the extra latency\n\t\t// from not doing this ASAP after all file creation has finished because this is a\n\t\t// background operation.\n\t\terr = s.kv.syncDir(s.kv.opt.Dir)\n\t}\n\n\tif err != nil {\n\t\t// An error happened.  Delete all the newly created table files (by calling DecrRef\n\t\t// -- we're the only holders of a ref).\n\t\t_ = decrRefs(newTables)\n\t\treturn nil, nil, y.Wrapf(err, \"while running compactions for: %+v\", cd)\n\t}\n\n\tsort.Slice(newTables, func(i, j int) bool {\n\t\treturn y.CompareKeys(newTables[i].Biggest(), newTables[j].Biggest()) < 0\n\t})\n\treturn newTables, func() error { return decrRefs(newTables) }, nil\n}\n\nfunc buildChangeSet(cd *compactDef, newTables []*table.Table) pb.ManifestChangeSet {\n\tchanges := []*pb.ManifestChange{}\n\tfor _, table := range newTables {\n\t\tchanges = append(changes,\n\t\t\tnewCreateChange(table.ID(), cd.nextLevel.level, table.KeyID(), table.CompressionType()))\n\t}\n\tfor _, table := range cd.top {\n\t\t// Add a delete change only if the table is not in memory.\n\t\tif !table.IsInmemory {\n\t\t\tchanges = append(changes, newDeleteChange(table.ID()))\n\t\t}\n\t}\n\tfor _, table := range cd.bot {\n\t\tchanges = append(changes, newDeleteChange(table.ID()))\n\t}\n\treturn pb.ManifestChangeSet{Changes: changes}\n}\n\nfunc hasAnyPrefixes(s []byte, listOfPrefixes [][]byte) bool {\n\tfor _, prefix := range listOfPrefixes {\n\t\tif bytes.HasPrefix(s, prefix) {\n\t\t\treturn true\n\t\t}\n\t}\n\n\treturn false\n}\n\nfunc containsPrefix(table *table.Table, prefix []byte) bool {\n\tsmallValue := table.Smallest()\n\tlargeValue := table.Biggest()\n\tif bytes.HasPrefix(smallValue, prefix) {\n\t\treturn true\n\t}\n\tif bytes.HasPrefix(largeValue, prefix) {\n\t\treturn true\n\t}\n\tisPresent := func() bool {\n\t\tti := table.NewIterator(0)\n\t\tdefer ti.Close()\n\t\t// In table iterator's Seek, we assume that key has version in last 8 bytes. We set\n\t\t// version=0 (ts=math.MaxUint64), so that we don't skip the key prefixed with prefix.\n\t\tti.Seek(y.KeyWithTs(prefix, math.MaxUint64))\n\t\treturn bytes.HasPrefix(ti.Key(), prefix)\n\t}\n\n\tif bytes.Compare(prefix, smallValue) > 0 &&\n\t\tbytes.Compare(prefix, largeValue) < 0 {\n\t\t// There may be a case when table contains [0x0000,...., 0xffff]. If we are searching for\n\t\t// k=0x0011, we should not directly infer that k is present. It may not be present.\n\t\treturn isPresent()\n\t}\n\n\treturn false\n}\n\nfunc containsAnyPrefixes(table *table.Table, listOfPrefixes [][]byte) bool {\n\tfor _, prefix := range listOfPrefixes {\n\t\tif containsPrefix(table, prefix) {\n\t\t\treturn true\n\t\t}\n\t}\n\n\treturn false\n}\n\ntype compactDef struct {\n\tcompactorId int\n\tt           targets\n\tp           compactionPriority\n\tthisLevel   *levelHandler\n\tnextLevel   *levelHandler\n\n\ttop []*table.Table\n\tbot []*table.Table\n\n\tthisRange keyRange\n\tnextRange keyRange\n\tsplits    []keyRange\n\n\tthisSize int64\n\n\tdropPrefixes [][]byte\n}\n\n// addSplits can allow us to run multiple sub-compactions in parallel across the split key ranges.\nfunc (s *levelsController) addSplits(cd *compactDef) {\n\tcd.splits = cd.splits[:0]\n\n\t// Let's say we have 10 tables in cd.bot and min width = 3. Then, we'll pick\n\t// 0, 1, 2 (pick), 3, 4, 5 (pick), 6, 7, 8 (pick), 9 (pick, because last table).\n\t// This gives us 4 picks for 10 tables.\n\t// In an edge case, 142 tables in bottom led to 48 splits. That's too many splits, because it\n\t// then uses up a lot of memory for table builder.\n\t// We should keep it so we have at max 5 splits.\n\twidth := int(math.Ceil(float64(len(cd.bot)) / 5.0))\n\tif width < 3 {\n\t\twidth = 3\n\t}\n\tskr := cd.thisRange\n\tskr.extend(cd.nextRange)\n\n\taddRange := func(right []byte) {\n\t\tskr.right = y.Copy(right)\n\t\tcd.splits = append(cd.splits, skr)\n\n\t\tskr.left = skr.right\n\t}\n\n\tfor i, t := range cd.bot {\n\t\t// last entry in bottom table.\n\t\tif i == len(cd.bot)-1 {\n\t\t\taddRange([]byte{})\n\t\t\treturn\n\t\t}\n\t\tif i%width == width-1 {\n\t\t\t// Right is assigned ts=0. The encoding ts bytes take MaxUint64-ts,\n\t\t\t// so, those with smaller TS will be considered larger for the same key.\n\t\t\t// Consider the following.\n\t\t\t// Top table is [A1...C3(deleted)]\n\t\t\t// bot table is [B1....C2]\n\t\t\t// It will generate a split [A1 ... C0], including any records of Key C.\n\t\t\tright := y.KeyWithTs(y.ParseKey(t.Biggest()), 0)\n\t\t\taddRange(right)\n\t\t}\n\t}\n}\n\nfunc (cd *compactDef) lockLevels() {\n\tcd.thisLevel.RLock()\n\tcd.nextLevel.RLock()\n}\n\nfunc (cd *compactDef) unlockLevels() {\n\tcd.nextLevel.RUnlock()\n\tcd.thisLevel.RUnlock()\n}\n\nfunc (cd *compactDef) allTables() []*table.Table {\n\tret := make([]*table.Table, 0, len(cd.top)+len(cd.bot))\n\tret = append(ret, cd.top...)\n\tret = append(ret, cd.bot...)\n\treturn ret\n}\n\nfunc (s *levelsController) fillTablesL0ToL0(cd *compactDef) bool {\n\tif cd.compactorId != 0 {\n\t\t// Only compactor zero can work on this.\n\t\treturn false\n\t}\n\n\tcd.nextLevel = s.levels[0]\n\tcd.nextRange = keyRange{}\n\tcd.bot = nil\n\n\t// Because this level and next level are both level 0, we should NOT acquire\n\t// the read lock twice, because it can result in a deadlock. So, we don't\n\t// call compactDef.lockLevels, instead locking the level only once and\n\t// directly here.\n\t//\n\t// As per godocs on RWMutex:\n\t// If a goroutine holds a RWMutex for reading and another goroutine might\n\t// call Lock, no goroutine should expect to be able to acquire a read lock\n\t// until the initial read lock is released. In particular, this prohibits\n\t// recursive read locking. This is to ensure that the lock eventually\n\t// becomes available; a blocked Lock call excludes new readers from\n\t// acquiring the lock.\n\ty.AssertTrue(cd.thisLevel.level == 0)\n\ty.AssertTrue(cd.nextLevel.level == 0)\n\ts.levels[0].RLock()\n\tdefer s.levels[0].RUnlock()\n\n\ts.cstatus.Lock()\n\tdefer s.cstatus.Unlock()\n\n\ttop := cd.thisLevel.tables\n\tvar out []*table.Table\n\tnow := time.Now()\n\tfor _, t := range top {\n\t\tif t.Size() >= 2*cd.t.fileSz[0] {\n\t\t\t// This file is already big, don't include it.\n\t\t\tcontinue\n\t\t}\n\t\tif now.Sub(t.CreatedAt) < 10*time.Second {\n\t\t\t// Just created it 10s ago. Don't pick for compaction.\n\t\t\tcontinue\n\t\t}\n\t\tif _, beingCompacted := s.cstatus.tables[t.ID()]; beingCompacted {\n\t\t\tcontinue\n\t\t}\n\t\tout = append(out, t)\n\t}\n\n\tif len(out) < 4 {\n\t\t// If we don't have enough tables to merge in L0, don't do it.\n\t\treturn false\n\t}\n\tcd.thisRange = infRange\n\tcd.top = out\n\n\t// Avoid any other L0 -> Lbase from happening, while this is going on.\n\tthisLevel := s.cstatus.levels[cd.thisLevel.level]\n\tthisLevel.ranges = append(thisLevel.ranges, infRange)\n\tfor _, t := range out {\n\t\ts.cstatus.tables[t.ID()] = struct{}{}\n\t}\n\n\t// For L0->L0 compaction, we set the target file size to max, so the output is always one file.\n\t// This significantly decreases the L0 table stalls and improves the performance.\n\tcd.t.fileSz[0] = math.MaxUint32\n\treturn true\n}\n\nfunc (s *levelsController) fillTablesL0ToLbase(cd *compactDef) bool {\n\tif cd.nextLevel.level == 0 {\n\t\tpanic(\"Base level can't be zero.\")\n\t}\n\t// We keep cd.p.adjusted > 0.0 here to allow functions in db.go to artificially trigger\n\t// L0->Lbase compactions. Those functions wouldn't be setting the adjusted score.\n\tif cd.p.adjusted > 0.0 && cd.p.adjusted < 1.0 {\n\t\t// Do not compact to Lbase if adjusted score is less than 1.0.\n\t\treturn false\n\t}\n\tcd.lockLevels()\n\tdefer cd.unlockLevels()\n\n\ttop := cd.thisLevel.tables\n\tif len(top) == 0 {\n\t\treturn false\n\t}\n\n\tvar out []*table.Table\n\tif len(cd.dropPrefixes) > 0 {\n\t\t// Use all tables if drop prefix is set. We don't want to compact only a\n\t\t// sub-range. We want to compact all the tables.\n\t\tout = top\n\n\t} else {\n\t\tvar kr keyRange\n\t\t// cd.top[0] is the oldest file. So we start from the oldest file first.\n\t\tfor _, t := range top {\n\t\t\tdkr := getKeyRange(t)\n\t\t\tif kr.overlapsWith(dkr) {\n\t\t\t\tout = append(out, t)\n\t\t\t\tkr.extend(dkr)\n\t\t\t} else {\n\t\t\t\tbreak\n\t\t\t}\n\t\t}\n\t}\n\tcd.thisRange = getKeyRange(out...)\n\tcd.top = out\n\n\tleft, right := cd.nextLevel.overlappingTables(levelHandlerRLocked{}, cd.thisRange)\n\tcd.bot = make([]*table.Table, right-left)\n\tcopy(cd.bot, cd.nextLevel.tables[left:right])\n\n\tif len(cd.bot) == 0 {\n\t\tcd.nextRange = cd.thisRange\n\t} else {\n\t\tcd.nextRange = getKeyRange(cd.bot...)\n\t}\n\treturn s.cstatus.compareAndAdd(thisAndNextLevelRLocked{}, *cd)\n}\n\n// fillTablesL0 would try to fill tables from L0 to be compacted with Lbase. If\n// it can not do that, it would try to compact tables from L0 -> L0.\n//\n// Say L0 has 10 tables.\n// fillTablesL0ToLbase picks up 5 tables to compact from L0 -> L5.\n// Next call to fillTablesL0 would run L0ToLbase again, which fails this time.\n// So, instead, we run fillTablesL0ToL0, which picks up rest of the 5 tables to\n// be compacted within L0. Additionally, it would set the compaction range in\n// cstatus to inf, so no other L0 -> Lbase compactions can happen.\n// Thus, L0 -> L0 must finish for the next L0 -> Lbase to begin.\nfunc (s *levelsController) fillTablesL0(cd *compactDef) bool {\n\tif ok := s.fillTablesL0ToLbase(cd); ok {\n\t\treturn true\n\t}\n\treturn s.fillTablesL0ToL0(cd)\n}\n\n// sortByStaleData sorts tables based on the amount of stale data they have.\n// This is useful in removing tombstones.\nfunc (s *levelsController) sortByStaleDataSize(tables []*table.Table, cd *compactDef) {\n\tif len(tables) == 0 || cd.nextLevel == nil {\n\t\treturn\n\t}\n\n\tsort.Slice(tables, func(i, j int) bool {\n\t\treturn tables[i].StaleDataSize() > tables[j].StaleDataSize()\n\t})\n}\n\n// sortByHeuristic sorts tables in increasing order of MaxVersion, so we\n// compact older tables first.\nfunc (s *levelsController) sortByHeuristic(tables []*table.Table, cd *compactDef) {\n\tif len(tables) == 0 || cd.nextLevel == nil {\n\t\treturn\n\t}\n\n\t// Sort tables by max version. This is what RocksDB does.\n\tsort.Slice(tables, func(i, j int) bool {\n\t\treturn tables[i].MaxVersion() < tables[j].MaxVersion()\n\t})\n}\n\n// This function should be called with lock on levels.\nfunc (s *levelsController) fillMaxLevelTables(tables []*table.Table, cd *compactDef) bool {\n\tsortedTables := make([]*table.Table, len(tables))\n\tcopy(sortedTables, tables)\n\ts.sortByStaleDataSize(sortedTables, cd)\n\n\tif len(sortedTables) > 0 && sortedTables[0].StaleDataSize() == 0 {\n\t\t// This is a maxLevel to maxLevel compaction and we don't have any stale data.\n\t\treturn false\n\t}\n\tcd.bot = []*table.Table{}\n\tcollectBotTables := func(t *table.Table, needSz int64) {\n\t\ttotalSize := t.Size()\n\n\t\tj := sort.Search(len(tables), func(i int) bool {\n\t\t\treturn y.CompareKeys(tables[i].Smallest(), t.Smallest()) >= 0\n\t\t})\n\t\ty.AssertTrue(tables[j].ID() == t.ID())\n\t\tj++\n\t\t// Collect tables until we reach the the required size.\n\t\tfor j < len(tables) {\n\t\t\tnewT := tables[j]\n\t\t\ttotalSize += newT.Size()\n\n\t\t\tif totalSize >= needSz {\n\t\t\t\tbreak\n\t\t\t}\n\t\t\tcd.bot = append(cd.bot, newT)\n\t\t\tcd.nextRange.extend(getKeyRange(newT))\n\t\t\tj++\n\t\t}\n\t}\n\tnow := time.Now()\n\tfor _, t := range sortedTables {\n\t\t// If the maxVersion is above the discardTs, we won't clean anything in\n\t\t// the compaction. So skip this table.\n\t\tif t.MaxVersion() > s.kv.orc.discardAtOrBelow() {\n\t\t\tcontinue\n\t\t}\n\t\tif now.Sub(t.CreatedAt) < time.Hour {\n\t\t\t// Just created it an hour ago. Don't pick for compaction.\n\t\t\tcontinue\n\t\t}\n\t\t// If the stale data size is less than 10 MB, it might not be worth\n\t\t// rewriting the table. Skip it.\n\t\tif t.StaleDataSize() < 10<<20 {\n\t\t\tcontinue\n\t\t}\n\n\t\tcd.thisSize = t.Size()\n\t\tcd.thisRange = getKeyRange(t)\n\t\t// Set the next range as the same as the current range. If we don't do\n\t\t// this, we won't be able to run more than one max level compactions.\n\t\tcd.nextRange = cd.thisRange\n\t\t// If we're already compacting this range, don't do anything.\n\t\tif s.cstatus.overlapsWith(cd.thisLevel.level, cd.thisRange) {\n\t\t\tcontinue\n\t\t}\n\n\t\t// Found a valid table!\n\t\tcd.top = []*table.Table{t}\n\n\t\tneedFileSz := cd.t.fileSz[cd.thisLevel.level]\n\t\t// The table size is what we want so no need to collect more tables.\n\t\tif t.Size() >= needFileSz {\n\t\t\tbreak\n\t\t}\n\t\t// TableSize is less than what we want. Collect more tables for compaction.\n\t\t// If the level has multiple small tables, we collect all of them\n\t\t// together to form a bigger table.\n\t\tcollectBotTables(t, needFileSz)\n\t\tif !s.cstatus.compareAndAdd(thisAndNextLevelRLocked{}, *cd) {\n\t\t\tcd.bot = cd.bot[:0]\n\t\t\tcd.nextRange = keyRange{}\n\t\t\tcontinue\n\t\t}\n\t\treturn true\n\t}\n\tif len(cd.top) == 0 {\n\t\treturn false\n\t}\n\n\treturn s.cstatus.compareAndAdd(thisAndNextLevelRLocked{}, *cd)\n}\n\nfunc (s *levelsController) fillTables(cd *compactDef) bool {\n\tcd.lockLevels()\n\tdefer cd.unlockLevels()\n\n\ttables := make([]*table.Table, len(cd.thisLevel.tables))\n\tcopy(tables, cd.thisLevel.tables)\n\tif len(tables) == 0 {\n\t\treturn false\n\t}\n\t// We're doing a maxLevel to maxLevel compaction. Pick tables based on the stale data size.\n\tif cd.thisLevel.isLastLevel() {\n\t\treturn s.fillMaxLevelTables(tables, cd)\n\t}\n\t// We pick tables, so we compact older tables first. This is similar to\n\t// kOldestLargestSeqFirst in RocksDB.\n\ts.sortByHeuristic(tables, cd)\n\n\tfor _, t := range tables {\n\t\tcd.thisSize = t.Size()\n\t\tcd.thisRange = getKeyRange(t)\n\t\t// If we're already compacting this range, don't do anything.\n\t\tif s.cstatus.overlapsWith(cd.thisLevel.level, cd.thisRange) {\n\t\t\tcontinue\n\t\t}\n\t\tcd.top = []*table.Table{t}\n\t\tleft, right := cd.nextLevel.overlappingTables(levelHandlerRLocked{}, cd.thisRange)\n\n\t\tcd.bot = make([]*table.Table, right-left)\n\t\tcopy(cd.bot, cd.nextLevel.tables[left:right])\n\n\t\tif len(cd.bot) == 0 {\n\t\t\tcd.bot = []*table.Table{}\n\t\t\tcd.nextRange = cd.thisRange\n\t\t\tif !s.cstatus.compareAndAdd(thisAndNextLevelRLocked{}, *cd) {\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\treturn true\n\t\t}\n\t\tcd.nextRange = getKeyRange(cd.bot...)\n\n\t\tif s.cstatus.overlapsWith(cd.nextLevel.level, cd.nextRange) {\n\t\t\tcontinue\n\t\t}\n\t\tif !s.cstatus.compareAndAdd(thisAndNextLevelRLocked{}, *cd) {\n\t\t\tcontinue\n\t\t}\n\t\treturn true\n\t}\n\treturn false\n}\n\nfunc (s *levelsController) runCompactDef(id, l int, cd compactDef) (err error) {\n\tif len(cd.t.fileSz) == 0 {\n\t\treturn errors.New(\"Filesizes cannot be zero. Targets are not set\")\n\t}\n\ttimeStart := time.Now()\n\n\tthisLevel := cd.thisLevel\n\tnextLevel := cd.nextLevel\n\n\ty.AssertTrue(len(cd.splits) == 0)\n\tif thisLevel.level == nextLevel.level {\n\t\t// don't do anything for L0 -> L0 and Lmax -> Lmax.\n\t} else {\n\t\ts.addSplits(&cd)\n\t}\n\tif len(cd.splits) == 0 {\n\t\tcd.splits = append(cd.splits, keyRange{})\n\t}\n\n\t// Table should never be moved directly between levels,\n\t// always be rewritten to allow discarding invalid versions.\n\n\tnewTables, decr, err := s.compactBuildTables(l, cd)\n\tif err != nil {\n\t\treturn err\n\t}\n\tdefer func() {\n\t\t// Only assign to err, if it's not already nil.\n\t\tif decErr := decr(); err == nil {\n\t\t\terr = decErr\n\t\t}\n\t}()\n\tchangeSet := buildChangeSet(&cd, newTables)\n\n\t// We write to the manifest _before_ we delete files (and after we created files)\n\tif err := s.kv.manifest.addChanges(changeSet.Changes, s.kv.opt); err != nil {\n\t\treturn err\n\t}\n\n\tgetSizes := func(tables []*table.Table) int64 {\n\t\tsize := int64(0)\n\t\tfor _, i := range tables {\n\t\t\tsize += i.Size()\n\t\t}\n\t\treturn size\n\t}\n\n\tsizeNewTables := int64(0)\n\tsizeOldTables := int64(0)\n\tif s.kv.opt.MetricsEnabled {\n\t\tsizeNewTables = getSizes(newTables)\n\t\tsizeOldTables = getSizes(cd.bot) + getSizes(cd.top)\n\t\ty.NumBytesCompactionWrittenAdd(s.kv.opt.MetricsEnabled, nextLevel.strLevel, sizeNewTables)\n\t}\n\n\t// See comment earlier in this function about the ordering of these ops, and the order in which\n\t// we access levels when reading.\n\tif err := nextLevel.replaceTables(cd.bot, newTables); err != nil {\n\t\treturn err\n\t}\n\tif err := thisLevel.deleteTables(cd.top); err != nil {\n\t\treturn err\n\t}\n\n\t// Note: For level 0, while doCompact is running, it is possible that new tables are added.\n\t// However, the tables are added only to the end, so it is ok to just delete the first table.\n\n\tfrom := append(tablesToString(cd.top), tablesToString(cd.bot)...)\n\tto := tablesToString(newTables)\n\tif dur := time.Since(timeStart); dur > 2*time.Second {\n\t\tvar expensive string\n\t\tif dur > time.Second {\n\t\t\texpensive = \" [E]\"\n\t\t}\n\t\ts.kv.opt.Infof(\"[%d]%s LOG Compact %d->%d (%d, %d -> %d tables with %d splits).\"+\n\t\t\t\" [%s] -> [%s], took %v\\n, deleted %d bytes\",\n\t\t\tid, expensive, thisLevel.level, nextLevel.level, len(cd.top), len(cd.bot),\n\t\t\tlen(newTables), len(cd.splits), strings.Join(from, \" \"), strings.Join(to, \" \"),\n\t\t\tdur.Round(time.Millisecond), sizeOldTables-sizeNewTables)\n\t}\n\n\tif cd.thisLevel.level != 0 && len(newTables) > 2*s.kv.opt.LevelSizeMultiplier {\n\t\ts.kv.opt.Infof(\"This Range (numTables: %d)\\nLeft:\\n%s\\nRight:\\n%s\\n\",\n\t\t\tlen(cd.top), hex.Dump(cd.thisRange.left), hex.Dump(cd.thisRange.right))\n\t\ts.kv.opt.Infof(\"Next Range (numTables: %d)\\nLeft:\\n%s\\nRight:\\n%s\\n\",\n\t\t\tlen(cd.bot), hex.Dump(cd.nextRange.left), hex.Dump(cd.nextRange.right))\n\t}\n\treturn nil\n}\n\nfunc tablesToString(tables []*table.Table) []string {\n\tvar res []string\n\tfor _, t := range tables {\n\t\tres = append(res, fmt.Sprintf(\"%05d\", t.ID()))\n\t}\n\tres = append(res, \".\")\n\treturn res\n}\n\nvar errFillTables = errors.New(\"Unable to fill tables\")\n\n// doCompact picks some table on level l and compacts it away to the next level.\nfunc (s *levelsController) doCompact(id int, p compactionPriority) error {\n\tl := p.level\n\ty.AssertTrue(l < s.kv.opt.MaxLevels) // Sanity check.\n\tif p.t.baseLevel == 0 {\n\t\tp.t = s.levelTargets()\n\t}\n\n\t_, span := otel.Tracer(\"\").Start(context.TODO(), \"Badger.Compaction\")\n\tdefer span.End()\n\n\tcd := compactDef{\n\t\tcompactorId:  id,\n\t\tp:            p,\n\t\tt:            p.t,\n\t\tthisLevel:    s.levels[l],\n\t\tdropPrefixes: p.dropPrefixes,\n\t}\n\n\t// While picking tables to be compacted, both levels' tables are expected to\n\t// remain unchanged.\n\tif l == 0 {\n\t\tcd.nextLevel = s.levels[p.t.baseLevel]\n\t\tif !s.fillTablesL0(&cd) {\n\t\t\treturn errFillTables\n\t\t}\n\t} else {\n\t\tcd.nextLevel = cd.thisLevel\n\t\t// We're not compacting the last level so pick the next level.\n\t\tif !cd.thisLevel.isLastLevel() {\n\t\t\tcd.nextLevel = s.levels[l+1]\n\t\t}\n\t\tif !s.fillTables(&cd) {\n\t\t\treturn errFillTables\n\t\t}\n\t}\n\tdefer s.cstatus.delete(cd) // Remove the ranges from compaction status.\n\n\tspan.SetAttributes(attribute.String(\"Compaction\", fmt.Sprintf(\"%+v\", cd)))\n\tif err := s.runCompactDef(id, l, cd); err != nil {\n\t\t// This compaction couldn't be done successfully.\n\t\ts.kv.opt.Warningf(\"[Compactor: %d] LOG Compact FAILED with error: %+v: %+v\", id, err, cd)\n\t\treturn err\n\t}\n\n\tspan.SetAttributes(\n\t\tattribute.Int(\"Top tables count\", len(cd.top)),\n\t\tattribute.Int(\"Bottom tables count\", len(cd.bot)))\n\n\ts.kv.opt.Debugf(\"[Compactor: %d] Compaction for level: %d DONE\", id, cd.thisLevel.level)\n\treturn nil\n}\n\nfunc (s *levelsController) addLevel0Table(t *table.Table) error {\n\t// Add table to manifest file only if it is not opened in memory. We don't want to add a table\n\t// to the manifest file if it exists only in memory.\n\tif !t.IsInmemory {\n\t\t// We update the manifest _before_ the table becomes part of a levelHandler, because at that\n\t\t// point it could get used in some compaction.  This ensures the manifest file gets updated in\n\t\t// the proper order. (That means this update happens before that of some compaction which\n\t\t// deletes the table.)\n\t\terr := s.kv.manifest.addChanges([]*pb.ManifestChange{\n\t\t\tnewCreateChange(t.ID(), 0, t.KeyID(), t.CompressionType()),\n\t\t}, s.kv.opt)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\n\tfor !s.levels[0].tryAddLevel0Table(t) {\n\t\t// Before we uninstall, we need to make sure that level 0 is healthy.\n\t\ttimeStart := time.Now()\n\t\tfor s.levels[0].numTables() >= s.kv.opt.NumLevelZeroTablesStall {\n\t\t\ttime.Sleep(10 * time.Millisecond)\n\t\t}\n\t\tdur := time.Since(timeStart)\n\t\tif dur > time.Second {\n\t\t\ts.kv.opt.Infof(\"L0 was stalled for %s\\n\", dur.Round(time.Millisecond))\n\t\t}\n\t\ts.l0stallsMs.Add(int64(dur.Round(time.Millisecond)))\n\t}\n\n\treturn nil\n}\n\nfunc (s *levelsController) close() error {\n\terr := s.cleanupLevels()\n\treturn y.Wrap(err, \"levelsController.Close\")\n}\n\n// get searches for a given key in all the levels of the LSM tree. It returns\n// key version <= the expected version (version in key). If not found,\n// it returns an empty y.ValueStruct.\nfunc (s *levelsController) get(key []byte, maxVs y.ValueStruct, startLevel int) (\n\ty.ValueStruct, error) {\n\tif s.kv.IsClosed() {\n\t\treturn y.ValueStruct{}, ErrDBClosed\n\t}\n\t// It's important that we iterate the levels from 0 on upward. The reason is, if we iterated\n\t// in opposite order, or in parallel (naively calling all the h.RLock() in some order) we could\n\t// read level L's tables post-compaction and level L+1's tables pre-compaction. (If we do\n\t// parallelize this, we will need to call the h.RLock() function by increasing order of level\n\t// number.)\n\tversion := y.ParseTs(key)\n\tfor _, h := range s.levels {\n\t\t// Ignore all levels below startLevel. This is useful for GC when L0 is kept in memory.\n\t\tif h.level < startLevel {\n\t\t\tcontinue\n\t\t}\n\t\tvs, err := h.get(key) // Calls h.RLock() and h.RUnlock().\n\t\tif err != nil {\n\t\t\treturn y.ValueStruct{}, y.Wrapf(err, \"get key: %q\", key)\n\t\t}\n\t\tif vs.Value == nil && vs.Meta == 0 {\n\t\t\tcontinue\n\t\t}\n\t\ty.NumBytesReadsLSMAdd(s.kv.opt.MetricsEnabled, int64(len(vs.Value)))\n\t\tif vs.Version == version {\n\t\t\treturn vs, nil\n\t\t}\n\t\tif maxVs.Version < vs.Version {\n\t\t\tmaxVs = vs\n\t\t}\n\t}\n\tif len(maxVs.Value) > 0 {\n\t\ty.NumGetsWithResultsAdd(s.kv.opt.MetricsEnabled, 1)\n\t}\n\treturn maxVs, nil\n}\n\nfunc appendIteratorsReversed(out []y.Iterator, th []*table.Table, opt int) []y.Iterator {\n\tfor i := len(th) - 1; i >= 0; i-- {\n\t\t// This will increment the reference of the table handler.\n\t\tout = append(out, th[i].NewIterator(opt))\n\t}\n\treturn out\n}\n\n// appendIterators appends iterators to an array of iterators, for merging.\n// Note: This obtains references for the table handlers. Remember to close these iterators.\nfunc (s *levelsController) appendIterators(\n\titers []y.Iterator, opt *IteratorOptions) []y.Iterator {\n\t// Just like with get, it's important we iterate the levels from 0 on upward, to avoid missing\n\t// data when there's a compaction.\n\tfor _, level := range s.levels {\n\t\titers = level.appendIterators(iters, opt)\n\t}\n\treturn iters\n}\n\n// TableInfo represents the information about a table.\ntype TableInfo struct {\n\tID               uint64\n\tLevel            int\n\tLeft             []byte\n\tRight            []byte\n\tKeyCount         uint32 // Number of keys in the table\n\tOnDiskSize       uint32\n\tStaleDataSize    uint32\n\tUncompressedSize uint32\n\tMaxVersion       uint64\n\tIndexSz          int\n\tBloomFilterSize  int\n}\n\nfunc (s *levelsController) getTableInfo() (result []TableInfo) {\n\tfor _, l := range s.levels {\n\t\tl.RLock()\n\t\tfor _, t := range l.tables {\n\t\t\tinfo := TableInfo{\n\t\t\t\tID:               t.ID(),\n\t\t\t\tLevel:            l.level,\n\t\t\t\tLeft:             t.Smallest(),\n\t\t\t\tRight:            t.Biggest(),\n\t\t\t\tKeyCount:         t.KeyCount(),\n\t\t\t\tOnDiskSize:       t.OnDiskSize(),\n\t\t\t\tStaleDataSize:    t.StaleDataSize(),\n\t\t\t\tIndexSz:          t.IndexSize(),\n\t\t\t\tBloomFilterSize:  t.BloomFilterSize(),\n\t\t\t\tUncompressedSize: t.UncompressedSize(),\n\t\t\t\tMaxVersion:       t.MaxVersion(),\n\t\t\t}\n\t\t\tresult = append(result, info)\n\t\t}\n\t\tl.RUnlock()\n\t}\n\tsort.Slice(result, func(i, j int) bool {\n\t\tif result[i].Level != result[j].Level {\n\t\t\treturn result[i].Level < result[j].Level\n\t\t}\n\t\treturn result[i].ID < result[j].ID\n\t})\n\treturn\n}\n\ntype LevelInfo struct {\n\tLevel          int\n\tNumTables      int\n\tSize           int64\n\tTargetSize     int64\n\tTargetFileSize int64\n\tIsBaseLevel    bool\n\tScore          float64\n\tAdjusted       float64\n\tStaleDatSize   int64\n}\n\nfunc (s *levelsController) getLevelInfo() []LevelInfo {\n\tt := s.levelTargets()\n\tprios := s.pickCompactLevels(nil)\n\tresult := make([]LevelInfo, len(s.levels))\n\tfor i, l := range s.levels {\n\t\tl.RLock()\n\t\tresult[i].Level = i\n\t\tresult[i].Size = l.totalSize\n\t\tresult[i].NumTables = len(l.tables)\n\t\tresult[i].StaleDatSize = l.totalStaleSize\n\n\t\tl.RUnlock()\n\n\t\tresult[i].TargetSize = t.targetSz[i]\n\t\tresult[i].TargetFileSize = t.fileSz[i]\n\t\tresult[i].IsBaseLevel = t.baseLevel == i\n\t}\n\tfor _, p := range prios {\n\t\tresult[p.level].Score = p.score\n\t\tresult[p.level].Adjusted = p.adjusted\n\t}\n\treturn result\n}\n\n// verifyChecksum verifies checksum for all tables on all levels.\nfunc (s *levelsController) verifyChecksum() error {\n\tvar tables []*table.Table\n\tfor _, l := range s.levels {\n\t\tl.RLock()\n\t\ttables = tables[:0]\n\t\tfor _, t := range l.tables {\n\t\t\ttables = append(tables, t)\n\t\t\tt.IncrRef()\n\t\t}\n\t\tl.RUnlock()\n\n\t\tfor _, t := range tables {\n\t\t\terrChkVerify := t.VerifyChecksum()\n\t\t\tif err := t.DecrRef(); err != nil {\n\t\t\t\ts.kv.opt.Errorf(\"unable to decrease reference of table: %s while \"+\n\t\t\t\t\t\"verifying checksum with error: %s\", t.Filename(), err)\n\t\t\t}\n\n\t\t\tif errChkVerify != nil {\n\t\t\t\treturn errChkVerify\n\t\t\t}\n\t\t}\n\t}\n\n\treturn nil\n}\n\n// Returns the sorted list of splits for all the levels and tables based\n// on the block offsets.\nfunc (s *levelsController) keySplits(numPerTable int, prefix []byte) []string {\n\tsplits := make([]string, 0)\n\tfor _, l := range s.levels {\n\t\tl.RLock()\n\t\tfor _, t := range l.tables {\n\t\t\ttableSplits := t.KeySplits(numPerTable, prefix)\n\t\t\tsplits = append(splits, tableSplits...)\n\t\t}\n\t\tl.RUnlock()\n\t}\n\tsort.Strings(splits)\n\treturn splits\n}\n"
  },
  {
    "path": "levels_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"fmt\"\n\t\"math\"\n\t\"math/rand\"\n\t\"os\"\n\t\"sort\"\n\t\"testing\"\n\t\"time\"\n\n\t\"github.com/stretchr/testify/require\"\n\n\t\"github.com/dgraph-io/badger/v4/options\"\n\t\"github.com/dgraph-io/badger/v4/pb\"\n\t\"github.com/dgraph-io/badger/v4/table\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n)\n\n// createAndOpen creates a table with the given data and adds it to the given level.\nfunc createAndOpen(db *DB, td []keyValVersion, level int) {\n\topts := table.Options{\n\t\tBlockSize:          db.opt.BlockSize,\n\t\tBloomFalsePositive: db.opt.BloomFalsePositive,\n\t\tChkMode:            options.NoVerification,\n\t}\n\tb := table.NewTableBuilder(opts)\n\tdefer b.Close()\n\n\t// Add all keys and versions to the table.\n\tfor _, item := range td {\n\t\tkey := y.KeyWithTs([]byte(item.key), uint64(item.version))\n\t\tval := y.ValueStruct{Value: []byte(item.val), Meta: item.meta}\n\t\tb.Add(key, val, 0)\n\t}\n\tfname := table.NewFilename(db.lc.reserveFileID(), db.opt.Dir)\n\ttab, err := table.CreateTable(fname, b)\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\tif err := db.manifest.addChanges([]*pb.ManifestChange{\n\t\tnewCreateChange(tab.ID(), level, 0, tab.CompressionType()),\n\t}, db.opt); err != nil {\n\t\tpanic(err)\n\t}\n\tdb.lc.levels[level].Lock()\n\t// Add table to the given level.\n\tdb.lc.levels[level].tables = append(db.lc.levels[level].tables, tab)\n\tdb.lc.levels[level].Unlock()\n}\n\ntype keyValVersion struct {\n\tkey     string\n\tval     string\n\tversion int\n\tmeta    byte\n}\n\nfunc TestCheckOverlap(t *testing.T) {\n\tt.Run(\"overlap\", func(t *testing.T) {\n\t\t// This test consists of one table on level 0 and one on level 1.\n\t\t// There is an overlap amongst the tables but there is no overlap\n\t\t// with rest of the levels.\n\t\tt.Run(\"same keys\", func(t *testing.T) {\n\t\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\t\tl0 := []keyValVersion{{\"foo\", \"bar\", 3, 0}}\n\t\t\t\tl1 := []keyValVersion{{\"foo\", \"bar\", 2, 0}}\n\t\t\t\tcreateAndOpen(db, l0, 0)\n\t\t\t\tcreateAndOpen(db, l1, 1)\n\n\t\t\t\t// Level 0 should overlap with level 0 tables.\n\t\t\t\trequire.True(t, db.lc.checkOverlap(db.lc.levels[0].tables, 0))\n\t\t\t\t// Level 1 should overlap with level 0 tables (they have the same keys).\n\t\t\t\trequire.True(t, db.lc.checkOverlap(db.lc.levels[0].tables, 1))\n\t\t\t\t// Level 2 and 3 should not overlap with level 0 tables.\n\t\t\t\trequire.False(t, db.lc.checkOverlap(db.lc.levels[0].tables, 2))\n\t\t\t\trequire.False(t, db.lc.checkOverlap(db.lc.levels[1].tables, 2))\n\t\t\t\trequire.False(t, db.lc.checkOverlap(db.lc.levels[0].tables, 3))\n\t\t\t\trequire.False(t, db.lc.checkOverlap(db.lc.levels[1].tables, 3))\n\n\t\t\t})\n\t\t})\n\t\tt.Run(\"overlapping keys\", func(t *testing.T) {\n\t\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\t\tl0 := []keyValVersion{{\"a\", \"x\", 1, 0}, {\"b\", \"x\", 1, 0}, {\"foo\", \"bar\", 3, 0}}\n\t\t\t\tl1 := []keyValVersion{{\"foo\", \"bar\", 2, 0}}\n\t\t\t\tcreateAndOpen(db, l0, 0)\n\t\t\t\tcreateAndOpen(db, l1, 1)\n\n\t\t\t\t// Level 0 should overlap with level 0 tables.\n\t\t\t\trequire.True(t, db.lc.checkOverlap(db.lc.levels[0].tables, 0))\n\t\t\t\trequire.True(t, db.lc.checkOverlap(db.lc.levels[1].tables, 1))\n\t\t\t\t// Level 1 should overlap with level 0 tables, \"foo\" key is common.\n\t\t\t\trequire.True(t, db.lc.checkOverlap(db.lc.levels[0].tables, 1))\n\t\t\t\t// Level 2 and 3 should not overlap with level 0 tables.\n\t\t\t\trequire.False(t, db.lc.checkOverlap(db.lc.levels[0].tables, 2))\n\t\t\t\trequire.False(t, db.lc.checkOverlap(db.lc.levels[0].tables, 3))\n\t\t\t})\n\t\t})\n\t})\n\tt.Run(\"non-overlapping\", func(t *testing.T) {\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\tl0 := []keyValVersion{{\"a\", \"x\", 1, 0}, {\"b\", \"x\", 1, 0}, {\"c\", \"bar\", 3, 0}}\n\t\t\tl1 := []keyValVersion{{\"foo\", \"bar\", 2, 0}}\n\t\t\tcreateAndOpen(db, l0, 0)\n\t\t\tcreateAndOpen(db, l1, 1)\n\n\t\t\t// Level 1 should not overlap with level 0 tables\n\t\t\trequire.False(t, db.lc.checkOverlap(db.lc.levels[0].tables, 1))\n\t\t\t// Level 2 and 3 should not overlap with level 0 tables.\n\t\t\trequire.False(t, db.lc.checkOverlap(db.lc.levels[0].tables, 2))\n\t\t\trequire.False(t, db.lc.checkOverlap(db.lc.levels[0].tables, 3))\n\t\t})\n\t})\n}\n\nfunc getAllAndCheck(t *testing.T, db *DB, expected []keyValVersion) {\n\trequire.NoError(t, db.View(func(txn *Txn) error {\n\t\topt := DefaultIteratorOptions\n\t\topt.AllVersions = true\n\t\topt.InternalAccess = true\n\t\tit := txn.NewIterator(opt)\n\t\tdefer it.Close()\n\t\ti := 0\n\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\titem := it.Item()\n\t\t\tv, err := item.ValueCopy(nil)\n\t\t\trequire.NoError(t, err)\n\t\t\t// fmt.Printf(\"k: %s v: %d val: %s\\n\", item.key, item.Version(), v)\n\t\t\trequire.Less(t, i, len(expected), \"DB has more number of key than expected\")\n\t\t\texpect := expected[i]\n\t\t\trequire.Equal(t, expect.key, string(item.Key()), \"expected key: %s actual key: %s\",\n\t\t\t\texpect.key, item.Key())\n\t\t\trequire.Equal(t, expect.val, string(v), \"key: %s expected value: %s actual %s\",\n\t\t\t\titem.key, expect.val, v)\n\t\t\trequire.Equal(t, expect.version, int(item.Version()),\n\t\t\t\t\"key: %s expected version: %d actual %d\", item.key, expect.version, item.Version())\n\t\t\trequire.Equal(t, expect.meta, item.meta,\n\t\t\t\t\"key: %s expected meta: %d meta %d\", item.key, expect.meta, item.meta)\n\t\t\ti++\n\t\t}\n\t\trequire.Equal(t, len(expected), i, \"keys examined should be equal to keys expected\")\n\t\treturn nil\n\t}))\n}\n\nfunc TestCompaction(t *testing.T) {\n\t// Disable compactions and keep single version of each key.\n\topt := DefaultOptions(\"\").WithNumCompactors(0).WithNumVersionsToKeep(1)\n\topt.managedTxns = true\n\tt.Run(\"level 0 to level 1\", func(t *testing.T) {\n\t\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\t\tl0 := []keyValVersion{{\"foo\", \"bar\", 3, 0}, {\"fooz\", \"baz\", 1, 0}}\n\t\t\tl01 := []keyValVersion{{\"foo\", \"bar\", 2, 0}}\n\t\t\tl1 := []keyValVersion{{\"foo\", \"bar\", 1, 0}}\n\t\t\t// Level 0 has table l0 and l01.\n\t\t\tcreateAndOpen(db, l0, 0)\n\t\t\tcreateAndOpen(db, l01, 0)\n\t\t\t// Level 1 has table l1.\n\t\t\tcreateAndOpen(db, l1, 1)\n\n\t\t\t// Set a high discard timestamp so that all the keys are below the discard timestamp.\n\t\t\tdb.SetDiscardTs(10)\n\n\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t{\"foo\", \"bar\", 3, 0}, {\"foo\", \"bar\", 2, 0},\n\t\t\t\t{\"foo\", \"bar\", 1, 0}, {\"fooz\", \"baz\", 1, 0},\n\t\t\t})\n\t\t\tcdef := compactDef{\n\t\t\t\tthisLevel: db.lc.levels[0],\n\t\t\t\tnextLevel: db.lc.levels[1],\n\t\t\t\ttop:       db.lc.levels[0].tables,\n\t\t\t\tbot:       db.lc.levels[1].tables,\n\t\t\t\tt:         db.lc.levelTargets(),\n\t\t\t}\n\t\t\tcdef.t.baseLevel = 1\n\t\t\trequire.NoError(t, db.lc.runCompactDef(-1, 0, cdef))\n\t\t\t// foo version 2 should be dropped after compaction.\n\t\t\tgetAllAndCheck(t, db, []keyValVersion{{\"foo\", \"bar\", 3, 0}, {\"fooz\", \"baz\", 1, 0}})\n\t\t})\n\t})\n\tt.Run(\"level 0 to level 1 with duplicates\", func(t *testing.T) {\n\t\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\t\t// We have foo version 3 on L0 because we gc'ed it.\n\t\t\tl0 := []keyValVersion{{\"foo\", \"barNew\", 3, 0}, {\"fooz\", \"baz\", 1, 0}}\n\t\t\tl01 := []keyValVersion{{\"foo\", \"bar\", 4, 0}}\n\t\t\tl1 := []keyValVersion{{\"foo\", \"bar\", 3, 0}}\n\t\t\t// Level 0 has table l0 and l01.\n\t\t\tcreateAndOpen(db, l0, 0)\n\t\t\tcreateAndOpen(db, l01, 0)\n\t\t\t// Level 1 has table l1.\n\t\t\tcreateAndOpen(db, l1, 1)\n\n\t\t\t// Set a high discard timestamp so that all the keys are below the discard timestamp.\n\t\t\tdb.SetDiscardTs(10)\n\n\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t{\"foo\", \"bar\", 4, 0}, {\"foo\", \"barNew\", 3, 0},\n\t\t\t\t{\"fooz\", \"baz\", 1, 0},\n\t\t\t})\n\t\t\tcdef := compactDef{\n\t\t\t\tthisLevel: db.lc.levels[0],\n\t\t\t\tnextLevel: db.lc.levels[1],\n\t\t\t\ttop:       db.lc.levels[0].tables,\n\t\t\t\tbot:       db.lc.levels[1].tables,\n\t\t\t\tt:         db.lc.levelTargets(),\n\t\t\t}\n\t\t\tcdef.t.baseLevel = 1\n\t\t\trequire.NoError(t, db.lc.runCompactDef(-1, 0, cdef))\n\t\t\t// foo version 3 (both) should be dropped after compaction.\n\t\t\tgetAllAndCheck(t, db, []keyValVersion{{\"foo\", \"bar\", 4, 0}, {\"fooz\", \"baz\", 1, 0}})\n\t\t})\n\t})\n\n\tt.Run(\"level 0 to level 1 with lower overlap\", func(t *testing.T) {\n\t\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\t\tl0 := []keyValVersion{{\"foo\", \"bar\", 4, 0}, {\"fooz\", \"baz\", 1, 0}}\n\t\t\tl01 := []keyValVersion{{\"foo\", \"bar\", 3, 0}}\n\t\t\tl1 := []keyValVersion{{\"foo\", \"bar\", 2, 0}}\n\t\t\tl2 := []keyValVersion{{\"foo\", \"bar\", 1, 0}}\n\t\t\t// Level 0 has table l0 and l01.\n\t\t\tcreateAndOpen(db, l0, 0)\n\t\t\tcreateAndOpen(db, l01, 0)\n\t\t\t// Level 1 has table l1.\n\t\t\tcreateAndOpen(db, l1, 1)\n\t\t\t// Level 2 has table l2.\n\t\t\tcreateAndOpen(db, l2, 2)\n\n\t\t\t// Set a high discard timestamp so that all the keys are below the discard timestamp.\n\t\t\tdb.SetDiscardTs(10)\n\n\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t{\"foo\", \"bar\", 4, 0}, {\"foo\", \"bar\", 3, 0}, {\"foo\", \"bar\", 2, 0},\n\t\t\t\t{\"foo\", \"bar\", 1, 0}, {\"fooz\", \"baz\", 1, 0},\n\t\t\t})\n\t\t\tcdef := compactDef{\n\t\t\t\tthisLevel: db.lc.levels[0],\n\t\t\t\tnextLevel: db.lc.levels[1],\n\t\t\t\ttop:       db.lc.levels[0].tables,\n\t\t\t\tbot:       db.lc.levels[1].tables,\n\t\t\t\tt:         db.lc.levelTargets(),\n\t\t\t}\n\t\t\tcdef.t.baseLevel = 1\n\t\t\trequire.NoError(t, db.lc.runCompactDef(-1, 0, cdef))\n\t\t\t// foo version 2 and version 1 should be dropped after compaction.\n\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t{\"foo\", \"bar\", 4, 0}, {\"foo\", \"bar\", 1, 0}, {\"fooz\", \"baz\", 1, 0},\n\t\t\t})\n\t\t})\n\t})\n\n\tt.Run(\"level 1 to level 2\", func(t *testing.T) {\n\t\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\t\tl1 := []keyValVersion{{\"foo\", \"bar\", 3, 0}, {\"fooz\", \"baz\", 1, 0}}\n\t\t\tl2 := []keyValVersion{{\"foo\", \"bar\", 2, 0}}\n\t\t\tcreateAndOpen(db, l1, 1)\n\t\t\tcreateAndOpen(db, l2, 2)\n\n\t\t\t// Set a high discard timestamp so that all the keys are below the discard timestamp.\n\t\t\tdb.SetDiscardTs(10)\n\n\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t{\"foo\", \"bar\", 3, 0}, {\"foo\", \"bar\", 2, 0}, {\"fooz\", \"baz\", 1, 0},\n\t\t\t})\n\t\t\tcdef := compactDef{\n\t\t\t\tthisLevel: db.lc.levels[1],\n\t\t\t\tnextLevel: db.lc.levels[2],\n\t\t\t\ttop:       db.lc.levels[1].tables,\n\t\t\t\tbot:       db.lc.levels[2].tables,\n\t\t\t\tt:         db.lc.levelTargets(),\n\t\t\t}\n\t\t\tcdef.t.baseLevel = 2\n\t\t\trequire.NoError(t, db.lc.runCompactDef(-1, 1, cdef))\n\t\t\t// foo version 2 should be dropped after compaction.\n\t\t\tgetAllAndCheck(t, db, []keyValVersion{{\"foo\", \"bar\", 3, 0}, {\"fooz\", \"baz\", 1, 0}})\n\t\t})\n\t})\n\n\tt.Run(\"level 1 to level 2 with delete\", func(t *testing.T) {\n\t\tt.Run(\"with overlap\", func(t *testing.T) {\n\t\t\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\t\t\tl1 := []keyValVersion{{\"foo\", \"bar\", 3, bitDelete}, {\"fooz\", \"baz\", 1, bitDelete}}\n\t\t\t\tl2 := []keyValVersion{{\"foo\", \"bar\", 2, 0}}\n\t\t\t\tl3 := []keyValVersion{{\"foo\", \"bar\", 1, 0}}\n\t\t\t\tcreateAndOpen(db, l1, 1)\n\t\t\t\tcreateAndOpen(db, l2, 2)\n\t\t\t\tcreateAndOpen(db, l3, 3)\n\n\t\t\t\t// Set a high discard timestamp so that all the keys are below the discard timestamp.\n\t\t\t\tdb.SetDiscardTs(10)\n\n\t\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t\t{\"foo\", \"bar\", 3, 1},\n\t\t\t\t\t{\"foo\", \"bar\", 2, 0},\n\t\t\t\t\t{\"foo\", \"bar\", 1, 0},\n\t\t\t\t\t{\"fooz\", \"baz\", 1, 1},\n\t\t\t\t})\n\t\t\t\tcdef := compactDef{\n\t\t\t\t\tthisLevel: db.lc.levels[1],\n\t\t\t\t\tnextLevel: db.lc.levels[2],\n\t\t\t\t\ttop:       db.lc.levels[1].tables,\n\t\t\t\t\tbot:       db.lc.levels[2].tables,\n\t\t\t\t\tt:         db.lc.levelTargets(),\n\t\t\t\t}\n\t\t\t\tcdef.t.baseLevel = 2\n\t\t\t\trequire.NoError(t, db.lc.runCompactDef(-1, 1, cdef))\n\t\t\t\t// foo bar version 2 should be dropped after compaction. fooz\n\t\t\t\t// baz version 1 will remain because overlap exists, which is\n\t\t\t\t// expected because `hasOverlap` is only checked once at the\n\t\t\t\t// beginning of `compactBuildTables` method.\n\t\t\t\t// everything from level 1 is now in level 2.\n\t\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t\t{\"foo\", \"bar\", 3, bitDelete},\n\t\t\t\t\t{\"foo\", \"bar\", 1, 0},\n\t\t\t\t\t{\"fooz\", \"baz\", 1, 1},\n\t\t\t\t})\n\n\t\t\t\tcdef = compactDef{\n\t\t\t\t\tthisLevel: db.lc.levels[2],\n\t\t\t\t\tnextLevel: db.lc.levels[3],\n\t\t\t\t\ttop:       db.lc.levels[2].tables,\n\t\t\t\t\tbot:       db.lc.levels[3].tables,\n\t\t\t\t\tt:         db.lc.levelTargets(),\n\t\t\t\t}\n\t\t\t\tcdef.t.baseLevel = 3\n\t\t\t\trequire.NoError(t, db.lc.runCompactDef(-1, 2, cdef))\n\t\t\t\t// everything should be removed now\n\t\t\t\tgetAllAndCheck(t, db, []keyValVersion{})\n\t\t\t})\n\t\t})\n\t\tt.Run(\"with bottom overlap\", func(t *testing.T) {\n\t\t\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\t\t\tl1 := []keyValVersion{{\"foo\", \"bar\", 3, bitDelete}}\n\t\t\t\tl2 := []keyValVersion{{\"foo\", \"bar\", 2, 0}, {\"fooz\", \"baz\", 2, bitDelete}}\n\t\t\t\tl3 := []keyValVersion{{\"fooz\", \"baz\", 1, 0}}\n\t\t\t\tcreateAndOpen(db, l1, 1)\n\t\t\t\tcreateAndOpen(db, l2, 2)\n\t\t\t\tcreateAndOpen(db, l3, 3)\n\n\t\t\t\t// Set a high discard timestamp so that all the keys are below the discard timestamp.\n\t\t\t\tdb.SetDiscardTs(10)\n\n\t\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t\t{\"foo\", \"bar\", 3, bitDelete},\n\t\t\t\t\t{\"foo\", \"bar\", 2, 0},\n\t\t\t\t\t{\"fooz\", \"baz\", 2, bitDelete},\n\t\t\t\t\t{\"fooz\", \"baz\", 1, 0},\n\t\t\t\t})\n\t\t\t\tcdef := compactDef{\n\t\t\t\t\tthisLevel: db.lc.levels[1],\n\t\t\t\t\tnextLevel: db.lc.levels[2],\n\t\t\t\t\ttop:       db.lc.levels[1].tables,\n\t\t\t\t\tbot:       db.lc.levels[2].tables,\n\t\t\t\t\tt:         db.lc.levelTargets(),\n\t\t\t\t}\n\t\t\t\tcdef.t.baseLevel = 2\n\t\t\t\trequire.NoError(t, db.lc.runCompactDef(-1, 1, cdef))\n\t\t\t\t// the top table at L1 doesn't overlap L3, but the bottom table at L2\n\t\t\t\t// does, delete keys should not be removed.\n\t\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t\t{\"foo\", \"bar\", 3, bitDelete},\n\t\t\t\t\t{\"fooz\", \"baz\", 2, bitDelete},\n\t\t\t\t\t{\"fooz\", \"baz\", 1, 0},\n\t\t\t\t})\n\t\t\t})\n\t\t})\n\t\tt.Run(\"without overlap\", func(t *testing.T) {\n\t\t\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\t\t\tl1 := []keyValVersion{{\"foo\", \"bar\", 3, bitDelete}, {\"fooz\", \"baz\", 1, bitDelete}}\n\t\t\t\tl2 := []keyValVersion{{\"fooo\", \"barr\", 2, 0}}\n\t\t\t\tcreateAndOpen(db, l1, 1)\n\t\t\t\tcreateAndOpen(db, l2, 2)\n\n\t\t\t\t// Set a high discard timestamp so that all the keys are below the discard timestamp.\n\t\t\t\tdb.SetDiscardTs(10)\n\n\t\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t\t{\"foo\", \"bar\", 3, 1}, {\"fooo\", \"barr\", 2, 0}, {\"fooz\", \"baz\", 1, 1},\n\t\t\t\t})\n\t\t\t\tcdef := compactDef{\n\t\t\t\t\tthisLevel: db.lc.levels[1],\n\t\t\t\t\tnextLevel: db.lc.levels[2],\n\t\t\t\t\ttop:       db.lc.levels[1].tables,\n\t\t\t\t\tbot:       db.lc.levels[2].tables,\n\t\t\t\t\tt:         db.lc.levelTargets(),\n\t\t\t\t}\n\t\t\t\tcdef.t.baseLevel = 2\n\t\t\t\trequire.NoError(t, db.lc.runCompactDef(-1, 1, cdef))\n\t\t\t\t// foo version 2 should be dropped after compaction.\n\t\t\t\tgetAllAndCheck(t, db, []keyValVersion{{\"fooo\", \"barr\", 2, 0}})\n\t\t\t})\n\t\t})\n\t\tt.Run(\"with splits\", func(t *testing.T) {\n\t\t\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\t\t\tl1 := []keyValVersion{{\"C\", \"bar\", 3, bitDelete}}\n\t\t\t\tl21 := []keyValVersion{{\"A\", \"bar\", 2, 0}}\n\t\t\t\tl22 := []keyValVersion{{\"B\", \"bar\", 2, 0}}\n\t\t\t\tl23 := []keyValVersion{{\"C\", \"bar\", 2, 0}}\n\t\t\t\tl24 := []keyValVersion{{\"D\", \"bar\", 2, 0}}\n\t\t\t\tl3 := []keyValVersion{{\"fooz\", \"baz\", 1, 0}}\n\t\t\t\tcreateAndOpen(db, l1, 1)\n\t\t\t\tcreateAndOpen(db, l21, 2)\n\t\t\t\tcreateAndOpen(db, l22, 2)\n\t\t\t\tcreateAndOpen(db, l23, 2)\n\t\t\t\tcreateAndOpen(db, l24, 2)\n\t\t\t\tcreateAndOpen(db, l3, 3)\n\n\t\t\t\t// Set a high discard timestamp so that all the keys are below the discard timestamp.\n\t\t\t\tdb.SetDiscardTs(10)\n\n\t\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t\t{\"A\", \"bar\", 2, 0},\n\t\t\t\t\t{\"B\", \"bar\", 2, 0},\n\t\t\t\t\t{\"C\", \"bar\", 3, bitDelete},\n\t\t\t\t\t{\"C\", \"bar\", 2, 0},\n\t\t\t\t\t{\"D\", \"bar\", 2, 0},\n\t\t\t\t\t{\"fooz\", \"baz\", 1, 0},\n\t\t\t\t})\n\t\t\t\tcdef := compactDef{\n\t\t\t\t\tthisLevel: db.lc.levels[1],\n\t\t\t\t\tnextLevel: db.lc.levels[2],\n\t\t\t\t\ttop:       db.lc.levels[1].tables,\n\t\t\t\t\tbot:       db.lc.levels[2].tables,\n\t\t\t\t\tt:         db.lc.levelTargets(),\n\t\t\t\t}\n\t\t\t\tcdef.t.baseLevel = 2\n\t\t\t\trequire.NoError(t, db.lc.runCompactDef(-1, 1, cdef))\n\t\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t\t{\"A\", \"bar\", 2, 0},\n\t\t\t\t\t{\"B\", \"bar\", 2, 0},\n\t\t\t\t\t{\"D\", \"bar\", 2, 0},\n\t\t\t\t\t{\"fooz\", \"baz\", 1, 0},\n\t\t\t\t})\n\t\t\t})\n\t\t})\n\t})\n}\n\nfunc TestCompactionTwoVersions(t *testing.T) {\n\t// Disable compactions and keep two versions of each key.\n\topt := DefaultOptions(\"\").WithNumCompactors(0).WithNumVersionsToKeep(2)\n\topt.managedTxns = true\n\tt.Run(\"with overlap\", func(t *testing.T) {\n\t\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\t\tl1 := []keyValVersion{{\"foo\", \"bar\", 3, 0}, {\"fooz\", \"baz\", 1, bitDelete}}\n\t\t\tl2 := []keyValVersion{{\"foo\", \"bar\", 2, 0}}\n\t\t\tl3 := []keyValVersion{{\"foo\", \"bar\", 1, 0}}\n\t\t\tcreateAndOpen(db, l1, 1)\n\t\t\tcreateAndOpen(db, l2, 2)\n\t\t\tcreateAndOpen(db, l3, 3)\n\n\t\t\t// Set a high discard timestamp so that all the keys are below the discard timestamp.\n\t\t\tdb.SetDiscardTs(10)\n\n\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t{\"foo\", \"bar\", 3, 0},\n\t\t\t\t{\"foo\", \"bar\", 2, 0},\n\t\t\t\t{\"foo\", \"bar\", 1, 0},\n\t\t\t\t{\"fooz\", \"baz\", 1, 1},\n\t\t\t})\n\t\t\tcdef := compactDef{\n\t\t\t\tthisLevel: db.lc.levels[1],\n\t\t\t\tnextLevel: db.lc.levels[2],\n\t\t\t\ttop:       db.lc.levels[1].tables,\n\t\t\t\tbot:       db.lc.levels[2].tables,\n\t\t\t\tt:         db.lc.levelTargets(),\n\t\t\t}\n\t\t\tcdef.t.baseLevel = 2\n\t\t\trequire.NoError(t, db.lc.runCompactDef(-1, 1, cdef))\n\t\t\t// Nothing should be dropped after compaction because number of\n\t\t\t// versions to keep is 2.\n\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t{\"foo\", \"bar\", 3, 0},\n\t\t\t\t{\"foo\", \"bar\", 2, 0},\n\t\t\t\t{\"foo\", \"bar\", 1, 0},\n\t\t\t\t{\"fooz\", \"baz\", 1, 1},\n\t\t\t})\n\n\t\t\tcdef = compactDef{\n\t\t\t\tthisLevel: db.lc.levels[2],\n\t\t\t\tnextLevel: db.lc.levels[3],\n\t\t\t\ttop:       db.lc.levels[2].tables,\n\t\t\t\tbot:       db.lc.levels[3].tables,\n\t\t\t\tt:         db.lc.levelTargets(),\n\t\t\t}\n\t\t\tcdef.t.baseLevel = 3\n\t\t\trequire.NoError(t, db.lc.runCompactDef(-1, 2, cdef))\n\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t{\"foo\", \"bar\", 3, 0},\n\t\t\t\t{\"foo\", \"bar\", 2, 0},\n\t\t\t})\n\t\t})\n\t})\n}\n\nfunc TestCompactionAllVersions(t *testing.T) {\n\t// Disable compactions and keep all versions of the each key.\n\topt := DefaultOptions(\"\").WithNumCompactors(0).WithNumVersionsToKeep(math.MaxInt32)\n\topt.managedTxns = true\n\tt.Run(\"without overlap\", func(t *testing.T) {\n\t\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\t\tl1 := []keyValVersion{{\"foo\", \"bar\", 3, 0}, {\"fooz\", \"baz\", 1, bitDelete}}\n\t\t\tl2 := []keyValVersion{{\"foo\", \"bar\", 2, 0}}\n\t\t\tl3 := []keyValVersion{{\"foo\", \"bar\", 1, 0}}\n\t\t\tcreateAndOpen(db, l1, 1)\n\t\t\tcreateAndOpen(db, l2, 2)\n\t\t\tcreateAndOpen(db, l3, 3)\n\n\t\t\t// Set a high discard timestamp so that all the keys are below the discard timestamp.\n\t\t\tdb.SetDiscardTs(10)\n\n\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t{\"foo\", \"bar\", 3, 0},\n\t\t\t\t{\"foo\", \"bar\", 2, 0},\n\t\t\t\t{\"foo\", \"bar\", 1, 0},\n\t\t\t\t{\"fooz\", \"baz\", 1, 1},\n\t\t\t})\n\t\t\tcdef := compactDef{\n\t\t\t\tthisLevel: db.lc.levels[1],\n\t\t\t\tnextLevel: db.lc.levels[2],\n\t\t\t\ttop:       db.lc.levels[1].tables,\n\t\t\t\tbot:       db.lc.levels[2].tables,\n\t\t\t\tt:         db.lc.levelTargets(),\n\t\t\t}\n\t\t\tcdef.t.baseLevel = 2\n\t\t\trequire.NoError(t, db.lc.runCompactDef(-1, 1, cdef))\n\t\t\t// Nothing should be dropped after compaction because all versions\n\t\t\t// should be kept.\n\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t{\"foo\", \"bar\", 3, 0},\n\t\t\t\t{\"foo\", \"bar\", 2, 0},\n\t\t\t\t{\"foo\", \"bar\", 1, 0},\n\t\t\t\t{\"fooz\", \"baz\", 1, 1},\n\t\t\t})\n\n\t\t\tcdef = compactDef{\n\t\t\t\tthisLevel: db.lc.levels[2],\n\t\t\t\tnextLevel: db.lc.levels[3],\n\t\t\t\ttop:       db.lc.levels[2].tables,\n\t\t\t\tbot:       db.lc.levels[3].tables,\n\t\t\t\tt:         db.lc.levelTargets(),\n\t\t\t}\n\t\t\tcdef.t.baseLevel = 3\n\t\t\trequire.NoError(t, db.lc.runCompactDef(-1, 2, cdef))\n\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t{\"foo\", \"bar\", 3, 0},\n\t\t\t\t{\"foo\", \"bar\", 2, 0},\n\t\t\t\t{\"foo\", \"bar\", 1, 0},\n\t\t\t})\n\t\t})\n\t})\n\tt.Run(\"without overlap\", func(t *testing.T) {\n\t\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\t\tl1 := []keyValVersion{{\"foo\", \"bar\", 3, bitDelete}, {\"fooz\", \"baz\", 1, bitDelete}}\n\t\t\tl2 := []keyValVersion{{\"fooo\", \"barr\", 2, 0}}\n\t\t\tcreateAndOpen(db, l1, 1)\n\t\t\tcreateAndOpen(db, l2, 2)\n\n\t\t\t// Set a high discard timestamp so that all the keys are below the discard timestamp.\n\t\t\tdb.SetDiscardTs(10)\n\n\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t{\"foo\", \"bar\", 3, 1}, {\"fooo\", \"barr\", 2, 0}, {\"fooz\", \"baz\", 1, 1},\n\t\t\t})\n\t\t\tcdef := compactDef{\n\t\t\t\tthisLevel: db.lc.levels[1],\n\t\t\t\tnextLevel: db.lc.levels[2],\n\t\t\t\ttop:       db.lc.levels[1].tables,\n\t\t\t\tbot:       db.lc.levels[2].tables,\n\t\t\t\tt:         db.lc.levelTargets(),\n\t\t\t}\n\t\t\tcdef.t.baseLevel = 2\n\t\t\trequire.NoError(t, db.lc.runCompactDef(-1, 1, cdef))\n\t\t\t// foo version 2 should be dropped after compaction.\n\t\t\tgetAllAndCheck(t, db, []keyValVersion{{\"fooo\", \"barr\", 2, 0}})\n\t\t})\n\t})\n}\n\nfunc TestDiscardTs(t *testing.T) {\n\t// Disable compactions and keep single version of each key.\n\topt := DefaultOptions(\"\").WithNumCompactors(0).WithNumVersionsToKeep(1)\n\topt.managedTxns = true\n\n\tt.Run(\"all keys above discardTs\", func(t *testing.T) {\n\t\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\t\tl0 := []keyValVersion{{\"foo\", \"bar\", 4, 0}, {\"fooz\", \"baz\", 3, 0}}\n\t\t\tl01 := []keyValVersion{{\"foo\", \"bar\", 3, 0}}\n\t\t\tl1 := []keyValVersion{{\"foo\", \"bar\", 2, 0}}\n\t\t\t// Level 0 has table l0 and l01.\n\t\t\tcreateAndOpen(db, l0, 0)\n\t\t\tcreateAndOpen(db, l01, 0)\n\t\t\t// Level 1 has table l1.\n\t\t\tcreateAndOpen(db, l1, 1)\n\n\t\t\t// Set dicardTs to 1. All the keys are above discardTs.\n\t\t\tdb.SetDiscardTs(1)\n\n\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t{\"foo\", \"bar\", 4, 0}, {\"foo\", \"bar\", 3, 0},\n\t\t\t\t{\"foo\", \"bar\", 2, 0}, {\"fooz\", \"baz\", 3, 0},\n\t\t\t})\n\t\t\tcdef := compactDef{\n\t\t\t\tthisLevel: db.lc.levels[0],\n\t\t\t\tnextLevel: db.lc.levels[1],\n\t\t\t\ttop:       db.lc.levels[0].tables,\n\t\t\t\tbot:       db.lc.levels[1].tables,\n\t\t\t\tt:         db.lc.levelTargets(),\n\t\t\t}\n\t\t\tcdef.t.baseLevel = 1\n\t\t\trequire.NoError(t, db.lc.runCompactDef(-1, 0, cdef))\n\t\t\t// No keys should be dropped.\n\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t{\"foo\", \"bar\", 4, 0}, {\"foo\", \"bar\", 3, 0},\n\t\t\t\t{\"foo\", \"bar\", 2, 0}, {\"fooz\", \"baz\", 3, 0},\n\t\t\t})\n\t\t})\n\t})\n\tt.Run(\"some keys above discardTs\", func(t *testing.T) {\n\t\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\t\tl0 := []keyValVersion{\n\t\t\t\t{\"foo\", \"bar\", 4, 0}, {\"foo\", \"bar\", 3, 0},\n\t\t\t\t{\"foo\", \"bar\", 2, 0}, {\"fooz\", \"baz\", 2, 0},\n\t\t\t}\n\t\t\tl1 := []keyValVersion{{\"foo\", \"bbb\", 1, 0}}\n\t\t\tcreateAndOpen(db, l0, 0)\n\t\t\tcreateAndOpen(db, l1, 1)\n\n\t\t\t// Set dicardTs to 3. foo2 and foo1 should be dropped.\n\t\t\tdb.SetDiscardTs(3)\n\n\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t{\"foo\", \"bar\", 4, 0}, {\"foo\", \"bar\", 3, 0}, {\"foo\", \"bar\", 2, 0},\n\t\t\t\t{\"foo\", \"bbb\", 1, 0}, {\"fooz\", \"baz\", 2, 0},\n\t\t\t})\n\t\t\tcdef := compactDef{\n\t\t\t\tthisLevel: db.lc.levels[0],\n\t\t\t\tnextLevel: db.lc.levels[1],\n\t\t\t\ttop:       db.lc.levels[0].tables,\n\t\t\t\tbot:       db.lc.levels[1].tables,\n\t\t\t\tt:         db.lc.levelTargets(),\n\t\t\t}\n\t\t\tcdef.t.baseLevel = 1\n\t\t\trequire.NoError(t, db.lc.runCompactDef(-1, 0, cdef))\n\t\t\t// foo1 and foo2 should be dropped.\n\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t{\"foo\", \"bar\", 4, 0}, {\"foo\", \"bar\", 3, 0}, {\"fooz\", \"baz\", 2, 0},\n\t\t\t})\n\t\t})\n\t})\n\tt.Run(\"all keys below discardTs\", func(t *testing.T) {\n\t\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\t\tl0 := []keyValVersion{{\"foo\", \"bar\", 4, 0}, {\"fooz\", \"baz\", 3, 0}}\n\t\t\tl01 := []keyValVersion{{\"foo\", \"bar\", 3, 0}}\n\t\t\tl1 := []keyValVersion{{\"foo\", \"bar\", 2, 0}}\n\t\t\t// Level 0 has table l0 and l01.\n\t\t\tcreateAndOpen(db, l0, 0)\n\t\t\tcreateAndOpen(db, l01, 0)\n\t\t\t// Level 1 has table l1.\n\t\t\tcreateAndOpen(db, l1, 1)\n\n\t\t\t// Set dicardTs to 10. All the keys are below discardTs.\n\t\t\tdb.SetDiscardTs(10)\n\n\t\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t\t{\"foo\", \"bar\", 4, 0}, {\"foo\", \"bar\", 3, 0},\n\t\t\t\t{\"foo\", \"bar\", 2, 0}, {\"fooz\", \"baz\", 3, 0},\n\t\t\t})\n\t\t\tcdef := compactDef{\n\t\t\t\tthisLevel: db.lc.levels[0],\n\t\t\t\tnextLevel: db.lc.levels[1],\n\t\t\t\ttop:       db.lc.levels[0].tables,\n\t\t\t\tbot:       db.lc.levels[1].tables,\n\t\t\t\tt:         db.lc.levelTargets(),\n\t\t\t}\n\t\t\tcdef.t.baseLevel = 1\n\t\t\trequire.NoError(t, db.lc.runCompactDef(-1, 0, cdef))\n\t\t\t// Only one version of every key should be left.\n\t\t\tgetAllAndCheck(t, db, []keyValVersion{{\"foo\", \"bar\", 4, 0}, {\"fooz\", \"baz\", 3, 0}})\n\t\t})\n\t})\n}\n\n// This is a test to ensure that the first entry with DiscardEarlierversion bit < DiscardTs\n// is kept around (when numversionstokeep is infinite).\nfunc TestDiscardFirstVersion(t *testing.T) {\n\topt := DefaultOptions(\"\")\n\topt.NumCompactors = 0\n\topt.NumVersionsToKeep = math.MaxInt32\n\topt.managedTxns = true\n\n\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\tl0 := []keyValVersion{{\"foo\", \"bar\", 1, 0}}\n\t\tl01 := []keyValVersion{{\"foo\", \"bar\", 2, bitDiscardEarlierVersions}}\n\t\tl02 := []keyValVersion{{\"foo\", \"bar\", 3, 0}}\n\t\tl03 := []keyValVersion{{\"foo\", \"bar\", 4, 0}}\n\t\tl04 := []keyValVersion{{\"foo\", \"bar\", 9, 0}}\n\t\tl05 := []keyValVersion{{\"foo\", \"bar\", 10, bitDiscardEarlierVersions}}\n\n\t\t// Level 0 has all the tables.\n\t\tcreateAndOpen(db, l0, 0)\n\t\tcreateAndOpen(db, l01, 0)\n\t\tcreateAndOpen(db, l02, 0)\n\t\tcreateAndOpen(db, l03, 0)\n\t\tcreateAndOpen(db, l04, 0)\n\t\tcreateAndOpen(db, l05, 0)\n\n\t\t// Discard Time stamp is set to 7.\n\t\tdb.SetDiscardTs(7)\n\n\t\t// Compact L0 to L1\n\t\tcdef := compactDef{\n\t\t\tthisLevel: db.lc.levels[0],\n\t\t\tnextLevel: db.lc.levels[1],\n\t\t\ttop:       db.lc.levels[0].tables,\n\t\t\tbot:       db.lc.levels[1].tables,\n\t\t\tt:         db.lc.levelTargets(),\n\t\t}\n\t\tcdef.t.baseLevel = 1\n\t\trequire.NoError(t, db.lc.runCompactDef(-1, 0, cdef))\n\n\t\t// - Version 10, 9 lie above version 7 so they should be there.\n\t\t// - Version 4, 3, 2 lie below the discardTs but they don't have the\n\t\t//   \"bitDiscardEarlierVersions\" versions set so they should not be removed because number\n\t\t//    of versions to keep is set to infinite.\n\t\t// - Version 1 is below DiscardTS and below the first \"bitDiscardEarlierVersions\"\n\t\t//   marker so IT WILL BE REMOVED.\n\t\tExpectedKeys := []keyValVersion{\n\t\t\t{\"foo\", \"bar\", 10, bitDiscardEarlierVersions},\n\t\t\t{\"foo\", \"bar\", 9, 0},\n\t\t\t{\"foo\", \"bar\", 4, 0},\n\t\t\t{\"foo\", \"bar\", 3, 0},\n\t\t\t{\"foo\", \"bar\", 2, bitDiscardEarlierVersions}}\n\n\t\tgetAllAndCheck(t, db, ExpectedKeys)\n\t})\n}\n\n// This test ensures we don't stall when L1's size is greater than opt.LevelOneSize.\n// We should stall only when L0 tables more than the opt.NumLevelZeroTableStall.\nfunc TestL1Stall(t *testing.T) {\n\t// TODO(ibrahim): Is this test still valid?\n\tt.Skip()\n\topt := DefaultOptions(\"\")\n\t// Disable all compactions.\n\topt.NumCompactors = 0\n\t// Number of level zero tables.\n\topt.NumLevelZeroTables = 3\n\t// Addition of new tables will stall if there are 4 or more L0 tables.\n\topt.NumLevelZeroTablesStall = 4\n\t// Level 1 size is 10 bytes.\n\topt.BaseLevelSize = 10\n\n\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\t// Level 0 has 4 tables.\n\t\tdb.lc.levels[0].Lock()\n\t\tdb.lc.levels[0].tables = []*table.Table{createEmptyTable(db), createEmptyTable(db),\n\t\t\tcreateEmptyTable(db), createEmptyTable(db)}\n\t\tdb.lc.levels[0].Unlock()\n\n\t\ttimeout := time.After(5 * time.Second)\n\t\tdone := make(chan bool)\n\n\t\t// This is important. Set level 1 size more than the opt.LevelOneSize (we've set it to 10).\n\t\tdb.lc.levels[1].totalSize = 100\n\t\tgo func() {\n\t\t\ttab := createEmptyTable(db)\n\t\t\trequire.NoError(t, db.lc.addLevel0Table(tab))\n\t\t\trequire.NoError(t, tab.DecrRef())\n\t\t\tdone <- true\n\t\t}()\n\t\ttime.Sleep(time.Second)\n\n\t\tdb.lc.levels[0].Lock()\n\t\t// Drop two tables from Level 0 so that addLevel0Table can make progress. Earlier table\n\t\t// count was 4 which is equal to L0 stall count.\n\t\ttoDrop := db.lc.levels[0].tables[:2]\n\t\trequire.NoError(t, decrRefs(toDrop))\n\t\tdb.lc.levels[0].tables = db.lc.levels[0].tables[2:]\n\t\tdb.lc.levels[0].Unlock()\n\n\t\tselect {\n\t\tcase <-timeout:\n\t\t\tt.Fatal(\"Test didn't finish in time\")\n\t\tcase <-done:\n\t\t}\n\t})\n}\n\nfunc createEmptyTable(db *DB) *table.Table {\n\topts := table.Options{\n\t\tBloomFalsePositive: db.opt.BloomFalsePositive,\n\t\tChkMode:            options.NoVerification,\n\t}\n\tb := table.NewTableBuilder(opts)\n\tdefer b.Close()\n\t// Add one key so that we can open this table.\n\tb.Add(y.KeyWithTs([]byte(\"foo\"), 1), y.ValueStruct{}, 0)\n\n\t// Open table in memory to avoid adding changes to manifest file.\n\ttab, err := table.OpenInMemoryTable(b.Finish(), db.lc.reserveFileID(), &opts)\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\n\treturn tab\n}\n\nfunc TestL0Stall(t *testing.T) {\n\t// TODO(ibrahim): Is this test still valid?\n\tt.Skip()\n\topt := DefaultOptions(\"\")\n\t// Disable all compactions.\n\topt.NumCompactors = 0\n\t// Number of level zero tables.\n\topt.NumLevelZeroTables = 3\n\t// Addition of new tables will stall if there are 4 or more L0 tables.\n\topt.NumLevelZeroTablesStall = 4\n\n\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\tdb.lc.levels[0].Lock()\n\t\t// Add NumLevelZeroTableStall+1 number of tables to level 0. This would fill up level\n\t\t// zero and all new additions are expected to stall if L0 is in memory.\n\t\tfor i := 0; i < opt.NumLevelZeroTablesStall+1; i++ {\n\t\t\tdb.lc.levels[0].tables = append(db.lc.levels[0].tables, createEmptyTable(db))\n\t\t}\n\t\tdb.lc.levels[0].Unlock()\n\n\t\ttimeout := time.After(5 * time.Second)\n\t\tdone := make(chan bool)\n\n\t\tgo func() {\n\t\t\ttab := createEmptyTable(db)\n\t\t\trequire.NoError(t, db.lc.addLevel0Table(tab))\n\t\t\trequire.NoError(t, tab.DecrRef())\n\t\t\tdone <- true\n\t\t}()\n\t\t// Let it stall for a second.\n\t\ttime.Sleep(time.Second)\n\n\t\tselect {\n\t\tcase <-timeout:\n\t\t\tt.Log(\"Timeout triggered\")\n\t\t\t// Mark this test as successful since L0 is in memory and the\n\t\t\t// addition of new table to L0 is supposed to stall.\n\n\t\t\t// Remove tables from level 0 so that the stalled\n\t\t\t// compaction can make progress. This does not have any\n\t\t\t// effect on the test. This is done so that the goroutine\n\t\t\t// stuck on addLevel0Table can make progress and end.\n\t\t\tdb.lc.levels[0].Lock()\n\t\t\tdb.lc.levels[0].tables = nil\n\t\t\tdb.lc.levels[0].Unlock()\n\t\t\t<-done\n\t\tcase <-done:\n\t\t\t// The test completed before 5 second timeout. Mark it as successful.\n\t\t\tt.Fatal(\"Test did not stall\")\n\t\t}\n\t})\n}\n\nfunc TestLevelGet(t *testing.T) {\n\tcreateLevel := func(db *DB, level int, data [][]keyValVersion) {\n\t\tfor _, v := range data {\n\t\t\tcreateAndOpen(db, v, level)\n\t\t}\n\t}\n\ttype testData struct {\n\t\tname string\n\t\t// Keys on each level. keyValVersion[0] is the first table and so on.\n\t\tlevelData map[int][][]keyValVersion\n\t\texpect    []keyValVersion\n\t}\n\ttest := func(t *testing.T, ti testData, db *DB) {\n\t\tfor level, data := range ti.levelData {\n\t\t\tcreateLevel(db, level, data)\n\t\t}\n\t\tfor _, item := range ti.expect {\n\t\t\tkey := y.KeyWithTs([]byte(item.key), uint64(item.version))\n\t\t\tvs, err := db.get(key)\n\t\t\trequire.NoError(t, err)\n\t\t\trequire.Equal(t, item.val, string(vs.Value), \"key:%s ver:%d\", item.key, item.version)\n\t\t}\n\t}\n\ttt := []testData{\n\t\t{\n\t\t\t\"Normal\",\n\t\t\tmap[int][][]keyValVersion{\n\t\t\t\t0: { // Level 0 has 2 tables and each table has single key.\n\t\t\t\t\t{{\"foo\", \"bar10\", 10, 0}},\n\t\t\t\t\t{{\"foo\", \"barSeven\", 7, 0}},\n\t\t\t\t},\n\t\t\t\t1: { // Level 1 has 1 table with a single key.\n\t\t\t\t\t{{\"foo\", \"bar\", 1, 0}},\n\t\t\t\t},\n\t\t\t},\n\t\t\t[]keyValVersion{\n\t\t\t\t{\"foo\", \"bar\", 1, 0},\n\t\t\t\t{\"foo\", \"barSeven\", 7, 0},\n\t\t\t\t{\"foo\", \"bar10\", 10, 0},\n\t\t\t\t{\"foo\", \"bar10\", 11, 0},     // ver 11 doesn't exist so we should get bar10.\n\t\t\t\t{\"foo\", \"barSeven\", 9, 0},   // ver 9 doesn't exist so we should get barSeven.\n\t\t\t\t{\"foo\", \"bar10\", 100000, 0}, // ver doesn't exist so we should get bar10.\n\t\t\t},\n\t\t},\n\t\t{\"after gc\",\n\t\t\tmap[int][][]keyValVersion{\n\t\t\t\t0: { // Level 0 has 3 tables and each table has single key.\n\t\t\t\t\t{{\"foo\", \"barNew\", 1, 0}}, // foo1 is above foo10 because of the GC.\n\t\t\t\t\t{{\"foo\", \"bar10\", 10, 0}},\n\t\t\t\t\t{{\"foo\", \"barSeven\", 7, 0}},\n\t\t\t\t},\n\t\t\t\t1: { // Level 1 has 1 table with a single key.\n\t\t\t\t\t{{\"foo\", \"bar\", 1, 0}},\n\t\t\t\t},\n\t\t\t},\n\t\t\t[]keyValVersion{\n\t\t\t\t{\"foo\", \"barNew\", 1, 0},\n\t\t\t\t{\"foo\", \"barSeven\", 7, 0},\n\t\t\t\t{\"foo\", \"bar10\", 10, 0},\n\t\t\t\t{\"foo\", \"bar10\", 11, 0}, // Should return biggest version.\n\t\t\t},\n\t\t},\n\t\t{\"after two gc\",\n\t\t\tmap[int][][]keyValVersion{\n\t\t\t\t0: { // Level 0 has 4 tables and each table has single key.\n\t\t\t\t\t{{\"foo\", \"barL0\", 1, 0}}, // foo1 is above foo10 because of the GC.\n\t\t\t\t\t{{\"foo\", \"bar10\", 10, 0}},\n\t\t\t\t\t{{\"foo\", \"barSeven\", 7, 0}},\n\t\t\t\t},\n\t\t\t\t1: { // Level 1 has 1 table with a single key.\n\t\t\t\t\t// Level 1 also has a foo because it was moved twice during GC.\n\t\t\t\t\t{{\"foo\", \"barL1\", 1, 0}},\n\t\t\t\t},\n\t\t\t\t2: { // Level 1 has 1 table with a single key.\n\t\t\t\t\t{{\"foo\", \"bar\", 1, 0}},\n\t\t\t\t},\n\t\t\t},\n\t\t\t[]keyValVersion{\n\t\t\t\t{\"foo\", \"barL0\", 1, 0},\n\t\t\t\t{\"foo\", \"barSeven\", 7, 0},\n\t\t\t\t{\"foo\", \"bar10\", 10, 0},\n\t\t\t\t{\"foo\", \"bar10\", 11, 0}, // Should return biggest version.\n\t\t\t},\n\t\t},\n\t}\n\tfor _, ti := range tt {\n\t\tt.Run(ti.name, func(t *testing.T) {\n\t\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\t\ttest(t, ti, db)\n\t\t\t})\n\n\t\t})\n\t}\n}\n\nfunc TestKeyVersions(t *testing.T) {\n\tinMemoryOpt := DefaultOptions(\"\").\n\t\tWithSyncWrites(false).\n\t\tWithInMemory(true)\n\n\tt.Run(\"disk\", func(t *testing.T) {\n\t\tt.Run(\"small table\", func(t *testing.T) {\n\t\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\t\tl0 := make([]keyValVersion, 0)\n\t\t\t\tfor i := 0; i < 10; i++ {\n\t\t\t\t\tl0 = append(l0, keyValVersion{fmt.Sprintf(\"%05d\", i), \"foo\", 1, 0})\n\t\t\t\t}\n\t\t\t\tcreateAndOpen(db, l0, 0)\n\t\t\t\trequire.Equal(t, 2, len(db.Ranges(nil, 10000)))\n\t\t\t})\n\t\t})\n\t\tt.Run(\"medium table\", func(t *testing.T) {\n\t\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\t\tl0 := make([]keyValVersion, 0)\n\t\t\t\tfor i := 0; i < 1000; i++ {\n\t\t\t\t\tl0 = append(l0, keyValVersion{fmt.Sprintf(\"%05d\", i), \"foo\", 1, 0})\n\t\t\t\t}\n\t\t\t\tcreateAndOpen(db, l0, 0)\n\t\t\t\trequire.Equal(t, 8, len(db.Ranges(nil, 10000)))\n\t\t\t})\n\t\t})\n\t\tt.Run(\"large table\", func(t *testing.T) {\n\t\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\t\tl0 := make([]keyValVersion, 0)\n\t\t\t\tfor i := 0; i < 10000; i++ {\n\t\t\t\t\tl0 = append(l0, keyValVersion{fmt.Sprintf(\"%05d\", i), \"foo\", 1, 0})\n\t\t\t\t}\n\t\t\t\tcreateAndOpen(db, l0, 0)\n\t\t\t\trequire.Equal(t, 62, len(db.Ranges(nil, 10000)))\n\t\t\t})\n\t\t})\n\t\tt.Run(\"prefix\", func(t *testing.T) {\n\t\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\t\tl0 := make([]keyValVersion, 0)\n\t\t\t\tfor i := 0; i < 1000; i++ {\n\t\t\t\t\tl0 = append(l0, keyValVersion{fmt.Sprintf(\"%05d\", i), \"foo\", 1, 0})\n\t\t\t\t}\n\t\t\t\tcreateAndOpen(db, l0, 0)\n\t\t\t\trequire.Equal(t, 1, len(db.Ranges([]byte(\"a\"), 10000)))\n\t\t\t})\n\t\t})\n\t})\n\n\tt.Run(\"in-memory\", func(t *testing.T) {\n\t\tt.Run(\"small table\", func(t *testing.T) {\n\t\t\trunBadgerTest(t, &inMemoryOpt, func(t *testing.T, db *DB) {\n\t\t\t\twriter := db.newWriteBatch(false)\n\t\t\t\tfor i := 0; i < 10; i++ {\n\t\t\t\t\trequire.NoError(t, writer.Set([]byte(fmt.Sprintf(\"%05d\", i)), []byte(\"foo\")))\n\t\t\t\t}\n\t\t\t\trequire.NoError(t, writer.Flush())\n\t\t\t\trequire.Equal(t, 2, len(db.Ranges(nil, 10000)))\n\t\t\t})\n\t\t})\n\t\tt.Run(\"large table\", func(t *testing.T) {\n\t\t\trunBadgerTest(t, &inMemoryOpt, func(t *testing.T, db *DB) {\n\t\t\t\twriter := db.newWriteBatch(false)\n\t\t\t\tfor i := 0; i < 100000; i++ {\n\t\t\t\t\trequire.NoError(t, writer.Set([]byte(fmt.Sprintf(\"%05d\", i)), []byte(\"foo\")))\n\t\t\t\t}\n\t\t\t\trequire.NoError(t, writer.Flush())\n\t\t\t\trequire.Equal(t, 11, len(db.Ranges(nil, 10000)))\n\t\t\t})\n\t\t})\n\t\tt.Run(\"prefix\", func(t *testing.T) {\n\t\t\trunBadgerTest(t, &inMemoryOpt, func(t *testing.T, db *DB) {\n\t\t\t\twriter := db.newWriteBatch(false)\n\t\t\t\tfor i := 0; i < 10000; i++ {\n\t\t\t\t\trequire.NoError(t, writer.Set([]byte(fmt.Sprintf(\"%05d\", i)), []byte(\"foo\")))\n\t\t\t\t}\n\t\t\t\trequire.NoError(t, writer.Flush())\n\t\t\t\trequire.Equal(t, 1, len(db.Ranges([]byte(\"a\"), 10000)))\n\t\t\t})\n\t\t})\n\t})\n}\n\nfunc TestSameLevel(t *testing.T) {\n\topt := DefaultOptions(\"\")\n\topt.NumCompactors = 0\n\topt.NumVersionsToKeep = math.MaxInt32\n\topt.managedTxns = true\n\topt.LmaxCompaction = true\n\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\tl6 := []keyValVersion{\n\t\t\t{\"A\", \"bar\", 4, bitDiscardEarlierVersions}, {\"A\", \"bar\", 3, 0},\n\t\t\t{\"A\", \"bar\", 2, 0}, {\"Afoo\", \"baz\", 2, 0},\n\t\t}\n\t\tl61 := []keyValVersion{\n\t\t\t{\"B\", \"bar\", 4, bitDiscardEarlierVersions}, {\"B\", \"bar\", 3, 0},\n\t\t\t{\"B\", \"bar\", 2, 0}, {\"Bfoo\", \"baz\", 2, 0},\n\t\t}\n\t\tl62 := []keyValVersion{\n\t\t\t{\"C\", \"bar\", 4, bitDiscardEarlierVersions}, {\"C\", \"bar\", 3, 0},\n\t\t\t{\"C\", \"bar\", 2, 0}, {\"Cfoo\", \"baz\", 2, 0},\n\t\t}\n\t\tcreateAndOpen(db, l6, 6)\n\t\tcreateAndOpen(db, l61, 6)\n\t\tcreateAndOpen(db, l62, 6)\n\t\trequire.NoError(t, db.lc.validate())\n\n\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t{\"A\", \"bar\", 4, bitDiscardEarlierVersions}, {\"A\", \"bar\", 3, 0},\n\t\t\t{\"A\", \"bar\", 2, 0}, {\"Afoo\", \"baz\", 2, 0},\n\t\t\t{\"B\", \"bar\", 4, bitDiscardEarlierVersions}, {\"B\", \"bar\", 3, 0},\n\t\t\t{\"B\", \"bar\", 2, 0}, {\"Bfoo\", \"baz\", 2, 0},\n\t\t\t{\"C\", \"bar\", 4, bitDiscardEarlierVersions}, {\"C\", \"bar\", 3, 0},\n\t\t\t{\"C\", \"bar\", 2, 0}, {\"Cfoo\", \"baz\", 2, 0},\n\t\t})\n\n\t\tcdef := compactDef{\n\t\t\tthisLevel: db.lc.levels[6],\n\t\t\tnextLevel: db.lc.levels[6],\n\t\t\ttop:       []*table.Table{db.lc.levels[6].tables[0]},\n\t\t\tbot:       db.lc.levels[6].tables[1:],\n\t\t\tt:         db.lc.levelTargets(),\n\t\t}\n\t\tcdef.t.baseLevel = 1\n\t\t// Set dicardTs to 3. foo2 and foo1 should be dropped.\n\t\tdb.SetDiscardTs(3)\n\t\trequire.NoError(t, db.lc.runCompactDef(-1, 6, cdef))\n\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t{\"A\", \"bar\", 4, bitDiscardEarlierVersions}, {\"A\", \"bar\", 3, 0},\n\t\t\t{\"A\", \"bar\", 2, 0}, {\"Afoo\", \"baz\", 2, 0},\n\t\t\t{\"B\", \"bar\", 4, bitDiscardEarlierVersions}, {\"B\", \"bar\", 3, 0},\n\t\t\t{\"B\", \"bar\", 2, 0}, {\"Bfoo\", \"baz\", 2, 0},\n\t\t\t{\"C\", \"bar\", 4, bitDiscardEarlierVersions}, {\"C\", \"bar\", 3, 0},\n\t\t\t{\"C\", \"bar\", 2, 0}, {\"Cfoo\", \"baz\", 2, 0},\n\t\t})\n\n\t\trequire.NoError(t, db.lc.validate())\n\t\t// Set dicardTs to 7.\n\t\tdb.SetDiscardTs(7)\n\t\tcdef = compactDef{\n\t\t\tthisLevel: db.lc.levels[6],\n\t\t\tnextLevel: db.lc.levels[6],\n\t\t\ttop:       []*table.Table{db.lc.levels[6].tables[0]},\n\t\t\tbot:       db.lc.levels[6].tables[1:],\n\t\t\tt:         db.lc.levelTargets(),\n\t\t}\n\t\tcdef.t.baseLevel = 1\n\t\trequire.NoError(t, db.lc.runCompactDef(-1, 6, cdef))\n\t\tgetAllAndCheck(t, db, []keyValVersion{\n\t\t\t{\"A\", \"bar\", 4, bitDiscardEarlierVersions}, {\"Afoo\", \"baz\", 2, 0},\n\t\t\t{\"B\", \"bar\", 4, bitDiscardEarlierVersions}, {\"Bfoo\", \"baz\", 2, 0},\n\t\t\t{\"C\", \"bar\", 4, bitDiscardEarlierVersions}, {\"Cfoo\", \"baz\", 2, 0}})\n\t\trequire.NoError(t, db.lc.validate())\n\t})\n}\n\nfunc TestTableContainsPrefix(t *testing.T) {\n\topts := table.Options{\n\t\tBlockSize:          4 * 1024,\n\t\tBloomFalsePositive: 0.01,\n\t}\n\n\tbuildTable := func(keys []string) *table.Table {\n\t\tfilename := fmt.Sprintf(\"%s%s%d.sst\", os.TempDir(), string(os.PathSeparator), rand.Uint32())\n\t\tb := table.NewTableBuilder(opts)\n\t\tdefer b.Close()\n\n\t\tv := []byte(\"value\")\n\t\tsort.Slice(keys, func(i, j int) bool {\n\t\t\treturn keys[i] < keys[j]\n\t\t})\n\t\tfor _, k := range keys {\n\t\t\tb.Add(y.KeyWithTs([]byte(k), 1), y.ValueStruct{Value: v}, 0)\n\t\t\tb.Add(y.KeyWithTs([]byte(k), 2), y.ValueStruct{Value: v}, 0)\n\t\t}\n\t\ttbl, err := table.CreateTable(filename, b)\n\t\trequire.NoError(t, err)\n\t\treturn tbl\n\t}\n\n\ttbl := buildTable([]string{\"key1\", \"key3\", \"key31\", \"key32\", \"key4\"})\n\tdefer func() { require.NoError(t, tbl.DecrRef()) }()\n\n\trequire.True(t, containsPrefix(tbl, []byte(\"key\")))\n\trequire.True(t, containsPrefix(tbl, []byte(\"key1\")))\n\trequire.True(t, containsPrefix(tbl, []byte(\"key3\")))\n\trequire.True(t, containsPrefix(tbl, []byte(\"key32\")))\n\trequire.True(t, containsPrefix(tbl, []byte(\"key4\")))\n\n\trequire.False(t, containsPrefix(tbl, []byte(\"key0\")))\n\trequire.False(t, containsPrefix(tbl, []byte(\"key2\")))\n\trequire.False(t, containsPrefix(tbl, []byte(\"key323\")))\n\trequire.False(t, containsPrefix(tbl, []byte(\"key5\")))\n}\n\n// Test that if a compaction fails during fill tables process, its tables are  cleaned up and we are able\n// to do compaction on them again.\nfunc TestFillTableCleanup(t *testing.T) {\n\topt := DefaultOptions(\"\")\n\topt.managedTxns = true\n\t// Stop normal compactions from happening. They can take up the entire space causing the test to fail.\n\topt.NumCompactors = 0\n\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\topts := table.Options{\n\t\t\tBlockSize:          4 * 1024,\n\t\t\tBloomFalsePositive: 0.01,\n\t\t}\n\t\tbuildTable := func(prefix byte) *table.Table {\n\t\t\tfilename := table.NewFilename(db.lc.reserveFileID(), db.opt.Dir)\n\t\t\tb := table.NewTableBuilder(opts)\n\t\t\tdefer b.Close()\n\t\t\tkey := make([]byte, 100)\n\t\t\tval := make([]byte, 500)\n\t\t\trand.Read(val)\n\n\t\t\tcopy(key, []byte{prefix})\n\t\t\tcount := uint64(40000)\n\t\t\tfor i := count; i > 0; i-- {\n\t\t\t\tvar meta byte\n\t\t\t\tif i == 0 {\n\t\t\t\t\tmeta = bitDiscardEarlierVersions\n\t\t\t\t}\n\t\t\t\tb.AddStaleKey(y.KeyWithTs(key, i), y.ValueStruct{Meta: meta, Value: val}, 0)\n\t\t\t}\n\t\t\ttbl, err := table.CreateTable(filename, b)\n\t\t\trequire.NoError(t, err)\n\t\t\treturn tbl\n\t\t}\n\n\t\tbuildLevel := func(level int, num_tab int) {\n\t\t\tlh := db.lc.levels[level]\n\t\t\tfor i := byte(1); i < byte(num_tab); i++ {\n\t\t\t\ttab := buildTable(i)\n\t\t\t\trequire.NoError(t, db.manifest.addChanges([]*pb.ManifestChange{\n\t\t\t\t\tnewCreateChange(tab.ID(), level, 0, tab.CompressionType()),\n\t\t\t\t}, db.opt))\n\t\t\t\ttab.CreatedAt = time.Now().Add(-10 * time.Hour)\n\t\t\t\t// Add table to the given level.\n\t\t\t\tlh.addTable(tab)\n\t\t\t}\n\t\t\trequire.NoError(t, db.lc.validate())\n\n\t\t\trequire.NotZero(t, lh.getTotalStaleSize())\n\t\t}\n\n\t\tlevel := 6\n\t\tbuildLevel(level-1, 2)\n\t\tbuildLevel(level, 2)\n\n\t\tdb.SetDiscardTs(1 << 30)\n\t\t// Modify the target file size so that we can compact all tables at once.\n\t\ttt := db.lc.levelTargets()\n\t\ttt.fileSz[6] = 1 << 30\n\t\tprio := compactionPriority{level: 6, t: tt}\n\n\t\tcd := compactDef{\n\t\t\tcompactorId: 0,\n\t\t\tp:           prio,\n\t\t\tt:           prio.t,\n\t\t\tthisLevel:   db.lc.levels[level-1],\n\t\t\tnextLevel:   db.lc.levels[level],\n\t\t}\n\n\t\t// Fill tables passes first.\n\t\trequire.Equal(t, db.lc.fillTables(&cd), true)\n\t\t// Make sure that running compaction again fails, as the tables are being compacted.\n\t\trequire.Equal(t, db.lc.fillTables(&cd), false)\n\n\t\t// Reset, to remove compaction being happening\n\t\tdb.lc.cstatus.delete(cd)\n\t\t// Test that compaction should be able to run again on these tables.\n\t\trequire.Equal(t, db.lc.fillTables(&cd), true)\n\t})\n}\n\nfunc TestStaleDataCleanup(t *testing.T) {\n\topt := DefaultOptions(\"\")\n\topt.managedTxns = true\n\topt.LmaxCompaction = true\n\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\topts := table.Options{\n\t\t\tBlockSize:          4 * 1024,\n\t\t\tBloomFalsePositive: 0.01,\n\t\t}\n\t\tbuildStaleTable := func(prefix byte) *table.Table {\n\t\t\tfilename := table.NewFilename(db.lc.reserveFileID(), db.opt.Dir)\n\t\t\tb := table.NewTableBuilder(opts)\n\t\t\tdefer b.Close()\n\t\t\tkey := make([]byte, 100)\n\t\t\tval := make([]byte, 500)\n\t\t\trand.Read(val)\n\n\t\t\tcopy(key, []byte{prefix})\n\t\t\tcount := uint64(40000)\n\t\t\tfor i := count; i > 0; i-- {\n\t\t\t\tvar meta byte\n\t\t\t\tif i == 0 {\n\t\t\t\t\tmeta = bitDiscardEarlierVersions\n\t\t\t\t}\n\t\t\t\tb.AddStaleKey(y.KeyWithTs(key, i), y.ValueStruct{Meta: meta, Value: val}, 0)\n\t\t\t}\n\t\t\ttbl, err := table.CreateTable(filename, b)\n\t\t\trequire.NoError(t, err)\n\t\t\treturn tbl\n\t\t}\n\n\t\tlevel := 6\n\t\tlh := db.lc.levels[level]\n\t\tfor i := byte(1); i < 5; i++ {\n\t\t\ttab := buildStaleTable(i)\n\t\t\trequire.NoError(t, db.manifest.addChanges([]*pb.ManifestChange{\n\t\t\t\tnewCreateChange(tab.ID(), level, 0, tab.CompressionType()),\n\t\t\t}, db.opt))\n\t\t\ttab.CreatedAt = time.Now().Add(-10 * time.Hour)\n\t\t\t// Add table to the given level.\n\t\t\tlh.addTable(tab)\n\t\t}\n\t\trequire.NoError(t, db.lc.validate())\n\n\t\trequire.NotZero(t, lh.getTotalStaleSize())\n\n\t\tdb.SetDiscardTs(1 << 30)\n\t\t// Modify the target file size so that we can compact all tables at once.\n\t\ttt := db.lc.levelTargets()\n\t\ttt.fileSz[6] = 1 << 30\n\t\tprio := compactionPriority{level: 6, t: tt}\n\t\trequire.NoError(t, db.lc.doCompact(-1, prio))\n\n\t\trequire.Zero(t, lh.getTotalStaleSize())\n\n\t})\n}\n"
  },
  {
    "path": "logger.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"log\"\n\t\"os\"\n)\n\n// Logger is implemented by any logging system that is used for standard logs.\ntype Logger interface {\n\tErrorf(string, ...interface{})\n\tWarningf(string, ...interface{})\n\tInfof(string, ...interface{})\n\tDebugf(string, ...interface{})\n}\n\n// Errorf logs an ERROR log message to the logger specified in opts or to the\n// global logger if no logger is specified in opts.\nfunc (opt *Options) Errorf(format string, v ...interface{}) {\n\tif opt.Logger == nil {\n\t\treturn\n\t}\n\topt.Logger.Errorf(format, v...)\n}\n\n// Infof logs an INFO message to the logger specified in opts.\nfunc (opt *Options) Infof(format string, v ...interface{}) {\n\tif opt.Logger == nil {\n\t\treturn\n\t}\n\topt.Logger.Infof(format, v...)\n}\n\n// Warningf logs a WARNING message to the logger specified in opts.\nfunc (opt *Options) Warningf(format string, v ...interface{}) {\n\tif opt.Logger == nil {\n\t\treturn\n\t}\n\topt.Logger.Warningf(format, v...)\n}\n\n// Debugf logs a DEBUG message to the logger specified in opts.\nfunc (opt *Options) Debugf(format string, v ...interface{}) {\n\tif opt.Logger == nil {\n\t\treturn\n\t}\n\topt.Logger.Debugf(format, v...)\n}\n\ntype loggingLevel int\n\nconst (\n\tDEBUG loggingLevel = iota\n\tINFO\n\tWARNING\n\tERROR\n)\n\ntype defaultLog struct {\n\t*log.Logger\n\tlevel loggingLevel\n}\n\nfunc defaultLogger(level loggingLevel) *defaultLog {\n\treturn &defaultLog{Logger: log.New(os.Stderr, \"badger \", log.LstdFlags), level: level}\n}\n\nfunc (l *defaultLog) Errorf(f string, v ...interface{}) {\n\tif l.level <= ERROR {\n\t\tl.Printf(\"ERROR: \"+f, v...)\n\t}\n}\n\nfunc (l *defaultLog) Warningf(f string, v ...interface{}) {\n\tif l.level <= WARNING {\n\t\tl.Printf(\"WARNING: \"+f, v...)\n\t}\n}\n\nfunc (l *defaultLog) Infof(f string, v ...interface{}) {\n\tif l.level <= INFO {\n\t\tl.Printf(\"INFO: \"+f, v...)\n\t}\n}\n\nfunc (l *defaultLog) Debugf(f string, v ...interface{}) {\n\tif l.level <= DEBUG {\n\t\tl.Printf(\"DEBUG: \"+f, v...)\n\t}\n}\n"
  },
  {
    "path": "logger_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"fmt\"\n\t\"testing\"\n\n\t\"github.com/stretchr/testify/require\"\n)\n\ntype mockLogger struct {\n\toutput string\n}\n\nfunc (l *mockLogger) Errorf(f string, v ...interface{}) {\n\tl.output = fmt.Sprintf(\"ERROR: \"+f, v...)\n}\n\nfunc (l *mockLogger) Infof(f string, v ...interface{}) {\n\tl.output = fmt.Sprintf(\"INFO: \"+f, v...)\n}\n\nfunc (l *mockLogger) Warningf(f string, v ...interface{}) {\n\tl.output = fmt.Sprintf(\"WARNING: \"+f, v...)\n}\n\nfunc (l *mockLogger) Debugf(f string, v ...interface{}) {\n\tl.output = fmt.Sprintf(\"DEBUG: \"+f, v...)\n}\n\n// Test that the DB-specific log is used instead of the global log.\nfunc TestDbLog(t *testing.T) {\n\tl := &mockLogger{}\n\topt := Options{Logger: l}\n\n\topt.Errorf(\"test\")\n\trequire.Equal(t, \"ERROR: test\", l.output)\n\topt.Infof(\"test\")\n\trequire.Equal(t, \"INFO: test\", l.output)\n\topt.Warningf(\"test\")\n\trequire.Equal(t, \"WARNING: test\", l.output)\n}\n\n// Test that the global logger is used when no logger is specified in Options.\nfunc TestNoDbLog(t *testing.T) {\n\tl := &mockLogger{}\n\topt := Options{}\n\topt.Logger = l\n\n\topt.Errorf(\"test\")\n\trequire.Equal(t, \"ERROR: test\", l.output)\n\topt.Infof(\"test\")\n\trequire.Equal(t, \"INFO: test\", l.output)\n\topt.Warningf(\"test\")\n\trequire.Equal(t, \"WARNING: test\", l.output)\n}\n"
  },
  {
    "path": "managed_db.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\n// OpenManaged returns a new DB, which allows more control over setting\n// transaction timestamps, aka managed mode.\n//\n// This is only useful for databases built on top of Badger (like Dgraph), and\n// can be ignored by most users.\nfunc OpenManaged(opts Options) (*DB, error) {\n\topts.managedTxns = true\n\treturn Open(opts)\n}\n\n// NewTransactionAt follows the same logic as DB.NewTransaction(), but uses the\n// provided read timestamp.\n//\n// This is only useful for databases built on top of Badger (like Dgraph), and\n// can be ignored by most users.\nfunc (db *DB) NewTransactionAt(readTs uint64, update bool) *Txn {\n\tif !db.opt.managedTxns {\n\t\tpanic(\"Cannot use NewTransactionAt with managedDB=false. Use NewTransaction instead.\")\n\t}\n\ttxn := db.newTransaction(update, true)\n\ttxn.readTs = readTs\n\treturn txn\n}\n\n// NewWriteBatchAt is similar to NewWriteBatch but it allows user to set the commit timestamp.\n// NewWriteBatchAt is supposed to be used only in the managed mode.\nfunc (db *DB) NewWriteBatchAt(commitTs uint64) *WriteBatch {\n\tif !db.opt.managedTxns {\n\t\tpanic(\"cannot use NewWriteBatchAt with managedDB=false. Use NewWriteBatch instead\")\n\t}\n\n\twb := db.newWriteBatch(true)\n\twb.commitTs = commitTs\n\twb.txn.commitTs = commitTs\n\treturn wb\n}\nfunc (db *DB) NewManagedWriteBatch() *WriteBatch {\n\tif !db.opt.managedTxns {\n\t\tpanic(\"cannot use NewManagedWriteBatch with managedDB=false. Use NewWriteBatch instead\")\n\t}\n\n\twb := db.newWriteBatch(true)\n\treturn wb\n}\n\n// CommitAt commits the transaction, following the same logic as Commit(), but\n// at the given commit timestamp. This will panic if not used with managed transactions.\n//\n// This is only useful for databases built on top of Badger (like Dgraph), and\n// can be ignored by most users.\nfunc (txn *Txn) CommitAt(commitTs uint64, callback func(error)) error {\n\tif !txn.db.opt.managedTxns {\n\t\tpanic(\"Cannot use CommitAt with managedDB=false. Use Commit instead.\")\n\t}\n\ttxn.commitTs = commitTs\n\tif callback == nil {\n\t\treturn txn.Commit()\n\t}\n\ttxn.CommitWith(callback)\n\treturn nil\n}\n\n// SetDiscardTs sets a timestamp at or below which, any invalid or deleted\n// versions can be discarded from the LSM tree, and thence from the value log to\n// reclaim disk space. Can only be used with managed transactions.\nfunc (db *DB) SetDiscardTs(ts uint64) {\n\tif !db.opt.managedTxns {\n\t\tpanic(\"Cannot use SetDiscardTs with managedDB=false.\")\n\t}\n\tdb.orc.setDiscardTs(ts)\n}\n"
  },
  {
    "path": "managed_db_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"fmt\"\n\t\"math\"\n\t\"math/rand\"\n\t\"os\"\n\t\"runtime\"\n\t\"sync\"\n\t\"sync/atomic\"\n\t\"testing\"\n\t\"time\"\n\n\t\"github.com/stretchr/testify/require\"\n\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\nfunc val(large bool) []byte {\n\tvar buf []byte\n\tif large {\n\t\tbuf = make([]byte, 8192)\n\t} else {\n\t\tbuf = make([]byte, 16)\n\t}\n\trand.Read(buf)\n\treturn buf\n}\n\nfunc numKeys(db *DB) int {\n\tvar count int\n\terr := db.View(func(txn *Txn) error {\n\t\titr := txn.NewIterator(DefaultIteratorOptions)\n\t\tdefer itr.Close()\n\n\t\tfor itr.Rewind(); itr.Valid(); itr.Next() {\n\t\t\tcount++\n\t\t}\n\t\treturn nil\n\t})\n\ty.Check(err)\n\treturn count\n}\n\nfunc numKeysManaged(db *DB, readTs uint64) int {\n\ttxn := db.NewTransactionAt(readTs, false)\n\tdefer txn.Discard()\n\n\titr := txn.NewIterator(DefaultIteratorOptions)\n\tdefer itr.Close()\n\n\tvar count int\n\tfor itr.Rewind(); itr.Valid(); itr.Next() {\n\t\tcount++\n\t}\n\treturn count\n}\n\nfunc TestDropAllManaged(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\topts := getTestOptions(dir)\n\topts.managedTxns = true\n\topts.ValueLogFileSize = 5 << 20\n\tdb, err := Open(opts)\n\trequire.NoError(t, err)\n\n\tN := uint64(10000)\n\tpopulate := func(db *DB, start uint64) {\n\t\tvar wg sync.WaitGroup\n\t\tfor i := start; i < start+N; i++ {\n\t\t\twg.Add(1)\n\t\t\ttxn := db.NewTransactionAt(math.MaxUint64, true)\n\t\t\trequire.NoError(t, txn.SetEntry(NewEntry([]byte(key(\"key\", int(i))), val(true))))\n\t\t\trequire.NoError(t, txn.CommitAt(i, func(err error) {\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\twg.Done()\n\t\t\t}))\n\t\t}\n\t\twg.Wait()\n\t}\n\n\tpopulate(db, N)\n\trequire.Equal(t, int(N), numKeysManaged(db, math.MaxUint64))\n\n\trequire.NoError(t, db.DropAll())\n\trequire.NoError(t, db.DropAll()) // Just call it twice, for fun.\n\trequire.Equal(t, 0, numKeysManaged(db, math.MaxUint64))\n\n\t// Check that we can still write to db, and using lower timestamps.\n\tpopulate(db, 1)\n\trequire.Equal(t, int(N), numKeysManaged(db, math.MaxUint64))\n\trequire.NoError(t, db.Close())\n\n\t// Ensure that value log is correctly replayed, that we are preserving badgerHead.\n\topts.managedTxns = true\n\tdb2, err := Open(opts)\n\trequire.NoError(t, err)\n\trequire.Equal(t, int(N), numKeysManaged(db2, math.MaxUint64))\n\trequire.NoError(t, db2.Close())\n}\n\nfunc TestDropAll(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\topts := getTestOptions(dir)\n\topts.ValueLogFileSize = 5 << 20\n\tdb, err := Open(opts)\n\trequire.NoError(t, err)\n\n\tN := uint64(10000)\n\tpopulate := func(db *DB) {\n\t\twriter := db.NewWriteBatch()\n\t\tfor i := uint64(0); i < N; i++ {\n\t\t\trequire.NoError(t, writer.Set([]byte(key(\"key\", int(i))), val(true)))\n\t\t}\n\t\trequire.NoError(t, writer.Flush())\n\t}\n\n\tpopulate(db)\n\trequire.Equal(t, int(N), numKeys(db))\n\n\trequire.NoError(t, db.DropAll())\n\trequire.Equal(t, 0, numKeys(db))\n\n\t// Check that we can still write to mdb, and using lower timestamps.\n\tpopulate(db)\n\trequire.Equal(t, int(N), numKeys(db))\n\trequire.NoError(t, db.Close())\n\n\t// Ensure that value log is correctly replayed.\n\tdb2, err := Open(opts)\n\trequire.NoError(t, err)\n\trequire.Equal(t, int(N), numKeys(db2))\n\trequire.NoError(t, db2.Close())\n}\n\nfunc TestDropAllTwice(t *testing.T) {\n\ttest := func(t *testing.T, opts Options) {\n\t\tdb, err := Open(opts)\n\n\t\trequire.NoError(t, err)\n\t\tdefer func() {\n\t\t\trequire.NoError(t, db.Close())\n\t\t}()\n\n\t\tN := uint64(10000)\n\t\tpopulate := func(db *DB) {\n\t\t\twriter := db.NewWriteBatch()\n\t\t\tfor i := uint64(0); i < N; i++ {\n\t\t\t\trequire.NoError(t, writer.Set([]byte(key(\"key\", int(i))), val(false)))\n\t\t\t}\n\t\t\trequire.NoError(t, writer.Flush())\n\t\t}\n\n\t\tpopulate(db)\n\t\trequire.Equal(t, int(N), numKeys(db))\n\n\t\trequire.NoError(t, db.DropAll())\n\t\trequire.Equal(t, 0, numKeys(db))\n\n\t\t// Call DropAll again.\n\t\trequire.NoError(t, db.DropAll())\n\t\trequire.NoError(t, db.Close())\n\t}\n\tt.Run(\"disk mode\", func(t *testing.T) {\n\t\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\t\trequire.NoError(t, err)\n\t\tdefer removeDir(dir)\n\t\topts := getTestOptions(dir)\n\t\topts.ValueLogFileSize = 5 << 20\n\t\ttest(t, opts)\n\t})\n\tt.Run(\"InMemory mode\", func(t *testing.T) {\n\t\topts := getTestOptions(\"\")\n\t\topts.InMemory = true\n\t\ttest(t, opts)\n\t})\n}\n\nfunc TestDropAllWithPendingTxn(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\topts := getTestOptions(dir)\n\topts.ValueLogFileSize = 5 << 20\n\tdb, err := Open(opts)\n\trequire.NoError(t, err)\n\tdefer func() {\n\t\trequire.NoError(t, db.Close())\n\t}()\n\n\tN := uint64(10000)\n\tpopulate := func(db *DB) {\n\t\twriter := db.NewWriteBatch()\n\t\tfor i := uint64(0); i < N; i++ {\n\t\t\trequire.NoError(t, writer.Set([]byte(key(\"key\", int(i))), val(true)))\n\t\t}\n\t\trequire.NoError(t, writer.Flush())\n\t}\n\n\tpopulate(db)\n\trequire.Equal(t, int(N), numKeys(db))\n\n\ttxn := db.NewTransaction(true)\n\n\tvar wg sync.WaitGroup\n\twg.Add(1)\n\tgo func() {\n\t\tdefer wg.Done()\n\t\titr := txn.NewIterator(DefaultIteratorOptions)\n\t\tdefer itr.Close()\n\n\t\tvar keys []string\n\t\tfor {\n\t\t\tvar count int\n\t\t\tfor itr.Rewind(); itr.Valid(); itr.Next() {\n\t\t\t\tcount++\n\t\t\t\titem := itr.Item()\n\t\t\t\tkeys = append(keys, string(item.KeyCopy(nil)))\n\t\t\t\t_, err := item.ValueCopy(nil)\n\t\t\t\tif err != nil {\n\t\t\t\t\tt.Logf(\"Got error during value copy: %v\", err)\n\t\t\t\t\treturn\n\t\t\t\t}\n\t\t\t}\n\t\t\tt.Logf(\"Got number of keys: %d\\n\", count)\n\t\t\tfor _, key := range keys {\n\t\t\t\titem, err := txn.Get([]byte(key))\n\t\t\t\tif err != nil {\n\t\t\t\t\tt.Logf(\"Got error during key lookup: %v\", err)\n\t\t\t\t\treturn\n\t\t\t\t}\n\t\t\t\tif _, err := item.ValueCopy(nil); err != nil {\n\t\t\t\t\tt.Logf(\"Got error during second value copy: %v\", err)\n\t\t\t\t\treturn\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}()\n\t// Do not cancel txn.\n\n\twg.Add(1)\n\tgo func() {\n\t\tdefer wg.Done()\n\t\ttime.Sleep(2 * time.Second)\n\t\trequire.NoError(t, db.DropAll())\n\t}()\n\twg.Wait()\n}\n\nfunc TestDropReadOnly(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\topts := getTestOptions(dir)\n\topts.ValueLogFileSize = 5 << 20\n\tdb, err := Open(opts)\n\trequire.NoError(t, err)\n\tN := uint64(1000)\n\tpopulate := func(db *DB) {\n\t\twriter := db.NewWriteBatch()\n\t\tfor i := uint64(0); i < N; i++ {\n\t\t\trequire.NoError(t, writer.Set([]byte(key(\"key\", int(i))), val(true)))\n\t\t}\n\t\trequire.NoError(t, writer.Flush())\n\t}\n\n\tpopulate(db)\n\trequire.Equal(t, int(N), numKeys(db))\n\trequire.NoError(t, db.Close())\n\n\topts.ReadOnly = true\n\tdb2, err := Open(opts)\n\t// acquireDirectoryLock returns ErrWindowsNotSupported on Windows. It can be ignored safely.\n\tif runtime.GOOS == \"windows\" {\n\t\trequire.Equal(t, err, ErrWindowsNotSupported)\n\t} else {\n\t\trequire.NoError(t, err)\n\t\trequire.Panics(t, func() { _ = db2.DropAll() })\n\t\trequire.NoError(t, db2.Close())\n\t}\n}\n\nfunc TestWriteAfterClose(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\topts := getTestOptions(dir)\n\topts.ValueLogFileSize = 5 << 20\n\tdb, err := Open(opts)\n\trequire.NoError(t, err)\n\tN := uint64(1000)\n\tpopulate := func(db *DB) {\n\t\twriter := db.NewWriteBatch()\n\t\tfor i := uint64(0); i < N; i++ {\n\t\t\trequire.NoError(t, writer.Set([]byte(key(\"key\", int(i))), val(true)))\n\t\t}\n\t\trequire.NoError(t, writer.Flush())\n\t}\n\n\tpopulate(db)\n\trequire.Equal(t, int(N), numKeys(db))\n\trequire.NoError(t, db.Close())\n\terr = db.Update(func(txn *Txn) error {\n\t\treturn txn.SetEntry(NewEntry([]byte(\"a\"), []byte(\"b\")))\n\t})\n\trequire.Equal(t, ErrDBClosed, err)\n}\n\nfunc TestDropAllRace(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\topts := getTestOptions(dir)\n\topts.managedTxns = true\n\tdb, err := Open(opts)\n\trequire.NoError(t, err)\n\n\tN := 10000\n\t// Start a goroutine to keep trying to write to DB while DropAll happens.\n\tcloser := z.NewCloser(1)\n\tgo func() {\n\t\tdefer closer.Done()\n\t\tticker := time.NewTicker(time.Millisecond)\n\t\tdefer ticker.Stop()\n\n\t\ti := N + 1 // Writes would happen above N.\n\t\tvar errors atomic.Int32\n\t\tfor {\n\t\t\tselect {\n\t\t\tcase <-ticker.C:\n\t\t\t\ti++\n\t\t\t\ttxn := db.NewTransactionAt(math.MaxUint64, true)\n\t\t\t\trequire.NoError(t, txn.SetEntry(NewEntry([]byte(key(\"key\", i)), val(false))))\n\t\t\t\tif err := txn.CommitAt(uint64(i), func(err error) {\n\t\t\t\t\tif err != nil {\n\t\t\t\t\t\terrors.Add(1)\n\t\t\t\t\t}\n\t\t\t\t}); err != nil {\n\t\t\t\t\terrors.Add(1)\n\t\t\t\t}\n\t\t\tcase <-closer.HasBeenClosed():\n\t\t\t\t// The following causes a data race.\n\t\t\t\t// t.Logf(\"i: %d. Number of (expected) write errors: %d.\\n\", i, errors)\n\t\t\t\treturn\n\t\t\t}\n\t\t}\n\t}()\n\n\tvar wg sync.WaitGroup\n\tfor i := 1; i <= N; i++ {\n\t\twg.Add(1)\n\t\ttxn := db.NewTransactionAt(math.MaxUint64, true)\n\t\trequire.NoError(t, txn.SetEntry(NewEntry([]byte(key(\"key\", i)), val(false))))\n\t\trequire.NoError(t, txn.CommitAt(uint64(i), func(err error) {\n\t\t\trequire.NoError(t, err)\n\t\t\twg.Done()\n\t\t}))\n\t}\n\twg.Wait()\n\n\tbefore := numKeysManaged(db, math.MaxUint64)\n\trequire.True(t, before > N)\n\n\trequire.NoError(t, db.DropAll())\n\tcloser.SignalAndWait()\n\n\tafter := numKeysManaged(db, math.MaxUint64)\n\tt.Logf(\"Before: %d. After dropall: %d\\n\", before, after)\n\trequire.True(t, after < before)\n\tdb.Close()\n}\n\nfunc TestDropPrefix(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\topts := getTestOptions(dir)\n\topts.ValueLogFileSize = 5 << 20\n\tdb, err := Open(opts)\n\trequire.NoError(t, err)\n\n\tN := uint64(10000)\n\tpopulate := func(db *DB) {\n\t\twriter := db.NewWriteBatch()\n\t\tfor i := uint64(0); i < N; i++ {\n\t\t\trequire.NoError(t, writer.Set([]byte(key(\"key\", int(i))), val(true)))\n\t\t}\n\t\trequire.NoError(t, writer.Flush())\n\t}\n\n\tpopulate(db)\n\trequire.Equal(t, int(N), numKeys(db))\n\n\trequire.NoError(t, db.DropPrefix([]byte(\"key000\")))\n\trequire.Equal(t, int(N)-10, numKeys(db))\n\n\trequire.NoError(t, db.DropPrefix([]byte(\"key00\")))\n\trequire.Equal(t, int(N)-100, numKeys(db))\n\n\texpected := int(N)\n\tfor i := 0; i < 10; i++ {\n\t\trequire.NoError(t, db.DropPrefix([]byte(fmt.Sprintf(\"key%d\", i))))\n\t\texpected -= 1000\n\t\trequire.Equal(t, expected, numKeys(db))\n\t}\n\trequire.NoError(t, db.DropPrefix([]byte(\"key1\")))\n\trequire.Equal(t, 0, numKeys(db))\n\trequire.NoError(t, db.DropPrefix([]byte(\"key\")))\n\trequire.Equal(t, 0, numKeys(db))\n\n\t// Check that we can still write to mdb, and using lower timestamps.\n\tpopulate(db)\n\trequire.Equal(t, int(N), numKeys(db))\n\trequire.NoError(t, db.DropPrefix([]byte(\"key\")))\n\trequire.Equal(t, 0, numKeys(db))\n\trequire.NoError(t, db.Close())\n\n\t// Ensure that value log is correctly replayed.\n\tdb2, err := Open(opts)\n\trequire.NoError(t, err)\n\trequire.Equal(t, 0, numKeys(db2))\n\trequire.NoError(t, db2.Close())\n}\n\nfunc TestDropPrefixWithPendingTxn(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\topts := getTestOptions(dir)\n\topts.ValueLogFileSize = 5 << 20\n\tdb, err := Open(opts)\n\trequire.NoError(t, err)\n\tdefer func() {\n\t\trequire.NoError(t, db.Close())\n\t}()\n\n\tN := uint64(10000)\n\tpopulate := func(db *DB) {\n\t\twriter := db.NewWriteBatch()\n\t\tfor i := uint64(0); i < N; i++ {\n\t\t\trequire.NoError(t, writer.Set([]byte(key(\"key\", int(i))), val(true)))\n\t\t}\n\t\trequire.NoError(t, writer.Flush())\n\t}\n\n\tpopulate(db)\n\trequire.Equal(t, int(N), numKeys(db))\n\n\ttxn := db.NewTransaction(true)\n\n\tvar wg sync.WaitGroup\n\twg.Add(2)\n\tgo func() {\n\t\tdefer wg.Done()\n\t\titr := txn.NewIterator(DefaultIteratorOptions)\n\t\tdefer itr.Close()\n\n\t\tvar keys []string\n\t\tfor {\n\t\t\tvar count int\n\t\t\tfor itr.Rewind(); itr.Valid(); itr.Next() {\n\t\t\t\tcount++\n\t\t\t\titem := itr.Item()\n\t\t\t\tkeys = append(keys, string(item.KeyCopy(nil)))\n\t\t\t\t_, err := item.ValueCopy(nil)\n\t\t\t\tif err != nil {\n\t\t\t\t\tt.Logf(\"Got error during value copy: %v\", err)\n\t\t\t\t\treturn\n\t\t\t\t}\n\t\t\t}\n\t\t\tt.Logf(\"Got number of keys: %d\\n\", count)\n\t\t\tfor _, key := range keys {\n\t\t\t\titem, err := txn.Get([]byte(key))\n\t\t\t\tif err != nil {\n\t\t\t\t\tt.Logf(\"Got error during key lookup: %v\", err)\n\t\t\t\t\treturn\n\t\t\t\t}\n\t\t\t\tif _, err := item.ValueCopy(nil); err != nil {\n\t\t\t\t\tt.Logf(\"Got error during second value copy: %v\", err)\n\t\t\t\t\treturn\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}()\n\t// Do not cancel txn.\n\n\tgo func() {\n\t\tdefer wg.Done()\n\t\ttime.Sleep(2 * time.Second)\n\t\trequire.NoError(t, db.DropPrefix([]byte(\"key0\")))\n\t\trequire.NoError(t, db.DropPrefix([]byte(\"key00\")))\n\t\trequire.NoError(t, db.DropPrefix([]byte(\"key\")))\n\t}()\n\twg.Wait()\n}\n\nfunc TestDropPrefixReadOnly(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\topts := getTestOptions(dir)\n\topts.ValueLogFileSize = 5 << 20\n\tdb, err := Open(opts)\n\trequire.NoError(t, err)\n\tN := uint64(1000)\n\tpopulate := func(db *DB) {\n\t\twriter := db.NewWriteBatch()\n\t\tfor i := uint64(0); i < N; i++ {\n\t\t\trequire.NoError(t, writer.Set([]byte(key(\"key\", int(i))), val(true)))\n\t\t}\n\t\trequire.NoError(t, writer.Flush())\n\t}\n\n\tpopulate(db)\n\trequire.Equal(t, int(N), numKeys(db))\n\trequire.NoError(t, db.Close())\n\n\topts.ReadOnly = true\n\tdb2, err := Open(opts)\n\t// acquireDirectoryLock returns ErrWindowsNotSupported on Windows. It can be ignored safely.\n\tif runtime.GOOS == \"windows\" {\n\t\trequire.Equal(t, err, ErrWindowsNotSupported)\n\t} else {\n\t\trequire.NoError(t, err)\n\t\trequire.Panics(t, func() { _ = db2.DropPrefix([]byte(\"key0\")) })\n\t\trequire.NoError(t, db2.Close())\n\t}\n}\n\nfunc TestDropPrefixRace(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\topts := getTestOptions(dir)\n\topts.managedTxns = true\n\tdb, err := Open(opts)\n\trequire.NoError(t, err)\n\n\tN := 10000\n\t// Start a goroutine to keep trying to write to DB while DropPrefix happens.\n\tcloser := z.NewCloser(1)\n\tgo func() {\n\t\tdefer closer.Done()\n\t\tticker := time.NewTicker(time.Millisecond)\n\t\tdefer ticker.Stop()\n\n\t\ti := N + 1 // Writes would happen above N.\n\t\tvar errors atomic.Int32\n\t\tfor {\n\t\t\tselect {\n\t\t\tcase <-ticker.C:\n\t\t\t\ti++\n\t\t\t\ttxn := db.NewTransactionAt(math.MaxUint64, true)\n\t\t\t\trequire.NoError(t, txn.SetEntry(NewEntry([]byte(key(\"key\", i)), val(false))))\n\t\t\t\tif err := txn.CommitAt(uint64(i), func(err error) {\n\t\t\t\t\tif err != nil {\n\t\t\t\t\t\terrors.Add(1)\n\t\t\t\t\t}\n\t\t\t\t}); err != nil {\n\t\t\t\t\terrors.Add(1)\n\t\t\t\t}\n\t\t\tcase <-closer.HasBeenClosed():\n\t\t\t\t// The following causes a data race.\n\t\t\t\t// t.Logf(\"i: %d. Number of (expected) write errors: %d.\\n\", i, errors)\n\t\t\t\treturn\n\t\t\t}\n\t\t}\n\t}()\n\n\tvar wg sync.WaitGroup\n\tfor i := 1; i <= N; i++ {\n\t\twg.Add(1)\n\t\ttxn := db.NewTransactionAt(math.MaxUint64, true)\n\t\trequire.NoError(t, txn.SetEntry(NewEntry([]byte(key(\"key\", i)), val(false))))\n\t\trequire.NoError(t, txn.CommitAt(uint64(i), func(err error) {\n\t\t\trequire.NoError(t, err)\n\t\t\twg.Done()\n\t\t}))\n\t}\n\twg.Wait()\n\n\tbefore := numKeysManaged(db, math.MaxUint64)\n\trequire.True(t, before > N)\n\n\trequire.NoError(t, db.DropPrefix([]byte(\"key00\")))\n\trequire.NoError(t, db.DropPrefix([]byte(\"key1\")))\n\trequire.NoError(t, db.DropPrefix([]byte(\"key\")))\n\tcloser.SignalAndWait()\n\n\tafter := numKeysManaged(db, math.MaxUint64)\n\tt.Logf(\"Before: %d. After dropprefix: %d\\n\", before, after)\n\trequire.True(t, after < before)\n\trequire.NoError(t, db.Close())\n}\n\nfunc TestWriteBatchManagedMode(t *testing.T) {\n\tkey := func(i int) []byte {\n\t\treturn []byte(fmt.Sprintf(\"%10d\", i))\n\t}\n\tval := func(i int) []byte {\n\t\treturn []byte(fmt.Sprintf(\"%128d\", i))\n\t}\n\topt := DefaultOptions(\"\")\n\topt.managedTxns = true\n\topt.BaseTableSize = 1 << 20 // This would create multiple transactions in write batch.\n\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\twb := db.NewWriteBatchAt(1)\n\t\tdefer wb.Cancel()\n\n\t\tN, M := 50000, 1000\n\t\tstart := time.Now()\n\n\t\tfor i := 0; i < N; i++ {\n\t\t\trequire.NoError(t, wb.Set(key(i), val(i)))\n\t\t}\n\t\tfor i := 0; i < M; i++ {\n\t\t\trequire.NoError(t, wb.Delete(key(i)))\n\t\t}\n\t\trequire.NoError(t, wb.Flush())\n\t\tt.Logf(\"Time taken for %d writes (w/ test options): %s\\n\", N+M, time.Since(start))\n\n\t\terr := db.View(func(txn *Txn) error {\n\t\t\titr := txn.NewIterator(DefaultIteratorOptions)\n\t\t\tdefer itr.Close()\n\n\t\t\ti := M\n\t\t\tfor itr.Rewind(); itr.Valid(); itr.Next() {\n\t\t\t\titem := itr.Item()\n\t\t\t\trequire.Equal(t, string(key(i)), string(item.Key()))\n\t\t\t\trequire.Equal(t, item.Version(), uint64(1))\n\t\t\t\tvalcopy, err := item.ValueCopy(nil)\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\trequire.Equal(t, val(i), valcopy)\n\t\t\t\ti++\n\t\t\t}\n\t\t\trequire.Equal(t, N, i)\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\t})\n}\nfunc TestWriteBatchManaged(t *testing.T) {\n\tkey := func(i int) []byte {\n\t\treturn []byte(fmt.Sprintf(\"%10d\", i))\n\t}\n\tval := func(i int) []byte {\n\t\treturn []byte(fmt.Sprintf(\"%128d\", i))\n\t}\n\topt := DefaultOptions(\"\")\n\topt.managedTxns = true\n\topt.BaseTableSize = 1 << 15 // This would create multiple transactions in write batch.\n\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\twb := db.NewManagedWriteBatch()\n\t\tdefer wb.Cancel()\n\n\t\tN, M := 50000, 1000\n\t\tstart := time.Now()\n\n\t\tfor i := 0; i < N; i++ {\n\t\t\trequire.NoError(t, wb.SetEntryAt(&Entry{Key: key(i), Value: val(i)}, 1))\n\t\t}\n\t\tfor i := 0; i < M; i++ {\n\t\t\trequire.NoError(t, wb.DeleteAt(key(i), 2))\n\t\t}\n\t\trequire.NoError(t, wb.Flush())\n\t\tt.Logf(\"Time taken for %d writes (w/ test options): %s\\n\", N+M, time.Since(start))\n\n\t\terr := db.View(func(txn *Txn) error {\n\t\t\titr := txn.NewIterator(DefaultIteratorOptions)\n\t\t\tdefer itr.Close()\n\n\t\t\ti := M\n\t\t\tfor itr.Rewind(); itr.Valid(); itr.Next() {\n\t\t\t\titem := itr.Item()\n\t\t\t\trequire.Equal(t, string(key(i)), string(item.Key()))\n\t\t\t\trequire.Equal(t, item.Version(), uint64(1))\n\t\t\t\tvalcopy, err := item.ValueCopy(nil)\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\trequire.Equal(t, val(i), valcopy)\n\t\t\t\ti++\n\t\t\t}\n\t\t\trequire.Equal(t, N, i)\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\t})\n}\n\nfunc TestWriteBatchDuplicate(t *testing.T) {\n\tN := 10\n\tk := []byte(\"key\")\n\tv := []byte(\"val\")\n\treadVerify := func(t *testing.T, db *DB, n int, versions []int) {\n\t\terr := db.View(func(txn *Txn) error {\n\t\t\tiopt := DefaultIteratorOptions\n\t\t\tiopt.AllVersions = true\n\t\t\titr := txn.NewIterator(iopt)\n\t\t\tdefer itr.Close()\n\n\t\t\ti := 0\n\t\t\tfor itr.Rewind(); itr.Valid(); itr.Next() {\n\t\t\t\titem := itr.Item()\n\t\t\t\trequire.Equal(t, k, item.Key())\n\t\t\t\trequire.Equal(t, uint64(versions[i]), item.Version())\n\t\t\t\terr := item.Value(func(val []byte) error {\n\t\t\t\t\trequire.Equal(t, v, val)\n\t\t\t\t\treturn nil\n\t\t\t\t})\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\ti++\n\t\t\t}\n\t\t\trequire.Equal(t, n, i)\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\t}\n\n\tt.Run(\"writebatch\", func(t *testing.T) {\n\t\topt := DefaultOptions(\"\")\n\t\topt.BaseTableSize = 1 << 15 // This would create multiple transactions in write batch.\n\n\t\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\t\twb := db.NewWriteBatch()\n\t\t\tdefer wb.Cancel()\n\n\t\t\tfor i := uint64(0); i < uint64(N); i++ {\n\t\t\t\t// Multiple versions of the same key.\n\t\t\t\trequire.NoError(t, wb.SetEntry(&Entry{Key: k, Value: v}))\n\t\t\t}\n\t\t\trequire.NoError(t, wb.Flush())\n\t\t\treadVerify(t, db, 1, []int{1})\n\t\t})\n\t})\n\tt.Run(\"writebatch at\", func(t *testing.T) {\n\t\topt := DefaultOptions(\"\")\n\t\topt.BaseTableSize = 1 << 15 // This would create multiple transactions in write batch.\n\t\topt.managedTxns = true\n\n\t\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\t\twb := db.NewWriteBatchAt(10)\n\t\t\tdefer wb.Cancel()\n\n\t\t\tfor i := uint64(0); i < uint64(N); i++ {\n\t\t\t\t// Multiple versions of the same key.\n\t\t\t\trequire.NoError(t, wb.SetEntry(&Entry{Key: k, Value: v}))\n\t\t\t}\n\t\t\trequire.NoError(t, wb.Flush())\n\t\t\treadVerify(t, db, 1, []int{10})\n\t\t})\n\n\t})\n\tt.Run(\"managed writebatch\", func(t *testing.T) {\n\t\topt := DefaultOptions(\"\")\n\t\topt.managedTxns = true\n\t\topt.BaseTableSize = 1 << 15 // This would create multiple transactions in write batch.\n\t\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\t\twb := db.NewManagedWriteBatch()\n\t\t\tdefer wb.Cancel()\n\n\t\t\tfor i := uint64(1); i <= uint64(N); i++ {\n\t\t\t\t// Multiple versions of the same key.\n\t\t\t\trequire.NoError(t, wb.SetEntryAt(&Entry{Key: k, Value: v}, i))\n\t\t\t}\n\t\t\trequire.NoError(t, wb.Flush())\n\t\t\treadVerify(t, db, N, []int{10, 9, 8, 7, 6, 5, 4, 3, 2, 1})\n\t\t})\n\t})\n}\n\nfunc TestZeroDiscardStats(t *testing.T) {\n\tN := uint64(10000)\n\tpopulate := func(t *testing.T, db *DB) {\n\t\twriter := db.NewWriteBatch()\n\t\tfor i := uint64(0); i < N; i++ {\n\t\t\trequire.NoError(t, writer.Set([]byte(key(\"key\", int(i))), val(true)))\n\t\t}\n\t\trequire.NoError(t, writer.Flush())\n\t}\n\n\tt.Run(\"after rewrite\", func(t *testing.T) {\n\t\topts := getTestOptions(\"\")\n\t\topts.ValueLogFileSize = 5 << 20\n\t\topts.ValueThreshold = 1 << 10\n\t\topts.MemTableSize = 1 << 15\n\t\trunBadgerTest(t, &opts, func(t *testing.T, db *DB) {\n\t\t\tpopulate(t, db)\n\t\t\trequire.Equal(t, int(N), numKeys(db))\n\n\t\t\tfids := db.vlog.sortedFids()\n\t\t\tfor _, fid := range fids {\n\t\t\t\tdb.vlog.discardStats.Update(fid, 1)\n\t\t\t}\n\n\t\t\t// Ensure we have some valid fids.\n\t\t\trequire.True(t, len(fids) > 2)\n\t\t\tfid := fids[0]\n\t\t\trequire.NoError(t, db.vlog.rewrite(db.vlog.filesMap[fid]))\n\t\t\t// All data should still be present.\n\t\t\trequire.Equal(t, int(N), numKeys(db))\n\n\t\t\tdb.vlog.discardStats.Iterate(func(id, val uint64) {\n\t\t\t\t// Vlog with id=fid has been re-written, it's discard stats should be zero.\n\t\t\t\tif uint32(id) == fid {\n\t\t\t\t\trequire.Zero(t, val)\n\t\t\t\t}\n\t\t\t})\n\t\t})\n\t})\n\tt.Run(\"after dropall\", func(t *testing.T) {\n\t\topts := getTestOptions(\"\")\n\t\topts.ValueLogFileSize = 5 << 20\n\t\trunBadgerTest(t, &opts, func(t *testing.T, db *DB) {\n\t\t\tpopulate(t, db)\n\t\t\trequire.Equal(t, int(N), numKeys(db))\n\n\t\t\t// Fill discard stats. Normally these are filled by compaction.\n\t\t\tfids := db.vlog.sortedFids()\n\t\t\tfor _, fid := range fids {\n\t\t\t\tdb.vlog.discardStats.Update(fid, 1)\n\t\t\t}\n\n\t\t\tdb.vlog.discardStats.Iterate(func(id, val uint64) { require.NotZero(t, val) })\n\t\t\trequire.NoError(t, db.DropAll())\n\t\t\trequire.Equal(t, 0, numKeys(db))\n\t\t\t// We've deleted everything. DS should be zero.\n\t\t\tdb.vlog.discardStats.Iterate(func(id, val uint64) { require.Zero(t, val) })\n\t\t})\n\t})\n}\n"
  },
  {
    "path": "manifest.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"bufio\"\n\t\"bytes\"\n\t\"encoding/binary\"\n\t\"errors\"\n\t\"fmt\"\n\t\"hash/crc32\"\n\t\"io\"\n\t\"math\"\n\t\"os\"\n\t\"path/filepath\"\n\t\"sync\"\n\n\t\"google.golang.org/protobuf/proto\"\n\n\t\"github.com/dgraph-io/badger/v4/options\"\n\t\"github.com/dgraph-io/badger/v4/pb\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n)\n\n// Manifest represents the contents of the MANIFEST file in a Badger store.\n//\n// The MANIFEST file describes the startup state of the db -- all LSM files and what level they're\n// at.\n//\n// It consists of a sequence of ManifestChangeSet objects.  Each of these is treated atomically,\n// and contains a sequence of ManifestChange's (file creations/deletions) which we use to\n// reconstruct the manifest at startup.\ntype Manifest struct {\n\tLevels []levelManifest\n\tTables map[uint64]TableManifest\n\n\t// Contains total number of creation and deletion changes in the manifest -- used to compute\n\t// whether it'd be useful to rewrite the manifest.\n\tCreations int\n\tDeletions int\n}\n\nfunc createManifest() Manifest {\n\tlevels := make([]levelManifest, 0)\n\treturn Manifest{\n\t\tLevels: levels,\n\t\tTables: make(map[uint64]TableManifest),\n\t}\n}\n\n// levelManifest contains information about LSM tree levels\n// in the MANIFEST file.\ntype levelManifest struct {\n\tTables map[uint64]struct{} // Set of table id's\n}\n\n// TableManifest contains information about a specific table\n// in the LSM tree.\ntype TableManifest struct {\n\tLevel       uint8\n\tKeyID       uint64\n\tCompression options.CompressionType\n}\n\n// manifestFile holds the file pointer (and other info) about the manifest file, which is a log\n// file we append to.\ntype manifestFile struct {\n\tfp        *os.File\n\tdirectory string\n\n\t// The external magic number used by the application running badger.\n\texternalMagic uint16\n\n\t// We make this configurable so that unit tests can hit rewrite() code quickly\n\tdeletionsRewriteThreshold int\n\n\t// Guards appends, which includes access to the manifest field.\n\tappendLock sync.Mutex\n\n\t// Used to track the current state of the manifest, used when rewriting.\n\tmanifest Manifest\n\n\t// Used to indicate if badger was opened in InMemory mode.\n\tinMemory bool\n}\n\nconst (\n\t// ManifestFilename is the filename for the manifest file.\n\tManifestFilename                  = \"MANIFEST\"\n\tmanifestRewriteFilename           = \"MANIFEST-REWRITE\"\n\tmanifestDeletionsRewriteThreshold = 10000\n\tmanifestDeletionsRatio            = 10\n)\n\n// asChanges returns a sequence of changes that could be used to recreate the Manifest in its\n// present state.\nfunc (m *Manifest) asChanges() []*pb.ManifestChange {\n\tchanges := make([]*pb.ManifestChange, 0, len(m.Tables))\n\tfor id, tm := range m.Tables {\n\t\tchanges = append(changes, newCreateChange(id, int(tm.Level), tm.KeyID, tm.Compression))\n\t}\n\treturn changes\n}\n\nfunc (m *Manifest) clone(opt Options) Manifest {\n\tchangeSet := pb.ManifestChangeSet{Changes: m.asChanges()}\n\tret := createManifest()\n\ty.Check(applyChangeSet(&ret, &changeSet, opt))\n\treturn ret\n}\n\n// openOrCreateManifestFile opens a Badger manifest file if it exists, or creates one if\n// doesn’t exists.\nfunc openOrCreateManifestFile(opt Options) (\n\tret *manifestFile, result Manifest, err error) {\n\tif opt.InMemory {\n\t\treturn &manifestFile{inMemory: true}, Manifest{}, nil\n\t}\n\treturn helpOpenOrCreateManifestFile(opt.Dir, opt.ReadOnly, opt.ExternalMagicVersion,\n\t\tmanifestDeletionsRewriteThreshold, opt)\n}\n\nfunc helpOpenOrCreateManifestFile(dir string, readOnly bool, extMagic uint16,\n\tdeletionsThreshold int, opt Options) (*manifestFile, Manifest, error) {\n\n\tpath := filepath.Join(dir, ManifestFilename)\n\tvar flags y.Flags\n\tif readOnly {\n\t\tflags |= y.ReadOnly\n\t}\n\tfp, err := y.OpenExistingFile(path, flags) // We explicitly sync in addChanges, outside the lock.\n\tif err != nil {\n\t\tif !os.IsNotExist(err) {\n\t\t\treturn nil, Manifest{}, err\n\t\t}\n\t\tif readOnly {\n\t\t\treturn nil, Manifest{}, fmt.Errorf(\"no manifest found, required for read-only db\")\n\t\t}\n\t\tm := createManifest()\n\t\tfp, netCreations, err := helpRewrite(dir, &m, extMagic)\n\t\tif err != nil {\n\t\t\treturn nil, Manifest{}, err\n\t\t}\n\t\ty.AssertTrue(netCreations == 0)\n\t\tmf := &manifestFile{\n\t\t\tfp:                        fp,\n\t\t\tdirectory:                 dir,\n\t\t\texternalMagic:             extMagic,\n\t\t\tmanifest:                  m.clone(opt),\n\t\t\tdeletionsRewriteThreshold: deletionsThreshold,\n\t\t}\n\t\treturn mf, m, nil\n\t}\n\n\tmanifest, truncOffset, err := ReplayManifestFile(fp, extMagic, opt)\n\tif err != nil {\n\t\t_ = fp.Close()\n\t\treturn nil, Manifest{}, err\n\t}\n\n\tif !readOnly {\n\t\t// Truncate file so we don't have a half-written entry at the end.\n\t\tif err := fp.Truncate(truncOffset); err != nil {\n\t\t\t_ = fp.Close()\n\t\t\treturn nil, Manifest{}, err\n\t\t}\n\t}\n\tif _, err = fp.Seek(0, io.SeekEnd); err != nil {\n\t\t_ = fp.Close()\n\t\treturn nil, Manifest{}, err\n\t}\n\n\tmf := &manifestFile{\n\t\tfp:                        fp,\n\t\tdirectory:                 dir,\n\t\texternalMagic:             extMagic,\n\t\tmanifest:                  manifest.clone(opt),\n\t\tdeletionsRewriteThreshold: deletionsThreshold,\n\t}\n\treturn mf, manifest, nil\n}\n\nfunc (mf *manifestFile) close() error {\n\tif mf.inMemory {\n\t\treturn nil\n\t}\n\treturn mf.fp.Close()\n}\n\n// addChanges writes a batch of changes, atomically, to the file.  By \"atomically\" that means when\n// we replay the MANIFEST file, we'll either replay all the changes or none of them.  (The truth of\n// this depends on the filesystem -- some might append garbage data if a system crash happens at\n// the wrong time.)\nfunc (mf *manifestFile) addChanges(changesParam []*pb.ManifestChange, opt Options) error {\n\tif mf.inMemory {\n\t\treturn nil\n\t}\n\tchanges := pb.ManifestChangeSet{Changes: changesParam}\n\tbuf, err := proto.Marshal(&changes)\n\tif err != nil {\n\t\treturn err\n\t}\n\n\t// Maybe we could use O_APPEND instead (on certain file systems)\n\tmf.appendLock.Lock()\n\tdefer mf.appendLock.Unlock()\n\tif err := applyChangeSet(&mf.manifest, &changes, opt); err != nil {\n\t\treturn err\n\t}\n\t// Rewrite manifest if it'd shrink by 1/10 and it's big enough to care\n\tif mf.manifest.Deletions > mf.deletionsRewriteThreshold &&\n\t\tmf.manifest.Deletions > manifestDeletionsRatio*(mf.manifest.Creations-mf.manifest.Deletions) {\n\t\tif err := mf.rewrite(); err != nil {\n\t\t\treturn err\n\t\t}\n\t} else {\n\t\tvar lenCrcBuf [8]byte\n\t\tbinary.BigEndian.PutUint32(lenCrcBuf[0:4], uint32(len(buf)))\n\t\tbinary.BigEndian.PutUint32(lenCrcBuf[4:8], crc32.Checksum(buf, y.CastagnoliCrcTable))\n\t\tbuf = append(lenCrcBuf[:], buf...)\n\t\tif _, err := mf.fp.Write(buf); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\n\treturn syncFunc(mf.fp)\n}\n\n// this function is saved here to allow injection of fake filesystem latency at test time.\nvar syncFunc = func(f *os.File) error { return f.Sync() }\n\n// Has to be 4 bytes.  The value can never change, ever, anyway.\nvar magicText = [4]byte{'B', 'd', 'g', 'r'}\n\n// The magic version number. It is allocated 2 bytes, so it's value must be <= math.MaxUint16\nconst badgerMagicVersion = 8\n\nfunc helpRewrite(dir string, m *Manifest, extMagic uint16) (*os.File, int, error) {\n\trewritePath := filepath.Join(dir, manifestRewriteFilename)\n\t// We explicitly sync.\n\tfp, err := y.OpenTruncFile(rewritePath, false)\n\tif err != nil {\n\t\treturn nil, 0, err\n\t}\n\n\t// magic bytes are structured as\n\t// +---------------------+-------------------------+-----------------------+\n\t// | magicText (4 bytes) | externalMagic (2 bytes) | badgerMagic (2 bytes) |\n\t// +---------------------+-------------------------+-----------------------+\n\n\ty.AssertTrue(badgerMagicVersion <= math.MaxUint16)\n\tbuf := make([]byte, 8)\n\tcopy(buf[0:4], magicText[:])\n\tbinary.BigEndian.PutUint16(buf[4:6], extMagic)\n\tbinary.BigEndian.PutUint16(buf[6:8], badgerMagicVersion)\n\n\tnetCreations := len(m.Tables)\n\tchanges := m.asChanges()\n\tset := pb.ManifestChangeSet{Changes: changes}\n\n\tchangeBuf, err := proto.Marshal(&set)\n\tif err != nil {\n\t\tfp.Close()\n\t\treturn nil, 0, err\n\t}\n\tvar lenCrcBuf [8]byte\n\tbinary.BigEndian.PutUint32(lenCrcBuf[0:4], uint32(len(changeBuf)))\n\tbinary.BigEndian.PutUint32(lenCrcBuf[4:8], crc32.Checksum(changeBuf, y.CastagnoliCrcTable))\n\tbuf = append(buf, lenCrcBuf[:]...)\n\tbuf = append(buf, changeBuf...)\n\tif _, err := fp.Write(buf); err != nil {\n\t\tfp.Close()\n\t\treturn nil, 0, err\n\t}\n\tif err := fp.Sync(); err != nil {\n\t\tfp.Close()\n\t\treturn nil, 0, err\n\t}\n\n\t// In Windows the files should be closed before doing a Rename.\n\tif err = fp.Close(); err != nil {\n\t\treturn nil, 0, err\n\t}\n\tmanifestPath := filepath.Join(dir, ManifestFilename)\n\tif err := os.Rename(rewritePath, manifestPath); err != nil {\n\t\treturn nil, 0, err\n\t}\n\tfp, err = y.OpenExistingFile(manifestPath, 0)\n\tif err != nil {\n\t\treturn nil, 0, err\n\t}\n\tif _, err := fp.Seek(0, io.SeekEnd); err != nil {\n\t\tfp.Close()\n\t\treturn nil, 0, err\n\t}\n\tif err := syncDir(dir); err != nil {\n\t\tfp.Close()\n\t\treturn nil, 0, err\n\t}\n\n\treturn fp, netCreations, nil\n}\n\n// Must be called while appendLock is held.\nfunc (mf *manifestFile) rewrite() error {\n\t// In Windows the files should be closed before doing a Rename.\n\tif err := mf.fp.Close(); err != nil {\n\t\treturn err\n\t}\n\tfp, netCreations, err := helpRewrite(mf.directory, &mf.manifest, mf.externalMagic)\n\tif err != nil {\n\t\treturn err\n\t}\n\tmf.fp = fp\n\tmf.manifest.Creations = netCreations\n\tmf.manifest.Deletions = 0\n\n\treturn nil\n}\n\ntype countingReader struct {\n\twrapped *bufio.Reader\n\tcount   int64\n}\n\nfunc (r *countingReader) Read(p []byte) (n int, err error) {\n\tn, err = r.wrapped.Read(p)\n\tr.count += int64(n)\n\treturn\n}\n\nfunc (r *countingReader) ReadByte() (b byte, err error) {\n\tb, err = r.wrapped.ReadByte()\n\tif err == nil {\n\t\tr.count++\n\t}\n\treturn\n}\n\nvar (\n\terrBadMagic    = errors.New(\"manifest has bad magic\")\n\terrBadChecksum = errors.New(\"manifest has checksum mismatch\")\n)\n\n// ReplayManifestFile reads the manifest file and constructs two manifest objects.  (We need one\n// immutable copy and one mutable copy of the manifest.  Easiest way is to construct two of them.)\n// Also, returns the last offset after a completely read manifest entry -- the file must be\n// truncated at that point before further appends are made (if there is a partial entry after\n// that).  In normal conditions, truncOffset is the file size.\nfunc ReplayManifestFile(fp *os.File, extMagic uint16, opt Options) (Manifest, int64, error) {\n\tr := countingReader{wrapped: bufio.NewReader(fp)}\n\n\tvar magicBuf [8]byte\n\tif _, err := io.ReadFull(&r, magicBuf[:]); err != nil {\n\t\treturn Manifest{}, 0, errBadMagic\n\t}\n\tif !bytes.Equal(magicBuf[0:4], magicText[:]) {\n\t\treturn Manifest{}, 0, errBadMagic\n\t}\n\n\textVersion := y.BytesToU16(magicBuf[4:6])\n\tversion := y.BytesToU16(magicBuf[6:8])\n\n\tif version != badgerMagicVersion {\n\t\treturn Manifest{}, 0,\n\t\t\t//nolint:lll\n\t\t\tfmt.Errorf(\"manifest has unsupported version: %d (we support %d).\\n\"+\n\t\t\t\t\"Please see https://github.com/dgraph-io/badger/blob/main/docs/troubleshooting.md#i-see-manifest-has-unsupported-version-x-we-support-y-error\"+\n\t\t\t\t\" on how to fix this\",\n\t\t\t\tversion, badgerMagicVersion)\n\t}\n\tif extVersion != extMagic {\n\t\treturn Manifest{}, 0,\n\t\t\tfmt.Errorf(\"cannot open DB because the external magic number doesn't match, \"+\n\t\t\t\t\"expected: %d, version present in manifest: %d\", extMagic, extVersion)\n\t}\n\n\tstat, err := fp.Stat()\n\tif err != nil {\n\t\treturn Manifest{}, 0, err\n\t}\n\n\tbuild := createManifest()\n\tvar offset int64\n\tfor {\n\t\toffset = r.count\n\t\tvar lenCrcBuf [8]byte\n\t\t_, err := io.ReadFull(&r, lenCrcBuf[:])\n\t\tif err != nil {\n\t\t\tif err == io.EOF || err == io.ErrUnexpectedEOF {\n\t\t\t\tbreak\n\t\t\t}\n\t\t\treturn Manifest{}, 0, err\n\t\t}\n\t\tlength := y.BytesToU32(lenCrcBuf[0:4])\n\t\t// Sanity check to ensure we don't over-allocate memory.\n\t\tif length > uint32(stat.Size()) {\n\t\t\treturn Manifest{}, 0, fmt.Errorf(\n\t\t\t\t\"Buffer length: %d greater than file size: %d. Manifest file might be corrupted\",\n\t\t\t\tlength, stat.Size())\n\t\t}\n\t\tvar buf = make([]byte, length)\n\t\tif _, err := io.ReadFull(&r, buf); err != nil {\n\t\t\tif err == io.EOF || err == io.ErrUnexpectedEOF {\n\t\t\t\tbreak\n\t\t\t}\n\t\t\treturn Manifest{}, 0, err\n\t\t}\n\t\tif crc32.Checksum(buf, y.CastagnoliCrcTable) != y.BytesToU32(lenCrcBuf[4:8]) {\n\t\t\treturn Manifest{}, 0, errBadChecksum\n\t\t}\n\n\t\tvar changeSet pb.ManifestChangeSet\n\t\tif err := proto.Unmarshal(buf, &changeSet); err != nil {\n\t\t\treturn Manifest{}, 0, err\n\t\t}\n\n\t\tif err := applyChangeSet(&build, &changeSet, opt); err != nil {\n\t\t\treturn Manifest{}, 0, err\n\t\t}\n\t}\n\n\treturn build, offset, nil\n}\n\nfunc applyManifestChange(build *Manifest, tc *pb.ManifestChange, opt Options) error {\n\tswitch tc.Op {\n\tcase pb.ManifestChange_CREATE:\n\t\tif _, ok := build.Tables[tc.Id]; ok {\n\t\t\treturn fmt.Errorf(\"MANIFEST invalid, table %d exists\", tc.Id)\n\t\t}\n\t\tbuild.Tables[tc.Id] = TableManifest{\n\t\t\tLevel:       uint8(tc.Level),\n\t\t\tKeyID:       tc.KeyId,\n\t\t\tCompression: options.CompressionType(tc.Compression),\n\t\t}\n\t\tfor len(build.Levels) <= int(tc.Level) {\n\t\t\tbuild.Levels = append(build.Levels, levelManifest{make(map[uint64]struct{})})\n\t\t}\n\t\tbuild.Levels[tc.Level].Tables[tc.Id] = struct{}{}\n\t\tbuild.Creations++\n\tcase pb.ManifestChange_DELETE:\n\t\ttm, ok := build.Tables[tc.Id]\n\t\tif !ok {\n\t\t\topt.Warningf(\"MANIFEST delete: table %d has already been removed\", tc.Id)\n\t\t\tfor _, level := range build.Levels {\n\t\t\t\tdelete(level.Tables, tc.Id)\n\t\t\t}\n\t\t} else {\n\t\t\tdelete(build.Levels[tm.Level].Tables, tc.Id)\n\t\t\tdelete(build.Tables, tc.Id)\n\t\t}\n\t\tbuild.Deletions++\n\tdefault:\n\t\treturn fmt.Errorf(\"MANIFEST file has invalid manifestChange op\")\n\t}\n\treturn nil\n}\n\n// This is not a \"recoverable\" error -- opening the KV store fails because the MANIFEST file is\n// just plain broken.\nfunc applyChangeSet(build *Manifest, changeSet *pb.ManifestChangeSet, opt Options) error {\n\tfor _, change := range changeSet.Changes {\n\t\tif err := applyManifestChange(build, change, opt); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\treturn nil\n}\n\nfunc newCreateChange(\n\tid uint64, level int, keyID uint64, c options.CompressionType) *pb.ManifestChange {\n\treturn &pb.ManifestChange{\n\t\tId:    id,\n\t\tOp:    pb.ManifestChange_CREATE,\n\t\tLevel: uint32(level),\n\t\tKeyId: keyID,\n\t\t// Hard coding it, since we're supporting only AES for now.\n\t\tEncryptionAlgo: pb.EncryptionAlgo_aes,\n\t\tCompression:    uint32(c),\n\t}\n}\n\nfunc newDeleteChange(id uint64) *pb.ManifestChange {\n\treturn &pb.ManifestChange{\n\t\tId: id,\n\t\tOp: pb.ManifestChange_DELETE,\n\t}\n}\n"
  },
  {
    "path": "manifest_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"fmt\"\n\t\"math/rand\"\n\t\"os\"\n\t\"path/filepath\"\n\t\"sort\"\n\t\"sync\"\n\t\"testing\"\n\t\"time\"\n\n\t\"github.com/stretchr/testify/require\"\n\n\t\"github.com/dgraph-io/badger/v4/options\"\n\t\"github.com/dgraph-io/badger/v4/pb\"\n\t\"github.com/dgraph-io/badger/v4/table\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n)\n\nfunc TestManifestBasic(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\topt := getTestOptions(dir)\n\t{\n\t\tkv, err := Open(opt)\n\t\trequire.NoError(t, err)\n\t\tn := 5000\n\t\tfor i := 0; i < n; i++ {\n\t\t\tif (i % 10000) == 0 {\n\t\t\t\tfmt.Printf(\"Putting i=%d\\n\", i)\n\t\t\t}\n\t\t\tk := []byte(fmt.Sprintf(\"%16x\", rand.Int63()))\n\t\t\ttxnSet(t, kv, k, k, 0x00)\n\t\t}\n\t\ttxnSet(t, kv, []byte(\"testkey\"), []byte(\"testval\"), 0x05)\n\t\trequire.NoError(t, kv.validate())\n\t\trequire.NoError(t, kv.Close())\n\t}\n\n\tkv, err := Open(opt)\n\trequire.NoError(t, err)\n\n\trequire.NoError(t, kv.View(func(txn *Txn) error {\n\t\titem, err := txn.Get([]byte(\"testkey\"))\n\t\trequire.NoError(t, err)\n\t\trequire.EqualValues(t, \"testval\", string(getItemValue(t, item)))\n\t\trequire.EqualValues(t, byte(0x05), item.UserMeta())\n\t\treturn nil\n\t}))\n\trequire.NoError(t, kv.Close())\n}\n\nfunc helpTestManifestFileCorruption(t *testing.T, off int64, errorContent string) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\topt := getTestOptions(dir)\n\t{\n\t\tkv, err := Open(opt)\n\t\trequire.NoError(t, err)\n\t\trequire.NoError(t, kv.Close())\n\t}\n\tfp, err := os.OpenFile(filepath.Join(dir, ManifestFilename), os.O_RDWR, 0)\n\trequire.NoError(t, err)\n\t// Mess with magic value or version to force error\n\t_, err = fp.WriteAt([]byte{'X'}, off)\n\trequire.NoError(t, err)\n\trequire.NoError(t, fp.Close())\n\tkv, err := Open(opt)\n\tdefer func() {\n\t\tif kv != nil {\n\t\t\tkv.Close()\n\t\t}\n\t}()\n\trequire.Error(t, err)\n\trequire.Contains(t, err.Error(), errorContent)\n}\n\nfunc TestManifestMagic(t *testing.T) {\n\thelpTestManifestFileCorruption(t, 3, \"bad magic\")\n}\n\nfunc TestManifestVersion(t *testing.T) {\n\thelpTestManifestFileCorruption(t, 6, \"unsupported version\")\n}\n\nfunc TestManifestChecksum(t *testing.T) {\n\thelpTestManifestFileCorruption(t, 15, \"checksum mismatch\")\n}\n\nfunc key(prefix string, i int) string {\n\treturn prefix + fmt.Sprintf(\"%04d\", i)\n}\n\n// TODO - Move these to somewhere where table package can also use it.\n// keyValues is n by 2 where n is number of pairs.\nfunc buildTable(t *testing.T, keyValues [][]string, bopts table.Options) *table.Table {\n\tif bopts.BloomFalsePositive == 0 {\n\t\tbopts.BloomFalsePositive = 0.01\n\t}\n\tif bopts.BlockSize == 0 {\n\t\tbopts.BlockSize = 4 * 1024\n\t}\n\tb := table.NewTableBuilder(bopts)\n\tdefer b.Close()\n\t// TODO: Add test for file garbage collection here. No files should be left after the tests here.\n\n\tfilename := fmt.Sprintf(\"%s%s%d.sst\", os.TempDir(), string(os.PathSeparator), rand.Uint32())\n\n\tsort.Slice(keyValues, func(i, j int) bool {\n\t\treturn keyValues[i][0] < keyValues[j][0]\n\t})\n\tfor _, kv := range keyValues {\n\t\ty.AssertTrue(len(kv) == 2)\n\t\tb.Add(y.KeyWithTs([]byte(kv[0]), 10), y.ValueStruct{\n\t\t\tValue:    []byte(kv[1]),\n\t\t\tMeta:     'A',\n\t\t\tUserMeta: 0,\n\t\t}, 0)\n\t}\n\n\ttbl, err := table.CreateTable(filename, b)\n\trequire.NoError(t, err)\n\treturn tbl\n}\n\nfunc TestOverlappingKeyRangeError(t *testing.T) {\n\t// [Aman] This test is not making sense to me right now. When fixing warnings from\n\t// linter, I realized that the runCompactDef function below always returns error.\n\tt.Skip()\n\n\tbuildTestTable := func(t *testing.T, prefix string, n int, opts table.Options) *table.Table {\n\t\ty.AssertTrue(n <= 10000)\n\t\tkeyValues := make([][]string, n)\n\t\tfor i := 0; i < n; i++ {\n\t\t\tk := key(prefix, i)\n\t\t\tv := fmt.Sprintf(\"%d\", i)\n\t\t\tkeyValues[i] = []string{k, v}\n\t\t}\n\t\treturn buildTable(t, keyValues, opts)\n\t}\n\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\tkv, err := Open(DefaultOptions(dir))\n\trequire.NoError(t, err)\n\tdefer func() { require.NoError(t, kv.Close()) }()\n\n\tlh0 := newLevelHandler(kv, 0)\n\tlh1 := newLevelHandler(kv, 1)\n\topts := table.Options{ChkMode: options.OnTableAndBlockRead}\n\tt1 := buildTestTable(t, \"k\", 2, opts)\n\tdefer func() { require.NoError(t, t1.DecrRef()) }()\n\n\tdone := lh0.tryAddLevel0Table(t1)\n\trequire.Equal(t, true, done)\n\tcd := compactDef{\n\t\tthisLevel: lh0,\n\t\tnextLevel: lh1,\n\t\tt:         kv.lc.levelTargets(),\n\t}\n\tcd.t.baseLevel = 1\n\n\tmanifest := createManifest()\n\tlc, err := newLevelsController(kv, &manifest)\n\trequire.NoError(t, err)\n\tdone = lc.fillTablesL0(&cd)\n\trequire.Equal(t, true, done)\n\trequire.NoError(t, lc.runCompactDef(-1, 0, cd))\n\n\tt2 := buildTestTable(t, \"l\", 2, opts)\n\tdefer func() { require.NoError(t, t2.DecrRef()) }()\n\tdone = lh0.tryAddLevel0Table(t2)\n\trequire.Equal(t, true, done)\n\n\tcd = compactDef{\n\t\tthisLevel: lh0,\n\t\tnextLevel: lh1,\n\t\tt:         kv.lc.levelTargets(),\n\t}\n\tcd.t.baseLevel = 1\n\tlc.fillTablesL0(&cd)\n\trequire.NoError(t, lc.runCompactDef(-1, 0, cd))\n}\n\nfunc TestManifestRewrite(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\n\tdb, err := Open(DefaultOptions(dir))\n\trequire.NoError(t, err, \"error while opening db\")\n\n\tdefer func() {\n\t\trequire.NoError(t, db.Close())\n\t\tremoveDir(dir)\n\t}()\n\n\tdeletionsThreshold := 10\n\n\tmf, m, err := helpOpenOrCreateManifestFile(dir, false, 0, deletionsThreshold, db.opt)\n\tdefer func() {\n\t\tif mf != nil {\n\t\t\tmf.close()\n\t\t}\n\t}()\n\trequire.NoError(t, err)\n\trequire.Equal(t, 0, m.Creations)\n\trequire.Equal(t, 0, m.Deletions)\n\n\terr = mf.addChanges([]*pb.ManifestChange{\n\t\tnewCreateChange(0, 0, 0, 0),\n\t}, db.opt)\n\trequire.NoError(t, err)\n\n\tfor i := uint64(0); i < uint64(deletionsThreshold*3); i++ {\n\t\tch := []*pb.ManifestChange{\n\t\t\tnewCreateChange(i+1, 0, 0, 0),\n\t\t\tnewDeleteChange(i),\n\t\t}\n\t\terr := mf.addChanges(ch, db.opt)\n\t\trequire.NoError(t, err)\n\t}\n\terr = mf.close()\n\trequire.NoError(t, err)\n\tmf = nil\n\tmf, m, err = helpOpenOrCreateManifestFile(dir, false, 0, deletionsThreshold, db.opt)\n\trequire.NoError(t, err)\n\trequire.Equal(t, map[uint64]TableManifest{\n\t\tuint64(deletionsThreshold * 3): {Level: 0},\n\t}, m.Tables)\n}\n\nfunc TestConcurrentManifestCompaction(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\tdb, err := Open(DefaultOptions(dir))\n\trequire.NoError(t, err, \"error while opening db\")\n\tdefer func() {\n\t\trequire.NoError(t, db.Close())\n\t}()\n\n\t// overwrite the sync function to make this race condition easily reproducible\n\tsyncFunc = func(f *os.File) error {\n\t\t// effectively making the Sync() take around 1s makes this reproduce every time\n\t\ttime.Sleep(1 * time.Second)\n\t\treturn f.Sync()\n\t}\n\n\tmf, _, err := helpOpenOrCreateManifestFile(dir, false, 0, 0, db.opt)\n\trequire.NoError(t, err)\n\n\tcs := &pb.ManifestChangeSet{}\n\tfor i := uint64(0); i < 1000; i++ {\n\t\tcs.Changes = append(cs.Changes,\n\t\t\tnewCreateChange(i, 0, 0, 0),\n\t\t\tnewDeleteChange(i),\n\t\t)\n\t}\n\n\t// simulate 2 concurrent compaction threads\n\tn := 2\n\twg := sync.WaitGroup{}\n\twg.Add(n)\n\tfor i := 0; i < n; i++ {\n\t\tgo func() {\n\t\t\tdefer wg.Done()\n\t\t\trequire.NoError(t, mf.addChanges(cs.Changes, db.opt))\n\t\t}()\n\t}\n\twg.Wait()\n\n\trequire.NoError(t, mf.close())\n}\n"
  },
  {
    "path": "memtable.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"bufio\"\n\t\"bytes\"\n\t\"crypto/aes\"\n\tcryptorand \"crypto/rand\"\n\t\"encoding/binary\"\n\t\"fmt\"\n\t\"hash/crc32\"\n\t\"io\"\n\t\"os\"\n\t\"path/filepath\"\n\t\"sort\"\n\t\"strconv\"\n\t\"strings\"\n\t\"sync\"\n\t\"sync/atomic\"\n\n\t\"github.com/dgraph-io/badger/v4/pb\"\n\t\"github.com/dgraph-io/badger/v4/skl\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\n// memTable structure stores a skiplist and a corresponding WAL. Writes to memTable are written\n// both to the WAL and the skiplist. On a crash, the WAL is replayed to bring the skiplist back to\n// its pre-crash form.\ntype memTable struct {\n\t// TODO: Give skiplist z.Calloc'd []byte.\n\tsl         *skl.Skiplist\n\twal        *logFile\n\tmaxVersion uint64\n\topt        Options\n\tbuf        *bytes.Buffer\n}\n\nfunc (db *DB) openMemTables(opt Options) error {\n\t// We don't need to open any tables in in-memory mode.\n\tif db.opt.InMemory {\n\t\treturn nil\n\t}\n\tfiles, err := os.ReadDir(db.opt.Dir)\n\tif err != nil {\n\t\treturn errFile(err, db.opt.Dir, \"Unable to open mem dir.\")\n\t}\n\n\tvar fids []int\n\tfor _, file := range files {\n\t\tif !strings.HasSuffix(file.Name(), memFileExt) {\n\t\t\tcontinue\n\t\t}\n\t\tfsz := len(file.Name())\n\t\tfid, err := strconv.ParseInt(file.Name()[:fsz-len(memFileExt)], 10, 64)\n\t\tif err != nil {\n\t\t\treturn errFile(err, file.Name(), \"Unable to parse log id.\")\n\t\t}\n\t\tfids = append(fids, int(fid))\n\t}\n\n\t// Sort in ascending order.\n\tsort.Slice(fids, func(i, j int) bool {\n\t\treturn fids[i] < fids[j]\n\t})\n\tfor _, fid := range fids {\n\t\tflags := os.O_RDWR\n\t\tif db.opt.ReadOnly {\n\t\t\tflags = os.O_RDONLY\n\t\t}\n\t\tmt, err := db.openMemTable(fid, flags)\n\t\tif err != nil {\n\t\t\treturn y.Wrapf(err, \"while opening fid: %d\", fid)\n\t\t}\n\t\t// If this memtable is empty we don't need to add it. This is a\n\t\t// memtable that was completely truncated.\n\t\tif mt.sl.Empty() {\n\t\t\tmt.DecrRef()\n\t\t\tcontinue\n\t\t}\n\t\t// These should no longer be written to. So, make them part of the imm.\n\t\tdb.imm = append(db.imm, mt)\n\t}\n\tif len(fids) != 0 {\n\t\tdb.nextMemFid = fids[len(fids)-1]\n\t}\n\tdb.nextMemFid++\n\treturn nil\n}\n\nconst memFileExt string = \".mem\"\n\nfunc (db *DB) openMemTable(fid, flags int) (*memTable, error) {\n\tfilepath := db.mtFilePath(fid)\n\ts := skl.NewSkiplist(arenaSize(db.opt))\n\tmt := &memTable{\n\t\tsl:  s,\n\t\topt: db.opt,\n\t\tbuf: &bytes.Buffer{},\n\t}\n\t// We don't need to create the wal for the skiplist in in-memory mode so return the mt.\n\tif db.opt.InMemory {\n\t\treturn mt, z.NewFile\n\t}\n\n\tmt.wal = &logFile{\n\t\tfid:      uint32(fid),\n\t\tpath:     filepath,\n\t\tregistry: db.registry,\n\t\twriteAt:  vlogHeaderSize,\n\t\topt:      db.opt,\n\t}\n\tlerr := mt.wal.open(filepath, flags, 2*db.opt.MemTableSize)\n\tif lerr != z.NewFile && lerr != nil {\n\t\treturn nil, y.Wrapf(lerr, \"While opening memtable: %s\", filepath)\n\t}\n\n\t// Have a callback set to delete WAL when skiplist reference count goes down to zero. That is,\n\t// when it gets flushed to L0.\n\ts.OnClose = func() {\n\t\tif err := mt.wal.Delete(); err != nil {\n\t\t\tdb.opt.Errorf(\"while deleting file: %s, err: %v\", filepath, err)\n\t\t}\n\t}\n\n\tif lerr == z.NewFile {\n\t\treturn mt, lerr\n\t}\n\terr := mt.UpdateSkipList()\n\treturn mt, y.Wrapf(err, \"while updating skiplist\")\n}\n\nfunc (db *DB) newMemTable() (*memTable, error) {\n\tmt, err := db.openMemTable(db.nextMemFid, os.O_CREATE|os.O_RDWR)\n\tif err == z.NewFile {\n\t\tdb.nextMemFid++\n\t\treturn mt, nil\n\t}\n\n\tif err != nil {\n\t\tdb.opt.Errorf(\"Got error: %v for id: %d\\n\", err, db.nextMemFid)\n\t\treturn nil, y.Wrapf(err, \"newMemTable\")\n\t}\n\treturn nil, fmt.Errorf(\"File %s already exists\", mt.wal.Fd.Name())\n}\n\nfunc (db *DB) mtFilePath(fid int) string {\n\treturn filepath.Join(db.opt.Dir, fmt.Sprintf(\"%05d%s\", fid, memFileExt))\n}\n\nfunc (mt *memTable) SyncWAL() error {\n\treturn mt.wal.Sync()\n}\n\nfunc (mt *memTable) isFull() bool {\n\tif mt.sl.MemSize() >= mt.opt.MemTableSize {\n\t\treturn true\n\t}\n\tif mt.opt.InMemory {\n\t\t// InMemory mode doesn't have any WAL.\n\t\treturn false\n\t}\n\treturn int64(mt.wal.writeAt) >= mt.opt.MemTableSize\n}\n\nfunc (mt *memTable) Put(key []byte, value y.ValueStruct) error {\n\tentry := &Entry{\n\t\tKey:       key,\n\t\tValue:     value.Value,\n\t\tUserMeta:  value.UserMeta,\n\t\tmeta:      value.Meta,\n\t\tExpiresAt: value.ExpiresAt,\n\t}\n\n\t// wal is nil only when badger in running in in-memory mode and we don't need the wal.\n\tif mt.wal != nil {\n\t\t// If WAL exceeds opt.ValueLogFileSize, we'll force flush the memTable. See logic in\n\t\t// ensureRoomForWrite.\n\t\tif err := mt.wal.writeEntry(mt.buf, entry, mt.opt); err != nil {\n\t\t\treturn y.Wrapf(err, \"cannot write entry to WAL file\")\n\t\t}\n\t}\n\t// We insert the finish marker in the WAL but not in the memtable.\n\tif entry.meta&bitFinTxn > 0 {\n\t\treturn nil\n\t}\n\n\t// Write to skiplist and update maxVersion encountered.\n\tmt.sl.Put(key, value)\n\tif ts := y.ParseTs(entry.Key); ts > mt.maxVersion {\n\t\tmt.maxVersion = ts\n\t}\n\ty.NumBytesWrittenToL0Add(mt.opt.MetricsEnabled, entry.estimateSizeAndSetThreshold(mt.opt.ValueThreshold))\n\treturn nil\n}\n\nfunc (mt *memTable) UpdateSkipList() error {\n\tif mt.wal == nil || mt.sl == nil {\n\t\treturn nil\n\t}\n\tendOff, err := mt.wal.iterate(true, 0, mt.replayFunction(mt.opt))\n\tif err != nil {\n\t\treturn y.Wrapf(err, \"while iterating wal: %s\", mt.wal.Fd.Name())\n\t}\n\tif endOff < mt.wal.size.Load() && mt.opt.ReadOnly {\n\t\treturn y.Wrapf(ErrTruncateNeeded, \"end offset: %d < size: %d\", endOff, mt.wal.size.Load())\n\t}\n\treturn mt.wal.Truncate(int64(endOff))\n}\n\n// IncrRef increases the refcount\nfunc (mt *memTable) IncrRef() {\n\tmt.sl.IncrRef()\n}\n\n// DecrRef decrements the refcount, deallocating the Skiplist when done using it\nfunc (mt *memTable) DecrRef() {\n\tmt.sl.DecrRef()\n}\n\nfunc (mt *memTable) replayFunction(opt Options) func(Entry, valuePointer) error {\n\tfirst := true\n\treturn func(e Entry, _ valuePointer) error { // Function for replaying.\n\t\tif first {\n\t\t\topt.Debugf(\"First key=%q\\n\", e.Key)\n\t\t}\n\t\tfirst = false\n\t\tif ts := y.ParseTs(e.Key); ts > mt.maxVersion {\n\t\t\tmt.maxVersion = ts\n\t\t}\n\t\tv := y.ValueStruct{\n\t\t\tValue:     e.Value,\n\t\t\tMeta:      e.meta,\n\t\t\tUserMeta:  e.UserMeta,\n\t\t\tExpiresAt: e.ExpiresAt,\n\t\t}\n\t\t// This is already encoded correctly. Value would be either a vptr, or a full value\n\t\t// depending upon how big the original value was. Skiplist makes a copy of the key and\n\t\t// value.\n\t\tmt.sl.Put(e.Key, v)\n\t\treturn nil\n\t}\n}\n\ntype logFile struct {\n\t*z.MmapFile\n\tpath string\n\t// This is a lock on the log file. It guards the fd’s value, the file’s\n\t// existence and the file’s memory map.\n\t//\n\t// Use shared ownership when reading/writing the file or memory map, use\n\t// exclusive ownership to open/close the descriptor, unmap or remove the file.\n\tlock     sync.RWMutex\n\tfid      uint32\n\tsize     atomic.Uint32\n\tdataKey  *pb.DataKey\n\tbaseIV   []byte\n\tregistry *KeyRegistry\n\twriteAt  uint32\n\topt      Options\n}\n\nfunc (lf *logFile) Truncate(end int64) error {\n\tif fi, err := lf.Fd.Stat(); err != nil {\n\t\treturn fmt.Errorf(\"while file.stat on file: %s, error: %v\\n\", lf.Fd.Name(), err)\n\t} else if fi.Size() == end {\n\t\treturn nil\n\t}\n\ty.AssertTrue(!lf.opt.ReadOnly)\n\tlf.size.Store(uint32(end))\n\treturn lf.MmapFile.Truncate(end)\n}\n\n// encodeEntry will encode entry to the buf\n// layout of entry\n// +--------+-----+-------+-------+\n// | header | key | value | crc32 |\n// +--------+-----+-------+-------+\nfunc (lf *logFile) encodeEntry(buf *bytes.Buffer, e *Entry, offset uint32) (int, error) {\n\th := header{\n\t\tklen:      uint32(len(e.Key)),\n\t\tvlen:      uint32(len(e.Value)),\n\t\texpiresAt: e.ExpiresAt,\n\t\tmeta:      e.meta,\n\t\tuserMeta:  e.UserMeta,\n\t}\n\n\thash := crc32.New(y.CastagnoliCrcTable)\n\twriter := io.MultiWriter(buf, hash)\n\n\t// encode header.\n\tvar headerEnc [maxHeaderSize]byte\n\tsz := h.Encode(headerEnc[:])\n\ty.Check2(writer.Write(headerEnc[:sz]))\n\t// we'll encrypt only key and value.\n\tif lf.encryptionEnabled() {\n\t\t// TODO: no need to allocate the bytes. we can calculate the encrypted buf one by one\n\t\t// since we're using ctr mode of AES encryption. Ordering won't changed. Need some\n\t\t// refactoring in XORBlock which will work like stream cipher.\n\t\teBuf := make([]byte, 0, len(e.Key)+len(e.Value))\n\t\teBuf = append(eBuf, e.Key...)\n\t\teBuf = append(eBuf, e.Value...)\n\t\tif err := y.XORBlockStream(\n\t\t\twriter, eBuf, lf.dataKey.Data, lf.generateIV(offset)); err != nil {\n\t\t\treturn 0, y.Wrapf(err, \"Error while encoding entry for vlog.\")\n\t\t}\n\t} else {\n\t\t// Encryption is disabled so writing directly to the buffer.\n\t\ty.Check2(writer.Write(e.Key))\n\t\ty.Check2(writer.Write(e.Value))\n\t}\n\t// write crc32 hash.\n\tvar crcBuf [crc32.Size]byte\n\tbinary.BigEndian.PutUint32(crcBuf[:], hash.Sum32())\n\ty.Check2(buf.Write(crcBuf[:]))\n\t// return encoded length.\n\treturn len(headerEnc[:sz]) + len(e.Key) + len(e.Value) + len(crcBuf), nil\n}\n\nfunc (lf *logFile) writeEntry(buf *bytes.Buffer, e *Entry, opt Options) error {\n\tbuf.Reset()\n\tplen, err := lf.encodeEntry(buf, e, lf.writeAt)\n\tif err != nil {\n\t\treturn err\n\t}\n\ty.AssertTrue(plen == copy(lf.Data[lf.writeAt:], buf.Bytes()))\n\tlf.writeAt += uint32(plen)\n\n\tlf.zeroNextEntry()\n\treturn nil\n}\n\nfunc (lf *logFile) decodeEntry(buf []byte, offset uint32) (*Entry, error) {\n\tvar h header\n\thlen := h.Decode(buf)\n\tkv := buf[hlen:]\n\tif lf.encryptionEnabled() {\n\t\tvar err error\n\t\t// No need to worry about mmap. because, XORBlock allocates a byte array to do the\n\t\t// xor. So, the given slice is not being mutated.\n\t\tif kv, err = lf.decryptKV(kv, offset); err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t}\n\te := &Entry{\n\t\tmeta:      h.meta,\n\t\tUserMeta:  h.userMeta,\n\t\tExpiresAt: h.expiresAt,\n\t\toffset:    offset,\n\t\tKey:       kv[:h.klen],\n\t\tValue:     kv[h.klen : h.klen+h.vlen],\n\t}\n\treturn e, nil\n}\n\nfunc (lf *logFile) decryptKV(buf []byte, offset uint32) ([]byte, error) {\n\treturn y.XORBlockAllocate(buf, lf.dataKey.Data, lf.generateIV(offset))\n}\n\n// KeyID returns datakey's ID.\nfunc (lf *logFile) keyID() uint64 {\n\tif lf.dataKey == nil {\n\t\t// If there is no datakey, then we'll return 0. Which means no encryption.\n\t\treturn 0\n\t}\n\treturn lf.dataKey.KeyId\n}\n\nfunc (lf *logFile) encryptionEnabled() bool {\n\treturn lf.dataKey != nil\n}\n\n// Acquire lock on mmap/file if you are calling this\nfunc (lf *logFile) read(p valuePointer) (buf []byte, err error) {\n\toffset := p.Offset\n\t// Do not convert size to uint32, because the lf.Data can be of size\n\t// 4GB, which overflows the uint32 during conversion to make the size 0,\n\t// causing the read to fail with ErrEOF. See issue #585.\n\tsize := int64(len(lf.Data))\n\tvalsz := p.Len\n\tlfsz := lf.size.Load()\n\tif int64(offset) >= size || int64(offset+valsz) > size ||\n\t\t// Ensure that the read is within the file's actual size. It might be possible that\n\t\t// the offset+valsz length is beyond the file's actual size. This could happen when\n\t\t// dropAll and iterations are running simultaneously.\n\t\tint64(offset+valsz) > int64(lfsz) {\n\t\terr = y.ErrEOF\n\t} else {\n\t\tbuf = lf.Data[offset : offset+valsz]\n\t}\n\treturn buf, err\n}\n\n// generateIV will generate IV by appending given offset with the base IV.\nfunc (lf *logFile) generateIV(offset uint32) []byte {\n\tiv := make([]byte, aes.BlockSize)\n\t// baseIV is of 12 bytes.\n\ty.AssertTrue(12 == copy(iv[:12], lf.baseIV))\n\t// remaining 4 bytes is obtained from offset.\n\tbinary.BigEndian.PutUint32(iv[12:], offset)\n\treturn iv\n}\n\nfunc (lf *logFile) doneWriting(offset uint32) error {\n\tif lf.opt.SyncWrites {\n\t\tif err := lf.Sync(); err != nil {\n\t\t\treturn y.Wrapf(err, \"Unable to sync value log: %q\", lf.path)\n\t\t}\n\t}\n\n\t// Before we were acquiring a lock here on lf.lock, because we were invalidating the file\n\t// descriptor due to reopening it as read-only. Now, we don't invalidate the fd, but unmap it,\n\t// truncate it and remap it. That creates a window where we have segfaults because the mmap is\n\t// no longer valid, while someone might be reading it. Therefore, we need a lock here again.\n\tlf.lock.Lock()\n\tdefer lf.lock.Unlock()\n\n\tif err := lf.Truncate(int64(offset)); err != nil {\n\t\treturn y.Wrapf(err, \"Unable to truncate file: %q\", lf.path)\n\t}\n\n\t// Previously we used to close the file after it was written and reopen it in read-only mode.\n\t// We no longer open files in read-only mode. We keep all vlog files open in read-write mode.\n\treturn nil\n}\n\n// iterate iterates over log file. It doesn't not allocate new memory for every kv pair.\n// Therefore, the kv pair is only valid for the duration of fn call.\nfunc (lf *logFile) iterate(readOnly bool, offset uint32, fn logEntry) (uint32, error) {\n\tif offset == 0 {\n\t\t// If offset is set to zero, let's advance past the encryption key header.\n\t\toffset = vlogHeaderSize\n\t}\n\n\t// For now, read directly from file, because it allows\n\treader := bufio.NewReader(lf.NewReader(int(offset)))\n\tread := &safeRead{\n\t\tk:            make([]byte, 10),\n\t\tv:            make([]byte, 10),\n\t\trecordOffset: offset,\n\t\tlf:           lf,\n\t}\n\n\tvar lastCommit uint64\n\tvar validEndOffset uint32 = offset\n\n\tvar entries []*Entry\n\tvar vptrs []valuePointer\n\nloop:\n\tfor {\n\t\te, err := read.Entry(reader)\n\t\tswitch {\n\t\t// We have not reached the end of the file but the entry we read is\n\t\t// zero. This happens because we have truncated the file and\n\t\t// zero'ed it out.\n\t\tcase err == io.EOF:\n\t\t\tbreak loop\n\t\tcase err == io.ErrUnexpectedEOF || err == errTruncate:\n\t\t\tbreak loop\n\t\tcase err != nil:\n\t\t\treturn 0, err\n\t\tcase e == nil:\n\t\t\tcontinue\n\t\tcase e.isZero():\n\t\t\tbreak loop\n\t\t}\n\n\t\tvar vp valuePointer\n\t\tvp.Len = uint32(e.hlen + len(e.Key) + len(e.Value) + crc32.Size)\n\t\tread.recordOffset += vp.Len\n\n\t\tvp.Offset = e.offset\n\t\tvp.Fid = lf.fid\n\n\t\tswitch {\n\t\tcase e.meta&bitTxn > 0:\n\t\t\ttxnTs := y.ParseTs(e.Key)\n\t\t\tif lastCommit == 0 {\n\t\t\t\tlastCommit = txnTs\n\t\t\t}\n\t\t\tif lastCommit != txnTs {\n\t\t\t\tbreak loop\n\t\t\t}\n\t\t\tentries = append(entries, e)\n\t\t\tvptrs = append(vptrs, vp)\n\n\t\tcase e.meta&bitFinTxn > 0:\n\t\t\ttxnTs, err := strconv.ParseUint(string(e.Value), 10, 64)\n\t\t\tif err != nil || lastCommit != txnTs {\n\t\t\t\tbreak loop\n\t\t\t}\n\t\t\t// Got the end of txn. Now we can store them.\n\t\t\tlastCommit = 0\n\t\t\tvalidEndOffset = read.recordOffset\n\n\t\t\tfor i, e := range entries {\n\t\t\t\tvp := vptrs[i]\n\t\t\t\tif err := fn(*e, vp); err != nil {\n\t\t\t\t\tif err == errStop {\n\t\t\t\t\t\tbreak\n\t\t\t\t\t}\n\t\t\t\t\treturn 0, errFile(err, lf.path, \"Iteration function\")\n\t\t\t\t}\n\t\t\t}\n\t\t\tentries = entries[:0]\n\t\t\tvptrs = vptrs[:0]\n\n\t\tdefault:\n\t\t\tif lastCommit != 0 {\n\t\t\t\t// This is most likely an entry which was moved as part of GC.\n\t\t\t\t// We shouldn't get this entry in the middle of a transaction.\n\t\t\t\tbreak loop\n\t\t\t}\n\t\t\tvalidEndOffset = read.recordOffset\n\n\t\t\tif err := fn(*e, vp); err != nil {\n\t\t\t\tif err == errStop {\n\t\t\t\t\tbreak\n\t\t\t\t}\n\t\t\t\treturn 0, errFile(err, lf.path, \"Iteration function\")\n\t\t\t}\n\t\t}\n\t}\n\treturn validEndOffset, nil\n}\n\n// Zero out the next entry to deal with any crashes.\nfunc (lf *logFile) zeroNextEntry() {\n\tz.ZeroOut(lf.Data, int(lf.writeAt), int(lf.writeAt+maxHeaderSize))\n}\n\nfunc (lf *logFile) open(path string, flags int, fsize int64) error {\n\tmf, ferr := z.OpenMmapFile(path, flags, int(fsize))\n\tlf.MmapFile = mf\n\n\tif ferr == z.NewFile {\n\t\tif err := lf.bootstrap(); err != nil {\n\t\t\tos.Remove(path)\n\t\t\treturn err\n\t\t}\n\t\tlf.size.Store(vlogHeaderSize)\n\n\t} else if ferr != nil {\n\t\treturn y.Wrapf(ferr, \"while opening file: %s\", path)\n\t}\n\tlf.size.Store(uint32(len(lf.Data)))\n\n\tif lf.size.Load() < vlogHeaderSize {\n\t\t// Every vlog file should have at least vlogHeaderSize. If it is less than vlogHeaderSize\n\t\t// then it must have been corrupted. But no need to handle here. log replayer will truncate\n\t\t// and bootstrap the logfile. So ignoring here.\n\t\treturn nil\n\t}\n\n\t// Copy over the encryption registry data.\n\tbuf := make([]byte, vlogHeaderSize)\n\n\ty.AssertTruef(vlogHeaderSize == copy(buf, lf.Data),\n\t\t\"Unable to copy from %s, size %d\", path, lf.size.Load())\n\tkeyID := binary.BigEndian.Uint64(buf[:8])\n\t// retrieve datakey.\n\tif dk, err := lf.registry.DataKey(keyID); err != nil {\n\t\treturn y.Wrapf(err, \"While opening vlog file %d\", lf.fid)\n\t} else {\n\t\tlf.dataKey = dk\n\t}\n\tlf.baseIV = buf[8:]\n\ty.AssertTrue(len(lf.baseIV) == 12)\n\n\t// Preserved ferr so we can return if this was a new file.\n\treturn ferr\n}\n\n// bootstrap will initialize the log file with key id and baseIV.\n// The below figure shows the layout of log file.\n// +----------------+------------------+------------------+\n// | keyID(8 bytes) |  baseIV(12 bytes)|\t entry...     |\n// +----------------+------------------+------------------+\nfunc (lf *logFile) bootstrap() error {\n\tvar err error\n\n\t// generate data key for the log file.\n\tvar dk *pb.DataKey\n\tif dk, err = lf.registry.LatestDataKey(); err != nil {\n\t\treturn y.Wrapf(err, \"Error while retrieving datakey in logFile.bootstarp\")\n\t}\n\tlf.dataKey = dk\n\n\t// We'll always preserve vlogHeaderSize for key id and baseIV.\n\tbuf := make([]byte, vlogHeaderSize)\n\n\t// write key id to the buf.\n\t// key id will be zero if the logfile is in plain text.\n\tbinary.BigEndian.PutUint64(buf[:8], lf.keyID())\n\t// generate base IV. It'll be used with offset of the vptr to encrypt the entry.\n\tif _, err := cryptorand.Read(buf[8:]); err != nil {\n\t\treturn y.Wrapf(err, \"Error while creating base IV, while creating logfile\")\n\t}\n\n\t// Initialize base IV.\n\tlf.baseIV = buf[8:]\n\ty.AssertTrue(len(lf.baseIV) == 12)\n\n\t// Copy over to the logFile.\n\ty.AssertTrue(vlogHeaderSize == copy(lf.Data[0:], buf))\n\n\t// Zero out the next entry.\n\tlf.zeroNextEntry()\n\treturn nil\n}\n"
  },
  {
    "path": "merge.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\tstderrors \"errors\"\n\t\"sync\"\n\t\"time\"\n\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\n// MergeOperator represents a Badger merge operator.\ntype MergeOperator struct {\n\tsync.RWMutex\n\tf      MergeFunc\n\tdb     *DB\n\tkey    []byte\n\tcloser *z.Closer\n}\n\n// MergeFunc accepts two byte slices, one representing an existing value, and\n// another representing a new value that needs to be ‘merged’ into it. MergeFunc\n// contains the logic to perform the ‘merge’ and return an updated value.\n// MergeFunc could perform operations like integer addition, list appends etc.\n// Note that the ordering of the operands is maintained.\ntype MergeFunc func(existingVal, newVal []byte) []byte\n\n// GetMergeOperator creates a new MergeOperator for a given key and returns a\n// pointer to it. It also fires off a goroutine that performs a compaction using\n// the merge function that runs periodically, as specified by dur.\nfunc (db *DB) GetMergeOperator(key []byte,\n\tf MergeFunc, dur time.Duration) *MergeOperator {\n\top := &MergeOperator{\n\t\tf:      f,\n\t\tdb:     db,\n\t\tkey:    key,\n\t\tcloser: z.NewCloser(1),\n\t}\n\n\tgo op.runCompactions(dur)\n\treturn op\n}\n\nvar errNoMerge = stderrors.New(\"No need for merge\")\n\nfunc (op *MergeOperator) iterateAndMerge() (newVal []byte, latest uint64, err error) {\n\ttxn := op.db.NewTransaction(false)\n\tdefer txn.Discard()\n\topt := DefaultIteratorOptions\n\topt.AllVersions = true\n\tit := txn.NewKeyIterator(op.key, opt)\n\tdefer it.Close()\n\n\tvar numVersions int\n\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\titem := it.Item()\n\t\tif item.IsDeletedOrExpired() {\n\t\t\tbreak\n\t\t}\n\t\tnumVersions++\n\t\tif numVersions == 1 {\n\t\t\t// This should be the newVal, considering this is the latest version.\n\t\t\tnewVal, err = item.ValueCopy(newVal)\n\t\t\tif err != nil {\n\t\t\t\treturn nil, 0, err\n\t\t\t}\n\t\t\tlatest = item.Version()\n\t\t} else {\n\t\t\tif err := item.Value(func(oldVal []byte) error {\n\t\t\t\t// The merge should always be on the newVal considering it has the merge result of\n\t\t\t\t// the latest version. The value read should be the oldVal.\n\t\t\t\tnewVal = op.f(oldVal, newVal)\n\t\t\t\treturn nil\n\t\t\t}); err != nil {\n\t\t\t\treturn nil, 0, err\n\t\t\t}\n\t\t}\n\t\tif item.DiscardEarlierVersions() {\n\t\t\tbreak\n\t\t}\n\t}\n\tif numVersions == 0 {\n\t\treturn nil, latest, ErrKeyNotFound\n\t} else if numVersions == 1 {\n\t\treturn newVal, latest, errNoMerge\n\t}\n\treturn newVal, latest, nil\n}\n\nfunc (op *MergeOperator) compact() error {\n\top.Lock()\n\tdefer op.Unlock()\n\tval, version, err := op.iterateAndMerge()\n\tif err == ErrKeyNotFound || err == errNoMerge {\n\t\treturn nil\n\t} else if err != nil {\n\t\treturn err\n\t}\n\tentries := []*Entry{\n\t\t{\n\t\t\tKey:   y.KeyWithTs(op.key, version),\n\t\t\tValue: val,\n\t\t\tmeta:  bitDiscardEarlierVersions,\n\t\t},\n\t}\n\t// Write value back to the DB. It is important that we do not set the bitMergeEntry bit\n\t// here. When compaction happens, all the older merged entries will be removed.\n\treturn op.db.batchSetAsync(entries, func(err error) {\n\t\tif err != nil {\n\t\t\top.db.opt.Errorf(\"failed to insert the result of merge compaction: %s\", err)\n\t\t}\n\t})\n}\n\nfunc (op *MergeOperator) runCompactions(dur time.Duration) {\n\tticker := time.NewTicker(dur)\n\tdefer op.closer.Done()\n\tvar stop bool\n\tfor {\n\t\tselect {\n\t\tcase <-op.closer.HasBeenClosed():\n\t\t\tstop = true\n\t\tcase <-ticker.C: // wait for tick\n\t\t}\n\t\tif err := op.compact(); err != nil {\n\t\t\top.db.opt.Errorf(\"failure while running merge operation: %s\", err)\n\t\t}\n\t\tif stop {\n\t\t\tticker.Stop()\n\t\t\tbreak\n\t\t}\n\t}\n}\n\n// Add records a value in Badger which will eventually be merged by a background\n// routine into the values that were recorded by previous invocations to Add().\nfunc (op *MergeOperator) Add(val []byte) error {\n\treturn op.db.Update(func(txn *Txn) error {\n\t\treturn txn.SetEntry(NewEntry(op.key, val).withMergeBit())\n\t})\n}\n\n// Get returns the latest value for the merge operator, which is derived by\n// applying the merge function to all the values added so far.\n//\n// If Add has not been called even once, Get will return ErrKeyNotFound.\nfunc (op *MergeOperator) Get() ([]byte, error) {\n\top.RLock()\n\tdefer op.RUnlock()\n\tvar existing []byte\n\terr := op.db.View(func(txn *Txn) (err error) {\n\t\texisting, _, err = op.iterateAndMerge()\n\t\treturn err\n\t})\n\tif err == errNoMerge {\n\t\treturn existing, nil\n\t}\n\treturn existing, err\n}\n\n// Stop waits for any pending merge to complete and then stops the background\n// goroutine.\nfunc (op *MergeOperator) Stop() {\n\top.closer.SignalAndWait()\n}\n"
  },
  {
    "path": "merge_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"encoding/binary\"\n\t\"os\"\n\t\"testing\"\n\t\"time\"\n\n\t\"github.com/stretchr/testify/require\"\n)\n\nfunc TestGetMergeOperator(t *testing.T) {\n\tt.Run(\"Get before Add\", func(t *testing.T) {\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\tm := db.GetMergeOperator([]byte(\"merge\"), add, 200*time.Millisecond)\n\t\t\tdefer m.Stop()\n\n\t\t\tval, err := m.Get()\n\t\t\trequire.Equal(t, ErrKeyNotFound, err)\n\t\t\trequire.Nil(t, val)\n\t\t})\n\t})\n\tt.Run(\"Add and Get\", func(t *testing.T) {\n\t\tkey := []byte(\"merge\")\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\tm := db.GetMergeOperator(key, add, 200*time.Millisecond)\n\t\t\tdefer m.Stop()\n\n\t\t\trequire.NoError(t, m.Add(uint64ToBytes(1)))\n\t\t\trequire.NoError(t, m.Add(uint64ToBytes(2)))\n\t\t\trequire.NoError(t, m.Add(uint64ToBytes(3)))\n\n\t\t\tres, err := m.Get()\n\t\t\trequire.NoError(t, err)\n\t\t\trequire.Equal(t, uint64(6), bytesToUint64(res))\n\t\t})\n\n\t})\n\tt.Run(\"Add and Get slices\", func(t *testing.T) {\n\t\t// Merge function to merge two byte slices\n\t\tadd := func(originalValue, newValue []byte) []byte {\n\t\t\treturn append(originalValue, newValue...)\n\t\t}\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\tm := db.GetMergeOperator([]byte(\"fooprefix\"), add, 2*time.Millisecond)\n\t\t\tdefer m.Stop()\n\n\t\t\trequire.NoError(t, m.Add([]byte(\"A\")))\n\t\t\trequire.NoError(t, m.Add([]byte(\"B\")))\n\t\t\trequire.NoError(t, m.Add([]byte(\"C\")))\n\n\t\t\tvalue, err := m.Get()\n\t\t\trequire.NoError(t, err)\n\t\t\trequire.Equal(t, \"ABC\", string(value))\n\t\t})\n\t})\n\tt.Run(\"Get Before Compact\", func(t *testing.T) {\n\t\tkey := []byte(\"merge\")\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\tm := db.GetMergeOperator(key, add, 500*time.Millisecond)\n\t\t\tdefer m.Stop()\n\n\t\t\trequire.NoError(t, m.Add(uint64ToBytes(1)))\n\t\t\trequire.NoError(t, m.Add(uint64ToBytes(2)))\n\t\t\trequire.NoError(t, m.Add(uint64ToBytes(3)))\n\n\t\t\tres, err := m.Get()\n\t\t\trequire.NoError(t, err)\n\t\t\trequire.Equal(t, uint64(6), bytesToUint64(res))\n\t\t})\n\t})\n\n\tt.Run(\"Get after Delete\", func(t *testing.T) {\n\t\tkey := []byte(\"merge\")\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\tm := db.GetMergeOperator(key, add, 200*time.Millisecond)\n\n\t\t\trequire.NoError(t, m.Add(uint64ToBytes(1)))\n\t\t\trequire.NoError(t, m.Add(uint64ToBytes(2)))\n\t\t\trequire.NoError(t, m.Add(uint64ToBytes(3)))\n\n\t\t\tm.Stop()\n\t\t\tres, err := m.Get()\n\t\t\trequire.NoError(t, err)\n\t\t\trequire.Equal(t, uint64(6), bytesToUint64(res))\n\n\t\t\trequire.NoError(t, db.Update(func(txn *Txn) error {\n\t\t\t\treturn txn.Delete(key)\n\t\t\t}))\n\n\t\t\tm = db.GetMergeOperator(key, add, 200*time.Millisecond)\n\t\t\trequire.NoError(t, m.Add(uint64ToBytes(1)))\n\t\t\tm.Stop()\n\n\t\t\tres, err = m.Get()\n\t\t\trequire.NoError(t, err)\n\t\t\trequire.Equal(t, uint64(1), bytesToUint64(res))\n\t\t})\n\t})\n\n\tt.Run(\"Get after Stop\", func(t *testing.T) {\n\t\tkey := []byte(\"merge\")\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\tm := db.GetMergeOperator(key, add, 1*time.Second)\n\n\t\t\trequire.NoError(t, m.Add(uint64ToBytes(1)))\n\t\t\trequire.NoError(t, m.Add(uint64ToBytes(2)))\n\t\t\trequire.NoError(t, m.Add(uint64ToBytes(3)))\n\n\t\t\tm.Stop()\n\t\t\tres, err := m.Get()\n\t\t\trequire.NoError(t, err)\n\t\t\trequire.Equal(t, uint64(6), bytesToUint64(res))\n\t\t})\n\t})\n\tt.Run(\"Old keys should be removed after compaction\", func(t *testing.T) {\n\t\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\t\trequire.NoError(t, err)\n\t\tdefer removeDir(dir)\n\n\t\t// This test relies on CompactL0OnClose\n\t\topts := getTestOptions(dir).WithCompactL0OnClose(true)\n\t\tdb, err := Open(opts)\n\t\trequire.NoError(t, err)\n\t\tmergeKey := []byte(\"foo\")\n\t\tm := db.GetMergeOperator(mergeKey, add, 2*time.Millisecond)\n\n\t\tcount := 5000 // This will cause compaction from L0->L1\n\t\tfor i := 0; i < count; i++ {\n\t\t\trequire.NoError(t, m.Add(uint64ToBytes(1)))\n\t\t}\n\t\tvalue, err := m.Get()\n\t\trequire.Nil(t, err)\n\t\trequire.Equal(t, uint64(count), bytesToUint64(value))\n\t\tm.Stop()\n\n\t\t// Force compaction by closing DB. The compaction should discard all the old merged values\n\t\trequire.Nil(t, db.Close())\n\t\tdb, err = Open(opts)\n\t\trequire.NoError(t, err)\n\t\tdefer db.Close()\n\n\t\tkeyCount := 0\n\t\ttxn := db.NewTransaction(false)\n\t\tdefer txn.Discard()\n\t\tiopt := DefaultIteratorOptions\n\t\tiopt.AllVersions = true\n\t\tit := txn.NewKeyIterator(mergeKey, iopt)\n\t\tdefer it.Close()\n\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\tkeyCount++\n\t\t}\n\t\t// We should have only one key in badger. All the other keys should've been removed by\n\t\t// compaction\n\t\trequire.Equal(t, 1, keyCount)\n\t})\n\n}\n\nfunc uint64ToBytes(i uint64) []byte {\n\tvar buf [8]byte\n\tbinary.BigEndian.PutUint64(buf[:], i)\n\treturn buf[:]\n}\n\nfunc bytesToUint64(b []byte) uint64 {\n\treturn binary.BigEndian.Uint64(b)\n}\n\n// Merge function to add two uint64 numbers\nfunc add(existing, latest []byte) []byte {\n\treturn uint64ToBytes(bytesToUint64(existing) + bytesToUint64(latest))\n}\n"
  },
  {
    "path": "metrics_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"expvar\"\n\t\"math/rand\"\n\t\"testing\"\n\n\t\"github.com/stretchr/testify/require\"\n)\n\nfunc clearAllMetrics() {\n\texpvar.Do(func(kv expvar.KeyValue) {\n\t\t// Reset the value of each expvar variable based on its type\n\t\tswitch v := kv.Value.(type) {\n\t\tcase *expvar.Int:\n\t\t\tv.Set(0)\n\t\tcase *expvar.Float:\n\t\t\tv.Set(0)\n\t\tcase *expvar.Map:\n\t\t\tv.Init()\n\t\tcase *expvar.String:\n\t\t\tv.Set(\"\")\n\t\t}\n\t})\n}\n\nfunc TestWriteMetrics(t *testing.T) {\n\topt := getTestOptions(\"\")\n\topt.managedTxns = true\n\topt.CompactL0OnClose = true\n\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\tclearAllMetrics()\n\t\tnum := 10\n\t\tval := make([]byte, 1<<12)\n\t\tkey := make([]byte, 40)\n\t\tfor i := 0; i < num; i++ {\n\t\t\t_, err := rand.Read(key)\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = rand.Read(val)\n\t\t\trequire.NoError(t, err)\n\n\t\t\twriter := db.NewManagedWriteBatch()\n\t\t\trequire.NoError(t, writer.SetEntryAt(NewEntry(key, val), 1))\n\t\t\twriter.Flush()\n\t\t}\n\n\t\texpectedSize := int64(len(val)) + 48 + 2 // 48 := size of key (40 + 8(ts)), 2 := meta\n\t\twrite_metric := expvar.Get(\"badger_write_bytes_user\")\n\t\trequire.Equal(t, expectedSize*int64(num), write_metric.(*expvar.Int).Value())\n\n\t\tput_metric := expvar.Get(\"badger_put_num_user\")\n\t\trequire.Equal(t, int64(num), put_metric.(*expvar.Int).Value())\n\n\t\tlsm_metric := expvar.Get(\"badger_write_bytes_l0\")\n\t\trequire.Equal(t, expectedSize*int64(num), lsm_metric.(*expvar.Int).Value())\n\n\t\tcompactionMetric := expvar.Get(\"badger_write_bytes_compaction\").(*expvar.Map)\n\t\trequire.Equal(t, nil, compactionMetric.Get(\"l6\"))\n\n\t\t// Force compaction\n\t\tdb.Close()\n\n\t\t_, err := OpenManaged(opt)\n\t\trequire.NoError(t, err)\n\n\t\tcompactionMetric = expvar.Get(\"badger_write_bytes_compaction\").(*expvar.Map)\n\t\trequire.GreaterOrEqual(t, expectedSize*int64(num)+int64(num*200), compactionMetric.Get(\"l6\").(*expvar.Int).Value())\n\t\t// Because we have random values, compression is not able to do much, so we incur a cost on total size\n\t})\n}\n\nfunc TestVlogMetrics(t *testing.T) {\n\topt := getTestOptions(\"\")\n\topt.managedTxns = true\n\topt.CompactL0OnClose = true\n\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\tclearAllMetrics()\n\t\tnum := 10\n\t\tval := make([]byte, 1<<20) // Large Value\n\t\tkey := make([]byte, 40)\n\t\tfor i := 0; i < num; i++ {\n\t\t\t_, err := rand.Read(key)\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = rand.Read(val)\n\t\t\trequire.NoError(t, err)\n\n\t\t\twriter := db.NewManagedWriteBatch()\n\t\t\trequire.NoError(t, writer.SetEntryAt(NewEntry(key, val), 1))\n\t\t\twriter.Flush()\n\t\t}\n\n\t\texpectedSize := int64(len(val)) + 200 // vlog expected size\n\n\t\ttotalWrites := expvar.Get(\"badger_write_num_vlog\")\n\t\trequire.Equal(t, int64(num), totalWrites.(*expvar.Int).Value())\n\n\t\tbytesWritten := expvar.Get(\"badger_write_bytes_vlog\")\n\t\trequire.GreaterOrEqual(t, expectedSize*int64(num), bytesWritten.(*expvar.Int).Value())\n\n\t\ttxn := db.NewTransactionAt(2, false)\n\t\titem, err := txn.Get(key)\n\t\trequire.NoError(t, err)\n\t\trequire.Equal(t, uint64(1), item.Version())\n\n\t\terr = item.Value(func(val []byte) error {\n\t\t\ttotalReads := expvar.Get(\"badger_read_num_vlog\")\n\t\t\tbytesRead := expvar.Get(\"badger_read_bytes_vlog\")\n\t\t\trequire.Equal(t, int64(1), totalReads.(*expvar.Int).Value())\n\t\t\trequire.GreaterOrEqual(t, expectedSize, bytesRead.(*expvar.Int).Value())\n\t\t\treturn nil\n\t\t})\n\n\t\trequire.NoError(t, err)\n\t})\n}\n\nfunc TestReadMetrics(t *testing.T) {\n\topt := getTestOptions(\"\")\n\topt.managedTxns = true\n\topt.CompactL0OnClose = true\n\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\tclearAllMetrics()\n\t\tnum := 10\n\t\tval := make([]byte, 1<<15)\n\t\tkeys := [][]byte{}\n\t\twriter := db.NewManagedWriteBatch()\n\t\tfor i := 0; i < num; i++ {\n\t\t\tkeyB := key(\"byte\", 1)\n\t\t\tkeys = append(keys, []byte(keyB))\n\n\t\t\t_, err := rand.Read(val)\n\t\t\trequire.NoError(t, err)\n\n\t\t\trequire.NoError(t, writer.SetEntryAt(NewEntry([]byte(keyB), val), 1))\n\t\t}\n\t\twriter.Flush()\n\n\t\ttxn := db.NewTransactionAt(2, false)\n\t\titem, err := txn.Get(keys[0])\n\t\trequire.NoError(t, err)\n\t\trequire.Equal(t, uint64(1), item.Version())\n\n\t\ttotalGets := expvar.Get(\"badger_get_num_user\")\n\t\trequire.Equal(t, int64(1), totalGets.(*expvar.Int).Value())\n\n\t\ttotalMemtableReads := expvar.Get(\"badger_get_num_memtable\")\n\t\trequire.Equal(t, int64(1), totalMemtableReads.(*expvar.Int).Value())\n\n\t\ttotalLSMGets := expvar.Get(\"badger_get_num_lsm\")\n\t\trequire.Nil(t, totalLSMGets.(*expvar.Map).Get(\"l6\"))\n\n\t\t// Force compaction\n\t\tdb.Close()\n\n\t\tdb, err = OpenManaged(opt)\n\t\trequire.NoError(t, err)\n\n\t\ttxn = db.NewTransactionAt(2, false)\n\t\titem, err = txn.Get(keys[0])\n\t\trequire.NoError(t, err)\n\t\trequire.Equal(t, uint64(1), item.Version())\n\n\t\t_, err = txn.Get([]byte(key(\"abdbyte\", 1000))) // val should be far enough that bloom filter doesn't hit\n\t\trequire.Error(t, err)\n\n\t\ttotalLSMGets = expvar.Get(\"badger_get_num_lsm\")\n\t\trequire.Equal(t, int64(0x1), totalLSMGets.(*expvar.Map).Get(\"l6\").(*expvar.Int).Value())\n\n\t\ttotalBloom := expvar.Get(\"badger_hit_num_lsm_bloom_filter\")\n\t\trequire.Equal(t, int64(0x1), totalBloom.(*expvar.Map).Get(\"l6\").(*expvar.Int).Value())\n\t\trequire.Equal(t, int64(0x1), totalBloom.(*expvar.Map).Get(\"DoesNotHave_HIT\").(*expvar.Int).Value())\n\t\trequire.Equal(t, int64(0x2), totalBloom.(*expvar.Map).Get(\"DoesNotHave_ALL\").(*expvar.Int).Value())\n\n\t\tbytesLSM := expvar.Get(\"badger_read_bytes_lsm\")\n\t\trequire.Equal(t, int64(len(val)), bytesLSM.(*expvar.Int).Value())\n\n\t\tgetWithResult := expvar.Get(\"badger_get_with_result_num_user\")\n\t\trequire.Equal(t, int64(2), getWithResult.(*expvar.Int).Value())\n\n\t\titerOpts := DefaultIteratorOptions\n\t\titer := txn.NewKeyIterator(keys[0], iterOpts)\n\t\titer.Seek(keys[0])\n\n\t\trangeQueries := expvar.Get(\"badger_iterator_num_user\")\n\t\trequire.Equal(t, int64(1), rangeQueries.(*expvar.Int).Value())\n\t})\n}\n"
  },
  {
    "path": "options/options.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage options\n\n// ChecksumVerificationMode tells when should DB verify checksum for SSTable blocks.\ntype ChecksumVerificationMode int\n\nconst (\n\t// NoVerification indicates DB should not verify checksum for SSTable blocks.\n\tNoVerification ChecksumVerificationMode = iota\n\t// OnTableRead indicates checksum should be verified while opening SSTtable.\n\tOnTableRead\n\t// OnBlockRead indicates checksum should be verified on every SSTable block read.\n\tOnBlockRead\n\t// OnTableAndBlockRead indicates checksum should be verified\n\t// on SSTable opening and on every block read.\n\tOnTableAndBlockRead\n)\n\n// CompressionType specifies how a block should be compressed.\ntype CompressionType uint32\n\nconst (\n\t// None mode indicates that a block is not compressed.\n\tNone CompressionType = 0\n\t// Snappy mode indicates that a block is compressed using Snappy algorithm.\n\tSnappy CompressionType = 1\n\t// ZSTD mode indicates that a block is compressed using ZSTD algorithm.\n\tZSTD CompressionType = 2\n)\n"
  },
  {
    "path": "options.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"fmt\"\n\t\"os\"\n\t\"reflect\"\n\t\"strconv\"\n\t\"strings\"\n\t\"time\"\n\n\t\"github.com/dgraph-io/badger/v4/options\"\n\t\"github.com/dgraph-io/badger/v4/table\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\n// Note: If you add a new option X make sure you also add a WithX method on Options.\n\n// Options are params for creating DB object.\n//\n// This package provides DefaultOptions which contains options that should\n// work for most applications. Consider using that as a starting point before\n// customizing it for your own needs.\n//\n// Each option X is documented on the WithX method.\ntype Options struct {\n\ttestOnlyOptions\n\n\t// Required options.\n\n\tDir      string\n\tValueDir string\n\n\t// Usually modified options.\n\n\tSyncWrites        bool\n\tNumVersionsToKeep int\n\tReadOnly          bool\n\tLogger            Logger\n\tCompression       options.CompressionType\n\tInMemory          bool\n\tMetricsEnabled    bool\n\t// Sets the Stream.numGo field\n\tNumGoroutines int\n\n\t// Fine tuning options.\n\n\tMemTableSize        int64\n\tBaseTableSize       int64\n\tBaseLevelSize       int64\n\tLevelSizeMultiplier int\n\tTableSizeMultiplier int\n\tMaxLevels           int\n\n\tVLogPercentile float64\n\tValueThreshold int64\n\tNumMemtables   int\n\t// Changing BlockSize across DB runs will not break badger. The block size is\n\t// read from the block index stored at the end of the table.\n\tBlockSize          int\n\tBloomFalsePositive float64\n\tBlockCacheSize     int64\n\tIndexCacheSize     int64\n\n\tNumLevelZeroTables      int\n\tNumLevelZeroTablesStall int\n\n\tValueLogFileSize   int64\n\tValueLogMaxEntries uint32\n\n\tNumCompactors        int\n\tCompactL0OnClose     bool\n\tLmaxCompaction       bool\n\tZSTDCompressionLevel int\n\n\t// When set, checksum will be validated for each entry read from the value log file.\n\tVerifyValueChecksum bool\n\n\t// Encryption related options.\n\tEncryptionKey                 []byte        // encryption key\n\tEncryptionKeyRotationDuration time.Duration // key rotation duration\n\n\t// BypassLockGuard will bypass the lock guard on badger. Bypassing lock\n\t// guard can cause data corruption if multiple badger instances are using\n\t// the same directory. Use this options with caution.\n\tBypassLockGuard bool\n\n\t// ChecksumVerificationMode decides when db should verify checksums for SSTable blocks.\n\tChecksumVerificationMode options.ChecksumVerificationMode\n\n\t// DetectConflicts determines whether the transactions would be checked for\n\t// conflicts. The transactions can be processed at a higher rate when\n\t// conflict detection is disabled.\n\tDetectConflicts bool\n\n\t// NamespaceOffset specifies the offset from where the next 8 bytes contains the namespace.\n\tNamespaceOffset int\n\n\t// Magic version used by the application using badger to ensure that it doesn't open the DB\n\t// with incompatible data format.\n\tExternalMagicVersion uint16\n\n\t// Transaction start and commit timestamps are managed by end-user.\n\t// This is only useful for databases built on top of Badger (like Dgraph).\n\t// Not recommended for most users.\n\tmanagedTxns bool\n\n\t// 4. Flags for testing purposes\n\t// ------------------------------\n\tmaxBatchCount int64 // max entries in batch\n\tmaxBatchSize  int64 // max batch size in bytes\n\n\tmaxValueThreshold float64\n}\n\n// DefaultOptions sets a list of recommended options for good performance.\n// Feel free to modify these to suit your needs with the WithX methods.\nfunc DefaultOptions(path string) Options {\n\treturn Options{\n\t\tDir:      path,\n\t\tValueDir: path,\n\n\t\tMemTableSize:        64 << 20,\n\t\tBaseTableSize:       2 << 20,\n\t\tBaseLevelSize:       10 << 20,\n\t\tTableSizeMultiplier: 2,\n\t\tLevelSizeMultiplier: 10,\n\t\tMaxLevels:           7,\n\t\tNumGoroutines:       8,\n\t\tMetricsEnabled:      true,\n\n\t\tNumCompactors:           4, // Run at least 2 compactors. Zero-th compactor prioritizes L0.\n\t\tNumLevelZeroTables:      5,\n\t\tNumLevelZeroTablesStall: 15,\n\t\tNumMemtables:            5,\n\t\tBloomFalsePositive:      0.01,\n\t\tBlockSize:               4 * 1024,\n\t\tSyncWrites:              false,\n\t\tNumVersionsToKeep:       1,\n\t\tCompactL0OnClose:        false,\n\t\tVerifyValueChecksum:     false,\n\t\tCompression:             options.Snappy,\n\t\tBlockCacheSize:          256 << 20,\n\t\tIndexCacheSize:          0,\n\n\t\t// The following benchmarks were done on a 4 KB block size (default block size). The\n\t\t// compression is ratio supposed to increase with increasing compression level but since the\n\t\t// input for compression algorithm is small (4 KB), we don't get significant benefit at\n\t\t// level 3.\n\t\t// NOTE: The benchmarks are with DataDog ZSTD that requires CGO. Hence, no longer valid.\n\t\t// no_compression-16              10\t 502848865 ns/op\t 165.46 MB/s\t-\n\t\t// zstd_compression/level_1-16     7\t 739037966 ns/op\t 112.58 MB/s\t2.93\n\t\t// zstd_compression/level_3-16     7\t 756950250 ns/op\t 109.91 MB/s\t2.72\n\t\t// zstd_compression/level_15-16    1\t11135686219 ns/op\t   7.47 MB/s\t4.38\n\t\t// Benchmark code can be found in table/builder_test.go file\n\t\tZSTDCompressionLevel: 1,\n\n\t\t// (2^30 - 1)*2 when mmapping < 2^31 - 1, max int32.\n\t\t// -1 so 2*ValueLogFileSize won't overflow on 32-bit systems.\n\t\tValueLogFileSize: 1<<30 - 1,\n\n\t\tValueLogMaxEntries: 1000000,\n\n\t\tVLogPercentile: 0.0,\n\t\tValueThreshold: maxValueThreshold,\n\n\t\tLogger:                        defaultLogger(INFO),\n\t\tEncryptionKey:                 []byte{},\n\t\tEncryptionKeyRotationDuration: 10 * 24 * time.Hour, // Default 10 days.\n\t\tDetectConflicts:               true,\n\t\tNamespaceOffset:               -1,\n\t}\n}\n\nfunc buildTableOptions(db *DB) table.Options {\n\topt := db.opt\n\tdk, err := db.registry.LatestDataKey()\n\ty.Check(err)\n\treturn table.Options{\n\t\tReadOnly:             opt.ReadOnly,\n\t\tMetricsEnabled:       db.opt.MetricsEnabled,\n\t\tTableSize:            uint64(opt.BaseTableSize),\n\t\tBlockSize:            opt.BlockSize,\n\t\tBloomFalsePositive:   opt.BloomFalsePositive,\n\t\tChkMode:              opt.ChecksumVerificationMode,\n\t\tCompression:          opt.Compression,\n\t\tZSTDCompressionLevel: opt.ZSTDCompressionLevel,\n\t\tBlockCache:           db.blockCache,\n\t\tIndexCache:           db.indexCache,\n\t\tAllocPool:            db.allocPool,\n\t\tDataKey:              dk,\n\t}\n}\n\nconst (\n\tmaxValueThreshold = (1 << 20) // 1 MB\n)\n\n// LSMOnlyOptions follows from DefaultOptions, but sets a higher ValueThreshold\n// so values would be collocated with the LSM tree, with value log largely acting\n// as a write-ahead log only. These options would reduce the disk usage of value\n// log, and make Badger act more like a typical LSM tree.\nfunc LSMOnlyOptions(path string) Options {\n\t// Let's not set any other options, because they can cause issues with the\n\t// size of key-value a user can pass to Badger. For e.g., if we set\n\t// ValueLogFileSize to 64MB, a user can't pass a value more than that.\n\t// Setting it to ValueLogMaxEntries to 1000, can generate too many files.\n\t// These options are better configured on a usage basis, than broadly here.\n\t// The ValueThreshold is the most important setting a user needs to do to\n\t// achieve a heavier usage of LSM tree.\n\t// NOTE: If a user does not want to set 64KB as the ValueThreshold because\n\t// of performance reasons, 1KB would be a good option too, allowing\n\t// values smaller than 1KB to be collocated with the keys in the LSM tree.\n\treturn DefaultOptions(path).WithValueThreshold(maxValueThreshold /* 1 MB */)\n}\n\n// parseCompression returns badger.compressionType and compression level given compression string\n// of format compression-type:compression-level\nfunc parseCompression(cStr string) (options.CompressionType, int, error) {\n\tcStrSplit := strings.Split(cStr, \":\")\n\tcType := cStrSplit[0]\n\tlevel := 3\n\n\tvar err error\n\tif len(cStrSplit) == 2 {\n\t\tlevel, err = strconv.Atoi(cStrSplit[1])\n\t\ty.Check(err)\n\t\tif level <= 0 {\n\t\t\treturn 0, 0,\n\t\t\t\tfmt.Errorf(\"ERROR: compression level(%v) must be greater than zero\", level)\n\t\t}\n\t} else if len(cStrSplit) > 2 {\n\t\treturn 0, 0, fmt.Errorf(\"ERROR: Invalid badger.compression argument\")\n\t}\n\tswitch cType {\n\tcase \"zstd\":\n\t\treturn options.ZSTD, level, nil\n\tcase \"snappy\":\n\t\treturn options.Snappy, 0, nil\n\tcase \"none\":\n\t\treturn options.None, 0, nil\n\t}\n\treturn 0, 0, fmt.Errorf(\"ERROR: compression type (%s) invalid\", cType)\n}\n\n// generateSuperFlag generates an identical SuperFlag string from the provided Options.\nfunc generateSuperFlag(options Options) string {\n\tsuperflag := \"\"\n\tv := reflect.ValueOf(&options).Elem()\n\toptionsStruct := v.Type()\n\tfor i := 0; i < v.NumField(); i++ {\n\t\tif field := v.Field(i); field.CanInterface() {\n\t\t\tname := strings.ToLower(optionsStruct.Field(i).Name)\n\t\t\tkind := v.Field(i).Kind()\n\t\t\tswitch kind {\n\t\t\tcase reflect.Bool:\n\t\t\t\tsuperflag += name + \"=\"\n\t\t\t\tsuperflag += fmt.Sprintf(\"%v; \", field.Bool())\n\t\t\tcase reflect.Int, reflect.Int64:\n\t\t\t\tsuperflag += name + \"=\"\n\t\t\t\tsuperflag += fmt.Sprintf(\"%v; \", field.Int())\n\t\t\tcase reflect.Uint32, reflect.Uint64:\n\t\t\t\tsuperflag += name + \"=\"\n\t\t\t\tsuperflag += fmt.Sprintf(\"%v; \", field.Uint())\n\t\t\tcase reflect.Float64:\n\t\t\t\tsuperflag += name + \"=\"\n\t\t\t\tsuperflag += fmt.Sprintf(\"%v; \", field.Float())\n\t\t\tcase reflect.String:\n\t\t\t\tsuperflag += name + \"=\"\n\t\t\t\tsuperflag += fmt.Sprintf(\"%v; \", field.String())\n\t\t\tdefault:\n\t\t\t\tcontinue\n\t\t\t}\n\t\t}\n\t}\n\treturn superflag\n}\n\n// FromSuperFlag fills Options fields for each flag within the superflag. For\n// example, replacing the default Options.NumGoroutines:\n//\n//\toptions := FromSuperFlag(\"numgoroutines=4\", DefaultOptions(\"\"))\n//\n// It's important to note that if you pass an empty Options struct, FromSuperFlag\n// will not fill it with default values. FromSuperFlag only writes to the fields\n// present within the superflag string (case insensitive).\n//\n// It specially handles compression subflag.\n// Valid options are {none,snappy,zstd:<level>}\n// Example: compression=zstd:3;\n// Unsupported: Options.Logger, Options.EncryptionKey\nfunc (opt Options) FromSuperFlag(superflag string) Options {\n\t// currentOptions act as a default value for the options superflag.\n\tcurrentOptions := generateSuperFlag(opt)\n\tcurrentOptions += \"compression=;\"\n\n\tflags := z.NewSuperFlag(superflag).MergeAndCheckDefault(currentOptions)\n\tv := reflect.ValueOf(&opt).Elem()\n\toptionsStruct := v.Type()\n\tfor i := 0; i < v.NumField(); i++ {\n\t\t// only iterate over exported fields\n\t\tif field := v.Field(i); field.CanInterface() {\n\t\t\t// z.SuperFlag stores keys as lowercase, keep everything case\n\t\t\t// insensitive\n\t\t\tname := strings.ToLower(optionsStruct.Field(i).Name)\n\t\t\tif name == \"compression\" {\n\t\t\t\t// We will specially handle this later. Skip it here.\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\tkind := v.Field(i).Kind()\n\t\t\tswitch kind {\n\t\t\tcase reflect.Bool:\n\t\t\t\tfield.SetBool(flags.GetBool(name))\n\t\t\tcase reflect.Int, reflect.Int64:\n\t\t\t\tfield.SetInt(flags.GetInt64(name))\n\t\t\tcase reflect.Uint32, reflect.Uint64:\n\t\t\t\tfield.SetUint(flags.GetUint64(name))\n\t\t\tcase reflect.Float64:\n\t\t\t\tfield.SetFloat(flags.GetFloat64(name))\n\t\t\tcase reflect.String:\n\t\t\t\tfield.SetString(flags.GetString(name))\n\t\t\t}\n\t\t}\n\t}\n\n\t// Only update the options for special flags that were present in the input superflag.\n\tinputFlag := z.NewSuperFlag(superflag)\n\tif inputFlag.Has(\"compression\") {\n\t\tctype, clevel, err := parseCompression(flags.GetString(\"compression\"))\n\t\tswitch err {\n\t\tcase nil:\n\t\t\topt.Compression = ctype\n\t\t\topt.ZSTDCompressionLevel = clevel\n\t\tdefault:\n\t\t\tctype = options.CompressionType(flags.GetUint32(\"compression\"))\n\t\t\ty.AssertTruef(ctype <= 2, \"ERROR: Invalid format or compression type. Got: %s\",\n\t\t\t\tflags.GetString(\"compression\"))\n\t\t\topt.Compression = ctype\n\t\t}\n\t}\n\n\treturn opt\n}\n\n// WithDir returns a new Options value with Dir set to the given value.\n//\n// Dir is the path of the directory where key data will be stored in.\n// If it doesn't exist, Badger will try to create it for you.\n// This is set automatically to be the path given to `DefaultOptions`.\nfunc (opt Options) WithDir(val string) Options {\n\topt.Dir = val\n\treturn opt\n}\n\n// WithValueDir returns a new Options value with ValueDir set to the given value.\n//\n// ValueDir is the path of the directory where value data will be stored in.\n// If it doesn't exist, Badger will try to create it for you.\n// This is set automatically to be the path given to `DefaultOptions`.\nfunc (opt Options) WithValueDir(val string) Options {\n\topt.ValueDir = val\n\treturn opt\n}\n\n// WithSyncWrites returns a new Options value with SyncWrites set to the given value.\n//\n// Badger does all writes via mmap. So, all writes can survive process crashes or k8s environments\n// with SyncWrites set to false.\n//\n// When set to true, Badger would call an additional msync after writes to flush mmap buffer over to\n// disk to survive hard reboots. Most users of Badger should not need to do this.\n//\n// The default value of SyncWrites is false.\nfunc (opt Options) WithSyncWrites(val bool) Options {\n\topt.SyncWrites = val\n\treturn opt\n}\n\n// WithNumVersionsToKeep returns a new Options value with NumVersionsToKeep set to the given value.\n//\n// NumVersionsToKeep sets how many versions to keep per key at most.\n//\n// The default value of NumVersionsToKeep is 1.\nfunc (opt Options) WithNumVersionsToKeep(val int) Options {\n\topt.NumVersionsToKeep = val\n\treturn opt\n}\n\n// WithNumGoroutines sets the number of goroutines to be used in Stream.\n//\n// The default value of NumGoroutines is 8.\nfunc (opt Options) WithNumGoroutines(val int) Options {\n\topt.NumGoroutines = val\n\treturn opt\n}\n\n// WithReadOnly returns a new Options value with ReadOnly set to the given value.\n//\n// When ReadOnly is true the DB will be opened on read-only mode.\n// Multiple processes can open the same Badger DB.\n// Note: if the DB being opened had crashed before and has vlog data to be replayed,\n// ReadOnly will cause Open to fail with an appropriate message.\n//\n// The default value of ReadOnly is false.\nfunc (opt Options) WithReadOnly(val bool) Options {\n\topt.ReadOnly = val\n\treturn opt\n}\n\n// WithMetricsEnabled returns a new Options value with MetricsEnabled set to the given value.\n//\n// When MetricsEnabled is set to false, then the DB will be opened and no badger metrics\n// will be logged. Metrics are defined in metric.go file.\n//\n// This flag is useful for use cases like in Dgraph where we open temporary badger instances to\n// index data. In those cases we don't want badger metrics to be polluted with the noise from\n// those temporary instances.\n//\n// Default value is set to true\nfunc (opt Options) WithMetricsEnabled(val bool) Options {\n\topt.MetricsEnabled = val\n\treturn opt\n}\n\n// WithLogger returns a new Options value with Logger set to the given value.\n//\n// Logger provides a way to configure what logger each value of badger.DB uses.\n//\n// The default value of Logger writes to stderr using the log package from the Go standard library.\nfunc (opt Options) WithLogger(val Logger) Options {\n\topt.Logger = val\n\treturn opt\n}\n\n// WithLoggingLevel returns a new Options value with logging level of the\n// default logger set to the given value.\n// LoggingLevel sets the level of logging. It should be one of DEBUG, INFO,\n// WARNING or ERROR levels.\n//\n// The default value of LoggingLevel is INFO.\nfunc (opt Options) WithLoggingLevel(val loggingLevel) Options {\n\topt.Logger = defaultLogger(val)\n\treturn opt\n}\n\n// WithBaseTableSize returns a new Options value with BaseTableSize set to the given value.\n//\n// BaseTableSize sets the maximum size in bytes for LSM table or file in the base level.\n//\n// The default value of BaseTableSize is 2MB.\nfunc (opt Options) WithBaseTableSize(val int64) Options {\n\topt.BaseTableSize = val\n\treturn opt\n}\n\n// WithLevelSizeMultiplier returns a new Options value with LevelSizeMultiplier set to the given\n// value.\n//\n// LevelSizeMultiplier sets the ratio between the maximum sizes of contiguous levels in the LSM.\n// Once a level grows to be larger than this ratio allowed, the compaction process will be\n// triggered.\n//\n// The default value of LevelSizeMultiplier is 10.\nfunc (opt Options) WithLevelSizeMultiplier(val int) Options {\n\topt.LevelSizeMultiplier = val\n\treturn opt\n}\n\n// WithMaxLevels returns a new Options value with MaxLevels set to the given value.\n//\n// Maximum number of levels of compaction allowed in the LSM.\n//\n// The default value of MaxLevels is 7.\nfunc (opt Options) WithMaxLevels(val int) Options {\n\topt.MaxLevels = val\n\treturn opt\n}\n\n// WithValueThreshold returns a new Options value with ValueThreshold set to the given value.\n//\n// ValueThreshold sets the threshold used to decide whether a value is stored directly in the LSM\n// tree or separately in the log value files.\n//\n// The default value of ValueThreshold is 1 MB, and LSMOnlyOptions sets it to maxValueThreshold\n// which is set to 1 MB too.\nfunc (opt Options) WithValueThreshold(val int64) Options {\n\topt.ValueThreshold = val\n\treturn opt\n}\n\n// WithVLogPercentile returns a new Options value with ValLogPercentile set to given value.\n//\n// VLogPercentile with 0.0 means no dynamic thresholding is enabled.\n// MinThreshold value will always act as the value threshold.\n//\n// VLogPercentile with value 0.99 means 99 percentile of value will be put in LSM tree\n// and only 1 percent in vlog. The value threshold will be dynamically updated within the range of\n// [ValueThreshold, Options.maxValueThreshold]\n//\n// # Say VLogPercentile with 1.0 means threshold will eventually set to Options.maxValueThreshold\n//\n// The default value of VLogPercentile is 0.0.\nfunc (opt Options) WithVLogPercentile(t float64) Options {\n\topt.VLogPercentile = t\n\treturn opt\n}\n\n// WithNumMemtables returns a new Options value with NumMemtables set to the given value.\n//\n// NumMemtables sets the maximum number of tables to keep in memory before stalling.\n//\n// The default value of NumMemtables is 5.\nfunc (opt Options) WithNumMemtables(val int) Options {\n\topt.NumMemtables = val\n\treturn opt\n}\n\n// WithMemTableSize returns a new Options value with MemTableSize set to the given value.\n//\n// MemTableSize sets the maximum size in bytes for memtable table.\n//\n// The default value of MemTableSize is 64MB.\nfunc (opt Options) WithMemTableSize(val int64) Options {\n\topt.MemTableSize = val\n\treturn opt\n}\n\n// WithBloomFalsePositive returns a new Options value with BloomFalsePositive set\n// to the given value.\n//\n// BloomFalsePositive sets the false positive probability of the bloom filter in any SSTable.\n// Before reading a key from table, the bloom filter is checked for key existence.\n// BloomFalsePositive might impact read performance of DB. Lower BloomFalsePositive value might\n// consume more memory.\n//\n// The default value of BloomFalsePositive is 0.01.\n//\n// Setting this to 0 disables the bloom filter completely.\nfunc (opt Options) WithBloomFalsePositive(val float64) Options {\n\topt.BloomFalsePositive = val\n\treturn opt\n}\n\n// WithBlockSize returns a new Options value with BlockSize set to the given value.\n//\n// BlockSize sets the size of any block in SSTable. SSTable is divided into multiple blocks\n// internally. Each block is compressed using prefix diff encoding.\n//\n// The default value of BlockSize is 4KB.\nfunc (opt Options) WithBlockSize(val int) Options {\n\topt.BlockSize = val\n\treturn opt\n}\n\n// WithNumLevelZeroTables sets the maximum number of Level 0 tables before compaction starts.\n//\n// The default value of NumLevelZeroTables is 5.\nfunc (opt Options) WithNumLevelZeroTables(val int) Options {\n\topt.NumLevelZeroTables = val\n\treturn opt\n}\n\n// WithNumLevelZeroTablesStall sets the number of Level 0 tables that once reached causes the DB to\n// stall until compaction succeeds.\n//\n// The default value of NumLevelZeroTablesStall is 15.\nfunc (opt Options) WithNumLevelZeroTablesStall(val int) Options {\n\topt.NumLevelZeroTablesStall = val\n\treturn opt\n}\n\n// WithBaseLevelSize sets the maximum size target for the base level.\n//\n// The default value is 10MB.\nfunc (opt Options) WithBaseLevelSize(val int64) Options {\n\topt.BaseLevelSize = val\n\treturn opt\n}\n\n// WithValueLogFileSize sets the maximum size of a single value log file.\n//\n// The default value of ValueLogFileSize is 1GB.\nfunc (opt Options) WithValueLogFileSize(val int64) Options {\n\topt.ValueLogFileSize = val\n\treturn opt\n}\n\n// WithValueLogMaxEntries sets the maximum number of entries a value log file\n// can hold approximately.  A actual size limit of a value log file is the\n// minimum of ValueLogFileSize and ValueLogMaxEntries.\n//\n// The default value of ValueLogMaxEntries is one million (1000000).\nfunc (opt Options) WithValueLogMaxEntries(val uint32) Options {\n\topt.ValueLogMaxEntries = val\n\treturn opt\n}\n\n// WithNumCompactors sets the number of compaction workers to run concurrently.  Setting this to\n// zero stops compactions, which could eventually cause writes to block forever.\n//\n// The default value of NumCompactors is 4. One is dedicated just for L0 and L1.\nfunc (opt Options) WithNumCompactors(val int) Options {\n\topt.NumCompactors = val\n\treturn opt\n}\n\n// WithCompactL0OnClose determines whether Level 0 should be compacted before closing the DB.  This\n// ensures that both reads and writes are efficient when the DB is opened later.\n//\n// The default value of CompactL0OnClose is false.\nfunc (opt Options) WithCompactL0OnClose(val bool) Options {\n\topt.CompactL0OnClose = val\n\treturn opt\n}\n\n// WithEncryptionKey is used to encrypt the data with AES. Type of AES is used based on the key\n// size. For example 16 bytes will use AES-128. 24 bytes will use AES-192. 32 bytes will\n// use AES-256.\nfunc (opt Options) WithEncryptionKey(key []byte) Options {\n\topt.EncryptionKey = key\n\treturn opt\n}\n\n// WithEncryptionKeyRotationDuration returns new Options value with the duration set to\n// the given value.\n//\n// Key Registry will use this duration to create new keys. If the previous generated\n// key exceed the given duration. Then the key registry will create new key.\n\n// The default value is set to 10 days.\nfunc (opt Options) WithEncryptionKeyRotationDuration(d time.Duration) Options {\n\topt.EncryptionKeyRotationDuration = d\n\treturn opt\n}\n\n// WithCompression is used to enable or disable compression. When compression is enabled, every\n// block will be compressed using the specified algorithm.  This option doesn't affect existing\n// tables. Only the newly created tables will be compressed.\n//\n// The default compression algorithm used is snappy. Compression is enabled by default.\nfunc (opt Options) WithCompression(cType options.CompressionType) Options {\n\topt.Compression = cType\n\treturn opt\n}\n\n// WithVerifyValueChecksum is used to set VerifyValueChecksum. When VerifyValueChecksum is set to\n// true, checksum will be verified for every entry read from the value log. If the value is stored\n// in SST (value size less than value threshold) then the checksum validation will not be done.\n//\n// The default value of VerifyValueChecksum is False.\nfunc (opt Options) WithVerifyValueChecksum(val bool) Options {\n\topt.VerifyValueChecksum = val\n\treturn opt\n}\n\n// WithChecksumVerificationMode returns a new Options value with ChecksumVerificationMode set to\n// the given value.\n//\n// ChecksumVerificationMode indicates when the db should verify checksums for SSTable blocks.\n//\n// The default value of VerifyValueChecksum is options.NoVerification.\nfunc (opt Options) WithChecksumVerificationMode(cvMode options.ChecksumVerificationMode) Options {\n\topt.ChecksumVerificationMode = cvMode\n\treturn opt\n}\n\n// WithBlockCacheSize returns a new Options value with BlockCacheSize set to the given value.\n//\n// This value specifies how much data cache should hold in memory. A small size\n// of cache means lower memory consumption and lookups/iterations would take\n// longer. It is recommended to use a cache if you're using compression or encryption.\n// If compression and encryption both are disabled, adding a cache will lead to\n// unnecessary overhead which will affect the read performance. Setting size to\n// zero disables the cache altogether.\n//\n// Default value of BlockCacheSize is 256 MB.\nfunc (opt Options) WithBlockCacheSize(size int64) Options {\n\topt.BlockCacheSize = size\n\treturn opt\n}\n\n// WithInMemory returns a new Options value with Inmemory mode set to the given value.\n//\n// When badger is running in InMemory mode, everything is stored in memory. No value/sst files are\n// created. In case of a crash all data will be lost.\nfunc (opt Options) WithInMemory(b bool) Options {\n\topt.InMemory = b\n\treturn opt\n}\n\n// WithZSTDCompressionLevel returns a new Options value with ZSTDCompressionLevel set\n// to the given value.\n//\n// The ZSTD compression algorithm supports 20 compression levels. The higher the compression\n// level, the better is the compression ratio but lower is the performance. Lower levels\n// have better performance and higher levels have better compression ratios.\n// We recommend using level 1 ZSTD Compression Level. Any level higher than 1 seems to\n// deteriorate badger's performance.\n// The following benchmarks were done on a 4 KB block size (default block size). The compression is\n// ratio supposed to increase with increasing compression level but since the input for compression\n// algorithm is small (4 KB), we don't get significant benefit at level 3. It is advised to write\n// your own benchmarks before choosing a compression algorithm or level.\n//\n// NOTE: The benchmarks are with DataDog ZSTD that requires CGO. Hence, no longer valid.\n// no_compression-16              10\t 502848865 ns/op\t 165.46 MB/s\t-\n// zstd_compression/level_1-16     7\t 739037966 ns/op\t 112.58 MB/s\t2.93\n// zstd_compression/level_3-16     7\t 756950250 ns/op\t 109.91 MB/s\t2.72\n// zstd_compression/level_15-16    1\t11135686219 ns/op\t   7.47 MB/s\t4.38\n// Benchmark code can be found in table/builder_test.go file\nfunc (opt Options) WithZSTDCompressionLevel(cLevel int) Options {\n\topt.ZSTDCompressionLevel = cLevel\n\treturn opt\n}\n\n// WithBypassLockGuard returns a new Options value with BypassLockGuard\n// set to the given value.\n//\n// When BypassLockGuard option is set, badger will not acquire a lock on the\n// directory. This could lead to data corruption if multiple badger instances\n// write to the same data directory. Use this option with caution.\n//\n// The default value of BypassLockGuard is false.\nfunc (opt Options) WithBypassLockGuard(b bool) Options {\n\topt.BypassLockGuard = b\n\treturn opt\n}\n\n// WithIndexCacheSize returns a new Options value with IndexCacheSize set to\n// the given value.\n//\n// This value specifies how much memory should be used by table indices. These\n// indices include the block offsets and the bloomfilters. Badger uses bloom\n// filters to speed up lookups. Each table has its own bloom\n// filter and each bloom filter is approximately of 5 MB.\n//\n// Zero value for IndexCacheSize means all the indices will be kept in\n// memory and the cache is disabled.\n//\n// The default value of IndexCacheSize is 0 which means all indices are kept in\n// memory.\nfunc (opt Options) WithIndexCacheSize(size int64) Options {\n\topt.IndexCacheSize = size\n\treturn opt\n}\n\n// WithDetectConflicts returns a new Options value with DetectConflicts set to the given value.\n//\n// Detect conflicts options determines if the transactions would be checked for\n// conflicts before committing them. When this option is set to false\n// (detectConflicts=false) badger can process transactions at a higher rate.\n// Setting this options to false might be useful when the user application\n// deals with conflict detection and resolution.\n//\n// The default value of Detect conflicts is True.\nfunc (opt Options) WithDetectConflicts(b bool) Options {\n\topt.DetectConflicts = b\n\treturn opt\n}\n\n// WithNamespaceOffset returns a new Options value with NamespaceOffset set to the given value. DB\n// will expect the namespace in each key at the 8 bytes starting from NamespaceOffset. A negative\n// value means that namespace is not stored in the key.\n//\n// The default value for NamespaceOffset is -1.\nfunc (opt Options) WithNamespaceOffset(offset int) Options {\n\topt.NamespaceOffset = offset\n\treturn opt\n}\n\n// WithExternalMagic returns a new Options value with ExternalMagicVersion set to the given value.\n// The DB would fail to start if either the internal or the external magic number fails validated.\nfunc (opt Options) WithExternalMagic(magic uint16) Options {\n\topt.ExternalMagicVersion = magic\n\treturn opt\n}\n\nfunc (opt Options) getFileFlags() int {\n\tvar flags int\n\t// opt.SyncWrites would be using msync to sync. All writes go through mmap.\n\tif opt.ReadOnly {\n\t\tflags |= os.O_RDONLY\n\t} else {\n\t\tflags |= os.O_RDWR\n\t}\n\treturn flags\n}\n"
  },
  {
    "path": "options_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"reflect\"\n\t\"testing\"\n\n\t\"github.com/dgraph-io/badger/v4/options\"\n)\n\nfunc TestOptions(t *testing.T) {\n\tt.Run(\"default options\", func(t *testing.T) {\n\t\t// copy all the default options over to a big SuperFlag string\n\t\tdefaultSuperFlag := generateSuperFlag(DefaultOptions(\"\"))\n\t\t// fill an empty Options with values from the SuperFlag\n\t\tgenerated := Options{}.FromSuperFlag(defaultSuperFlag)\n\t\t// make sure they're equal\n\t\tif !optionsEqual(DefaultOptions(\"\"), generated) {\n\t\t\tt.Fatal(\"generated default SuperFlag != default Options\")\n\t\t}\n\t\t// check that values are overwritten properly\n\t\toverwritten := DefaultOptions(\"\").FromSuperFlag(\"numgoroutines=1234\")\n\t\tif overwritten.NumGoroutines != 1234 {\n\t\t\tt.Fatal(\"Option value not overwritten by SuperFlag value\")\n\t\t}\n\t})\n\n\tt.Run(\"special flags\", func(t *testing.T) {\n\t\to1 := DefaultOptions(\"\")\n\t\to1.NamespaceOffset = 10\n\t\to1.Compression = options.ZSTD\n\t\to1.ZSTDCompressionLevel = 2\n\t\to1.NumGoroutines = 20\n\n\t\to2 := DefaultOptions(\"\")\n\t\to2.NamespaceOffset = 10\n\t\to2 = o2.FromSuperFlag(\"compression=zstd:2; numgoroutines=20;\")\n\n\t\t// make sure they're equal\n\t\tif !optionsEqual(o1, o2) {\n\t\t\tt.Fatal(\"generated superFlag != expected options\")\n\t\t}\n\t})\n}\n\n// optionsEqual just compares the values of two Options structs\nfunc optionsEqual(o1, o2 Options) bool {\n\to1v := reflect.ValueOf(&o1).Elem()\n\to2v := reflect.ValueOf(&o2).Elem()\n\tfor i := 0; i < o1v.NumField(); i++ {\n\t\tif o1v.Field(i).CanInterface() {\n\t\t\tkind := o1v.Field(i).Kind()\n\t\t\t// compare values\n\t\t\tswitch kind {\n\t\t\tcase reflect.Bool:\n\t\t\t\tif o1v.Field(i).Bool() != o2v.Field(i).Bool() {\n\t\t\t\t\treturn false\n\t\t\t\t}\n\t\t\tcase reflect.Int, reflect.Int64:\n\t\t\t\tif o1v.Field(i).Int() != o2v.Field(i).Int() {\n\t\t\t\t\treturn false\n\t\t\t\t}\n\t\t\tcase reflect.Uint32, reflect.Uint64:\n\t\t\t\tif o1v.Field(i).Uint() != o2v.Field(i).Uint() {\n\t\t\t\t\treturn false\n\t\t\t\t}\n\t\t\tcase reflect.Float64:\n\t\t\t\tif o1v.Field(i).Float() != o2v.Field(i).Float() {\n\t\t\t\t\treturn false\n\t\t\t\t}\n\t\t\tcase reflect.String:\n\t\t\t\tif o1v.Field(i).String() != o2v.Field(i).String() {\n\t\t\t\t\treturn false\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n\treturn true\n}\n"
  },
  {
    "path": "pb/badgerpb4.pb.go",
    "content": "//\n// SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n// SPDX-License-Identifier: Apache-2.0\n\n// Use protos/gen.sh to generate .pb.go files.\n\n// Code generated by protoc-gen-go. DO NOT EDIT.\n// versions:\n// \tprotoc-gen-go v1.31.0\n// \tprotoc        v3.21.12\n// source: badgerpb4.proto\n\npackage pb\n\nimport (\n\tprotoreflect \"google.golang.org/protobuf/reflect/protoreflect\"\n\tprotoimpl \"google.golang.org/protobuf/runtime/protoimpl\"\n\treflect \"reflect\"\n\tsync \"sync\"\n)\n\nconst (\n\t// Verify that this generated code is sufficiently up-to-date.\n\t_ = protoimpl.EnforceVersion(20 - protoimpl.MinVersion)\n\t// Verify that runtime/protoimpl is sufficiently up-to-date.\n\t_ = protoimpl.EnforceVersion(protoimpl.MaxVersion - 20)\n)\n\ntype EncryptionAlgo int32\n\nconst (\n\tEncryptionAlgo_aes EncryptionAlgo = 0\n)\n\n// Enum value maps for EncryptionAlgo.\nvar (\n\tEncryptionAlgo_name = map[int32]string{\n\t\t0: \"aes\",\n\t}\n\tEncryptionAlgo_value = map[string]int32{\n\t\t\"aes\": 0,\n\t}\n)\n\nfunc (x EncryptionAlgo) Enum() *EncryptionAlgo {\n\tp := new(EncryptionAlgo)\n\t*p = x\n\treturn p\n}\n\nfunc (x EncryptionAlgo) String() string {\n\treturn protoimpl.X.EnumStringOf(x.Descriptor(), protoreflect.EnumNumber(x))\n}\n\nfunc (EncryptionAlgo) Descriptor() protoreflect.EnumDescriptor {\n\treturn file_badgerpb4_proto_enumTypes[0].Descriptor()\n}\n\nfunc (EncryptionAlgo) Type() protoreflect.EnumType {\n\treturn &file_badgerpb4_proto_enumTypes[0]\n}\n\nfunc (x EncryptionAlgo) Number() protoreflect.EnumNumber {\n\treturn protoreflect.EnumNumber(x)\n}\n\n// Deprecated: Use EncryptionAlgo.Descriptor instead.\nfunc (EncryptionAlgo) EnumDescriptor() ([]byte, []int) {\n\treturn file_badgerpb4_proto_rawDescGZIP(), []int{0}\n}\n\ntype ManifestChange_Operation int32\n\nconst (\n\tManifestChange_CREATE ManifestChange_Operation = 0\n\tManifestChange_DELETE ManifestChange_Operation = 1\n)\n\n// Enum value maps for ManifestChange_Operation.\nvar (\n\tManifestChange_Operation_name = map[int32]string{\n\t\t0: \"CREATE\",\n\t\t1: \"DELETE\",\n\t}\n\tManifestChange_Operation_value = map[string]int32{\n\t\t\"CREATE\": 0,\n\t\t\"DELETE\": 1,\n\t}\n)\n\nfunc (x ManifestChange_Operation) Enum() *ManifestChange_Operation {\n\tp := new(ManifestChange_Operation)\n\t*p = x\n\treturn p\n}\n\nfunc (x ManifestChange_Operation) String() string {\n\treturn protoimpl.X.EnumStringOf(x.Descriptor(), protoreflect.EnumNumber(x))\n}\n\nfunc (ManifestChange_Operation) Descriptor() protoreflect.EnumDescriptor {\n\treturn file_badgerpb4_proto_enumTypes[1].Descriptor()\n}\n\nfunc (ManifestChange_Operation) Type() protoreflect.EnumType {\n\treturn &file_badgerpb4_proto_enumTypes[1]\n}\n\nfunc (x ManifestChange_Operation) Number() protoreflect.EnumNumber {\n\treturn protoreflect.EnumNumber(x)\n}\n\n// Deprecated: Use ManifestChange_Operation.Descriptor instead.\nfunc (ManifestChange_Operation) EnumDescriptor() ([]byte, []int) {\n\treturn file_badgerpb4_proto_rawDescGZIP(), []int{3, 0}\n}\n\ntype Checksum_Algorithm int32\n\nconst (\n\tChecksum_CRC32C   Checksum_Algorithm = 0\n\tChecksum_XXHash64 Checksum_Algorithm = 1\n)\n\n// Enum value maps for Checksum_Algorithm.\nvar (\n\tChecksum_Algorithm_name = map[int32]string{\n\t\t0: \"CRC32C\",\n\t\t1: \"XXHash64\",\n\t}\n\tChecksum_Algorithm_value = map[string]int32{\n\t\t\"CRC32C\":   0,\n\t\t\"XXHash64\": 1,\n\t}\n)\n\nfunc (x Checksum_Algorithm) Enum() *Checksum_Algorithm {\n\tp := new(Checksum_Algorithm)\n\t*p = x\n\treturn p\n}\n\nfunc (x Checksum_Algorithm) String() string {\n\treturn protoimpl.X.EnumStringOf(x.Descriptor(), protoreflect.EnumNumber(x))\n}\n\nfunc (Checksum_Algorithm) Descriptor() protoreflect.EnumDescriptor {\n\treturn file_badgerpb4_proto_enumTypes[2].Descriptor()\n}\n\nfunc (Checksum_Algorithm) Type() protoreflect.EnumType {\n\treturn &file_badgerpb4_proto_enumTypes[2]\n}\n\nfunc (x Checksum_Algorithm) Number() protoreflect.EnumNumber {\n\treturn protoreflect.EnumNumber(x)\n}\n\n// Deprecated: Use Checksum_Algorithm.Descriptor instead.\nfunc (Checksum_Algorithm) EnumDescriptor() ([]byte, []int) {\n\treturn file_badgerpb4_proto_rawDescGZIP(), []int{4, 0}\n}\n\ntype KV struct {\n\tstate         protoimpl.MessageState\n\tsizeCache     protoimpl.SizeCache\n\tunknownFields protoimpl.UnknownFields\n\n\tKey       []byte `protobuf:\"bytes,1,opt,name=key,proto3\" json:\"key,omitempty\"`\n\tValue     []byte `protobuf:\"bytes,2,opt,name=value,proto3\" json:\"value,omitempty\"`\n\tUserMeta  []byte `protobuf:\"bytes,3,opt,name=user_meta,json=userMeta,proto3\" json:\"user_meta,omitempty\"`\n\tVersion   uint64 `protobuf:\"varint,4,opt,name=version,proto3\" json:\"version,omitempty\"`\n\tExpiresAt uint64 `protobuf:\"varint,5,opt,name=expires_at,json=expiresAt,proto3\" json:\"expires_at,omitempty\"`\n\tMeta      []byte `protobuf:\"bytes,6,opt,name=meta,proto3\" json:\"meta,omitempty\"`\n\t// Stream id is used to identify which stream the KV came from.\n\tStreamId uint32 `protobuf:\"varint,10,opt,name=stream_id,json=streamId,proto3\" json:\"stream_id,omitempty\"`\n\t// Stream done is used to indicate end of stream.\n\tStreamDone bool `protobuf:\"varint,11,opt,name=stream_done,json=streamDone,proto3\" json:\"stream_done,omitempty\"`\n}\n\nfunc (x *KV) Reset() {\n\t*x = KV{}\n\tif protoimpl.UnsafeEnabled {\n\t\tmi := &file_badgerpb4_proto_msgTypes[0]\n\t\tms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x))\n\t\tms.StoreMessageInfo(mi)\n\t}\n}\n\nfunc (x *KV) String() string {\n\treturn protoimpl.X.MessageStringOf(x)\n}\n\nfunc (*KV) ProtoMessage() {}\n\nfunc (x *KV) ProtoReflect() protoreflect.Message {\n\tmi := &file_badgerpb4_proto_msgTypes[0]\n\tif protoimpl.UnsafeEnabled && x != nil {\n\t\tms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x))\n\t\tif ms.LoadMessageInfo() == nil {\n\t\t\tms.StoreMessageInfo(mi)\n\t\t}\n\t\treturn ms\n\t}\n\treturn mi.MessageOf(x)\n}\n\n// Deprecated: Use KV.ProtoReflect.Descriptor instead.\nfunc (*KV) Descriptor() ([]byte, []int) {\n\treturn file_badgerpb4_proto_rawDescGZIP(), []int{0}\n}\n\nfunc (x *KV) GetKey() []byte {\n\tif x != nil {\n\t\treturn x.Key\n\t}\n\treturn nil\n}\n\nfunc (x *KV) GetValue() []byte {\n\tif x != nil {\n\t\treturn x.Value\n\t}\n\treturn nil\n}\n\nfunc (x *KV) GetUserMeta() []byte {\n\tif x != nil {\n\t\treturn x.UserMeta\n\t}\n\treturn nil\n}\n\nfunc (x *KV) GetVersion() uint64 {\n\tif x != nil {\n\t\treturn x.Version\n\t}\n\treturn 0\n}\n\nfunc (x *KV) GetExpiresAt() uint64 {\n\tif x != nil {\n\t\treturn x.ExpiresAt\n\t}\n\treturn 0\n}\n\nfunc (x *KV) GetMeta() []byte {\n\tif x != nil {\n\t\treturn x.Meta\n\t}\n\treturn nil\n}\n\nfunc (x *KV) GetStreamId() uint32 {\n\tif x != nil {\n\t\treturn x.StreamId\n\t}\n\treturn 0\n}\n\nfunc (x *KV) GetStreamDone() bool {\n\tif x != nil {\n\t\treturn x.StreamDone\n\t}\n\treturn false\n}\n\ntype KVList struct {\n\tstate         protoimpl.MessageState\n\tsizeCache     protoimpl.SizeCache\n\tunknownFields protoimpl.UnknownFields\n\n\tKv []*KV `protobuf:\"bytes,1,rep,name=kv,proto3\" json:\"kv,omitempty\"`\n\t// alloc_ref used internally for memory management.\n\tAllocRef uint64 `protobuf:\"varint,10,opt,name=alloc_ref,json=allocRef,proto3\" json:\"alloc_ref,omitempty\"`\n}\n\nfunc (x *KVList) Reset() {\n\t*x = KVList{}\n\tif protoimpl.UnsafeEnabled {\n\t\tmi := &file_badgerpb4_proto_msgTypes[1]\n\t\tms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x))\n\t\tms.StoreMessageInfo(mi)\n\t}\n}\n\nfunc (x *KVList) String() string {\n\treturn protoimpl.X.MessageStringOf(x)\n}\n\nfunc (*KVList) ProtoMessage() {}\n\nfunc (x *KVList) ProtoReflect() protoreflect.Message {\n\tmi := &file_badgerpb4_proto_msgTypes[1]\n\tif protoimpl.UnsafeEnabled && x != nil {\n\t\tms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x))\n\t\tif ms.LoadMessageInfo() == nil {\n\t\t\tms.StoreMessageInfo(mi)\n\t\t}\n\t\treturn ms\n\t}\n\treturn mi.MessageOf(x)\n}\n\n// Deprecated: Use KVList.ProtoReflect.Descriptor instead.\nfunc (*KVList) Descriptor() ([]byte, []int) {\n\treturn file_badgerpb4_proto_rawDescGZIP(), []int{1}\n}\n\nfunc (x *KVList) GetKv() []*KV {\n\tif x != nil {\n\t\treturn x.Kv\n\t}\n\treturn nil\n}\n\nfunc (x *KVList) GetAllocRef() uint64 {\n\tif x != nil {\n\t\treturn x.AllocRef\n\t}\n\treturn 0\n}\n\ntype ManifestChangeSet struct {\n\tstate         protoimpl.MessageState\n\tsizeCache     protoimpl.SizeCache\n\tunknownFields protoimpl.UnknownFields\n\n\t// A set of changes that are applied atomically.\n\tChanges []*ManifestChange `protobuf:\"bytes,1,rep,name=changes,proto3\" json:\"changes,omitempty\"`\n}\n\nfunc (x *ManifestChangeSet) Reset() {\n\t*x = ManifestChangeSet{}\n\tif protoimpl.UnsafeEnabled {\n\t\tmi := &file_badgerpb4_proto_msgTypes[2]\n\t\tms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x))\n\t\tms.StoreMessageInfo(mi)\n\t}\n}\n\nfunc (x *ManifestChangeSet) String() string {\n\treturn protoimpl.X.MessageStringOf(x)\n}\n\nfunc (*ManifestChangeSet) ProtoMessage() {}\n\nfunc (x *ManifestChangeSet) ProtoReflect() protoreflect.Message {\n\tmi := &file_badgerpb4_proto_msgTypes[2]\n\tif protoimpl.UnsafeEnabled && x != nil {\n\t\tms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x))\n\t\tif ms.LoadMessageInfo() == nil {\n\t\t\tms.StoreMessageInfo(mi)\n\t\t}\n\t\treturn ms\n\t}\n\treturn mi.MessageOf(x)\n}\n\n// Deprecated: Use ManifestChangeSet.ProtoReflect.Descriptor instead.\nfunc (*ManifestChangeSet) Descriptor() ([]byte, []int) {\n\treturn file_badgerpb4_proto_rawDescGZIP(), []int{2}\n}\n\nfunc (x *ManifestChangeSet) GetChanges() []*ManifestChange {\n\tif x != nil {\n\t\treturn x.Changes\n\t}\n\treturn nil\n}\n\ntype ManifestChange struct {\n\tstate         protoimpl.MessageState\n\tsizeCache     protoimpl.SizeCache\n\tunknownFields protoimpl.UnknownFields\n\n\tId             uint64                   `protobuf:\"varint,1,opt,name=Id,proto3\" json:\"Id,omitempty\"` // Table ID.\n\tOp             ManifestChange_Operation `protobuf:\"varint,2,opt,name=Op,proto3,enum=badgerpb4.ManifestChange_Operation\" json:\"Op,omitempty\"`\n\tLevel          uint32                   `protobuf:\"varint,3,opt,name=Level,proto3\" json:\"Level,omitempty\"` // Only used for CREATE.\n\tKeyId          uint64                   `protobuf:\"varint,4,opt,name=key_id,json=keyId,proto3\" json:\"key_id,omitempty\"`\n\tEncryptionAlgo EncryptionAlgo           `protobuf:\"varint,5,opt,name=encryption_algo,json=encryptionAlgo,proto3,enum=badgerpb4.EncryptionAlgo\" json:\"encryption_algo,omitempty\"`\n\tCompression    uint32                   `protobuf:\"varint,6,opt,name=compression,proto3\" json:\"compression,omitempty\"` // Only used for CREATE Op.\n}\n\nfunc (x *ManifestChange) Reset() {\n\t*x = ManifestChange{}\n\tif protoimpl.UnsafeEnabled {\n\t\tmi := &file_badgerpb4_proto_msgTypes[3]\n\t\tms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x))\n\t\tms.StoreMessageInfo(mi)\n\t}\n}\n\nfunc (x *ManifestChange) String() string {\n\treturn protoimpl.X.MessageStringOf(x)\n}\n\nfunc (*ManifestChange) ProtoMessage() {}\n\nfunc (x *ManifestChange) ProtoReflect() protoreflect.Message {\n\tmi := &file_badgerpb4_proto_msgTypes[3]\n\tif protoimpl.UnsafeEnabled && x != nil {\n\t\tms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x))\n\t\tif ms.LoadMessageInfo() == nil {\n\t\t\tms.StoreMessageInfo(mi)\n\t\t}\n\t\treturn ms\n\t}\n\treturn mi.MessageOf(x)\n}\n\n// Deprecated: Use ManifestChange.ProtoReflect.Descriptor instead.\nfunc (*ManifestChange) Descriptor() ([]byte, []int) {\n\treturn file_badgerpb4_proto_rawDescGZIP(), []int{3}\n}\n\nfunc (x *ManifestChange) GetId() uint64 {\n\tif x != nil {\n\t\treturn x.Id\n\t}\n\treturn 0\n}\n\nfunc (x *ManifestChange) GetOp() ManifestChange_Operation {\n\tif x != nil {\n\t\treturn x.Op\n\t}\n\treturn ManifestChange_CREATE\n}\n\nfunc (x *ManifestChange) GetLevel() uint32 {\n\tif x != nil {\n\t\treturn x.Level\n\t}\n\treturn 0\n}\n\nfunc (x *ManifestChange) GetKeyId() uint64 {\n\tif x != nil {\n\t\treturn x.KeyId\n\t}\n\treturn 0\n}\n\nfunc (x *ManifestChange) GetEncryptionAlgo() EncryptionAlgo {\n\tif x != nil {\n\t\treturn x.EncryptionAlgo\n\t}\n\treturn EncryptionAlgo_aes\n}\n\nfunc (x *ManifestChange) GetCompression() uint32 {\n\tif x != nil {\n\t\treturn x.Compression\n\t}\n\treturn 0\n}\n\ntype Checksum struct {\n\tstate         protoimpl.MessageState\n\tsizeCache     protoimpl.SizeCache\n\tunknownFields protoimpl.UnknownFields\n\n\tAlgo Checksum_Algorithm `protobuf:\"varint,1,opt,name=algo,proto3,enum=badgerpb4.Checksum_Algorithm\" json:\"algo,omitempty\"` // For storing type of Checksum algorithm used\n\tSum  uint64             `protobuf:\"varint,2,opt,name=sum,proto3\" json:\"sum,omitempty\"`\n}\n\nfunc (x *Checksum) Reset() {\n\t*x = Checksum{}\n\tif protoimpl.UnsafeEnabled {\n\t\tmi := &file_badgerpb4_proto_msgTypes[4]\n\t\tms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x))\n\t\tms.StoreMessageInfo(mi)\n\t}\n}\n\nfunc (x *Checksum) String() string {\n\treturn protoimpl.X.MessageStringOf(x)\n}\n\nfunc (*Checksum) ProtoMessage() {}\n\nfunc (x *Checksum) ProtoReflect() protoreflect.Message {\n\tmi := &file_badgerpb4_proto_msgTypes[4]\n\tif protoimpl.UnsafeEnabled && x != nil {\n\t\tms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x))\n\t\tif ms.LoadMessageInfo() == nil {\n\t\t\tms.StoreMessageInfo(mi)\n\t\t}\n\t\treturn ms\n\t}\n\treturn mi.MessageOf(x)\n}\n\n// Deprecated: Use Checksum.ProtoReflect.Descriptor instead.\nfunc (*Checksum) Descriptor() ([]byte, []int) {\n\treturn file_badgerpb4_proto_rawDescGZIP(), []int{4}\n}\n\nfunc (x *Checksum) GetAlgo() Checksum_Algorithm {\n\tif x != nil {\n\t\treturn x.Algo\n\t}\n\treturn Checksum_CRC32C\n}\n\nfunc (x *Checksum) GetSum() uint64 {\n\tif x != nil {\n\t\treturn x.Sum\n\t}\n\treturn 0\n}\n\ntype DataKey struct {\n\tstate         protoimpl.MessageState\n\tsizeCache     protoimpl.SizeCache\n\tunknownFields protoimpl.UnknownFields\n\n\tKeyId     uint64 `protobuf:\"varint,1,opt,name=key_id,json=keyId,proto3\" json:\"key_id,omitempty\"`\n\tData      []byte `protobuf:\"bytes,2,opt,name=data,proto3\" json:\"data,omitempty\"`\n\tIv        []byte `protobuf:\"bytes,3,opt,name=iv,proto3\" json:\"iv,omitempty\"`\n\tCreatedAt int64  `protobuf:\"varint,4,opt,name=created_at,json=createdAt,proto3\" json:\"created_at,omitempty\"`\n}\n\nfunc (x *DataKey) Reset() {\n\t*x = DataKey{}\n\tif protoimpl.UnsafeEnabled {\n\t\tmi := &file_badgerpb4_proto_msgTypes[5]\n\t\tms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x))\n\t\tms.StoreMessageInfo(mi)\n\t}\n}\n\nfunc (x *DataKey) String() string {\n\treturn protoimpl.X.MessageStringOf(x)\n}\n\nfunc (*DataKey) ProtoMessage() {}\n\nfunc (x *DataKey) ProtoReflect() protoreflect.Message {\n\tmi := &file_badgerpb4_proto_msgTypes[5]\n\tif protoimpl.UnsafeEnabled && x != nil {\n\t\tms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x))\n\t\tif ms.LoadMessageInfo() == nil {\n\t\t\tms.StoreMessageInfo(mi)\n\t\t}\n\t\treturn ms\n\t}\n\treturn mi.MessageOf(x)\n}\n\n// Deprecated: Use DataKey.ProtoReflect.Descriptor instead.\nfunc (*DataKey) Descriptor() ([]byte, []int) {\n\treturn file_badgerpb4_proto_rawDescGZIP(), []int{5}\n}\n\nfunc (x *DataKey) GetKeyId() uint64 {\n\tif x != nil {\n\t\treturn x.KeyId\n\t}\n\treturn 0\n}\n\nfunc (x *DataKey) GetData() []byte {\n\tif x != nil {\n\t\treturn x.Data\n\t}\n\treturn nil\n}\n\nfunc (x *DataKey) GetIv() []byte {\n\tif x != nil {\n\t\treturn x.Iv\n\t}\n\treturn nil\n}\n\nfunc (x *DataKey) GetCreatedAt() int64 {\n\tif x != nil {\n\t\treturn x.CreatedAt\n\t}\n\treturn 0\n}\n\ntype Match struct {\n\tstate         protoimpl.MessageState\n\tsizeCache     protoimpl.SizeCache\n\tunknownFields protoimpl.UnknownFields\n\n\tPrefix      []byte `protobuf:\"bytes,1,opt,name=prefix,proto3\" json:\"prefix,omitempty\"`\n\tIgnoreBytes string `protobuf:\"bytes,2,opt,name=ignore_bytes,json=ignoreBytes,proto3\" json:\"ignore_bytes,omitempty\"` // Comma separated with dash to represent ranges \"1, 2-3, 4-7, 9\"\n}\n\nfunc (x *Match) Reset() {\n\t*x = Match{}\n\tif protoimpl.UnsafeEnabled {\n\t\tmi := &file_badgerpb4_proto_msgTypes[6]\n\t\tms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x))\n\t\tms.StoreMessageInfo(mi)\n\t}\n}\n\nfunc (x *Match) String() string {\n\treturn protoimpl.X.MessageStringOf(x)\n}\n\nfunc (*Match) ProtoMessage() {}\n\nfunc (x *Match) ProtoReflect() protoreflect.Message {\n\tmi := &file_badgerpb4_proto_msgTypes[6]\n\tif protoimpl.UnsafeEnabled && x != nil {\n\t\tms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x))\n\t\tif ms.LoadMessageInfo() == nil {\n\t\t\tms.StoreMessageInfo(mi)\n\t\t}\n\t\treturn ms\n\t}\n\treturn mi.MessageOf(x)\n}\n\n// Deprecated: Use Match.ProtoReflect.Descriptor instead.\nfunc (*Match) Descriptor() ([]byte, []int) {\n\treturn file_badgerpb4_proto_rawDescGZIP(), []int{6}\n}\n\nfunc (x *Match) GetPrefix() []byte {\n\tif x != nil {\n\t\treturn x.Prefix\n\t}\n\treturn nil\n}\n\nfunc (x *Match) GetIgnoreBytes() string {\n\tif x != nil {\n\t\treturn x.IgnoreBytes\n\t}\n\treturn \"\"\n}\n\nvar File_badgerpb4_proto protoreflect.FileDescriptor\n\nvar file_badgerpb4_proto_rawDesc = []byte{\n\t0x0a, 0x0f, 0x62, 0x61, 0x64, 0x67, 0x65, 0x72, 0x70, 0x62, 0x34, 0x2e, 0x70, 0x72, 0x6f, 0x74,\n\t0x6f, 0x12, 0x09, 0x62, 0x61, 0x64, 0x67, 0x65, 0x72, 0x70, 0x62, 0x34, 0x22, 0xd4, 0x01, 0x0a,\n\t0x02, 0x4b, 0x56, 0x12, 0x10, 0x0a, 0x03, 0x6b, 0x65, 0x79, 0x18, 0x01, 0x20, 0x01, 0x28, 0x0c,\n\t0x52, 0x03, 0x6b, 0x65, 0x79, 0x12, 0x14, 0x0a, 0x05, 0x76, 0x61, 0x6c, 0x75, 0x65, 0x18, 0x02,\n\t0x20, 0x01, 0x28, 0x0c, 0x52, 0x05, 0x76, 0x61, 0x6c, 0x75, 0x65, 0x12, 0x1b, 0x0a, 0x09, 0x75,\n\t0x73, 0x65, 0x72, 0x5f, 0x6d, 0x65, 0x74, 0x61, 0x18, 0x03, 0x20, 0x01, 0x28, 0x0c, 0x52, 0x08,\n\t0x75, 0x73, 0x65, 0x72, 0x4d, 0x65, 0x74, 0x61, 0x12, 0x18, 0x0a, 0x07, 0x76, 0x65, 0x72, 0x73,\n\t0x69, 0x6f, 0x6e, 0x18, 0x04, 0x20, 0x01, 0x28, 0x04, 0x52, 0x07, 0x76, 0x65, 0x72, 0x73, 0x69,\n\t0x6f, 0x6e, 0x12, 0x1d, 0x0a, 0x0a, 0x65, 0x78, 0x70, 0x69, 0x72, 0x65, 0x73, 0x5f, 0x61, 0x74,\n\t0x18, 0x05, 0x20, 0x01, 0x28, 0x04, 0x52, 0x09, 0x65, 0x78, 0x70, 0x69, 0x72, 0x65, 0x73, 0x41,\n\t0x74, 0x12, 0x12, 0x0a, 0x04, 0x6d, 0x65, 0x74, 0x61, 0x18, 0x06, 0x20, 0x01, 0x28, 0x0c, 0x52,\n\t0x04, 0x6d, 0x65, 0x74, 0x61, 0x12, 0x1b, 0x0a, 0x09, 0x73, 0x74, 0x72, 0x65, 0x61, 0x6d, 0x5f,\n\t0x69, 0x64, 0x18, 0x0a, 0x20, 0x01, 0x28, 0x0d, 0x52, 0x08, 0x73, 0x74, 0x72, 0x65, 0x61, 0x6d,\n\t0x49, 0x64, 0x12, 0x1f, 0x0a, 0x0b, 0x73, 0x74, 0x72, 0x65, 0x61, 0x6d, 0x5f, 0x64, 0x6f, 0x6e,\n\t0x65, 0x18, 0x0b, 0x20, 0x01, 0x28, 0x08, 0x52, 0x0a, 0x73, 0x74, 0x72, 0x65, 0x61, 0x6d, 0x44,\n\t0x6f, 0x6e, 0x65, 0x22, 0x44, 0x0a, 0x06, 0x4b, 0x56, 0x4c, 0x69, 0x73, 0x74, 0x12, 0x1d, 0x0a,\n\t0x02, 0x6b, 0x76, 0x18, 0x01, 0x20, 0x03, 0x28, 0x0b, 0x32, 0x0d, 0x2e, 0x62, 0x61, 0x64, 0x67,\n\t0x65, 0x72, 0x70, 0x62, 0x34, 0x2e, 0x4b, 0x56, 0x52, 0x02, 0x6b, 0x76, 0x12, 0x1b, 0x0a, 0x09,\n\t0x61, 0x6c, 0x6c, 0x6f, 0x63, 0x5f, 0x72, 0x65, 0x66, 0x18, 0x0a, 0x20, 0x01, 0x28, 0x04, 0x52,\n\t0x08, 0x61, 0x6c, 0x6c, 0x6f, 0x63, 0x52, 0x65, 0x66, 0x22, 0x48, 0x0a, 0x11, 0x4d, 0x61, 0x6e,\n\t0x69, 0x66, 0x65, 0x73, 0x74, 0x43, 0x68, 0x61, 0x6e, 0x67, 0x65, 0x53, 0x65, 0x74, 0x12, 0x33,\n\t0x0a, 0x07, 0x63, 0x68, 0x61, 0x6e, 0x67, 0x65, 0x73, 0x18, 0x01, 0x20, 0x03, 0x28, 0x0b, 0x32,\n\t0x19, 0x2e, 0x62, 0x61, 0x64, 0x67, 0x65, 0x72, 0x70, 0x62, 0x34, 0x2e, 0x4d, 0x61, 0x6e, 0x69,\n\t0x66, 0x65, 0x73, 0x74, 0x43, 0x68, 0x61, 0x6e, 0x67, 0x65, 0x52, 0x07, 0x63, 0x68, 0x61, 0x6e,\n\t0x67, 0x65, 0x73, 0x22, 0x8d, 0x02, 0x0a, 0x0e, 0x4d, 0x61, 0x6e, 0x69, 0x66, 0x65, 0x73, 0x74,\n\t0x43, 0x68, 0x61, 0x6e, 0x67, 0x65, 0x12, 0x0e, 0x0a, 0x02, 0x49, 0x64, 0x18, 0x01, 0x20, 0x01,\n\t0x28, 0x04, 0x52, 0x02, 0x49, 0x64, 0x12, 0x33, 0x0a, 0x02, 0x4f, 0x70, 0x18, 0x02, 0x20, 0x01,\n\t0x28, 0x0e, 0x32, 0x23, 0x2e, 0x62, 0x61, 0x64, 0x67, 0x65, 0x72, 0x70, 0x62, 0x34, 0x2e, 0x4d,\n\t0x61, 0x6e, 0x69, 0x66, 0x65, 0x73, 0x74, 0x43, 0x68, 0x61, 0x6e, 0x67, 0x65, 0x2e, 0x4f, 0x70,\n\t0x65, 0x72, 0x61, 0x74, 0x69, 0x6f, 0x6e, 0x52, 0x02, 0x4f, 0x70, 0x12, 0x14, 0x0a, 0x05, 0x4c,\n\t0x65, 0x76, 0x65, 0x6c, 0x18, 0x03, 0x20, 0x01, 0x28, 0x0d, 0x52, 0x05, 0x4c, 0x65, 0x76, 0x65,\n\t0x6c, 0x12, 0x15, 0x0a, 0x06, 0x6b, 0x65, 0x79, 0x5f, 0x69, 0x64, 0x18, 0x04, 0x20, 0x01, 0x28,\n\t0x04, 0x52, 0x05, 0x6b, 0x65, 0x79, 0x49, 0x64, 0x12, 0x42, 0x0a, 0x0f, 0x65, 0x6e, 0x63, 0x72,\n\t0x79, 0x70, 0x74, 0x69, 0x6f, 0x6e, 0x5f, 0x61, 0x6c, 0x67, 0x6f, 0x18, 0x05, 0x20, 0x01, 0x28,\n\t0x0e, 0x32, 0x19, 0x2e, 0x62, 0x61, 0x64, 0x67, 0x65, 0x72, 0x70, 0x62, 0x34, 0x2e, 0x45, 0x6e,\n\t0x63, 0x72, 0x79, 0x70, 0x74, 0x69, 0x6f, 0x6e, 0x41, 0x6c, 0x67, 0x6f, 0x52, 0x0e, 0x65, 0x6e,\n\t0x63, 0x72, 0x79, 0x70, 0x74, 0x69, 0x6f, 0x6e, 0x41, 0x6c, 0x67, 0x6f, 0x12, 0x20, 0x0a, 0x0b,\n\t0x63, 0x6f, 0x6d, 0x70, 0x72, 0x65, 0x73, 0x73, 0x69, 0x6f, 0x6e, 0x18, 0x06, 0x20, 0x01, 0x28,\n\t0x0d, 0x52, 0x0b, 0x63, 0x6f, 0x6d, 0x70, 0x72, 0x65, 0x73, 0x73, 0x69, 0x6f, 0x6e, 0x22, 0x23,\n\t0x0a, 0x09, 0x4f, 0x70, 0x65, 0x72, 0x61, 0x74, 0x69, 0x6f, 0x6e, 0x12, 0x0a, 0x0a, 0x06, 0x43,\n\t0x52, 0x45, 0x41, 0x54, 0x45, 0x10, 0x00, 0x12, 0x0a, 0x0a, 0x06, 0x44, 0x45, 0x4c, 0x45, 0x54,\n\t0x45, 0x10, 0x01, 0x22, 0x76, 0x0a, 0x08, 0x43, 0x68, 0x65, 0x63, 0x6b, 0x73, 0x75, 0x6d, 0x12,\n\t0x31, 0x0a, 0x04, 0x61, 0x6c, 0x67, 0x6f, 0x18, 0x01, 0x20, 0x01, 0x28, 0x0e, 0x32, 0x1d, 0x2e,\n\t0x62, 0x61, 0x64, 0x67, 0x65, 0x72, 0x70, 0x62, 0x34, 0x2e, 0x43, 0x68, 0x65, 0x63, 0x6b, 0x73,\n\t0x75, 0x6d, 0x2e, 0x41, 0x6c, 0x67, 0x6f, 0x72, 0x69, 0x74, 0x68, 0x6d, 0x52, 0x04, 0x61, 0x6c,\n\t0x67, 0x6f, 0x12, 0x10, 0x0a, 0x03, 0x73, 0x75, 0x6d, 0x18, 0x02, 0x20, 0x01, 0x28, 0x04, 0x52,\n\t0x03, 0x73, 0x75, 0x6d, 0x22, 0x25, 0x0a, 0x09, 0x41, 0x6c, 0x67, 0x6f, 0x72, 0x69, 0x74, 0x68,\n\t0x6d, 0x12, 0x0a, 0x0a, 0x06, 0x43, 0x52, 0x43, 0x33, 0x32, 0x43, 0x10, 0x00, 0x12, 0x0c, 0x0a,\n\t0x08, 0x58, 0x58, 0x48, 0x61, 0x73, 0x68, 0x36, 0x34, 0x10, 0x01, 0x22, 0x63, 0x0a, 0x07, 0x44,\n\t0x61, 0x74, 0x61, 0x4b, 0x65, 0x79, 0x12, 0x15, 0x0a, 0x06, 0x6b, 0x65, 0x79, 0x5f, 0x69, 0x64,\n\t0x18, 0x01, 0x20, 0x01, 0x28, 0x04, 0x52, 0x05, 0x6b, 0x65, 0x79, 0x49, 0x64, 0x12, 0x12, 0x0a,\n\t0x04, 0x64, 0x61, 0x74, 0x61, 0x18, 0x02, 0x20, 0x01, 0x28, 0x0c, 0x52, 0x04, 0x64, 0x61, 0x74,\n\t0x61, 0x12, 0x0e, 0x0a, 0x02, 0x69, 0x76, 0x18, 0x03, 0x20, 0x01, 0x28, 0x0c, 0x52, 0x02, 0x69,\n\t0x76, 0x12, 0x1d, 0x0a, 0x0a, 0x63, 0x72, 0x65, 0x61, 0x74, 0x65, 0x64, 0x5f, 0x61, 0x74, 0x18,\n\t0x04, 0x20, 0x01, 0x28, 0x03, 0x52, 0x09, 0x63, 0x72, 0x65, 0x61, 0x74, 0x65, 0x64, 0x41, 0x74,\n\t0x22, 0x42, 0x0a, 0x05, 0x4d, 0x61, 0x74, 0x63, 0x68, 0x12, 0x16, 0x0a, 0x06, 0x70, 0x72, 0x65,\n\t0x66, 0x69, 0x78, 0x18, 0x01, 0x20, 0x01, 0x28, 0x0c, 0x52, 0x06, 0x70, 0x72, 0x65, 0x66, 0x69,\n\t0x78, 0x12, 0x21, 0x0a, 0x0c, 0x69, 0x67, 0x6e, 0x6f, 0x72, 0x65, 0x5f, 0x62, 0x79, 0x74, 0x65,\n\t0x73, 0x18, 0x02, 0x20, 0x01, 0x28, 0x09, 0x52, 0x0b, 0x69, 0x67, 0x6e, 0x6f, 0x72, 0x65, 0x42,\n\t0x79, 0x74, 0x65, 0x73, 0x2a, 0x19, 0x0a, 0x0e, 0x45, 0x6e, 0x63, 0x72, 0x79, 0x70, 0x74, 0x69,\n\t0x6f, 0x6e, 0x41, 0x6c, 0x67, 0x6f, 0x12, 0x07, 0x0a, 0x03, 0x61, 0x65, 0x73, 0x10, 0x00, 0x42,\n\t0x23, 0x5a, 0x21, 0x67, 0x69, 0x74, 0x68, 0x75, 0x62, 0x2e, 0x63, 0x6f, 0x6d, 0x2f, 0x64, 0x67,\n\t0x72, 0x61, 0x70, 0x68, 0x2d, 0x69, 0x6f, 0x2f, 0x62, 0x61, 0x64, 0x67, 0x65, 0x72, 0x2f, 0x76,\n\t0x34, 0x2f, 0x70, 0x62, 0x62, 0x06, 0x70, 0x72, 0x6f, 0x74, 0x6f, 0x33,\n}\n\nvar (\n\tfile_badgerpb4_proto_rawDescOnce sync.Once\n\tfile_badgerpb4_proto_rawDescData = file_badgerpb4_proto_rawDesc\n)\n\nfunc file_badgerpb4_proto_rawDescGZIP() []byte {\n\tfile_badgerpb4_proto_rawDescOnce.Do(func() {\n\t\tfile_badgerpb4_proto_rawDescData = protoimpl.X.CompressGZIP(file_badgerpb4_proto_rawDescData)\n\t})\n\treturn file_badgerpb4_proto_rawDescData\n}\n\nvar file_badgerpb4_proto_enumTypes = make([]protoimpl.EnumInfo, 3)\nvar file_badgerpb4_proto_msgTypes = make([]protoimpl.MessageInfo, 7)\nvar file_badgerpb4_proto_goTypes = []interface{}{\n\t(EncryptionAlgo)(0),           // 0: badgerpb4.EncryptionAlgo\n\t(ManifestChange_Operation)(0), // 1: badgerpb4.ManifestChange.Operation\n\t(Checksum_Algorithm)(0),       // 2: badgerpb4.Checksum.Algorithm\n\t(*KV)(nil),                    // 3: badgerpb4.KV\n\t(*KVList)(nil),                // 4: badgerpb4.KVList\n\t(*ManifestChangeSet)(nil),     // 5: badgerpb4.ManifestChangeSet\n\t(*ManifestChange)(nil),        // 6: badgerpb4.ManifestChange\n\t(*Checksum)(nil),              // 7: badgerpb4.Checksum\n\t(*DataKey)(nil),               // 8: badgerpb4.DataKey\n\t(*Match)(nil),                 // 9: badgerpb4.Match\n}\nvar file_badgerpb4_proto_depIdxs = []int32{\n\t3, // 0: badgerpb4.KVList.kv:type_name -> badgerpb4.KV\n\t6, // 1: badgerpb4.ManifestChangeSet.changes:type_name -> badgerpb4.ManifestChange\n\t1, // 2: badgerpb4.ManifestChange.Op:type_name -> badgerpb4.ManifestChange.Operation\n\t0, // 3: badgerpb4.ManifestChange.encryption_algo:type_name -> badgerpb4.EncryptionAlgo\n\t2, // 4: badgerpb4.Checksum.algo:type_name -> badgerpb4.Checksum.Algorithm\n\t5, // [5:5] is the sub-list for method output_type\n\t5, // [5:5] is the sub-list for method input_type\n\t5, // [5:5] is the sub-list for extension type_name\n\t5, // [5:5] is the sub-list for extension extendee\n\t0, // [0:5] is the sub-list for field type_name\n}\n\nfunc init() { file_badgerpb4_proto_init() }\nfunc file_badgerpb4_proto_init() {\n\tif File_badgerpb4_proto != nil {\n\t\treturn\n\t}\n\tif !protoimpl.UnsafeEnabled {\n\t\tfile_badgerpb4_proto_msgTypes[0].Exporter = func(v interface{}, i int) interface{} {\n\t\t\tswitch v := v.(*KV); i {\n\t\t\tcase 0:\n\t\t\t\treturn &v.state\n\t\t\tcase 1:\n\t\t\t\treturn &v.sizeCache\n\t\t\tcase 2:\n\t\t\t\treturn &v.unknownFields\n\t\t\tdefault:\n\t\t\t\treturn nil\n\t\t\t}\n\t\t}\n\t\tfile_badgerpb4_proto_msgTypes[1].Exporter = func(v interface{}, i int) interface{} {\n\t\t\tswitch v := v.(*KVList); i {\n\t\t\tcase 0:\n\t\t\t\treturn &v.state\n\t\t\tcase 1:\n\t\t\t\treturn &v.sizeCache\n\t\t\tcase 2:\n\t\t\t\treturn &v.unknownFields\n\t\t\tdefault:\n\t\t\t\treturn nil\n\t\t\t}\n\t\t}\n\t\tfile_badgerpb4_proto_msgTypes[2].Exporter = func(v interface{}, i int) interface{} {\n\t\t\tswitch v := v.(*ManifestChangeSet); i {\n\t\t\tcase 0:\n\t\t\t\treturn &v.state\n\t\t\tcase 1:\n\t\t\t\treturn &v.sizeCache\n\t\t\tcase 2:\n\t\t\t\treturn &v.unknownFields\n\t\t\tdefault:\n\t\t\t\treturn nil\n\t\t\t}\n\t\t}\n\t\tfile_badgerpb4_proto_msgTypes[3].Exporter = func(v interface{}, i int) interface{} {\n\t\t\tswitch v := v.(*ManifestChange); i {\n\t\t\tcase 0:\n\t\t\t\treturn &v.state\n\t\t\tcase 1:\n\t\t\t\treturn &v.sizeCache\n\t\t\tcase 2:\n\t\t\t\treturn &v.unknownFields\n\t\t\tdefault:\n\t\t\t\treturn nil\n\t\t\t}\n\t\t}\n\t\tfile_badgerpb4_proto_msgTypes[4].Exporter = func(v interface{}, i int) interface{} {\n\t\t\tswitch v := v.(*Checksum); i {\n\t\t\tcase 0:\n\t\t\t\treturn &v.state\n\t\t\tcase 1:\n\t\t\t\treturn &v.sizeCache\n\t\t\tcase 2:\n\t\t\t\treturn &v.unknownFields\n\t\t\tdefault:\n\t\t\t\treturn nil\n\t\t\t}\n\t\t}\n\t\tfile_badgerpb4_proto_msgTypes[5].Exporter = func(v interface{}, i int) interface{} {\n\t\t\tswitch v := v.(*DataKey); i {\n\t\t\tcase 0:\n\t\t\t\treturn &v.state\n\t\t\tcase 1:\n\t\t\t\treturn &v.sizeCache\n\t\t\tcase 2:\n\t\t\t\treturn &v.unknownFields\n\t\t\tdefault:\n\t\t\t\treturn nil\n\t\t\t}\n\t\t}\n\t\tfile_badgerpb4_proto_msgTypes[6].Exporter = func(v interface{}, i int) interface{} {\n\t\t\tswitch v := v.(*Match); i {\n\t\t\tcase 0:\n\t\t\t\treturn &v.state\n\t\t\tcase 1:\n\t\t\t\treturn &v.sizeCache\n\t\t\tcase 2:\n\t\t\t\treturn &v.unknownFields\n\t\t\tdefault:\n\t\t\t\treturn nil\n\t\t\t}\n\t\t}\n\t}\n\ttype x struct{}\n\tout := protoimpl.TypeBuilder{\n\t\tFile: protoimpl.DescBuilder{\n\t\t\tGoPackagePath: reflect.TypeOf(x{}).PkgPath(),\n\t\t\tRawDescriptor: file_badgerpb4_proto_rawDesc,\n\t\t\tNumEnums:      3,\n\t\t\tNumMessages:   7,\n\t\t\tNumExtensions: 0,\n\t\t\tNumServices:   0,\n\t\t},\n\t\tGoTypes:           file_badgerpb4_proto_goTypes,\n\t\tDependencyIndexes: file_badgerpb4_proto_depIdxs,\n\t\tEnumInfos:         file_badgerpb4_proto_enumTypes,\n\t\tMessageInfos:      file_badgerpb4_proto_msgTypes,\n\t}.Build()\n\tFile_badgerpb4_proto = out.File\n\tfile_badgerpb4_proto_rawDesc = nil\n\tfile_badgerpb4_proto_goTypes = nil\n\tfile_badgerpb4_proto_depIdxs = nil\n}\n"
  },
  {
    "path": "pb/badgerpb4.proto",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\n// Use protos/gen.sh to generate .pb.go files.\nsyntax = \"proto3\";\n\npackage badgerpb4;\n\noption go_package = \"github.com/dgraph-io/badger/v4/pb\";\n\nmessage KV {\n  bytes key = 1;\n  bytes value = 2;\n  bytes user_meta = 3;\n  uint64 version = 4;\n  uint64 expires_at = 5;\n  bytes meta = 6;\n\n  // Stream id is used to identify which stream the KV came from.\n  uint32 stream_id = 10;\n  // Stream done is used to indicate end of stream.\n  bool stream_done = 11;\n}\n\nmessage KVList {\n  repeated KV kv = 1;\n\n  // alloc_ref used internally for memory management.\n  uint64 alloc_ref = 10;\n}\n\nmessage ManifestChangeSet {\n  // A set of changes that are applied atomically.\n  repeated ManifestChange changes = 1;\n}\n\nenum EncryptionAlgo {\n  aes = 0;\n}\n\nmessage ManifestChange {\n  uint64 Id = 1;            // Table ID.\n  enum Operation {\n    CREATE = 0;\n    DELETE = 1;\n  }\n  Operation Op   = 2;\n  uint32 Level   = 3;       // Only used for CREATE.\n  uint64 key_id  = 4;\n  EncryptionAlgo encryption_algo = 5;\n  uint32 compression = 6;   // Only used for CREATE Op.\n}\n\nmessage Checksum {\n  enum Algorithm {\n    CRC32C = 0;\n    XXHash64 = 1;\n  }\n  Algorithm algo = 1; // For storing type of Checksum algorithm used\n  uint64 sum = 2;\n}\n\nmessage DataKey {\n  uint64 key_id      = 1;\n  bytes  data       = 2;\n  bytes  iv         = 3;\n  int64  created_at = 4;\n}\n\nmessage Match {\n    bytes prefix = 1;\n    string ignore_bytes = 2; // Comma separated with dash to represent ranges \"1, 2-3, 4-7, 9\"\n}\n"
  },
  {
    "path": "pb/gen.sh",
    "content": "#!/bin/bash\n\n# Run this script from its directory, so that badgerpb4.proto is where it's expected to\n# be.\n\ngo install google.golang.org/protobuf/cmd/protoc-gen-go@v1.31.0\nprotoc --go_out=. --go_opt=paths=source_relative badgerpb4.proto\n"
  },
  {
    "path": "pb/protos_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage pb\n\nimport (\n\t\"os/exec\"\n\t\"testing\"\n\n\t\"github.com/stretchr/testify/require\"\n)\n\nfunc Exec(argv ...string) error {\n\tcmd := exec.Command(argv[0], argv[1:]...)\n\n\tif err := cmd.Start(); err != nil {\n\t\treturn err\n\t}\n\treturn cmd.Wait()\n}\n\nfunc TestProtosRegenerate(t *testing.T) {\n\terr := Exec(\"./gen.sh\")\n\trequire.NoError(t, err, \"Got error while regenerating protos: %v\\n\", err)\n\n\tgeneratedProtos := \"badgerpb4.pb.go\"\n\terr = Exec(\"git\", \"diff\", \"--quiet\", \"--\", generatedProtos)\n\trequire.NoError(t, err, \"badgerpb4.pb.go changed after regenerating\")\n}\n"
  },
  {
    "path": "publisher.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"sync\"\n\t\"sync/atomic\"\n\n\t\"github.com/dgraph-io/badger/v4/pb\"\n\t\"github.com/dgraph-io/badger/v4/trie\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\ntype subscriber struct {\n\tid        uint64\n\tmatches   []pb.Match\n\tsendCh    chan *pb.KVList\n\tsubCloser *z.Closer\n\t// this will be atomic pointer which will be used to\n\t// track whether the subscriber is active or not\n\tactive *atomic.Uint64\n}\n\ntype publisher struct {\n\tsync.Mutex\n\tpubCh       chan requests\n\tsubscribers map[uint64]subscriber\n\tnextID      uint64\n\tindexer     *trie.Trie\n}\n\nfunc newPublisher() *publisher {\n\treturn &publisher{\n\t\tpubCh:       make(chan requests, 1000),\n\t\tsubscribers: make(map[uint64]subscriber),\n\t\tnextID:      0,\n\t\tindexer:     trie.NewTrie(),\n\t}\n}\n\nfunc (p *publisher) listenForUpdates(c *z.Closer) {\n\tdefer func() {\n\t\tp.cleanSubscribers()\n\t\tc.Done()\n\t}()\n\tslurp := func(batch requests) {\n\t\tfor {\n\t\t\tselect {\n\t\t\tcase reqs := <-p.pubCh:\n\t\t\t\tbatch = append(batch, reqs...)\n\t\t\tdefault:\n\t\t\t\tp.publishUpdates(batch)\n\t\t\t\treturn\n\t\t\t}\n\t\t}\n\t}\n\tfor {\n\t\tselect {\n\t\tcase <-c.HasBeenClosed():\n\t\t\treturn\n\t\tcase reqs := <-p.pubCh:\n\t\t\tslurp(reqs)\n\t\t}\n\t}\n}\n\nfunc (p *publisher) publishUpdates(reqs requests) {\n\tp.Lock()\n\tdefer func() {\n\t\tp.Unlock()\n\t\t// Release all the request.\n\t\treqs.DecrRef()\n\t}()\n\tbatchedUpdates := make(map[uint64]*pb.KVList)\n\tfor _, req := range reqs {\n\t\tfor _, e := range req.Entries {\n\t\t\tids := p.indexer.Get(e.Key)\n\t\t\tif len(ids) == 0 {\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\tk := y.SafeCopy(nil, e.Key)\n\t\t\tkv := &pb.KV{\n\t\t\t\tKey:       y.ParseKey(k),\n\t\t\t\tValue:     y.SafeCopy(nil, e.Value),\n\t\t\t\tMeta:      []byte{e.UserMeta},\n\t\t\t\tExpiresAt: e.ExpiresAt,\n\t\t\t\tVersion:   y.ParseTs(k),\n\t\t\t}\n\t\t\tfor id := range ids {\n\t\t\t\tif _, ok := batchedUpdates[id]; !ok {\n\t\t\t\t\tbatchedUpdates[id] = &pb.KVList{}\n\t\t\t\t}\n\t\t\t\tbatchedUpdates[id].Kv = append(batchedUpdates[id].Kv, kv)\n\t\t\t}\n\t\t}\n\t}\n\n\tfor id, kvs := range batchedUpdates {\n\t\tif p.subscribers[id].active.Load() == 1 {\n\t\t\tp.subscribers[id].sendCh <- kvs\n\t\t}\n\t}\n}\n\nfunc (p *publisher) newSubscriber(c *z.Closer, matches []pb.Match) (subscriber, error) {\n\tp.Lock()\n\tdefer p.Unlock()\n\tch := make(chan *pb.KVList, 1000)\n\tid := p.nextID\n\t// Increment next ID.\n\tp.nextID++\n\ts := subscriber{\n\t\tid:        id,\n\t\tmatches:   matches,\n\t\tsendCh:    ch,\n\t\tsubCloser: c,\n\t\tactive:    new(atomic.Uint64),\n\t}\n\ts.active.Store(1)\n\n\tp.subscribers[id] = s\n\tfor _, m := range matches {\n\t\tif err := p.indexer.AddMatch(m, id); err != nil {\n\t\t\treturn subscriber{}, err\n\t\t}\n\t}\n\treturn s, nil\n}\n\n// cleanSubscribers stops all the subscribers. Ideally, It should be called while closing DB.\nfunc (p *publisher) cleanSubscribers() {\n\tp.Lock()\n\tdefer p.Unlock()\n\tfor id, s := range p.subscribers {\n\t\tfor _, m := range s.matches {\n\t\t\t_ = p.indexer.DeleteMatch(m, id)\n\t\t}\n\t\tdelete(p.subscribers, id)\n\t\ts.subCloser.SignalAndWait()\n\t}\n}\n\nfunc (p *publisher) deleteSubscriber(id uint64) {\n\tp.Lock()\n\tdefer p.Unlock()\n\tif s, ok := p.subscribers[id]; ok {\n\t\tfor _, m := range s.matches {\n\t\t\t_ = p.indexer.DeleteMatch(m, id)\n\t\t}\n\t}\n\tdelete(p.subscribers, id)\n}\n\nfunc (p *publisher) sendUpdates(reqs requests) {\n\tif p.noOfSubscribers() != 0 {\n\t\treqs.IncrRef()\n\t\tp.pubCh <- reqs\n\t}\n}\n\nfunc (p *publisher) noOfSubscribers() int {\n\tp.Lock()\n\tdefer p.Unlock()\n\treturn len(p.subscribers)\n}\n"
  },
  {
    "path": "publisher_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"context\"\n\t\"errors\"\n\t\"fmt\"\n\t\"runtime\"\n\t\"sync\"\n\t\"sync/atomic\"\n\t\"testing\"\n\n\t\"github.com/stretchr/testify/require\"\n\n\t\"github.com/dgraph-io/badger/v4/pb\"\n)\n\n// This test will result in deadlock for commits before this.\n// Exiting this test gracefully will be the proof that the\n// publisher is no longer stuck in deadlock.\nfunc TestPublisherDeadlock(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\tvar subWg sync.WaitGroup\n\t\tsubWg.Add(1)\n\n\t\tvar firstUpdate sync.WaitGroup\n\t\tfirstUpdate.Add(1)\n\n\t\tvar allUpdatesDone sync.WaitGroup\n\t\tallUpdatesDone.Add(1)\n\t\tvar subDone sync.WaitGroup\n\t\tsubDone.Add(1)\n\t\tgo func() {\n\t\t\tsubWg.Done()\n\t\t\tmatch := pb.Match{Prefix: []byte(\"ke\"), IgnoreBytes: \"\"}\n\t\t\terr := db.Subscribe(context.Background(), func(kvs *pb.KVList) error {\n\t\t\t\tfirstUpdate.Done()\n\t\t\t\t// Before exiting Subscribe process, we will wait until each of the\n\t\t\t\t// 1110 updates (defined below) have been completed.\n\t\t\t\tallUpdatesDone.Wait()\n\t\t\t\treturn errors.New(\"error returned\")\n\t\t\t}, []pb.Match{match})\n\t\t\trequire.Error(t, err, errors.New(\"error returned\"))\n\t\t\tsubDone.Done()\n\t\t}()\n\t\tsubWg.Wait()\n\t\tgo func() {\n\t\t\terr := db.Update(func(txn *Txn) error {\n\t\t\t\te := NewEntry([]byte(fmt.Sprintf(\"key%d\", 0)), []byte(fmt.Sprintf(\"value%d\", 0)))\n\t\t\t\treturn txn.SetEntry(e)\n\t\t\t})\n\t\t\trequire.NoError(t, err)\n\t\t}()\n\n\t\tfirstUpdate.Wait()\n\t\tvar req atomic.Int64\n\t\tfor i := 1; i < 1110; i++ {\n\t\t\tgo func(i int) {\n\t\t\t\terr := db.Update(func(txn *Txn) error {\n\t\t\t\t\te := NewEntry([]byte(fmt.Sprintf(\"key%d\", i)), []byte(fmt.Sprintf(\"value%d\", i)))\n\t\t\t\t\treturn txn.SetEntry(e)\n\t\t\t\t})\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\treq.Add(1)\n\t\t\t}(i)\n\t\t}\n\t\tfor {\n\t\t\tif req.Load() == 1109 {\n\t\t\t\tbreak\n\t\t\t}\n\t\t\t// FYI: This does the same as \"thread.yield()\" from other languages.\n\t\t\t//      In other words, it tells the go-routine scheduler to switch\n\t\t\t//      to another go-routine. This is strongly preferred over\n\t\t\t//      time.Sleep(...).\n\t\t\truntime.Gosched()\n\t\t}\n\t\t// Free up the subscriber, which is waiting for updates to finish.\n\t\tallUpdatesDone.Done()\n\t\t// Exit when the subscription process has been exited.\n\t\tsubDone.Wait()\n\t})\n}\n\nfunc TestPublisherOrdering(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\torder := []string{}\n\t\tvar wg sync.WaitGroup\n\t\twg.Add(1)\n\t\tvar subWg sync.WaitGroup\n\t\tsubWg.Add(1)\n\t\tgo func() {\n\t\t\tsubWg.Done()\n\t\t\tupdates := 0\n\t\t\tmatch := pb.Match{Prefix: []byte(\"ke\"), IgnoreBytes: \"\"}\n\t\t\terr := db.Subscribe(context.Background(), func(kvs *pb.KVList) error {\n\t\t\t\tupdates += len(kvs.GetKv())\n\t\t\t\tfor _, kv := range kvs.GetKv() {\n\t\t\t\t\torder = append(order, string(kv.Value))\n\t\t\t\t}\n\t\t\t\tif updates == 5 {\n\t\t\t\t\twg.Done()\n\t\t\t\t}\n\t\t\t\treturn nil\n\t\t\t}, []pb.Match{match})\n\t\t\tif err != nil {\n\t\t\t\trequire.Equal(t, err.Error(), context.Canceled.Error())\n\t\t\t}\n\t\t}()\n\t\tsubWg.Wait()\n\t\tfor i := 0; i < 5; i++ {\n\t\t\trequire.NoError(t, db.Update(func(txn *Txn) error {\n\t\t\t\te := NewEntry([]byte(fmt.Sprintf(\"key%d\", i)), []byte(fmt.Sprintf(\"value%d\", i)))\n\t\t\t\treturn txn.SetEntry(e)\n\t\t\t}))\n\t\t}\n\t\twg.Wait()\n\t\tfor i := 0; i < 5; i++ {\n\t\t\trequire.Equal(t, fmt.Sprintf(\"value%d\", i), order[i])\n\t\t}\n\t})\n}\n\nfunc TestMultiplePrefix(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\tvar wg sync.WaitGroup\n\t\twg.Add(1)\n\t\tvar subWg sync.WaitGroup\n\t\tsubWg.Add(1)\n\t\tgo func() {\n\t\t\tsubWg.Done()\n\t\t\tupdates := 0\n\t\t\tmatch1 := pb.Match{Prefix: []byte(\"ke\"), IgnoreBytes: \"\"}\n\t\t\tmatch2 := pb.Match{Prefix: []byte(\"hel\"), IgnoreBytes: \"\"}\n\t\t\terr := db.Subscribe(context.Background(), func(kvs *pb.KVList) error {\n\t\t\t\tupdates += len(kvs.GetKv())\n\t\t\t\tfor _, kv := range kvs.GetKv() {\n\t\t\t\t\tif string(kv.Key) == \"key\" {\n\t\t\t\t\t\trequire.Equal(t, string(kv.Value), \"value\")\n\t\t\t\t\t} else {\n\t\t\t\t\t\trequire.Equal(t, string(kv.Value), \"badger\")\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tif updates == 2 {\n\t\t\t\t\twg.Done()\n\t\t\t\t}\n\t\t\t\treturn nil\n\t\t\t}, []pb.Match{match1, match2})\n\t\t\tif err != nil {\n\t\t\t\trequire.Equal(t, err.Error(), context.Canceled.Error())\n\t\t\t}\n\t\t}()\n\t\tsubWg.Wait()\n\t\trequire.NoError(t, db.Update(func(txn *Txn) error {\n\t\t\treturn txn.SetEntry(NewEntry([]byte(\"key\"), []byte(\"value\")))\n\t\t}))\n\t\trequire.NoError(t, db.Update(func(txn *Txn) error {\n\t\t\treturn txn.SetEntry(NewEntry([]byte(\"hello\"), []byte(\"badger\")))\n\t\t}))\n\t\twg.Wait()\n\t})\n}\n"
  },
  {
    "path": "skl/README.md",
    "content": "This is much better than `skiplist` and `slist`.\n\n```sh\nBenchmarkReadWrite/frac_0-8            3000000         537 ns/op\nBenchmarkReadWrite/frac_1-8            3000000         503 ns/op\nBenchmarkReadWrite/frac_2-8            3000000         492 ns/op\nBenchmarkReadWrite/frac_3-8            3000000         475 ns/op\nBenchmarkReadWrite/frac_4-8            3000000         440 ns/op\nBenchmarkReadWrite/frac_5-8            5000000         442 ns/op\nBenchmarkReadWrite/frac_6-8            5000000         380 ns/op\nBenchmarkReadWrite/frac_7-8            5000000         338 ns/op\nBenchmarkReadWrite/frac_8-8            5000000         294 ns/op\nBenchmarkReadWrite/frac_9-8            10000000        268 ns/op\nBenchmarkReadWrite/frac_10-8           100000000       26.3 ns/op\n```\n\nAnd even better than a simple map with read-write lock:\n\n```sh\nBenchmarkReadWriteMap/frac_0-8         2000000         774 ns/op\nBenchmarkReadWriteMap/frac_1-8         2000000         647 ns/op\nBenchmarkReadWriteMap/frac_2-8         3000000         605 ns/op\nBenchmarkReadWriteMap/frac_3-8         3000000         603 ns/op\nBenchmarkReadWriteMap/frac_4-8         3000000         556 ns/op\nBenchmarkReadWriteMap/frac_5-8         3000000         472 ns/op\nBenchmarkReadWriteMap/frac_6-8         3000000         476 ns/op\nBenchmarkReadWriteMap/frac_7-8         3000000         457 ns/op\nBenchmarkReadWriteMap/frac_8-8         5000000         444 ns/op\nBenchmarkReadWriteMap/frac_9-8         5000000         361 ns/op\nBenchmarkReadWriteMap/frac_10-8        10000000        212 ns/op\n```\n\n# Node Pooling\n\nCommand used\n\n```sh\nrm -Rf tmp && /usr/bin/time -l ./populate -keys_mil 10\n```\n\nFor pprof results, we run without using /usr/bin/time. There are four runs below.\n\nResults seem to vary quite a bit between runs.\n\n## Before node pooling\n\n```sh\n1311.53MB of 1338.69MB total (97.97%)\nDropped 30 nodes (cum <= 6.69MB)\nShowing top 10 nodes out of 37 (cum >= 12.50MB)\n      flat  flat%   sum%        cum   cum%\n  523.04MB 39.07% 39.07%   523.04MB 39.07%  github.com/dgraph-io/badger/skl.(*Skiplist).Put\n  184.51MB 13.78% 52.85%   184.51MB 13.78%  runtime.stringtoslicebyte\n  166.01MB 12.40% 65.25%   689.04MB 51.47%  github.com/dgraph-io/badger/mem.(*Table).Put\n     165MB 12.33% 77.58%      165MB 12.33%  runtime.convT2E\n  116.92MB  8.73% 86.31%   116.92MB  8.73%  bytes.makeSlice\n   62.50MB  4.67% 90.98%    62.50MB  4.67%  main.newValue\n   34.50MB  2.58% 93.56%    34.50MB  2.58%  github.com/dgraph-io/badger/table.(*BlockIterator).parseKV\n   25.50MB  1.90% 95.46%   100.06MB  7.47%  github.com/dgraph-io/badger/y.(*MergeIterator).Next\n   21.06MB  1.57% 97.04%    21.06MB  1.57%  github.com/dgraph-io/badger/table.(*Table).read\n   12.50MB  0.93% 97.97%    12.50MB  0.93%  github.com/dgraph-io/badger/table.header.Encode\n\n      128.31 real       329.37 user        17.11 sys\n3355660288  maximum resident set size\n         0  average shared memory size\n         0  average unshared data size\n         0  average unshared stack size\n   2203080  page reclaims\n       764  page faults\n         0  swaps\n       275  block input operations\n        76  block output operations\n         0  messages sent\n         0  messages received\n         0  signals received\n     49173  voluntary context switches\n    599922  involuntary context switches\n```\n\n## After node pooling\n\n```sh\n1963.13MB of 2026.09MB total (96.89%)\nDropped 29 nodes (cum <= 10.13MB)\nShowing top 10 nodes out of 41 (cum >= 185.62MB)\n      flat  flat%   sum%        cum   cum%\n  658.05MB 32.48% 32.48%   658.05MB 32.48%  github.com/dgraph-io/badger/skl.glob..func1\n  297.51MB 14.68% 47.16%   297.51MB 14.68%  runtime.convT2E\n  257.51MB 12.71% 59.87%   257.51MB 12.71%  runtime.stringtoslicebyte\n  249.01MB 12.29% 72.16%  1007.06MB 49.70%  github.com/dgraph-io/badger/mem.(*Table).Put\n  142.43MB  7.03% 79.19%   142.43MB  7.03%  bytes.makeSlice\n     100MB  4.94% 84.13%   758.05MB 37.41%  github.com/dgraph-io/badger/skl.newNode\n   99.50MB  4.91% 89.04%    99.50MB  4.91%  main.newValue\n      75MB  3.70% 92.74%       75MB  3.70%  github.com/dgraph-io/badger/table.(*BlockIterator).parseKV\n   44.62MB  2.20% 94.94%    44.62MB  2.20%  github.com/dgraph-io/badger/table.(*Table).read\n   39.50MB  1.95% 96.89%   185.62MB  9.16%  github.com/dgraph-io/badger/y.(*MergeIterator).Next\n\n      135.58 real       374.29 user        17.65 sys\n3740614656  maximum resident set size\n         0  average shared memory size\n         0  average unshared data size\n         0  average unshared stack size\n   2276566  page reclaims\n       770  page faults\n         0  swaps\n       128  block input operations\n        90  block output operations\n         0  messages sent\n         0  messages received\n         0  signals received\n     46434  voluntary context switches\n    597049  involuntary context switches\n```\n"
  },
  {
    "path": "skl/arena.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage skl\n\nimport (\n\t\"sync/atomic\"\n\t\"unsafe\"\n\n\t\"github.com/dgraph-io/badger/v4/y\"\n)\n\nconst (\n\toffsetSize = int(unsafe.Sizeof(uint32(0)))\n\n\t// Always align nodes on 64-bit boundaries, even on 32-bit architectures,\n\t// so that the node.value field is 64-bit aligned. This is necessary because\n\t// node.getValueOffset uses atomic.LoadUint64, which expects its input\n\t// pointer to be 64-bit aligned.\n\tnodeAlign = int(unsafe.Sizeof(uint64(0))) - 1\n)\n\n// Arena should be lock-free.\ntype Arena struct {\n\tn   atomic.Uint32\n\tbuf []byte\n}\n\n// newArena returns a new arena.\nfunc newArena(n int64) *Arena {\n\t// Don't store data at position 0 in order to reserve offset=0 as a kind\n\t// of nil pointer.\n\tout := &Arena{buf: make([]byte, n)}\n\tout.n.Store(1)\n\treturn out\n}\n\nfunc (s *Arena) size() int64 {\n\treturn int64(s.n.Load())\n}\n\n// putNode allocates a node in the arena. The node is aligned on a pointer-sized\n// boundary. The arena offset of the node is returned.\nfunc (s *Arena) putNode(height int) uint32 {\n\t// Compute the amount of the tower that will never be used, since the height\n\t// is less than maxHeight.\n\tunusedSize := (maxHeight - height) * offsetSize\n\n\t// Pad the allocation with enough bytes to ensure pointer alignment.\n\tl := uint32(MaxNodeSize - unusedSize + nodeAlign)\n\tn := s.n.Add(l)\n\ty.AssertTruef(int(n) <= len(s.buf),\n\t\t\"Arena too small, toWrite:%d newTotal:%d limit:%d\",\n\t\tl, n, len(s.buf))\n\n\t// Return the aligned offset.\n\tm := (n - l + uint32(nodeAlign)) & ^uint32(nodeAlign)\n\treturn m\n}\n\n// Put will *copy* val into arena. To make better use of this, reuse your input\n// val buffer. Returns an offset into buf. User is responsible for remembering\n// size of val. We could also store this size inside arena but the encoding and\n// decoding will incur some overhead.\nfunc (s *Arena) putVal(v y.ValueStruct) uint32 {\n\tl := v.EncodedSize()\n\tn := s.n.Add(l)\n\ty.AssertTruef(int(n) <= len(s.buf),\n\t\t\"Arena too small, toWrite:%d newTotal:%d limit:%d\",\n\t\tl, n, len(s.buf))\n\tm := n - l\n\tv.Encode(s.buf[m:])\n\treturn m\n}\n\nfunc (s *Arena) putKey(key []byte) uint32 {\n\tl := uint32(len(key))\n\tn := s.n.Add(l)\n\ty.AssertTruef(int(n) <= len(s.buf),\n\t\t\"Arena too small, toWrite:%d newTotal:%d limit:%d\",\n\t\tl, n, len(s.buf))\n\t// m is the offset where you should write.\n\t// n = new len - key len give you the offset at which you should write.\n\tm := n - l\n\t// Copy to buffer from m:n\n\ty.AssertTrue(len(key) == copy(s.buf[m:n], key))\n\treturn m\n}\n\n// getNode returns a pointer to the node located at offset. If the offset is\n// zero, then the nil node pointer is returned.\nfunc (s *Arena) getNode(offset uint32) *node {\n\tif offset == 0 {\n\t\treturn nil\n\t}\n\n\treturn (*node)(unsafe.Pointer(&s.buf[offset]))\n}\n\n// getKey returns byte slice at offset.\nfunc (s *Arena) getKey(offset uint32, size uint16) []byte {\n\treturn s.buf[offset : offset+uint32(size)]\n}\n\n// getVal returns byte slice at offset. The given size should be just the value\n// size and should NOT include the meta bytes.\nfunc (s *Arena) getVal(offset uint32, size uint32) (ret y.ValueStruct) {\n\tret.Decode(s.buf[offset : offset+size])\n\treturn\n}\n\n// getNodeOffset returns the offset of node in the arena. If the node pointer is\n// nil, then the zero offset is returned.\nfunc (s *Arena) getNodeOffset(nd *node) uint32 {\n\tif nd == nil {\n\t\treturn 0\n\t}\n\n\treturn uint32(uintptr(unsafe.Pointer(nd)) - uintptr(unsafe.Pointer(&s.buf[0])))\n}\n"
  },
  {
    "path": "skl/skl.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\n/*\nAdapted from RocksDB inline skiplist.\n\nKey differences:\n- No optimization for sequential inserts (no \"prev\").\n- No custom comparator.\n- Support overwrites. This requires care when we see the same key when inserting.\n  For RocksDB or LevelDB, overwrites are implemented as a newer sequence number in the key, so\n\tthere is no need for values. We don't intend to support versioning. In-place updates of values\n\twould be more efficient.\n- We discard all non-concurrent code.\n- We do not support Splices. This simplifies the code a lot.\n- No AllocateNode or other pointer arithmetic.\n- We combine the findLessThan, findGreaterOrEqual, etc into one function.\n*/\n\npackage skl\n\nimport (\n\t\"math\"\n\t\"sync/atomic\"\n\t\"unsafe\"\n\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\nconst (\n\tmaxHeight      = 20\n\theightIncrease = math.MaxUint32 / 3\n)\n\n// MaxNodeSize is the memory footprint of a node of maximum height.\nconst MaxNodeSize = int(unsafe.Sizeof(node{}))\n\ntype node struct {\n\t// Multiple parts of the value are encoded as a single uint64 so that it\n\t// can be atomically loaded and stored:\n\t//   value offset: uint32 (bits 0-31)\n\t//   value size  : uint32 (bits 32-63)\n\tvalue atomic.Uint64\n\n\t// A byte slice is 24 bytes. We are trying to save space here.\n\tkeyOffset uint32 // Immutable. No need to lock to access key.\n\tkeySize   uint16 // Immutable. No need to lock to access key.\n\n\t// Height of the tower.\n\theight uint16\n\n\t// Most nodes do not need to use the full height of the tower, since the\n\t// probability of each successive level decreases exponentially. Because\n\t// these elements are never accessed, they do not need to be allocated.\n\t// Therefore, when a node is allocated in the arena, its memory footprint\n\t// is deliberately truncated to not include unneeded tower elements.\n\t//\n\t// All accesses to elements should use CAS operations, with no need to lock.\n\ttower [maxHeight]atomic.Uint32\n}\n\ntype Skiplist struct {\n\theight  atomic.Int32 // Current height. 1 <= height <= kMaxHeight. CAS.\n\thead    *node\n\tref     atomic.Int32\n\tarena   *Arena\n\tOnClose func()\n}\n\n// IncrRef increases the refcount\nfunc (s *Skiplist) IncrRef() {\n\ts.ref.Add(1)\n}\n\n// DecrRef decrements the refcount, deallocating the Skiplist when done using it\nfunc (s *Skiplist) DecrRef() {\n\tnewRef := s.ref.Add(-1)\n\tif newRef > 0 {\n\t\treturn\n\t}\n\tif s.OnClose != nil {\n\t\ts.OnClose()\n\t}\n\n\t// Indicate we are closed. Good for testing.  Also, lets GC reclaim memory. Race condition\n\t// here would suggest we are accessing skiplist when we are supposed to have no reference!\n\ts.arena = nil\n\t// Since the head references the arena's buf, as long as the head is kept around\n\t// GC can't release the buf.\n\ts.head = nil\n}\n\nfunc newNode(arena *Arena, key []byte, v y.ValueStruct, height int) *node {\n\t// The base level is already allocated in the node struct.\n\toffset := arena.putNode(height)\n\tnode := arena.getNode(offset)\n\tnode.keyOffset = arena.putKey(key)\n\tnode.keySize = uint16(len(key))\n\tnode.height = uint16(height)\n\tnode.value.Store(encodeValue(arena.putVal(v), v.EncodedSize()))\n\treturn node\n}\n\nfunc encodeValue(valOffset uint32, valSize uint32) uint64 {\n\treturn uint64(valSize)<<32 | uint64(valOffset)\n}\n\nfunc decodeValue(value uint64) (valOffset uint32, valSize uint32) {\n\tvalOffset = uint32(value)\n\tvalSize = uint32(value >> 32)\n\treturn\n}\n\n// NewSkiplist makes a new empty skiplist, with a given arena size\nfunc NewSkiplist(arenaSize int64) *Skiplist {\n\tarena := newArena(arenaSize)\n\thead := newNode(arena, nil, y.ValueStruct{}, maxHeight)\n\ts := &Skiplist{head: head, arena: arena}\n\ts.height.Store(1)\n\ts.ref.Store(1)\n\treturn s\n}\n\nfunc (s *node) getValueOffset() (uint32, uint32) {\n\tvalue := s.value.Load()\n\treturn decodeValue(value)\n}\n\nfunc (s *node) key(arena *Arena) []byte {\n\treturn arena.getKey(s.keyOffset, s.keySize)\n}\n\nfunc (s *node) setValue(arena *Arena, v y.ValueStruct) {\n\tvalOffset := arena.putVal(v)\n\tvalue := encodeValue(valOffset, v.EncodedSize())\n\ts.value.Store(value)\n}\n\nfunc (s *node) getNextOffset(h int) uint32 {\n\treturn s.tower[h].Load()\n}\n\nfunc (s *node) casNextOffset(h int, old, val uint32) bool {\n\treturn s.tower[h].CompareAndSwap(old, val)\n}\n\n// Returns true if key is strictly > n.key.\n// If n is nil, this is an \"end\" marker and we return false.\n//func (s *Skiplist) keyIsAfterNode(key []byte, n *node) bool {\n//\ty.AssertTrue(n != s.head)\n//\treturn n != nil && y.CompareKeys(key, n.key) > 0\n//}\n\nfunc (s *Skiplist) randomHeight() int {\n\th := 1\n\tfor h < maxHeight && z.FastRand() <= heightIncrease {\n\t\th++\n\t}\n\treturn h\n}\n\nfunc (s *Skiplist) getNext(nd *node, height int) *node {\n\treturn s.arena.getNode(nd.getNextOffset(height))\n}\n\n// findNear finds the node near to key.\n// If less=true, it finds rightmost node such that node.key < key (if allowEqual=false) or\n// node.key <= key (if allowEqual=true).\n// If less=false, it finds leftmost node such that node.key > key (if allowEqual=false) or\n// node.key >= key (if allowEqual=true).\n// Returns the node found. The bool returned is true if the node has key equal to given key.\nfunc (s *Skiplist) findNear(key []byte, less bool, allowEqual bool) (*node, bool) {\n\tx := s.head\n\tlevel := int(s.getHeight() - 1)\n\tfor {\n\t\t// Assume x.key < key.\n\t\tnext := s.getNext(x, level)\n\t\tif next == nil {\n\t\t\t// x.key < key < END OF LIST\n\t\t\tif level > 0 {\n\t\t\t\t// Can descend further to iterate closer to the end.\n\t\t\t\tlevel--\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\t// Level=0. Cannot descend further. Let's return something that makes sense.\n\t\t\tif !less {\n\t\t\t\treturn nil, false\n\t\t\t}\n\t\t\t// Try to return x. Make sure it is not a head node.\n\t\t\tif x == s.head {\n\t\t\t\treturn nil, false\n\t\t\t}\n\t\t\treturn x, false\n\t\t}\n\n\t\tnextKey := next.key(s.arena)\n\t\tcmp := y.CompareKeys(key, nextKey)\n\t\tif cmp > 0 {\n\t\t\t// x.key < next.key < key. We can continue to move right.\n\t\t\tx = next\n\t\t\tcontinue\n\t\t}\n\t\tif cmp == 0 {\n\t\t\t// x.key < key == next.key.\n\t\t\tif allowEqual {\n\t\t\t\treturn next, true\n\t\t\t}\n\t\t\tif !less {\n\t\t\t\t// We want >, so go to base level to grab the next bigger note.\n\t\t\t\treturn s.getNext(next, 0), false\n\t\t\t}\n\t\t\t// We want <. If not base level, we should go closer in the next level.\n\t\t\tif level > 0 {\n\t\t\t\tlevel--\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\t// On base level. Return x.\n\t\t\tif x == s.head {\n\t\t\t\treturn nil, false\n\t\t\t}\n\t\t\treturn x, false\n\t\t}\n\t\t// cmp < 0. In other words, x.key < key < next.\n\t\tif level > 0 {\n\t\t\tlevel--\n\t\t\tcontinue\n\t\t}\n\t\t// At base level. Need to return something.\n\t\tif !less {\n\t\t\treturn next, false\n\t\t}\n\t\t// Try to return x. Make sure it is not a head node.\n\t\tif x == s.head {\n\t\t\treturn nil, false\n\t\t}\n\t\treturn x, false\n\t}\n}\n\n// findSpliceForLevel returns (outBefore, outAfter) with outBefore.key <= key <= outAfter.key.\n// The input \"before\" tells us where to start looking.\n// If we found a node with the same key, then we return outBefore = outAfter.\n// Otherwise, outBefore.key < key < outAfter.key.\nfunc (s *Skiplist) findSpliceForLevel(key []byte, before *node, level int) (*node, *node) {\n\tfor {\n\t\t// Assume before.key < key.\n\t\tnext := s.getNext(before, level)\n\t\tif next == nil {\n\t\t\treturn before, next\n\t\t}\n\t\tnextKey := next.key(s.arena)\n\t\tcmp := y.CompareKeys(key, nextKey)\n\t\tif cmp == 0 {\n\t\t\t// Equality case.\n\t\t\treturn next, next\n\t\t}\n\t\tif cmp < 0 {\n\t\t\t// before.key < key < next.key. We are done for this level.\n\t\t\treturn before, next\n\t\t}\n\t\tbefore = next // Keep moving right on this level.\n\t}\n}\n\nfunc (s *Skiplist) getHeight() int32 {\n\treturn s.height.Load()\n}\n\n// Put inserts the key-value pair.\nfunc (s *Skiplist) Put(key []byte, v y.ValueStruct) {\n\t// Since we allow overwrite, we may not need to create a new node. We might not even need to\n\t// increase the height. Let's defer these actions.\n\n\tlistHeight := s.getHeight()\n\tvar prev [maxHeight + 1]*node\n\tvar next [maxHeight + 1]*node\n\tprev[listHeight] = s.head\n\tnext[listHeight] = nil\n\tfor i := int(listHeight) - 1; i >= 0; i-- {\n\t\t// Use higher level to speed up for current level.\n\t\tprev[i], next[i] = s.findSpliceForLevel(key, prev[i+1], i)\n\t\tif prev[i] == next[i] {\n\t\t\tprev[i].setValue(s.arena, v)\n\t\t\treturn\n\t\t}\n\t}\n\n\t// We do need to create a new node.\n\theight := s.randomHeight()\n\tx := newNode(s.arena, key, v, height)\n\n\t// Try to increase s.height via CAS.\n\tlistHeight = s.getHeight()\n\tfor height > int(listHeight) {\n\t\tif s.height.CompareAndSwap(listHeight, int32(height)) {\n\t\t\t// Successfully increased skiplist.height.\n\t\t\tbreak\n\t\t}\n\t\tlistHeight = s.getHeight()\n\t}\n\n\t// We always insert from the base level and up. After you add a node in base level, we cannot\n\t// create a node in the level above because it would have discovered the node in the base level.\n\tfor i := 0; i < height; i++ {\n\t\tfor {\n\t\t\tif prev[i] == nil {\n\t\t\t\ty.AssertTrue(i > 1) // This cannot happen in base level.\n\t\t\t\t// We haven't computed prev, next for this level because height exceeds old listHeight.\n\t\t\t\t// For these levels, we expect the lists to be sparse, so we can just search from head.\n\t\t\t\tprev[i], next[i] = s.findSpliceForLevel(key, s.head, i)\n\t\t\t\t// Someone adds the exact same key before we are able to do so. This can only happen on\n\t\t\t\t// the base level. But we know we are not on the base level.\n\t\t\t\ty.AssertTrue(prev[i] != next[i])\n\t\t\t}\n\t\t\tnextOffset := s.arena.getNodeOffset(next[i])\n\t\t\tx.tower[i].Store(nextOffset)\n\t\t\tif prev[i].casNextOffset(i, nextOffset, s.arena.getNodeOffset(x)) {\n\t\t\t\t// Managed to insert x between prev[i] and next[i]. Go to the next level.\n\t\t\t\tbreak\n\t\t\t}\n\t\t\t// CAS failed. We need to recompute prev and next.\n\t\t\t// It is unlikely to be helpful to try to use a different level as we redo the search,\n\t\t\t// because it is unlikely that lots of nodes are inserted between prev[i] and next[i].\n\t\t\tprev[i], next[i] = s.findSpliceForLevel(key, prev[i], i)\n\t\t\tif prev[i] == next[i] {\n\t\t\t\ty.AssertTruef(i == 0, \"Equality can happen only on base level: %d\", i)\n\t\t\t\tprev[i].setValue(s.arena, v)\n\t\t\t\treturn\n\t\t\t}\n\t\t}\n\t}\n}\n\n// Empty returns if the Skiplist is empty.\nfunc (s *Skiplist) Empty() bool {\n\treturn s.findLast() == nil\n}\n\n// findLast returns the last element. If head (empty list), we return nil. All the find functions\n// will NEVER return the head nodes.\nfunc (s *Skiplist) findLast() *node {\n\tn := s.head\n\tlevel := int(s.getHeight()) - 1\n\tfor {\n\t\tnext := s.getNext(n, level)\n\t\tif next != nil {\n\t\t\tn = next\n\t\t\tcontinue\n\t\t}\n\t\tif level == 0 {\n\t\t\tif n == s.head {\n\t\t\t\treturn nil\n\t\t\t}\n\t\t\treturn n\n\t\t}\n\t\tlevel--\n\t}\n}\n\n// Get gets the value associated with the key. It returns a valid value if it finds equal or earlier\n// version of the same key.\nfunc (s *Skiplist) Get(key []byte) y.ValueStruct {\n\tn, _ := s.findNear(key, false, true) // findGreaterOrEqual.\n\tif n == nil {\n\t\treturn y.ValueStruct{}\n\t}\n\n\tnextKey := s.arena.getKey(n.keyOffset, n.keySize)\n\tif !y.SameKey(key, nextKey) {\n\t\treturn y.ValueStruct{}\n\t}\n\n\tvalOffset, valSize := n.getValueOffset()\n\tvs := s.arena.getVal(valOffset, valSize)\n\tvs.Version = y.ParseTs(nextKey)\n\treturn vs\n}\n\n// NewIterator returns a skiplist iterator.  You have to Close() the iterator.\nfunc (s *Skiplist) NewIterator() *Iterator {\n\ts.IncrRef()\n\treturn &Iterator{list: s}\n}\n\n// MemSize returns the size of the Skiplist in terms of how much memory is used within its internal\n// arena.\nfunc (s *Skiplist) MemSize() int64 { return s.arena.size() }\n\n// Iterator is an iterator over skiplist object. For new objects, you just\n// need to initialize Iterator.list.\ntype Iterator struct {\n\tlist *Skiplist\n\tn    *node\n}\n\n// Close frees the resources held by the iterator\nfunc (s *Iterator) Close() error {\n\ts.list.DecrRef()\n\treturn nil\n}\n\n// Valid returns true iff the iterator is positioned at a valid node.\nfunc (s *Iterator) Valid() bool { return s.n != nil }\n\n// Key returns the key at the current position.\nfunc (s *Iterator) Key() []byte {\n\treturn s.list.arena.getKey(s.n.keyOffset, s.n.keySize)\n}\n\n// Value returns value.\nfunc (s *Iterator) Value() y.ValueStruct {\n\tvalOffset, valSize := s.n.getValueOffset()\n\treturn s.list.arena.getVal(valOffset, valSize)\n}\n\n// ValueUint64 returns the uint64 value of the current node.\nfunc (s *Iterator) ValueUint64() uint64 {\n\treturn s.n.value.Load()\n}\n\n// Next advances to the next position.\nfunc (s *Iterator) Next() {\n\ty.AssertTrue(s.Valid())\n\ts.n = s.list.getNext(s.n, 0)\n}\n\n// Prev advances to the previous position.\nfunc (s *Iterator) Prev() {\n\ty.AssertTrue(s.Valid())\n\ts.n, _ = s.list.findNear(s.Key(), true, false) // find <. No equality allowed.\n}\n\n// Seek advances to the first entry with a key >= target.\nfunc (s *Iterator) Seek(target []byte) {\n\ts.n, _ = s.list.findNear(target, false, true) // find >=.\n}\n\n// SeekForPrev finds an entry with key <= target.\nfunc (s *Iterator) SeekForPrev(target []byte) {\n\ts.n, _ = s.list.findNear(target, true, true) // find <=.\n}\n\n// SeekToFirst seeks position at the first entry in list.\n// Final state of iterator is Valid() iff list is not empty.\nfunc (s *Iterator) SeekToFirst() {\n\ts.n = s.list.getNext(s.list.head, 0)\n}\n\n// SeekToLast seeks position at the last entry in list.\n// Final state of iterator is Valid() iff list is not empty.\nfunc (s *Iterator) SeekToLast() {\n\ts.n = s.list.findLast()\n}\n\n// UniIterator is a unidirectional memtable iterator. It is a thin wrapper around\n// Iterator. We like to keep Iterator as before, because it is more powerful and\n// we might support bidirectional iterators in the future.\ntype UniIterator struct {\n\titer     *Iterator\n\treversed bool\n}\n\n// NewUniIterator returns a UniIterator.\nfunc (s *Skiplist) NewUniIterator(reversed bool) *UniIterator {\n\treturn &UniIterator{\n\t\titer:     s.NewIterator(),\n\t\treversed: reversed,\n\t}\n}\n\n// Next implements y.Interface\nfunc (s *UniIterator) Next() {\n\tif !s.reversed {\n\t\ts.iter.Next()\n\t} else {\n\t\ts.iter.Prev()\n\t}\n}\n\n// Rewind implements y.Interface\nfunc (s *UniIterator) Rewind() {\n\tif !s.reversed {\n\t\ts.iter.SeekToFirst()\n\t} else {\n\t\ts.iter.SeekToLast()\n\t}\n}\n\n// Seek implements y.Interface\nfunc (s *UniIterator) Seek(key []byte) {\n\tif !s.reversed {\n\t\ts.iter.Seek(key)\n\t} else {\n\t\ts.iter.SeekForPrev(key)\n\t}\n}\n\n// Key implements y.Interface\nfunc (s *UniIterator) Key() []byte { return s.iter.Key() }\n\n// Value implements y.Interface\nfunc (s *UniIterator) Value() y.ValueStruct { return s.iter.Value() }\n\n// Valid implements y.Interface\nfunc (s *UniIterator) Valid() bool { return s.iter.Valid() }\n\n// Close implements y.Interface (and frees up the iter's resources)\nfunc (s *UniIterator) Close() error { return s.iter.Close() }\n"
  },
  {
    "path": "skl/skl_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage skl\n\nimport (\n\t\"encoding/binary\"\n\t\"fmt\"\n\t\"math/rand\"\n\t\"strconv\"\n\t\"sync\"\n\t\"sync/atomic\"\n\t\"testing\"\n\t\"time\"\n\n\t\"github.com/stretchr/testify/require\"\n\n\t\"github.com/dgraph-io/badger/v4/y\"\n)\n\nconst arenaSize = 1 << 20\n\nfunc (s *Skiplist) valid() bool { return s.arena != nil }\n\nfunc newValue(v int) []byte {\n\treturn []byte(fmt.Sprintf(\"%05d\", v))\n}\n\n// length iterates over skiplist to give exact size.\nfunc length(s *Skiplist) int {\n\tx := s.getNext(s.head, 0)\n\tcount := 0\n\tfor x != nil {\n\t\tcount++\n\t\tx = s.getNext(x, 0)\n\t}\n\treturn count\n}\n\nfunc TestEmpty(t *testing.T) {\n\tkey := []byte(\"aaa\")\n\tl := NewSkiplist(arenaSize)\n\n\tv := l.Get(key)\n\trequire.True(t, v.Value == nil) // Cannot use require.Nil for unsafe.Pointer nil.\n\n\tfor _, less := range []bool{true, false} {\n\t\tfor _, allowEqual := range []bool{true, false} {\n\t\t\tn, found := l.findNear(key, less, allowEqual)\n\t\t\trequire.Nil(t, n)\n\t\t\trequire.False(t, found)\n\t\t}\n\t}\n\n\tit := l.NewIterator()\n\trequire.False(t, it.Valid())\n\n\tit.SeekToFirst()\n\trequire.False(t, it.Valid())\n\n\tit.SeekToLast()\n\trequire.False(t, it.Valid())\n\n\tit.Seek(key)\n\trequire.False(t, it.Valid())\n\n\tl.DecrRef()\n\trequire.True(t, l.valid()) // Check the reference counting.\n\n\tit.Close()\n\trequire.False(t, l.valid()) // Check the reference counting.\n}\n\n// TestBasic tests single-threaded inserts and updates and gets.\nfunc TestBasic(t *testing.T) {\n\tl := NewSkiplist(arenaSize)\n\tval1 := newValue(42)\n\tval2 := newValue(52)\n\tval3 := newValue(62)\n\tval4 := newValue(72)\n\tval5 := []byte(fmt.Sprintf(\"%0102400d\", 1)) // Have size 100 KB which is > math.MaxUint16.\n\n\t// Try inserting values.\n\t// Somehow require.Nil doesn't work when checking for unsafe.Pointer(nil).\n\tl.Put(y.KeyWithTs([]byte(\"key1\"), 0), y.ValueStruct{Value: val1, Meta: 55, UserMeta: 0})\n\tl.Put(y.KeyWithTs([]byte(\"key2\"), 2), y.ValueStruct{Value: val2, Meta: 56, UserMeta: 0})\n\tl.Put(y.KeyWithTs([]byte(\"key3\"), 0), y.ValueStruct{Value: val3, Meta: 57, UserMeta: 0})\n\n\tv := l.Get(y.KeyWithTs([]byte(\"key\"), 0))\n\trequire.True(t, v.Value == nil)\n\n\tv = l.Get(y.KeyWithTs([]byte(\"key1\"), 0))\n\trequire.True(t, v.Value != nil)\n\trequire.EqualValues(t, \"00042\", string(v.Value))\n\trequire.EqualValues(t, 55, v.Meta)\n\n\tv = l.Get(y.KeyWithTs([]byte(\"key2\"), 0))\n\trequire.True(t, v.Value == nil)\n\n\tv = l.Get(y.KeyWithTs([]byte(\"key3\"), 0))\n\trequire.True(t, v.Value != nil)\n\trequire.EqualValues(t, \"00062\", string(v.Value))\n\trequire.EqualValues(t, 57, v.Meta)\n\n\tl.Put(y.KeyWithTs([]byte(\"key3\"), 1), y.ValueStruct{Value: val4, Meta: 12, UserMeta: 0})\n\tv = l.Get(y.KeyWithTs([]byte(\"key3\"), 1))\n\trequire.True(t, v.Value != nil)\n\trequire.EqualValues(t, \"00072\", string(v.Value))\n\trequire.EqualValues(t, 12, v.Meta)\n\n\tl.Put(y.KeyWithTs([]byte(\"key4\"), 1), y.ValueStruct{Value: val5, Meta: 60, UserMeta: 0})\n\tv = l.Get(y.KeyWithTs([]byte(\"key4\"), 1))\n\trequire.NotNil(t, v.Value)\n\trequire.EqualValues(t, val5, v.Value)\n\trequire.EqualValues(t, 60, v.Meta)\n}\n\n// TestConcurrentBasic tests concurrent writes followed by concurrent reads.\nfunc TestConcurrentBasic(t *testing.T) {\n\tconst n = 1000\n\tl := NewSkiplist(arenaSize)\n\tvar wg sync.WaitGroup\n\tkey := func(i int) []byte {\n\t\treturn y.KeyWithTs([]byte(fmt.Sprintf(\"%05d\", i)), 0)\n\t}\n\tfor i := 0; i < n; i++ {\n\t\twg.Add(1)\n\t\tgo func(i int) {\n\t\t\tdefer wg.Done()\n\t\t\tl.Put(key(i),\n\t\t\t\ty.ValueStruct{Value: newValue(i), Meta: 0, UserMeta: 0})\n\t\t}(i)\n\t}\n\twg.Wait()\n\t// Check values. Concurrent reads.\n\tfor i := 0; i < n; i++ {\n\t\twg.Add(1)\n\t\tgo func(i int) {\n\t\t\tdefer wg.Done()\n\t\t\tv := l.Get(key(i))\n\t\t\trequire.True(t, v.Value != nil)\n\t\t\trequire.EqualValues(t, newValue(i), v.Value)\n\t\t}(i)\n\t}\n\twg.Wait()\n\trequire.EqualValues(t, n, length(l))\n}\n\nfunc TestConcurrentBasicBigValues(t *testing.T) {\n\tconst n = 100\n\tarenaSize := int64(120 << 20) // 120 MB\n\tl := NewSkiplist(arenaSize)\n\tvar wg sync.WaitGroup\n\tkey := func(i int) []byte {\n\t\treturn y.KeyWithTs([]byte(fmt.Sprintf(\"%05d\", i)), 0)\n\t}\n\tBigValue := func(i int) []byte {\n\t\treturn []byte(fmt.Sprintf(\"%01048576d\", i)) // Have 1 MB value which is > math.MaxUint16.\n\t}\n\tfor i := 0; i < n; i++ {\n\t\twg.Add(1)\n\t\tgo func(i int) {\n\t\t\tdefer wg.Done()\n\t\t\tl.Put(key(i),\n\t\t\t\ty.ValueStruct{Value: BigValue(i), Meta: 0, UserMeta: 0})\n\t\t}(i)\n\t}\n\twg.Wait()\n\t// Check values. Concurrent reads.\n\tfor i := 0; i < n; i++ {\n\t\twg.Add(1)\n\t\tgo func(i int) {\n\t\t\tdefer wg.Done()\n\t\t\tv := l.Get(key(i))\n\t\t\trequire.NotNil(t, v.Value)\n\t\t\trequire.EqualValues(t, BigValue(i), v.Value)\n\t\t}(i)\n\t}\n\twg.Wait()\n\trequire.EqualValues(t, n, length(l))\n}\n\n// TestOneKey will read while writing to one single key.\nfunc TestOneKey(t *testing.T) {\n\tconst n = 100\n\tkey := y.KeyWithTs([]byte(\"thekey\"), 0)\n\tl := NewSkiplist(arenaSize)\n\tdefer l.DecrRef()\n\n\tvar wg sync.WaitGroup\n\tfor i := 0; i < n; i++ {\n\t\twg.Add(1)\n\t\tgo func(i int) {\n\t\t\tdefer wg.Done()\n\t\t\tl.Put(key, y.ValueStruct{Value: newValue(i), Meta: 0, UserMeta: 0})\n\t\t}(i)\n\t}\n\t// We expect that at least some write made it such that some read returns a value.\n\tvar sawValue atomic.Int32\n\tfor i := 0; i < n; i++ {\n\t\twg.Add(1)\n\t\tgo func() {\n\t\t\tdefer wg.Done()\n\t\t\tp := l.Get(key)\n\t\t\tif p.Value == nil {\n\t\t\t\treturn\n\t\t\t}\n\t\t\tsawValue.Add(1)\n\t\t\tv, err := strconv.Atoi(string(p.Value))\n\t\t\trequire.NoError(t, err)\n\t\t\trequire.True(t, 0 <= v && v < n, fmt.Sprintf(\"invalid value %d\", v))\n\t\t}()\n\t}\n\twg.Wait()\n\trequire.True(t, sawValue.Load() > 0)\n\trequire.EqualValues(t, 1, length(l))\n}\n\nfunc TestFindNear(t *testing.T) {\n\tl := NewSkiplist(arenaSize)\n\tdefer l.DecrRef()\n\tfor i := 0; i < 1000; i++ {\n\t\tkey := fmt.Sprintf(\"%05d\", i*10+5)\n\t\tl.Put(y.KeyWithTs([]byte(key), 0), y.ValueStruct{Value: newValue(i), Meta: 0, UserMeta: 0})\n\t}\n\n\tn, eq := l.findNear(y.KeyWithTs([]byte(\"00001\"), 0), false, false)\n\trequire.NotNil(t, n)\n\trequire.EqualValues(t, y.KeyWithTs([]byte(\"00005\"), 0), string(n.key(l.arena)))\n\trequire.False(t, eq)\n\tn, eq = l.findNear(y.KeyWithTs([]byte(\"00001\"), 0), false, true)\n\trequire.NotNil(t, n)\n\trequire.EqualValues(t, y.KeyWithTs([]byte(\"00005\"), 0), string(n.key(l.arena)))\n\trequire.False(t, eq)\n\tn, eq = l.findNear(y.KeyWithTs([]byte(\"00001\"), 0), true, false)\n\trequire.Nil(t, n)\n\trequire.False(t, eq)\n\tn, eq = l.findNear(y.KeyWithTs([]byte(\"00001\"), 0), true, true)\n\trequire.Nil(t, n)\n\trequire.False(t, eq)\n\n\tn, eq = l.findNear(y.KeyWithTs([]byte(\"00005\"), 0), false, false)\n\trequire.NotNil(t, n)\n\trequire.EqualValues(t, y.KeyWithTs([]byte(\"00015\"), 0), string(n.key(l.arena)))\n\trequire.False(t, eq)\n\tn, eq = l.findNear(y.KeyWithTs([]byte(\"00005\"), 0), false, true)\n\trequire.NotNil(t, n)\n\trequire.EqualValues(t, y.KeyWithTs([]byte(\"00005\"), 0), string(n.key(l.arena)))\n\trequire.True(t, eq)\n\tn, eq = l.findNear(y.KeyWithTs([]byte(\"00005\"), 0), true, false)\n\trequire.Nil(t, n)\n\trequire.False(t, eq)\n\tn, eq = l.findNear(y.KeyWithTs([]byte(\"00005\"), 0), true, true)\n\trequire.NotNil(t, n)\n\trequire.EqualValues(t, y.KeyWithTs([]byte(\"00005\"), 0), string(n.key(l.arena)))\n\trequire.True(t, eq)\n\n\tn, eq = l.findNear(y.KeyWithTs([]byte(\"05555\"), 0), false, false)\n\trequire.NotNil(t, n)\n\trequire.EqualValues(t, y.KeyWithTs([]byte(\"05565\"), 0), string(n.key(l.arena)))\n\trequire.False(t, eq)\n\tn, eq = l.findNear(y.KeyWithTs([]byte(\"05555\"), 0), false, true)\n\trequire.NotNil(t, n)\n\trequire.EqualValues(t, y.KeyWithTs([]byte(\"05555\"), 0), string(n.key(l.arena)))\n\trequire.True(t, eq)\n\tn, eq = l.findNear(y.KeyWithTs([]byte(\"05555\"), 0), true, false)\n\trequire.NotNil(t, n)\n\trequire.EqualValues(t, y.KeyWithTs([]byte(\"05545\"), 0), string(n.key(l.arena)))\n\trequire.False(t, eq)\n\tn, eq = l.findNear(y.KeyWithTs([]byte(\"05555\"), 0), true, true)\n\trequire.NotNil(t, n)\n\trequire.EqualValues(t, y.KeyWithTs([]byte(\"05555\"), 0), string(n.key(l.arena)))\n\trequire.True(t, eq)\n\n\tn, eq = l.findNear(y.KeyWithTs([]byte(\"05558\"), 0), false, false)\n\trequire.NotNil(t, n)\n\trequire.EqualValues(t, y.KeyWithTs([]byte(\"05565\"), 0), string(n.key(l.arena)))\n\trequire.False(t, eq)\n\tn, eq = l.findNear(y.KeyWithTs([]byte(\"05558\"), 0), false, true)\n\trequire.NotNil(t, n)\n\trequire.EqualValues(t, y.KeyWithTs([]byte(\"05565\"), 0), string(n.key(l.arena)))\n\trequire.False(t, eq)\n\tn, eq = l.findNear(y.KeyWithTs([]byte(\"05558\"), 0), true, false)\n\trequire.NotNil(t, n)\n\trequire.EqualValues(t, y.KeyWithTs([]byte(\"05555\"), 0), string(n.key(l.arena)))\n\trequire.False(t, eq)\n\tn, eq = l.findNear(y.KeyWithTs([]byte(\"05558\"), 0), true, true)\n\trequire.NotNil(t, n)\n\trequire.EqualValues(t, y.KeyWithTs([]byte(\"05555\"), 0), string(n.key(l.arena)))\n\trequire.False(t, eq)\n\n\tn, eq = l.findNear(y.KeyWithTs([]byte(\"09995\"), 0), false, false)\n\trequire.Nil(t, n)\n\trequire.False(t, eq)\n\tn, eq = l.findNear(y.KeyWithTs([]byte(\"09995\"), 0), false, true)\n\trequire.NotNil(t, n)\n\trequire.EqualValues(t, y.KeyWithTs([]byte(\"09995\"), 0), string(n.key(l.arena)))\n\trequire.True(t, eq)\n\tn, eq = l.findNear(y.KeyWithTs([]byte(\"09995\"), 0), true, false)\n\trequire.NotNil(t, n)\n\trequire.EqualValues(t, y.KeyWithTs([]byte(\"09985\"), 0), string(n.key(l.arena)))\n\trequire.False(t, eq)\n\tn, eq = l.findNear(y.KeyWithTs([]byte(\"09995\"), 0), true, true)\n\trequire.NotNil(t, n)\n\trequire.EqualValues(t, y.KeyWithTs([]byte(\"09995\"), 0), string(n.key(l.arena)))\n\trequire.True(t, eq)\n\n\tn, eq = l.findNear(y.KeyWithTs([]byte(\"59995\"), 0), false, false)\n\trequire.Nil(t, n)\n\trequire.False(t, eq)\n\tn, eq = l.findNear(y.KeyWithTs([]byte(\"59995\"), 0), false, true)\n\trequire.Nil(t, n)\n\trequire.False(t, eq)\n\tn, eq = l.findNear(y.KeyWithTs([]byte(\"59995\"), 0), true, false)\n\trequire.NotNil(t, n)\n\trequire.EqualValues(t, y.KeyWithTs([]byte(\"09995\"), 0), string(n.key(l.arena)))\n\trequire.False(t, eq)\n\tn, eq = l.findNear(y.KeyWithTs([]byte(\"59995\"), 0), true, true)\n\trequire.NotNil(t, n)\n\trequire.EqualValues(t, y.KeyWithTs([]byte(\"09995\"), 0), string(n.key(l.arena)))\n\trequire.False(t, eq)\n}\n\n// TestIteratorNext tests a basic iteration over all nodes from the beginning.\nfunc TestIteratorNext(t *testing.T) {\n\tconst n = 100\n\tl := NewSkiplist(arenaSize)\n\tdefer l.DecrRef()\n\tit := l.NewIterator()\n\tdefer it.Close()\n\trequire.False(t, it.Valid())\n\tit.SeekToFirst()\n\trequire.False(t, it.Valid())\n\tfor i := n - 1; i >= 0; i-- {\n\t\tl.Put(y.KeyWithTs([]byte(fmt.Sprintf(\"%05d\", i)), 0),\n\t\t\ty.ValueStruct{Value: newValue(i), Meta: 0, UserMeta: 0})\n\t}\n\tit.SeekToFirst()\n\tfor i := 0; i < n; i++ {\n\t\trequire.True(t, it.Valid())\n\t\tv := it.Value()\n\t\trequire.EqualValues(t, newValue(i), v.Value)\n\t\tit.Next()\n\t}\n\trequire.False(t, it.Valid())\n}\n\n// TestIteratorPrev tests a basic iteration over all nodes from the end.\nfunc TestIteratorPrev(t *testing.T) {\n\tconst n = 100\n\tl := NewSkiplist(arenaSize)\n\tdefer l.DecrRef()\n\tit := l.NewIterator()\n\tdefer it.Close()\n\trequire.False(t, it.Valid())\n\tit.SeekToFirst()\n\trequire.False(t, it.Valid())\n\tfor i := 0; i < n; i++ {\n\t\tl.Put(y.KeyWithTs([]byte(fmt.Sprintf(\"%05d\", i)), 0),\n\t\t\ty.ValueStruct{Value: newValue(i), Meta: 0, UserMeta: 0})\n\t}\n\tit.SeekToLast()\n\tfor i := n - 1; i >= 0; i-- {\n\t\trequire.True(t, it.Valid())\n\t\tv := it.Value()\n\t\trequire.EqualValues(t, newValue(i), v.Value)\n\t\tit.Prev()\n\t}\n\trequire.False(t, it.Valid())\n}\n\n// TestIteratorSeek tests Seek and SeekForPrev.\nfunc TestIteratorSeek(t *testing.T) {\n\tconst n = 100\n\tl := NewSkiplist(arenaSize)\n\tdefer l.DecrRef()\n\n\tit := l.NewIterator()\n\tdefer it.Close()\n\n\trequire.False(t, it.Valid())\n\tit.SeekToFirst()\n\trequire.False(t, it.Valid())\n\t// 1000, 1010, 1020, ..., 1990.\n\tfor i := n - 1; i >= 0; i-- {\n\t\tv := i*10 + 1000\n\t\tl.Put(y.KeyWithTs([]byte(fmt.Sprintf(\"%05d\", i*10+1000)), 0),\n\t\t\ty.ValueStruct{Value: newValue(v), Meta: 0, UserMeta: 0})\n\t}\n\tit.SeekToFirst()\n\trequire.True(t, it.Valid())\n\tv := it.Value()\n\trequire.EqualValues(t, \"01000\", v.Value)\n\n\tit.Seek(y.KeyWithTs([]byte(\"01000\"), 0))\n\trequire.True(t, it.Valid())\n\tv = it.Value()\n\trequire.EqualValues(t, \"01000\", v.Value)\n\n\tit.Seek(y.KeyWithTs([]byte(\"01005\"), 0))\n\trequire.True(t, it.Valid())\n\tv = it.Value()\n\trequire.EqualValues(t, \"01010\", v.Value)\n\n\tit.Seek(y.KeyWithTs([]byte(\"01010\"), 0))\n\trequire.True(t, it.Valid())\n\tv = it.Value()\n\trequire.EqualValues(t, \"01010\", v.Value)\n\n\tit.Seek(y.KeyWithTs([]byte(\"99999\"), 0))\n\trequire.False(t, it.Valid())\n\n\t// Try SeekForPrev.\n\tit.SeekForPrev(y.KeyWithTs([]byte(\"00\"), 0))\n\trequire.False(t, it.Valid())\n\n\tit.SeekForPrev(y.KeyWithTs([]byte(\"01000\"), 0))\n\trequire.True(t, it.Valid())\n\tv = it.Value()\n\trequire.EqualValues(t, \"01000\", v.Value)\n\n\tit.SeekForPrev(y.KeyWithTs([]byte(\"01005\"), 0))\n\trequire.True(t, it.Valid())\n\tv = it.Value()\n\trequire.EqualValues(t, \"01000\", v.Value)\n\n\tit.SeekForPrev(y.KeyWithTs([]byte(\"01010\"), 0))\n\trequire.True(t, it.Valid())\n\tv = it.Value()\n\trequire.EqualValues(t, \"01010\", v.Value)\n\n\tit.SeekForPrev(y.KeyWithTs([]byte(\"99999\"), 0))\n\trequire.True(t, it.Valid())\n\tv = it.Value()\n\trequire.EqualValues(t, \"01990\", v.Value)\n}\n\nfunc randomKey(rng *rand.Rand) []byte {\n\tb := make([]byte, 8)\n\tkey := rng.Uint32()\n\tkey2 := rng.Uint32()\n\tbinary.LittleEndian.PutUint32(b, key)\n\tbinary.LittleEndian.PutUint32(b[4:], key2)\n\treturn y.KeyWithTs(b, 0)\n}\n\n// Standard test. Some fraction is read. Some fraction is write. Writes have\n// to go through mutex lock.\nfunc BenchmarkReadWrite(b *testing.B) {\n\tvalue := newValue(123)\n\tfor i := 0; i <= 10; i++ {\n\t\treadFrac := float32(i) / 10.0\n\t\tb.Run(fmt.Sprintf(\"frac_%d\", i), func(b *testing.B) {\n\t\t\tl := NewSkiplist(int64((b.N + 1) * MaxNodeSize))\n\t\t\tdefer l.DecrRef()\n\t\t\tb.ResetTimer()\n\t\t\tvar count int\n\t\t\tb.RunParallel(func(pb *testing.PB) {\n\t\t\t\trng := rand.New(rand.NewSource(time.Now().UnixNano()))\n\t\t\t\tfor pb.Next() {\n\t\t\t\t\tif rng.Float32() < readFrac {\n\t\t\t\t\t\tv := l.Get(randomKey(rng))\n\t\t\t\t\t\tif v.Value != nil {\n\t\t\t\t\t\t\tcount++\n\t\t\t\t\t\t}\n\t\t\t\t\t} else {\n\t\t\t\t\t\tl.Put(randomKey(rng), y.ValueStruct{Value: value, Meta: 0, UserMeta: 0})\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t})\n\t\t})\n\t}\n}\n\n// Standard test. Some fraction is read. Some fraction is write. Writes have\n// to go through mutex lock.\nfunc BenchmarkReadWriteMap(b *testing.B) {\n\tvalue := newValue(123)\n\tfor i := 0; i <= 10; i++ {\n\t\treadFrac := float32(i) / 10.0\n\t\tb.Run(fmt.Sprintf(\"frac_%d\", i), func(b *testing.B) {\n\t\t\tm := make(map[string][]byte)\n\t\t\tvar mutex sync.RWMutex\n\t\t\tb.ResetTimer()\n\t\t\tvar count int\n\t\t\tb.RunParallel(func(pb *testing.PB) {\n\t\t\t\trng := rand.New(rand.NewSource(time.Now().UnixNano()))\n\t\t\t\tfor pb.Next() {\n\t\t\t\t\tif rng.Float32() < readFrac {\n\t\t\t\t\t\tmutex.RLock()\n\t\t\t\t\t\t_, ok := m[string(randomKey(rng))]\n\t\t\t\t\t\tmutex.RUnlock()\n\t\t\t\t\t\tif ok {\n\t\t\t\t\t\t\tcount++\n\t\t\t\t\t\t}\n\t\t\t\t\t} else {\n\t\t\t\t\t\tmutex.Lock()\n\t\t\t\t\t\tm[string(randomKey(rng))] = value\n\t\t\t\t\t\tmutex.Unlock()\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t})\n\t\t})\n\t}\n}\n\nfunc BenchmarkWrite(b *testing.B) {\n\tvalue := newValue(123)\n\tl := NewSkiplist(int64((b.N + 1) * MaxNodeSize))\n\tdefer l.DecrRef()\n\tb.ResetTimer()\n\tb.RunParallel(func(pb *testing.PB) {\n\t\trng := rand.New(rand.NewSource(time.Now().UnixNano()))\n\t\tfor pb.Next() {\n\t\t\tl.Put(randomKey(rng), y.ValueStruct{Value: value, Meta: 0, UserMeta: 0})\n\t\t}\n\t})\n}\n"
  },
  {
    "path": "stream.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"bytes\"\n\t\"context\"\n\t\"sort\"\n\t\"sync\"\n\t\"sync/atomic\"\n\t\"time\"\n\n\thumanize \"github.com/dustin/go-humanize\"\n\t\"google.golang.org/protobuf/proto\"\n\n\t\"github.com/dgraph-io/badger/v4/pb\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\nconst batchSize = 16 << 20 // 16 MB\n\n// maxStreamSize is the maximum allowed size of a stream batch. This is a soft limit\n// as a single list that is still over the limit will have to be sent as is since it\n// cannot be split further. This limit prevents the framework from creating batches\n// so big that sending them causes issues (e.g running into the max size gRPC limit).\nvar maxStreamSize = uint64(100 << 20) // 100MB\n\n// Stream provides a framework to concurrently iterate over a snapshot of Badger, pick up\n// key-values, batch them up and call Send. Stream does concurrent iteration over many smaller key\n// ranges. It does NOT send keys in lexicographical sorted order. To get keys in sorted\n// order, use Iterator.\ntype Stream struct {\n\t// Prefix to only iterate over certain range of keys. If set to nil (default), Stream would\n\t// iterate over the entire DB.\n\tPrefix []byte\n\n\t// Number of goroutines to use for iterating over key ranges. Defaults to 8.\n\tNumGo int\n\n\t// Badger would produce log entries in Infof to indicate the progress of Stream. LogPrefix can\n\t// be used to help differentiate them from other activities. Default is \"Badger.Stream\".\n\tLogPrefix string\n\n\t// ChooseKey is invoked each time a new key is encountered. Note that this is not called\n\t// on every version of the value, only the first encountered version (i.e. the highest version\n\t// of the value a key has). ChooseKey can be left nil to select all keys.\n\t//\n\t// Note: Calls to ChooseKey are concurrent.\n\tChooseKey func(item *Item) bool\n\n\t// MaxSize is the maximum allowed size of a stream batch. This is a soft limit\n\t// as a single list that is still over the limit will have to be sent as is since it\n\t// cannot be split further. This limit prevents the framework from creating batches\n\t// so big that sending them causes issues (e.g running into the max size gRPC limit).\n\t// If necessary, set it up before the Stream starts synchronisation\n\t// This is not a concurrency-safe setting\n\tMaxSize uint64\n\n\t// KeyToList, similar to ChooseKey, is only invoked on the highest version of the value. It\n\t// is upto the caller to iterate over the versions and generate zero, one or more KVs. It\n\t// is expected that the user would advance the iterator to go through the versions of the\n\t// values. However, the user MUST immediately return from this function on the first encounter\n\t// with a mismatching key. See example usage in ToList function. Can be left nil to use ToList\n\t// function by default.\n\t//\n\t// KeyToList has access to z.Allocator accessible via stream.Allocator(itr.ThreadId). This\n\t// allocator can be used to allocate KVs, to decrease the memory pressure on Go GC. Stream\n\t// framework takes care of releasing those resources after calling Send. AllocRef does\n\t// NOT need to be set in the returned KVList, as Stream framework would ignore that field,\n\t// instead using the allocator assigned to that thread id.\n\t//\n\t// Note: Calls to KeyToList are concurrent.\n\tKeyToList func(key []byte, itr *Iterator) (*pb.KVList, error)\n\t// UseKeyToListWithThreadId is used to indicate that KeyToListWithThreadId should be used\n\t// instead of KeyToList. This is a new api that can be used to figure out parallelism\n\t// of the stream. Each threadId would be run serially. KeyToList being concurrent makes you\n\t// take care of concurrency in KeyToList. Here threadId could be used to do some things serially.\n\t// Once a thread finishes FinishThread() would be called.\n\tUseKeyToListWithThreadId bool\n\tKeyToListWithThreadId    func(key []byte, itr *Iterator, threadId int) (*pb.KVList, error)\n\tFinishThread             func(threadId int) (*pb.KVList, error)\n\n\t// This is the method where Stream sends the final output. All calls to Send are done by a\n\t// single goroutine, i.e. logic within Send method can expect single threaded execution.\n\tSend func(buf *z.Buffer) error\n\n\t// Read data above the sinceTs. All keys with version =< sinceTs will be ignored.\n\tSinceTs      uint64\n\treadTs       uint64\n\tdb           *DB\n\trangeCh      chan keyRange\n\tkvChan       chan *z.Buffer\n\tnextStreamId atomic.Uint32\n\tdoneMarkers  bool\n\tscanned      atomic.Uint64 // used to estimate the ETA for data scan.\n\tnumProducers atomic.Int32\n}\n\n// SendDoneMarkers when true would send out done markers on the stream. False by default.\nfunc (st *Stream) SendDoneMarkers(done bool) {\n\tst.doneMarkers = done\n}\n\n// ToList is a default implementation of KeyToList. It picks up all valid versions of the key,\n// skipping over deleted or expired keys.\nfunc (st *Stream) ToList(key []byte, itr *Iterator) (*pb.KVList, error) {\n\ta := itr.Alloc\n\tka := a.Copy(key)\n\n\tlist := &pb.KVList{}\n\tfor ; itr.Valid(); itr.Next() {\n\t\titem := itr.Item()\n\t\tif item.IsDeletedOrExpired() {\n\t\t\tbreak\n\t\t}\n\t\tif !bytes.Equal(key, item.Key()) {\n\t\t\t// Break out on the first encounter with another key.\n\t\t\tbreak\n\t\t}\n\n\t\tkv := y.NewKV(a)\n\t\tkv.Key = ka\n\n\t\tif err := item.Value(func(val []byte) error {\n\t\t\tkv.Value = a.Copy(val)\n\t\t\treturn nil\n\n\t\t}); err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t\tkv.Version = item.Version()\n\t\tkv.ExpiresAt = item.ExpiresAt()\n\t\tkv.UserMeta = a.Copy([]byte{item.UserMeta()})\n\n\t\tlist.Kv = append(list.Kv, kv)\n\t\tif st.db.opt.NumVersionsToKeep == 1 {\n\t\t\tbreak\n\t\t}\n\n\t\tif item.DiscardEarlierVersions() {\n\t\t\tbreak\n\t\t}\n\t}\n\treturn list, nil\n}\n\n// keyRange is [start, end), including start, excluding end. Do ensure that the start,\n// end byte slices are owned by keyRange struct.\nfunc (st *Stream) produceRanges(ctx context.Context) {\n\tranges := st.db.Ranges(st.Prefix, st.NumGo)\n\ty.AssertTrue(len(ranges) > 0)\n\ty.AssertTrue(ranges[0].left == nil)\n\ty.AssertTrue(ranges[len(ranges)-1].right == nil)\n\tst.db.opt.Infof(\"Number of ranges found: %d\\n\", len(ranges))\n\n\t// Sort in descending order of size.\n\tsort.Slice(ranges, func(i, j int) bool {\n\t\treturn ranges[i].size > ranges[j].size\n\t})\n\tfor i, r := range ranges {\n\t\tst.rangeCh <- *r\n\t\tst.db.opt.Infof(\"Sent range %d for iteration: [%x, %x) of size: %s\\n\",\n\t\t\ti, r.left, r.right, humanize.IBytes(uint64(r.size)))\n\t}\n\tclose(st.rangeCh)\n}\n\n// produceKVs picks up ranges from rangeCh, generates KV lists and sends them to kvChan.\nfunc (st *Stream) produceKVs(ctx context.Context, threadId int) error {\n\tst.numProducers.Add(1)\n\tdefer st.numProducers.Add(-1)\n\n\tvar txn *Txn\n\tif st.readTs > 0 {\n\t\ttxn = st.db.NewTransactionAt(st.readTs, false)\n\t} else {\n\t\ttxn = st.db.NewTransaction(false)\n\t}\n\tdefer txn.Discard()\n\n\t// produceKVs is running iterate serially. So, we can define the outList here.\n\toutList := z.NewBuffer(2*batchSize, \"Stream.ProduceKVs\")\n\tdefer func() {\n\t\t// The outList variable changes. So, we need to evaluate the variable in the defer. DO NOT\n\t\t// call `defer outList.Release()`.\n\t\t_ = outList.Release()\n\t}()\n\n\titerate := func(kr keyRange) error {\n\t\titerOpts := DefaultIteratorOptions\n\t\titerOpts.AllVersions = true\n\t\titerOpts.Prefix = st.Prefix\n\t\titerOpts.PrefetchValues = true\n\t\titerOpts.SinceTs = st.SinceTs\n\t\titr := txn.NewIterator(iterOpts)\n\t\titr.ThreadId = threadId\n\t\tdefer itr.Close()\n\n\t\titr.Alloc = z.NewAllocator(1<<20, \"Stream.Iterate\")\n\t\tdefer itr.Alloc.Release()\n\n\t\t// This unique stream id is used to identify all the keys from this iteration.\n\t\tstreamId := st.nextStreamId.Add(1)\n\t\tvar scanned int\n\n\t\tsendIt := func() error {\n\t\t\tselect {\n\t\t\tcase st.kvChan <- outList:\n\t\t\t\toutList = z.NewBuffer(2*batchSize, \"Stream.ProduceKVs\")\n\t\t\t\tst.scanned.Add(uint64(itr.scanned - scanned))\n\t\t\t\tscanned = itr.scanned\n\t\t\tcase <-ctx.Done():\n\t\t\t\treturn ctx.Err()\n\t\t\t}\n\t\t\treturn nil\n\t\t}\n\n\t\tvar prevKey []byte\n\t\tfor itr.Seek(kr.left); itr.Valid(); {\n\t\t\t// it.Valid would only return true for keys with the provided Prefix in iterOpts.\n\t\t\titem := itr.Item()\n\t\t\tif bytes.Equal(item.Key(), prevKey) {\n\t\t\t\titr.Next()\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\tprevKey = append(prevKey[:0], item.Key()...)\n\n\t\t\t// Check if we reached the end of the key range.\n\t\t\tif len(kr.right) > 0 && bytes.Compare(item.Key(), kr.right) >= 0 {\n\t\t\t\tbreak\n\t\t\t}\n\n\t\t\t// Check if we should pick this key.\n\t\t\tif st.ChooseKey != nil && !st.ChooseKey(item) {\n\t\t\t\tcontinue\n\t\t\t}\n\n\t\t\t// Now convert to key value.\n\t\t\titr.Alloc.Reset()\n\t\t\tvar list *pb.KVList\n\t\t\tvar err error\n\t\t\tif st.UseKeyToListWithThreadId {\n\t\t\t\tlist, err = st.KeyToListWithThreadId(item.KeyCopy(nil), itr, threadId)\n\t\t\t} else {\n\t\t\t\tlist, err = st.KeyToList(item.KeyCopy(nil), itr)\n\t\t\t}\n\t\t\tif err != nil {\n\t\t\t\tst.db.opt.Warningf(\"While reading key: %x, got error: %v\", item.Key(), err)\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\tif list == nil || len(list.Kv) == 0 {\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\tfor _, kv := range list.Kv {\n\t\t\t\tkv.StreamId = streamId\n\t\t\t\tKVToBuffer(kv, outList)\n\t\t\t\tif outList.LenNoPadding() < batchSize {\n\t\t\t\t\tcontinue\n\t\t\t\t}\n\t\t\t\tif err := sendIt(); err != nil {\n\t\t\t\t\treturn err\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\n\t\tif st.UseKeyToListWithThreadId {\n\t\t\tif kvs, err := st.FinishThread(threadId); err != nil {\n\t\t\t\treturn err\n\t\t\t} else {\n\t\t\t\tfor _, kv := range kvs.Kv {\n\t\t\t\t\tkv.StreamId = streamId\n\t\t\t\t\tKVToBuffer(kv, outList)\n\t\t\t\t\tif outList.LenNoPadding() < batchSize {\n\t\t\t\t\t\tcontinue\n\t\t\t\t\t}\n\t\t\t\t\tif err := sendIt(); err != nil {\n\t\t\t\t\t\treturn err\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\t// Mark the stream as done.\n\t\tif st.doneMarkers {\n\t\t\tkv := &pb.KV{\n\t\t\t\tStreamId:   streamId,\n\t\t\t\tStreamDone: true,\n\t\t\t}\n\t\t\tKVToBuffer(kv, outList)\n\t\t}\n\t\treturn sendIt()\n\t}\n\n\tfor {\n\t\tselect {\n\t\tcase kr, ok := <-st.rangeCh:\n\t\t\tif !ok {\n\t\t\t\t// Done with the keys.\n\t\t\t\treturn nil\n\t\t\t}\n\t\t\tif err := iterate(kr); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\tcase <-ctx.Done():\n\t\t\treturn ctx.Err()\n\t\t}\n\t}\n}\n\nfunc (st *Stream) streamKVs(ctx context.Context) error {\n\tonDiskSize, uncompressedSize := st.db.EstimateSize(st.Prefix)\n\t// Manish has seen uncompressed size to be in 20% error margin.\n\tuncompressedSize = uint64(float64(uncompressedSize) * 1.2)\n\tst.db.opt.Infof(\"%s Streaming about %s of uncompressed data (%s on disk)\\n\",\n\t\tst.LogPrefix, humanize.IBytes(uncompressedSize), humanize.IBytes(onDiskSize))\n\n\ttickerDur := 5 * time.Second\n\tvar bytesSent uint64\n\tt := time.NewTicker(tickerDur)\n\tdefer t.Stop()\n\tnow := time.Now()\n\n\tsendBatch := func(batch *z.Buffer) error {\n\t\tdefer func() { _ = batch.Release() }()\n\t\tsz := uint64(batch.LenNoPadding())\n\t\tif sz == 0 {\n\t\t\treturn nil\n\t\t}\n\t\tbytesSent += sz\n\t\t// st.db.opt.Infof(\"%s Sending batch of size: %s.\\n\", st.LogPrefix, humanize.IBytes(sz))\n\t\tif err := st.Send(batch); err != nil {\n\t\t\tst.db.opt.Warningf(\"Error while sending: %v\\n\", err)\n\t\t\treturn err\n\t\t}\n\t\treturn nil\n\t}\n\n\tslurp := func(batch *z.Buffer) error {\n\tloop:\n\t\tfor {\n\t\t\t// Send the batch immediately if it already exceeds the maximum allowed size.\n\t\t\t// If the size of the batch exceeds maxStreamSize, break from the loop to\n\t\t\t// avoid creating a batch that is so big that certain limits are reached.\n\t\t\tif uint64(batch.LenNoPadding()) > st.MaxSize {\n\t\t\t\tbreak loop\n\t\t\t}\n\t\t\tselect {\n\t\t\tcase kvs, ok := <-st.kvChan:\n\t\t\t\tif !ok {\n\t\t\t\t\tbreak loop\n\t\t\t\t}\n\t\t\t\ty.AssertTrue(kvs != nil)\n\t\t\t\ty.Check2(batch.Write(kvs.Bytes()))\n\t\t\t\ty.Check(kvs.Release())\n\n\t\t\tdefault:\n\t\t\t\tbreak loop\n\t\t\t}\n\t\t}\n\t\treturn sendBatch(batch)\n\t} // end of slurp.\n\n\twriteRate := y.NewRateMonitor(20)\n\tscanRate := y.NewRateMonitor(20)\nouter:\n\tfor {\n\t\tvar batch *z.Buffer\n\t\tselect {\n\t\tcase <-ctx.Done():\n\t\t\treturn ctx.Err()\n\n\t\tcase <-t.C:\n\t\t\t// Instead of calculating speed over the entire lifetime, we average the speed over\n\t\t\t// ticker duration.\n\t\t\twriteRate.Capture(bytesSent)\n\t\t\tscanned := st.scanned.Load()\n\t\t\tscanRate.Capture(scanned)\n\t\t\tnumProducers := st.numProducers.Load()\n\n\t\t\tst.db.opt.Infof(\"%s [%s] Scan (%d): ~%s/%s at %s/sec. Sent: %s at %s/sec.\"+\n\t\t\t\t\" jemalloc: %s\\n\",\n\t\t\t\tst.LogPrefix, y.FixedDuration(time.Since(now)), numProducers,\n\t\t\t\ty.IBytesToString(scanned, 1), humanize.IBytes(uncompressedSize),\n\t\t\t\thumanize.IBytes(scanRate.Rate()),\n\t\t\t\ty.IBytesToString(bytesSent, 1), humanize.IBytes(writeRate.Rate()),\n\t\t\t\thumanize.IBytes(uint64(z.NumAllocBytes())))\n\n\t\tcase kvs, ok := <-st.kvChan:\n\t\t\tif !ok {\n\t\t\t\tbreak outer\n\t\t\t}\n\t\t\ty.AssertTrue(kvs != nil)\n\t\t\tbatch = kvs\n\n\t\t\t// Otherwise, slurp more keys into this batch.\n\t\t\tif err := slurp(batch); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t}\n\t}\n\n\tst.db.opt.Infof(\"%s Sent data of size %s\\n\", st.LogPrefix, humanize.IBytes(bytesSent))\n\treturn nil\n}\n\n// Orchestrate runs Stream. It picks up ranges from the SSTables, then runs NumGo number of\n// goroutines to iterate over these ranges and batch up KVs in lists. It concurrently runs a single\n// goroutine to pick these lists, batch them up further and send to Output.Send. Orchestrate also\n// spits logs out to Infof, using provided LogPrefix. Note that all calls to Output.Send\n// are serial. In case any of these steps encounter an error, Orchestrate would stop execution and\n// return that error. Orchestrate can be called multiple times, but in serial order.\nfunc (st *Stream) Orchestrate(ctx context.Context) error {\n\tctx, cancel := context.WithCancel(ctx)\n\tdefer cancel()\n\tst.rangeCh = make(chan keyRange, 3) // Contains keys for posting lists.\n\n\t// kvChan should only have a small capacity to ensure that we don't buffer up too much data if\n\t// sending is slow. Page size is set to 4MB, which is used to lazily cap the size of each\n\t// KVList. To get 128MB buffer, we can set the channel size to 32.\n\tst.kvChan = make(chan *z.Buffer, 32)\n\n\tif st.KeyToList == nil {\n\t\tst.KeyToList = st.ToList\n\t}\n\n\t// Picks up ranges from Badger, and sends them to rangeCh.\n\tgo st.produceRanges(ctx)\n\n\terrCh := make(chan error, st.NumGo) // Stores error by consumeKeys.\n\tvar wg sync.WaitGroup\n\tfor i := 0; i < st.NumGo; i++ {\n\t\twg.Add(1)\n\n\t\tgo func(threadId int) {\n\t\t\tdefer wg.Done()\n\t\t\t// Picks up ranges from rangeCh, generates KV lists, and sends them to kvChan.\n\t\t\tif err := st.produceKVs(ctx, threadId); err != nil {\n\t\t\t\tselect {\n\t\t\t\tcase errCh <- err:\n\t\t\t\tdefault:\n\t\t\t\t}\n\t\t\t}\n\t\t}(i)\n\t}\n\n\t// Pick up key-values from kvChan and send to stream.\n\tkvErr := make(chan error, 1)\n\tgo func() {\n\t\t// Picks up KV lists from kvChan, and sends them to Output.\n\t\terr := st.streamKVs(ctx)\n\t\tif err != nil {\n\t\t\tcancel() // Stop all the go routines.\n\t\t}\n\t\tkvErr <- err\n\t}()\n\twg.Wait()        // Wait for produceKVs to be over.\n\tclose(st.kvChan) // Now we can close kvChan.\n\tdefer func() {\n\t\t// If due to some error, we have buffers left in kvChan, we should release them.\n\t\tfor buf := range st.kvChan {\n\t\t\t_ = buf.Release()\n\t\t}\n\t}()\n\n\tselect {\n\tcase err := <-errCh: // Check error from produceKVs.\n\t\treturn err\n\tdefault:\n\t}\n\n\t// Wait for key streaming to be over.\n\terr := <-kvErr\n\treturn err\n}\n\nfunc (db *DB) newStream() *Stream {\n\treturn &Stream{\n\t\tdb:        db,\n\t\tNumGo:     db.opt.NumGoroutines,\n\t\tLogPrefix: \"Badger.Stream\",\n\t\tMaxSize:   maxStreamSize,\n\t}\n}\n\n// NewStream creates a new Stream.\nfunc (db *DB) NewStream() *Stream {\n\tif db.opt.managedTxns {\n\t\tpanic(\"This API can not be called in managed mode.\")\n\t}\n\treturn db.newStream()\n}\n\n// NewStreamAt creates a new Stream at a particular timestamp. Should only be used with managed DB.\nfunc (db *DB) NewStreamAt(readTs uint64) *Stream {\n\tif !db.opt.managedTxns {\n\t\tpanic(\"This API can only be called in managed mode.\")\n\t}\n\tstream := db.newStream()\n\tstream.readTs = readTs\n\treturn stream\n}\n\nfunc BufferToKVList(buf *z.Buffer) (*pb.KVList, error) {\n\tvar list pb.KVList\n\terr := buf.SliceIterate(func(s []byte) error {\n\t\tkv := new(pb.KV)\n\t\tif err := proto.Unmarshal(s, kv); err != nil {\n\t\t\treturn err\n\t\t}\n\t\tlist.Kv = append(list.Kv, kv)\n\t\treturn nil\n\t})\n\treturn &list, err\n}\n\nfunc KVToBuffer(kv *pb.KV, buf *z.Buffer) {\n\tin := buf.SliceAllocate(proto.Size(kv))[:0]\n\t_, err := proto.MarshalOptions{}.MarshalAppend(in, kv)\n\ty.AssertTrue(err == nil)\n}\n"
  },
  {
    "path": "stream_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"context\"\n\t\"fmt\"\n\t\"math\"\n\t\"os\"\n\t\"strconv\"\n\t\"strings\"\n\t\"testing\"\n\n\t\"github.com/stretchr/testify/require\"\n\t\"google.golang.org/protobuf/proto\"\n\n\tbpb \"github.com/dgraph-io/badger/v4/pb\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\nfunc keyWithPrefix(prefix string, k int) []byte {\n\treturn []byte(fmt.Sprintf(\"%s-%d\", prefix, k))\n}\n\nfunc keyToInt(k []byte) (string, int) {\n\tsplits := strings.Split(string(k), \"-\")\n\tkey, err := strconv.Atoi(splits[1])\n\ty.Check(err)\n\treturn splits[0], key\n}\n\nfunc value(k int) []byte {\n\treturn []byte(fmt.Sprintf(\"%08d\", k))\n}\n\ntype collector struct {\n\tkv []*bpb.KV\n}\n\nfunc (c *collector) Send(buf *z.Buffer) error {\n\tlist, err := BufferToKVList(buf)\n\tif err != nil {\n\t\treturn err\n\t}\n\tfor _, kv := range list.Kv {\n\t\tif kv.StreamDone == true {\n\t\t\treturn nil\n\t\t}\n\t\tcp := proto.Clone(kv).(*bpb.KV)\n\t\tc.kv = append(c.kv, cp)\n\t}\n\treturn err\n}\n\nvar ctxb = context.Background()\n\nfunc TestStream(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\tdb, err := OpenManaged(DefaultOptions(dir))\n\trequire.NoError(t, err)\n\n\tvar count int\n\tfor _, prefix := range []string{\"p0\", \"p1\", \"p2\"} {\n\t\ttxn := db.NewTransactionAt(math.MaxUint64, true)\n\t\tfor i := 1; i <= 100; i++ {\n\t\t\trequire.NoError(t, txn.SetEntry(NewEntry(keyWithPrefix(prefix, i), value(i))))\n\t\t\tcount++\n\t\t}\n\t\trequire.NoError(t, txn.CommitAt(5, nil))\n\t}\n\n\tstream := db.NewStreamAt(math.MaxUint64)\n\tstream.LogPrefix = \"Testing\"\n\tc := &collector{}\n\tstream.Send = c.Send\n\n\t// Test case 1. Retrieve everything.\n\terr = stream.Orchestrate(ctxb)\n\trequire.NoError(t, err)\n\trequire.Equal(t, 300, len(c.kv), \"Expected 300. Got: %d\", len(c.kv))\n\n\tm := make(map[string]int)\n\tfor _, kv := range c.kv {\n\t\tprefix, ki := keyToInt(kv.Key)\n\t\texpected := value(ki)\n\t\trequire.Equal(t, expected, kv.Value)\n\t\tm[prefix]++\n\t}\n\trequire.Equal(t, 3, len(m))\n\tfor pred, count := range m {\n\t\trequire.Equal(t, 100, count, \"Count mismatch for pred: %s\", pred)\n\t}\n\n\t// Test case 2. Retrieve only 1 predicate.\n\tstream.Prefix = []byte(\"p1\")\n\tc.kv = c.kv[:0]\n\terr = stream.Orchestrate(ctxb)\n\trequire.NoError(t, err)\n\trequire.Equal(t, 100, len(c.kv), \"Expected 100. Got: %d\", len(c.kv))\n\n\tm = make(map[string]int)\n\tfor _, kv := range c.kv {\n\t\tprefix, ki := keyToInt(kv.Key)\n\t\texpected := value(ki)\n\t\trequire.Equal(t, expected, kv.Value)\n\t\tm[prefix]++\n\t}\n\trequire.Equal(t, 1, len(m))\n\tfor pred, count := range m {\n\t\trequire.Equal(t, 100, count, \"Count mismatch for pred: %s\", pred)\n\t}\n\n\t// Test case 3. Retrieve select keys within the predicate.\n\tc.kv = c.kv[:0]\n\tstream.ChooseKey = func(item *Item) bool {\n\t\t_, k := keyToInt(item.Key())\n\t\treturn k%2 == 0\n\t}\n\terr = stream.Orchestrate(ctxb)\n\trequire.NoError(t, err)\n\trequire.Equal(t, 50, len(c.kv), \"Expected 50. Got: %d\", len(c.kv))\n\n\tm = make(map[string]int)\n\tfor _, kv := range c.kv {\n\t\tprefix, ki := keyToInt(kv.Key)\n\t\texpected := value(ki)\n\t\trequire.Equal(t, expected, kv.Value)\n\t\tm[prefix]++\n\t}\n\trequire.Equal(t, 1, len(m))\n\tfor pred, count := range m {\n\t\trequire.Equal(t, 50, count, \"Count mismatch for pred: %s\", pred)\n\t}\n\n\t// Test case 4. Retrieve select keys from all predicates.\n\tc.kv = c.kv[:0]\n\tstream.Prefix = []byte{}\n\terr = stream.Orchestrate(ctxb)\n\trequire.NoError(t, err)\n\trequire.Equal(t, 150, len(c.kv), \"Expected 150. Got: %d\", len(c.kv))\n\n\tm = make(map[string]int)\n\tfor _, kv := range c.kv {\n\t\tprefix, ki := keyToInt(kv.Key)\n\t\texpected := value(ki)\n\t\trequire.Equal(t, expected, kv.Value)\n\t\tm[prefix]++\n\t}\n\trequire.Equal(t, 3, len(m))\n\tfor pred, count := range m {\n\t\trequire.Equal(t, 50, count, \"Count mismatch for pred: %s\", pred)\n\t}\n\trequire.NoError(t, db.Close())\n}\n\nfunc TestStreamMaxSize(t *testing.T) {\n\tif !*manual {\n\t\tt.Skip(\"Skipping test meant to be run manually.\")\n\t\treturn\n\t}\n\t// Set the maxStreamSize to 1MB for the duration of the test so that the it can use a smaller\n\t// dataset than it would otherwise need.\n\toriginalMaxStreamSize := maxStreamSize\n\tmaxStreamSize = 1 << 20\n\tdefer func() {\n\t\tmaxStreamSize = originalMaxStreamSize\n\t}()\n\n\ttestSize := int(1e6)\n\tdir, err := os.MkdirTemp(\"\", \"badger-big-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\tdb, err := OpenManaged(DefaultOptions(dir))\n\trequire.NoError(t, err)\n\n\tvar count int\n\twb := db.NewWriteBatchAt(5)\n\tfor _, prefix := range []string{\"p0\", \"p1\", \"p2\"} {\n\t\tfor i := 1; i <= testSize; i++ {\n\t\t\trequire.NoError(t, wb.SetEntry(NewEntry(keyWithPrefix(prefix, i), value(i))))\n\t\t\tcount++\n\t\t}\n\t}\n\trequire.NoError(t, wb.Flush())\n\n\tstream := db.NewStreamAt(math.MaxUint64)\n\tstream.LogPrefix = \"Testing\"\n\tc := &collector{}\n\tstream.Send = c.Send\n\n\t// default value\n\trequire.Equal(t, stream.MaxSize, maxStreamSize)\n\n\t// reset maxsize\n\tstream.MaxSize = 1024 * 1024 * 50\n\n\t// Test case 1. Retrieve everything.\n\terr = stream.Orchestrate(ctxb)\n\trequire.NoError(t, err)\n\trequire.Equal(t, 3*testSize, len(c.kv), \"Expected 30000. Got: %d\", len(c.kv))\n\n\tm := make(map[string]int)\n\tfor _, kv := range c.kv {\n\t\tprefix, ki := keyToInt(kv.Key)\n\t\texpected := value(ki)\n\t\trequire.Equal(t, expected, kv.Value)\n\t\tm[prefix]++\n\t}\n\trequire.Equal(t, 3, len(m))\n\tfor pred, count := range m {\n\t\trequire.Equal(t, testSize, count, \"Count mismatch for pred: %s\", pred)\n\t}\n\trequire.NoError(t, db.Close())\n}\n\nfunc TestStreamWithThreadId(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\tdb, err := OpenManaged(DefaultOptions(dir))\n\trequire.NoError(t, err)\n\n\tvar count int\n\tfor _, prefix := range []string{\"p0\", \"p1\", \"p2\"} {\n\t\ttxn := db.NewTransactionAt(math.MaxUint64, true)\n\t\tfor i := 1; i <= 100; i++ {\n\t\t\trequire.NoError(t, txn.SetEntry(NewEntry(keyWithPrefix(prefix, i), value(i))))\n\t\t\tcount++\n\t\t}\n\t\trequire.NoError(t, txn.CommitAt(5, nil))\n\t}\n\n\tstream := db.NewStreamAt(math.MaxUint64)\n\tstream.LogPrefix = \"Testing\"\n\tstream.KeyToList = func(key []byte, itr *Iterator) (\n\t\t*bpb.KVList, error) {\n\t\trequire.Less(t, itr.ThreadId, stream.NumGo)\n\t\treturn stream.ToList(key, itr)\n\t}\n\tc := &collector{}\n\tstream.Send = c.Send\n\n\terr = stream.Orchestrate(ctxb)\n\trequire.NoError(t, err)\n\trequire.Equal(t, 300, len(c.kv), \"Expected 300. Got: %d\", len(c.kv))\n\n\tm := make(map[string]int)\n\tfor _, kv := range c.kv {\n\t\tprefix, ki := keyToInt(kv.Key)\n\t\texpected := value(ki)\n\t\trequire.Equal(t, expected, kv.Value)\n\t\tm[prefix]++\n\t}\n\trequire.Equal(t, 3, len(m))\n\tfor pred, count := range m {\n\t\trequire.Equal(t, 100, count, \"Count mismatch for pred: %s\", pred)\n\t}\n\trequire.NoError(t, db.Close())\n}\n\nfunc TestBigStream(t *testing.T) {\n\tif !*manual {\n\t\tt.Skip(\"Skipping test meant to be run manually.\")\n\t\treturn\n\t}\n\t// Set the maxStreamSize to 1MB for the duration of the test so that the it can use a smaller\n\t// dataset than it would otherwise need.\n\toriginalMaxStreamSize := maxStreamSize\n\tmaxStreamSize = 1 << 20\n\tdefer func() {\n\t\tmaxStreamSize = originalMaxStreamSize\n\t}()\n\n\ttestSize := int(1e6)\n\tdir, err := os.MkdirTemp(\"\", \"badger-big-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\tdb, err := OpenManaged(DefaultOptions(dir))\n\trequire.NoError(t, err)\n\n\tvar count int\n\twb := db.NewWriteBatchAt(5)\n\tfor _, prefix := range []string{\"p0\", \"p1\", \"p2\"} {\n\t\tfor i := 1; i <= testSize; i++ {\n\t\t\trequire.NoError(t, wb.SetEntry(NewEntry(keyWithPrefix(prefix, i), value(i))))\n\t\t\tcount++\n\t\t}\n\t}\n\trequire.NoError(t, wb.Flush())\n\n\tstream := db.NewStreamAt(math.MaxUint64)\n\tstream.LogPrefix = \"Testing\"\n\tc := &collector{}\n\tstream.Send = c.Send\n\n\t// Test case 1. Retrieve everything.\n\terr = stream.Orchestrate(ctxb)\n\trequire.NoError(t, err)\n\trequire.Equal(t, 3*testSize, len(c.kv), \"Expected 30000. Got: %d\", len(c.kv))\n\n\tm := make(map[string]int)\n\tfor _, kv := range c.kv {\n\t\tprefix, ki := keyToInt(kv.Key)\n\t\texpected := value(ki)\n\t\trequire.Equal(t, expected, kv.Value)\n\t\tm[prefix]++\n\t}\n\trequire.Equal(t, 3, len(m))\n\tfor pred, count := range m {\n\t\trequire.Equal(t, testSize, count, \"Count mismatch for pred: %s\", pred)\n\t}\n\trequire.NoError(t, db.Close())\n}\n\n// There was a bug in the stream writer code which would cause allocators to be\n// freed up twice if the default keyToList was not used. This test verifies that issue.\nfunc TestStreamCustomKeyToList(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\tdb, err := OpenManaged(DefaultOptions(dir))\n\trequire.NoError(t, err)\n\n\tvar count int\n\tfor _, key := range []string{\"p0\", \"p1\", \"p2\"} {\n\t\tfor i := 1; i <= 100; i++ {\n\t\t\ttxn := db.NewTransactionAt(math.MaxUint64, true)\n\t\t\trequire.NoError(t, txn.SetEntry(NewEntry([]byte(key), value(i))))\n\t\t\tcount++\n\t\t\trequire.NoError(t, txn.CommitAt(uint64(i), nil))\n\t\t}\n\t}\n\n\tstream := db.NewStreamAt(math.MaxUint64)\n\tstream.LogPrefix = \"Testing\"\n\tstream.KeyToList = func(key []byte, itr *Iterator) (*bpb.KVList, error) {\n\t\titem := itr.Item()\n\t\tval, err := item.ValueCopy(nil)\n\t\tif err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t\tkv := &bpb.KV{\n\t\t\tKey:   y.Copy(item.Key()),\n\t\t\tValue: val,\n\t\t}\n\t\treturn &bpb.KVList{\n\t\t\tKv: []*bpb.KV{kv},\n\t\t}, nil\n\t}\n\tres := map[string]struct{}{\"p0\": {}, \"p1\": {}, \"p2\": {}}\n\tstream.Send = func(buf *z.Buffer) error {\n\t\tlist, err := BufferToKVList(buf)\n\t\trequire.NoError(t, err)\n\t\tfor _, kv := range list.Kv {\n\t\t\tkey := string(kv.Key)\n\t\t\tif _, ok := res[key]; !ok {\n\t\t\t\tpanic(fmt.Sprintf(\"%s key not found\", key))\n\t\t\t}\n\t\t\tdelete(res, key)\n\t\t}\n\t\treturn nil\n\t}\n\trequire.NoError(t, stream.Orchestrate(ctxb))\n\trequire.Zero(t, len(res))\n}\n"
  },
  {
    "path": "stream_writer.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"encoding/hex\"\n\t\"fmt\"\n\t\"sync\"\n\n\t\"github.com/dustin/go-humanize\"\n\t\"google.golang.org/protobuf/proto\"\n\n\t\"github.com/dgraph-io/badger/v4/pb\"\n\t\"github.com/dgraph-io/badger/v4/table\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\n// StreamWriter is used to write data coming from multiple streams. The streams must not have any\n// overlapping key ranges. Within each stream, the keys must be sorted. Badger Stream framework is\n// capable of generating such an output. So, this StreamWriter can be used at the other end to build\n// BadgerDB at a much faster pace by writing SSTables (and value logs) directly to LSM tree levels\n// without causing any compactions at all. This is way faster than using batched writer or using\n// transactions, but only applicable in situations where the keys are pre-sorted and the DB is being\n// bootstrapped. Existing data would get deleted when using this writer. So, this is only useful\n// when restoring from backup or replicating DB across servers.\n//\n// StreamWriter should not be called on in-use DB instances. It is designed only to bootstrap new\n// DBs.\ntype StreamWriter struct {\n\twriteLock  sync.Mutex\n\tdb         *DB\n\tdone       func()\n\tthrottle   *y.Throttle\n\tmaxVersion uint64\n\twriters    map[uint32]*sortedWriter\n\tprevLevel  int\n}\n\n// NewStreamWriter creates a StreamWriter. Right after creating StreamWriter, Prepare must be\n// called. The memory usage of a StreamWriter is directly proportional to the number of streams\n// possible. So, efforts must be made to keep the number of streams low. Stream framework would\n// typically use 16 goroutines and hence create 16 streams.\nfunc (db *DB) NewStreamWriter() *StreamWriter {\n\treturn &StreamWriter{\n\t\tdb: db,\n\t\t// throttle shouldn't make much difference. Memory consumption is based on the number of\n\t\t// concurrent streams being processed.\n\t\tthrottle: y.NewThrottle(16),\n\t\twriters:  make(map[uint32]*sortedWriter),\n\t}\n}\n\n// Prepare should be called before writing any entry to StreamWriter. It deletes all data present in\n// existing DB, stops compactions and any writes being done by other means. Be very careful when\n// calling Prepare, because it could result in permanent data loss. Not calling Prepare would result\n// in a corrupt Badger instance. Use PrepareIncremental to do incremental stream write.\nfunc (sw *StreamWriter) Prepare() error {\n\tsw.writeLock.Lock()\n\tdefer sw.writeLock.Unlock()\n\n\tdone, err := sw.db.dropAll()\n\t// Ensure that done() is never called more than once.\n\tvar once sync.Once\n\tsw.done = func() { once.Do(done) }\n\treturn err\n}\n\n// PrepareIncremental should be called before writing any entry to StreamWriter incrementally.\n// In incremental stream write, the tables are written at one level above the current base level.\nfunc (sw *StreamWriter) PrepareIncremental() error {\n\tsw.writeLock.Lock()\n\tdefer sw.writeLock.Unlock()\n\n\t// Ensure that done() is never called more than once.\n\tvar once sync.Once\n\n\t// prepareToDrop will stop all the incoming writes and process any pending flush tasks.\n\t// Before we start writing, we'll stop the compactions because no one else should be writing to\n\t// the same level as the stream writer is writing to.\n\tf, err := sw.db.prepareToDrop()\n\tif err != nil {\n\t\tsw.done = func() { once.Do(f) }\n\t\treturn err\n\t}\n\tsw.db.stopCompactions()\n\tdone := func() {\n\t\tsw.db.startCompactions()\n\t\tf()\n\t}\n\tsw.done = func() { once.Do(done) }\n\n\tmts, decr := sw.db.getMemTables()\n\tdefer decr()\n\tfor _, m := range mts {\n\t\tif !m.sl.Empty() {\n\t\t\treturn fmt.Errorf(\"Unable to do incremental writes because MemTable has data\")\n\t\t}\n\t}\n\n\tisEmptyDB := true\n\tfor _, level := range sw.db.Levels() {\n\t\tif level.NumTables > 0 {\n\t\t\tsw.prevLevel = level.Level\n\t\t\tisEmptyDB = false\n\t\t\tbreak\n\t\t}\n\t}\n\tif isEmptyDB {\n\t\t// If DB is empty, we should allow doing incremental stream write.\n\t\treturn nil\n\t}\n\tif sw.prevLevel == 0 {\n\t\t// It seems that data is present in all levels from Lmax to L0. If we call flatten\n\t\t// on the tree, all the data will go to Lmax. All the levels above will be empty\n\t\t// after flatten call. Now, we should be able to use incremental stream writer again.\n\t\tif err := sw.db.Flatten(3); err != nil {\n\t\t\treturn fmt.Errorf(\"error during flatten in StreamWriter: %w\", err)\n\t\t}\n\t\tsw.prevLevel = len(sw.db.Levels()) - 1\n\t}\n\treturn nil\n}\n\n// Write writes KVList to DB. Each KV within the list contains the stream id which StreamWriter\n// would use to demux the writes. Write is thread safe and can be called concurrently by multiple\n// goroutines.\nfunc (sw *StreamWriter) Write(buf *z.Buffer) error {\n\tif buf.LenNoPadding() == 0 {\n\t\treturn nil\n\t}\n\n\t// closedStreams keeps track of all streams which are going to be marked as done. We are\n\t// keeping track of all streams so that we can close them at the end, after inserting all\n\t// the valid kvs.\n\tclosedStreams := make(map[uint32]struct{})\n\tstreamReqs := make(map[uint32]*request)\n\n\terr := buf.SliceIterate(func(s []byte) error {\n\t\tvar kv pb.KV\n\t\tif err := proto.Unmarshal(s, &kv); err != nil {\n\t\t\treturn err\n\t\t}\n\t\tif kv.StreamDone {\n\t\t\tclosedStreams[kv.StreamId] = struct{}{}\n\t\t\treturn nil\n\t\t}\n\n\t\t// Panic if some kv comes after stream has been marked as closed.\n\t\tif _, ok := closedStreams[kv.StreamId]; ok {\n\t\t\tpanic(fmt.Sprintf(\"write performed on closed stream: %d\", kv.StreamId))\n\t\t}\n\n\t\tsw.writeLock.Lock()\n\t\tif sw.maxVersion < kv.Version {\n\t\t\tsw.maxVersion = kv.Version\n\t\t}\n\t\tif sw.prevLevel == 0 {\n\t\t\t// If prevLevel is 0, that means that we have not written anything yet.\n\t\t\t// So, we can write to the maxLevel. newWriter writes to prevLevel - 1,\n\t\t\t// so we can set prevLevel to len(levels).\n\t\t\tsw.prevLevel = len(sw.db.lc.levels)\n\t\t}\n\t\tsw.writeLock.Unlock()\n\n\t\tvar meta, userMeta byte\n\t\tif len(kv.Meta) > 0 {\n\t\t\tmeta = kv.Meta[0]\n\t\t}\n\t\tif len(kv.UserMeta) > 0 {\n\t\t\tuserMeta = kv.UserMeta[0]\n\t\t}\n\t\te := &Entry{\n\t\t\tKey:       y.KeyWithTs(kv.Key, kv.Version),\n\t\t\tValue:     y.Copy(kv.Value),\n\t\t\tUserMeta:  userMeta,\n\t\t\tExpiresAt: kv.ExpiresAt,\n\t\t\tmeta:      meta,\n\t\t}\n\t\t// If the value can be collocated with the key in LSM tree, we can skip\n\t\t// writing the value to value log.\n\t\treq := streamReqs[kv.StreamId]\n\t\tif req == nil {\n\t\t\treq = &request{}\n\t\t\tstreamReqs[kv.StreamId] = req\n\t\t}\n\t\treq.Entries = append(req.Entries, e)\n\t\treturn nil\n\t})\n\tif err != nil {\n\t\treturn err\n\t}\n\n\tall := make([]*request, 0, len(streamReqs))\n\tfor _, req := range streamReqs {\n\t\tall = append(all, req)\n\t}\n\n\tsw.writeLock.Lock()\n\tdefer sw.writeLock.Unlock()\n\n\t// We are writing all requests to vlog even if some request belongs to already closed stream.\n\t// It is safe to do because we are panicking while writing to sorted writer, which will be nil\n\t// for closed stream. At restart, stream writer will drop all the data in Prepare function.\n\tif err := sw.db.vlog.write(all); err != nil {\n\t\treturn err\n\t}\n\n\tfor streamID, req := range streamReqs {\n\t\twriter, ok := sw.writers[streamID]\n\t\tif !ok {\n\t\t\tvar err error\n\t\t\twriter, err = sw.newWriter(streamID)\n\t\t\tif err != nil {\n\t\t\t\treturn y.Wrapf(err, \"failed to create writer with ID %d\", streamID)\n\t\t\t}\n\t\t\tsw.writers[streamID] = writer\n\t\t}\n\n\t\tif writer == nil {\n\t\t\tpanic(fmt.Sprintf(\"write performed on closed stream: %d\", streamID))\n\t\t}\n\n\t\twriter.reqCh <- req\n\t}\n\n\t// Now we can close any streams if required. We will make writer for\n\t// the closed streams as nil.\n\tfor streamId := range closedStreams {\n\t\twriter, ok := sw.writers[streamId]\n\t\tif !ok {\n\t\t\tsw.db.opt.Warningf(\"Trying to close stream: %d, but no sorted \"+\n\t\t\t\t\"writer found for it\", streamId)\n\t\t\tcontinue\n\t\t}\n\n\t\twriter.closer.SignalAndWait()\n\t\tif err := writer.Done(); err != nil {\n\t\t\treturn err\n\t\t}\n\n\t\tsw.writers[streamId] = nil\n\t}\n\treturn nil\n}\n\n// Flush is called once we are done writing all the entries. It syncs DB directories. It also\n// updates Oracle with maxVersion found in all entries (if DB is not managed).\nfunc (sw *StreamWriter) Flush() error {\n\tsw.writeLock.Lock()\n\tdefer sw.writeLock.Unlock()\n\n\tdefer sw.done()\n\n\tfor _, writer := range sw.writers {\n\t\tif writer != nil {\n\t\t\twriter.closer.SignalAndWait()\n\t\t}\n\t}\n\n\tfor _, writer := range sw.writers {\n\t\tif writer == nil {\n\t\t\tcontinue\n\t\t}\n\t\tif err := writer.Done(); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\n\tif !sw.db.opt.managedTxns {\n\t\tif sw.db.orc != nil {\n\t\t\tsw.db.orc.Stop()\n\t\t}\n\n\t\tif curMax := sw.db.orc.readTs(); curMax >= sw.maxVersion {\n\t\t\tsw.maxVersion = curMax\n\t\t}\n\n\t\tsw.db.orc = newOracle(sw.db.opt)\n\t\tsw.db.orc.nextTxnTs = sw.maxVersion\n\t\tsw.db.orc.txnMark.Done(sw.maxVersion)\n\t\tsw.db.orc.readMark.Done(sw.maxVersion)\n\t\tsw.db.orc.incrementNextTs()\n\t}\n\n\t// Wait for all files to be written.\n\tif err := sw.throttle.Finish(); err != nil {\n\t\treturn err\n\t}\n\n\t// Sort tables at the end.\n\tfor _, l := range sw.db.lc.levels {\n\t\tl.sortTables()\n\t}\n\n\t// Now sync the directories, so all the files are registered.\n\tif sw.db.opt.ValueDir != sw.db.opt.Dir {\n\t\tif err := sw.db.syncDir(sw.db.opt.ValueDir); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\tif err := sw.db.syncDir(sw.db.opt.Dir); err != nil {\n\t\treturn err\n\t}\n\treturn sw.db.lc.validate()\n}\n\n// Cancel signals all goroutines to exit. Calling defer sw.Cancel() immediately after creating a new StreamWriter\n// ensures that writes are unblocked even upon early return. Note that dropAll() is not called here, so any\n// partially written data will not be erased until a new StreamWriter is initialized.\nfunc (sw *StreamWriter) Cancel() {\n\tsw.writeLock.Lock()\n\tdefer sw.writeLock.Unlock()\n\n\tfor _, writer := range sw.writers {\n\t\tif writer != nil {\n\t\t\twriter.closer.Signal()\n\t\t}\n\t}\n\tfor _, writer := range sw.writers {\n\t\tif writer != nil {\n\t\t\twriter.closer.Wait()\n\t\t}\n\t}\n\n\tif err := sw.throttle.Finish(); err != nil {\n\t\tsw.db.opt.Errorf(\"error in throttle.Finish: %+v\", err)\n\t}\n\n\t// Handle Cancel() being called before Prepare().\n\tif sw.done != nil {\n\t\tsw.done()\n\t}\n}\n\ntype sortedWriter struct {\n\tdb       *DB\n\tthrottle *y.Throttle\n\topts     table.Options\n\n\tbuilder  *table.Builder\n\tlastKey  []byte\n\tlevel    int\n\tstreamID uint32\n\treqCh    chan *request\n\t// Have separate closer for each writer, as it can be closed at any time.\n\tcloser *z.Closer\n}\n\nfunc (sw *StreamWriter) newWriter(streamID uint32) (*sortedWriter, error) {\n\tbopts := buildTableOptions(sw.db)\n\tfor i := 2; i < sw.db.opt.MaxLevels; i++ {\n\t\tbopts.TableSize *= uint64(sw.db.opt.TableSizeMultiplier)\n\t}\n\tw := &sortedWriter{\n\t\tdb:       sw.db,\n\t\topts:     bopts,\n\t\tstreamID: streamID,\n\t\tthrottle: sw.throttle,\n\t\tbuilder:  table.NewTableBuilder(bopts),\n\t\treqCh:    make(chan *request, 3),\n\t\tcloser:   z.NewCloser(1),\n\t\tlevel:    sw.prevLevel - 1, // Write at the level just above the one we were writing to.\n\t}\n\n\tgo w.handleRequests()\n\treturn w, nil\n}\n\nfunc (w *sortedWriter) handleRequests() {\n\tdefer w.closer.Done()\n\n\tprocess := func(req *request) {\n\t\tfor i, e := range req.Entries {\n\t\t\t// If badger is running in InMemory mode, len(req.Ptrs) == 0.\n\t\t\tvar vs y.ValueStruct\n\t\t\tif e.skipVlogAndSetThreshold(w.db.valueThreshold()) {\n\t\t\t\tvs = y.ValueStruct{\n\t\t\t\t\tValue:     e.Value,\n\t\t\t\t\tMeta:      e.meta,\n\t\t\t\t\tUserMeta:  e.UserMeta,\n\t\t\t\t\tExpiresAt: e.ExpiresAt,\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\tvptr := req.Ptrs[i]\n\t\t\t\tvs = y.ValueStruct{\n\t\t\t\t\tValue:     vptr.Encode(),\n\t\t\t\t\tMeta:      e.meta | bitValuePointer,\n\t\t\t\t\tUserMeta:  e.UserMeta,\n\t\t\t\t\tExpiresAt: e.ExpiresAt,\n\t\t\t\t}\n\t\t\t}\n\t\t\tif err := w.Add(e.Key, vs); err != nil {\n\t\t\t\tpanic(err)\n\t\t\t}\n\t\t}\n\t}\n\n\tfor {\n\t\tselect {\n\t\tcase req := <-w.reqCh:\n\t\t\tprocess(req)\n\t\tcase <-w.closer.HasBeenClosed():\n\t\t\tclose(w.reqCh)\n\t\t\tfor req := range w.reqCh {\n\t\t\t\tprocess(req)\n\t\t\t}\n\t\t\treturn\n\t\t}\n\t}\n}\n\n// Add adds key and vs to sortedWriter.\nfunc (w *sortedWriter) Add(key []byte, vs y.ValueStruct) error {\n\tif len(w.lastKey) > 0 && y.CompareKeys(key, w.lastKey) <= 0 {\n\t\treturn fmt.Errorf(\"keys not in sorted order (last key: %s, key: %s)\",\n\t\t\thex.Dump(w.lastKey), hex.Dump(key))\n\t}\n\n\tsameKey := y.SameKey(key, w.lastKey)\n\n\t// Same keys should go into the same SSTable.\n\tif !sameKey && w.builder.ReachedCapacity() {\n\t\tif err := w.send(false); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\n\tw.lastKey = y.SafeCopy(w.lastKey, key)\n\tvar vp valuePointer\n\tif vs.Meta&bitValuePointer > 0 {\n\t\tvp.Decode(vs.Value)\n\t}\n\n\tw.builder.Add(key, vs, vp.Len)\n\treturn nil\n}\n\nfunc (w *sortedWriter) send(done bool) error {\n\tif err := w.throttle.Do(); err != nil {\n\t\treturn err\n\t}\n\tgo func(builder *table.Builder) {\n\t\terr := w.createTable(builder)\n\t\tw.throttle.Done(err)\n\t}(w.builder)\n\t// If done is true, this indicates we can close the writer.\n\t// No need to allocate underlying TableBuilder now.\n\tif done {\n\t\tw.builder = nil\n\t\treturn nil\n\t}\n\n\tw.builder = table.NewTableBuilder(w.opts)\n\treturn nil\n}\n\n// Done is called once we are done writing all keys and valueStructs\n// to sortedWriter. It completes writing current SST to disk.\nfunc (w *sortedWriter) Done() error {\n\tif w.builder.Empty() {\n\t\tw.builder.Close()\n\t\t// Assign builder as nil, so that underlying memory can be garbage collected.\n\t\tw.builder = nil\n\t\treturn nil\n\t}\n\n\treturn w.send(true)\n}\n\nfunc (w *sortedWriter) createTable(builder *table.Builder) error {\n\tdefer builder.Close()\n\tif builder.Empty() {\n\t\tbuilder.Finish()\n\t\treturn nil\n\t}\n\n\tfileID := w.db.lc.reserveFileID()\n\tvar tbl *table.Table\n\tif w.db.opt.InMemory {\n\t\tdata := builder.Finish()\n\t\tvar err error\n\t\tif tbl, err = table.OpenInMemoryTable(data, fileID, builder.Opts()); err != nil {\n\t\t\treturn err\n\t\t}\n\t} else {\n\t\tvar err error\n\t\tfname := table.NewFilename(fileID, w.db.opt.Dir)\n\t\tif tbl, err = table.CreateTable(fname, builder); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\tlc := w.db.lc\n\n\tlhandler := lc.levels[w.level]\n\t// Now that table can be opened successfully, let's add this to the MANIFEST.\n\tchange := &pb.ManifestChange{\n\t\tId:          tbl.ID(),\n\t\tKeyId:       tbl.KeyID(),\n\t\tOp:          pb.ManifestChange_CREATE,\n\t\tLevel:       uint32(lhandler.level),\n\t\tCompression: uint32(tbl.CompressionType()),\n\t}\n\tif err := w.db.manifest.addChanges([]*pb.ManifestChange{change}, w.db.opt); err != nil {\n\t\treturn err\n\t}\n\n\t// We are not calling lhandler.replaceTables() here, as it sorts tables on every addition.\n\t// We can sort all tables only once during Flush() call.\n\tlhandler.addTable(tbl)\n\n\t// Release the ref held by OpenTable.\n\t_ = tbl.DecrRef()\n\tw.db.opt.Infof(\"Table created: %d at level: %d for stream: %d. Size: %s\\n\",\n\t\tfileID, lhandler.level, w.streamID, humanize.IBytes(uint64(tbl.Size())))\n\treturn nil\n}\n"
  },
  {
    "path": "stream_writer_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"bytes\"\n\t\"encoding/binary\"\n\t\"fmt\"\n\t\"math\"\n\t\"math/rand\"\n\t\"os\"\n\t\"testing\"\n\n\t\"github.com/stretchr/testify/require\"\n\n\t\"github.com/dgraph-io/badger/v4/pb\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\nfunc getSortedKVList(valueSize, listSize int) *z.Buffer {\n\tvalue := make([]byte, valueSize)\n\ty.Check2(rand.Read(value))\n\tbuf := z.NewBuffer(10<<20, \"test\")\n\tfor i := 0; i < listSize; i++ {\n\t\tkey := make([]byte, 8)\n\t\tbinary.BigEndian.PutUint64(key, uint64(i))\n\t\tKVToBuffer(&pb.KV{\n\t\t\tKey:     key,\n\t\t\tValue:   value,\n\t\t\tVersion: 20,\n\t\t}, buf)\n\t}\n\n\treturn buf\n}\n\n// check if we can read values after writing using stream writer\nfunc TestStreamWriter1(t *testing.T) {\n\ttest := func(t *testing.T, opts *Options) {\n\t\trunBadgerTest(t, opts, func(t *testing.T, db *DB) {\n\t\t\t// write entries using stream writer\n\t\t\tnoOfKeys := 10\n\t\t\tvalueSize := 128\n\t\t\tlist := getSortedKVList(valueSize, noOfKeys)\n\t\t\tsw := db.NewStreamWriter()\n\t\t\trequire.NoError(t, sw.Prepare(), \"sw.Prepare() failed\")\n\t\t\trequire.NoError(t, sw.Write(list), \"sw.Write() failed\")\n\t\t\trequire.NoError(t, sw.Flush(), \"sw.Flush() failed\")\n\n\t\t\terr := db.View(func(txn *Txn) error {\n\t\t\t\t// read any random key from inserted keys\n\t\t\t\tkeybyte := make([]byte, 8)\n\t\t\t\tkeyNo := uint64(rand.Int63n(int64(noOfKeys)))\n\t\t\t\tbinary.BigEndian.PutUint64(keybyte, keyNo)\n\t\t\t\t_, err := txn.Get(keybyte)\n\t\t\t\trequire.Nil(t, err, \"key should be found\")\n\n\t\t\t\t// count all keys written using stream writer\n\t\t\t\tkeysCount := 0\n\t\t\t\titrOps := DefaultIteratorOptions\n\t\t\t\tit := txn.NewIterator(itrOps)\n\t\t\t\tdefer it.Close()\n\t\t\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\t\t\tkeysCount++\n\t\t\t\t}\n\t\t\t\trequire.True(t, keysCount == noOfKeys, \"count of keys should be matched\")\n\t\t\t\treturn nil\n\t\t\t})\n\t\t\trequire.NoError(t, err, \"error while retrieving key\")\n\t\t})\n\t}\n\tt.Run(\"Normal mode\", func(t *testing.T) {\n\t\tnormalModeOpts := getTestOptions(\"\")\n\t\ttest(t, &normalModeOpts)\n\t})\n\tt.Run(\"Managed mode\", func(t *testing.T) {\n\t\tmanagedModeOpts := getTestOptions(\"\")\n\t\tmanagedModeOpts.managedTxns = true\n\t\ttest(t, &managedModeOpts)\n\t})\n\tt.Run(\"InMemory mode\", func(t *testing.T) {\n\t\tdiskLessModeOpts := getTestOptions(\"\")\n\t\tdiskLessModeOpts.InMemory = true\n\t\ttest(t, &diskLessModeOpts)\n\t})\n}\n\n// write more keys to db after writing keys using stream writer\nfunc TestStreamWriter2(t *testing.T) {\n\ttest := func(t *testing.T, opts *Options) {\n\t\trunBadgerTest(t, opts, func(t *testing.T, db *DB) {\n\t\t\t// write entries using stream writer\n\t\t\tnoOfKeys := 1000\n\t\t\tvalueSize := 128\n\t\t\tlist := getSortedKVList(valueSize, noOfKeys)\n\t\t\tsw := db.NewStreamWriter()\n\t\t\trequire.NoError(t, sw.Prepare(), \"sw.Prepare() failed\")\n\t\t\trequire.NoError(t, sw.Write(list), \"sw.Write() failed\")\n\t\t\t// get max version of sw, will be used in transactions for managed mode\n\t\t\tmaxVs := sw.maxVersion\n\t\t\trequire.NoError(t, sw.Flush(), \"sw.Flush() failed\")\n\n\t\t\t// delete all the inserted keys\n\t\t\tval := make([]byte, valueSize)\n\t\t\ty.Check2(rand.Read(val))\n\t\t\tfor i := 0; i < noOfKeys; i++ {\n\t\t\t\ttxn := db.newTransaction(true, opts.managedTxns)\n\t\t\t\tif opts.managedTxns {\n\t\t\t\t\ttxn.readTs = math.MaxUint64\n\t\t\t\t\ttxn.commitTs = maxVs\n\t\t\t\t}\n\t\t\t\tkeybyte := make([]byte, 8)\n\t\t\t\tkeyNo := uint64(i)\n\t\t\t\tbinary.BigEndian.PutUint64(keybyte, keyNo)\n\t\t\t\trequire.NoError(t, txn.Delete(keybyte), \"error while deleting keys\")\n\t\t\t\trequire.NoError(t, txn.Commit(), \"error while commit\")\n\t\t\t}\n\n\t\t\t// verify while iteration count of keys should be 0\n\t\t\terr := db.View(func(txn *Txn) error {\n\t\t\t\tkeysCount := 0\n\t\t\t\titrOps := DefaultIteratorOptions\n\t\t\t\tit := txn.NewIterator(itrOps)\n\t\t\t\tdefer it.Close()\n\t\t\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\t\t\tkeysCount++\n\t\t\t\t}\n\t\t\t\trequire.Zero(t, keysCount, \"count of keys should be 0\")\n\t\t\t\treturn nil\n\t\t\t})\n\n\t\t\trequire.Nil(t, err, \"error should be nil while iterating\")\n\t\t})\n\t}\n\tt.Run(\"Normal mode\", func(t *testing.T) {\n\t\tnormalModeOpts := getTestOptions(\"\")\n\t\ttest(t, &normalModeOpts)\n\t})\n\tt.Run(\"Managed mode\", func(t *testing.T) {\n\t\tmanagedModeOpts := getTestOptions(\"\")\n\t\tmanagedModeOpts.managedTxns = true\n\t\ttest(t, &managedModeOpts)\n\t})\n\tt.Run(\"InMemory mode\", func(t *testing.T) {\n\t\tdiskLessModeOpts := getTestOptions(\"\")\n\t\tdiskLessModeOpts.InMemory = true\n\t\ttest(t, &diskLessModeOpts)\n\t})\n}\n\nfunc TestStreamWriter3(t *testing.T) {\n\ttest := func(t *testing.T, opts *Options) {\n\t\trunBadgerTest(t, opts, func(t *testing.T, db *DB) {\n\t\t\t// write entries using stream writer\n\t\t\tnoOfKeys := 1000\n\t\t\tvalueSize := 128\n\n\t\t\t// insert keys which are even\n\t\t\tvalue := make([]byte, valueSize)\n\t\t\ty.Check2(rand.Read(value))\n\t\t\tcounter := 0\n\t\t\tbuf := z.NewBuffer(10<<20, \"test\")\n\t\t\tdefer func() { require.NoError(t, buf.Release()) }()\n\t\t\tfor i := 0; i < noOfKeys; i++ {\n\t\t\t\tkey := make([]byte, 8)\n\t\t\t\tbinary.BigEndian.PutUint64(key, uint64(counter))\n\t\t\t\tKVToBuffer(&pb.KV{\n\t\t\t\t\tKey:     key,\n\t\t\t\t\tValue:   value,\n\t\t\t\t\tVersion: 20,\n\t\t\t\t}, buf)\n\t\t\t\tcounter = counter + 2\n\t\t\t}\n\n\t\t\tsw := db.NewStreamWriter()\n\t\t\trequire.NoError(t, sw.Prepare(), \"sw.Prepare() failed\")\n\t\t\trequire.NoError(t, sw.Write(buf), \"sw.Write() failed\")\n\t\t\t// get max version of sw, will be used in transactions for managed mode\n\t\t\tmaxVs := sw.maxVersion\n\t\t\trequire.NoError(t, sw.Flush(), \"sw.Flush() failed\")\n\n\t\t\t// insert keys which are odd\n\t\t\tval := make([]byte, valueSize)\n\t\t\ty.Check2(rand.Read(val))\n\t\t\tcounter = 1\n\t\t\tfor i := 0; i < noOfKeys; i++ {\n\t\t\t\ttxn := db.newTransaction(true, opts.managedTxns)\n\t\t\t\tif opts.managedTxns {\n\t\t\t\t\ttxn.readTs = math.MaxUint64\n\t\t\t\t\ttxn.commitTs = maxVs\n\t\t\t\t}\n\t\t\t\tkeybyte := make([]byte, 8)\n\t\t\t\tkeyNo := uint64(counter)\n\t\t\t\tbinary.BigEndian.PutUint64(keybyte, keyNo)\n\t\t\t\trequire.NoError(t, txn.SetEntry(NewEntry(keybyte, val)),\n\t\t\t\t\t\"error while inserting entries\")\n\t\t\t\trequire.NoError(t, txn.Commit(), \"error while commit\")\n\t\t\t\tcounter = counter + 2\n\t\t\t}\n\n\t\t\t// verify while iteration keys are in sorted order\n\t\t\terr := db.View(func(txn *Txn) error {\n\t\t\t\tkeysCount := 0\n\t\t\t\titrOps := DefaultIteratorOptions\n\t\t\t\tit := txn.NewIterator(itrOps)\n\t\t\t\tdefer it.Close()\n\t\t\t\tprev := uint64(0)\n\t\t\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\t\t\titem := it.Item()\n\t\t\t\t\tkey := item.Key()\n\t\t\t\t\tcurrent := binary.BigEndian.Uint64(key)\n\t\t\t\t\tif prev != 0 && current != (prev+uint64(1)) {\n\t\t\t\t\t\tt.Fatal(\"keys should be in increasing order\")\n\t\t\t\t\t}\n\t\t\t\t\tkeysCount++\n\t\t\t\t\tprev = current\n\t\t\t\t}\n\t\t\t\trequire.True(t, keysCount == 2*noOfKeys, \"count of keys is not matching\")\n\t\t\t\treturn nil\n\t\t\t})\n\n\t\t\trequire.Nil(t, err, \"error should be nil while iterating\")\n\t\t})\n\t}\n\tt.Run(\"Normal mode\", func(t *testing.T) {\n\t\tnormalModeOpts := getTestOptions(\"\")\n\t\ttest(t, &normalModeOpts)\n\t})\n\tt.Run(\"Managed mode\", func(t *testing.T) {\n\t\tmanagedModeOpts := getTestOptions(\"\")\n\t\tmanagedModeOpts.managedTxns = true\n\t\ttest(t, &managedModeOpts)\n\t})\n\tt.Run(\"InMemory mode\", func(t *testing.T) {\n\t\tdiskLessModeOpts := getTestOptions(\"\")\n\t\tdiskLessModeOpts.InMemory = true\n\t\ttest(t, &diskLessModeOpts)\n\t})\n}\n\n// After inserting all data from streams, StreamWriter reinitializes Oracle and updates its nextTs\n// to maxVersion found in all entries inserted(if db is running in non managed mode). It also\n// updates Oracle's txnMark and readMark. If Oracle is not reinitialized, it might cause issue\n// while updating readMark and txnMark when its nextTs is ahead of maxVersion. This tests verifies\n// Oracle reinitialization is happening. Try commenting line 171 in stream_writer.go with code\n// (sw.db.orc = newOracle(sw.db.opt), this test should fail.\nfunc TestStreamWriter4(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t// first insert some entries in db\n\t\tfor i := 0; i < 10; i++ {\n\t\t\terr := db.Update(func(txn *Txn) error {\n\t\t\t\tkey := []byte(fmt.Sprintf(\"key-%d\", i))\n\t\t\t\tvalue := []byte(fmt.Sprintf(\"val-%d\", i))\n\t\t\t\treturn txn.Set(key, value)\n\t\t\t})\n\t\t\trequire.NoError(t, err, \"error while updating db\")\n\t\t}\n\n\t\tbuf := z.NewBuffer(10<<20, \"test\")\n\t\tdefer func() { require.NoError(t, buf.Release()) }()\n\t\tKVToBuffer(&pb.KV{\n\t\t\tKey:     []byte(\"key-1\"),\n\t\t\tValue:   []byte(\"value-1\"),\n\t\t\tVersion: 1,\n\t\t}, buf)\n\n\t\tsw := db.NewStreamWriter()\n\t\trequire.NoError(t, sw.Prepare(), \"sw.Prepare() failed\")\n\t\trequire.NoError(t, sw.Write(buf), \"sw.Write() failed\")\n\t\trequire.NoError(t, sw.Flush(), \"sw.Flush() failed\")\n\t})\n}\n\nfunc TestStreamWriter5(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\tleft := make([]byte, 6)\n\t\tleft[0] = 0x00\n\t\tcopy(left[1:], []byte(\"break\"))\n\n\t\tright := make([]byte, 6)\n\t\tright[0] = 0xff\n\t\tcopy(right[1:], []byte(\"break\"))\n\n\t\tbuf := z.NewBuffer(10<<20, \"test\")\n\t\tdefer func() { require.NoError(t, buf.Release()) }()\n\t\tKVToBuffer(&pb.KV{\n\t\t\tKey:     left,\n\t\t\tValue:   []byte(\"val\"),\n\t\t\tVersion: 1,\n\t\t}, buf)\n\t\tKVToBuffer(&pb.KV{\n\t\t\tKey:     right,\n\t\t\tValue:   []byte(\"val\"),\n\t\t\tVersion: 1,\n\t\t}, buf)\n\n\t\tsw := db.NewStreamWriter()\n\t\trequire.NoError(t, sw.Prepare(), \"sw.Prepare() failed\")\n\t\trequire.NoError(t, sw.Write(buf), \"sw.Write() failed\")\n\t\trequire.NoError(t, sw.Flush(), \"sw.Flush() failed\")\n\t\trequire.NoError(t, db.Close())\n\n\t\tvar err error\n\t\tdb, err = Open(db.opt)\n\t\trequire.NoError(t, err)\n\t\trequire.NoError(t, db.Close())\n\t})\n}\n\n// This test tries to insert multiple equal keys(without version) and verifies\n// if those are going to same table.\nfunc TestStreamWriter6(t *testing.T) {\n\topt := getTestOptions(\"\")\n\topt.BaseTableSize = 1 << 15\n\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\tstr := []string{\"a\", \"b\", \"c\"}\n\t\tver := uint64(0)\n\t\t// The baseTable size is 32 KB (1<<15) and the max table size for level\n\t\t// 6 table is 1 mb (look at newWrite function). Since all the tables\n\t\t// will be written to level 6, we need to insert at least 1 mb of data.\n\t\t// Setting keycount below 32 would cause this test to fail.\n\t\tkeyCount := 40\n\t\tbuf := z.NewBuffer(10<<20, \"test\")\n\t\tdefer func() { require.NoError(t, buf.Release()) }()\n\t\tfor i := range str {\n\t\t\tfor j := 0; j < keyCount; j++ {\n\t\t\t\tver++\n\t\t\t\tkv := &pb.KV{\n\t\t\t\t\tKey:     bytes.Repeat([]byte(str[i]), int(db.opt.BaseTableSize)),\n\t\t\t\t\tValue:   []byte(\"val\"),\n\t\t\t\t\tVersion: uint64(keyCount - j),\n\t\t\t\t}\n\t\t\t\tKVToBuffer(kv, buf)\n\t\t\t}\n\t\t}\n\n\t\t// list has 3 pairs for equal keys. Since each Key has size equal to BaseTableSize\n\t\t// we would have 6 tables, if keys are not equal. Here we should have 3 tables.\n\t\tsw := db.NewStreamWriter()\n\t\trequire.NoError(t, sw.Prepare(), \"sw.Prepare() failed\")\n\t\trequire.NoError(t, sw.Write(buf), \"sw.Write() failed\")\n\t\trequire.NoError(t, sw.Flush(), \"sw.Flush() failed\")\n\n\t\ttables := db.Tables()\n\t\trequire.Equal(t, 3, len(tables), \"Count of tables not matching\")\n\t\tfor _, tab := range tables {\n\t\t\trequire.Equal(t, keyCount, int(tab.KeyCount),\n\t\t\t\tfmt.Sprintf(\"failed for level: %d\", tab.Level))\n\t\t}\n\t\trequire.NoError(t, db.Close())\n\t\tdb, err := Open(db.opt)\n\t\trequire.NoError(t, err)\n\t\trequire.NoError(t, db.Close())\n\t})\n}\n\n// This test uses a StreamWriter without calling Flush() at the end.\nfunc TestStreamWriterCancel(t *testing.T) {\n\topt := getTestOptions(\"\")\n\topt.BaseTableSize = 1 << 15\n\trunBadgerTest(t, &opt, func(t *testing.T, db *DB) {\n\t\tstr := []string{\"a\", \"a\", \"b\", \"b\", \"c\", \"c\"}\n\t\tver := 1\n\t\tbuf := z.NewBuffer(10<<20, \"test\")\n\t\tdefer func() { require.NoError(t, buf.Release()) }()\n\t\tfor i := range str {\n\t\t\tkv := &pb.KV{\n\t\t\t\tKey:     bytes.Repeat([]byte(str[i]), int(db.opt.BaseTableSize)),\n\t\t\t\tValue:   []byte(\"val\"),\n\t\t\t\tVersion: uint64(ver),\n\t\t\t}\n\t\t\tKVToBuffer(kv, buf)\n\t\t\tver = (ver + 1) % 2\n\t\t}\n\n\t\tsw := db.NewStreamWriter()\n\t\trequire.NoError(t, sw.Prepare(), \"sw.Prepare() failed\")\n\t\trequire.NoError(t, sw.Write(buf), \"sw.Write() failed\")\n\t\tsw.Cancel()\n\n\t\t// Use the API incorrectly.\n\t\tsw1 := db.NewStreamWriter()\n\t\tdefer sw1.Cancel()\n\t\trequire.NoError(t, sw1.Prepare())\n\t\tdefer sw1.Cancel()\n\t\tsw1.Flush()\n\t})\n}\n\nfunc TestStreamDone(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\tsw := db.NewStreamWriter()\n\t\trequire.NoError(t, sw.Prepare(), \"sw.Prepare() failed\")\n\n\t\tvar val [10]byte\n\t\trand.Read(val[:])\n\t\tfor i := 0; i < 10; i++ {\n\t\t\tbuf := z.NewBuffer(10<<20, \"test\")\n\t\t\tdefer func() { require.NoError(t, buf.Release()) }()\n\t\t\tkv1 := &pb.KV{\n\t\t\t\tKey:      []byte(fmt.Sprintf(\"%d\", i)),\n\t\t\t\tValue:    val[:],\n\t\t\t\tVersion:  1,\n\t\t\t\tStreamId: uint32(i),\n\t\t\t}\n\t\t\tkv2 := &pb.KV{\n\t\t\t\tStreamId:   uint32(i),\n\t\t\t\tStreamDone: true,\n\t\t\t}\n\t\t\tKVToBuffer(kv1, buf)\n\t\t\tKVToBuffer(kv2, buf)\n\t\t\trequire.NoError(t, sw.Write(buf), \"sw.Write() failed\")\n\t\t}\n\t\trequire.NoError(t, sw.Flush(), \"sw.Flush() failed\")\n\t\trequire.NoError(t, db.Close())\n\n\t\tvar err error\n\t\tdb, err = Open(db.opt)\n\t\trequire.NoError(t, err)\n\t\trequire.NoError(t, db.Close())\n\t})\n}\n\nfunc TestSendOnClosedStream(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer func() {\n\t\trequire.NoError(t, os.RemoveAll(dir))\n\t}()\n\topts := getTestOptions(dir)\n\tdb, err := Open(opts)\n\trequire.NoError(t, err)\n\n\tsw := db.NewStreamWriter()\n\trequire.NoError(t, sw.Prepare(), \"sw.Prepare() failed\")\n\n\tvar val [10]byte\n\trand.Read(val[:])\n\tbuf := z.NewBuffer(10<<20, \"test\")\n\tdefer func() { require.NoError(t, buf.Release()) }()\n\tkv1 := &pb.KV{\n\t\tKey:      []byte(fmt.Sprintf(\"%d\", 1)),\n\t\tValue:    val[:],\n\t\tVersion:  1,\n\t\tStreamId: uint32(1),\n\t}\n\tkv2 := &pb.KV{\n\t\tStreamId:   uint32(1),\n\t\tStreamDone: true,\n\t}\n\tKVToBuffer(kv1, buf)\n\tKVToBuffer(kv2, buf)\n\trequire.NoError(t, sw.Write(buf), \"sw.Write() failed\")\n\n\t// Defer for panic.\n\tdefer func() {\n\t\trequire.NotNil(t, recover(), \"should have panicked\")\n\t\trequire.NoError(t, sw.Flush())\n\t\trequire.NoError(t, db.Close())\n\t}()\n\t// Send once stream is closed.\n\tbuf1 := z.NewBuffer(10<<20, \"test\")\n\tdefer func() { require.NoError(t, buf1.Release()) }()\n\tkv1 = &pb.KV{\n\t\tKey:      []byte(fmt.Sprintf(\"%d\", 2)),\n\t\tValue:    val[:],\n\t\tVersion:  1,\n\t\tStreamId: uint32(1),\n\t}\n\tKVToBuffer(kv1, buf1)\n\trequire.NoError(t, sw.Write(buf1))\n}\n\nfunc TestSendOnClosedStream2(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer func() {\n\t\trequire.NoError(t, os.RemoveAll(dir))\n\t}()\n\topts := getTestOptions(dir)\n\tdb, err := Open(opts)\n\trequire.NoError(t, err)\n\n\tsw := db.NewStreamWriter()\n\trequire.NoError(t, sw.Prepare(), \"sw.Prepare() failed\")\n\n\tvar val [10]byte\n\trand.Read(val[:])\n\tbuf := z.NewBuffer(10<<20, \"test\")\n\tdefer func() { require.NoError(t, buf.Release()) }()\n\tkv1 := &pb.KV{\n\t\tKey:      []byte(fmt.Sprintf(\"%d\", 1)),\n\t\tValue:    val[:],\n\t\tVersion:  1,\n\t\tStreamId: uint32(1),\n\t}\n\tkv2 := &pb.KV{\n\t\tStreamId:   uint32(1),\n\t\tStreamDone: true,\n\t}\n\tkv3 := &pb.KV{\n\t\tKey:      []byte(fmt.Sprintf(\"%d\", 2)),\n\t\tValue:    val[:],\n\t\tVersion:  1,\n\t\tStreamId: uint32(1),\n\t}\n\tKVToBuffer(kv1, buf)\n\tKVToBuffer(kv2, buf)\n\tKVToBuffer(kv3, buf)\n\n\t// Defer for panic.\n\tdefer func() {\n\t\trequire.NotNil(t, recover(), \"should have panicked\")\n\t\trequire.NoError(t, sw.Flush())\n\t\trequire.NoError(t, db.Close())\n\t}()\n\n\trequire.NoError(t, sw.Write(buf), \"sw.Write() failed\")\n}\n\nfunc TestStreamWriterEncrypted(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\n\topts := DefaultOptions(dir)\n\tdefer removeDir(dir)\n\n\topts = opts.WithEncryptionKey([]byte(\"badgerkey16bytes\"))\n\topts.BlockCacheSize = 100 << 20\n\topts.IndexCacheSize = 100 << 20\n\tdb, err := Open(opts)\n\trequire.NoError(t, err)\n\tkey := []byte(\"mykey\")\n\tvalue := []byte(\"myvalue\")\n\n\tbuf := z.NewBuffer(10<<20, \"test\")\n\tdefer func() { require.NoError(t, buf.Release()) }()\n\tKVToBuffer(&pb.KV{\n\t\tKey:     key,\n\t\tValue:   value,\n\t\tVersion: 20,\n\t}, buf)\n\n\tsw := db.NewStreamWriter()\n\trequire.NoError(t, sw.Prepare(), \"Prepare failed\")\n\trequire.NoError(t, sw.Write(buf), \"Write failed\")\n\trequire.NoError(t, sw.Flush(), \"Flush failed\")\n\n\terr = db.View(func(txn *Txn) error {\n\t\titem, err := txn.Get(key)\n\t\trequire.NoError(t, err)\n\t\tval, err := item.ValueCopy(nil)\n\t\trequire.Equal(t, value, val)\n\t\trequire.NoError(t, err)\n\t\treturn nil\n\t})\n\trequire.NoError(t, err, \"Error while retrieving key\")\n\trequire.NoError(t, db.Close())\n\n\tdb, err = Open(opts)\n\trequire.NoError(t, err)\n\trequire.NoError(t, db.Close())\n\n}\n\n// Test that stream writer does not crashes with large values in managed mode.\nfunc TestStreamWriterWithLargeValue(t *testing.T) {\n\topts := DefaultOptions(\"\")\n\topts.managedTxns = true\n\trunBadgerTest(t, &opts, func(t *testing.T, db *DB) {\n\t\tbuf := z.NewBuffer(10<<20, \"test\")\n\t\tdefer func() { require.NoError(t, buf.Release()) }()\n\t\tval := make([]byte, 10<<20)\n\t\t_, err := rand.Read(val)\n\t\trequire.NoError(t, err)\n\t\tKVToBuffer(&pb.KV{\n\t\t\tKey:     []byte(\"key\"),\n\t\t\tValue:   val,\n\t\t\tVersion: 1,\n\t\t}, buf)\n\n\t\tsw := db.NewStreamWriter()\n\t\trequire.NoError(t, sw.Prepare(), \"sw.Prepare() failed\")\n\t\trequire.NoError(t, sw.Write(buf), \"sw.Write() failed\")\n\t\trequire.NoError(t, sw.Flush(), \"sw.Flush() failed\")\n\t})\n}\n\nfunc TestStreamWriterIncremental(t *testing.T) {\n\taddIncremental := func(t *testing.T, db *DB, keys [][]byte) {\n\t\tbuf := z.NewBuffer(10<<20, \"test\")\n\t\tdefer func() { require.NoError(t, buf.Release()) }()\n\t\tfor _, key := range keys {\n\t\t\tKVToBuffer(&pb.KV{\n\t\t\t\tKey:     key,\n\t\t\t\tValue:   []byte(\"val\"),\n\t\t\t\tVersion: 1,\n\t\t\t}, buf)\n\t\t}\n\t\t// Now do an incremental stream write.\n\t\tsw := db.NewStreamWriter()\n\t\trequire.NoError(t, sw.PrepareIncremental(), \"sw.PrepareIncremental() failed\")\n\t\trequire.NoError(t, sw.Write(buf), \"sw.Write() failed\")\n\t\trequire.NoError(t, sw.Flush(), \"sw.Flush() failed\")\n\t}\n\n\tt.Run(\"incremental on non-empty DB\", func(t *testing.T) {\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\tbuf := z.NewBuffer(10<<20, \"test\")\n\t\t\tdefer func() { require.NoError(t, buf.Release()) }()\n\t\t\tKVToBuffer(&pb.KV{\n\t\t\t\tKey:     []byte(\"key-1\"),\n\t\t\t\tValue:   []byte(\"val\"),\n\t\t\t\tVersion: 1,\n\t\t\t}, buf)\n\t\t\tsw := db.NewStreamWriter()\n\t\t\trequire.NoError(t, sw.Prepare(), \"sw.Prepare() failed\")\n\t\t\trequire.NoError(t, sw.Write(buf), \"sw.Write() failed\")\n\t\t\trequire.NoError(t, sw.Flush(), \"sw.Flush() failed\")\n\n\t\t\taddIncremental(t, db, [][]byte{[]byte(\"key-2\")})\n\n\t\t\ttxn := db.NewTransaction(false)\n\t\t\tdefer txn.Discard()\n\t\t\t_, err := txn.Get([]byte(\"key-1\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"key-2\"))\n\t\t\trequire.NoError(t, err)\n\t\t})\n\t})\n\n\tt.Run(\"incremental on empty DB\", func(t *testing.T) {\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\taddIncremental(t, db, [][]byte{[]byte(\"key-1\")})\n\t\t\ttxn := db.NewTransaction(false)\n\t\t\tdefer txn.Discard()\n\t\t\t_, err := txn.Get([]byte(\"key-1\"))\n\t\t\trequire.NoError(t, err)\n\t\t})\n\t})\n\n\tt.Run(\"multiple incremental\", func(t *testing.T) {\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\taddIncremental(t, db, [][]byte{[]byte(\"a1\"), []byte(\"c1\")})\n\t\t\taddIncremental(t, db, [][]byte{[]byte(\"a2\"), []byte(\"c2\")})\n\t\t\taddIncremental(t, db, [][]byte{[]byte(\"a3\"), []byte(\"c3\")})\n\t\t\ttxn := db.NewTransaction(false)\n\t\t\tdefer txn.Discard()\n\t\t\t_, err := txn.Get([]byte(\"a1\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"c1\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"a2\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"c2\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"a3\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"c3\"))\n\t\t\trequire.NoError(t, err)\n\t\t})\n\t})\n\n\tt.Run(\"write between incremental writes\", func(t *testing.T) {\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\taddIncremental(t, db, [][]byte{[]byte(\"a1\"), []byte(\"c1\")})\n\t\t\trequire.NoError(t, db.Update(func(txn *Txn) error {\n\t\t\t\treturn txn.Set([]byte(\"a3\"), []byte(\"c3\"))\n\t\t\t}))\n\n\t\t\tsw := db.NewStreamWriter()\n\t\t\tdefer sw.Cancel()\n\t\t\trequire.EqualError(t, sw.PrepareIncremental(), \"Unable to do incremental writes because MemTable has data\")\n\n\t\t\ttxn := db.NewTransaction(false)\n\t\t\tdefer txn.Discard()\n\t\t\t_, err := txn.Get([]byte(\"a1\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"c1\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"a3\"))\n\t\t\trequire.NoError(t, err)\n\t\t})\n\t})\n\n\tt.Run(\"incremental writes > #levels\", func(t *testing.T) {\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\taddIncremental(t, db, [][]byte{[]byte(\"a1\"), []byte(\"c1\")})\n\t\t\taddIncremental(t, db, [][]byte{[]byte(\"a2\"), []byte(\"c2\")})\n\t\t\taddIncremental(t, db, [][]byte{[]byte(\"a3\"), []byte(\"c3\")})\n\t\t\taddIncremental(t, db, [][]byte{[]byte(\"a4\"), []byte(\"c4\")})\n\t\t\taddIncremental(t, db, [][]byte{[]byte(\"a5\"), []byte(\"c5\")})\n\t\t\taddIncremental(t, db, [][]byte{[]byte(\"a6\"), []byte(\"c6\")})\n\t\t\taddIncremental(t, db, [][]byte{[]byte(\"a7\"), []byte(\"c7\")})\n\t\t\taddIncremental(t, db, [][]byte{[]byte(\"a8\"), []byte(\"c8\")})\n\t\t\taddIncremental(t, db, [][]byte{[]byte(\"a9\"), []byte(\"c9\")})\n\n\t\t\ttxn := db.NewTransaction(false)\n\t\t\tdefer txn.Discard()\n\t\t\t_, err := txn.Get([]byte(\"a1\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"c1\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"a2\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"c2\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"a3\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"c3\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"a4\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"c4\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"a5\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"c5\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"a6\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"c6\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"a7\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"c7\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"a8\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"c8\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"a9\"))\n\t\t\trequire.NoError(t, err)\n\t\t\t_, err = txn.Get([]byte(\"c9\"))\n\t\t\trequire.NoError(t, err)\n\t\t})\n\t})\n\n\tt.Run(\"multiple incremental with older data first\", func(t *testing.T) {\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\tbuf := z.NewBuffer(10<<20, \"test\")\n\t\t\tdefer func() { require.NoError(t, buf.Release()) }()\n\t\t\tKVToBuffer(&pb.KV{\n\t\t\t\tKey:     []byte(\"a1\"),\n\t\t\t\tValue:   []byte(\"val1\"),\n\t\t\t\tVersion: 11,\n\t\t\t}, buf)\n\t\t\tsw := db.NewStreamWriter()\n\t\t\trequire.NoError(t, sw.PrepareIncremental(), \"sw.PrepareIncremental() failed\")\n\t\t\trequire.NoError(t, sw.Write(buf), \"sw.Write() failed\")\n\t\t\trequire.NoError(t, sw.Flush(), \"sw.Flush() failed\")\n\n\t\t\tbuf2 := z.NewBuffer(10<<20, \"test\")\n\t\t\tdefer func() { require.NoError(t, buf2.Release()) }()\n\t\t\tKVToBuffer(&pb.KV{\n\t\t\t\tKey:     []byte(\"a2\"),\n\t\t\t\tValue:   []byte(\"val2\"),\n\t\t\t\tVersion: 9,\n\t\t\t}, buf2)\n\t\t\tsw = db.NewStreamWriter()\n\t\t\trequire.NoError(t, sw.PrepareIncremental(), \"sw.PrepareIncremental() failed\")\n\t\t\trequire.NoError(t, sw.Write(buf2), \"sw.Write() failed\")\n\t\t\trequire.NoError(t, sw.Flush(), \"sw.Flush() failed\")\n\n\t\t\t// This will move the maxTs to 10 (earlier, without the fix)\n\t\t\trequire.NoError(t, db.Update(func(txn *Txn) error {\n\t\t\t\treturn txn.Set([]byte(\"a1\"), []byte(\"val3\"))\n\t\t\t}))\n\t\t\t// This will move the maxTs to 11 (earliler, without the fix)\n\t\t\trequire.NoError(t, db.Update(func(txn *Txn) error {\n\t\t\t\treturn txn.Set([]byte(\"a3\"), []byte(\"val4\"))\n\t\t\t}))\n\n\t\t\t// And now, the first write with val1 will resurface (without the fix)\n\t\t\trequire.NoError(t, db.View(func(txn *Txn) error {\n\t\t\t\titem, err := txn.Get([]byte(\"a1\"))\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\tval, err := item.ValueCopy(nil)\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\trequire.Equal(t, \"val3\", string(val))\n\t\t\t\treturn nil\n\t\t\t}))\n\t\t})\n\t})\n}\n"
  },
  {
    "path": "structs.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"encoding/binary\"\n\t\"fmt\"\n\t\"time\"\n\t\"unsafe\"\n)\n\ntype valuePointer struct {\n\tFid    uint32\n\tLen    uint32\n\tOffset uint32\n}\n\nconst vptrSize = unsafe.Sizeof(valuePointer{})\n\nfunc (p valuePointer) Less(o valuePointer) bool {\n\tif p.Fid != o.Fid {\n\t\treturn p.Fid < o.Fid\n\t}\n\tif p.Offset != o.Offset {\n\t\treturn p.Offset < o.Offset\n\t}\n\treturn p.Len < o.Len\n}\n\nfunc (p valuePointer) IsZero() bool {\n\treturn p.Fid == 0 && p.Offset == 0 && p.Len == 0\n}\n\n// Encode encodes Pointer into byte buffer.\nfunc (p valuePointer) Encode() []byte {\n\tb := make([]byte, vptrSize)\n\t// Copy over the content from p to b.\n\t*(*valuePointer)(unsafe.Pointer(&b[0])) = p\n\treturn b\n}\n\n// Decode decodes the value pointer into the provided byte buffer.\nfunc (p *valuePointer) Decode(b []byte) {\n\t// Copy over data from b into p. Using *p=unsafe.pointer(...) leads to\n\t// pointer alignment issues. See https://github.com/dgraph-io/badger/issues/1096\n\t// and comment https://github.com/dgraph-io/badger/pull/1097#pullrequestreview-307361714\n\tcopy(((*[vptrSize]byte)(unsafe.Pointer(p))[:]), b[:vptrSize])\n}\n\n// header is used in value log as a header before Entry.\ntype header struct {\n\tklen      uint32\n\tvlen      uint32\n\texpiresAt uint64\n\tmeta      byte\n\tuserMeta  byte\n}\n\nconst (\n\t// Maximum possible size of the header. The maximum size of header struct will be 18 but the\n\t// maximum size of varint encoded header will be 22.\n\tmaxHeaderSize = 22\n)\n\n// Encode encodes the header into []byte. The provided []byte should be at least 5 bytes. The\n// function will panic if out []byte isn't large enough to hold all the values.\n// The encoded header looks like\n// +------+----------+------------+--------------+-----------+\n// | Meta | UserMeta | Key Length | Value Length | ExpiresAt |\n// +------+----------+------------+--------------+-----------+\nfunc (h header) Encode(out []byte) int {\n\tout[0], out[1] = h.meta, h.userMeta\n\tindex := 2\n\tindex += binary.PutUvarint(out[index:], uint64(h.klen))\n\tindex += binary.PutUvarint(out[index:], uint64(h.vlen))\n\tindex += binary.PutUvarint(out[index:], h.expiresAt)\n\treturn index\n}\n\n// Decode decodes the given header from the provided byte slice.\n// Returns the number of bytes read.\nfunc (h *header) Decode(buf []byte) int {\n\th.meta, h.userMeta = buf[0], buf[1]\n\tindex := 2\n\tklen, count := binary.Uvarint(buf[index:])\n\th.klen = uint32(klen)\n\tindex += count\n\tvlen, count := binary.Uvarint(buf[index:])\n\th.vlen = uint32(vlen)\n\tindex += count\n\th.expiresAt, count = binary.Uvarint(buf[index:])\n\treturn index + count\n}\n\n// DecodeFrom reads the header from the hashReader.\n// Returns the number of bytes read.\nfunc (h *header) DecodeFrom(reader *hashReader) (int, error) {\n\tvar err error\n\th.meta, err = reader.ReadByte()\n\tif err != nil {\n\t\treturn 0, err\n\t}\n\th.userMeta, err = reader.ReadByte()\n\tif err != nil {\n\t\treturn 0, err\n\t}\n\tklen, err := binary.ReadUvarint(reader)\n\tif err != nil {\n\t\treturn 0, err\n\t}\n\th.klen = uint32(klen)\n\tvlen, err := binary.ReadUvarint(reader)\n\tif err != nil {\n\t\treturn 0, err\n\t}\n\th.vlen = uint32(vlen)\n\th.expiresAt, err = binary.ReadUvarint(reader)\n\tif err != nil {\n\t\treturn 0, err\n\t}\n\treturn reader.bytesRead, nil\n}\n\n// Entry provides Key, Value, UserMeta and ExpiresAt. This struct can be used by\n// the user to set data.\ntype Entry struct {\n\tKey       []byte\n\tValue     []byte\n\tExpiresAt uint64 // time.Unix\n\tversion   uint64\n\toffset    uint32 // offset is an internal field.\n\tUserMeta  byte\n\tmeta      byte\n\n\t// Fields maintained internally.\n\thlen         int // Length of the header.\n\tvalThreshold int64\n}\n\nfunc (e *Entry) isZero() bool {\n\treturn len(e.Key) == 0\n}\n\nfunc (e *Entry) estimateSizeAndSetThreshold(threshold int64) int64 {\n\tif e.valThreshold == 0 {\n\t\te.valThreshold = threshold\n\t}\n\tk := int64(len(e.Key))\n\tv := int64(len(e.Value))\n\tif v < e.valThreshold {\n\t\treturn k + v + 2 // Meta, UserMeta\n\t}\n\treturn k + 12 + 2 // 12 for ValuePointer, 2 for metas.\n}\n\nfunc (e *Entry) skipVlogAndSetThreshold(threshold int64) bool {\n\tif e.valThreshold == 0 {\n\t\te.valThreshold = threshold\n\t}\n\treturn int64(len(e.Value)) < e.valThreshold\n}\n\n//nolint:unused\nfunc (e Entry) print(prefix string) {\n\tfmt.Printf(\"%s Key: %s Meta: %d UserMeta: %d Offset: %d len(val)=%d\",\n\t\tprefix, e.Key, e.meta, e.UserMeta, e.offset, len(e.Value))\n}\n\n// NewEntry creates a new entry with key and value passed in args. This newly created entry can be\n// set in a transaction by calling txn.SetEntry(). All other properties of Entry can be set by\n// calling WithMeta, WithDiscard, WithTTL methods on it.\n// This function uses key and value reference, hence users must\n// not modify key and value until the end of transaction.\nfunc NewEntry(key, value []byte) *Entry {\n\treturn &Entry{\n\t\tKey:   key,\n\t\tValue: value,\n\t}\n}\n\n// WithMeta adds meta data to Entry e. This byte is stored alongside the key\n// and can be used as an aid to interpret the value or store other contextual\n// bits corresponding to the key-value pair of entry.\nfunc (e *Entry) WithMeta(meta byte) *Entry {\n\te.UserMeta = meta\n\treturn e\n}\n\n// WithDiscard adds a marker to Entry e. This means all the previous versions of the key (of the\n// Entry) will be eligible for garbage collection.\n// This method is only useful if you have set a higher limit for options.NumVersionsToKeep. The\n// default setting is 1, in which case, this function doesn't add any more benefit. If however, you\n// have a higher setting for NumVersionsToKeep (in Dgraph, we set it to infinity), you can use this\n// method to indicate that all the older versions can be discarded and removed during compactions.\nfunc (e *Entry) WithDiscard() *Entry {\n\te.meta = bitDiscardEarlierVersions\n\treturn e\n}\n\n// WithTTL adds time to live duration to Entry e. Entry stored with a TTL would automatically expire\n// after the time has elapsed, and will be eligible for garbage collection.\nfunc (e *Entry) WithTTL(dur time.Duration) *Entry {\n\te.ExpiresAt = uint64(time.Now().Add(dur).Unix())\n\treturn e\n}\n\n// withMergeBit sets merge bit in entry's metadata. This\n// function is called by MergeOperator's Add method.\nfunc (e *Entry) withMergeBit() *Entry {\n\te.meta = bitMergeEntry\n\treturn e\n}\n"
  },
  {
    "path": "structs_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"math\"\n\t\"reflect\"\n\t\"testing\"\n\n\t\"github.com/stretchr/testify/require\"\n)\n\n// Regression test for github.com/dgraph-io/badger/pull/1800\nfunc TestLargeEncode(t *testing.T) {\n\tvar headerEnc [maxHeaderSize]byte\n\th := header{math.MaxUint32, math.MaxUint32, math.MaxUint64, math.MaxUint8, math.MaxUint8}\n\trequire.NotPanics(t, func() { _ = h.Encode(headerEnc[:]) })\n}\n\nfunc TestNumFieldsHeader(t *testing.T) {\n\t// maxHeaderSize must correspond with any changes made to header\n\trequire.Equal(t, 5, reflect.TypeOf(header{}).NumField())\n}\n"
  },
  {
    "path": "table/README.md",
    "content": "Size of table is 123,217,667 bytes for all benchmarks.\n\n# BenchmarkRead\n\n```sh\n$ go test -bench ^BenchmarkRead$ -run ^$ -count 3\ngoos: linux\ngoarch: amd64\npkg: github.com/dgraph-io/badger/table\nBenchmarkRead-16              10       154074944 ns/op\nBenchmarkRead-16              10       154340411 ns/op\nBenchmarkRead-16              10       151914489 ns/op\nPASS\nok       github.com/dgraph-io/badger/table       22.467s\n```\n\nSize of table is 123,217,667 bytes, which is ~118MB.\n\nThe rate is ~762MB/s using LoadToRAM (when table is in RAM).\n\nTo read a 64MB table, this would take ~0.084s, which is negligible.\n\n# BenchmarkReadAndBuild\n\n```sh\n$ go test -bench BenchmarkReadAndBuild -run ^$ -count 3\ngoos: linux\ngoarch: amd64\npkg: github.com/dgraph-io/badger/table\nBenchmarkReadAndBuild-16              1       1026755231 ns/op\nBenchmarkReadAndBuild-16              1       1009543316 ns/op\nBenchmarkReadAndBuild-16              1       1039920546 ns/op\nPASS\nok       github.com/dgraph-io/badger/table       12.081s\n```\n\nThe rate is ~123MB/s. To build a 64MB table, this would take ~0.56s. Note that this does NOT include\nthe flushing of the table to disk. All we are doing above is reading one table (which is in RAM) and\nwrite one table in memory.\n\nThe table building takes 0.56-0.084s ~ 0.4823s.\n\n# BenchmarkReadMerged\n\nBelow, we merge 5 tables. The total size remains unchanged at ~122M.\n\n```sh\n$ go test -bench ReadMerged -run ^$ -count 3\ngoos: linux\ngoarch: amd64\npkg: github.com/dgraph-io/badger/table\nBenchmarkReadMerged-16              2       977588975 ns/op\nBenchmarkReadMerged-16              2       982140738 ns/op\nBenchmarkReadMerged-16              2       962046017 ns/op\nPASS\nok       github.com/dgraph-io/badger/table       27.433s\n```\n\nThe rate is ~120MB/s. To read a 64MB table using merge iterator, this would take ~0.53s.\n\n# BenchmarkRandomRead\n\n```sh\ngo test -bench BenchmarkRandomRead$ -run ^$ -count 3\ngoos: linux\ngoarch: amd64\npkg: github.com/dgraph-io/badger/table\nBenchmarkRandomRead-16              500000       2645 ns/op\nBenchmarkRandomRead-16              500000       2648 ns/op\nBenchmarkRandomRead-16              500000       2614 ns/op\nPASS\nok       github.com/dgraph-io/badger/table       50.850s\n```\n\nFor random read benchmarking, we are randomly reading a key and verifying its value.\n\n# DB Open benchmark\n\n1. Create badger DB with 2 billion key-value pairs (about 380GB of data)\n\n   ```sh\n   badger fill -m 2000 --dir=\"/tmp/data\" --sorted\n   ```\n\n2. Clear buffers and swap memory\n\n   ```sh\n   free -mh && sync && echo 3 | sudo tee /proc/sys/vm/drop_caches && sudo swapoff -a && sudo swapon -a && free -mh\n   ```\n\n   Also flush disk buffers\n\n   ```sh\n   blockdev --flushbufs /dev/nvme0n1p4\n   ```\n\n3. Run the benchmark\n\n   ```sh\n   go test -run=^$ github.com/dgraph-io/badger -bench ^BenchmarkDBOpen$ -benchdir=\"/tmp/data\" -v\n\n   badger 2019/06/04 17:15:56 INFO: 126 tables out of 1028 opened in 3.017s\n   badger 2019/06/04 17:15:59 INFO: 257 tables out of 1028 opened in 6.014s\n   badger 2019/06/04 17:16:02 INFO: 387 tables out of 1028 opened in 9.017s\n   badger 2019/06/04 17:16:05 INFO: 516 tables out of 1028 opened in 12.025s\n   badger 2019/06/04 17:16:08 INFO: 645 tables out of 1028 opened in 15.013s\n   badger 2019/06/04 17:16:11 INFO: 775 tables out of 1028 opened in 18.008s\n   badger 2019/06/04 17:16:14 INFO: 906 tables out of 1028 opened in 21.003s\n   badger 2019/06/04 17:16:17 INFO: All 1028 tables opened in 23.851s\n   badger 2019/06/04 17:16:17 INFO: Replaying file id: 1998 at offset: 332000\n   badger 2019/06/04 17:16:17 INFO: Replay took: 9.81µs\n   goos: linux\n   goarch: amd64\n   pkg: github.com/dgraph-io/badger\n   BenchmarkDBOpen-16              1       23930082140 ns/op\n   PASS\n   ok       github.com/dgraph-io/badger       24.076s\n\n   ```\n\n   It takes about 23.851s to open a DB with 2 billion sorted key-value entries.\n"
  },
  {
    "path": "table/builder.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage table\n\nimport (\n\t\"crypto/aes\"\n\t\"errors\"\n\t\"math\"\n\t\"runtime\"\n\t\"sync\"\n\t\"sync/atomic\"\n\t\"unsafe\"\n\n\tfbs \"github.com/google/flatbuffers/go\"\n\t\"github.com/klauspost/compress/s2\"\n\t\"google.golang.org/protobuf/proto\"\n\n\t\"github.com/dgraph-io/badger/v4/fb\"\n\t\"github.com/dgraph-io/badger/v4/options\"\n\t\"github.com/dgraph-io/badger/v4/pb\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\nconst (\n\tKB = 1024\n\tMB = KB * 1024\n\n\t// When a block is encrypted, it's length increases. We add 256 bytes of padding to\n\t// handle cases when block size increases. This is an approximate number.\n\tpadding = 256\n)\n\ntype header struct {\n\toverlap uint16 // Overlap with base key.\n\tdiff    uint16 // Length of the diff.\n}\n\nconst headerSize = uint16(unsafe.Sizeof(header{}))\n\n// Encode encodes the header.\nfunc (h header) Encode() []byte {\n\tvar b [4]byte\n\t*(*header)(unsafe.Pointer(&b[0])) = h\n\treturn b[:]\n}\n\n// Decode decodes the header.\nfunc (h *header) Decode(buf []byte) {\n\t// Copy over data from buf into h. Using *h=unsafe.pointer(...) leads to\n\t// pointer alignment issues. See https://github.com/dgraph-io/badger/issues/1096\n\t// and comment https://github.com/dgraph-io/badger/pull/1097#pullrequestreview-307361714\n\tcopy(((*[headerSize]byte)(unsafe.Pointer(h))[:]), buf[:headerSize])\n}\n\n// bblock represents a block that is being compressed/encrypted in the background.\ntype bblock struct {\n\tdata         []byte\n\tbaseKey      []byte   // Base key for the current block.\n\tentryOffsets []uint32 // Offsets of entries present in current block.\n\tend          int      // Points to the end offset of the block.\n}\n\n// Builder is used in building a table.\ntype Builder struct {\n\t// Typically tens or hundreds of meg. This is for one single file.\n\talloc            *z.Allocator\n\tcurBlock         *bblock\n\tcompressedSize   atomic.Uint32\n\tuncompressedSize atomic.Uint32\n\n\tlenOffsets    uint32\n\tkeyHashes     []uint32 // Used for building the bloomfilter.\n\topts          *Options\n\tmaxVersion    uint64\n\tonDiskSize    uint32\n\tstaleDataSize int\n\n\t// Used to concurrently compress/encrypt blocks.\n\twg        sync.WaitGroup\n\tblockChan chan *bblock\n\tblockList []*bblock\n}\n\nfunc (b *Builder) allocate(need int) []byte {\n\tbb := b.curBlock\n\tif len(bb.data[bb.end:]) < need {\n\t\t// We need to reallocate. 1GB is the max size that the allocator can allocate.\n\t\t// While reallocating, if doubling exceeds that limit, then put the upper bound on it.\n\t\tsz := 2 * len(bb.data)\n\t\tif sz > (1 << 30) {\n\t\t\tsz = 1 << 30\n\t\t}\n\t\tif bb.end+need > sz {\n\t\t\tsz = bb.end + need\n\t\t}\n\t\ttmp := b.alloc.Allocate(sz)\n\t\tcopy(tmp, bb.data)\n\t\tbb.data = tmp\n\t}\n\tbb.end += need\n\treturn bb.data[bb.end-need : bb.end]\n}\n\n// append appends to curBlock.data\nfunc (b *Builder) append(data []byte) {\n\tdst := b.allocate(len(data))\n\ty.AssertTrue(len(data) == copy(dst, data))\n}\n\nconst maxAllocatorInitialSz = 256 << 20\n\n// NewTableBuilder makes a new TableBuilder.\nfunc NewTableBuilder(opts Options) *Builder {\n\tsz := 2 * int(opts.TableSize)\n\tif sz > maxAllocatorInitialSz {\n\t\tsz = maxAllocatorInitialSz\n\t}\n\tb := &Builder{\n\t\talloc: opts.AllocPool.Get(sz, \"TableBuilder\"),\n\t\topts:  &opts,\n\t}\n\tb.alloc.Tag = \"Builder\"\n\tb.curBlock = &bblock{\n\t\tdata: b.alloc.Allocate(opts.BlockSize + padding),\n\t}\n\tb.opts.tableCapacity = uint64(float64(b.opts.TableSize) * 0.95)\n\n\t// If encryption or compression is not enabled, do not start compression/encryption goroutines\n\t// and write directly to the buffer.\n\tif b.opts.Compression == options.None && b.opts.DataKey == nil {\n\t\treturn b\n\t}\n\n\tcount := 2 * runtime.NumCPU()\n\tb.blockChan = make(chan *bblock, count*2)\n\n\tb.wg.Add(count)\n\tfor i := 0; i < count; i++ {\n\t\tgo b.handleBlock()\n\t}\n\treturn b\n}\n\nfunc maxEncodedLen(ctype options.CompressionType, sz int) int {\n\tswitch ctype {\n\tcase options.Snappy:\n\t\treturn s2.MaxEncodedLen(sz)\n\tcase options.ZSTD:\n\t\treturn y.ZSTDCompressBound(sz)\n\t}\n\treturn sz\n}\n\nfunc (b *Builder) handleBlock() {\n\tdefer b.wg.Done()\n\n\tdoCompress := b.opts.Compression != options.None\n\tfor item := range b.blockChan {\n\t\t// Extract the block.\n\t\tblockBuf := item.data[:item.end]\n\t\t// Compress the block.\n\t\tif doCompress {\n\t\t\tout, err := b.compressData(blockBuf)\n\t\t\ty.Check(err)\n\t\t\tblockBuf = out\n\t\t}\n\t\tif b.shouldEncrypt() {\n\t\t\tout, err := b.encrypt(blockBuf)\n\t\t\ty.Check(y.Wrapf(err, \"Error while encrypting block in table builder.\"))\n\t\t\tblockBuf = out\n\t\t}\n\n\t\t// BlockBuf should always less than or equal to allocated space. If the blockBuf is greater\n\t\t// than allocated space that means the data from this block cannot be stored in its\n\t\t// existing location.\n\t\tallocatedSpace := maxEncodedLen(b.opts.Compression, (item.end)) + padding + 1\n\t\ty.AssertTrue(len(blockBuf) <= allocatedSpace)\n\n\t\t// blockBuf was allocated on allocator. So, we don't need to copy it over.\n\t\titem.data = blockBuf\n\t\titem.end = len(blockBuf)\n\t\tb.compressedSize.Add(uint32(len(blockBuf)))\n\t}\n}\n\n// Close closes the TableBuilder.\nfunc (b *Builder) Close() {\n\tb.opts.AllocPool.Return(b.alloc)\n}\n\n// Empty returns whether it's empty.\nfunc (b *Builder) Empty() bool { return len(b.keyHashes) == 0 }\n\n// keyDiff returns a suffix of newKey that is different from b.baseKey.\nfunc (b *Builder) keyDiff(newKey []byte) []byte {\n\tvar i int\n\tfor i = 0; i < len(newKey) && i < len(b.curBlock.baseKey); i++ {\n\t\tif newKey[i] != b.curBlock.baseKey[i] {\n\t\t\tbreak\n\t\t}\n\t}\n\treturn newKey[i:]\n}\n\nfunc (b *Builder) addHelper(key []byte, v y.ValueStruct, vpLen uint32) {\n\tb.keyHashes = append(b.keyHashes, y.Hash(y.ParseKey(key)))\n\n\tif version := y.ParseTs(key); version > b.maxVersion {\n\t\tb.maxVersion = version\n\t}\n\n\t// diffKey stores the difference of key with baseKey.\n\tvar diffKey []byte\n\tif len(b.curBlock.baseKey) == 0 {\n\t\t// Make a copy. Builder should not keep references. Otherwise, caller has to be very careful\n\t\t// and will have to make copies of keys every time they add to builder, which is even worse.\n\t\tb.curBlock.baseKey = append(b.curBlock.baseKey[:0], key...)\n\t\tdiffKey = key\n\t} else {\n\t\tdiffKey = b.keyDiff(key)\n\t}\n\n\ty.AssertTrue(len(key)-len(diffKey) <= math.MaxUint16)\n\ty.AssertTrue(len(diffKey) <= math.MaxUint16)\n\n\th := header{\n\t\toverlap: uint16(len(key) - len(diffKey)),\n\t\tdiff:    uint16(len(diffKey)),\n\t}\n\n\t// store current entry's offset\n\tb.curBlock.entryOffsets = append(b.curBlock.entryOffsets, uint32(b.curBlock.end))\n\n\t// Layout: header, diffKey, value.\n\tb.append(h.Encode())\n\tb.append(diffKey)\n\n\tdst := b.allocate(int(v.EncodedSize()))\n\tv.Encode(dst)\n\n\t// Add the vpLen to the onDisk size. We'll add the size of the block to\n\t// onDisk size in Finish() function.\n\tb.onDiskSize += vpLen\n}\n\n/*\nStructure of Block.\n+-------------------+---------------------+--------------------+--------------+------------------+\n| Entry1            | Entry2              | Entry3             | Entry4       | Entry5           |\n+-------------------+---------------------+--------------------+--------------+------------------+\n| Entry6            | ...                 | ...                | ...          | EntryN           |\n+-------------------+---------------------+--------------------+--------------+------------------+\n| Block Meta(contains list of offsets used| Block Meta Size    | Block        | Checksum Size    |\n| to perform binary search in the block)  | (4 Bytes)          | Checksum     | (4 Bytes)        |\n+-----------------------------------------+--------------------+--------------+------------------+\n*/\n// In case the data is encrypted, the \"IV\" is added to the end of the block.\nfunc (b *Builder) finishBlock() {\n\tif len(b.curBlock.entryOffsets) == 0 {\n\t\treturn\n\t}\n\t// Append the entryOffsets and its length.\n\tb.append(y.U32SliceToBytes(b.curBlock.entryOffsets))\n\tb.append(y.U32ToBytes(uint32(len(b.curBlock.entryOffsets))))\n\n\tchecksum := b.calculateChecksum(b.curBlock.data[:b.curBlock.end])\n\n\t// Append the block checksum and its length.\n\tb.append(checksum)\n\tb.append(y.U32ToBytes(uint32(len(checksum))))\n\n\tb.blockList = append(b.blockList, b.curBlock)\n\tb.uncompressedSize.Add(uint32(b.curBlock.end))\n\n\t// Add length of baseKey (rounded to next multiple of 4 because of alignment).\n\t// Add another 40 Bytes, these additional 40 bytes consists of\n\t// 12 bytes of metadata of flatbuffer\n\t// 8 bytes for Key in flat buffer\n\t// 8 bytes for offset\n\t// 8 bytes for the len\n\t// 4 bytes for the size of slice while SliceAllocate\n\tb.lenOffsets += uint32(int(math.Ceil(float64(len(b.curBlock.baseKey))/4))*4) + 40\n\n\t// If compression/encryption is enabled, we need to send the block to the blockChan.\n\tif b.blockChan != nil {\n\t\tb.blockChan <- b.curBlock\n\t}\n}\n\nfunc (b *Builder) shouldFinishBlock(key []byte, value y.ValueStruct) bool {\n\t// If there is no entry till now, we will return false.\n\tif len(b.curBlock.entryOffsets) <= 0 {\n\t\treturn false\n\t}\n\n\t// Integer overflow check for statements below.\n\ty.AssertTrue((uint32(len(b.curBlock.entryOffsets))+1)*4+4+8+4 < math.MaxUint32)\n\t// We should include current entry also in size, that's why +1 to len(b.entryOffsets).\n\tentriesOffsetsSize := uint32((len(b.curBlock.entryOffsets)+1)*4 +\n\t\t4 + // size of list\n\t\t8 + // Sum64 in checksum proto\n\t\t4) // checksum length\n\testimatedSize := uint32(b.curBlock.end) + uint32(6 /*header size for entry*/) +\n\t\tuint32(len(key)) + value.EncodedSize() + entriesOffsetsSize\n\n\tif b.shouldEncrypt() {\n\t\t// IV is added at the end of the block, while encrypting.\n\t\t// So, size of IV is added to estimatedSize.\n\t\testimatedSize += aes.BlockSize\n\t}\n\n\t// Integer overflow check for table size.\n\ty.AssertTrue(uint64(b.curBlock.end)+uint64(estimatedSize) < math.MaxUint32)\n\n\treturn estimatedSize > uint32(b.opts.BlockSize)\n}\n\n// AddStaleKey is same is Add function but it also increments the internal\n// staleDataSize counter. This value will be used to prioritize this table for\n// compaction.\nfunc (b *Builder) AddStaleKey(key []byte, v y.ValueStruct, valueLen uint32) {\n\t// Rough estimate based on how much space it will occupy in the SST.\n\tb.staleDataSize += len(key) + len(v.Value) + 4 /* entry offset */ + 4 /* header size */\n\tb.addInternal(key, v, valueLen, true)\n}\n\n// Add adds a key-value pair to the block.\nfunc (b *Builder) Add(key []byte, value y.ValueStruct, valueLen uint32) {\n\tb.addInternal(key, value, valueLen, false)\n}\n\nfunc (b *Builder) addInternal(key []byte, value y.ValueStruct, valueLen uint32, isStale bool) {\n\tif b.shouldFinishBlock(key, value) {\n\t\tif isStale {\n\t\t\t// This key will be added to tableIndex and it is stale.\n\t\t\tb.staleDataSize += len(key) + 4 /* len */ + 4 /* offset */\n\t\t}\n\t\tb.finishBlock()\n\t\t// Create a new block and start writing.\n\t\tb.curBlock = &bblock{\n\t\t\tdata: b.alloc.Allocate(b.opts.BlockSize + padding),\n\t\t}\n\t}\n\tb.addHelper(key, value, valueLen)\n}\n\n// TODO: vvv this was the comment on ReachedCapacity.\n// FinalSize returns the *rough* final size of the array, counting the header which is\n// not yet written.\n// TODO: Look into why there is a discrepancy. I suspect it is because of Write(empty, empty)\n// at the end. The diff can vary.\n\n// ReachedCapacity returns true if we... roughly (?) reached capacity?\nfunc (b *Builder) ReachedCapacity() bool {\n\t// If encryption/compression is enabled then use the compressed size.\n\tsumBlockSizes := b.compressedSize.Load()\n\tif b.opts.Compression == options.None && b.opts.DataKey == nil {\n\t\tsumBlockSizes = b.uncompressedSize.Load()\n\t}\n\tblocksSize := sumBlockSizes + // actual length of current buffer\n\t\tuint32(len(b.curBlock.entryOffsets)*4) + // all entry offsets size\n\t\t4 + // count of all entry offsets\n\t\t8 + // checksum bytes\n\t\t4 // checksum length\n\n\testimateSz := blocksSize +\n\t\t4 + // Index length\n\t\tb.lenOffsets\n\n\treturn uint64(estimateSz) > b.opts.tableCapacity\n}\n\n// Finish finishes the table by appending the index.\n/*\nThe table structure looks like\n+---------+------------+-----------+---------------+\n| Block 1 | Block 2    | Block 3   | Block 4       |\n+---------+------------+-----------+---------------+\n| Block 5 | Block 6    | Block ... | Block N       |\n+---------+------------+-----------+---------------+\n| Index   | Index Size | Checksum  | Checksum Size |\n+---------+------------+-----------+---------------+\n*/\n// In case the data is encrypted, the \"IV\" is added to the end of the index.\nfunc (b *Builder) Finish() []byte {\n\tbd := b.Done()\n\tbuf := make([]byte, bd.Size)\n\twritten := bd.Copy(buf)\n\ty.AssertTrue(written == len(buf))\n\treturn buf\n}\n\ntype buildData struct {\n\tblockList []*bblock\n\tindex     []byte\n\tchecksum  []byte\n\tSize      int\n\talloc     *z.Allocator\n}\n\nfunc (bd *buildData) Copy(dst []byte) int {\n\tvar written int\n\tfor _, bl := range bd.blockList {\n\t\twritten += copy(dst[written:], bl.data[:bl.end])\n\t}\n\twritten += copy(dst[written:], bd.index)\n\twritten += copy(dst[written:], y.U32ToBytes(uint32(len(bd.index))))\n\n\twritten += copy(dst[written:], bd.checksum)\n\twritten += copy(dst[written:], y.U32ToBytes(uint32(len(bd.checksum))))\n\treturn written\n}\n\nfunc (b *Builder) Done() buildData {\n\tb.finishBlock() // This will never start a new block.\n\tif b.blockChan != nil {\n\t\tclose(b.blockChan)\n\t}\n\t// Wait for block handler to finish.\n\tb.wg.Wait()\n\n\tif len(b.blockList) == 0 {\n\t\treturn buildData{}\n\t}\n\tbd := buildData{\n\t\tblockList: b.blockList,\n\t\talloc:     b.alloc,\n\t}\n\n\tvar f y.Filter\n\tif b.opts.BloomFalsePositive > 0 {\n\t\tbits := y.BloomBitsPerKey(len(b.keyHashes), b.opts.BloomFalsePositive)\n\t\tf = y.NewFilter(b.keyHashes, bits)\n\t}\n\tindex, dataSize := b.buildIndex(f)\n\n\tvar err error\n\tif b.shouldEncrypt() {\n\t\tindex, err = b.encrypt(index)\n\t\ty.Check(err)\n\t}\n\tchecksum := b.calculateChecksum(index)\n\n\tbd.index = index\n\tbd.checksum = checksum\n\tbd.Size = int(dataSize) + len(index) + len(checksum) + 4 + 4\n\treturn bd\n}\n\nfunc (b *Builder) calculateChecksum(data []byte) []byte {\n\t// Build checksum for the index.\n\tchecksum := pb.Checksum{\n\t\t// TODO: The checksum type should be configurable from the\n\t\t// options.\n\t\t// We chose to use CRC32 as the default option because\n\t\t// it performed better compared to xxHash64.\n\t\t// See the BenchmarkChecksum in table_test.go file\n\t\t// Size     =>   1024 B        2048 B\n\t\t// CRC32    => 63.7 ns/op     112 ns/op\n\t\t// xxHash64 => 87.5 ns/op     158 ns/op\n\t\tSum:  y.CalculateChecksum(data, pb.Checksum_CRC32C),\n\t\tAlgo: pb.Checksum_CRC32C,\n\t}\n\n\t// Write checksum to the file.\n\tchksum, err := proto.Marshal(&checksum)\n\ty.Check(err)\n\t// Write checksum size.\n\treturn chksum\n}\n\n// DataKey returns datakey of the builder.\nfunc (b *Builder) DataKey() *pb.DataKey {\n\treturn b.opts.DataKey\n}\n\nfunc (b *Builder) Opts() *Options {\n\treturn b.opts\n}\n\n// encrypt will encrypt the given data and appends IV to the end of the encrypted data.\n// This should be only called only after checking shouldEncrypt method.\nfunc (b *Builder) encrypt(data []byte) ([]byte, error) {\n\tiv, err := y.GenerateIV()\n\tif err != nil {\n\t\treturn data, y.Wrapf(err, \"Error while generating IV in Builder.encrypt\")\n\t}\n\tneedSz := len(data) + len(iv)\n\tdst := b.alloc.Allocate(needSz)\n\n\tif err = y.XORBlock(dst[:len(data)], data, b.DataKey().Data, iv); err != nil {\n\t\treturn data, y.Wrapf(err, \"Error while encrypting in Builder.encrypt\")\n\t}\n\n\ty.AssertTrue(len(iv) == copy(dst[len(data):], iv))\n\treturn dst, nil\n}\n\n// shouldEncrypt tells us whether to encrypt the data or not.\n// We encrypt only if the data key exist. Otherwise, not.\nfunc (b *Builder) shouldEncrypt() bool {\n\treturn b.opts.DataKey != nil\n}\n\n// compressData compresses the given data.\nfunc (b *Builder) compressData(data []byte) ([]byte, error) {\n\tswitch b.opts.Compression {\n\tcase options.None:\n\t\treturn data, nil\n\tcase options.Snappy:\n\t\tsz := s2.MaxEncodedLen(len(data))\n\t\tdst := b.alloc.Allocate(sz)\n\t\treturn s2.EncodeSnappy(dst, data), nil\n\tcase options.ZSTD:\n\t\tsz := y.ZSTDCompressBound(len(data))\n\t\tdst := b.alloc.Allocate(sz)\n\t\treturn y.ZSTDCompress(dst, data, b.opts.ZSTDCompressionLevel)\n\t}\n\treturn nil, errors.New(\"Unsupported compression type\")\n}\n\nfunc (b *Builder) buildIndex(bloom []byte) ([]byte, uint32) {\n\tbuilder := fbs.NewBuilder(3 << 20)\n\n\tboList, dataSize := b.writeBlockOffsets(builder)\n\t// Write block offset vector the the idxBuilder.\n\tfb.TableIndexStartOffsetsVector(builder, len(boList))\n\n\t// Write individual block offsets in reverse order to work around how Flatbuffers expects it.\n\tfor i := len(boList) - 1; i >= 0; i-- {\n\t\tbuilder.PrependUOffsetT(boList[i])\n\t}\n\tboEnd := builder.EndVector(len(boList))\n\n\tvar bfoff fbs.UOffsetT\n\t// Write the bloom filter.\n\tif len(bloom) > 0 {\n\t\tbfoff = builder.CreateByteVector(bloom)\n\t}\n\tb.onDiskSize += dataSize\n\tfb.TableIndexStart(builder)\n\tfb.TableIndexAddOffsets(builder, boEnd)\n\tfb.TableIndexAddBloomFilter(builder, bfoff)\n\tfb.TableIndexAddMaxVersion(builder, b.maxVersion)\n\tfb.TableIndexAddUncompressedSize(builder, b.uncompressedSize.Load())\n\tfb.TableIndexAddKeyCount(builder, uint32(len(b.keyHashes)))\n\tfb.TableIndexAddOnDiskSize(builder, b.onDiskSize)\n\tfb.TableIndexAddStaleDataSize(builder, uint32(b.staleDataSize))\n\tbuilder.Finish(fb.TableIndexEnd(builder))\n\n\tbuf := builder.FinishedBytes()\n\tindex := fb.GetRootAsTableIndex(buf, 0)\n\t// Mutate the ondisk size to include the size of the index as well.\n\ty.AssertTrue(index.MutateOnDiskSize(index.OnDiskSize() + uint32(len(buf))))\n\treturn buf, dataSize\n}\n\n// writeBlockOffsets writes all the blockOffets in b.offsets and returns the\n// offsets for the newly written items.\nfunc (b *Builder) writeBlockOffsets(builder *fbs.Builder) ([]fbs.UOffsetT, uint32) {\n\tvar startOffset uint32\n\tvar uoffs []fbs.UOffsetT\n\tfor _, bl := range b.blockList {\n\t\tuoff := b.writeBlockOffset(builder, bl, startOffset)\n\t\tuoffs = append(uoffs, uoff)\n\t\tstartOffset += uint32(bl.end)\n\t}\n\treturn uoffs, startOffset\n}\n\n// writeBlockOffset writes the given key,offset,len triple to the indexBuilder.\n// It returns the offset of the newly written blockoffset.\nfunc (b *Builder) writeBlockOffset(\n\tbuilder *fbs.Builder, bl *bblock, startOffset uint32) fbs.UOffsetT {\n\t// Write the key to the buffer.\n\tk := builder.CreateByteVector(bl.baseKey)\n\n\t// Build the blockOffset.\n\tfb.BlockOffsetStart(builder)\n\tfb.BlockOffsetAddKey(builder, k)\n\tfb.BlockOffsetAddOffset(builder, startOffset)\n\tfb.BlockOffsetAddLen(builder, uint32(bl.end))\n\treturn fb.BlockOffsetEnd(builder)\n}\n"
  },
  {
    "path": "table/builder_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage table\n\nimport (\n\t\"fmt\"\n\t\"math/rand\"\n\t\"os\"\n\t\"testing\"\n\t\"time\"\n\n\t\"github.com/stretchr/testify/require\"\n\n\t\"github.com/dgraph-io/badger/v4/fb\"\n\t\"github.com/dgraph-io/badger/v4/options\"\n\t\"github.com/dgraph-io/badger/v4/pb\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2\"\n)\n\nfunc TestTableIndex(t *testing.T) {\n\trand.Seed(time.Now().Unix())\n\tkeysCount := 100000\n\tkey := make([]byte, 32)\n\t_, err := rand.Read(key)\n\trequire.NoError(t, err)\n\tcache, err := ristretto.NewCache[uint64, *fb.TableIndex](&ristretto.Config[uint64, *fb.TableIndex]{\n\t\tNumCounters: 1000,\n\t\tMaxCost:     1 << 20,\n\t\tBufferItems: 64,\n\t})\n\trequire.NoError(t, err)\n\tsubTest := []struct {\n\t\tname string\n\t\topts Options\n\t}{\n\t\t{\n\t\t\tname: \"No encryption/compression\",\n\t\t\topts: Options{\n\t\t\t\tBlockSize:          4 * 1024,\n\t\t\t\tBloomFalsePositive: 0.01,\n\t\t\t\tTableSize:          30 << 20,\n\t\t\t},\n\t\t},\n\t\t{\n\t\t\t// Encryption mode.\n\t\t\tname: \"Only encryption\",\n\t\t\topts: Options{\n\t\t\t\tBlockSize:          4 * 1024,\n\t\t\t\tBloomFalsePositive: 0.01,\n\t\t\t\tTableSize:          30 << 20,\n\t\t\t\tDataKey:            &pb.DataKey{Data: key},\n\t\t\t\tIndexCache:         cache,\n\t\t\t},\n\t\t},\n\t\t{\n\t\t\t// Compression mode.\n\t\t\tname: \"Only compression\",\n\t\t\topts: Options{\n\t\t\t\tBlockSize:            4 * 1024,\n\t\t\t\tBloomFalsePositive:   0.01,\n\t\t\t\tTableSize:            30 << 20,\n\t\t\t\tCompression:          options.ZSTD,\n\t\t\t\tZSTDCompressionLevel: 3,\n\t\t\t},\n\t\t},\n\t\t{\n\t\t\t// Compression mode and encryption.\n\t\t\tname: \"Compression and encryption\",\n\t\t\topts: Options{\n\t\t\t\tBlockSize:            4 * 1024,\n\t\t\t\tBloomFalsePositive:   0.01,\n\t\t\t\tTableSize:            30 << 20,\n\t\t\t\tCompression:          options.ZSTD,\n\t\t\t\tZSTDCompressionLevel: 3,\n\t\t\t\tDataKey:              &pb.DataKey{Data: key},\n\t\t\t\tIndexCache:           cache,\n\t\t\t},\n\t\t},\n\t}\n\n\tfor _, tt := range subTest {\n\t\tt.Run(tt.name, func(t *testing.T) {\n\t\t\topt := tt.opts\n\t\t\tbuilder := NewTableBuilder(opt)\n\t\t\tdefer builder.Close()\n\t\t\tfilename := fmt.Sprintf(\"%s%c%d.sst\", os.TempDir(), os.PathSeparator, rand.Uint32())\n\n\t\t\tblockFirstKeys := make([][]byte, 0)\n\t\t\tblockCount := 0\n\t\t\tfor i := 0; i < keysCount; i++ {\n\t\t\t\tk := y.KeyWithTs([]byte(fmt.Sprintf(\"%016x\", i)), uint64(i+1))\n\t\t\t\tv := fmt.Sprintf(\"%d\", i)\n\t\t\t\tvs := y.ValueStruct{Value: []byte(v)}\n\t\t\t\tif i == 0 { // This is first key for first block.\n\t\t\t\t\tblockFirstKeys = append(blockFirstKeys, k)\n\t\t\t\t\tblockCount = 1\n\t\t\t\t} else if builder.shouldFinishBlock(k, vs) {\n\t\t\t\t\tblockCount++\n\t\t\t\t\tblockFirstKeys = append(blockFirstKeys, k)\n\t\t\t\t}\n\t\t\t\tbuilder.Add(k, vs, 0)\n\t\t\t}\n\t\t\ttbl, err := CreateTable(filename, builder)\n\t\t\trequire.NoError(t, err, \"unable to open table\")\n\n\t\t\tif opt.DataKey == nil {\n\t\t\t\t// key id is zero if there is no datakey.\n\t\t\t\trequire.Equal(t, tbl.KeyID(), uint64(0))\n\t\t\t}\n\n\t\t\t// Ensure index is built correctly\n\t\t\trequire.Equal(t, blockCount, tbl.offsetsLength())\n\t\t\tidx, err := tbl.readTableIndex()\n\t\t\trequire.NoError(t, err)\n\t\t\tfor i := 0; i < idx.OffsetsLength(); i++ {\n\t\t\t\tvar bo fb.BlockOffset\n\t\t\t\trequire.True(t, idx.Offsets(&bo, i))\n\t\t\t\trequire.Equal(t, blockFirstKeys[i], bo.KeyBytes())\n\t\t\t}\n\t\t\trequire.Equal(t, keysCount, int(tbl.MaxVersion()))\n\t\t\ttbl.Close(-1)\n\t\t\trequire.NoError(t, os.RemoveAll(filename))\n\t\t})\n\t}\n}\n\nfunc TestInvalidCompression(t *testing.T) {\n\tkeyPrefix := \"key\"\n\topts := Options{BlockSize: 4 << 10, Compression: options.ZSTD}\n\ttbl := buildTestTable(t, keyPrefix, 1000, opts)\n\tdefer func() { require.NoError(t, tbl.DecrRef()) }()\n\tmf := tbl.MmapFile\n\tt.Run(\"with correct decompression algo\", func(t *testing.T) {\n\t\t_, err := OpenTable(mf, opts)\n\t\trequire.NoError(t, err)\n\t})\n\tt.Run(\"with incorrect decompression algo\", func(t *testing.T) {\n\t\t// Set incorrect compression algorithm.\n\t\topts.Compression = options.Snappy\n\t\t_, err := OpenTable(mf, opts)\n\t\trequire.Error(t, err)\n\t})\n}\n\nfunc BenchmarkBuilder(b *testing.B) {\n\trand.Seed(time.Now().Unix())\n\tkey := func(i int) []byte {\n\t\treturn []byte(fmt.Sprintf(\"%032d\", i))\n\t}\n\n\tval := make([]byte, 32)\n\trand.Read(val)\n\tvs := y.ValueStruct{Value: val}\n\n\tkeysCount := 1300000 // This number of entries consumes ~64MB of memory.\n\n\tvar keyList [][]byte\n\tfor i := 0; i < keysCount; i++ {\n\t\tkeyList = append(keyList, key(i))\n\t}\n\tbench := func(b *testing.B, opt *Options) {\n\t\tb.SetBytes(int64(keysCount) * (32 + 32))\n\t\topt.BlockSize = 4 * 1024\n\t\topt.BloomFalsePositive = 0.01\n\t\topt.TableSize = 5 << 20\n\n\t\tb.ResetTimer()\n\t\tb.ReportAllocs()\n\t\tfor i := 0; i < b.N; i++ {\n\t\t\tbuilder := NewTableBuilder(*opt)\n\t\t\tfor j := 0; j < keysCount; j++ {\n\t\t\t\tbuilder.Add(keyList[j], vs, 0)\n\t\t\t}\n\n\t\t\t_ = builder.Finish()\n\t\t\tbuilder.Close()\n\t\t}\n\t}\n\n\tb.Run(\"no compression\", func(b *testing.B) {\n\t\tvar opt Options\n\t\topt.Compression = options.None\n\t\tbench(b, &opt)\n\t})\n\tb.Run(\"encryption\", func(b *testing.B) {\n\t\tvar opt Options\n\t\tcache, err := ristretto.NewCache(&ristretto.Config[uint64, *fb.TableIndex]{\n\t\t\tNumCounters: 1000,\n\t\t\tMaxCost:     1 << 20,\n\t\t\tBufferItems: 64,\n\t\t})\n\t\trequire.NoError(b, err)\n\t\topt.IndexCache = cache\n\t\tkey := make([]byte, 32)\n\t\trand.Read(key)\n\t\topt.DataKey = &pb.DataKey{Data: key}\n\t\tbench(b, &opt)\n\t})\n\tb.Run(\"snappy compression\", func(b *testing.B) {\n\t\tvar opt Options\n\t\topt.Compression = options.Snappy\n\t\tbench(b, &opt)\n\t})\n\tb.Run(\"zstd compression\", func(b *testing.B) {\n\t\tvar opt Options\n\t\topt.Compression = options.ZSTD\n\t\tb.Run(\"level 1\", func(b *testing.B) {\n\t\t\topt.ZSTDCompressionLevel = 1\n\t\t\tbench(b, &opt)\n\t\t})\n\t\tb.Run(\"level 3\", func(b *testing.B) {\n\t\t\topt.ZSTDCompressionLevel = 3\n\t\t\tbench(b, &opt)\n\t\t})\n\t\tb.Run(\"level 15\", func(b *testing.B) {\n\t\t\topt.ZSTDCompressionLevel = 15\n\t\t\tbench(b, &opt)\n\t\t})\n\t})\n}\n\nfunc TestBloomfilter(t *testing.T) {\n\tkeyPrefix := \"p\"\n\tkeyCount := 1000\n\n\tcreateAndTest := func(t *testing.T, withBlooms bool) {\n\t\topts := Options{\n\t\t\tBloomFalsePositive: 0.0,\n\t\t}\n\t\tif withBlooms {\n\t\t\topts.BloomFalsePositive = 0.01\n\t\t}\n\t\ttab := buildTestTable(t, keyPrefix, keyCount, opts)\n\t\tdefer func() { require.NoError(t, tab.DecrRef()) }()\n\t\trequire.Equal(t, withBlooms, tab.hasBloomFilter)\n\t\t// Forward iteration\n\t\tit := tab.NewIterator(0)\n\t\tc := 0\n\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\tc++\n\t\t\thash := y.Hash(y.ParseKey(it.Key()))\n\t\t\trequire.False(t, tab.DoesNotHave(hash))\n\t\t}\n\t\trequire.Equal(t, keyCount, c)\n\n\t\t// Backward iteration\n\t\tit = tab.NewIterator(REVERSED)\n\t\tc = 0\n\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\tc++\n\t\t\thash := y.Hash(y.ParseKey(it.Key()))\n\t\t\trequire.False(t, tab.DoesNotHave(hash))\n\t\t}\n\t\trequire.Equal(t, keyCount, c)\n\n\t\t// Ensure tab.DoesNotHave works\n\t\thash := y.Hash([]byte(\"foo\"))\n\t\trequire.Equal(t, withBlooms, tab.DoesNotHave(hash))\n\t}\n\n\tt.Run(\"build with bloom filter\", func(t *testing.T) {\n\t\tcreateAndTest(t, true)\n\t})\n\tt.Run(\"build without bloom filter\", func(t *testing.T) {\n\t\tcreateAndTest(t, false)\n\t})\n}\nfunc TestEmptyBuilder(t *testing.T) {\n\topts := Options{BloomFalsePositive: 0.1}\n\tb := NewTableBuilder(opts)\n\tdefer b.Close()\n\trequire.Equal(t, []byte{}, b.Finish())\n\n}\n"
  },
  {
    "path": "table/iterator.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage table\n\nimport (\n\t\"bytes\"\n\t\"fmt\"\n\t\"io\"\n\t\"sort\"\n\n\t\"github.com/dgraph-io/badger/v4/fb\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n)\n\ntype blockIterator struct {\n\tdata         []byte\n\tidx          int // Idx of the entry inside a block\n\terr          error\n\tbaseKey      []byte\n\tkey          []byte\n\tval          []byte\n\tentryOffsets []uint32\n\tblock        *Block\n\n\ttableID uint64\n\tblockID int\n\t// prevOverlap stores the overlap of the previous key with the base key.\n\t// This avoids unnecessary copy of base key when the overlap is same for multiple keys.\n\tprevOverlap uint16\n}\n\nfunc (itr *blockIterator) setBlock(b *Block) {\n\t// Decrement the ref for the old block. If the old block was compressed, we\n\t// might be able to reuse it.\n\titr.block.decrRef()\n\n\titr.block = b\n\titr.err = nil\n\titr.idx = 0\n\titr.baseKey = itr.baseKey[:0]\n\titr.prevOverlap = 0\n\titr.key = itr.key[:0]\n\titr.val = itr.val[:0]\n\t// Drop the index from the block. We don't need it anymore.\n\titr.data = b.data[:b.entriesIndexStart]\n\titr.entryOffsets = b.entryOffsets\n}\n\n// setIdx sets the iterator to the entry at index i and set it's key and value.\nfunc (itr *blockIterator) setIdx(i int) {\n\titr.idx = i\n\tif i >= len(itr.entryOffsets) || i < 0 {\n\t\titr.err = io.EOF\n\t\treturn\n\t}\n\titr.err = nil\n\tstartOffset := int(itr.entryOffsets[i])\n\n\t// Set base key.\n\tif len(itr.baseKey) == 0 {\n\t\tvar baseHeader header\n\t\tbaseHeader.Decode(itr.data)\n\t\titr.baseKey = itr.data[headerSize : headerSize+baseHeader.diff]\n\t}\n\n\tvar endOffset int\n\t// idx points to the last entry in the block.\n\tif itr.idx+1 == len(itr.entryOffsets) {\n\t\tendOffset = len(itr.data)\n\t} else {\n\t\t// idx point to some entry other than the last one in the block.\n\t\t// EndOffset of the current entry is the start offset of the next entry.\n\t\tendOffset = int(itr.entryOffsets[itr.idx+1])\n\t}\n\tdefer func() {\n\t\tif r := recover(); r != nil {\n\t\t\tvar debugBuf bytes.Buffer\n\t\t\tfmt.Fprintf(&debugBuf, \"==== Recovered====\\n\")\n\t\t\tfmt.Fprintf(&debugBuf, \"Table ID: %d\\nBlock ID: %d\\nEntry Idx: %d\\nData len: %d\\n\"+\n\t\t\t\t\"StartOffset: %d\\nEndOffset: %d\\nEntryOffsets len: %d\\nEntryOffsets: %v\\n\",\n\t\t\t\titr.tableID, itr.blockID, itr.idx, len(itr.data), startOffset, endOffset,\n\t\t\t\tlen(itr.entryOffsets), itr.entryOffsets)\n\t\t\tpanic(debugBuf.String())\n\t\t}\n\t}()\n\n\tentryData := itr.data[startOffset:endOffset]\n\tvar h header\n\th.Decode(entryData)\n\t// Header contains the length of key overlap and difference compared to the base key. If the key\n\t// before this one had the same or better key overlap, we can avoid copying that part into\n\t// itr.key. But, if the overlap was lesser, we could copy over just that portion.\n\tif h.overlap > itr.prevOverlap {\n\t\titr.key = append(itr.key[:itr.prevOverlap], itr.baseKey[itr.prevOverlap:h.overlap]...)\n\t}\n\titr.prevOverlap = h.overlap\n\tvalueOff := headerSize + h.diff\n\tdiffKey := entryData[headerSize:valueOff]\n\titr.key = append(itr.key[:h.overlap], diffKey...)\n\titr.val = entryData[valueOff:]\n}\n\nfunc (itr *blockIterator) Valid() bool {\n\treturn itr != nil && itr.err == nil\n}\n\nfunc (itr *blockIterator) Error() error {\n\treturn itr.err\n}\n\nfunc (itr *blockIterator) Close() {\n\titr.block.decrRef()\n}\n\nvar (\n\torigin  = 0\n\tcurrent = 1\n)\n\n// seek brings us to the first block element that is >= input key.\nfunc (itr *blockIterator) seek(key []byte, whence int) {\n\titr.err = nil\n\tstartIndex := 0 // This tells from which index we should start binary search.\n\n\tswitch whence {\n\tcase origin:\n\t\t// We don't need to do anything. startIndex is already at 0\n\tcase current:\n\t\tstartIndex = itr.idx\n\t}\n\n\tfoundEntryIdx := sort.Search(len(itr.entryOffsets), func(idx int) bool {\n\t\t// If idx is less than start index then just return false.\n\t\tif idx < startIndex {\n\t\t\treturn false\n\t\t}\n\t\titr.setIdx(idx)\n\t\treturn y.CompareKeys(itr.key, key) >= 0\n\t})\n\titr.setIdx(foundEntryIdx)\n}\n\n// seekToFirst brings us to the first element.\nfunc (itr *blockIterator) seekToFirst() {\n\titr.setIdx(0)\n}\n\n// seekToLast brings us to the last element.\nfunc (itr *blockIterator) seekToLast() {\n\titr.setIdx(len(itr.entryOffsets) - 1)\n}\n\nfunc (itr *blockIterator) next() {\n\titr.setIdx(itr.idx + 1)\n}\n\nfunc (itr *blockIterator) prev() {\n\titr.setIdx(itr.idx - 1)\n}\n\n// Iterator is an iterator for a Table.\ntype Iterator struct {\n\tt    *Table\n\tbpos int\n\tbi   blockIterator\n\terr  error\n\n\t// Internally, Iterator is bidirectional. However, we only expose the\n\t// unidirectional functionality for now.\n\topt int // Valid options are REVERSED and NOCACHE.\n}\n\n// NewIterator returns a new iterator of the Table\nfunc (t *Table) NewIterator(opt int) *Iterator {\n\tt.IncrRef() // Important.\n\tti := &Iterator{t: t, opt: opt}\n\treturn ti\n}\n\n// Close closes the iterator (and it must be called).\nfunc (itr *Iterator) Close() error {\n\titr.bi.Close()\n\treturn itr.t.DecrRef()\n}\n\nfunc (itr *Iterator) reset() {\n\titr.bpos = 0\n\titr.err = nil\n}\n\n// Valid follows the y.Iterator interface\nfunc (itr *Iterator) Valid() bool {\n\treturn itr.err == nil\n}\n\nfunc (itr *Iterator) useCache() bool {\n\treturn itr.opt&NOCACHE == 0\n}\n\nfunc (itr *Iterator) seekToFirst() {\n\tnumBlocks := itr.t.offsetsLength()\n\tif numBlocks == 0 {\n\t\titr.err = io.EOF\n\t\treturn\n\t}\n\titr.bpos = 0\n\tblock, err := itr.t.block(itr.bpos, itr.useCache())\n\tif err != nil {\n\t\titr.err = err\n\t\treturn\n\t}\n\titr.bi.tableID = itr.t.id\n\titr.bi.blockID = itr.bpos\n\titr.bi.setBlock(block)\n\titr.bi.seekToFirst()\n\titr.err = itr.bi.Error()\n}\n\nfunc (itr *Iterator) seekToLast() {\n\tnumBlocks := itr.t.offsetsLength()\n\tif numBlocks == 0 {\n\t\titr.err = io.EOF\n\t\treturn\n\t}\n\titr.bpos = numBlocks - 1\n\tblock, err := itr.t.block(itr.bpos, itr.useCache())\n\tif err != nil {\n\t\titr.err = err\n\t\treturn\n\t}\n\titr.bi.tableID = itr.t.id\n\titr.bi.blockID = itr.bpos\n\titr.bi.setBlock(block)\n\titr.bi.seekToLast()\n\titr.err = itr.bi.Error()\n}\n\nfunc (itr *Iterator) seekHelper(blockIdx int, key []byte) {\n\titr.bpos = blockIdx\n\tblock, err := itr.t.block(blockIdx, itr.useCache())\n\tif err != nil {\n\t\titr.err = err\n\t\treturn\n\t}\n\titr.bi.tableID = itr.t.id\n\titr.bi.blockID = itr.bpos\n\titr.bi.setBlock(block)\n\titr.bi.seek(key, origin)\n\titr.err = itr.bi.Error()\n}\n\n// seekFrom brings us to a key that is >= input key.\nfunc (itr *Iterator) seekFrom(key []byte, whence int) {\n\titr.err = nil\n\tswitch whence {\n\tcase origin:\n\t\titr.reset()\n\tcase current:\n\t}\n\n\tvar ko fb.BlockOffset\n\tidx := sort.Search(itr.t.offsetsLength(), func(idx int) bool {\n\t\t// Offsets should never return false since we're iterating within the OffsetsLength.\n\t\ty.AssertTrue(itr.t.offsets(&ko, idx))\n\t\treturn y.CompareKeys(ko.KeyBytes(), key) > 0\n\t})\n\tif idx == 0 {\n\t\t// The smallest key in our table is already strictly > key. We can return that.\n\t\t// This is like a SeekToFirst.\n\t\titr.seekHelper(0, key)\n\t\treturn\n\t}\n\n\t// block[idx].smallest is > key.\n\t// Since idx>0, we know block[idx-1].smallest is <= key.\n\t// There are two cases.\n\t// 1) Everything in block[idx-1] is strictly < key. In this case, we should go to the first\n\t//    element of block[idx].\n\t// 2) Some element in block[idx-1] is >= key. We should go to that element.\n\titr.seekHelper(idx-1, key)\n\tif itr.err == io.EOF {\n\t\t// Case 1. Need to visit block[idx].\n\t\tif idx == itr.t.offsetsLength() {\n\t\t\t// If idx == len(itr.t.blockIndex), then input key is greater than ANY element of table.\n\t\t\t// There's nothing we can do. Valid() should return false as we seek to end of table.\n\t\t\treturn\n\t\t}\n\t\t// Since block[idx].smallest is > key. This is essentially a block[idx].SeekToFirst.\n\t\titr.seekHelper(idx, key)\n\t}\n\t// Case 2: No need to do anything. We already did the seek in block[idx-1].\n}\n\n// seek will reset iterator and seek to >= key.\nfunc (itr *Iterator) seek(key []byte) {\n\titr.seekFrom(key, origin)\n}\n\n// seekForPrev will reset iterator and seek to <= key.\nfunc (itr *Iterator) seekForPrev(key []byte) {\n\t// TODO: Optimize this. We shouldn't have to take a Prev step.\n\titr.seekFrom(key, origin)\n\tif !bytes.Equal(itr.Key(), key) {\n\t\titr.prev()\n\t}\n}\n\nfunc (itr *Iterator) next() {\n\titr.err = nil\n\n\tif itr.bpos >= itr.t.offsetsLength() {\n\t\titr.err = io.EOF\n\t\treturn\n\t}\n\n\tif len(itr.bi.data) == 0 {\n\t\tblock, err := itr.t.block(itr.bpos, itr.useCache())\n\t\tif err != nil {\n\t\t\titr.err = err\n\t\t\treturn\n\t\t}\n\t\titr.bi.tableID = itr.t.id\n\t\titr.bi.blockID = itr.bpos\n\t\titr.bi.setBlock(block)\n\t\titr.bi.seekToFirst()\n\t\titr.err = itr.bi.Error()\n\t\treturn\n\t}\n\n\titr.bi.next()\n\tif !itr.bi.Valid() {\n\t\titr.bpos++\n\t\titr.bi.data = nil\n\t\titr.next()\n\t\treturn\n\t}\n}\n\nfunc (itr *Iterator) prev() {\n\titr.err = nil\n\tif itr.bpos < 0 {\n\t\titr.err = io.EOF\n\t\treturn\n\t}\n\n\tif len(itr.bi.data) == 0 {\n\t\tblock, err := itr.t.block(itr.bpos, itr.useCache())\n\t\tif err != nil {\n\t\t\titr.err = err\n\t\t\treturn\n\t\t}\n\t\titr.bi.tableID = itr.t.id\n\t\titr.bi.blockID = itr.bpos\n\t\titr.bi.setBlock(block)\n\t\titr.bi.seekToLast()\n\t\titr.err = itr.bi.Error()\n\t\treturn\n\t}\n\n\titr.bi.prev()\n\tif !itr.bi.Valid() {\n\t\titr.bpos--\n\t\titr.bi.data = nil\n\t\titr.prev()\n\t\treturn\n\t}\n}\n\n// Key follows the y.Iterator interface.\n// Returns the key with timestamp.\nfunc (itr *Iterator) Key() []byte {\n\treturn itr.bi.key\n}\n\n// Value follows the y.Iterator interface\nfunc (itr *Iterator) Value() (ret y.ValueStruct) {\n\tret.Decode(itr.bi.val)\n\treturn\n}\n\n// ValueCopy copies the current value and returns it as decoded\n// ValueStruct.\nfunc (itr *Iterator) ValueCopy() (ret y.ValueStruct) {\n\tdst := y.Copy(itr.bi.val)\n\tret.Decode(dst)\n\treturn\n}\n\n// Next follows the y.Iterator interface\nfunc (itr *Iterator) Next() {\n\tif itr.opt&REVERSED == 0 {\n\t\titr.next()\n\t} else {\n\t\titr.prev()\n\t}\n}\n\n// Rewind follows the y.Iterator interface\nfunc (itr *Iterator) Rewind() {\n\tif itr.opt&REVERSED == 0 {\n\t\titr.seekToFirst()\n\t} else {\n\t\titr.seekToLast()\n\t}\n}\n\n// Seek follows the y.Iterator interface\nfunc (itr *Iterator) Seek(key []byte) {\n\tif itr.opt&REVERSED == 0 {\n\t\titr.seek(key)\n\t} else {\n\t\titr.seekForPrev(key)\n\t}\n}\n\nvar (\n\tREVERSED int = 2\n\tNOCACHE  int = 4\n)\n\n// ConcatIterator concatenates the sequences defined by several iterators.  (It only works with\n// TableIterators, probably just because it's faster to not be so generic.)\ntype ConcatIterator struct {\n\tidx     int // Which iterator is active now.\n\tcur     *Iterator\n\titers   []*Iterator // Corresponds to tables.\n\ttables  []*Table    // Disregarding reversed, this is in ascending order.\n\toptions int         // Valid options are REVERSED and NOCACHE.\n}\n\n// NewConcatIterator creates a new concatenated iterator\nfunc NewConcatIterator(tbls []*Table, opt int) *ConcatIterator {\n\titers := make([]*Iterator, len(tbls))\n\tfor i := 0; i < len(tbls); i++ {\n\t\t// Increment the reference count. Since, we're not creating the iterator right now.\n\t\t// Here, We'll hold the reference of the tables, till the lifecycle of the iterator.\n\t\ttbls[i].IncrRef()\n\n\t\t// Save cycles by not initializing the iterators until needed.\n\t\t// iters[i] = tbls[i].NewIterator(reversed)\n\t}\n\treturn &ConcatIterator{\n\t\toptions: opt,\n\t\titers:   iters,\n\t\ttables:  tbls,\n\t\tidx:     -1, // Not really necessary because s.it.Valid()=false, but good to have.\n\t}\n}\n\nfunc (s *ConcatIterator) setIdx(idx int) {\n\ts.idx = idx\n\tif idx < 0 || idx >= len(s.iters) {\n\t\ts.cur = nil\n\t\treturn\n\t}\n\tif s.iters[idx] == nil {\n\t\ts.iters[idx] = s.tables[idx].NewIterator(s.options)\n\t}\n\ts.cur = s.iters[s.idx]\n}\n\n// Rewind implements y.Interface\nfunc (s *ConcatIterator) Rewind() {\n\tif len(s.iters) == 0 {\n\t\treturn\n\t}\n\tif s.options&REVERSED == 0 {\n\t\ts.setIdx(0)\n\t} else {\n\t\ts.setIdx(len(s.iters) - 1)\n\t}\n\ts.cur.Rewind()\n}\n\n// Valid implements y.Interface\nfunc (s *ConcatIterator) Valid() bool {\n\treturn s.cur != nil && s.cur.Valid()\n}\n\n// Key implements y.Interface\nfunc (s *ConcatIterator) Key() []byte {\n\treturn s.cur.Key()\n}\n\n// Value implements y.Interface\nfunc (s *ConcatIterator) Value() y.ValueStruct {\n\treturn s.cur.Value()\n}\n\n// Seek brings us to element >= key if reversed is false. Otherwise, <= key.\nfunc (s *ConcatIterator) Seek(key []byte) {\n\tvar idx int\n\tif s.options&REVERSED == 0 {\n\t\tidx = sort.Search(len(s.tables), func(i int) bool {\n\t\t\treturn y.CompareKeys(s.tables[i].Biggest(), key) >= 0\n\t\t})\n\t} else {\n\t\tn := len(s.tables)\n\t\tidx = n - 1 - sort.Search(n, func(i int) bool {\n\t\t\treturn y.CompareKeys(s.tables[n-1-i].Smallest(), key) <= 0\n\t\t})\n\t}\n\tif idx >= len(s.tables) || idx < 0 {\n\t\ts.setIdx(-1)\n\t\treturn\n\t}\n\t// For reversed=false, we know s.tables[i-1].Biggest() < key. Thus, the\n\t// previous table cannot possibly contain key.\n\ts.setIdx(idx)\n\ts.cur.Seek(key)\n}\n\n// Next advances our concat iterator.\nfunc (s *ConcatIterator) Next() {\n\ts.cur.Next()\n\tif s.cur.Valid() {\n\t\t// Nothing to do. Just stay with the current table.\n\t\treturn\n\t}\n\tfor { // In case there are empty tables.\n\t\tif s.options&REVERSED == 0 {\n\t\t\ts.setIdx(s.idx + 1)\n\t\t} else {\n\t\t\ts.setIdx(s.idx - 1)\n\t\t}\n\t\tif s.cur == nil {\n\t\t\t// End of list. Valid will become false.\n\t\t\treturn\n\t\t}\n\t\ts.cur.Rewind()\n\t\tif s.cur.Valid() {\n\t\t\tbreak\n\t\t}\n\t}\n}\n\n// Close implements y.Interface.\nfunc (s *ConcatIterator) Close() error {\n\tfor _, t := range s.tables {\n\t\t// DeReference the tables while closing the iterator.\n\t\tif err := t.DecrRef(); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\tfor _, it := range s.iters {\n\t\tif it == nil {\n\t\t\tcontinue\n\t\t}\n\t\tif err := it.Close(); err != nil {\n\t\t\treturn y.Wrap(err, \"ConcatIterator\")\n\t\t}\n\t}\n\treturn nil\n}\n"
  },
  {
    "path": "table/merge_iterator.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage table\n\nimport (\n\t\"bytes\"\n\n\t\"github.com/dgraph-io/badger/v4/y\"\n)\n\n// MergeIterator merges multiple iterators.\n// NOTE: MergeIterator owns the array of iterators and is responsible for closing them.\ntype MergeIterator struct {\n\tleft  node\n\tright node\n\tsmall *node\n\n\tcurKey  []byte\n\treverse bool\n}\n\ntype node struct {\n\tvalid bool\n\tkey   []byte\n\titer  y.Iterator\n\n\t// The two iterators are type asserted from `y.Iterator`, used to inline more function calls.\n\t// Calling functions on concrete types is much faster (about 25-30%) than calling the\n\t// interface's function.\n\tmerge  *MergeIterator\n\tconcat *ConcatIterator\n}\n\nfunc (n *node) setIterator(iter y.Iterator) {\n\tn.iter = iter\n\t// It's okay if the type assertion below fails and n.merge/n.concat are set to nil.\n\t// We handle the nil values of merge and concat in all the methods.\n\tn.merge, _ = iter.(*MergeIterator)\n\tn.concat, _ = iter.(*ConcatIterator)\n}\n\nfunc (n *node) setKey() {\n\tswitch {\n\tcase n.merge != nil:\n\t\tn.valid = n.merge.small.valid\n\t\tif n.valid {\n\t\t\tn.key = n.merge.small.key\n\t\t}\n\tcase n.concat != nil:\n\t\tn.valid = n.concat.Valid()\n\t\tif n.valid {\n\t\t\tn.key = n.concat.Key()\n\t\t}\n\tdefault:\n\t\tn.valid = n.iter.Valid()\n\t\tif n.valid {\n\t\t\tn.key = n.iter.Key()\n\t\t}\n\t}\n}\n\nfunc (n *node) next() {\n\tswitch {\n\tcase n.merge != nil:\n\t\tn.merge.Next()\n\tcase n.concat != nil:\n\t\tn.concat.Next()\n\tdefault:\n\t\tn.iter.Next()\n\t}\n\tn.setKey()\n}\n\nfunc (n *node) rewind() {\n\tn.iter.Rewind()\n\tn.setKey()\n}\n\nfunc (n *node) seek(key []byte) {\n\tn.iter.Seek(key)\n\tn.setKey()\n}\n\nfunc (mi *MergeIterator) fix() {\n\tif !mi.bigger().valid {\n\t\treturn\n\t}\n\tif !mi.small.valid {\n\t\tmi.swapSmall()\n\t\treturn\n\t}\n\tcmp := y.CompareKeys(mi.small.key, mi.bigger().key)\n\tswitch {\n\tcase cmp == 0: // Both the keys are equal.\n\t\t// In case of same keys, move the right iterator ahead.\n\t\tmi.right.next()\n\t\tif &mi.right == mi.small {\n\t\t\tmi.swapSmall()\n\t\t}\n\t\treturn\n\tcase cmp < 0: // Small is less than bigger().\n\t\tif mi.reverse {\n\t\t\tmi.swapSmall()\n\t\t} else { //nolint:staticcheck\n\t\t\t// we don't need to do anything. Small already points to the smallest.\n\t\t}\n\t\treturn\n\tdefault: // bigger() is less than small.\n\t\tif mi.reverse {\n\t\t\t// Do nothing since we're iterating in reverse. Small currently points to\n\t\t\t// the bigger key and that's okay in reverse iteration.\n\t\t} else {\n\t\t\tmi.swapSmall()\n\t\t}\n\t\treturn\n\t}\n}\n\nfunc (mi *MergeIterator) bigger() *node {\n\tif mi.small == &mi.left {\n\t\treturn &mi.right\n\t}\n\treturn &mi.left\n}\n\nfunc (mi *MergeIterator) swapSmall() {\n\tif mi.small == &mi.left {\n\t\tmi.small = &mi.right\n\t\treturn\n\t}\n\tif mi.small == &mi.right {\n\t\tmi.small = &mi.left\n\t\treturn\n\t}\n}\n\n// Next returns the next element. If it is the same as the current key, ignore it.\nfunc (mi *MergeIterator) Next() {\n\tfor mi.Valid() {\n\t\tif !bytes.Equal(mi.small.key, mi.curKey) {\n\t\t\tbreak\n\t\t}\n\t\tmi.small.next()\n\t\tmi.fix()\n\t}\n\tmi.setCurrent()\n}\n\nfunc (mi *MergeIterator) setCurrent() {\n\tmi.curKey = append(mi.curKey[:0], mi.small.key...)\n}\n\n// Rewind seeks to first element (or last element for reverse iterator).\nfunc (mi *MergeIterator) Rewind() {\n\tmi.left.rewind()\n\tmi.right.rewind()\n\tmi.fix()\n\tmi.setCurrent()\n}\n\n// Seek brings us to element with key >= given key.\nfunc (mi *MergeIterator) Seek(key []byte) {\n\tmi.left.seek(key)\n\tmi.right.seek(key)\n\tmi.fix()\n\tmi.setCurrent()\n}\n\n// Valid returns whether the MergeIterator is at a valid element.\nfunc (mi *MergeIterator) Valid() bool {\n\treturn mi.small.valid\n}\n\n// Key returns the key associated with the current iterator.\nfunc (mi *MergeIterator) Key() []byte {\n\treturn mi.small.key\n}\n\n// Value returns the value associated with the iterator.\nfunc (mi *MergeIterator) Value() y.ValueStruct {\n\treturn mi.small.iter.Value()\n}\n\n// Close implements y.Iterator.\nfunc (mi *MergeIterator) Close() error {\n\terr1 := mi.left.iter.Close()\n\terr2 := mi.right.iter.Close()\n\tif err1 != nil {\n\t\treturn y.Wrap(err1, \"MergeIterator\")\n\t}\n\treturn y.Wrap(err2, \"MergeIterator\")\n}\n\n// NewMergeIterator creates a merge iterator.\nfunc NewMergeIterator(iters []y.Iterator, reverse bool) y.Iterator {\n\tswitch len(iters) {\n\tcase 0:\n\t\treturn nil\n\tcase 1:\n\t\treturn iters[0]\n\tcase 2:\n\t\tmi := &MergeIterator{\n\t\t\treverse: reverse,\n\t\t}\n\t\tmi.left.setIterator(iters[0])\n\t\tmi.right.setIterator(iters[1])\n\t\t// Assign left iterator randomly. This will be fixed when user calls rewind/seek.\n\t\tmi.small = &mi.left\n\t\treturn mi\n\t}\n\tmid := len(iters) / 2\n\treturn NewMergeIterator(\n\t\t[]y.Iterator{\n\t\t\tNewMergeIterator(iters[:mid], reverse),\n\t\t\tNewMergeIterator(iters[mid:], reverse),\n\t\t}, reverse)\n}\n"
  },
  {
    "path": "table/merge_iterator_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage table\n\nimport (\n\t\"sort\"\n\t\"testing\"\n\n\t\"github.com/stretchr/testify/require\"\n\n\t\"github.com/dgraph-io/badger/v4/y\"\n)\n\ntype SimpleIterator struct {\n\tkeys     [][]byte\n\tvals     [][]byte\n\tidx      int\n\treversed bool\n}\n\nvar (\n\tcloseCount int\n)\n\nfunc (s *SimpleIterator) Close() error { closeCount++; return nil }\n\nfunc (s *SimpleIterator) Next() {\n\tif !s.reversed {\n\t\ts.idx++\n\t} else {\n\t\ts.idx--\n\t}\n}\n\nfunc (s *SimpleIterator) Rewind() {\n\tif !s.reversed {\n\t\ts.idx = 0\n\t} else {\n\t\ts.idx = len(s.keys) - 1\n\t}\n}\n\nfunc (s *SimpleIterator) Seek(key []byte) {\n\tkey = y.KeyWithTs(key, 0)\n\tif !s.reversed {\n\t\ts.idx = sort.Search(len(s.keys), func(i int) bool {\n\t\t\treturn y.CompareKeys(s.keys[i], key) >= 0\n\t\t})\n\t} else {\n\t\tn := len(s.keys)\n\t\ts.idx = n - 1 - sort.Search(n, func(i int) bool {\n\t\t\treturn y.CompareKeys(s.keys[n-1-i], key) <= 0\n\t\t})\n\t}\n}\n\nfunc (s *SimpleIterator) Key() []byte { return s.keys[s.idx] }\nfunc (s *SimpleIterator) Value() y.ValueStruct {\n\treturn y.ValueStruct{\n\t\tValue:    s.vals[s.idx],\n\t\tUserMeta: 55,\n\t\tMeta:     0,\n\t}\n}\nfunc (s *SimpleIterator) Valid() bool {\n\treturn s.idx >= 0 && s.idx < len(s.keys)\n}\n\nvar _ y.Iterator = &SimpleIterator{}\n\nfunc newSimpleIterator(keys []string, vals []string, reversed bool) *SimpleIterator {\n\tk := make([][]byte, len(keys))\n\tv := make([][]byte, len(vals))\n\ty.AssertTrue(len(keys) == len(vals))\n\tfor i := 0; i < len(keys); i++ {\n\t\tk[i] = y.KeyWithTs([]byte(keys[i]), 0)\n\t\tv[i] = []byte(vals[i])\n\t}\n\treturn &SimpleIterator{\n\t\tkeys:     k,\n\t\tvals:     v,\n\t\tidx:      -1,\n\t\treversed: reversed,\n\t}\n}\n\nfunc getAll(it y.Iterator) ([]string, []string) {\n\tvar keys, vals []string\n\tfor ; it.Valid(); it.Next() {\n\t\tk := it.Key()\n\t\tkeys = append(keys, string(y.ParseKey(k)))\n\t\tv := it.Value()\n\t\tvals = append(vals, string(v.Value))\n\t}\n\treturn keys, vals\n}\n\nfunc closeAndCheck(t *testing.T, it y.Iterator, expected int) {\n\tcloseCount = 0\n\tit.Close()\n\trequire.EqualValues(t, expected, closeCount)\n}\n\nfunc TestSimpleIterator(t *testing.T) {\n\tkeys := []string{\"1\", \"2\", \"3\"}\n\tvals := []string{\"v1\", \"v2\", \"v3\"}\n\tit := newSimpleIterator(keys, vals, false)\n\tit.Rewind()\n\tk, v := getAll(it)\n\trequire.EqualValues(t, keys, k)\n\trequire.EqualValues(t, vals, v)\n\n\tcloseAndCheck(t, it, 1)\n}\n\nfunc reversed(a []string) []string {\n\tvar out []string\n\tfor i := len(a) - 1; i >= 0; i-- {\n\t\tout = append(out, a[i])\n\t}\n\treturn out\n}\n\nfunc TestMergeSingle(t *testing.T) {\n\tkeys := []string{\"1\", \"2\", \"3\"}\n\tvals := []string{\"v1\", \"v2\", \"v3\"}\n\tit := newSimpleIterator(keys, vals, false)\n\tmergeIt := NewMergeIterator([]y.Iterator{it}, false)\n\tmergeIt.Rewind()\n\tk, v := getAll(mergeIt)\n\trequire.EqualValues(t, keys, k)\n\trequire.EqualValues(t, vals, v)\n\tcloseAndCheck(t, mergeIt, 1)\n}\n\nfunc TestMergeSingleReversed(t *testing.T) {\n\tkeys := []string{\"1\", \"2\", \"3\"}\n\tvals := []string{\"v1\", \"v2\", \"v3\"}\n\tit := newSimpleIterator(keys, vals, true)\n\tmergeIt := NewMergeIterator([]y.Iterator{it}, true)\n\tmergeIt.Rewind()\n\tk, v := getAll(mergeIt)\n\trequire.EqualValues(t, reversed(keys), k)\n\trequire.EqualValues(t, reversed(vals), v)\n\tcloseAndCheck(t, mergeIt, 1)\n}\n\nfunc TestMergeMore(t *testing.T) {\n\tit := newSimpleIterator([]string{\"1\", \"3\", \"7\"}, []string{\"a1\", \"a3\", \"a7\"}, false)\n\tit2 := newSimpleIterator([]string{\"2\", \"3\", \"5\"}, []string{\"b2\", \"b3\", \"b5\"}, false)\n\tit3 := newSimpleIterator([]string{\"1\"}, []string{\"c1\"}, false)\n\tit4 := newSimpleIterator([]string{\"1\", \"7\", \"9\"}, []string{\"d1\", \"d7\", \"d9\"}, false)\n\tt.Run(\"forward\", func(t *testing.T) {\n\t\texpectedKeys := []string{\"1\", \"2\", \"3\", \"5\", \"7\", \"9\"}\n\t\texpectedVals := []string{\"a1\", \"b2\", \"a3\", \"b5\", \"a7\", \"d9\"}\n\t\tt.Run(\"no duplicates\", func(t *testing.T) {\n\t\t\tmergeIt := NewMergeIterator([]y.Iterator{it, it2, it3, it4}, false)\n\t\t\tmergeIt.Rewind()\n\t\t\tk, v := getAll(mergeIt)\n\t\t\trequire.EqualValues(t, expectedKeys, k)\n\t\t\trequire.EqualValues(t, expectedVals, v)\n\t\t\tcloseAndCheck(t, mergeIt, 4)\n\t\t})\n\t\tt.Run(\"duplicates\", func(t *testing.T) {\n\t\t\tit5 := newSimpleIterator(\n\t\t\t\t[]string{\"1\", \"1\", \"3\", \"7\"},\n\t\t\t\t[]string{\"a1\", \"a1-1\", \"a3\", \"a7\"},\n\t\t\t\tfalse)\n\t\t\tmergeIt := NewMergeIterator([]y.Iterator{it5, it2, it3, it4}, false)\n\t\t\texpectedKeys := []string{\"1\", \"2\", \"3\", \"5\", \"7\", \"9\"}\n\t\t\texpectedVals := []string{\"a1\", \"b2\", \"a3\", \"b5\", \"a7\", \"d9\"}\n\t\t\tmergeIt.Rewind()\n\t\t\tk, v := getAll(mergeIt)\n\t\t\trequire.EqualValues(t, expectedKeys, k)\n\t\t\trequire.EqualValues(t, expectedVals, v)\n\t\t\tcloseAndCheck(t, mergeIt, 4)\n\t\t})\n\t})\n\tt.Run(\"reverse\", func(t *testing.T) {\n\t\tit.reversed = true\n\t\tit2.reversed = true\n\t\tit3.reversed = true\n\t\tit4.reversed = true\n\t\tt.Run(\"no duplicates\", func(t *testing.T) {\n\t\t\tmergeIt := NewMergeIterator([]y.Iterator{it, it2, it3, it4}, true)\n\t\t\texpectedKeys := []string{\"9\", \"7\", \"5\", \"3\", \"2\", \"1\"}\n\t\t\texpectedVals := []string{\"d9\", \"a7\", \"b5\", \"a3\", \"b2\", \"a1\"}\n\t\t\tmergeIt.Rewind()\n\t\t\tk, v := getAll(mergeIt)\n\t\t\trequire.EqualValues(t, expectedKeys, k)\n\t\t\trequire.EqualValues(t, expectedVals, v)\n\t\t\tcloseAndCheck(t, mergeIt, 4)\n\t\t})\n\t\tt.Run(\"duplicates\", func(t *testing.T) {\n\t\t\tit5 := newSimpleIterator(\n\t\t\t\t[]string{\"1\", \"1\", \"3\", \"7\"},\n\t\t\t\t[]string{\"a1\", \"a1-1\", \"a3\", \"a7\"},\n\t\t\t\ttrue)\n\t\t\tmergeIt := NewMergeIterator([]y.Iterator{it5, it2, it3, it4}, true)\n\t\t\texpectedKeys := []string{\"9\", \"7\", \"5\", \"3\", \"2\", \"1\"}\n\t\t\texpectedVals := []string{\"d9\", \"a7\", \"b5\", \"a3\", \"b2\", \"a1-1\"}\n\t\t\tmergeIt.Rewind()\n\t\t\tk, v := getAll(mergeIt)\n\t\t\trequire.EqualValues(t, expectedKeys, k)\n\t\t\trequire.EqualValues(t, expectedVals, v)\n\t\t\tcloseAndCheck(t, mergeIt, 4)\n\t\t})\n\t})\n}\n\n// Ensure MergeIterator satisfies the Iterator interface\nfunc TestMergeIteratorNested(t *testing.T) {\n\tkeys := []string{\"1\", \"2\", \"3\"}\n\tvals := []string{\"v1\", \"v2\", \"v3\"}\n\tit := newSimpleIterator(keys, vals, false)\n\tmergeIt := NewMergeIterator([]y.Iterator{it}, false)\n\tmergeIt2 := NewMergeIterator([]y.Iterator{mergeIt}, false)\n\tmergeIt2.Rewind()\n\tk, v := getAll(mergeIt2)\n\trequire.EqualValues(t, keys, k)\n\trequire.EqualValues(t, vals, v)\n\tcloseAndCheck(t, mergeIt2, 1)\n}\n\nfunc TestMergeIteratorSeek(t *testing.T) {\n\tit := newSimpleIterator([]string{\"1\", \"3\", \"7\"}, []string{\"a1\", \"a3\", \"a7\"}, false)\n\tit2 := newSimpleIterator([]string{\"2\", \"3\", \"5\"}, []string{\"b2\", \"b3\", \"b5\"}, false)\n\tit3 := newSimpleIterator([]string{\"1\"}, []string{\"c1\"}, false)\n\tit4 := newSimpleIterator([]string{\"1\", \"7\", \"9\"}, []string{\"d1\", \"d7\", \"d9\"}, false)\n\tmergeIt := NewMergeIterator([]y.Iterator{it, it2, it3, it4}, false)\n\tmergeIt.Seek([]byte(\"4\"))\n\tk, v := getAll(mergeIt)\n\trequire.EqualValues(t, []string{\"5\", \"7\", \"9\"}, k)\n\trequire.EqualValues(t, []string{\"b5\", \"a7\", \"d9\"}, v)\n\tcloseAndCheck(t, mergeIt, 4)\n}\n\nfunc TestMergeIteratorSeekReversed(t *testing.T) {\n\tit := newSimpleIterator([]string{\"1\", \"3\", \"7\"}, []string{\"a1\", \"a3\", \"a7\"}, true)\n\tit2 := newSimpleIterator([]string{\"2\", \"3\", \"5\"}, []string{\"b2\", \"b3\", \"b5\"}, true)\n\tit3 := newSimpleIterator([]string{\"1\"}, []string{\"c1\"}, true)\n\tit4 := newSimpleIterator([]string{\"1\", \"7\", \"9\"}, []string{\"d1\", \"d7\", \"d9\"}, true)\n\tmergeIt := NewMergeIterator([]y.Iterator{it, it2, it3, it4}, true)\n\tmergeIt.Seek([]byte(\"5\"))\n\tk, v := getAll(mergeIt)\n\trequire.EqualValues(t, []string{\"5\", \"3\", \"2\", \"1\"}, k)\n\trequire.EqualValues(t, []string{\"b5\", \"a3\", \"b2\", \"a1\"}, v)\n\tcloseAndCheck(t, mergeIt, 4)\n}\n\nfunc TestMergeIteratorSeekInvalid(t *testing.T) {\n\tit := newSimpleIterator([]string{\"1\", \"3\", \"7\"}, []string{\"a1\", \"a3\", \"a7\"}, false)\n\tit2 := newSimpleIterator([]string{\"2\", \"3\", \"5\"}, []string{\"b2\", \"b3\", \"b5\"}, false)\n\tit3 := newSimpleIterator([]string{\"1\"}, []string{\"c1\"}, false)\n\tit4 := newSimpleIterator([]string{\"1\", \"7\", \"9\"}, []string{\"d1\", \"d7\", \"d9\"}, false)\n\tmergeIt := NewMergeIterator([]y.Iterator{it, it2, it3, it4}, false)\n\tmergeIt.Seek([]byte(\"f\"))\n\trequire.False(t, mergeIt.Valid())\n\tcloseAndCheck(t, mergeIt, 4)\n}\n\nfunc TestMergeIteratorSeekInvalidReversed(t *testing.T) {\n\tit := newSimpleIterator([]string{\"1\", \"3\", \"7\"}, []string{\"a1\", \"a3\", \"a7\"}, true)\n\tit2 := newSimpleIterator([]string{\"2\", \"3\", \"5\"}, []string{\"b2\", \"b3\", \"b5\"}, true)\n\tit3 := newSimpleIterator([]string{\"1\"}, []string{\"c1\"}, true)\n\tit4 := newSimpleIterator([]string{\"1\", \"7\", \"9\"}, []string{\"d1\", \"d7\", \"d9\"}, true)\n\tmergeIt := NewMergeIterator([]y.Iterator{it, it2, it3, it4}, true)\n\tmergeIt.Seek([]byte(\"0\"))\n\trequire.False(t, mergeIt.Valid())\n\tcloseAndCheck(t, mergeIt, 4)\n}\n\nfunc TestMergeIteratorDuplicate(t *testing.T) {\n\tit1 := newSimpleIterator([]string{\"0\", \"1\", \"2\"}, []string{\"a0\", \"a1\", \"a2\"}, false)\n\tit2 := newSimpleIterator([]string{\"1\", \"3\"}, []string{\"b1\", \"b3\"}, false)\n\tit3 := newSimpleIterator([]string{\"0\", \"1\", \"2\"}, []string{\"c0\", \"c1\", \"c2\"}, false)\n\tt.Run(\"forward\", func(t *testing.T) {\n\t\tt.Run(\"only duplicates\", func(t *testing.T) {\n\t\t\tit := NewMergeIterator([]y.Iterator{it1, it3}, false)\n\t\t\texpectedKeys := []string{\"0\", \"1\", \"2\"}\n\t\t\texpectedVals := []string{\"a0\", \"a1\", \"a2\"}\n\t\t\tit.Rewind()\n\t\t\tk, v := getAll(it)\n\t\t\trequire.Equal(t, expectedKeys, k)\n\t\t\trequire.Equal(t, expectedVals, v)\n\t\t})\n\t\tt.Run(\"one\", func(t *testing.T) {\n\t\t\tit := NewMergeIterator([]y.Iterator{it3, it2, it1}, false)\n\t\t\texpectedKeys := []string{\"0\", \"1\", \"2\", \"3\"}\n\t\t\texpectedVals := []string{\"c0\", \"c1\", \"c2\", \"b3\"}\n\t\t\tit.Rewind()\n\t\t\tk, v := getAll(it)\n\t\t\trequire.Equal(t, expectedKeys, k)\n\t\t\trequire.Equal(t, expectedVals, v)\n\t\t})\n\t\tt.Run(\"two\", func(t *testing.T) {\n\t\t\tit1 := newSimpleIterator([]string{\"0\", \"1\", \"2\"}, []string{\"0\", \"1\", \"2\"}, false)\n\t\t\tit2 := newSimpleIterator([]string{\"1\"}, []string{\"1\"}, false)\n\t\t\tit3 := newSimpleIterator([]string{\"2\"}, []string{\"2\"}, false)\n\t\t\tit := NewMergeIterator([]y.Iterator{it3, it2, it1}, false)\n\n\t\t\tvar cnt int\n\t\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\t\trequire.EqualValues(t, cnt+48, it.Key()[0])\n\t\t\t\tcnt++\n\t\t\t}\n\t\t\trequire.Equal(t, 3, cnt)\n\t\t})\n\t})\n\n\tt.Run(\"reverse\", func(t *testing.T) {\n\t\tit1.reversed = true\n\t\tit2.reversed = true\n\t\tit3.reversed = true\n\n\t\tit := NewMergeIterator([]y.Iterator{it3, it2, it1}, true)\n\n\t\texpectedKeys := []string{\"3\", \"2\", \"1\", \"0\"}\n\t\texpectedVals := []string{\"b3\", \"c2\", \"c1\", \"c0\"}\n\t\tit.Rewind()\n\t\tk, v := getAll(it)\n\t\trequire.Equal(t, expectedKeys, k)\n\t\trequire.Equal(t, expectedVals, v)\n\t})\n}\n\nfunc TestMergeDuplicates(t *testing.T) {\n\tit := newSimpleIterator([]string{\"1\", \"1\", \"1\"}, []string{\"a1\", \"a3\", \"a7\"}, false)\n\tit2 := newSimpleIterator([]string{\"1\", \"1\", \"1\"}, []string{\"b2\", \"b3\", \"b5\"}, false)\n\tit3 := newSimpleIterator([]string{\"1\"}, []string{\"c1\"}, false)\n\tit4 := newSimpleIterator([]string{\"1\", \"1\", \"2\"}, []string{\"d1\", \"d7\", \"d9\"}, false)\n\tt.Run(\"forward\", func(t *testing.T) {\n\t\texpectedKeys := []string{\"1\", \"2\"}\n\t\texpectedVals := []string{\"a1\", \"d9\"}\n\t\tmergeIt := NewMergeIterator([]y.Iterator{it, it2, it3, it4}, false)\n\t\tmergeIt.Rewind()\n\t\tk, v := getAll(mergeIt)\n\t\trequire.EqualValues(t, expectedKeys, k)\n\t\trequire.EqualValues(t, expectedVals, v)\n\t\tcloseAndCheck(t, mergeIt, 4)\n\t})\n\n\tt.Run(\"reverse\", func(t *testing.T) {\n\t\tit.reversed = true\n\t\tit2.reversed = true\n\t\tit3.reversed = true\n\t\tit4.reversed = true\n\t\texpectedKeys := []string{\"2\", \"1\"}\n\t\texpectedVals := []string{\"d9\", \"a7\"}\n\t\tmergeIt := NewMergeIterator([]y.Iterator{it, it2, it3, it4}, true)\n\t\tmergeIt.Rewind()\n\t\tk, v := getAll(mergeIt)\n\t\trequire.EqualValues(t, expectedKeys, k)\n\t\trequire.EqualValues(t, expectedVals, v)\n\t\tcloseAndCheck(t, mergeIt, 4)\n\t})\n}\n"
  },
  {
    "path": "table/table.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage table\n\nimport (\n\t\"bytes\"\n\t\"crypto/aes\"\n\t\"encoding/binary\"\n\t\"errors\"\n\t\"fmt\"\n\t\"math\"\n\t\"os\"\n\t\"path/filepath\"\n\t\"strconv\"\n\t\"strings\"\n\t\"sync\"\n\t\"sync/atomic\"\n\t\"time\"\n\t\"unsafe\"\n\n\t\"github.com/klauspost/compress/snappy\"\n\t\"github.com/klauspost/compress/zstd\"\n\t\"google.golang.org/protobuf/proto\"\n\n\t\"github.com/dgraph-io/badger/v4/fb\"\n\t\"github.com/dgraph-io/badger/v4/options\"\n\t\"github.com/dgraph-io/badger/v4/pb\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\nconst fileSuffix = \".sst\"\nconst intSize = int(unsafe.Sizeof(int(0)))\n\n// Options contains configurable options for Table/Builder.\ntype Options struct {\n\t// Options for Opening/Building Table.\n\n\t// Open tables in read only mode.\n\tReadOnly       bool\n\tMetricsEnabled bool\n\n\t// Maximum size of the table.\n\tTableSize     uint64\n\ttableCapacity uint64 // 0.9x TableSize.\n\n\t// ChkMode is the checksum verification mode for Table.\n\tChkMode options.ChecksumVerificationMode\n\n\t// Options for Table builder.\n\n\t// BloomFalsePositive is the false positive probabiltiy of bloom filter.\n\tBloomFalsePositive float64\n\n\t// BlockSize is the size of each block inside SSTable in bytes.\n\tBlockSize int\n\n\t// DataKey is the key used to decrypt the encrypted text.\n\tDataKey *pb.DataKey\n\n\t// Compression indicates the compression algorithm used for block compression.\n\tCompression options.CompressionType\n\n\t// Block cache is used to cache decompressed and decrypted blocks.\n\tBlockCache *ristretto.Cache[[]byte, *Block]\n\tIndexCache *ristretto.Cache[uint64, *fb.TableIndex]\n\n\tAllocPool *z.AllocatorPool\n\n\t// ZSTDCompressionLevel is the ZSTD compression level used for compressing blocks.\n\tZSTDCompressionLevel int\n}\n\n// TableInterface is useful for testing.\ntype TableInterface interface {\n\tSmallest() []byte\n\tBiggest() []byte\n\tDoesNotHave(hash uint32) bool\n\tMaxVersion() uint64\n}\n\n// Table represents a loaded table file with the info we have about it.\ntype Table struct {\n\tsync.Mutex\n\t*z.MmapFile\n\n\ttableSize int // Initialized in OpenTable, using fd.Stat().\n\n\t_index *fb.TableIndex // Nil if encryption is enabled. Use fetchIndex to access.\n\t_cheap *cheapIndex\n\tref    atomic.Int32 // For file garbage collection\n\n\t// The following are initialized once and const.\n\tsmallest, biggest []byte // Smallest and largest keys (with timestamps).\n\tid                uint64 // file id, part of filename\n\n\tChecksum       []byte\n\tCreatedAt      time.Time\n\tindexStart     int\n\tindexLen       int\n\thasBloomFilter bool\n\n\tIsInmemory bool // Set to true if the table is on level 0 and opened in memory.\n\topt        *Options\n}\n\ntype cheapIndex struct {\n\tMaxVersion        uint64\n\tKeyCount          uint32\n\tUncompressedSize  uint32\n\tOnDiskSize        uint32\n\tBloomFilterLength int\n\tOffsetsLength     int\n}\n\nfunc (t *Table) cheapIndex() *cheapIndex {\n\treturn t._cheap\n}\nfunc (t *Table) offsetsLength() int { return t.cheapIndex().OffsetsLength }\n\n// MaxVersion returns the maximum version across all keys stored in this table.\nfunc (t *Table) MaxVersion() uint64 { return t.cheapIndex().MaxVersion }\n\n// BloomFilterSize returns the size of the bloom filter in bytes stored in memory.\nfunc (t *Table) BloomFilterSize() int { return t.cheapIndex().BloomFilterLength }\n\n// UncompressedSize is the size uncompressed data stored in this file.\nfunc (t *Table) UncompressedSize() uint32 { return t.cheapIndex().UncompressedSize }\n\n// KeyCount is the total number of keys in this table.\nfunc (t *Table) KeyCount() uint32 { return t.cheapIndex().KeyCount }\n\n// OnDiskSize returns the total size of key-values stored in this table (including the\n// disk space occupied on the value log).\nfunc (t *Table) OnDiskSize() uint32 { return t.cheapIndex().OnDiskSize }\n\n// CompressionType returns the compression algorithm used for block compression.\nfunc (t *Table) CompressionType() options.CompressionType {\n\treturn t.opt.Compression\n}\n\n// IncrRef increments the refcount (having to do with whether the file should be deleted)\nfunc (t *Table) IncrRef() {\n\tt.ref.Add(1)\n}\n\n// DecrRef decrements the refcount and possibly deletes the table\nfunc (t *Table) DecrRef() error {\n\tnewRef := t.ref.Add(-1)\n\tif newRef == 0 {\n\t\t// We can safely delete this file, because for all the current files, we always have\n\t\t// at least one reference pointing to them.\n\n\t\t// Delete all blocks from the cache.\n\t\tfor i := 0; i < t.offsetsLength(); i++ {\n\t\t\tt.opt.BlockCache.Del(t.blockCacheKey(i))\n\t\t}\n\t\tif err := t.Delete(); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\treturn nil\n}\n\n// BlockEvictHandler is used to reuse the byte slice stored in the block on cache eviction.\nfunc BlockEvictHandler(b *Block) {\n\tb.decrRef()\n}\n\ntype Block struct {\n\toffset            int\n\tdata              []byte\n\tchecksum          []byte\n\tentriesIndexStart int      // start index of entryOffsets list\n\tentryOffsets      []uint32 // used to binary search an entry in the block.\n\tchkLen            int      // checksum length.\n\tfreeMe            bool     // used to determine if the blocked should be reused.\n\tref               atomic.Int32\n}\n\nvar NumBlocks atomic.Int32\n\n// incrRef increments the ref of a block and return a bool indicating if the\n// increment was successful. A true value indicates that the block can be used.\nfunc (b *Block) incrRef() bool {\n\tfor {\n\t\t// We can't blindly add 1 to ref. We need to check whether it has\n\t\t// reached zero first, because if it did, then we should absolutely not\n\t\t// use this block.\n\t\tref := b.ref.Load()\n\t\t// The ref would not be equal to 0 unless the existing\n\t\t// block get evicted before this line. If the ref is zero, it means that\n\t\t// the block is already added the the blockPool and cannot be used\n\t\t// anymore. The ref of a new block is 1 so the following condition will\n\t\t// be true only if the block got reused before we could increment its\n\t\t// ref.\n\t\tif ref == 0 {\n\t\t\treturn false\n\t\t}\n\t\t// Increment the ref only if it is not zero and has not changed between\n\t\t// the time we read it and we're updating it.\n\t\t//\n\t\tif b.ref.CompareAndSwap(ref, ref+1) {\n\t\t\treturn true\n\t\t}\n\t}\n}\nfunc (b *Block) decrRef() {\n\tif b == nil {\n\t\treturn\n\t}\n\n\t// Insert the []byte into pool only if the block is reusable. When a block\n\t// is reusable a new []byte is used for decompression and this []byte can\n\t// be reused.\n\t// In case of an uncompressed block, the []byte is a reference to the\n\t// table.mmap []byte slice. Any attempt to write data to the mmap []byte\n\t// will lead to SEGFAULT.\n\tif b.ref.Add(-1) == 0 {\n\t\tif b.freeMe {\n\t\t\tz.Free(b.data)\n\t\t}\n\t\tNumBlocks.Add(-1)\n\t\t// blockPool.Put(&b.data)\n\t}\n\ty.AssertTrue(b.ref.Load() >= 0)\n}\nfunc (b *Block) size() int64 {\n\treturn int64(3*intSize /* Size of the offset, entriesIndexStart and chkLen */ +\n\t\tcap(b.data) + cap(b.checksum) + cap(b.entryOffsets)*4)\n}\n\nfunc (b *Block) verifyCheckSum() error {\n\tcs := &pb.Checksum{}\n\tif err := proto.Unmarshal(b.checksum, cs); err != nil {\n\t\treturn y.Wrapf(err, \"unable to unmarshal checksum for block\")\n\t}\n\treturn y.VerifyChecksum(b.data, cs)\n}\n\nfunc CreateTable(fname string, builder *Builder) (*Table, error) {\n\tbd := builder.Done()\n\tmf, err := z.OpenMmapFile(fname, os.O_CREATE|os.O_RDWR|os.O_EXCL, bd.Size)\n\tif err == z.NewFile {\n\t\t// Expected.\n\t} else if err != nil {\n\t\treturn nil, y.Wrapf(err, \"while creating table: %s\", fname)\n\t} else {\n\t\treturn nil, fmt.Errorf(\"file already exists: %s\", fname)\n\t}\n\n\twritten := bd.Copy(mf.Data)\n\ty.AssertTrue(written == len(mf.Data))\n\tif err := z.Msync(mf.Data); err != nil {\n\t\treturn nil, y.Wrapf(err, \"while calling msync on %s\", fname)\n\t}\n\treturn OpenTable(mf, *builder.opts)\n}\n\n// OpenTable assumes file has only one table and opens it. Takes ownership of fd upon function\n// entry. Returns a table with one reference count on it (decrementing which may delete the file!\n// -- consider t.Close() instead). The fd has to writeable because we call Truncate on it before\n// deleting. Checksum for all blocks of table is verified based on value of chkMode.\nfunc OpenTable(mf *z.MmapFile, opts Options) (*Table, error) {\n\t// BlockSize is used to compute the approximate size of the decompressed\n\t// block. It should not be zero if the table is compressed.\n\tif opts.BlockSize == 0 && opts.Compression != options.None {\n\t\treturn nil, errors.New(\"Block size cannot be zero\")\n\t}\n\tfileInfo, err := mf.Fd.Stat()\n\tif err != nil {\n\t\tmf.Close(-1)\n\t\treturn nil, y.Wrap(err, \"\")\n\t}\n\n\tfilename := fileInfo.Name()\n\tid, ok := ParseFileID(filename)\n\tif !ok {\n\t\tmf.Close(-1)\n\t\treturn nil, fmt.Errorf(\"Invalid filename: %s\", filename)\n\t}\n\tt := &Table{\n\t\tMmapFile:   mf,\n\t\tid:         id,\n\t\topt:        &opts,\n\t\tIsInmemory: false,\n\t\ttableSize:  int(fileInfo.Size()),\n\t\tCreatedAt:  fileInfo.ModTime(),\n\t}\n\t// Caller is given one reference.\n\tt.ref.Store(1)\n\n\tif err := t.initBiggestAndSmallest(); err != nil {\n\t\treturn nil, y.Wrapf(err, \"failed to initialize table\")\n\t}\n\n\tif opts.ChkMode == options.OnTableRead || opts.ChkMode == options.OnTableAndBlockRead {\n\t\tif err := t.VerifyChecksum(); err != nil {\n\t\t\tmf.Close(-1)\n\t\t\treturn nil, y.Wrapf(err, \"failed to verify checksum\")\n\t\t}\n\t}\n\n\treturn t, nil\n}\n\n// OpenInMemoryTable is similar to OpenTable but it opens a new table from the provided data.\n// OpenInMemoryTable is used for L0 tables.\nfunc OpenInMemoryTable(data []byte, id uint64, opt *Options) (*Table, error) {\n\tmf := &z.MmapFile{\n\t\tData: data,\n\t\tFd:   nil,\n\t}\n\tt := &Table{\n\t\tMmapFile:   mf,\n\t\topt:        opt,\n\t\ttableSize:  len(data),\n\t\tIsInmemory: true,\n\t\tid:         id, // It is important that each table gets a unique ID.\n\t}\n\t// Caller is given one reference.\n\tt.ref.Store(1)\n\n\tif err := t.initBiggestAndSmallest(); err != nil {\n\t\treturn nil, err\n\t}\n\treturn t, nil\n}\n\nfunc (t *Table) initBiggestAndSmallest() error {\n\t// This defer will help gathering debugging info in case initIndex crashes.\n\tdefer func() {\n\t\tif r := recover(); r != nil {\n\t\t\t// Use defer for printing info because there may be an intermediate panic.\n\t\t\tvar debugBuf bytes.Buffer\n\t\t\tdefer func() {\n\t\t\t\tpanic(fmt.Sprintf(\"%s\\n== Recovered ==\\n\", debugBuf.String()))\n\t\t\t}()\n\n\t\t\t// Get the count of null bytes at the end of file. This is to make sure if there was an\n\t\t\t// issue with mmap sync or file copy.\n\t\t\tcount := 0\n\t\t\tfor i := len(t.Data) - 1; i >= 0; i-- {\n\t\t\t\tif t.Data[i] != 0 {\n\t\t\t\t\tbreak\n\t\t\t\t}\n\t\t\t\tcount++\n\t\t\t}\n\n\t\t\tfmt.Fprintf(&debugBuf, \"\\n== Recovering from initIndex crash ==\\n\")\n\t\t\tfmt.Fprintf(&debugBuf, \"File Info: [ID: %d, Size: %d, Zeros: %d]\\n\",\n\t\t\t\tt.id, t.tableSize, count)\n\n\t\t\tfmt.Fprintf(&debugBuf, \"isEnrypted: %v \", t.shouldDecrypt())\n\n\t\t\treadPos := t.tableSize\n\n\t\t\t// Read checksum size.\n\t\t\treadPos -= 4\n\t\t\tbuf := t.readNoFail(readPos, 4)\n\t\t\tchecksumLen := int(y.BytesToU32(buf))\n\t\t\tfmt.Fprintf(&debugBuf, \"checksumLen: %d \", checksumLen)\n\n\t\t\t// Read checksum.\n\t\t\tchecksum := &pb.Checksum{}\n\t\t\treadPos -= checksumLen\n\t\t\tbuf = t.readNoFail(readPos, checksumLen)\n\t\t\t_ = proto.Unmarshal(buf, checksum)\n\t\t\tfmt.Fprintf(&debugBuf, \"checksum: %+v \", checksum)\n\n\t\t\t// Read index size from the footer.\n\t\t\treadPos -= 4\n\t\t\tbuf = t.readNoFail(readPos, 4)\n\t\t\tindexLen := int(y.BytesToU32(buf))\n\t\t\tfmt.Fprintf(&debugBuf, \"indexLen: %d \", indexLen)\n\n\t\t\t// Read index.\n\t\t\treadPos -= t.indexLen\n\t\t\tt.indexStart = readPos\n\t\t\tindexData := t.readNoFail(readPos, t.indexLen)\n\t\t\tfmt.Fprintf(&debugBuf, \"index: %v \", indexData)\n\t\t}\n\t}()\n\n\tvar err error\n\tvar ko *fb.BlockOffset\n\tif ko, err = t.initIndex(); err != nil {\n\t\treturn y.Wrapf(err, \"failed to read index.\")\n\t}\n\n\tt.smallest = y.Copy(ko.KeyBytes())\n\n\tit2 := t.NewIterator(REVERSED | NOCACHE)\n\tdefer it2.Close()\n\tit2.Rewind()\n\tif !it2.Valid() {\n\t\treturn y.Wrapf(it2.err, \"failed to initialize biggest for table %s\", t.Filename())\n\t}\n\tt.biggest = y.Copy(it2.Key())\n\treturn nil\n}\n\nfunc (t *Table) read(off, sz int) ([]byte, error) {\n\treturn t.Bytes(off, sz)\n}\n\nfunc (t *Table) readNoFail(off, sz int) []byte {\n\tres, err := t.read(off, sz)\n\ty.Check(err)\n\treturn res\n}\n\n// initIndex reads the index and populate the necessary table fields and returns\n// first block offset\nfunc (t *Table) initIndex() (*fb.BlockOffset, error) {\n\treadPos := t.tableSize\n\n\t// Read checksum len from the last 4 bytes.\n\treadPos -= 4\n\tbuf := t.readNoFail(readPos, 4)\n\tchecksumLen := int(y.BytesToU32(buf))\n\tif checksumLen < 0 {\n\t\treturn nil, errors.New(\"checksum length less than zero. Data corrupted\")\n\t}\n\n\t// Read checksum.\n\texpectedChk := &pb.Checksum{}\n\treadPos -= checksumLen\n\tbuf = t.readNoFail(readPos, checksumLen)\n\tif err := proto.Unmarshal(buf, expectedChk); err != nil {\n\t\treturn nil, err\n\t}\n\n\t// Read index size from the footer.\n\treadPos -= 4\n\tbuf = t.readNoFail(readPos, 4)\n\tt.indexLen = int(y.BytesToU32(buf))\n\n\t// Read index.\n\treadPos -= t.indexLen\n\tt.indexStart = readPos\n\tdata := t.readNoFail(readPos, t.indexLen)\n\n\tif err := y.VerifyChecksum(data, expectedChk); err != nil {\n\t\treturn nil, y.Wrapf(err, \"failed to verify checksum for table: %s\", t.Filename())\n\t}\n\n\tindex, err := t.readTableIndex()\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\tif !t.shouldDecrypt() {\n\t\t// If there's no encryption, this points to the mmap'ed buffer.\n\t\tt._index = index\n\t}\n\tt._cheap = &cheapIndex{\n\t\tMaxVersion:        index.MaxVersion(),\n\t\tKeyCount:          index.KeyCount(),\n\t\tUncompressedSize:  index.UncompressedSize(),\n\t\tOnDiskSize:        index.OnDiskSize(),\n\t\tOffsetsLength:     index.OffsetsLength(),\n\t\tBloomFilterLength: index.BloomFilterLength(),\n\t}\n\n\tt.hasBloomFilter = len(index.BloomFilterBytes()) > 0\n\n\tvar bo fb.BlockOffset\n\ty.AssertTrue(index.Offsets(&bo, 0))\n\treturn &bo, nil\n}\n\n// KeySplits splits the table into at least n ranges based on the block offsets.\nfunc (t *Table) KeySplits(n int, prefix []byte) []string {\n\tif n == 0 {\n\t\treturn nil\n\t}\n\n\toLen := t.offsetsLength()\n\tjump := oLen / n\n\tif jump == 0 {\n\t\tjump = 1\n\t}\n\n\tvar bo fb.BlockOffset\n\tvar res []string\n\tfor i := 0; i < oLen; i += jump {\n\t\tif i >= oLen {\n\t\t\ti = oLen - 1\n\t\t}\n\t\ty.AssertTrue(t.offsets(&bo, i))\n\t\tif bytes.HasPrefix(bo.KeyBytes(), prefix) {\n\t\t\tres = append(res, string(bo.KeyBytes()))\n\t\t}\n\t}\n\treturn res\n}\n\nfunc (t *Table) fetchIndex() *fb.TableIndex {\n\tif !t.shouldDecrypt() {\n\t\treturn t._index\n\t}\n\n\tif t.opt.IndexCache == nil {\n\t\tpanic(\"Index Cache must be set for encrypted workloads\")\n\t}\n\tif val, ok := t.opt.IndexCache.Get(t.indexKey()); ok && val != nil {\n\t\treturn val\n\t}\n\n\tindex, err := t.readTableIndex()\n\ty.Check(err)\n\tt.opt.IndexCache.Set(t.indexKey(), index, int64(t.indexLen))\n\treturn index\n}\n\nfunc (t *Table) offsets(ko *fb.BlockOffset, i int) bool {\n\treturn t.fetchIndex().Offsets(ko, i)\n}\n\n// block function return a new block. Each block holds a ref and the byte\n// slice stored in the block will be reused when the ref becomes zero. The\n// caller should release the block by calling block.decrRef() on it.\nfunc (t *Table) block(idx int, useCache bool) (*Block, error) {\n\ty.AssertTruef(idx >= 0, \"idx=%d\", idx)\n\tif idx >= t.offsetsLength() {\n\t\treturn nil, errors.New(\"block out of index\")\n\t}\n\tif t.opt.BlockCache != nil {\n\t\tkey := t.blockCacheKey(idx)\n\t\tblk, ok := t.opt.BlockCache.Get(key)\n\t\tif ok && blk != nil {\n\t\t\t// Use the block only if the increment was successful. The block\n\t\t\t// could get evicted from the cache between the Get() call and the\n\t\t\t// incrRef() call.\n\t\t\tif blk.incrRef() {\n\t\t\t\treturn blk, nil\n\t\t\t}\n\t\t}\n\t}\n\n\tvar ko fb.BlockOffset\n\ty.AssertTrue(t.offsets(&ko, idx))\n\tblk := &Block{offset: int(ko.Offset())}\n\tblk.ref.Store(1)\n\tdefer blk.decrRef() // Deal with any errors, where blk would not be returned.\n\tNumBlocks.Add(1)\n\n\tvar err error\n\tif blk.data, err = t.read(blk.offset, int(ko.Len())); err != nil {\n\t\treturn nil, y.Wrapf(err,\n\t\t\t\"failed to read from file: %s at offset: %d, len: %d\",\n\t\t\tt.Fd.Name(), blk.offset, ko.Len())\n\t}\n\n\tif t.shouldDecrypt() {\n\t\t// Decrypt the block if it is encrypted.\n\t\tif blk.data, err = t.decrypt(blk.data, true); err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t\t// blk.data is allocated via Calloc. So, do free.\n\t\tblk.freeMe = true\n\t}\n\n\tif err = t.decompress(blk); err != nil {\n\t\treturn nil, y.Wrapf(err,\n\t\t\t\"failed to decode compressed data in file: %s at offset: %d, len: %d\",\n\t\t\tt.Fd.Name(), blk.offset, ko.Len())\n\t}\n\n\t// Read meta data related to block.\n\treadPos := len(blk.data) - 4 // First read checksum length.\n\tblk.chkLen = int(y.BytesToU32(blk.data[readPos : readPos+4]))\n\n\t// Checksum length greater than block size could happen if the table was compressed and\n\t// it was opened with an incorrect compression algorithm (or the data was corrupted).\n\tif blk.chkLen > len(blk.data) {\n\t\treturn nil, errors.New(\"invalid checksum length. Either the data is \" +\n\t\t\t\"corrupted or the table options are incorrectly set\")\n\t}\n\n\t// Read checksum and store it\n\treadPos -= blk.chkLen\n\tblk.checksum = blk.data[readPos : readPos+blk.chkLen]\n\t// Move back and read numEntries in the block.\n\treadPos -= 4\n\tnumEntries := int(y.BytesToU32(blk.data[readPos : readPos+4]))\n\tentriesIndexStart := readPos - (numEntries * 4)\n\tentriesIndexEnd := entriesIndexStart + numEntries*4\n\n\tblk.entryOffsets = y.BytesToU32Slice(blk.data[entriesIndexStart:entriesIndexEnd])\n\n\tblk.entriesIndexStart = entriesIndexStart\n\n\t// Drop checksum and checksum length.\n\t// The checksum is calculated for actual data + entry index + index length\n\tblk.data = blk.data[:readPos+4]\n\n\t// Verify checksum on if checksum verification mode is OnRead on OnStartAndRead.\n\tif t.opt.ChkMode == options.OnBlockRead || t.opt.ChkMode == options.OnTableAndBlockRead {\n\t\tif err = blk.verifyCheckSum(); err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t}\n\n\tblk.incrRef()\n\tif useCache && t.opt.BlockCache != nil {\n\t\tkey := t.blockCacheKey(idx)\n\t\t// incrRef should never return false here because we're calling it on a\n\t\t// new block with ref=1.\n\t\ty.AssertTrue(blk.incrRef())\n\n\t\t// Decrement the block ref if we could not insert it in the cache.\n\t\tif !t.opt.BlockCache.Set(key, blk, blk.size()) {\n\t\t\tblk.decrRef()\n\t\t}\n\t\t// We have added an OnReject func in our cache, which gets called in case the block is not\n\t\t// admitted to the cache. So, every block would be accounted for.\n\t}\n\treturn blk, nil\n}\n\n// blockCacheKey is used to store blocks in the block cache.\nfunc (t *Table) blockCacheKey(idx int) []byte {\n\ty.AssertTrue(t.id < math.MaxUint32)\n\ty.AssertTrue(uint32(idx) < math.MaxUint32)\n\n\tbuf := make([]byte, 8)\n\t// Assume t.ID does not overflow uint32.\n\tbinary.BigEndian.PutUint32(buf[:4], uint32(t.ID()))\n\tbinary.BigEndian.PutUint32(buf[4:], uint32(idx))\n\treturn buf\n}\n\n// indexKey returns the cache key for block offsets. blockOffsets\n// are stored in the index cache.\nfunc (t *Table) indexKey() uint64 {\n\treturn t.id\n}\n\n// IndexSize is the size of table index in bytes.\nfunc (t *Table) IndexSize() int {\n\treturn t.indexLen\n}\n\n// Size is its file size in bytes\nfunc (t *Table) Size() int64 { return int64(t.tableSize) }\n\n// StaleDataSize is the amount of stale data (that can be dropped by a compaction )in this SST.\nfunc (t *Table) StaleDataSize() uint32 { return t.fetchIndex().StaleDataSize() }\n\n// Smallest is its smallest key, or nil if there are none\nfunc (t *Table) Smallest() []byte { return t.smallest }\n\n// Biggest is its biggest key, or nil if there are none\nfunc (t *Table) Biggest() []byte { return t.biggest }\n\n// Filename is NOT the file name.  Just kidding, it is.\nfunc (t *Table) Filename() string { return t.Fd.Name() }\n\n// ID is the table's ID number (used to make the file name).\nfunc (t *Table) ID() uint64 { return t.id }\n\n// DoesNotHave returns true if and only if the table does not have the key hash.\n// It does a bloom filter lookup.\nfunc (t *Table) DoesNotHave(hash uint32) bool {\n\tif !t.hasBloomFilter {\n\t\treturn false\n\t}\n\n\ty.NumLSMBloomHitsAdd(t.opt.MetricsEnabled, \"DoesNotHave_ALL\", 1)\n\tindex := t.fetchIndex()\n\tbf := index.BloomFilterBytes()\n\tmayContain := y.Filter(bf).MayContain(hash)\n\tif !mayContain {\n\t\ty.NumLSMBloomHitsAdd(t.opt.MetricsEnabled, \"DoesNotHave_HIT\", 1)\n\t}\n\treturn !mayContain\n}\n\n// readTableIndex reads table index from the sst and returns its pb format.\nfunc (t *Table) readTableIndex() (*fb.TableIndex, error) {\n\tdata := t.readNoFail(t.indexStart, t.indexLen)\n\tvar err error\n\t// Decrypt the table index if it is encrypted.\n\tif t.shouldDecrypt() {\n\t\tif data, err = t.decrypt(data, false); err != nil {\n\t\t\treturn nil, y.Wrapf(err,\n\t\t\t\t\"Error while decrypting table index for the table %d in readTableIndex\", t.id)\n\t\t}\n\t}\n\treturn fb.GetRootAsTableIndex(data, 0), nil\n}\n\n// VerifyChecksum verifies checksum for all blocks of table. This function is called by\n// OpenTable() function. This function is also called inside levelsController.VerifyChecksum().\nfunc (t *Table) VerifyChecksum() error {\n\tti := t.fetchIndex()\n\tfor i := 0; i < ti.OffsetsLength(); i++ {\n\t\tb, err := t.block(i, true)\n\t\tif err != nil {\n\t\t\treturn y.Wrapf(err, \"checksum validation failed for table: %s, block: %d, offset:%d\",\n\t\t\t\tt.Filename(), i, b.offset)\n\t\t}\n\t\t// We should not call incrRef here, because the block already has one ref when created.\n\t\tdefer b.decrRef()\n\t\t// OnBlockRead or OnTableAndBlockRead, we don't need to call verify checksum\n\t\t// on block, verification would be done while reading block itself.\n\t\tif !(t.opt.ChkMode == options.OnBlockRead || t.opt.ChkMode == options.OnTableAndBlockRead) {\n\t\t\tif err = b.verifyCheckSum(); err != nil {\n\t\t\t\treturn y.Wrapf(err,\n\t\t\t\t\t\"checksum validation failed for table: %s, block: %d, offset:%d\",\n\t\t\t\t\tt.Filename(), i, b.offset)\n\t\t\t}\n\t\t}\n\t}\n\treturn nil\n}\n\n// shouldDecrypt tells whether to decrypt or not. We decrypt only if the datakey exist\n// for the table.\nfunc (t *Table) shouldDecrypt() bool {\n\treturn t.opt.DataKey != nil\n}\n\n// KeyID returns data key id.\nfunc (t *Table) KeyID() uint64 {\n\tif t.opt.DataKey != nil {\n\t\treturn t.opt.DataKey.KeyId\n\t}\n\t// By default it's 0, if it is plain text.\n\treturn 0\n}\n\n// decrypt decrypts the given data. It should be called only after checking shouldDecrypt.\nfunc (t *Table) decrypt(data []byte, viaCalloc bool) ([]byte, error) {\n\t// Last BlockSize bytes of the data is the IV.\n\tiv := data[len(data)-aes.BlockSize:]\n\t// Rest all bytes are data.\n\tdata = data[:len(data)-aes.BlockSize]\n\n\tvar dst []byte\n\tif viaCalloc {\n\t\tdst = z.Calloc(len(data), \"Table.Decrypt\")\n\t} else {\n\t\tdst = make([]byte, len(data))\n\t}\n\tif err := y.XORBlock(dst, data, t.opt.DataKey.Data, iv); err != nil {\n\t\treturn nil, y.Wrapf(err, \"while decrypt\")\n\t}\n\treturn dst, nil\n}\n\n// ParseFileID reads the file id out of a filename.\nfunc ParseFileID(name string) (uint64, bool) {\n\tname = filepath.Base(name)\n\tif !strings.HasSuffix(name, fileSuffix) {\n\t\treturn 0, false\n\t}\n\t//\tsuffix := name[len(fileSuffix):]\n\tname = strings.TrimSuffix(name, fileSuffix)\n\tid, err := strconv.Atoi(name)\n\tif err != nil {\n\t\treturn 0, false\n\t}\n\ty.AssertTrue(id >= 0)\n\treturn uint64(id), true\n}\n\n// IDToFilename does the inverse of ParseFileID\nfunc IDToFilename(id uint64) string {\n\treturn fmt.Sprintf(\"%06d\", id) + fileSuffix\n}\n\n// NewFilename should be named TableFilepath -- it combines the dir with the ID to make a table\n// filepath.\nfunc NewFilename(id uint64, dir string) string {\n\treturn filepath.Join(dir, IDToFilename(id))\n}\n\n// decompress decompresses the data stored in a block.\nfunc (t *Table) decompress(b *Block) error {\n\tvar dst []byte\n\tvar err error\n\n\t// Point to the original b.data\n\tsrc := b.data\n\n\tswitch t.opt.Compression {\n\tcase options.None:\n\t\t// Nothing to be done here.\n\t\treturn nil\n\tcase options.Snappy:\n\t\tif sz, err := snappy.DecodedLen(b.data); err == nil {\n\t\t\tdst = z.Calloc(sz, \"Table.Decompress\")\n\t\t} else {\n\t\t\tdst = z.Calloc(len(b.data)*4, \"Table.Decompress\") // Take a guess.\n\t\t}\n\t\tb.data, err = snappy.Decode(dst, b.data)\n\t\tif err != nil {\n\t\t\tz.Free(dst)\n\t\t\treturn y.Wrap(err, \"failed to decompress\")\n\t\t}\n\tcase options.ZSTD:\n\t\tsz := int(float64(t.opt.BlockSize) * 1.2)\n\t\t// Get frame content size from header.\n\t\tvar hdr zstd.Header\n\t\tif err := hdr.Decode(b.data); err == nil && hdr.HasFCS && hdr.FrameContentSize < uint64(t.opt.BlockSize*2) {\n\t\t\tsz = int(hdr.FrameContentSize)\n\t\t}\n\t\tdst = z.Calloc(sz, \"Table.Decompress\")\n\t\tb.data, err = y.ZSTDDecompress(dst, b.data)\n\t\tif err != nil {\n\t\t\tz.Free(dst)\n\t\t\treturn y.Wrap(err, \"failed to decompress\")\n\t\t}\n\tdefault:\n\t\treturn errors.New(\"Unsupported compression type\")\n\t}\n\n\tif b.freeMe {\n\t\tz.Free(src)\n\t\tb.freeMe = false\n\t}\n\n\tif len(b.data) > 0 && len(dst) > 0 && &dst[0] != &b.data[0] {\n\t\tz.Free(dst)\n\t} else {\n\t\tb.freeMe = true\n\t}\n\treturn nil\n}\n"
  },
  {
    "path": "table/table_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage table\n\nimport (\n\t\"bytes\"\n\t\"crypto/sha256\"\n\t\"fmt\"\n\t\"hash/crc32\"\n\t\"math/rand\"\n\t\"os\"\n\t\"sort\"\n\t\"strings\"\n\t\"sync\"\n\t\"testing\"\n\t\"time\"\n\n\t\"github.com/cespare/xxhash/v2\"\n\t\"github.com/stretchr/testify/require\"\n\n\t\"github.com/dgraph-io/badger/v4/options\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2\"\n)\n\nfunc key(prefix string, i int) string {\n\treturn prefix + fmt.Sprintf(\"%04d\", i)\n}\n\nfunc getTestTableOptions() Options {\n\treturn Options{\n\t\tCompression:          options.ZSTD,\n\t\tZSTDCompressionLevel: 15,\n\t\tBlockSize:            4 * 1024,\n\t\tBloomFalsePositive:   0.01,\n\t}\n\n}\nfunc buildTestTable(t *testing.T, prefix string, n int, opts Options) *Table {\n\tif opts.BlockSize == 0 {\n\t\topts.BlockSize = 4 * 1024\n\t}\n\ty.AssertTrue(n <= 10000)\n\tkeyValues := make([][]string, n)\n\tfor i := 0; i < n; i++ {\n\t\tk := key(prefix, i)\n\t\tv := fmt.Sprintf(\"%d\", i)\n\t\tkeyValues[i] = []string{k, v}\n\t}\n\treturn buildTable(t, keyValues, opts)\n}\n\n// keyValues is n by 2 where n is number of pairs.\nfunc buildTable(t *testing.T, keyValues [][]string, opts Options) *Table {\n\tb := NewTableBuilder(opts)\n\tdefer b.Close()\n\t// TODO: Add test for file garbage collection here. No files should be left after the tests here.\n\n\tfilename := fmt.Sprintf(\"%s%s%d.sst\", os.TempDir(), string(os.PathSeparator), rand.Uint32())\n\n\tsort.Slice(keyValues, func(i, j int) bool {\n\t\treturn keyValues[i][0] < keyValues[j][0]\n\t})\n\tfor _, kv := range keyValues {\n\t\ty.AssertTrue(len(kv) == 2)\n\t\tb.Add(y.KeyWithTs([]byte(kv[0]), 0),\n\t\t\ty.ValueStruct{Value: []byte(kv[1]), Meta: 'A', UserMeta: 0}, 0)\n\t}\n\ttbl, err := CreateTable(filename, b)\n\trequire.NoError(t, err, \"writing to file failed\")\n\treturn tbl\n}\n\nfunc TestTableIterator(t *testing.T) {\n\tfor _, n := range []int{99, 100, 101} {\n\t\tt.Run(fmt.Sprintf(\"n=%d\", n), func(t *testing.T) {\n\t\t\topts := getTestTableOptions()\n\t\t\ttable := buildTestTable(t, \"key\", n, opts)\n\t\t\tdefer func() { require.NoError(t, table.DecrRef()) }()\n\t\t\tit := table.NewIterator(0)\n\t\t\tdefer it.Close()\n\t\t\tcount := 0\n\t\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\t\tv := it.Value()\n\t\t\t\tk := y.KeyWithTs([]byte(key(\"key\", count)), 0)\n\t\t\t\trequire.EqualValues(t, k, it.Key())\n\t\t\t\trequire.EqualValues(t, fmt.Sprintf(\"%d\", count), string(v.Value))\n\t\t\t\tcount++\n\t\t\t}\n\t\t\trequire.Equal(t, count, n)\n\t\t})\n\t}\n}\n\nfunc TestSeekToFirst(t *testing.T) {\n\tfor _, n := range []int{99, 100, 101, 199, 200, 250, 9999, 10000} {\n\t\tt.Run(fmt.Sprintf(\"n=%d\", n), func(t *testing.T) {\n\t\t\topts := getTestTableOptions()\n\t\t\ttable := buildTestTable(t, \"key\", n, opts)\n\t\t\tdefer func() { require.NoError(t, table.DecrRef()) }()\n\t\t\tit := table.NewIterator(0)\n\t\t\tdefer it.Close()\n\t\t\tit.seekToFirst()\n\t\t\trequire.True(t, it.Valid())\n\t\t\tv := it.Value()\n\t\t\trequire.EqualValues(t, \"0\", string(v.Value))\n\t\t\trequire.EqualValues(t, 'A', v.Meta)\n\t\t})\n\t}\n}\n\nfunc TestSeekToLast(t *testing.T) {\n\tfor _, n := range []int{99, 100, 101, 199, 200, 250, 9999, 10000} {\n\t\tt.Run(fmt.Sprintf(\"n=%d\", n), func(t *testing.T) {\n\t\t\topts := getTestTableOptions()\n\t\t\ttable := buildTestTable(t, \"key\", n, opts)\n\t\t\tdefer func() { require.NoError(t, table.DecrRef()) }()\n\t\t\tit := table.NewIterator(0)\n\t\t\tdefer it.Close()\n\t\t\tit.seekToLast()\n\t\t\trequire.True(t, it.Valid())\n\t\t\tv := it.Value()\n\t\t\trequire.EqualValues(t, fmt.Sprintf(\"%d\", n-1), string(v.Value))\n\t\t\trequire.EqualValues(t, 'A', v.Meta)\n\t\t\tit.prev()\n\t\t\trequire.True(t, it.Valid())\n\t\t\tv = it.Value()\n\t\t\trequire.EqualValues(t, fmt.Sprintf(\"%d\", n-2), string(v.Value))\n\t\t\trequire.EqualValues(t, 'A', v.Meta)\n\t\t})\n\t}\n}\n\nfunc TestSeek(t *testing.T) {\n\topts := getTestTableOptions()\n\ttable := buildTestTable(t, \"k\", 10000, opts)\n\tdefer func() { require.NoError(t, table.DecrRef()) }()\n\n\tit := table.NewIterator(0)\n\tdefer it.Close()\n\n\tvar data = []struct {\n\t\tin    string\n\t\tvalid bool\n\t\tout   string\n\t}{\n\t\t{\"abc\", true, \"k0000\"},\n\t\t{\"k0100\", true, \"k0100\"},\n\t\t{\"k0100b\", true, \"k0101\"}, // Test case where we jump to next block.\n\t\t{\"k1234\", true, \"k1234\"},\n\t\t{\"k1234b\", true, \"k1235\"},\n\t\t{\"k9999\", true, \"k9999\"},\n\t\t{\"z\", false, \"\"},\n\t}\n\n\tfor _, tt := range data {\n\t\tit.seek(y.KeyWithTs([]byte(tt.in), 0))\n\t\tif !tt.valid {\n\t\t\trequire.False(t, it.Valid())\n\t\t\tcontinue\n\t\t}\n\t\trequire.True(t, it.Valid())\n\t\tk := it.Key()\n\t\trequire.EqualValues(t, tt.out, string(y.ParseKey(k)))\n\t}\n}\n\nfunc TestSeekForPrev(t *testing.T) {\n\topts := getTestTableOptions()\n\ttable := buildTestTable(t, \"k\", 10000, opts)\n\tdefer func() { require.NoError(t, table.DecrRef()) }()\n\n\tit := table.NewIterator(0)\n\tdefer it.Close()\n\n\tvar data = []struct {\n\t\tin    string\n\t\tvalid bool\n\t\tout   string\n\t}{\n\t\t{\"abc\", false, \"\"},\n\t\t{\"k0100\", true, \"k0100\"},\n\t\t{\"k0100b\", true, \"k0100\"}, // Test case where we jump to next block.\n\t\t{\"k1234\", true, \"k1234\"},\n\t\t{\"k1234b\", true, \"k1234\"},\n\t\t{\"k9999\", true, \"k9999\"},\n\t\t{\"z\", true, \"k9999\"},\n\t}\n\n\tfor _, tt := range data {\n\t\tit.seekForPrev(y.KeyWithTs([]byte(tt.in), 0))\n\t\tif !tt.valid {\n\t\t\trequire.False(t, it.Valid())\n\t\t\tcontinue\n\t\t}\n\t\trequire.True(t, it.Valid())\n\t\tk := it.Key()\n\t\trequire.EqualValues(t, tt.out, string(y.ParseKey(k)))\n\t}\n}\n\nfunc TestIterateFromStart(t *testing.T) {\n\t// Vary the number of elements added.\n\tfor _, n := range []int{99, 100, 101, 199, 200, 250, 9999, 10000} {\n\t\tt.Run(fmt.Sprintf(\"n=%d\", n), func(t *testing.T) {\n\t\t\topts := getTestTableOptions()\n\t\t\ttable := buildTestTable(t, \"key\", n, opts)\n\t\t\tdefer func() { require.NoError(t, table.DecrRef()) }()\n\t\t\tti := table.NewIterator(0)\n\t\t\tdefer ti.Close()\n\t\t\tti.reset()\n\t\t\tti.seekToFirst()\n\t\t\trequire.True(t, ti.Valid())\n\t\t\t// No need to do a Next.\n\t\t\t// ti.Seek brings us to the first key >= \"\". Essentially a SeekToFirst.\n\t\t\tvar count int\n\t\t\tfor ; ti.Valid(); ti.next() {\n\t\t\t\tv := ti.Value()\n\t\t\t\trequire.EqualValues(t, fmt.Sprintf(\"%d\", count), string(v.Value))\n\t\t\t\trequire.EqualValues(t, 'A', v.Meta)\n\t\t\t\tcount++\n\t\t\t}\n\t\t\trequire.EqualValues(t, n, count)\n\t\t})\n\t}\n}\n\nfunc TestIterateFromEnd(t *testing.T) {\n\t// Vary the number of elements added.\n\tfor _, n := range []int{99, 100, 101, 199, 200, 250, 9999, 10000} {\n\t\tt.Run(fmt.Sprintf(\"n=%d\", n), func(t *testing.T) {\n\t\t\topts := getTestTableOptions()\n\t\t\ttable := buildTestTable(t, \"key\", n, opts)\n\t\t\tdefer func() { require.NoError(t, table.DecrRef()) }()\n\t\t\tti := table.NewIterator(0)\n\t\t\tdefer ti.Close()\n\t\t\tti.reset()\n\t\t\tti.seek(y.KeyWithTs([]byte(\"zzzzzz\"), 0)) // Seek to end, an invalid element.\n\t\t\trequire.False(t, ti.Valid())\n\t\t\tfor i := n - 1; i >= 0; i-- {\n\t\t\t\tti.prev()\n\t\t\t\trequire.True(t, ti.Valid())\n\t\t\t\tv := ti.Value()\n\t\t\t\trequire.EqualValues(t, fmt.Sprintf(\"%d\", i), string(v.Value))\n\t\t\t\trequire.EqualValues(t, 'A', v.Meta)\n\t\t\t}\n\t\t\tti.prev()\n\t\t\trequire.False(t, ti.Valid())\n\t\t})\n\t}\n}\n\nfunc TestTable(t *testing.T) {\n\topts := getTestTableOptions()\n\ttable := buildTestTable(t, \"key\", 10000, opts)\n\tdefer func() { require.NoError(t, table.DecrRef()) }()\n\tti := table.NewIterator(0)\n\tdefer ti.Close()\n\tkid := 1010\n\tseek := y.KeyWithTs([]byte(key(\"key\", kid)), 0)\n\tfor ti.seek(seek); ti.Valid(); ti.next() {\n\t\tk := ti.Key()\n\t\trequire.EqualValues(t, string(y.ParseKey(k)), key(\"key\", kid))\n\t\tkid++\n\t}\n\tif kid != 10000 {\n\t\tt.Errorf(\"Expected kid: 10000. Got: %v\", kid)\n\t}\n\n\tti.seek(y.KeyWithTs([]byte(key(\"key\", 99999)), 0))\n\trequire.False(t, ti.Valid())\n\n\tti.seek(y.KeyWithTs([]byte(key(\"key\", -1)), 0))\n\trequire.True(t, ti.Valid())\n\tk := ti.Key()\n\trequire.EqualValues(t, string(y.ParseKey(k)), key(\"key\", 0))\n}\n\nfunc TestIterateBackAndForth(t *testing.T) {\n\topts := getTestTableOptions()\n\ttable := buildTestTable(t, \"key\", 10000, opts)\n\tdefer func() { require.NoError(t, table.DecrRef()) }()\n\n\tseek := y.KeyWithTs([]byte(key(\"key\", 1010)), 0)\n\tit := table.NewIterator(0)\n\tdefer it.Close()\n\tit.seek(seek)\n\trequire.True(t, it.Valid())\n\tk := it.Key()\n\trequire.EqualValues(t, seek, k)\n\n\tit.prev()\n\tit.prev()\n\trequire.True(t, it.Valid())\n\tk = it.Key()\n\trequire.EqualValues(t, key(\"key\", 1008), string(y.ParseKey(k)))\n\n\tit.next()\n\tit.next()\n\trequire.True(t, it.Valid())\n\tk = it.Key()\n\trequire.EqualValues(t, key(\"key\", 1010), y.ParseKey(k))\n\n\tit.seek(y.KeyWithTs([]byte(key(\"key\", 2000)), 0))\n\trequire.True(t, it.Valid())\n\tk = it.Key()\n\trequire.EqualValues(t, key(\"key\", 2000), y.ParseKey(k))\n\n\tit.prev()\n\trequire.True(t, it.Valid())\n\tk = it.Key()\n\trequire.EqualValues(t, key(\"key\", 1999), y.ParseKey(k))\n\n\tit.seekToFirst()\n\tk = it.Key()\n\trequire.EqualValues(t, key(\"key\", 0), string(y.ParseKey(k)))\n}\n\nfunc TestUniIterator(t *testing.T) {\n\topts := getTestTableOptions()\n\ttable := buildTestTable(t, \"key\", 10000, opts)\n\tdefer func() { require.NoError(t, table.DecrRef()) }()\n\t{\n\t\tit := table.NewIterator(0)\n\t\tdefer it.Close()\n\t\tvar count int\n\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\tv := it.Value()\n\t\t\trequire.EqualValues(t, fmt.Sprintf(\"%d\", count), string(v.Value))\n\t\t\trequire.EqualValues(t, 'A', v.Meta)\n\t\t\tcount++\n\t\t}\n\t\trequire.EqualValues(t, 10000, count)\n\t}\n\t{\n\t\tit := table.NewIterator(REVERSED)\n\t\tdefer it.Close()\n\t\tvar count int\n\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\tv := it.Value()\n\t\t\trequire.EqualValues(t, fmt.Sprintf(\"%d\", 10000-1-count), string(v.Value))\n\t\t\trequire.EqualValues(t, 'A', v.Meta)\n\t\t\tcount++\n\t\t}\n\t\trequire.EqualValues(t, 10000, count)\n\t}\n}\n\n// Try having only one table.\nfunc TestConcatIteratorOneTable(t *testing.T) {\n\topts := getTestTableOptions()\n\ttbl := buildTable(t, [][]string{\n\t\t{\"k1\", \"a1\"},\n\t\t{\"k2\", \"a2\"},\n\t}, opts)\n\tdefer func() { require.NoError(t, tbl.DecrRef()) }()\n\n\tit := NewConcatIterator([]*Table{tbl}, 0)\n\tdefer it.Close()\n\n\tit.Rewind()\n\trequire.True(t, it.Valid())\n\tk := it.Key()\n\trequire.EqualValues(t, \"k1\", string(y.ParseKey(k)))\n\tvs := it.Value()\n\trequire.EqualValues(t, \"a1\", string(vs.Value))\n\trequire.EqualValues(t, 'A', vs.Meta)\n}\n\nfunc TestConcatIterator(t *testing.T) {\n\topts := getTestTableOptions()\n\ttbl := buildTestTable(t, \"keya\", 10000, opts)\n\tdefer func() { require.NoError(t, tbl.DecrRef()) }()\n\ttbl2 := buildTestTable(t, \"keyb\", 10000, opts)\n\tdefer func() { require.NoError(t, tbl2.DecrRef()) }()\n\ttbl3 := buildTestTable(t, \"keyc\", 10000, opts)\n\tdefer func() { require.NoError(t, tbl3.DecrRef()) }()\n\n\t{\n\t\tit := NewConcatIterator([]*Table{tbl, tbl2, tbl3}, 0)\n\t\tdefer it.Close()\n\t\tit.Rewind()\n\t\trequire.True(t, it.Valid())\n\t\tvar count int\n\t\tfor ; it.Valid(); it.Next() {\n\t\t\tvs := it.Value()\n\t\t\trequire.EqualValues(t, fmt.Sprintf(\"%d\", count%10000), string(vs.Value))\n\t\t\trequire.EqualValues(t, 'A', vs.Meta)\n\t\t\tcount++\n\t\t}\n\t\trequire.EqualValues(t, 30000, count)\n\n\t\tit.Seek(y.KeyWithTs([]byte(\"a\"), 0))\n\t\trequire.EqualValues(t, \"keya0000\", string(y.ParseKey(it.Key())))\n\t\tvs := it.Value()\n\t\trequire.EqualValues(t, \"0\", string(vs.Value))\n\n\t\tit.Seek(y.KeyWithTs([]byte(\"keyb\"), 0))\n\t\trequire.EqualValues(t, \"keyb0000\", string(y.ParseKey(it.Key())))\n\t\tvs = it.Value()\n\t\trequire.EqualValues(t, \"0\", string(vs.Value))\n\n\t\tit.Seek(y.KeyWithTs([]byte(\"keyb9999b\"), 0))\n\t\trequire.EqualValues(t, \"keyc0000\", string(y.ParseKey(it.Key())))\n\t\tvs = it.Value()\n\t\trequire.EqualValues(t, \"0\", string(vs.Value))\n\n\t\tit.Seek(y.KeyWithTs([]byte(\"keyd\"), 0))\n\t\trequire.False(t, it.Valid())\n\t}\n\t{\n\t\tit := NewConcatIterator([]*Table{tbl, tbl2, tbl3}, REVERSED)\n\t\tdefer it.Close()\n\t\tit.Rewind()\n\t\trequire.True(t, it.Valid())\n\t\tvar count int\n\t\tfor ; it.Valid(); it.Next() {\n\t\t\tvs := it.Value()\n\t\t\trequire.EqualValues(t, fmt.Sprintf(\"%d\", 10000-(count%10000)-1), string(vs.Value))\n\t\t\trequire.EqualValues(t, 'A', vs.Meta)\n\t\t\tcount++\n\t\t}\n\t\trequire.EqualValues(t, 30000, count)\n\n\t\tit.Seek(y.KeyWithTs([]byte(\"a\"), 0))\n\t\trequire.False(t, it.Valid())\n\n\t\tit.Seek(y.KeyWithTs([]byte(\"keyb\"), 0))\n\t\trequire.EqualValues(t, \"keya9999\", string(y.ParseKey(it.Key())))\n\t\tvs := it.Value()\n\t\trequire.EqualValues(t, \"9999\", string(vs.Value))\n\n\t\tit.Seek(y.KeyWithTs([]byte(\"keyb9999b\"), 0))\n\t\trequire.EqualValues(t, \"keyb9999\", string(y.ParseKey(it.Key())))\n\t\tvs = it.Value()\n\t\trequire.EqualValues(t, \"9999\", string(vs.Value))\n\n\t\tit.Seek(y.KeyWithTs([]byte(\"keyd\"), 0))\n\t\trequire.EqualValues(t, \"keyc9999\", string(y.ParseKey(it.Key())))\n\t\tvs = it.Value()\n\t\trequire.EqualValues(t, \"9999\", string(vs.Value))\n\t}\n}\n\nfunc TestMergingIterator(t *testing.T) {\n\topts := getTestTableOptions()\n\ttbl1 := buildTable(t, [][]string{\n\t\t{\"k1\", \"a1\"},\n\t\t{\"k4\", \"a4\"},\n\t\t{\"k5\", \"a5\"},\n\t}, opts)\n\tdefer func() { require.NoError(t, tbl1.DecrRef()) }()\n\n\ttbl2 := buildTable(t, [][]string{\n\t\t{\"k2\", \"b2\"},\n\t\t{\"k3\", \"b3\"},\n\t\t{\"k4\", \"b4\"},\n\t}, opts)\n\tdefer func() { require.NoError(t, tbl2.DecrRef()) }()\n\n\texpected := []struct {\n\t\tkey   string\n\t\tvalue string\n\t}{\n\t\t{\"k1\", \"a1\"},\n\t\t{\"k2\", \"b2\"},\n\t\t{\"k3\", \"b3\"},\n\t\t{\"k4\", \"a4\"},\n\t\t{\"k5\", \"a5\"},\n\t}\n\n\tit1 := tbl1.NewIterator(0)\n\tit2 := NewConcatIterator([]*Table{tbl2}, 0)\n\tit := NewMergeIterator([]y.Iterator{it1, it2}, false)\n\tdefer it.Close()\n\n\tvar i int\n\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\tk := it.Key()\n\t\tvs := it.Value()\n\t\trequire.EqualValues(t, expected[i].key, string(y.ParseKey(k)))\n\t\trequire.EqualValues(t, expected[i].value, string(vs.Value))\n\t\trequire.EqualValues(t, 'A', vs.Meta)\n\t\ti++\n\t}\n\trequire.Equal(t, i, len(expected))\n\trequire.False(t, it.Valid())\n}\n\nfunc TestMergingIteratorReversed(t *testing.T) {\n\topts := getTestTableOptions()\n\ttbl1 := buildTable(t, [][]string{\n\t\t{\"k1\", \"a1\"},\n\t\t{\"k2\", \"a2\"},\n\t\t{\"k4\", \"a4\"},\n\t\t{\"k5\", \"a5\"},\n\t}, opts)\n\tdefer func() { require.NoError(t, tbl1.DecrRef()) }()\n\n\ttbl2 := buildTable(t, [][]string{\n\t\t{\"k1\", \"b2\"},\n\t\t{\"k3\", \"b3\"},\n\t\t{\"k4\", \"b4\"},\n\t\t{\"k5\", \"b5\"},\n\t}, opts)\n\tdefer func() { require.NoError(t, tbl2.DecrRef()) }()\n\n\texpected := []struct {\n\t\tkey   string\n\t\tvalue string\n\t}{\n\t\t{\"k5\", \"a5\"},\n\t\t{\"k4\", \"a4\"},\n\t\t{\"k3\", \"b3\"},\n\t\t{\"k2\", \"a2\"},\n\t\t{\"k1\", \"a1\"},\n\t}\n\n\tit1 := tbl1.NewIterator(REVERSED)\n\tit2 := NewConcatIterator([]*Table{tbl2}, REVERSED)\n\tit := NewMergeIterator([]y.Iterator{it1, it2}, true)\n\tdefer it.Close()\n\n\tvar i int\n\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\tk := it.Key()\n\t\tvs := it.Value()\n\t\trequire.EqualValues(t, expected[i].key, string(y.ParseKey(k)))\n\t\trequire.EqualValues(t, expected[i].value, string(vs.Value))\n\t\trequire.EqualValues(t, 'A', vs.Meta)\n\t\ti++\n\t}\n\n\trequire.Equal(t, i, len(expected))\n\trequire.False(t, it.Valid())\n}\n\n// Take only the first iterator.\nfunc TestMergingIteratorTakeOne(t *testing.T) {\n\topts := getTestTableOptions()\n\tt1 := buildTable(t, [][]string{\n\t\t{\"k1\", \"a1\"},\n\t\t{\"k2\", \"a2\"},\n\t}, opts)\n\tdefer func() { require.NoError(t, t1.DecrRef()) }()\n\tt2 := buildTable(t, [][]string{{\"l1\", \"b1\"}}, opts)\n\tdefer func() { require.NoError(t, t2.DecrRef()) }()\n\n\tit1 := NewConcatIterator([]*Table{t1}, 0)\n\tit2 := NewConcatIterator([]*Table{t2}, 0)\n\tit := NewMergeIterator([]y.Iterator{it1, it2}, false)\n\tdefer it.Close()\n\n\tit.Rewind()\n\trequire.True(t, it.Valid())\n\tk := it.Key()\n\trequire.EqualValues(t, \"k1\", string(y.ParseKey(k)))\n\tvs := it.Value()\n\trequire.EqualValues(t, \"a1\", string(vs.Value))\n\trequire.EqualValues(t, 'A', vs.Meta)\n\tit.Next()\n\n\trequire.True(t, it.Valid())\n\tk = it.Key()\n\trequire.EqualValues(t, \"k2\", string(y.ParseKey(k)))\n\tvs = it.Value()\n\trequire.EqualValues(t, \"a2\", string(vs.Value))\n\trequire.EqualValues(t, 'A', vs.Meta)\n\tit.Next()\n\n\tk = it.Key()\n\trequire.EqualValues(t, \"l1\", string(y.ParseKey(k)))\n\tvs = it.Value()\n\trequire.EqualValues(t, \"b1\", string(vs.Value))\n\trequire.EqualValues(t, 'A', vs.Meta)\n\tit.Next()\n\n\trequire.False(t, it.Valid())\n}\n\n// Take only the second iterator.\nfunc TestMergingIteratorTakeTwo(t *testing.T) {\n\topts := getTestTableOptions()\n\tt1 := buildTable(t, [][]string{{\"l1\", \"b1\"}}, opts)\n\tdefer func() { require.NoError(t, t1.DecrRef()) }()\n\n\tt2 := buildTable(t, [][]string{\n\t\t{\"k1\", \"a1\"},\n\t\t{\"k2\", \"a2\"},\n\t}, opts)\n\tdefer func() { require.NoError(t, t2.DecrRef()) }()\n\n\tit1 := NewConcatIterator([]*Table{t1}, 0)\n\tit2 := NewConcatIterator([]*Table{t2}, 0)\n\tit := NewMergeIterator([]y.Iterator{it1, it2}, false)\n\tdefer it.Close()\n\n\tit.Rewind()\n\trequire.True(t, it.Valid())\n\tk := it.Key()\n\trequire.EqualValues(t, \"k1\", string(y.ParseKey(k)))\n\tvs := it.Value()\n\trequire.EqualValues(t, \"a1\", string(vs.Value))\n\trequire.EqualValues(t, 'A', vs.Meta)\n\tit.Next()\n\n\trequire.True(t, it.Valid())\n\tk = it.Key()\n\trequire.EqualValues(t, \"k2\", string(y.ParseKey(k)))\n\tvs = it.Value()\n\trequire.EqualValues(t, \"a2\", string(vs.Value))\n\trequire.EqualValues(t, 'A', vs.Meta)\n\tit.Next()\n\trequire.True(t, it.Valid())\n\n\tk = it.Key()\n\trequire.EqualValues(t, \"l1\", string(y.ParseKey(k)))\n\tvs = it.Value()\n\trequire.EqualValues(t, \"b1\", string(vs.Value))\n\trequire.EqualValues(t, 'A', vs.Meta)\n\tit.Next()\n\n\trequire.False(t, it.Valid())\n}\n\nfunc TestTableBigValues(t *testing.T) {\n\tvalue := func(i int) []byte {\n\t\treturn []byte(fmt.Sprintf(\"%01048576d\", i)) // Return 1MB value which is > math.MaxUint16.\n\t}\n\n\trand.Seed(time.Now().UnixNano())\n\n\tn := 100 // Insert 100 keys.\n\topts := Options{Compression: options.ZSTD, BlockSize: 4 * 1024, BloomFalsePositive: 0.01,\n\t\tTableSize: uint64(n) * 1 << 20}\n\tbuilder := NewTableBuilder(opts)\n\tdefer builder.Close()\n\tfor i := 0; i < n; i++ {\n\t\tkey := y.KeyWithTs([]byte(key(\"\", i)), uint64(i+1))\n\t\tvs := y.ValueStruct{Value: value(i)}\n\t\tbuilder.Add(key, vs, 0)\n\t}\n\n\tfilename := fmt.Sprintf(\"%s%s%d.sst\", os.TempDir(), string(os.PathSeparator), rand.Uint32())\n\ttbl, err := CreateTable(filename, builder)\n\trequire.NoError(t, err, \"unable to open table\")\n\tdefer func() { require.NoError(t, tbl.DecrRef()) }()\n\n\titr := tbl.NewIterator(0)\n\trequire.True(t, itr.Valid())\n\n\tcount := 0\n\tfor itr.Rewind(); itr.Valid(); itr.Next() {\n\t\trequire.Equal(t, []byte(key(\"\", count)), y.ParseKey(itr.Key()), \"keys are not equal\")\n\t\trequire.Equal(t, value(count), itr.Value().Value, \"values are not equal\")\n\t\tcount++\n\t}\n\trequire.False(t, itr.Valid(), \"table iterator should be invalid now\")\n\trequire.Equal(t, n, count)\n\trequire.Equal(t, n, int(tbl.MaxVersion()))\n}\n\n// This test is for verifying checksum failure during table open.\nfunc TestTableChecksum(t *testing.T) {\n\trand.Seed(time.Now().Unix())\n\t// we are going to write random byte at random location in table file.\n\trb := make([]byte, 100)\n\trand.Read(rb)\n\topts := getTestTableOptions()\n\topts.ChkMode = options.OnTableAndBlockRead\n\t// When verifying checksum capability, we find it simpler to disable compression\n\t// since randomly initializing bytes can kill the compression storage.\n\topts.Compression = options.None\n\ttbl := buildTestTable(t, \"k\", 10000, opts)\n\tdefer func() { require.NoError(t, tbl.DecrRef()) }()\n\t// Write random bytes at location guaranteed to not be in range of\n\t// metadata for block. (No particular reason for the value 128,\n\t// it just avoids the sensitive block size or other metadata blocks).\n\tstart := 128\n\tn := copy(tbl.Data[start:], rb)\n\trequire.Equal(t, n, len(rb))\n\n\trequire.Panics(t, func() {\n\t\t// Either OpenTable will panic on corrupted data or the checksum verification will fail.\n\t\t_, err := OpenTable(tbl.MmapFile, opts)\n\t\tif strings.Contains(err.Error(), \"checksum\") {\n\t\t\tpanic(\"checksum mismatch\")\n\t\t} else {\n\t\t\trequire.NoError(t, err)\n\t\t}\n\t})\n}\n\nvar cacheConfig = ristretto.Config[[]byte, *Block]{\n\tNumCounters: 1000000 * 10,\n\tMaxCost:     1000000,\n\tBufferItems: 64,\n\tMetrics:     true,\n}\n\nfunc BenchmarkRead(b *testing.B) {\n\tn := int(5 * 1e6)\n\ttbl := getTableForBenchmarks(b, n, nil)\n\tdefer func() { _ = tbl.DecrRef() }()\n\n\tb.ResetTimer()\n\t// Iterate b.N times over the entire table.\n\tfor i := 0; i < b.N; i++ {\n\t\tfunc() {\n\t\t\tit := tbl.NewIterator(0)\n\t\t\tdefer it.Close()\n\t\t\tfor it.seekToFirst(); it.Valid(); it.next() {\n\t\t\t}\n\t\t}()\n\t}\n}\n\nfunc BenchmarkReadAndBuild(b *testing.B) {\n\tn := int(5 * 1e6)\n\n\tvar cache, _ = ristretto.NewCache(&cacheConfig)\n\ttbl := getTableForBenchmarks(b, n, cache)\n\tdefer func() { _ = tbl.DecrRef() }()\n\n\tb.ResetTimer()\n\t// Iterate b.N times over the entire table.\n\tfor i := 0; i < b.N; i++ {\n\t\tfunc() {\n\t\t\topts := Options{Compression: options.ZSTD, BlockSize: 4 * 0124, BloomFalsePositive: 0.01}\n\t\t\topts.BlockCache = cache\n\t\t\tnewBuilder := NewTableBuilder(opts)\n\t\t\tit := tbl.NewIterator(0)\n\t\t\tdefer it.Close()\n\t\t\tfor it.seekToFirst(); it.Valid(); it.next() {\n\t\t\t\tvs := it.Value()\n\t\t\t\tnewBuilder.Add(it.Key(), vs, 0)\n\t\t\t}\n\t\t\tnewBuilder.Finish()\n\t\t}()\n\t}\n}\n\nfunc BenchmarkReadMerged(b *testing.B) {\n\tn := int(5 * 1e6)\n\tm := 5 // Number of tables.\n\ty.AssertTrue((n % m) == 0)\n\ttableSize := n / m\n\tvar tables []*Table\n\n\tvar cache, err = ristretto.NewCache(&cacheConfig)\n\trequire.NoError(b, err)\n\n\tfor i := 0; i < m; i++ {\n\t\tfilename := fmt.Sprintf(\"%s%s%d.sst\", os.TempDir(), string(os.PathSeparator), rand.Uint32())\n\t\topts := Options{Compression: options.ZSTD, BlockSize: 4 * 1024, BloomFalsePositive: 0.01}\n\t\topts.BlockCache = cache\n\t\tbuilder := NewTableBuilder(opts)\n\t\tfor j := 0; j < tableSize; j++ {\n\t\t\tid := j*m + i // Arrays are interleaved.\n\t\t\t// id := i*tableSize+j (not interleaved)\n\t\t\tk := fmt.Sprintf(\"%016x\", id)\n\t\t\tv := fmt.Sprintf(\"%d\", id)\n\t\t\tbuilder.Add([]byte(k), y.ValueStruct{Value: []byte(v), Meta: 123, UserMeta: 0}, 0)\n\t\t}\n\t\ttbl, err := CreateTable(filename, builder)\n\t\ty.Check(err)\n\t\tbuilder.Close()\n\t\ttables = append(tables, tbl)\n\t\tdefer func() { _ = tbl.DecrRef() }()\n\t}\n\n\tb.ResetTimer()\n\t// Iterate b.N times over the entire table.\n\tfor i := 0; i < b.N; i++ {\n\t\tfunc() {\n\t\t\tvar iters []y.Iterator\n\t\t\tfor _, tbl := range tables {\n\t\t\t\titers = append(iters, tbl.NewIterator(0))\n\t\t\t}\n\t\t\tit := NewMergeIterator(iters, false)\n\t\t\tdefer it.Close()\n\t\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\t}\n\t\t}()\n\t}\n}\n\nfunc BenchmarkChecksum(b *testing.B) {\n\tkeySz := []int{KB, 2 * KB, 4 * KB, 8 * KB, 16 * KB, 32 * KB, 64 * KB, 128 * KB, 256 * KB, MB}\n\tfor _, kz := range keySz {\n\t\tkey := make([]byte, kz)\n\t\tb.Run(fmt.Sprintf(\"CRC %d\", kz), func(b *testing.B) {\n\t\t\tfor i := 0; i < b.N; i++ {\n\t\t\t\tcrc32.ChecksumIEEE(key)\n\t\t\t}\n\t\t})\n\t\tb.Run(fmt.Sprintf(\"xxHash64 %d\", kz), func(b *testing.B) {\n\t\t\tfor i := 0; i < b.N; i++ {\n\t\t\t\txxhash.Sum64(key)\n\t\t\t}\n\t\t})\n\t\tb.Run(fmt.Sprintf(\"SHA256 %d\", kz), func(b *testing.B) {\n\t\t\tfor i := 0; i < b.N; i++ {\n\t\t\t\tsha256.Sum256(key)\n\t\t\t}\n\t\t})\n\t}\n}\n\nfunc BenchmarkRandomRead(b *testing.B) {\n\tn := int(5 * 1e6)\n\ttbl := getTableForBenchmarks(b, n, nil)\n\tdefer func() { _ = tbl.DecrRef() }()\n\n\tr := rand.New(rand.NewSource(time.Now().Unix()))\n\n\tb.ResetTimer()\n\tfor i := 0; i < b.N; i++ {\n\t\titr := tbl.NewIterator(0)\n\t\tno := r.Intn(n)\n\t\tk := []byte(fmt.Sprintf(\"%016x\", no))\n\t\tv := []byte(fmt.Sprintf(\"%d\", no))\n\t\titr.Seek(k)\n\t\tif !itr.Valid() {\n\t\t\tb.Fatal(\"itr should be valid\")\n\t\t}\n\t\tv1 := itr.Value().Value\n\n\t\tif !bytes.Equal(v, v1) {\n\t\t\tfmt.Println(\"value does not match\")\n\t\t\tb.Fatal()\n\t\t}\n\t\titr.Close()\n\t}\n}\n\nfunc getTableForBenchmarks(b *testing.B, count int, cache *ristretto.Cache[[]byte, *Block]) *Table {\n\trand.Seed(time.Now().Unix())\n\topts := Options{Compression: options.ZSTD, BlockSize: 4 * 1024, BloomFalsePositive: 0.01}\n\tif cache == nil {\n\t\tvar err error\n\t\tcache, err = ristretto.NewCache(&cacheConfig)\n\t\trequire.NoError(b, err)\n\t}\n\topts.BlockCache = cache\n\tbuilder := NewTableBuilder(opts)\n\tdefer builder.Close()\n\tfilename := fmt.Sprintf(\"%s%s%d.sst\", os.TempDir(), string(os.PathSeparator), rand.Uint32())\n\tfor i := 0; i < count; i++ {\n\t\tk := fmt.Sprintf(\"%016x\", i)\n\t\tv := fmt.Sprintf(\"%d\", i)\n\t\tbuilder.Add([]byte(k), y.ValueStruct{Value: []byte(v)}, 0)\n\t}\n\n\ttbl, err := CreateTable(filename, builder)\n\trequire.NoError(b, err, \"unable to open table\")\n\treturn tbl\n}\n\nfunc TestMain(m *testing.M) {\n\trand.Seed(time.Now().UTC().UnixNano())\n\tos.Exit(m.Run())\n}\n\n// Run this test with command \"go test -race -run TestDoesNotHaveRace\"\nfunc TestDoesNotHaveRace(t *testing.T) {\n\topts := getTestTableOptions()\n\ttable := buildTestTable(t, \"key\", 10000, opts)\n\tdefer func() { require.NoError(t, table.DecrRef()) }()\n\n\tvar wg sync.WaitGroup\n\twg.Add(5)\n\tfor i := 0; i < 5; i++ {\n\t\tgo func() {\n\t\t\trequire.True(t, table.DoesNotHave(uint32(1237882)))\n\t\t\twg.Done()\n\t\t}()\n\t}\n\twg.Wait()\n}\n\nfunc TestMaxVersion(t *testing.T) {\n\topt := getTestTableOptions()\n\tb := NewTableBuilder(opt)\n\tdefer b.Close()\n\n\tfilename := fmt.Sprintf(\"%s%s%d.sst\", os.TempDir(), string(os.PathSeparator), rand.Uint32())\n\tN := 1000\n\tfor i := 0; i < N; i++ {\n\t\tb.Add(y.KeyWithTs([]byte(fmt.Sprintf(\"foo:%d\", i)), uint64(i+1)), y.ValueStruct{}, 0)\n\t}\n\ttable, err := CreateTable(filename, b)\n\trequire.NoError(t, err)\n\trequire.Equal(t, N, int(table.MaxVersion()))\n}\n"
  },
  {
    "path": "test.sh",
    "content": "#!/bin/bash\n\nset -eo pipefail\n\ngo version\n\n# Check if Github Actions is running\nif [ \"$CI\" = \"true\" ]; then\n\t# Enable code coverage\n\t# export because tests run in a subprocess\n\texport covermode=\"-covermode=atomic\"\n\texport coverprofile=\"-coverprofile=cover_tmp.out\"\n\techo \"mode: atomic\" >>cover.out\nfi\n\n# Run `go list` BEFORE setting GOFLAGS so that the output is in the right\n# format for grep.\n# export packages because the test will run in a sub process.\nexport packages=$(go list ./... | grep \"github.com/dgraph-io/badger/v4/\")\n\ntags=\"-tags=jemalloc\"\n\n# Compile the Badger binary\npushd badger\ngo build -v $tags .\npopd\n\n# Run the memory intensive tests first.\nmanual() {\n\ttimeout=\"-timeout 5m\"\n\techo \"==> Running package tests for $packages\"\n\tset -e\n\tgo env -w GOTOOLCHAIN=go1.25.0+auto\n\tfor pkg in $packages; do\n\t\techo \"===> Testing $pkg\"\n\t\tgo test $tags -timeout=25m $covermode $coverprofile -failfast -race -parallel 16 $pkg && write_coverage || return 1\n\tdone\n\techo \"==> DONE package tests\"\n\n\techo \"==> Running manual tests\"\n\t# Run the special Truncate test.\n\trm -rf p\n\tset -e\n\tgo test $tags $timeout $covermode $coverprofile -run='TestTruncateVlogNoClose$' -failfast --manual=true && write_coverage || return 1\n\ttruncate --size=4096 p/000000.vlog\n\tgo test $tags $timeout $covermode $coverprofile -run='TestTruncateVlogNoClose2$' -failfast --manual=true && write_coverage || return 1\n\tgo test $tags $timeout $covermode $coverprofile -run='TestTruncateVlogNoClose3$' -failfast --manual=true && write_coverage || return 1\n\trm -rf p\n\n\t# TODO(ibrahim): Let's make these tests have Manual prefix.\n\t# go test $tags -run='TestManual' --manual=true --parallel=2\n\t# TestWriteBatch\n\t# TestValueGCManaged\n\t# TestDropPrefix\n\t# TestDropAllManaged\n\tgo test $tags $timeout $covermode $coverprofile -failfast -run='TestBigKeyValuePairs$' --manual=true && write_coverage || return 1\n\tgo test $tags $timeout $covermode $coverprofile -failfast -run='TestPushValueLogLimit' --manual=true && write_coverage || return 1\n\tgo test $tags $timeout $covermode $coverprofile -failfast -run='TestKeyCount' --manual=true && write_coverage || return 1\n\tgo test $tags $timeout $covermode $coverprofile -failfast -run='TestIteratePrefix' --manual=true && write_coverage || return 1\n\tgo test $tags $timeout $covermode $coverprofile -failfast -run='TestIterateParallel' --manual=true && write_coverage || return 1\n\tgo test $tags $timeout $covermode $coverprofile -failfast -run='TestBigStream' --manual=true && write_coverage || return 1\n\tgo test $tags $timeout $covermode $coverprofile -failfast -run='TestGoroutineLeak' --manual=true && write_coverage || return 1\n\tgo test $tags $timeout $covermode $coverprofile -failfast -run='TestGetMore' --manual=true && write_coverage || return 1\n\n\techo \"==> DONE manual tests\"\n}\n\nroot() {\n\t# Run the normal tests.\n\t# go test -timeout=25m -v -race github.com/dgraph-io/badger/v4/...\n\n\techo \"==> Running root level tests.\"\n\tgo test $tags -v -race -parallel=16 -timeout=25m -failfast $covermode $coverprofile . && write_coverage || return 1\n\techo \"==> DONE root level tests\"\n}\n\nstream() {\n\tset -eo pipefail\n\tpushd badger\n\tbaseDir=$(mktemp -d -p .)\n\t./badger benchmark write -s --dir=$baseDir/test | tee $baseDir/log.txt\n\t./badger benchmark read --dir=$baseDir/test --full-scan | tee --append $baseDir/log.txt\n\t./badger benchmark read --dir=$baseDir/test -d=30s | tee --append $baseDir/log.txt\n\t./badger stream --dir=$baseDir/test -o \"$baseDir/test2\" | tee --append $baseDir/log.txt\n\tcount=$(cat \"$baseDir/log.txt\" | grep \"at program end: 0 B\" | wc -l)\n\trm -rf $baseDir\n\tif [ $count -ne 4 ]; then\n\t\techo \"LEAK detected in Badger stream.\"\n\t\treturn 1\n\tfi\n\techo \"==> DONE stream test\"\n\tpopd\n\treturn 0\n}\n\nwrite_coverage() {\n\tif [[ $CI == \"true\" ]]; then\n\t\tif [[ -f cover_tmp.out ]]; then\n\t\t\tsed -i '1d' cover_tmp.out\n\t\t\tcat cover_tmp.out >>cover.out && rm cover_tmp.out\n\t\tfi\n\tfi\n\n}\n\n# parallel tests currently not working\n# parallel --halt now,fail=1 --progress --line-buffer ::: stream manual root\n# run tests in sequence\nroot\nstream\nmanual\n"
  },
  {
    "path": "test_extensions.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\n// Important: Do NOT import the \"testing\" package, as otherwise, that\n// will pull in imports into the production class that we do not want.\n\n// TODO: Consider using this with specific compilation tags so that it only\n//       shows up when performing testing (e.g., specify build tag=unit).\n//       We are not yet ready to do that, as it may impact customer usage as\n//       well as requiring us to update the CI build flags. Moreover, the\n//       current model does not actually incur any significant cost.\n//       If we do this, we will also want to introduce a parallel file that\n//       overrides some of these structs and functions with empty contents.\n\n// String constants for messages to be pushed to syncChan.\nconst (\n\tupdateDiscardStatsMsg = \"updateDiscardStats iteration done\"\n\tendVLogInitMsg        = \"End: vlog.init(db)\"\n)\n\n// testOnlyOptions specifies an extension to the type Options that we want to\n// use only in the context of testing.\ntype testOnlyOptions struct {\n\t// syncChan is used to listen for specific messages related to activities\n\t// that can occur in a DB instance. Currently, this is only used in\n\t// testing activities.\n\tsyncChan chan string\n}\n\n// testOnlyDBExtensions specifies an extension to the type DB that we want to\n// use only in the context of testing.\ntype testOnlyDBExtensions struct {\n\tsyncChan chan string\n\n\t// onCloseDiscardCapture will be populated by a DB instance during the\n\t// process of performing the Close operation. Currently, we only consider\n\t// using this during testing.\n\tonCloseDiscardCapture map[uint64]uint64\n}\n\n// logToSyncChan sends a message to the DB's syncChan. Note that we expect\n// that the DB never closes this channel; the responsibility for\n// allocating and closing the channel belongs to the test module.\n// if db.syncChan is nil or has never been initialized, this will be\n// silently ignored.\nfunc (db *DB) logToSyncChan(msg string) {\n\tif db.syncChan != nil {\n\t\tdb.syncChan <- msg\n\t}\n}\n\n// captureDiscardStats will copy the contents of the discardStats file\n// maintained by vlog to the onCloseDiscardCapture map specified by\n// db.opt. Of course, if db.opt.onCloseDiscardCapture is nil (as expected\n// for a production system as opposed to a test system), this is a no-op.\nfunc (db *DB) captureDiscardStats() {\n\tif db.onCloseDiscardCapture != nil {\n\t\tdb.vlog.discardStats.Lock()\n\t\tdb.vlog.discardStats.Iterate(func(id, val uint64) {\n\t\t\tdb.onCloseDiscardCapture[id] = val\n\t\t})\n\t\tdb.vlog.discardStats.Unlock()\n\t}\n}\n"
  },
  {
    "path": "trie/trie.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage trie\n\nimport (\n\t\"fmt\"\n\t\"strconv\"\n\t\"strings\"\n\n\t\"github.com/dgraph-io/badger/v4/pb\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n)\n\ntype node struct {\n\tchildren map[byte]*node\n\tignore   *node\n\tids      []uint64\n}\n\nfunc (n *node) isEmpty() bool {\n\treturn len(n.children) == 0 && len(n.ids) == 0 && n.ignore == nil\n}\n\nfunc newNode() *node {\n\treturn &node{\n\t\tchildren: make(map[byte]*node),\n\t\tids:      []uint64{},\n\t}\n}\n\n// Trie datastructure.\ntype Trie struct {\n\troot *node\n}\n\n// NewTrie returns Trie.\nfunc NewTrie() *Trie {\n\treturn &Trie{\n\t\troot: newNode(),\n\t}\n}\n\n// parseIgnoreBytes would parse the ignore string, and convert it into a list of bools, where\n// bool[idx] = true implies that key[idx] can be ignored during comparison.\nfunc parseIgnoreBytes(ig string) ([]bool, error) {\n\tvar out []bool\n\tif ig == \"\" {\n\t\treturn out, nil\n\t}\n\n\tfor _, each := range strings.Split(strings.TrimSpace(ig), \",\") {\n\t\tr := strings.Split(strings.TrimSpace(each), \"-\")\n\t\tif len(r) == 0 || len(r) > 2 {\n\t\t\treturn out, fmt.Errorf(\"Invalid range: %s\", each)\n\t\t}\n\t\tstart, end := -1, -1 //nolint:ineffassign\n\t\tif len(r) == 2 {\n\t\t\tidx, err := strconv.Atoi(strings.TrimSpace(r[1]))\n\t\t\tif err != nil {\n\t\t\t\treturn out, err\n\t\t\t}\n\t\t\tend = idx\n\t\t}\n\t\t{\n\t\t\t// Always consider r[0]\n\t\t\tidx, err := strconv.Atoi(strings.TrimSpace(r[0]))\n\t\t\tif err != nil {\n\t\t\t\treturn out, err\n\t\t\t}\n\t\t\tstart = idx\n\t\t}\n\t\tif start == -1 {\n\t\t\treturn out, fmt.Errorf(\"Invalid range: %s\", each)\n\t\t}\n\t\tfor start >= len(out) {\n\t\t\tout = append(out, false)\n\t\t}\n\t\tfor end >= len(out) { // end could be -1, so do have the start loop above.\n\t\t\tout = append(out, false)\n\t\t}\n\t\tif end == -1 {\n\t\t\tout[start] = true\n\t\t} else {\n\t\t\tfor i := start; i <= end; i++ {\n\t\t\t\tout[i] = true\n\t\t\t}\n\t\t}\n\t}\n\treturn out, nil\n}\n\n// Add adds the id in the trie for the given prefix path.\nfunc (t *Trie) Add(prefix []byte, id uint64) {\n\tm := pb.Match{\n\t\tPrefix: prefix,\n\t}\n\ty.Check(t.AddMatch(m, id))\n}\n\n// AddMatch allows you to send in a prefix match, with \"holes\" in the prefix. The holes are\n// specified via IgnoreBytes in a comma-separated list of indices starting from 0. A dash can be\n// used to denote a range. Valid example is \"3, 5-8, 10, 12-15\". Length of IgnoreBytes does not need\n// to match the length of the Prefix passed.\n//\n// Consider a prefix = \"aaaa\". If the IgnoreBytes is set to \"0, 2\", then along with key \"aaaa...\",\n// a key \"baba...\" would also match.\nfunc (t *Trie) AddMatch(m pb.Match, id uint64) error {\n\treturn t.fix(m, id, set)\n}\n\nconst (\n\tset = iota\n\tdel\n)\n\nfunc (t *Trie) fix(m pb.Match, id uint64, op int) error {\n\tcurNode := t.root\n\n\tignore, err := parseIgnoreBytes(m.IgnoreBytes)\n\tif err != nil {\n\t\treturn fmt.Errorf(\"while parsing ignore bytes: %s: %w\", m.IgnoreBytes, err)\n\t}\n\tfor len(ignore) < len(m.Prefix) {\n\t\tignore = append(ignore, false)\n\t}\n\tfor idx, byt := range m.Prefix {\n\t\tvar child *node\n\t\tif ignore[idx] {\n\t\t\tchild = curNode.ignore\n\t\t\tif child == nil {\n\t\t\t\tif op == del {\n\t\t\t\t\t// No valid node found for delete operation. Return immediately.\n\t\t\t\t\treturn nil\n\t\t\t\t}\n\t\t\t\tchild = newNode()\n\t\t\t\tcurNode.ignore = child\n\t\t\t}\n\t\t} else {\n\t\t\tchild = curNode.children[byt]\n\t\t\tif child == nil {\n\t\t\t\tif op == del {\n\t\t\t\t\t// No valid node found for delete operation. Return immediately.\n\t\t\t\t\treturn nil\n\t\t\t\t}\n\t\t\t\tchild = newNode()\n\t\t\t\tcurNode.children[byt] = child\n\t\t\t}\n\t\t}\n\t\tcurNode = child\n\t}\n\n\t// We only need to add the id to the last node of the given prefix.\n\tif op == set {\n\t\tcurNode.ids = append(curNode.ids, id)\n\n\t} else if op == del {\n\t\tout := curNode.ids[:0]\n\t\tfor _, cid := range curNode.ids {\n\t\t\tif id != cid {\n\t\t\t\tout = append(out, cid)\n\t\t\t}\n\t\t}\n\t\tcurNode.ids = out\n\t} else {\n\t\ty.AssertTrue(false)\n\t}\n\treturn nil\n}\n\nfunc (t *Trie) Get(key []byte) map[uint64]struct{} {\n\treturn t.get(t.root, key)\n}\n\n// Get returns prefix matched ids for the given key.\nfunc (t *Trie) get(curNode *node, key []byte) map[uint64]struct{} {\n\ty.AssertTrue(curNode != nil)\n\n\tout := make(map[uint64]struct{})\n\t// If any node in the path of the key has ids, pick them up.\n\t// This would also match nil prefixes.\n\tfor _, i := range curNode.ids {\n\t\tout[i] = struct{}{}\n\t}\n\tif len(key) == 0 {\n\t\treturn out\n\t}\n\n\t// If we found an ignore node, traverse that path.\n\tif curNode.ignore != nil {\n\t\tres := t.get(curNode.ignore, key[1:])\n\t\tfor id := range res {\n\t\t\tout[id] = struct{}{}\n\t\t}\n\t}\n\n\tif child := curNode.children[key[0]]; child != nil {\n\t\tres := t.get(child, key[1:])\n\t\tfor id := range res {\n\t\t\tout[id] = struct{}{}\n\t\t}\n\t}\n\treturn out\n}\n\nfunc removeEmpty(curNode *node) bool {\n\t// Go depth first.\n\tif curNode.ignore != nil {\n\t\tif empty := removeEmpty(curNode.ignore); empty {\n\t\t\tcurNode.ignore = nil\n\t\t}\n\t}\n\n\tfor byt, n := range curNode.children {\n\t\tif empty := removeEmpty(n); empty {\n\t\t\tdelete(curNode.children, byt)\n\t\t}\n\t}\n\n\treturn curNode.isEmpty()\n}\n\n// Delete will delete the id if the id exist in the given index path.\nfunc (t *Trie) Delete(prefix []byte, id uint64) error {\n\treturn t.DeleteMatch(pb.Match{Prefix: prefix}, id)\n}\n\nfunc (t *Trie) DeleteMatch(m pb.Match, id uint64) error {\n\tif err := t.fix(m, id, del); err != nil {\n\t\treturn err\n\t}\n\t// Would recursively delete empty nodes.\n\t// Do not remove the t.root even if its empty.\n\tremoveEmpty(t.root)\n\treturn nil\n}\n\nfunc numNodes(curNode *node) int {\n\tif curNode == nil {\n\t\treturn 0\n\t}\n\n\tnum := numNodes(curNode.ignore)\n\tfor _, n := range curNode.children {\n\t\tnum += numNodes(n)\n\t}\n\treturn num + 1\n}\n"
  },
  {
    "path": "trie/trie_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage trie\n\nimport (\n\t\"sort\"\n\t\"testing\"\n\n\t\"github.com/stretchr/testify/require\"\n\n\t\"github.com/dgraph-io/badger/v4/pb\"\n)\n\nfunc TestGet(t *testing.T) {\n\ttrie := NewTrie()\n\ttrie.Add([]byte(\"hello\"), 1)\n\ttrie.Add([]byte(\"hello\"), 3)\n\ttrie.Add([]byte(\"hello\"), 4)\n\ttrie.Add([]byte(\"hel\"), 20)\n\ttrie.Add([]byte(\"he\"), 20)\n\ttrie.Add([]byte(\"badger\"), 30)\n\n\ttrie.Add(nil, 10)\n\trequire.Equal(t, map[uint64]struct{}{10: {}}, trie.Get([]byte(\"A\")))\n\n\tids := trie.Get([]byte(\"hel\"))\n\trequire.Equal(t, 2, len(ids))\n\trequire.Equal(t, map[uint64]struct{}{10: {}, 20: {}}, ids)\n\n\tids = trie.Get([]byte(\"badger\"))\n\trequire.Equal(t, 2, len(ids))\n\trequire.Equal(t, map[uint64]struct{}{10: {}, 30: {}}, ids)\n\n\tids = trie.Get([]byte(\"hello\"))\n\trequire.Equal(t, 5, len(ids))\n\trequire.Equal(t, map[uint64]struct{}{10: {}, 1: {}, 3: {}, 4: {}, 20: {}}, ids)\n\n\ttrie.Add([]byte{}, 11)\n\trequire.Equal(t, map[uint64]struct{}{10: {}, 11: {}}, trie.Get([]byte(\"A\")))\n}\n\nfunc TestTrieDelete(t *testing.T) {\n\ttrie := NewTrie()\n\tt.Logf(\"Num nodes: %d\", numNodes(trie.root))\n\trequire.Equal(t, 1, numNodes(trie.root))\n\n\ttrie.Add([]byte(\"hello\"), 1)\n\ttrie.Add([]byte(\"hello\"), 3)\n\ttrie.Add([]byte(\"hello\"), 4)\n\ttrie.Add(nil, 5)\n\n\tt.Logf(\"Num nodes: %d\", numNodes(trie.root))\n\n\trequire.NoError(t, trie.Delete([]byte(\"hello\"), 4))\n\tt.Logf(\"Num nodes: %d\", numNodes(trie.root))\n\n\trequire.Equal(t, map[uint64]struct{}{5: {}, 1: {}, 3: {}}, trie.Get([]byte(\"hello\")))\n\n\trequire.NoError(t, trie.Delete(nil, 5))\n\tt.Logf(\"Num nodes: %d\", numNodes(trie.root))\n\trequire.Equal(t, map[uint64]struct{}{1: {}, 3: {}}, trie.Get([]byte(\"hello\")))\n\n\trequire.NoError(t, trie.Delete([]byte(\"hello\"), 1))\n\trequire.NoError(t, trie.Delete([]byte(\"hello\"), 3))\n\trequire.NoError(t, trie.Delete([]byte(\"hello\"), 4))\n\trequire.NoError(t, trie.Delete([]byte(\"hello\"), 5))\n\trequire.NoError(t, trie.Delete([]byte(\"hello\"), 6))\n\n\trequire.Equal(t, 1, numNodes(trie.root))\n\tt.Logf(\"Num nodes: %d\", numNodes(trie.root))\n\n\trequire.Equal(t, true, trie.root.isEmpty())\n\trequire.Equal(t, map[uint64]struct{}{}, trie.Get([]byte(\"hello\")))\n}\n\nfunc TestParseIgnoreBytes(t *testing.T) {\n\tout, err := parseIgnoreBytes(\"1\")\n\trequire.NoError(t, err)\n\trequire.Equal(t, []bool{false, true}, out)\n\n\tout, err = parseIgnoreBytes(\"0\")\n\trequire.NoError(t, err)\n\trequire.Equal(t, []bool{true}, out)\n\n\tout, err = parseIgnoreBytes(\"0, 3 - 5, 7\")\n\trequire.NoError(t, err)\n\trequire.Equal(t, []bool{true, false, false, true, true, true, false, true}, out)\n}\n\nfunc TestPrefixMatchWithHoles(t *testing.T) {\n\ttrie := NewTrie()\n\n\tadd := func(prefix, ignore string, id uint64) {\n\t\tm := pb.Match{\n\t\t\tPrefix:      []byte(prefix),\n\t\t\tIgnoreBytes: ignore,\n\t\t}\n\t\trequire.NoError(t, trie.AddMatch(m, id))\n\t}\n\n\tadd(\"\", \"\", 1)\n\tadd(\"aaaa\", \"\", 2)\n\tadd(\"aaaaaa\", \"2-10\", 3)\n\tadd(\"aaaaaaaaa\", \"0, 4 - 6, 8\", 4)\n\n\tget := func(k string) []uint64 {\n\t\tvar ids []uint64\n\t\tm := trie.Get([]byte(k))\n\t\tfor id := range m {\n\t\t\tids = append(ids, id)\n\t\t}\n\t\tsort.Slice(ids, func(i, j int) bool {\n\t\t\treturn ids[i] < ids[j]\n\t\t})\n\t\treturn ids\n\t}\n\n\t// Everything matches 1\n\trequire.Equal(t, []uint64{1}, get(\"\"))\n\trequire.Equal(t, []uint64{1}, get(\"aax\"))\n\n\t// aaaaa would match 2, but not 3 because 3's length is 6.\n\trequire.Equal(t, []uint64{1, 2}, get(\"aaaaa\"))\n\n\t// aa and enough length is sufficient to match 3.\n\trequire.Equal(t, []uint64{1, 3}, get(\"aabbbbbbbb\"))\n\n\t// has differences in the right place to match 4.\n\trequire.Equal(t, []uint64{1, 4}, get(\"baaabbbabba\"))\n\n\t// Even with differences matches everything.\n\trequire.Equal(t, []uint64{1, 2, 3, 4}, get(\"aaaabbbabba\"))\n\n\tt.Logf(\"Num nodes: %d\", numNodes(trie.root))\n\n\tdel := func(prefix, ignore string, id uint64) {\n\t\tm := pb.Match{\n\t\t\tPrefix:      []byte(prefix),\n\t\t\tIgnoreBytes: ignore,\n\t\t}\n\t\trequire.NoError(t, trie.DeleteMatch(m, id))\n\t}\n\n\tdel(\"aaaaaaaaa\", \"0, 4 - 6, 8\", 5)\n\tt.Logf(\"Num nodes: %d\", numNodes(trie.root))\n\n\tdel(\"aaaaaaaaa\", \"0, 4 - 6, 8\", 4)\n\tt.Logf(\"Num nodes: %d\", numNodes(trie.root))\n\n\tdel(\"aaaaaa\", \"2-10\", 3)\n\tt.Logf(\"Num nodes: %d\", numNodes(trie.root))\n\n\tdel(\"aaaa\", \"\", 2)\n\tt.Logf(\"Num nodes: %d\", numNodes(trie.root))\n\n\tdel(\"\", \"\", 1)\n\tt.Logf(\"Num nodes: %d\", numNodes(trie.root))\n\n\tdel(\"abracadabra\", \"\", 4)\n\tt.Logf(\"Num nodes: %d\", numNodes(trie.root))\n\n\trequire.Equal(t, 1, numNodes(trie.root))\n}\n"
  },
  {
    "path": "txn.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"bytes\"\n\t\"context\"\n\t\"encoding/hex\"\n\t\"errors\"\n\t\"fmt\"\n\t\"math\"\n\t\"sort\"\n\t\"strconv\"\n\t\"sync\"\n\t\"sync/atomic\"\n\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\ntype oracle struct {\n\tisManaged       bool // Does not change value, so no locking required.\n\tdetectConflicts bool // Determines if the txns should be checked for conflicts.\n\n\tsync.Mutex // For nextTxnTs and commits.\n\t// writeChLock lock is for ensuring that transactions go to the write\n\t// channel in the same order as their commit timestamps.\n\twriteChLock sync.Mutex\n\tnextTxnTs   uint64\n\n\t// Used to block NewTransaction, so all previous commits are visible to a new read.\n\ttxnMark *y.WaterMark\n\n\t// Either of these is used to determine which versions can be permanently\n\t// discarded during compaction.\n\tdiscardTs uint64       // Used by ManagedDB.\n\treadMark  *y.WaterMark // Used by DB.\n\n\t// committedTxns contains all committed writes (contains fingerprints\n\t// of keys written and their latest commit counter).\n\tcommittedTxns []committedTxn\n\tlastCleanupTs uint64\n\n\t// closer is used to stop watermarks.\n\tcloser *z.Closer\n}\n\ntype committedTxn struct {\n\tts uint64\n\t// ConflictKeys Keeps track of the entries written at timestamp ts.\n\tconflictKeys map[uint64]struct{}\n}\n\nfunc newOracle(opt Options) *oracle {\n\torc := &oracle{\n\t\tisManaged:       opt.managedTxns,\n\t\tdetectConflicts: opt.DetectConflicts,\n\t\t// We're not initializing nextTxnTs and readOnlyTs. It would be done after replay in Open.\n\t\t//\n\t\t// WaterMarks must be 64-bit aligned for atomic package, hence we must use pointers here.\n\t\t// See https://golang.org/pkg/sync/atomic/#pkg-note-BUG.\n\t\treadMark: &y.WaterMark{Name: \"badger.PendingReads\"},\n\t\ttxnMark:  &y.WaterMark{Name: \"badger.TxnTimestamp\"},\n\t\tcloser:   z.NewCloser(2),\n\t}\n\torc.readMark.Init(orc.closer)\n\torc.txnMark.Init(orc.closer)\n\treturn orc\n}\n\nfunc (o *oracle) Stop() {\n\to.closer.SignalAndWait()\n}\n\nfunc (o *oracle) readTs() uint64 {\n\tif o.isManaged {\n\t\tpanic(\"ReadTs should not be retrieved for managed DB\")\n\t}\n\n\tvar readTs uint64\n\to.Lock()\n\treadTs = o.nextTxnTs - 1\n\to.readMark.Begin(readTs)\n\to.Unlock()\n\n\t// Wait for all txns which have no conflicts, have been assigned a commit\n\t// timestamp and are going through the write to value log and LSM tree\n\t// process. Not waiting here could mean that some txns which have been\n\t// committed would not be read.\n\ty.Check(o.txnMark.WaitForMark(context.Background(), readTs))\n\treturn readTs\n}\n\nfunc (o *oracle) nextTs() uint64 {\n\to.Lock()\n\tdefer o.Unlock()\n\treturn o.nextTxnTs\n}\n\nfunc (o *oracle) incrementNextTs() {\n\to.Lock()\n\tdefer o.Unlock()\n\to.nextTxnTs++\n}\n\n// Any deleted or invalid versions at or below ts would be discarded during\n// compaction to reclaim disk space in LSM tree and thence value log.\nfunc (o *oracle) setDiscardTs(ts uint64) {\n\to.Lock()\n\tdefer o.Unlock()\n\to.discardTs = ts\n\to.cleanupCommittedTransactions()\n}\n\nfunc (o *oracle) discardAtOrBelow() uint64 {\n\tif o.isManaged {\n\t\to.Lock()\n\t\tdefer o.Unlock()\n\t\treturn o.discardTs\n\t}\n\treturn o.readMark.DoneUntil()\n}\n\n// hasConflict must be called while having a lock.\nfunc (o *oracle) hasConflict(txn *Txn) bool {\n\tif len(txn.reads) == 0 {\n\t\treturn false\n\t}\n\tfor _, committedTxn := range o.committedTxns {\n\t\t// If the committedTxn.ts is less than txn.readTs that implies that the\n\t\t// committedTxn finished before the current transaction started.\n\t\t// We don't need to check for conflict in that case.\n\t\t// This change assumes linearizability. Lack of linearizability could\n\t\t// cause the read ts of a new txn to be lower than the commit ts of\n\t\t// a txn before it (@mrjn).\n\t\tif committedTxn.ts <= txn.readTs {\n\t\t\tcontinue\n\t\t}\n\n\t\tfor _, ro := range txn.reads {\n\t\t\tif _, has := committedTxn.conflictKeys[ro]; has {\n\t\t\t\treturn true\n\t\t\t}\n\t\t}\n\t}\n\n\treturn false\n}\n\nfunc (o *oracle) newCommitTs(txn *Txn) (uint64, bool) {\n\to.Lock()\n\tdefer o.Unlock()\n\n\tif o.hasConflict(txn) {\n\t\treturn 0, true\n\t}\n\n\tvar ts uint64\n\tif !o.isManaged {\n\t\to.doneRead(txn)\n\t\to.cleanupCommittedTransactions()\n\n\t\t// This is the general case, when user doesn't specify the read and commit ts.\n\t\tts = o.nextTxnTs\n\t\to.nextTxnTs++\n\t\to.txnMark.Begin(ts)\n\n\t} else {\n\t\t// If commitTs is set, use it instead.\n\t\tts = txn.commitTs\n\t}\n\n\ty.AssertTrue(ts >= o.lastCleanupTs)\n\n\tif o.detectConflicts {\n\t\t// We should ensure that txns are not added to o.committedTxns slice when\n\t\t// conflict detection is disabled otherwise this slice would keep growing.\n\t\to.committedTxns = append(o.committedTxns, committedTxn{\n\t\t\tts:           ts,\n\t\t\tconflictKeys: txn.conflictKeys,\n\t\t})\n\t}\n\n\treturn ts, false\n}\n\nfunc (o *oracle) doneRead(txn *Txn) {\n\tif !txn.doneRead {\n\t\ttxn.doneRead = true\n\t\to.readMark.Done(txn.readTs)\n\t}\n}\n\nfunc (o *oracle) cleanupCommittedTransactions() { // Must be called under o.Lock\n\tif !o.detectConflicts {\n\t\t// When detectConflicts is set to false, we do not store any\n\t\t// committedTxns and so there's nothing to clean up.\n\t\treturn\n\t}\n\t// Same logic as discardAtOrBelow but unlocked\n\tvar maxReadTs uint64\n\tif o.isManaged {\n\t\tmaxReadTs = o.discardTs\n\t} else {\n\t\tmaxReadTs = o.readMark.DoneUntil()\n\t}\n\n\ty.AssertTrue(maxReadTs >= o.lastCleanupTs)\n\n\t// do not run clean up if the maxReadTs (read timestamp of the\n\t// oldest transaction that is still in flight) has not increased\n\tif maxReadTs == o.lastCleanupTs {\n\t\treturn\n\t}\n\to.lastCleanupTs = maxReadTs\n\n\ttmp := o.committedTxns[:0]\n\tfor _, txn := range o.committedTxns {\n\t\tif txn.ts <= maxReadTs {\n\t\t\tcontinue\n\t\t}\n\t\ttmp = append(tmp, txn)\n\t}\n\to.committedTxns = tmp\n}\n\nfunc (o *oracle) doneCommit(cts uint64) {\n\tif o.isManaged {\n\t\t// No need to update anything.\n\t\treturn\n\t}\n\to.txnMark.Done(cts)\n}\n\n// Txn represents a Badger transaction.\ntype Txn struct {\n\treadTs   uint64\n\tcommitTs uint64\n\tsize     int64\n\tcount    int64\n\tdb       *DB\n\n\treads []uint64 // contains fingerprints of keys read.\n\t// contains fingerprints of keys written. This is used for conflict detection.\n\tconflictKeys map[uint64]struct{}\n\treadsLock    sync.Mutex // guards the reads slice. See addReadKey.\n\n\tpendingWrites   map[string]*Entry // cache stores any writes done by txn.\n\tduplicateWrites []*Entry          // Used in managed mode to store duplicate entries.\n\n\tnumIterators atomic.Int32\n\tdiscarded    bool\n\tdoneRead     bool\n\tupdate       bool // update is used to conditionally keep track of reads.\n}\n\ntype pendingWritesIterator struct {\n\tentries  []*Entry\n\tnextIdx  int\n\treadTs   uint64\n\treversed bool\n}\n\nfunc (pi *pendingWritesIterator) Next() {\n\tpi.nextIdx++\n}\n\nfunc (pi *pendingWritesIterator) Rewind() {\n\tpi.nextIdx = 0\n}\n\nfunc (pi *pendingWritesIterator) Seek(key []byte) {\n\tkey = y.ParseKey(key)\n\tpi.nextIdx = sort.Search(len(pi.entries), func(idx int) bool {\n\t\tcmp := bytes.Compare(pi.entries[idx].Key, key)\n\t\tif !pi.reversed {\n\t\t\treturn cmp >= 0\n\t\t}\n\t\treturn cmp <= 0\n\t})\n}\n\nfunc (pi *pendingWritesIterator) Key() []byte {\n\ty.AssertTrue(pi.Valid())\n\tentry := pi.entries[pi.nextIdx]\n\treturn y.KeyWithTs(entry.Key, pi.readTs)\n}\n\nfunc (pi *pendingWritesIterator) Value() y.ValueStruct {\n\ty.AssertTrue(pi.Valid())\n\tentry := pi.entries[pi.nextIdx]\n\treturn y.ValueStruct{\n\t\tValue:     entry.Value,\n\t\tMeta:      entry.meta,\n\t\tUserMeta:  entry.UserMeta,\n\t\tExpiresAt: entry.ExpiresAt,\n\t\tVersion:   pi.readTs,\n\t}\n}\n\nfunc (pi *pendingWritesIterator) Valid() bool {\n\treturn pi.nextIdx < len(pi.entries)\n}\n\nfunc (pi *pendingWritesIterator) Close() error {\n\treturn nil\n}\n\nfunc (txn *Txn) newPendingWritesIterator(reversed bool) *pendingWritesIterator {\n\tif !txn.update || len(txn.pendingWrites) == 0 {\n\t\treturn nil\n\t}\n\tentries := make([]*Entry, 0, len(txn.pendingWrites))\n\tfor _, e := range txn.pendingWrites {\n\t\tentries = append(entries, e)\n\t}\n\t// Number of pending writes per transaction shouldn't be too big in general.\n\tsort.Slice(entries, func(i, j int) bool {\n\t\tcmp := bytes.Compare(entries[i].Key, entries[j].Key)\n\t\tif !reversed {\n\t\t\treturn cmp < 0\n\t\t}\n\t\treturn cmp > 0\n\t})\n\treturn &pendingWritesIterator{\n\t\treadTs:   txn.readTs,\n\t\tentries:  entries,\n\t\treversed: reversed,\n\t}\n}\n\nfunc (txn *Txn) checkSize(e *Entry) error {\n\tcount := txn.count + 1\n\t// Extra bytes for the version in key.\n\tsize := txn.size + e.estimateSizeAndSetThreshold(txn.db.valueThreshold()) + 10\n\tif count >= txn.db.opt.maxBatchCount || size >= txn.db.opt.maxBatchSize {\n\t\treturn ErrTxnTooBig\n\t}\n\ttxn.count, txn.size = count, size\n\treturn nil\n}\n\nfunc exceedsSize(prefix string, max int64, key []byte) error {\n\treturn fmt.Errorf(\"%s with size %d exceeded %d limit. %s:\\n%s\",\n\t\tprefix, len(key), max, prefix, hex.Dump(key[:1<<10]))\n}\n\nfunc (txn *Txn) modify(e *Entry) error {\n\tconst maxKeySize = 65000\n\n\tswitch {\n\tcase !txn.update:\n\t\treturn ErrReadOnlyTxn\n\tcase txn.discarded:\n\t\treturn ErrDiscardedTxn\n\tcase len(e.Key) == 0:\n\t\treturn ErrEmptyKey\n\tcase bytes.HasPrefix(e.Key, badgerPrefix):\n\t\treturn ErrInvalidKey\n\tcase len(e.Key) > maxKeySize:\n\t\t// Key length can't be more than uint16, as determined by table::header.  To\n\t\t// keep things safe and allow badger move prefix and a timestamp suffix, let's\n\t\t// cut it down to 65000, instead of using 65536.\n\t\treturn exceedsSize(\"Key\", maxKeySize, e.Key)\n\tcase int64(len(e.Value)) > txn.db.opt.ValueLogFileSize:\n\t\treturn exceedsSize(\"Value\", txn.db.opt.ValueLogFileSize, e.Value)\n\tcase txn.db.opt.InMemory && int64(len(e.Value)) > txn.db.valueThreshold():\n\t\treturn exceedsSize(\"Value\", txn.db.valueThreshold(), e.Value)\n\t}\n\n\tif err := txn.db.isBanned(e.Key); err != nil {\n\t\treturn err\n\t}\n\n\tif err := txn.checkSize(e); err != nil {\n\t\treturn err\n\t}\n\n\t// The txn.conflictKeys is used for conflict detection. If conflict detection\n\t// is disabled, we don't need to store key hashes in this map.\n\tif txn.db.opt.DetectConflicts {\n\t\tfp := z.MemHash(e.Key) // Avoid dealing with byte arrays.\n\t\ttxn.conflictKeys[fp] = struct{}{}\n\t}\n\t// If a duplicate entry was inserted in managed mode, move it to the duplicate writes slice.\n\t// Add the entry to duplicateWrites only if both the entries have different versions. For\n\t// same versions, we will overwrite the existing entry.\n\tif oldEntry, ok := txn.pendingWrites[string(e.Key)]; ok && oldEntry.version != e.version {\n\t\ttxn.duplicateWrites = append(txn.duplicateWrites, oldEntry)\n\t}\n\ttxn.pendingWrites[string(e.Key)] = e\n\treturn nil\n}\n\n// Set adds a key-value pair to the database.\n// It will return ErrReadOnlyTxn if update flag was set to false when creating the transaction.\n//\n// The current transaction keeps a reference to the key and val byte slice\n// arguments. Users must not modify key and val until the end of the transaction.\nfunc (txn *Txn) Set(key, val []byte) error {\n\treturn txn.SetEntry(NewEntry(key, val))\n}\n\n// SetEntry takes an Entry struct and adds the key-value pair in the struct,\n// along with other metadata to the database.\n//\n// The current transaction keeps a reference to the entry passed in argument.\n// Users must not modify the entry until the end of the transaction.\nfunc (txn *Txn) SetEntry(e *Entry) error {\n\treturn txn.modify(e)\n}\n\n// Delete deletes a key.\n//\n// This is done by adding a delete marker for the key at commit timestamp.  Any\n// reads happening before this timestamp would be unaffected. Any reads after\n// this commit would see the deletion.\n//\n// The current transaction keeps a reference to the key byte slice argument.\n// Users must not modify the key until the end of the transaction.\nfunc (txn *Txn) Delete(key []byte) error {\n\te := &Entry{\n\t\tKey:  key,\n\t\tmeta: bitDelete,\n\t}\n\treturn txn.modify(e)\n}\n\n// Get looks for key and returns corresponding Item.\n// If key is not found, ErrKeyNotFound is returned.\nfunc (txn *Txn) Get(key []byte) (item *Item, rerr error) {\n\tif len(key) == 0 {\n\t\treturn nil, ErrEmptyKey\n\t} else if txn.discarded {\n\t\treturn nil, ErrDiscardedTxn\n\t}\n\n\tif err := txn.db.isBanned(key); err != nil {\n\t\treturn nil, err\n\t}\n\n\titem = new(Item)\n\tif txn.update {\n\t\tif e, has := txn.pendingWrites[string(key)]; has && bytes.Equal(key, e.Key) {\n\t\t\tif isDeletedOrExpired(e.meta, e.ExpiresAt) {\n\t\t\t\treturn nil, ErrKeyNotFound\n\t\t\t}\n\t\t\t// Fulfill from cache.\n\t\t\titem.meta = e.meta\n\t\t\titem.val = e.Value\n\t\t\titem.userMeta = e.UserMeta\n\t\t\titem.key = key\n\t\t\titem.status = prefetched\n\t\t\titem.version = txn.readTs\n\t\t\titem.expiresAt = e.ExpiresAt\n\t\t\t// We probably don't need to set db on item here.\n\t\t\treturn item, nil\n\t\t}\n\t\t// Only track reads if this is update txn. No need to track read if txn serviced it\n\t\t// internally.\n\t\ttxn.addReadKey(key)\n\t}\n\n\tseek := y.KeyWithTs(key, txn.readTs)\n\tvs, err := txn.db.get(seek)\n\tif err != nil {\n\t\treturn nil, y.Wrapf(err, \"DB::Get key: %q\", key)\n\t}\n\tif vs.Value == nil && vs.Meta == 0 {\n\t\treturn nil, ErrKeyNotFound\n\t}\n\tif isDeletedOrExpired(vs.Meta, vs.ExpiresAt) {\n\t\treturn nil, ErrKeyNotFound\n\t}\n\n\titem.key = key\n\titem.version = vs.Version\n\titem.meta = vs.Meta\n\titem.userMeta = vs.UserMeta\n\titem.vptr = y.SafeCopy(item.vptr, vs.Value)\n\titem.txn = txn\n\titem.expiresAt = vs.ExpiresAt\n\treturn item, nil\n}\n\nfunc (txn *Txn) addReadKey(key []byte) {\n\tif txn.update {\n\t\tfp := z.MemHash(key)\n\n\t\t// Because of the possibility of multiple iterators it is now possible\n\t\t// for multiple threads within a read-write transaction to read keys at\n\t\t// the same time. The reads slice is not currently thread-safe and\n\t\t// needs to be locked whenever we mark a key as read.\n\t\ttxn.readsLock.Lock()\n\t\ttxn.reads = append(txn.reads, fp)\n\t\ttxn.readsLock.Unlock()\n\t}\n}\n\n// Discard discards a created transaction. This method is very important and must be called. Commit\n// method calls this internally, however, calling this multiple times doesn't cause any issues. So,\n// this can safely be called via a defer right when transaction is created.\n//\n// NOTE: If any operations are run on a discarded transaction, ErrDiscardedTxn is returned.\nfunc (txn *Txn) Discard() {\n\tif txn.discarded { // Avoid a re-run.\n\t\treturn\n\t}\n\tif txn.numIterators.Load() > 0 {\n\t\tpanic(\"Unclosed iterator at time of Txn.Discard.\")\n\t}\n\ttxn.discarded = true\n\tif !txn.db.orc.isManaged {\n\t\ttxn.db.orc.doneRead(txn)\n\t}\n}\n\nfunc (txn *Txn) commitAndSend() (func() error, error) {\n\torc := txn.db.orc\n\t// Ensure that the order in which we get the commit timestamp is the same as\n\t// the order in which we push these updates to the write channel. So, we\n\t// acquire a writeChLock before getting a commit timestamp, and only release\n\t// it after pushing the entries to it.\n\torc.writeChLock.Lock()\n\tdefer orc.writeChLock.Unlock()\n\n\tcommitTs, conflict := orc.newCommitTs(txn)\n\tif conflict {\n\t\treturn nil, ErrConflict\n\t}\n\n\tkeepTogether := true\n\tsetVersion := func(e *Entry) {\n\t\tif e.version == 0 {\n\t\t\te.version = commitTs\n\t\t} else {\n\t\t\tkeepTogether = false\n\t\t}\n\t}\n\tfor _, e := range txn.pendingWrites {\n\t\tsetVersion(e)\n\t}\n\t// The duplicateWrites slice will be non-empty only if there are duplicate\n\t// entries with different versions.\n\tfor _, e := range txn.duplicateWrites {\n\t\tsetVersion(e)\n\t}\n\n\tentries := make([]*Entry, 0, len(txn.pendingWrites)+len(txn.duplicateWrites)+1)\n\n\tprocessEntry := func(e *Entry) {\n\t\t// Suffix the keys with commit ts, so the key versions are sorted in\n\t\t// descending order of commit timestamp.\n\t\te.Key = y.KeyWithTs(e.Key, e.version)\n\t\t// Add bitTxn only if these entries are part of a transaction. We\n\t\t// support SetEntryAt(..) in managed mode which means a single\n\t\t// transaction can have entries with different timestamps. If entries\n\t\t// in a single transaction have different timestamps, we don't add the\n\t\t// transaction markers.\n\t\tif keepTogether {\n\t\t\te.meta |= bitTxn\n\t\t}\n\t\tentries = append(entries, e)\n\t}\n\n\t// The following debug information is what led to determining the cause of\n\t// bank txn violation bug, and it took a whole bunch of effort to narrow it\n\t// down to here. So, keep this around for at least a couple of months.\n\t// var b strings.Builder\n\t// fmt.Fprintf(&b, \"Read: %d. Commit: %d. reads: %v. writes: %v. Keys: \",\n\t// \ttxn.readTs, commitTs, txn.reads, txn.conflictKeys)\n\tfor _, e := range txn.pendingWrites {\n\t\tprocessEntry(e)\n\t}\n\tfor _, e := range txn.duplicateWrites {\n\t\tprocessEntry(e)\n\t}\n\n\tif keepTogether {\n\t\t// CommitTs should not be zero if we're inserting transaction markers.\n\t\ty.AssertTrue(commitTs != 0)\n\t\te := &Entry{\n\t\t\tKey:   y.KeyWithTs(txnKey, commitTs),\n\t\t\tValue: []byte(strconv.FormatUint(commitTs, 10)),\n\t\t\tmeta:  bitFinTxn,\n\t\t}\n\t\tentries = append(entries, e)\n\t}\n\n\treq, err := txn.db.sendToWriteCh(entries)\n\tif err != nil {\n\t\torc.doneCommit(commitTs)\n\t\treturn nil, err\n\t}\n\tret := func() error {\n\t\terr := req.Wait()\n\t\t// Wait before marking commitTs as done.\n\t\t// We can't defer doneCommit above, because it is being called from a\n\t\t// callback here.\n\t\torc.doneCommit(commitTs)\n\t\treturn err\n\t}\n\treturn ret, nil\n}\n\nfunc (txn *Txn) commitPrecheck() error {\n\tif txn.discarded {\n\t\treturn errors.New(\"Trying to commit a discarded txn\")\n\t}\n\tkeepTogether := true\n\tfor _, e := range txn.pendingWrites {\n\t\tif e.version != 0 {\n\t\t\tkeepTogether = false\n\t\t}\n\t}\n\n\t// If keepTogether is True, it implies transaction markers will be added.\n\t// In that case, commitTs should not be never be zero. This might happen if\n\t// someone uses txn.Commit instead of txn.CommitAt in managed mode.  This\n\t// should happen only in managed mode. In normal mode, keepTogether will\n\t// always be true.\n\tif keepTogether && txn.db.opt.managedTxns && txn.commitTs == 0 {\n\t\treturn errors.New(\"CommitTs cannot be zero. Please use commitAt instead\")\n\t}\n\treturn nil\n}\n\n// Commit commits the transaction, following these steps:\n//\n// 1. If there are no writes, return immediately.\n//\n// 2. Check if read rows were updated since txn started. If so, return ErrConflict.\n//\n// 3. If no conflict, generate a commit timestamp and update written rows' commit ts.\n//\n// 4. Batch up all writes, write them to value log and LSM tree.\n//\n// 5. If callback is provided, Badger will return immediately after checking\n// for conflicts. Writes to the database will happen in the background.  If\n// there is a conflict, an error will be returned and the callback will not\n// run. If there are no conflicts, the callback will be called in the\n// background upon successful completion of writes or any error during write.\n//\n// If error is nil, the transaction is successfully committed. In case of a non-nil error, the LSM\n// tree won't be updated, so there's no need for any rollback.\nfunc (txn *Txn) Commit() error {\n\t// txn.conflictKeys can be zero if conflict detection is turned off. So we\n\t// should check txn.pendingWrites.\n\tif len(txn.pendingWrites) == 0 {\n\t\t// Discard the transaction so that the read is marked done.\n\t\ttxn.Discard()\n\t\treturn nil\n\t}\n\t// Precheck before discarding txn.\n\tif err := txn.commitPrecheck(); err != nil {\n\t\treturn err\n\t}\n\tdefer txn.Discard()\n\n\ttxnCb, err := txn.commitAndSend()\n\tif err != nil {\n\t\treturn err\n\t}\n\t// If batchSet failed, LSM would not have been updated. So, no need to rollback anything.\n\n\t// TODO: What if some of the txns successfully make it to value log, but others fail.\n\t// Nothing gets updated to LSM, until a restart happens.\n\treturn txnCb()\n}\n\ntype txnCb struct {\n\tcommit func() error\n\tuser   func(error)\n\terr    error\n}\n\nfunc runTxnCallback(cb *txnCb) {\n\tswitch {\n\tcase cb == nil:\n\t\tpanic(\"txn callback is nil\")\n\tcase cb.user == nil:\n\t\tpanic(\"Must have caught a nil callback for txn.CommitWith\")\n\tcase cb.err != nil:\n\t\tcb.user(cb.err)\n\tcase cb.commit != nil:\n\t\terr := cb.commit()\n\t\tcb.user(err)\n\tdefault:\n\t\tcb.user(nil)\n\t}\n}\n\n// CommitWith acts like Commit, but takes a callback, which gets run via a\n// goroutine to avoid blocking this function. The callback is guaranteed to run,\n// so it is safe to increment sync.WaitGroup before calling CommitWith, and\n// decrementing it in the callback; to block until all callbacks are run.\nfunc (txn *Txn) CommitWith(cb func(error)) {\n\tif cb == nil {\n\t\tpanic(\"Nil callback provided to CommitWith\")\n\t}\n\n\tif len(txn.pendingWrites) == 0 {\n\t\t// Do not run these callbacks from here, because the CommitWith and the\n\t\t// callback might be acquiring the same locks. Instead run the callback\n\t\t// from another goroutine.\n\t\tgo runTxnCallback(&txnCb{user: cb, err: nil})\n\t\t// Discard the transaction so that the read is marked done.\n\t\ttxn.Discard()\n\t\treturn\n\t}\n\n\t// Precheck before discarding txn.\n\tif err := txn.commitPrecheck(); err != nil {\n\t\tcb(err)\n\t\treturn\n\t}\n\n\tdefer txn.Discard()\n\n\tcommitCb, err := txn.commitAndSend()\n\tif err != nil {\n\t\tgo runTxnCallback(&txnCb{user: cb, err: err})\n\t\treturn\n\t}\n\n\tgo runTxnCallback(&txnCb{user: cb, commit: commitCb})\n}\n\n// ReadTs returns the read timestamp of the transaction.\nfunc (txn *Txn) ReadTs() uint64 {\n\treturn txn.readTs\n}\n\n// NewTransaction creates a new transaction. Badger supports concurrent execution of transactions,\n// providing serializable snapshot isolation, avoiding write skews. Badger achieves this by tracking\n// the keys read and at Commit time, ensuring that these read keys weren't concurrently modified by\n// another transaction.\n//\n// For read-only transactions, set update to false. In this mode, we don't track the rows read for\n// any changes. Thus, any long running iterations done in this mode wouldn't pay this overhead.\n//\n// Running transactions concurrently is OK. However, a transaction itself isn't thread safe, and\n// should only be run serially. It doesn't matter if a transaction is created by one goroutine and\n// passed down to other, as long as the Txn APIs are called serially.\n//\n// When you create a new transaction, it is absolutely essential to call\n// Discard(). This should be done irrespective of what the update param is set\n// to. Commit API internally runs Discard, but running it twice wouldn't cause\n// any issues.\n//\n//\ttxn := db.NewTransaction(false)\n//\tdefer txn.Discard()\n//\t// Call various APIs.\nfunc (db *DB) NewTransaction(update bool) *Txn {\n\treturn db.newTransaction(update, false)\n}\n\nfunc (db *DB) newTransaction(update, isManaged bool) *Txn {\n\tif db.opt.ReadOnly && update {\n\t\t// DB is read-only, force read-only transaction.\n\t\tupdate = false\n\t}\n\n\ttxn := &Txn{\n\t\tupdate: update,\n\t\tdb:     db,\n\t\tcount:  1,                       // One extra entry for BitFin.\n\t\tsize:   int64(len(txnKey) + 10), // Some buffer for the extra entry.\n\t}\n\tif update {\n\t\tif db.opt.DetectConflicts {\n\t\t\ttxn.conflictKeys = make(map[uint64]struct{})\n\t\t}\n\t\ttxn.pendingWrites = make(map[string]*Entry)\n\t}\n\tif !isManaged {\n\t\ttxn.readTs = db.orc.readTs()\n\t}\n\treturn txn\n}\n\n// View executes a function creating and managing a read-only transaction for the user. Error\n// returned by the function is relayed by the View method.\n// If View is used with managed transactions, it would assume a read timestamp of MaxUint64.\nfunc (db *DB) View(fn func(txn *Txn) error) error {\n\tif db.IsClosed() {\n\t\treturn ErrDBClosed\n\t}\n\tvar txn *Txn\n\tif db.opt.managedTxns {\n\t\ttxn = db.NewTransactionAt(math.MaxUint64, false)\n\t} else {\n\t\ttxn = db.NewTransaction(false)\n\t}\n\tdefer txn.Discard()\n\n\treturn fn(txn)\n}\n\n// Update executes a function, creating and managing a read-write transaction\n// for the user. Error returned by the function is relayed by the Update method.\n// Update cannot be used with managed transactions.\nfunc (db *DB) Update(fn func(txn *Txn) error) error {\n\tif db.IsClosed() {\n\t\treturn ErrDBClosed\n\t}\n\tif db.opt.managedTxns {\n\t\tpanic(\"Update can only be used with managedDB=false.\")\n\t}\n\ttxn := db.NewTransaction(true)\n\tdefer txn.Discard()\n\n\tif err := fn(txn); err != nil {\n\t\treturn err\n\t}\n\n\treturn txn.Commit()\n}\n"
  },
  {
    "path": "txn_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"fmt\"\n\t\"math/rand\"\n\t\"os\"\n\t\"strconv\"\n\t\"sync\"\n\t\"sync/atomic\"\n\t\"testing\"\n\t\"time\"\n\n\t\"github.com/stretchr/testify/require\"\n\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\nfunc TestTxnSimple(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\ttxn := db.NewTransaction(true)\n\n\t\tfor i := 0; i < 10; i++ {\n\t\t\tk := []byte(fmt.Sprintf(\"key=%d\", i))\n\t\t\tv := []byte(fmt.Sprintf(\"val=%d\", i))\n\t\t\trequire.NoError(t, txn.SetEntry(NewEntry(k, v)))\n\t\t}\n\n\t\titem, err := txn.Get([]byte(\"key=8\"))\n\t\trequire.NoError(t, err)\n\n\t\trequire.NoError(t, item.Value(func(val []byte) error {\n\t\t\trequire.Equal(t, []byte(\"val=8\"), val)\n\t\t\treturn nil\n\t\t}))\n\n\t\trequire.Panics(t, func() { _ = txn.CommitAt(100, nil) })\n\t\trequire.NoError(t, txn.Commit())\n\t})\n}\n\nfunc TestTxnReadAfterWrite(t *testing.T) {\n\ttest := func(t *testing.T, db *DB) {\n\t\tvar wg sync.WaitGroup\n\t\tN := 100\n\t\twg.Add(N)\n\t\tfor i := 0; i < N; i++ {\n\t\t\tgo func(i int) {\n\t\t\t\tdefer wg.Done()\n\t\t\t\tkey := []byte(fmt.Sprintf(\"key%d\", i))\n\t\t\t\terr := db.Update(func(tx *Txn) error {\n\t\t\t\t\treturn tx.SetEntry(NewEntry(key, key))\n\t\t\t\t})\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\terr = db.View(func(tx *Txn) error {\n\t\t\t\t\titem, err := tx.Get(key)\n\t\t\t\t\trequire.NoError(t, err)\n\t\t\t\t\tval, err := item.ValueCopy(nil)\n\t\t\t\t\trequire.NoError(t, err)\n\t\t\t\t\trequire.Equal(t, val, key)\n\t\t\t\t\treturn nil\n\t\t\t\t})\n\t\t\t\trequire.NoError(t, err)\n\t\t\t}(i)\n\t\t}\n\t\twg.Wait()\n\t}\n\tt.Run(\"disk mode\", func(t *testing.T) {\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\ttest(t, db)\n\t\t})\n\t})\n\tt.Run(\"InMemory mode\", func(t *testing.T) {\n\t\topt := getTestOptions(\"\")\n\t\topt.InMemory = true\n\t\tdb, err := Open(opt)\n\t\trequire.NoError(t, err)\n\t\ttest(t, db)\n\t\trequire.NoError(t, db.Close())\n\t})\n}\n\nfunc TestTxnCommitAsync(t *testing.T) {\n\tkey := func(i int) []byte {\n\t\treturn []byte(fmt.Sprintf(\"key=%d\", i))\n\t}\n\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\ttxn := db.NewTransaction(true)\n\t\tfor i := 0; i < 40; i++ {\n\t\t\terr := txn.SetEntry(NewEntry(key(i), []byte(strconv.Itoa(100))))\n\t\t\trequire.NoError(t, err)\n\t\t}\n\t\trequire.NoError(t, txn.Commit())\n\t\ttxn.Discard()\n\n\t\tcloser := z.NewCloser(1)\n\t\tgo func() {\n\t\t\tdefer closer.Done()\n\t\t\tfor {\n\t\t\t\tselect {\n\t\t\t\tcase <-closer.HasBeenClosed():\n\t\t\t\t\treturn\n\t\t\t\tdefault:\n\t\t\t\t}\n\t\t\t\t// Keep checking balance variant\n\t\t\t\ttxn := db.NewTransaction(false)\n\t\t\t\ttotalBalance := 0\n\t\t\t\tfor i := 0; i < 40; i++ {\n\t\t\t\t\titem, err := txn.Get(key(i))\n\t\t\t\t\trequire.NoError(t, err)\n\t\t\t\t\tval, err := item.ValueCopy(nil)\n\t\t\t\t\trequire.NoError(t, err)\n\t\t\t\t\tbal, err := strconv.Atoi(string(val))\n\t\t\t\t\trequire.NoError(t, err)\n\t\t\t\t\ttotalBalance += bal\n\t\t\t\t}\n\t\t\t\trequire.Equal(t, totalBalance, 4000)\n\t\t\t\ttxn.Discard()\n\t\t\t}\n\t\t}()\n\n\t\tvar wg sync.WaitGroup\n\t\twg.Add(100)\n\t\tfor i := 0; i < 100; i++ {\n\t\t\tgo func() {\n\t\t\t\ttxn := db.NewTransaction(true)\n\t\t\t\tdelta := rand.Intn(100)\n\t\t\t\tfor i := 0; i < 20; i++ {\n\t\t\t\t\terr := txn.SetEntry(NewEntry(key(i), []byte(strconv.Itoa(100-delta))))\n\t\t\t\t\trequire.NoError(t, err)\n\t\t\t\t}\n\t\t\t\tfor i := 20; i < 40; i++ {\n\t\t\t\t\terr := txn.SetEntry(NewEntry(key(i), []byte(strconv.Itoa(100+delta))))\n\t\t\t\t\trequire.NoError(t, err)\n\t\t\t\t}\n\t\t\t\t// We are only doing writes, so there won't be any conflicts.\n\t\t\t\ttxn.CommitWith(func(err error) {})\n\t\t\t\ttxn.Discard()\n\t\t\t\twg.Done()\n\t\t\t}()\n\t\t}\n\t\twg.Wait()\n\t\tcloser.SignalAndWait()\n\t\ttime.Sleep(time.Millisecond * 10) // allow goroutine to complete.\n\t})\n}\n\nfunc TestTxnVersions(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\tk := []byte(\"key\")\n\t\tfor i := 1; i < 10; i++ {\n\t\t\ttxn := db.NewTransaction(true)\n\n\t\t\trequire.NoError(t, txn.SetEntry(NewEntry(k, []byte(fmt.Sprintf(\"valversion=%d\", i)))))\n\t\t\trequire.NoError(t, txn.Commit())\n\t\t\trequire.Equal(t, uint64(i), db.orc.readTs())\n\t\t}\n\n\t\tcheckIterator := func(itr *Iterator, i int) {\n\t\t\tdefer itr.Close()\n\t\t\tcount := 0\n\t\t\tfor itr.Rewind(); itr.Valid(); itr.Next() {\n\t\t\t\titem := itr.Item()\n\t\t\t\trequire.Equal(t, k, item.Key())\n\n\t\t\t\tval, err := item.ValueCopy(nil)\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\texp := fmt.Sprintf(\"valversion=%d\", i)\n\t\t\t\trequire.Equal(t, exp, string(val), \"i=%d\", i)\n\t\t\t\tcount++\n\t\t\t}\n\t\t\trequire.Equal(t, 1, count, \"i=%d\", i) // Should only loop once.\n\t\t}\n\n\t\tcheckAllVersions := func(itr *Iterator, i int) {\n\t\t\tvar version uint64\n\t\t\tif itr.opt.Reverse {\n\t\t\t\tversion = 1\n\t\t\t} else {\n\t\t\t\tversion = uint64(i)\n\t\t\t}\n\n\t\t\tcount := 0\n\t\t\tfor itr.Rewind(); itr.Valid(); itr.Next() {\n\t\t\t\titem := itr.Item()\n\t\t\t\trequire.Equal(t, k, item.Key())\n\t\t\t\trequire.Equal(t, version, item.Version())\n\n\t\t\t\tval, err := item.ValueCopy(nil)\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\texp := fmt.Sprintf(\"valversion=%d\", version)\n\t\t\t\trequire.Equal(t, exp, string(val), \"v=%d\", version)\n\t\t\t\tcount++\n\n\t\t\t\tif itr.opt.Reverse {\n\t\t\t\t\tversion++\n\t\t\t\t} else {\n\t\t\t\t\tversion--\n\t\t\t\t}\n\t\t\t}\n\t\t\trequire.Equal(t, i, count, \"i=%d\", i) // Should loop as many times as i.\n\t\t}\n\n\t\tfor i := 1; i < 10; i++ {\n\t\t\ttxn := db.NewTransaction(true)\n\t\t\ttxn.readTs = uint64(i) // Read version at i.\n\n\t\t\titem, err := txn.Get(k)\n\t\t\trequire.NoError(t, err)\n\n\t\t\tval, err := item.ValueCopy(nil)\n\t\t\trequire.NoError(t, err)\n\t\t\trequire.Equal(t, []byte(fmt.Sprintf(\"valversion=%d\", i)), val,\n\t\t\t\t\"Expected versions to match up at i=%d\", i)\n\n\t\t\t// Try retrieving the latest version forward and reverse.\n\t\t\titr := txn.NewIterator(DefaultIteratorOptions)\n\t\t\tcheckIterator(itr, i)\n\n\t\t\topt := DefaultIteratorOptions\n\t\t\topt.Reverse = true\n\t\t\titr = txn.NewIterator(opt)\n\t\t\tcheckIterator(itr, i)\n\n\t\t\t// Now try retrieving all versions forward and reverse.\n\t\t\topt = DefaultIteratorOptions\n\t\t\topt.AllVersions = true\n\t\t\titr = txn.NewIterator(opt)\n\t\t\tcheckAllVersions(itr, i)\n\t\t\titr.Close()\n\n\t\t\topt = DefaultIteratorOptions\n\t\t\topt.AllVersions = true\n\t\t\topt.Reverse = true\n\t\t\titr = txn.NewIterator(opt)\n\t\t\tcheckAllVersions(itr, i)\n\t\t\titr.Close()\n\n\t\t\ttxn.Discard()\n\t\t}\n\t\ttxn := db.NewTransaction(true)\n\t\tdefer txn.Discard()\n\t\titem, err := txn.Get(k)\n\t\trequire.NoError(t, err)\n\n\t\tval, err := item.ValueCopy(nil)\n\t\trequire.NoError(t, err)\n\t\trequire.Equal(t, []byte(\"valversion=9\"), val)\n\t})\n}\n\nfunc TestTxnWriteSkew(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t// Accounts\n\t\tax := []byte(\"x\")\n\t\tay := []byte(\"y\")\n\n\t\t// Set balance to $100 in each account.\n\t\ttxn := db.NewTransaction(true)\n\t\tdefer txn.Discard()\n\t\tval := []byte(strconv.Itoa(100))\n\t\trequire.NoError(t, txn.SetEntry(NewEntry(ax, val)))\n\t\trequire.NoError(t, txn.SetEntry(NewEntry(ay, val)))\n\t\trequire.NoError(t, txn.Commit())\n\t\trequire.Equal(t, uint64(1), db.orc.readTs())\n\n\t\tgetBal := func(txn *Txn, key []byte) (bal int) {\n\t\t\titem, err := txn.Get(key)\n\t\t\trequire.NoError(t, err)\n\n\t\t\tval, err := item.ValueCopy(nil)\n\t\t\trequire.NoError(t, err)\n\t\t\tbal, err = strconv.Atoi(string(val))\n\t\t\trequire.NoError(t, err)\n\t\t\treturn bal\n\t\t}\n\n\t\t// Start two transactions, each would read both accounts and deduct from one account.\n\t\ttxn1 := db.NewTransaction(true)\n\n\t\tsum := getBal(txn1, ax)\n\t\tsum += getBal(txn1, ay)\n\t\trequire.Equal(t, 200, sum)\n\t\trequire.NoError(t, txn1.SetEntry(NewEntry(ax, []byte(\"0\")))) // Deduct 100 from ax.\n\n\t\t// Let's read this back.\n\t\tsum = getBal(txn1, ax)\n\t\trequire.Equal(t, 0, sum)\n\t\tsum += getBal(txn1, ay)\n\t\trequire.Equal(t, 100, sum)\n\t\t// Don't commit yet.\n\n\t\ttxn2 := db.NewTransaction(true)\n\n\t\tsum = getBal(txn2, ax)\n\t\tsum += getBal(txn2, ay)\n\t\trequire.Equal(t, 200, sum)\n\t\trequire.NoError(t, txn2.SetEntry(NewEntry(ay, []byte(\"0\")))) // Deduct 100 from ay.\n\n\t\t// Let's read this back.\n\t\tsum = getBal(txn2, ax)\n\t\trequire.Equal(t, 100, sum)\n\t\tsum += getBal(txn2, ay)\n\t\trequire.Equal(t, 100, sum)\n\n\t\t// Commit both now.\n\t\trequire.NoError(t, txn1.Commit())\n\t\trequire.Error(t, txn2.Commit()) // This should fail.\n\n\t\trequire.Equal(t, uint64(2), db.orc.readTs())\n\t})\n}\n\n// a3, a2, b4 (del), b3, c2, c1\n// Read at ts=4 -> a3, c2\n// Read at ts=4(Uncommitted) -> a3, b4\n// Read at ts=3 -> a3, b3, c2\n// Read at ts=2 -> a2, c2\n// Read at ts=1 -> c1\nfunc TestTxnIterationEdgeCase(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\tka := []byte(\"a\")\n\t\tkb := []byte(\"b\")\n\t\tkc := []byte(\"c\")\n\n\t\t// c1\n\t\ttxn := db.NewTransaction(true)\n\t\trequire.NoError(t, txn.SetEntry(NewEntry(kc, []byte(\"c1\"))))\n\t\trequire.NoError(t, txn.Commit())\n\t\trequire.Equal(t, uint64(1), db.orc.readTs())\n\n\t\t// a2, c2\n\t\ttxn = db.NewTransaction(true)\n\t\trequire.NoError(t, txn.SetEntry(NewEntry(ka, []byte(\"a2\"))))\n\t\trequire.NoError(t, txn.SetEntry(NewEntry(kc, []byte(\"c2\"))))\n\t\trequire.NoError(t, txn.Commit())\n\t\trequire.Equal(t, uint64(2), db.orc.readTs())\n\n\t\t// b3\n\t\ttxn = db.NewTransaction(true)\n\t\trequire.NoError(t, txn.SetEntry(NewEntry(ka, []byte(\"a3\"))))\n\t\trequire.NoError(t, txn.SetEntry(NewEntry(kb, []byte(\"b3\"))))\n\t\trequire.NoError(t, txn.Commit())\n\t\trequire.Equal(t, uint64(3), db.orc.readTs())\n\n\t\t// b4, c4(del) (Uncommitted)\n\t\ttxn4 := db.NewTransaction(true)\n\t\trequire.NoError(t, txn4.SetEntry(NewEntry(kb, []byte(\"b4\"))))\n\t\trequire.NoError(t, txn4.Delete(kc))\n\t\trequire.Equal(t, uint64(3), db.orc.readTs())\n\n\t\t// b4 (del)\n\t\ttxn = db.NewTransaction(true)\n\t\trequire.NoError(t, txn.Delete(kb))\n\t\trequire.NoError(t, txn.Commit())\n\t\trequire.Equal(t, uint64(4), db.orc.readTs())\n\n\t\tcheckIterator := func(itr *Iterator, expected []string) {\n\t\t\tdefer itr.Close()\n\t\t\tvar i int\n\t\t\tfor itr.Rewind(); itr.Valid(); itr.Next() {\n\t\t\t\titem := itr.Item()\n\t\t\t\tval, err := item.ValueCopy(nil)\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\trequire.Equal(t, expected[i], string(val), \"readts=%d\", itr.readTs)\n\t\t\t\ti++\n\t\t\t}\n\t\t\trequire.Equal(t, len(expected), i)\n\t\t}\n\n\t\ttxn = db.NewTransaction(true)\n\t\tdefer txn.Discard()\n\t\titr := txn.NewIterator(DefaultIteratorOptions)\n\t\titr5 := txn4.NewIterator(DefaultIteratorOptions)\n\t\tcheckIterator(itr, []string{\"a3\", \"c2\"})\n\t\tcheckIterator(itr5, []string{\"a3\", \"b4\"})\n\n\t\trev := DefaultIteratorOptions\n\t\trev.Reverse = true\n\t\titr = txn.NewIterator(rev)\n\t\titr5 = txn4.NewIterator(rev)\n\t\tcheckIterator(itr, []string{\"c2\", \"a3\"})\n\t\tcheckIterator(itr5, []string{\"b4\", \"a3\"})\n\n\t\ttxn.readTs = 3\n\t\titr = txn.NewIterator(DefaultIteratorOptions)\n\t\tcheckIterator(itr, []string{\"a3\", \"b3\", \"c2\"})\n\t\titr = txn.NewIterator(rev)\n\t\tcheckIterator(itr, []string{\"c2\", \"b3\", \"a3\"})\n\n\t\ttxn.readTs = 2\n\t\titr = txn.NewIterator(DefaultIteratorOptions)\n\t\tcheckIterator(itr, []string{\"a2\", \"c2\"})\n\t\titr = txn.NewIterator(rev)\n\t\tcheckIterator(itr, []string{\"c2\", \"a2\"})\n\n\t\ttxn.readTs = 1\n\t\titr = txn.NewIterator(DefaultIteratorOptions)\n\t\tcheckIterator(itr, []string{\"c1\"})\n\t\titr = txn.NewIterator(rev)\n\t\tcheckIterator(itr, []string{\"c1\"})\n\t})\n}\n\n// a2, a3, b4 (del), b3, c2, c1\n// Read at ts=4 -> a3, c2\n// Read at ts=3 -> a3, b3, c2\n// Read at ts=2 -> a2, c2\n// Read at ts=1 -> c1\nfunc TestTxnIterationEdgeCase2(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\tka := []byte(\"a\")\n\t\tkb := []byte(\"aa\")\n\t\tkc := []byte(\"aaa\")\n\n\t\t// c1\n\t\ttxn := db.NewTransaction(true)\n\t\trequire.NoError(t, txn.SetEntry(NewEntry(kc, []byte(\"c1\"))))\n\t\trequire.NoError(t, txn.Commit())\n\t\trequire.Equal(t, uint64(1), db.orc.readTs())\n\n\t\t// a2, c2\n\t\ttxn = db.NewTransaction(true)\n\t\trequire.NoError(t, txn.SetEntry(NewEntry(ka, []byte(\"a2\"))))\n\t\trequire.NoError(t, txn.SetEntry(NewEntry(kc, []byte(\"c2\"))))\n\t\trequire.NoError(t, txn.Commit())\n\t\trequire.Equal(t, uint64(2), db.orc.readTs())\n\n\t\t// b3\n\t\ttxn = db.NewTransaction(true)\n\t\trequire.NoError(t, txn.SetEntry(NewEntry(ka, []byte(\"a3\"))))\n\t\trequire.NoError(t, txn.SetEntry(NewEntry(kb, []byte(\"b3\"))))\n\t\trequire.NoError(t, txn.Commit())\n\t\trequire.Equal(t, uint64(3), db.orc.readTs())\n\n\t\t// b4 (del)\n\t\ttxn = db.NewTransaction(true)\n\t\trequire.NoError(t, txn.Delete(kb))\n\t\trequire.NoError(t, txn.Commit())\n\t\trequire.Equal(t, uint64(4), db.orc.readTs())\n\n\t\tcheckIterator := func(itr *Iterator, expected []string) {\n\t\t\tdefer itr.Close()\n\t\t\tvar i int\n\t\t\tfor itr.Rewind(); itr.Valid(); itr.Next() {\n\t\t\t\titem := itr.Item()\n\t\t\t\tval, err := item.ValueCopy(nil)\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\trequire.Equal(t, expected[i], string(val), \"readts=%d\", itr.readTs)\n\t\t\t\ti++\n\t\t\t}\n\t\t\trequire.Equal(t, len(expected), i)\n\t\t}\n\t\ttxn = db.NewTransaction(true)\n\t\tdefer txn.Discard()\n\t\trev := DefaultIteratorOptions\n\t\trev.Reverse = true\n\n\t\titr := txn.NewIterator(DefaultIteratorOptions)\n\t\tcheckIterator(itr, []string{\"a3\", \"c2\"})\n\t\titr = txn.NewIterator(rev)\n\t\tcheckIterator(itr, []string{\"c2\", \"a3\"})\n\n\t\ttxn.readTs = 5\n\t\titr = txn.NewIterator(DefaultIteratorOptions)\n\t\titr.Seek(ka)\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), ka)\n\t\titr.Seek(kc)\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), kc)\n\t\titr.Close()\n\n\t\titr = txn.NewIterator(rev)\n\t\titr.Seek(ka)\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), ka)\n\t\titr.Seek(kc)\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), kc)\n\t\titr.Close()\n\n\t\ttxn.readTs = 3\n\t\titr = txn.NewIterator(DefaultIteratorOptions)\n\t\tcheckIterator(itr, []string{\"a3\", \"b3\", \"c2\"})\n\t\titr = txn.NewIterator(rev)\n\t\tcheckIterator(itr, []string{\"c2\", \"b3\", \"a3\"})\n\n\t\ttxn.readTs = 2\n\t\titr = txn.NewIterator(DefaultIteratorOptions)\n\t\tcheckIterator(itr, []string{\"a2\", \"c2\"})\n\t\titr = txn.NewIterator(rev)\n\t\tcheckIterator(itr, []string{\"c2\", \"a2\"})\n\n\t\ttxn.readTs = 1\n\t\titr = txn.NewIterator(DefaultIteratorOptions)\n\t\tcheckIterator(itr, []string{\"c1\"})\n\t\titr = txn.NewIterator(rev)\n\t\tcheckIterator(itr, []string{\"c1\"})\n\t})\n}\n\nfunc TestTxnIterationEdgeCase3(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\tkb := []byte(\"abc\")\n\t\tkc := []byte(\"acd\")\n\t\tkd := []byte(\"ade\")\n\n\t\t// c1\n\t\ttxn := db.NewTransaction(true)\n\t\trequire.NoError(t, txn.SetEntry(NewEntry(kc, []byte(\"c1\"))))\n\t\trequire.NoError(t, txn.Commit())\n\t\trequire.Equal(t, uint64(1), db.orc.readTs())\n\n\t\t// b2\n\t\ttxn = db.NewTransaction(true)\n\t\trequire.NoError(t, txn.SetEntry(NewEntry(kb, []byte(\"b2\"))))\n\t\trequire.NoError(t, txn.Commit())\n\t\trequire.Equal(t, uint64(2), db.orc.readTs())\n\n\t\ttxn2 := db.NewTransaction(true)\n\t\trequire.NoError(t, txn2.SetEntry(NewEntry(kd, []byte(\"d2\"))))\n\t\trequire.NoError(t, txn2.Delete(kc))\n\n\t\ttxn = db.NewTransaction(true)\n\t\tdefer txn.Discard()\n\t\trev := DefaultIteratorOptions\n\t\trev.Reverse = true\n\n\t\titr := txn.NewIterator(DefaultIteratorOptions)\n\t\titr.Seek([]byte(\"ab\"))\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), kb)\n\t\titr.Seek([]byte(\"ac\"))\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), kc)\n\t\titr.Seek(nil)\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), kb)\n\t\titr.Seek([]byte(\"ac\"))\n\t\titr.Rewind()\n\t\titr.Seek(nil)\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), kb)\n\t\titr.Seek([]byte(\"ac\"))\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), kc)\n\t\titr.Close()\n\n\t\t//  Keys: \"abc\", \"ade\"\n\t\t// Read pending writes.\n\t\titr = txn2.NewIterator(DefaultIteratorOptions)\n\t\titr.Seek([]byte(\"ab\"))\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), kb)\n\t\titr.Seek([]byte(\"ac\"))\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), kd)\n\t\titr.Seek(nil)\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), kb)\n\t\titr.Seek([]byte(\"ac\"))\n\t\titr.Rewind()\n\t\titr.Seek(nil)\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), kb)\n\t\titr.Seek([]byte(\"ad\"))\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), kd)\n\t\titr.Close()\n\n\t\titr = txn.NewIterator(rev)\n\t\titr.Seek([]byte(\"ac\"))\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), kb)\n\t\titr.Seek([]byte(\"ad\"))\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), kc)\n\t\titr.Seek(nil)\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), kc)\n\t\titr.Seek([]byte(\"ac\"))\n\t\titr.Rewind()\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), kc)\n\t\titr.Seek([]byte(\"ad\"))\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), kc)\n\t\titr.Close()\n\n\t\t//  Keys: \"abc\", \"ade\"\n\t\titr = txn2.NewIterator(rev)\n\t\titr.Seek([]byte(\"ad\"))\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), kb)\n\t\titr.Seek([]byte(\"ae\"))\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), kd)\n\t\titr.Seek(nil)\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), kd)\n\t\titr.Seek([]byte(\"ab\"))\n\t\titr.Rewind()\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), kd)\n\t\titr.Seek([]byte(\"ac\"))\n\t\trequire.True(t, itr.Valid())\n\t\trequire.Equal(t, itr.item.Key(), kb)\n\t\titr.Close()\n\t})\n}\n\nfunc TestIteratorAllVersionsWithDeleted(t *testing.T) {\n\ttest := func(t *testing.T, db *DB) {\n\t\t// Write two keys\n\t\terr := db.Update(func(txn *Txn) error {\n\t\t\trequire.NoError(t, txn.SetEntry(NewEntry([]byte(\"answer1\"), []byte(\"42\"))))\n\t\t\treturn txn.SetEntry(NewEntry([]byte(\"answer2\"), []byte(\"43\")))\n\t\t})\n\t\trequire.NoError(t, err)\n\n\t\t// Delete the specific key version from underlying db directly\n\t\terr = db.View(func(txn *Txn) error {\n\t\t\titem, err := txn.Get([]byte(\"answer1\"))\n\t\t\trequire.NoError(t, err)\n\t\t\terr = txn.db.batchSet([]*Entry{\n\t\t\t\t{\n\t\t\t\t\tKey:  y.KeyWithTs(item.key, item.version),\n\t\t\t\t\tmeta: bitDelete,\n\t\t\t\t},\n\t\t\t})\n\t\t\trequire.NoError(t, err)\n\t\t\treturn err\n\t\t})\n\t\trequire.NoError(t, err)\n\n\t\topts := DefaultIteratorOptions\n\t\topts.AllVersions = true\n\t\topts.PrefetchValues = false\n\n\t\t// Verify that deleted shows up when AllVersions is set.\n\t\terr = db.View(func(txn *Txn) error {\n\t\t\tit := txn.NewIterator(opts)\n\t\t\tdefer it.Close()\n\t\t\tvar count int\n\t\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\t\tcount++\n\t\t\t\titem := it.Item()\n\t\t\t\tif count == 1 {\n\t\t\t\t\trequire.Equal(t, []byte(\"answer1\"), item.Key())\n\t\t\t\t\trequire.True(t, item.meta&bitDelete > 0)\n\t\t\t\t} else {\n\t\t\t\t\trequire.Equal(t, []byte(\"answer2\"), item.Key())\n\t\t\t\t}\n\t\t\t}\n\t\t\trequire.Equal(t, 2, count)\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\t}\n\tt.Run(\"disk mode\", func(t *testing.T) {\n\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\ttest(t, db)\n\t\t})\n\t})\n\tt.Run(\"InMemory mode\", func(t *testing.T) {\n\t\topt := getTestOptions(\"\")\n\t\topt.InMemory = true\n\t\tdb, err := Open(opt)\n\t\trequire.NoError(t, err)\n\t\ttest(t, db)\n\t\trequire.NoError(t, db.Close())\n\t})\n}\n\nfunc TestIteratorAllVersionsWithDeleted2(t *testing.T) {\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t// Set and delete alternatively\n\t\tfor i := 0; i < 4; i++ {\n\t\t\terr := db.Update(func(txn *Txn) error {\n\t\t\t\tif i%2 == 0 {\n\t\t\t\t\trequire.NoError(t, txn.SetEntry(NewEntry([]byte(\"key\"), []byte(\"value\"))))\n\t\t\t\t\treturn nil\n\t\t\t\t}\n\t\t\t\treturn txn.Delete([]byte(\"key\"))\n\t\t\t})\n\t\t\trequire.NoError(t, err)\n\t\t}\n\n\t\topts := DefaultIteratorOptions\n\t\topts.AllVersions = true\n\t\topts.PrefetchValues = false\n\n\t\t// Verify that deleted shows up when AllVersions is set.\n\t\terr := db.View(func(txn *Txn) error {\n\t\t\tit := txn.NewIterator(opts)\n\t\t\tdefer it.Close()\n\t\t\tvar count int\n\t\t\tfor it.Rewind(); it.Valid(); it.Next() {\n\t\t\t\titem := it.Item()\n\t\t\t\trequire.Equal(t, []byte(\"key\"), item.Key())\n\t\t\t\tif count%2 != 0 {\n\t\t\t\t\tval, err := item.ValueCopy(nil)\n\t\t\t\t\trequire.NoError(t, err)\n\t\t\t\t\trequire.Equal(t, val, []byte(\"value\"))\n\t\t\t\t} else {\n\t\t\t\t\trequire.True(t, item.meta&bitDelete > 0)\n\t\t\t\t}\n\t\t\t\tcount++\n\t\t\t}\n\t\t\trequire.Equal(t, 4, count)\n\t\t\treturn nil\n\t\t})\n\t\trequire.NoError(t, err)\n\t})\n}\n\nfunc TestManagedDB(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\topt := getTestOptions(dir)\n\topt.managedTxns = true\n\n\ttest := func(t *testing.T, db *DB) {\n\t\tkey := func(i int) []byte {\n\t\t\treturn []byte(fmt.Sprintf(\"key-%02d\", i))\n\t\t}\n\n\t\tval := func(i int) []byte {\n\t\t\treturn []byte(fmt.Sprintf(\"val-%d\", i))\n\t\t}\n\n\t\trequire.Panics(t, func() {\n\t\t\t_ = db.Update(func(tx *Txn) error { return nil })\n\t\t})\n\n\t\terr = db.View(func(tx *Txn) error { return nil })\n\t\trequire.NoError(t, err)\n\n\t\t// Write data at t=3.\n\t\ttxn := db.NewTransactionAt(3, true)\n\t\tfor i := 0; i <= 3; i++ {\n\t\t\trequire.NoError(t, txn.SetEntry(NewEntry(key(i), val(i))))\n\t\t}\n\t\trequire.Error(t, txn.Commit())\n\t\trequire.NoError(t, txn.CommitAt(3, nil))\n\n\t\t// Read data at t=2.\n\t\ttxn = db.NewTransactionAt(2, false)\n\t\tfor i := 0; i <= 3; i++ {\n\t\t\t_, err := txn.Get(key(i))\n\t\t\trequire.Equal(t, ErrKeyNotFound, err)\n\t\t}\n\t\ttxn.Discard()\n\n\t\t// Read data at t=3.\n\t\ttxn = db.NewTransactionAt(3, false)\n\t\tfor i := 0; i <= 3; i++ {\n\t\t\titem, err := txn.Get(key(i))\n\t\t\trequire.NoError(t, err)\n\t\t\trequire.Equal(t, uint64(3), item.Version())\n\t\t\tv, err := item.ValueCopy(nil)\n\t\t\trequire.NoError(t, err)\n\t\t\trequire.Equal(t, val(i), v)\n\t\t}\n\t\ttxn.Discard()\n\n\t\t// Write data at t=7.\n\t\ttxn = db.NewTransactionAt(6, true)\n\t\tfor i := 0; i <= 7; i++ {\n\t\t\t_, err := txn.Get(key(i))\n\t\t\tif err == nil {\n\t\t\t\tcontinue // Don't overwrite existing keys.\n\t\t\t}\n\t\t\trequire.NoError(t, txn.SetEntry(NewEntry(key(i), val(i))))\n\t\t}\n\t\trequire.NoError(t, txn.CommitAt(7, nil))\n\n\t\t// Read data at t=9.\n\t\ttxn = db.NewTransactionAt(9, false)\n\t\tfor i := 0; i <= 9; i++ {\n\t\t\titem, err := txn.Get(key(i))\n\t\t\tif i <= 7 {\n\t\t\t\trequire.NoError(t, err)\n\t\t\t} else {\n\t\t\t\trequire.Equal(t, ErrKeyNotFound, err)\n\t\t\t}\n\n\t\t\tif i <= 3 {\n\t\t\t\trequire.Equal(t, uint64(3), item.Version())\n\t\t\t} else if i <= 7 {\n\t\t\t\trequire.Equal(t, uint64(7), item.Version())\n\t\t\t}\n\t\t\tif i <= 7 {\n\t\t\t\tv, err := item.ValueCopy(nil)\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\trequire.Equal(t, val(i), v)\n\t\t\t}\n\t\t}\n\t\ttxn.Discard()\n\n\t\t// Write data to same key, causing a conflict\n\t\ttxn = db.NewTransactionAt(10, true)\n\t\ttxnb := db.NewTransactionAt(10, true)\n\t\t_, err := txnb.Get(key(0))\n\t\trequire.NoError(t, err)\n\t\trequire.NoError(t, txn.SetEntry(NewEntry(key(0), val(0))))\n\t\trequire.NoError(t, txnb.SetEntry(NewEntry(key(0), val(1))))\n\t\trequire.NoError(t, txn.CommitAt(11, nil))\n\t\trequire.Equal(t, ErrConflict, txnb.CommitAt(11, nil))\n\t}\n\tt.Run(\"disk mode\", func(t *testing.T) {\n\t\tdb, err := Open(opt)\n\t\trequire.NoError(t, err)\n\t\ttest(t, db)\n\t\trequire.NoError(t, db.Close())\n\t})\n\tt.Run(\"InMemory mode\", func(t *testing.T) {\n\t\topt.InMemory = true\n\t\topt.Dir = \"\"\n\t\topt.ValueDir = \"\"\n\t\tdb, err := Open(opt)\n\t\trequire.NoError(t, err)\n\t\ttest(t, db)\n\t\trequire.NoError(t, db.Close())\n\t})\n\n}\n\nfunc TestArmV7Issue311Fix(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"\")\n\trequire.NoError(t, err)\n\n\tdefer removeDir(dir)\n\n\tdb, err := Open(DefaultOptions(dir).\n\t\tWithValueLogFileSize(16 << 20).\n\t\tWithBaseLevelSize(8 << 20).\n\t\tWithBaseTableSize(2 << 20).\n\t\tWithSyncWrites(false))\n\n\trequire.NoError(t, err)\n\n\terr = db.View(func(txn *Txn) error { return nil })\n\trequire.NoError(t, err)\n\n\terr = db.Update(func(txn *Txn) error {\n\t\treturn txn.SetEntry(NewEntry([]byte{0x11}, []byte{0x22}))\n\t})\n\trequire.NoError(t, err)\n\n\terr = db.Update(func(txn *Txn) error {\n\t\treturn txn.SetEntry(NewEntry([]byte{0x11}, []byte{0x22}))\n\t})\n\n\trequire.NoError(t, err)\n\trequire.NoError(t, db.Close())\n}\n\n// This test tries to perform a GetAndSet operation using multiple concurrent\n// transaction and only one of the transactions should be successful.\n// Regression test for https://github.com/dgraph-io/badger/issues/1289\nfunc TestConflict(t *testing.T) {\n\tkey := []byte(\"foo\")\n\tvar setCount atomic.Uint32\n\n\ttestAndSet := func(wg *sync.WaitGroup, db *DB) {\n\t\tdefer wg.Done()\n\t\ttxn := db.NewTransaction(true)\n\t\tdefer txn.Discard()\n\n\t\t_, err := txn.Get(key)\n\t\tif err == ErrKeyNotFound {\n\t\t\t// Unset the error.\n\t\t\terr = nil\n\t\t\trequire.NoError(t, txn.Set(key, []byte(\"AA\")))\n\t\t\ttxn.CommitWith(func(err error) {\n\t\t\t\tif err == nil {\n\t\t\t\t\trequire.LessOrEqual(t, uint32(1), setCount.Add(1))\n\t\t\t\t} else {\n\n\t\t\t\t\trequire.Error(t, err, ErrConflict)\n\t\t\t\t}\n\t\t\t})\n\t\t}\n\t\trequire.NoError(t, err)\n\t}\n\ttestAndSetItr := func(wg *sync.WaitGroup, db *DB) {\n\t\tdefer wg.Done()\n\t\ttxn := db.NewTransaction(true)\n\t\tdefer txn.Discard()\n\n\t\tiopt := DefaultIteratorOptions\n\t\tit := txn.NewIterator(iopt)\n\n\t\tfound := false\n\t\tfor it.Seek(key); it.Valid(); it.Next() {\n\t\t\tfound = true\n\t\t}\n\t\tit.Close()\n\n\t\tif !found {\n\t\t\trequire.NoError(t, txn.Set(key, []byte(\"AA\")))\n\t\t\ttxn.CommitWith(func(err error) {\n\t\t\t\tif err == nil {\n\t\t\t\t\trequire.LessOrEqual(t, setCount.Add(1), uint32(1))\n\t\t\t\t} else {\n\t\t\t\t\trequire.Error(t, err, ErrConflict)\n\t\t\t\t}\n\t\t\t})\n\t\t}\n\t}\n\n\trunTest := func(t *testing.T, fn func(wg *sync.WaitGroup, db *DB)) {\n\t\tloop := 10\n\t\tnumGo := 16 // This many concurrent transactions.\n\t\tfor i := 0; i < loop; i++ {\n\t\t\tvar wg sync.WaitGroup\n\t\t\twg.Add(numGo)\n\t\t\tsetCount.Store(0)\n\t\t\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\t\t\tfor j := 0; j < numGo; j++ {\n\t\t\t\t\tgo fn(&wg, db)\n\t\t\t\t}\n\t\t\t\twg.Wait()\n\t\t\t})\n\t\t\trequire.Equal(t, uint32(1), setCount.Load())\n\t\t}\n\t}\n\tt.Run(\"TxnGet\", func(t *testing.T) {\n\t\trunTest(t, testAndSet)\n\t})\n\tt.Run(\"ItrSeek\", func(t *testing.T) {\n\t\trunTest(t, testAndSetItr)\n\t})\n}\n"
  },
  {
    "path": "util.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"encoding/hex\"\n\t\"fmt\"\n\t\"math/rand\"\n\t\"os\"\n\t\"time\"\n\n\t\"github.com/dgraph-io/badger/v4/table\"\n\t\"github.com/dgraph-io/badger/v4/y\"\n)\n\nfunc (s *levelsController) validate() error {\n\tfor _, l := range s.levels {\n\t\tif err := l.validate(); err != nil {\n\t\t\treturn y.Wrap(err, \"Levels Controller\")\n\t\t}\n\t}\n\treturn nil\n}\n\n// Check does some sanity check on one level of data or in-memory index.\nfunc (s *levelHandler) validate() error {\n\tif s.level == 0 {\n\t\treturn nil\n\t}\n\n\ts.RLock()\n\tdefer s.RUnlock()\n\tnumTables := len(s.tables)\n\tfor j := 1; j < numTables; j++ {\n\t\tif j >= len(s.tables) {\n\t\t\treturn fmt.Errorf(\"Level %d, j=%d numTables=%d\", s.level, j, numTables)\n\t\t}\n\n\t\tif y.CompareKeys(s.tables[j-1].Biggest(), s.tables[j].Smallest()) >= 0 {\n\t\t\treturn fmt.Errorf(\n\t\t\t\t\"Inter: Biggest(j-1)[%d] \\n%s\\n vs Smallest(j)[%d]: \\n%s\\n: \"+\n\t\t\t\t\t\"level=%d j=%d numTables=%d\",\n\t\t\t\ts.tables[j-1].ID(), hex.Dump(s.tables[j-1].Biggest()), s.tables[j].ID(),\n\t\t\t\thex.Dump(s.tables[j].Smallest()), s.level, j, numTables)\n\t\t}\n\n\t\tif y.CompareKeys(s.tables[j].Smallest(), s.tables[j].Biggest()) > 0 {\n\t\t\treturn fmt.Errorf(\n\t\t\t\t\"Intra: \\n%s\\n vs \\n%s\\n: level=%d j=%d numTables=%d\",\n\t\t\t\thex.Dump(s.tables[j].Smallest()), hex.Dump(s.tables[j].Biggest()), s.level, j, numTables)\n\t\t}\n\t}\n\treturn nil\n}\n\n// func (s *KV) debugPrintMore() { s.lc.debugPrintMore() }\n\n// // debugPrintMore shows key ranges of each level.\n// func (s *levelsController) debugPrintMore() {\n// \ts.Lock()\n// \tdefer s.Unlock()\n// \tfor i := 0; i < s.kv.opt.MaxLevels; i++ {\n// \t\ts.levels[i].debugPrintMore()\n// \t}\n// }\n\n// func (s *levelHandler) debugPrintMore() {\n// \ts.RLock()\n// \tdefer s.RUnlock()\n// \ts.elog.Printf(\"Level %d:\", s.level)\n// \tfor _, t := range s.tables {\n// \t\ty.Printf(\" [%s, %s]\", t.Smallest(), t.Biggest())\n// \t}\n// \ty.Printf(\"\\n\")\n// }\n\n// reserveFileID reserves a unique file id.\nfunc (s *levelsController) reserveFileID() uint64 {\n\tid := s.nextFileID.Add(1)\n\treturn id - 1\n}\n\nfunc getIDMap(dir string) map[uint64]struct{} {\n\tfileInfos, err := os.ReadDir(dir)\n\ty.Check(err)\n\tidMap := make(map[uint64]struct{})\n\tfor _, info := range fileInfos {\n\t\tif info.IsDir() {\n\t\t\tcontinue\n\t\t}\n\t\tfileID, ok := table.ParseFileID(info.Name())\n\t\tif !ok {\n\t\t\tcontinue\n\t\t}\n\t\tidMap[fileID] = struct{}{}\n\t}\n\treturn idMap\n}\n\nfunc init() {\n\trand.Seed(time.Now().UnixNano())\n}\n"
  },
  {
    "path": "value.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"bytes\"\n\t\"context\"\n\t\"errors\"\n\t\"fmt\"\n\t\"hash\"\n\t\"hash/crc32\"\n\t\"io\"\n\t\"math\"\n\t\"os\"\n\t\"sort\"\n\t\"strconv\"\n\t\"strings\"\n\t\"sync\"\n\t\"sync/atomic\"\n\n\t\"go.opentelemetry.io/otel\"\n\t\"go.opentelemetry.io/otel/attribute\"\n\n\t\"github.com/dgraph-io/badger/v4/y\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\n// maxVlogFileSize is the maximum size of the vlog file which can be created. Vlog Offset is of\n// uint32, so limiting at max uint32.\nvar maxVlogFileSize uint32 = math.MaxUint32\n\n// Values have their first byte being byteData or byteDelete. This helps us distinguish between\n// a key that has never been seen and a key that has been explicitly deleted.\nconst (\n\tbitDelete                 byte = 1 << 0 // Set if the key has been deleted.\n\tbitValuePointer           byte = 1 << 1 // Set if the value is NOT stored directly next to key.\n\tbitDiscardEarlierVersions byte = 1 << 2 // Set if earlier versions can be discarded.\n\t// Set if item shouldn't be discarded via compactions (used by merge operator)\n\tbitMergeEntry byte = 1 << 3\n\t// The MSB 2 bits are for transactions.\n\tbitTxn    byte = 1 << 6 // Set if the entry is part of a txn.\n\tbitFinTxn byte = 1 << 7 // Set if the entry is to indicate end of txn in value log.\n\n\tmi int64 = 1 << 20 //nolint:unused\n\n\t// size of vlog header.\n\t// +----------------+------------------+\n\t// | keyID(8 bytes) |  baseIV(12 bytes)|\n\t// +----------------+------------------+\n\tvlogHeaderSize = 20\n)\n\nvar errStop = errors.New(\"Stop iteration\")\nvar errTruncate = errors.New(\"Do truncate\")\n\ntype logEntry func(e Entry, vp valuePointer) error\n\ntype safeRead struct {\n\tk []byte\n\tv []byte\n\n\trecordOffset uint32\n\tlf           *logFile\n}\n\n// hashReader implements io.Reader, io.ByteReader interfaces. It also keeps track of the number\n// bytes read. The hashReader writes to h (hash) what it reads from r.\ntype hashReader struct {\n\tr         io.Reader\n\th         hash.Hash32\n\tbytesRead int // Number of bytes read.\n}\n\nfunc newHashReader(r io.Reader) *hashReader {\n\thash := crc32.New(y.CastagnoliCrcTable)\n\treturn &hashReader{\n\t\tr: r,\n\t\th: hash,\n\t}\n}\n\n// Read reads len(p) bytes from the reader. Returns the number of bytes read, error on failure.\nfunc (t *hashReader) Read(p []byte) (int, error) {\n\tn, err := t.r.Read(p)\n\tif err != nil {\n\t\treturn n, err\n\t}\n\tt.bytesRead += n\n\treturn t.h.Write(p[:n])\n}\n\n// ReadByte reads exactly one byte from the reader. Returns error on failure.\nfunc (t *hashReader) ReadByte() (byte, error) {\n\tb := make([]byte, 1)\n\t_, err := t.Read(b)\n\treturn b[0], err\n}\n\n// Sum32 returns the sum32 of the underlying hash.\nfunc (t *hashReader) Sum32() uint32 {\n\treturn t.h.Sum32()\n}\n\n// Entry reads an entry from the provided reader. It also validates the checksum for every entry\n// read. Returns error on failure.\nfunc (r *safeRead) Entry(reader io.Reader) (*Entry, error) {\n\ttee := newHashReader(reader)\n\tvar h header\n\thlen, err := h.DecodeFrom(tee)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\tif h.klen > uint32(1<<16) { // Key length must be below uint16.\n\t\treturn nil, errTruncate\n\t}\n\tkl := int(h.klen)\n\tif cap(r.k) < kl {\n\t\tr.k = make([]byte, 2*kl)\n\t}\n\tvl := int(h.vlen)\n\tif cap(r.v) < vl {\n\t\tr.v = make([]byte, 2*vl)\n\t}\n\n\te := &Entry{}\n\te.offset = r.recordOffset\n\te.hlen = hlen\n\tbuf := make([]byte, h.klen+h.vlen)\n\tif _, err := io.ReadFull(tee, buf[:]); err != nil {\n\t\tif err == io.EOF {\n\t\t\terr = errTruncate\n\t\t}\n\t\treturn nil, err\n\t}\n\tif r.lf.encryptionEnabled() {\n\t\tif buf, err = r.lf.decryptKV(buf[:], r.recordOffset); err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t}\n\te.Key = buf[:h.klen]\n\te.Value = buf[h.klen:]\n\tvar crcBuf [crc32.Size]byte\n\tif _, err := io.ReadFull(reader, crcBuf[:]); err != nil {\n\t\tif err == io.EOF {\n\t\t\terr = errTruncate\n\t\t}\n\t\treturn nil, err\n\t}\n\tcrc := y.BytesToU32(crcBuf[:])\n\tif crc != tee.Sum32() {\n\t\treturn nil, errTruncate\n\t}\n\te.meta = h.meta\n\te.UserMeta = h.userMeta\n\te.ExpiresAt = h.expiresAt\n\treturn e, nil\n}\n\nfunc (vlog *valueLog) rewrite(f *logFile) error {\n\tvlog.filesLock.RLock()\n\tfor _, fid := range vlog.filesToBeDeleted {\n\t\tif fid == f.fid {\n\t\t\tvlog.filesLock.RUnlock()\n\t\t\treturn fmt.Errorf(\"value log file already marked for deletion fid: %d\", fid)\n\t\t}\n\t}\n\tmaxFid := vlog.maxFid\n\ty.AssertTruef(f.fid < maxFid, \"fid to move: %d. Current max fid: %d\", f.fid, maxFid)\n\tvlog.filesLock.RUnlock()\n\n\tvlog.opt.Infof(\"Rewriting fid: %d\", f.fid)\n\twb := make([]*Entry, 0, 1000)\n\tvar size int64\n\n\ty.AssertTrue(vlog.db != nil)\n\tvar count, moved int\n\tfe := func(e Entry) error {\n\t\tcount++\n\t\tif count%100000 == 0 {\n\t\t\tvlog.opt.Debugf(\"Processing entry %d\", count)\n\t\t}\n\n\t\tvs, err := vlog.db.get(e.Key)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tif discardEntry(e, vs, vlog.db) {\n\t\t\treturn nil\n\t\t}\n\n\t\t// Value is still present in value log.\n\t\tif len(vs.Value) == 0 {\n\t\t\treturn fmt.Errorf(\"Empty value: %+v\", vs)\n\t\t}\n\t\tvar vp valuePointer\n\t\tvp.Decode(vs.Value)\n\n\t\t// If the entry found from the LSM Tree points to a newer vlog file, don't do anything.\n\t\tif vp.Fid > f.fid {\n\t\t\treturn nil\n\t\t}\n\t\t// If the entry found from the LSM Tree points to an offset greater than the one\n\t\t// read from vlog, don't do anything.\n\t\tif vp.Offset > e.offset {\n\t\t\treturn nil\n\t\t}\n\t\t// If the entry read from LSM Tree and vlog file point to the same vlog file and offset,\n\t\t// insert them back into the DB.\n\t\t// NOTE: It might be possible that the entry read from the LSM Tree points to\n\t\t// an older vlog file. See the comments in the else part.\n\t\tif vp.Fid == f.fid && vp.Offset == e.offset {\n\t\t\tmoved++\n\t\t\t// This new entry only contains the key, and a pointer to the value.\n\t\t\tne := new(Entry)\n\t\t\t// Remove only the bitValuePointer and transaction markers. We\n\t\t\t// should keep the other bits.\n\t\t\tne.meta = e.meta &^ (bitValuePointer | bitTxn | bitFinTxn)\n\t\t\tne.UserMeta = e.UserMeta\n\t\t\tne.ExpiresAt = e.ExpiresAt\n\t\t\tne.Key = append([]byte{}, e.Key...)\n\t\t\tne.Value = append([]byte{}, e.Value...)\n\t\t\tes := ne.estimateSizeAndSetThreshold(vlog.db.valueThreshold())\n\t\t\t// Consider size of value as well while considering the total size\n\t\t\t// of the batch. There have been reports of high memory usage in\n\t\t\t// rewrite because we don't consider the value size. See #1292.\n\t\t\tes += int64(len(e.Value))\n\n\t\t\t// Ensure length and size of wb is within transaction limits.\n\t\t\tif int64(len(wb)+1) >= vlog.opt.maxBatchCount ||\n\t\t\t\tsize+es >= vlog.opt.maxBatchSize {\n\t\t\t\tif err := vlog.db.batchSet(wb); err != nil {\n\t\t\t\t\treturn err\n\t\t\t\t}\n\t\t\t\tsize = 0\n\t\t\t\twb = wb[:0]\n\t\t\t}\n\t\t\twb = append(wb, ne)\n\t\t\tsize += es\n\t\t} else { //nolint:staticcheck\n\t\t\t// It might be possible that the entry read from LSM Tree points to\n\t\t\t// an older vlog file.  This can happen in the following situation.\n\t\t\t// Assume DB is opened with\n\t\t\t// numberOfVersionsToKeep=1\n\t\t\t//\n\t\t\t// Now, if we have ONLY one key in the system \"FOO\" which has been\n\t\t\t// updated 3 times and the same key has been garbage collected 3\n\t\t\t// times, we'll have 3 versions of the movekey\n\t\t\t// for the same key \"FOO\".\n\t\t\t//\n\t\t\t// NOTE: moveKeyi is the gc'ed version of the original key with version i\n\t\t\t// We're calling the gc'ed keys as moveKey to simplify the\n\t\t\t// explanation. We used to add move keys but we no longer do that.\n\t\t\t//\n\t\t\t// Assume we have 3 move keys in L0.\n\t\t\t// - moveKey1 (points to vlog file 10),\n\t\t\t// - moveKey2 (points to vlog file 14) and\n\t\t\t// - moveKey3 (points to vlog file 15).\n\t\t\t//\n\t\t\t// Also, assume there is another move key \"moveKey1\" (points to\n\t\t\t// vlog file 6) (this is also a move Key for key \"FOO\" ) on upper\n\t\t\t// levels (let's say 3). The move key \"moveKey1\" on level 0 was\n\t\t\t// inserted because vlog file 6 was GCed.\n\t\t\t//\n\t\t\t// Here's what the arrangement looks like\n\t\t\t// L0 => (moveKey1 => vlog10), (moveKey2 => vlog14), (moveKey3 => vlog15)\n\t\t\t// L1 => ....\n\t\t\t// L2 => ....\n\t\t\t// L3 => (moveKey1 => vlog6)\n\t\t\t//\n\t\t\t// When L0 compaction runs, it keeps only moveKey3 because the number of versions\n\t\t\t// to keep is set to 1. (we've dropped moveKey1's latest version)\n\t\t\t//\n\t\t\t// The new arrangement of keys is\n\t\t\t// L0 => ....\n\t\t\t// L1 => (moveKey3 => vlog15)\n\t\t\t// L2 => ....\n\t\t\t// L3 => (moveKey1 => vlog6)\n\t\t\t//\n\t\t\t// Now if we try to GC vlog file 10, the entry read from vlog file\n\t\t\t// will point to vlog10 but the entry read from LSM Tree will point\n\t\t\t// to vlog6. The move key read from LSM tree will point to vlog6\n\t\t\t// because we've asked for version 1 of the move key.\n\t\t\t//\n\t\t\t// This might seem like an issue but it's not really an issue\n\t\t\t// because the user has set the number of versions to keep to 1 and\n\t\t\t// the latest version of moveKey points to the correct vlog file\n\t\t\t// and offset. The stale move key on L3 will be eventually dropped\n\t\t\t// by compaction because there is a newer versions in the upper\n\t\t\t// levels.\n\t\t}\n\t\treturn nil\n\t}\n\n\t_, err := f.iterate(vlog.opt.ReadOnly, 0, func(e Entry, vp valuePointer) error {\n\t\treturn fe(e)\n\t})\n\tif err != nil {\n\t\treturn err\n\t}\n\n\tbatchSize := 1024\n\tvar loops int\n\tfor i := 0; i < len(wb); {\n\t\tloops++\n\t\tif batchSize == 0 {\n\t\t\tvlog.db.opt.Warningf(\"We shouldn't reach batch size of zero.\")\n\t\t\treturn ErrNoRewrite\n\t\t}\n\t\tend := i + batchSize\n\t\tif end > len(wb) {\n\t\t\tend = len(wb)\n\t\t}\n\t\tif err := vlog.db.batchSet(wb[i:end]); err != nil {\n\t\t\tif err == ErrTxnTooBig {\n\t\t\t\t// Decrease the batch size to half.\n\t\t\t\tbatchSize = batchSize / 2\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\treturn err\n\t\t}\n\t\ti += batchSize\n\t}\n\tvlog.opt.Infof(\"Processed %d entries in %d loops\", len(wb), loops)\n\tvlog.opt.Infof(\"Total entries: %d. Moved: %d\", count, moved)\n\tvlog.opt.Infof(\"Removing fid: %d\", f.fid)\n\tvar deleteFileNow bool\n\t// Entries written to LSM. Remove the older file now.\n\t{\n\t\tvlog.filesLock.Lock()\n\t\t// Just a sanity-check.\n\t\tif _, ok := vlog.filesMap[f.fid]; !ok {\n\t\t\tvlog.filesLock.Unlock()\n\t\t\treturn fmt.Errorf(\"Unable to find fid: %d\", f.fid)\n\t\t}\n\t\tif vlog.iteratorCount() == 0 {\n\t\t\tdelete(vlog.filesMap, f.fid)\n\t\t\tdeleteFileNow = true\n\t\t} else {\n\t\t\tvlog.filesToBeDeleted = append(vlog.filesToBeDeleted, f.fid)\n\t\t}\n\t\tvlog.filesLock.Unlock()\n\t}\n\n\tif deleteFileNow {\n\t\tif err := vlog.deleteLogFile(f); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\treturn nil\n}\n\nfunc (vlog *valueLog) incrIteratorCount() {\n\tvlog.numActiveIterators.Add(1)\n}\n\nfunc (vlog *valueLog) iteratorCount() int {\n\treturn int(vlog.numActiveIterators.Load())\n}\n\nfunc (vlog *valueLog) decrIteratorCount() error {\n\tnum := vlog.numActiveIterators.Add(-1)\n\tif num != 0 {\n\t\treturn nil\n\t}\n\n\tvlog.filesLock.Lock()\n\tlfs := make([]*logFile, 0, len(vlog.filesToBeDeleted))\n\tfor _, id := range vlog.filesToBeDeleted {\n\t\tlfs = append(lfs, vlog.filesMap[id])\n\t\tdelete(vlog.filesMap, id)\n\t}\n\tvlog.filesToBeDeleted = nil\n\tvlog.filesLock.Unlock()\n\n\tfor _, lf := range lfs {\n\t\tif err := vlog.deleteLogFile(lf); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\treturn nil\n}\n\nfunc (vlog *valueLog) deleteLogFile(lf *logFile) error {\n\tif lf == nil {\n\t\treturn nil\n\t}\n\tlf.lock.Lock()\n\tdefer lf.lock.Unlock()\n\t// Delete fid from discard stats as well.\n\tvlog.discardStats.Update(lf.fid, -1)\n\n\treturn lf.Delete()\n}\n\nfunc (vlog *valueLog) dropAll() (int, error) {\n\t// If db is opened in InMemory mode, we don't need to do anything since there are no vlog files.\n\tif vlog.db.opt.InMemory {\n\t\treturn 0, nil\n\t}\n\t// We don't want to block dropAll on any pending transactions. So, don't worry about iterator\n\t// count.\n\tvar count int\n\tdeleteAll := func() error {\n\t\tvlog.filesLock.Lock()\n\t\tdefer vlog.filesLock.Unlock()\n\t\tfor _, lf := range vlog.filesMap {\n\t\t\tif err := vlog.deleteLogFile(lf); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tcount++\n\t\t}\n\t\tvlog.filesMap = make(map[uint32]*logFile)\n\t\tvlog.maxFid = 0\n\t\treturn nil\n\t}\n\tif err := deleteAll(); err != nil {\n\t\treturn count, err\n\t}\n\n\tvlog.db.opt.Infof(\"Value logs deleted. Creating value log file: 1\")\n\tif _, err := vlog.createVlogFile(); err != nil { // Called while writes are stopped.\n\t\treturn count, err\n\t}\n\treturn count, nil\n}\n\nfunc (db *DB) valueThreshold() int64 {\n\treturn db.threshold.valueThreshold.Load()\n}\n\ntype valueLog struct {\n\tdirPath string\n\n\t// guards our view of which files exist, which to be deleted, how many active iterators\n\tfilesLock        sync.RWMutex\n\tfilesMap         map[uint32]*logFile\n\tmaxFid           uint32\n\tfilesToBeDeleted []uint32\n\t// A refcount of iterators -- when this hits zero, we can delete the filesToBeDeleted.\n\tnumActiveIterators atomic.Int32\n\n\tdb                *DB\n\twritableLogOffset atomic.Uint32 // read by read, written by write\n\tnumEntriesWritten uint32\n\topt               Options\n\n\tgarbageCh    chan struct{}\n\tdiscardStats *discardStats\n}\n\nfunc vlogFilePath(dirPath string, fid uint32) string {\n\treturn fmt.Sprintf(\"%s%s%06d.vlog\", dirPath, string(os.PathSeparator), fid)\n}\n\nfunc (vlog *valueLog) fpath(fid uint32) string {\n\treturn vlogFilePath(vlog.dirPath, fid)\n}\n\nfunc (vlog *valueLog) populateFilesMap() error {\n\tvlog.filesMap = make(map[uint32]*logFile)\n\n\tfiles, err := os.ReadDir(vlog.dirPath)\n\tif err != nil {\n\t\treturn errFile(err, vlog.dirPath, \"Unable to open log dir.\")\n\t}\n\n\tfound := make(map[uint64]struct{})\n\tfor _, file := range files {\n\t\tif !strings.HasSuffix(file.Name(), \".vlog\") {\n\t\t\tcontinue\n\t\t}\n\t\tfsz := len(file.Name())\n\t\tfid, err := strconv.ParseUint(file.Name()[:fsz-5], 10, 32)\n\t\tif err != nil {\n\t\t\treturn errFile(err, file.Name(), \"Unable to parse log id.\")\n\t\t}\n\t\tif _, ok := found[fid]; ok {\n\t\t\treturn errFile(err, file.Name(), \"Duplicate file found. Please delete one.\")\n\t\t}\n\t\tfound[fid] = struct{}{}\n\n\t\tlf := &logFile{\n\t\t\tfid:      uint32(fid),\n\t\t\tpath:     vlog.fpath(uint32(fid)),\n\t\t\tregistry: vlog.db.registry,\n\t\t}\n\t\tvlog.filesMap[uint32(fid)] = lf\n\t\tif vlog.maxFid < uint32(fid) {\n\t\t\tvlog.maxFid = uint32(fid)\n\t\t}\n\t}\n\treturn nil\n}\n\nfunc (vlog *valueLog) createVlogFile() (*logFile, error) {\n\tfid := vlog.maxFid + 1\n\tpath := vlog.fpath(fid)\n\tlf := &logFile{\n\t\tfid:      fid,\n\t\tpath:     path,\n\t\tregistry: vlog.db.registry,\n\t\twriteAt:  vlogHeaderSize,\n\t\topt:      vlog.opt,\n\t}\n\terr := lf.open(path, os.O_RDWR|os.O_CREATE|os.O_EXCL, 2*vlog.opt.ValueLogFileSize)\n\tif err != z.NewFile && err != nil {\n\t\treturn nil, err\n\t}\n\n\tvlog.filesLock.Lock()\n\tvlog.filesMap[fid] = lf\n\ty.AssertTrue(vlog.maxFid < fid)\n\tvlog.maxFid = fid\n\t// writableLogOffset is only written by write func, by read by Read func.\n\t// To avoid a race condition, all reads and updates to this variable must be\n\t// done via atomics.\n\tvlog.writableLogOffset.Store(vlogHeaderSize)\n\tvlog.numEntriesWritten = 0\n\tvlog.filesLock.Unlock()\n\n\treturn lf, nil\n}\n\nfunc errFile(err error, path string, msg string) error {\n\treturn fmt.Errorf(\"%s. Path=%s. Error=%v\", msg, path, err)\n}\n\n// init initializes the value log struct. This initialization needs to happen\n// before compactions start.\nfunc (vlog *valueLog) init(db *DB) {\n\tvlog.opt = db.opt\n\tvlog.db = db\n\t// We don't need to open any vlog files or collect stats for GC if DB is opened\n\t// in InMemory mode. InMemory mode doesn't create any files/directories on disk.\n\tif vlog.opt.InMemory {\n\t\treturn\n\t}\n\tvlog.dirPath = vlog.opt.ValueDir\n\n\tvlog.garbageCh = make(chan struct{}, 1) // Only allow one GC at a time.\n\tlf, err := InitDiscardStats(vlog.opt)\n\ty.Check(err)\n\tvlog.discardStats = lf\n\t// See TestPersistLFDiscardStats for purpose of statement below.\n\tdb.logToSyncChan(endVLogInitMsg)\n}\n\nfunc (vlog *valueLog) open(db *DB) error {\n\t// We don't need to open any vlog files or collect stats for GC if DB is opened\n\t// in InMemory mode. InMemory mode doesn't create any files/directories on disk.\n\tif db.opt.InMemory {\n\t\treturn nil\n\t}\n\n\tif err := vlog.populateFilesMap(); err != nil {\n\t\treturn err\n\t}\n\t// If no files are found, then create a new file.\n\tif len(vlog.filesMap) == 0 {\n\t\tif vlog.opt.ReadOnly {\n\t\t\treturn nil\n\t\t}\n\t\t_, err := vlog.createVlogFile()\n\t\treturn y.Wrapf(err, \"Error while creating log file in valueLog.open\")\n\t}\n\tfids := vlog.sortedFids()\n\tfor _, fid := range fids {\n\t\tlf, ok := vlog.filesMap[fid]\n\t\ty.AssertTrue(ok)\n\n\t\t// Just open in RDWR mode. This should not create a new log file.\n\t\tlf.opt = vlog.opt\n\t\tif err := lf.open(vlog.fpath(fid), os.O_RDWR,\n\t\t\t2*vlog.opt.ValueLogFileSize); err != nil {\n\t\t\treturn y.Wrapf(err, \"Open existing file: %q\", lf.path)\n\t\t}\n\t\t// We shouldn't delete the maxFid file.\n\t\tif lf.size.Load() == vlogHeaderSize && fid != vlog.maxFid {\n\t\t\tvlog.opt.Infof(\"Deleting empty file: %s\", lf.path)\n\t\t\tif err := lf.Delete(); err != nil {\n\t\t\t\treturn y.Wrapf(err, \"while trying to delete empty file: %s\", lf.path)\n\t\t\t}\n\t\t\tdelete(vlog.filesMap, fid)\n\t\t}\n\t}\n\n\tif vlog.opt.ReadOnly {\n\t\treturn nil\n\t}\n\t// Now we can read the latest value log file, and see if it needs truncation. We could\n\t// technically do this over all the value log files, but that would mean slowing down the value\n\t// log open.\n\tlast, ok := vlog.filesMap[vlog.maxFid]\n\ty.AssertTrue(ok)\n\tlastOff, err := last.iterate(vlog.opt.ReadOnly, vlogHeaderSize,\n\t\tfunc(_ Entry, vp valuePointer) error {\n\t\t\treturn nil\n\t\t})\n\tif err != nil {\n\t\treturn y.Wrapf(err, \"while iterating over: %s\", last.path)\n\t}\n\tif err := last.Truncate(int64(lastOff)); err != nil {\n\t\treturn y.Wrapf(err, \"while truncating last value log file: %s\", last.path)\n\t}\n\n\t// Don't write to the old log file. Always create a new one.\n\tif _, err := vlog.createVlogFile(); err != nil {\n\t\treturn y.Wrapf(err, \"Error while creating log file in valueLog.open\")\n\t}\n\treturn nil\n}\n\nfunc (vlog *valueLog) Close() error {\n\tif vlog == nil || vlog.db == nil || vlog.db.opt.InMemory {\n\t\treturn nil\n\t}\n\n\tvlog.opt.Debugf(\"Stopping garbage collection of values.\")\n\tvar err error\n\tfor id, lf := range vlog.filesMap {\n\t\tlf.lock.Lock() // We won’t release the lock.\n\t\toffset := int64(-1)\n\n\t\tif !vlog.opt.ReadOnly && id == vlog.maxFid {\n\t\t\toffset = int64(vlog.woffset())\n\t\t}\n\t\tif terr := lf.Close(offset); terr != nil && err == nil {\n\t\t\terr = terr\n\t\t}\n\t}\n\tif vlog.discardStats != nil {\n\t\tvlog.db.captureDiscardStats()\n\t\tif terr := vlog.discardStats.Close(-1); terr != nil && err == nil {\n\t\t\terr = terr\n\t\t}\n\t}\n\treturn err\n}\n\n// sortedFids returns the file id's not pending deletion, sorted.  Assumes we have shared access to\n// filesMap.\nfunc (vlog *valueLog) sortedFids() []uint32 {\n\ttoBeDeleted := make(map[uint32]struct{})\n\tfor _, fid := range vlog.filesToBeDeleted {\n\t\ttoBeDeleted[fid] = struct{}{}\n\t}\n\tret := make([]uint32, 0, len(vlog.filesMap))\n\tfor fid := range vlog.filesMap {\n\t\tif _, ok := toBeDeleted[fid]; !ok {\n\t\t\tret = append(ret, fid)\n\t\t}\n\t}\n\tsort.Slice(ret, func(i, j int) bool {\n\t\treturn ret[i] < ret[j]\n\t})\n\treturn ret\n}\n\ntype request struct {\n\t// Input values\n\tEntries []*Entry\n\t// Output values and wait group stuff below\n\tPtrs []valuePointer\n\tWg   sync.WaitGroup\n\tErr  error\n\tref  atomic.Int32\n}\n\nfunc (req *request) reset() {\n\treq.Entries = req.Entries[:0]\n\treq.Ptrs = req.Ptrs[:0]\n\treq.Wg = sync.WaitGroup{}\n\treq.Err = nil\n\treq.ref.Store(0)\n}\n\nfunc (req *request) IncrRef() {\n\treq.ref.Add(1)\n}\n\nfunc (req *request) DecrRef() {\n\tnRef := req.ref.Add(-1)\n\tif nRef > 0 {\n\t\treturn\n\t}\n\treq.Entries = nil\n\trequestPool.Put(req)\n}\n\nfunc (req *request) Wait() error {\n\treq.Wg.Wait()\n\terr := req.Err\n\treq.DecrRef() // DecrRef after writing to DB.\n\treturn err\n}\n\ntype requests []*request\n\nfunc (reqs requests) DecrRef() {\n\tfor _, req := range reqs {\n\t\treq.DecrRef()\n\t}\n}\n\nfunc (reqs requests) IncrRef() {\n\tfor _, req := range reqs {\n\t\treq.IncrRef()\n\t}\n}\n\n// sync function syncs content of latest value log file to disk. Syncing of value log directory is\n// not required here as it happens every time a value log file rotation happens(check createVlogFile\n// function). During rotation, previous value log file also gets synced to disk. It only syncs file\n// if fid >= vlog.maxFid. In some cases such as replay(while opening db), it might be called with\n// fid < vlog.maxFid. To sync irrespective of file id just call it with math.MaxUint32.\nfunc (vlog *valueLog) sync() error {\n\tif vlog.opt.SyncWrites || vlog.opt.InMemory {\n\t\treturn nil\n\t}\n\n\tvlog.filesLock.RLock()\n\tmaxFid := vlog.maxFid\n\tcurlf := vlog.filesMap[maxFid]\n\t// Sometimes it is possible that vlog.maxFid has been increased but file creation\n\t// with same id is still in progress and this function is called. In those cases\n\t// entry for the file might not be present in vlog.filesMap.\n\tif curlf == nil {\n\t\tvlog.filesLock.RUnlock()\n\t\treturn nil\n\t}\n\tcurlf.lock.RLock()\n\tvlog.filesLock.RUnlock()\n\n\terr := curlf.Sync()\n\tcurlf.lock.RUnlock()\n\treturn err\n}\n\nfunc (vlog *valueLog) woffset() uint32 {\n\treturn vlog.writableLogOffset.Load()\n}\n\n// validateWrites will check whether the given requests can fit into 4GB vlog file.\n// NOTE: 4GB is the maximum size we can create for vlog because value pointer offset is of type\n// uint32. If we create more than 4GB, it will overflow uint32. So, limiting the size to 4GB.\nfunc (vlog *valueLog) validateWrites(reqs []*request) error {\n\tvlogOffset := uint64(vlog.woffset())\n\tfor _, req := range reqs {\n\t\t// calculate size of the request.\n\t\tsize := estimateRequestSize(req)\n\t\testimatedVlogOffset := vlogOffset + size\n\t\tif estimatedVlogOffset > uint64(maxVlogFileSize) {\n\t\t\treturn fmt.Errorf(\"Request size offset %d is bigger than maximum offset %d\",\n\t\t\t\testimatedVlogOffset, maxVlogFileSize)\n\t\t}\n\n\t\tif estimatedVlogOffset >= uint64(vlog.opt.ValueLogFileSize) {\n\t\t\t// We'll create a new vlog file if the estimated offset is greater or equal to\n\t\t\t// max vlog size. So, resetting the vlogOffset.\n\t\t\tvlogOffset = 0\n\t\t\tcontinue\n\t\t}\n\t\t// Estimated vlog offset will become current vlog offset if the vlog is not rotated.\n\t\tvlogOffset = estimatedVlogOffset\n\t}\n\treturn nil\n}\n\n// estimateRequestSize returns the size that needed to be written for the given request.\nfunc estimateRequestSize(req *request) uint64 {\n\tsize := uint64(0)\n\tfor _, e := range req.Entries {\n\t\tsize += uint64(maxHeaderSize + len(e.Key) + len(e.Value) + crc32.Size)\n\t}\n\treturn size\n}\n\n// write is thread-unsafe by design and should not be called concurrently.\nfunc (vlog *valueLog) write(reqs []*request) error {\n\tif vlog.db.opt.InMemory {\n\t\treturn nil\n\t}\n\t// Validate writes before writing to vlog. Because, we don't want to partially write and return\n\t// an error.\n\tif err := vlog.validateWrites(reqs); err != nil {\n\t\treturn y.Wrapf(err, \"while validating writes\")\n\t}\n\n\tvlog.filesLock.RLock()\n\tmaxFid := vlog.maxFid\n\tcurlf := vlog.filesMap[maxFid]\n\tvlog.filesLock.RUnlock()\n\n\tdefer func() {\n\t\tif vlog.opt.SyncWrites {\n\t\t\tif err := curlf.Sync(); err != nil {\n\t\t\t\tvlog.opt.Errorf(\"Error while curlf sync: %v\\n\", err)\n\t\t\t}\n\t\t}\n\t}()\n\n\twrite := func(buf *bytes.Buffer) error {\n\t\tif buf.Len() == 0 {\n\t\t\treturn nil\n\t\t}\n\n\t\tn := uint32(buf.Len())\n\t\tendOffset := vlog.writableLogOffset.Add(n)\n\t\t// Increase the file size if we cannot accommodate this entry.\n\t\t// [Aman] Should this be >= or just >? Doesn't make sense to extend the file if it big enough already.\n\t\tif int(endOffset) >= len(curlf.Data) {\n\t\t\tif err := curlf.Truncate(int64(endOffset)); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t}\n\n\t\tstart := int(endOffset - n)\n\t\ty.AssertTrue(copy(curlf.Data[start:], buf.Bytes()) == int(n))\n\n\t\tcurlf.size.Store(endOffset)\n\t\treturn nil\n\t}\n\n\ttoDisk := func() error {\n\t\tif vlog.woffset() > uint32(vlog.opt.ValueLogFileSize) ||\n\t\t\tvlog.numEntriesWritten > vlog.opt.ValueLogMaxEntries {\n\t\t\tif err := curlf.doneWriting(vlog.woffset()); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\n\t\t\tnewlf, err := vlog.createVlogFile()\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tcurlf = newlf\n\t\t}\n\t\treturn nil\n\t}\n\n\tbuf := new(bytes.Buffer)\n\tfor i := range reqs {\n\t\tb := reqs[i]\n\t\tb.Ptrs = b.Ptrs[:0]\n\t\tvar written, bytesWritten int\n\t\tvalueSizes := make([]int64, 0, len(b.Entries))\n\t\tfor j := range b.Entries {\n\t\t\tbuf.Reset()\n\n\t\t\te := b.Entries[j]\n\t\t\tvalueSizes = append(valueSizes, int64(len(e.Value)))\n\t\t\tif e.skipVlogAndSetThreshold(vlog.db.valueThreshold()) {\n\t\t\t\tb.Ptrs = append(b.Ptrs, valuePointer{})\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\tvar p valuePointer\n\n\t\t\tp.Fid = curlf.fid\n\t\t\tp.Offset = vlog.woffset()\n\n\t\t\t// We should not store transaction marks in the vlog file because it will never have all\n\t\t\t// the entries in a transaction. If we store entries with transaction marks then value\n\t\t\t// GC will not be able to iterate on the entire vlog file.\n\t\t\t// But, we still want the entry to stay intact for the memTable WAL. So, store the meta\n\t\t\t// in a temporary variable and reassign it after writing to the value log.\n\t\t\ttmpMeta := e.meta\n\t\t\te.meta = e.meta &^ (bitTxn | bitFinTxn)\n\t\t\tplen, err := curlf.encodeEntry(buf, e, p.Offset) // Now encode the entry into buffer.\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\t// Restore the meta.\n\t\t\te.meta = tmpMeta\n\n\t\t\tp.Len = uint32(plen)\n\t\t\tb.Ptrs = append(b.Ptrs, p)\n\t\t\tif err := write(buf); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\twritten++\n\t\t\tbytesWritten += buf.Len()\n\t\t\t// No need to flush anything, we write to file directly via mmap.\n\t\t}\n\t\ty.NumWritesVlogAdd(vlog.opt.MetricsEnabled, int64(written))\n\t\ty.NumBytesWrittenVlogAdd(vlog.opt.MetricsEnabled, int64(bytesWritten))\n\n\t\tvlog.numEntriesWritten += uint32(written)\n\t\tvlog.db.threshold.update(valueSizes)\n\t\t// We write to disk here so that all entries that are part of the same transaction are\n\t\t// written to the same vlog file.\n\t\tif err := toDisk(); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\treturn toDisk()\n}\n\n// Gets the logFile and acquires and RLock() for the mmap. You must call RUnlock on the file\n// (if non-nil)\nfunc (vlog *valueLog) getFileRLocked(vp valuePointer) (*logFile, error) {\n\tvlog.filesLock.RLock()\n\tdefer vlog.filesLock.RUnlock()\n\tret, ok := vlog.filesMap[vp.Fid]\n\tif !ok {\n\t\t// log file has gone away, we can't do anything. Return.\n\t\treturn nil, fmt.Errorf(\"file with ID: %d not found\", vp.Fid)\n\t}\n\n\t// Check for valid offset if we are reading from writable log.\n\tmaxFid := vlog.maxFid\n\t// In read-only mode we don't need to check for writable offset as we are not writing anything.\n\t// Moreover, this offset is not set in readonly mode.\n\tif !vlog.opt.ReadOnly && vp.Fid == maxFid {\n\t\tcurrentOffset := vlog.woffset()\n\t\tif vp.Offset >= currentOffset {\n\t\t\treturn nil, fmt.Errorf(\n\t\t\t\t\"Invalid value pointer offset: %d greater than current offset: %d\",\n\t\t\t\tvp.Offset, currentOffset)\n\t\t}\n\t}\n\n\tret.lock.RLock()\n\treturn ret, nil\n}\n\n// Read reads the value log at a given location.\n// TODO: Make this read private.\nfunc (vlog *valueLog) Read(vp valuePointer, _ *y.Slice) ([]byte, func(), error) {\n\tbuf, lf, err := vlog.readValueBytes(vp)\n\t// log file is locked so, decide whether to lock immediately or let the caller to\n\t// unlock it, after caller uses it.\n\tcb := vlog.getUnlockCallback(lf)\n\tif err != nil {\n\t\treturn nil, cb, err\n\t}\n\n\tif vlog.opt.VerifyValueChecksum {\n\t\thash := crc32.New(y.CastagnoliCrcTable)\n\t\tif _, err := hash.Write(buf[:len(buf)-crc32.Size]); err != nil {\n\t\t\trunCallback(cb)\n\t\t\treturn nil, nil, y.Wrapf(err, \"failed to write hash for vp %+v\", vp)\n\t\t}\n\t\t// Fetch checksum from the end of the buffer.\n\t\tchecksum := buf[len(buf)-crc32.Size:]\n\t\tif hash.Sum32() != y.BytesToU32(checksum) {\n\t\t\trunCallback(cb)\n\t\t\treturn nil, nil, y.Wrapf(y.ErrChecksumMismatch, \"value corrupted for vp: %+v\", vp)\n\t\t}\n\t}\n\tvar h header\n\theaderLen := h.Decode(buf)\n\tkv := buf[headerLen:]\n\tif lf.encryptionEnabled() {\n\t\tkv, err = lf.decryptKV(kv, vp.Offset)\n\t\tif err != nil {\n\t\t\treturn nil, cb, err\n\t\t}\n\t}\n\tif uint32(len(kv)) < h.klen+h.vlen {\n\t\tvlog.db.opt.Errorf(\"Invalid read: vp: %+v\", vp)\n\t\treturn nil, nil, fmt.Errorf(\"Invalid read: Len: %d read at:[%d:%d]\",\n\t\t\tlen(kv), h.klen, h.klen+h.vlen)\n\t}\n\treturn kv[h.klen : h.klen+h.vlen], cb, nil\n}\n\n// getUnlockCallback will returns a function which unlock the logfile if the logfile is mmaped.\n// otherwise, it unlock the logfile and return nil.\nfunc (vlog *valueLog) getUnlockCallback(lf *logFile) func() {\n\tif lf == nil {\n\t\treturn nil\n\t}\n\treturn lf.lock.RUnlock\n}\n\n// readValueBytes return vlog entry slice and read locked log file. Caller should take care of\n// logFile unlocking.\nfunc (vlog *valueLog) readValueBytes(vp valuePointer) ([]byte, *logFile, error) {\n\tlf, err := vlog.getFileRLocked(vp)\n\tif err != nil {\n\t\treturn nil, nil, err\n\t}\n\n\tbuf, err := lf.read(vp)\n\ty.NumReadsVlogAdd(vlog.db.opt.MetricsEnabled, 1)\n\ty.NumBytesReadsVlogAdd(vlog.db.opt.MetricsEnabled, int64(len(buf)))\n\treturn buf, lf, err\n}\n\nfunc (vlog *valueLog) pickLog(discardRatio float64) *logFile {\n\tvlog.filesLock.RLock()\n\tdefer vlog.filesLock.RUnlock()\n\nLOOP:\n\t// Pick a candidate that contains the largest amount of discardable data\n\tfid, discard := vlog.discardStats.MaxDiscard()\n\n\t// MaxDiscard will return fid=0 if it doesn't have any discard data. The\n\t// vlog files start from 1.\n\tif fid == 0 {\n\t\tvlog.opt.Debugf(\"No file with discard stats\")\n\t\treturn nil\n\t}\n\tlf, ok := vlog.filesMap[fid]\n\t// This file was deleted but it's discard stats increased because of compactions. The file\n\t// doesn't exist so we don't need to do anything. Skip it and retry.\n\tif !ok {\n\t\tvlog.discardStats.Update(fid, -1)\n\t\tgoto LOOP\n\t}\n\t// We have a valid file.\n\tfi, err := lf.Fd.Stat()\n\tif err != nil {\n\t\tvlog.opt.Errorf(\"Unable to get stats for value log fid: %d err: %+v\", fi, err)\n\t\treturn nil\n\t}\n\tif thr := discardRatio * float64(fi.Size()); float64(discard) < thr {\n\t\tvlog.opt.Debugf(\"Discard: %d less than threshold: %.0f for file: %s\",\n\t\t\tdiscard, thr, fi.Name())\n\t\treturn nil\n\t}\n\tif fid < vlog.maxFid {\n\t\tvlog.opt.Infof(\"Found value log max discard fid: %d discard: %d\\n\", fid, discard)\n\t\tlf, ok := vlog.filesMap[fid]\n\t\ty.AssertTrue(ok)\n\t\treturn lf\n\t}\n\n\t// Don't randomly pick any value log file.\n\treturn nil\n}\n\nfunc discardEntry(e Entry, vs y.ValueStruct, db *DB) bool {\n\tif vs.Version != y.ParseTs(e.Key) {\n\t\t// Version not found. Discard.\n\t\treturn true\n\t}\n\tif isDeletedOrExpired(vs.Meta, vs.ExpiresAt) {\n\t\treturn true\n\t}\n\tif (vs.Meta & bitValuePointer) == 0 {\n\t\t// Key also stores the value in LSM. Discard.\n\t\treturn true\n\t}\n\tif (vs.Meta & bitFinTxn) > 0 {\n\t\t// Just a txn finish entry. Discard.\n\t\treturn true\n\t}\n\treturn false\n}\n\nfunc (vlog *valueLog) doRunGC(lf *logFile) error {\n\t_, span := otel.Tracer(\"\").Start(context.TODO(), \"Badger.GC\")\n\tspan.SetAttributes(attribute.String(\"GC rewrite for\", lf.path))\n\tdefer span.End()\n\tif err := vlog.rewrite(lf); err != nil {\n\t\treturn err\n\t}\n\t// Remove the file from discardStats.\n\tvlog.discardStats.Update(lf.fid, -1)\n\treturn nil\n}\n\nfunc (vlog *valueLog) waitOnGC(lc *z.Closer) {\n\tdefer lc.Done()\n\n\t<-lc.HasBeenClosed() // Wait for lc to be closed.\n\n\t// Block any GC in progress to finish, and don't allow any more writes to runGC by filling up\n\t// the channel of size 1.\n\tvlog.garbageCh <- struct{}{}\n}\n\nfunc (vlog *valueLog) runGC(discardRatio float64) error {\n\tselect {\n\tcase vlog.garbageCh <- struct{}{}:\n\t\t// Pick a log file for GC.\n\t\tdefer func() {\n\t\t\t<-vlog.garbageCh\n\t\t}()\n\n\t\tlf := vlog.pickLog(discardRatio)\n\t\tif lf == nil {\n\t\t\treturn ErrNoRewrite\n\t\t}\n\t\treturn vlog.doRunGC(lf)\n\tdefault:\n\t\treturn ErrRejected\n\t}\n}\n\nfunc (vlog *valueLog) updateDiscardStats(stats map[uint32]int64) {\n\tif vlog.opt.InMemory {\n\t\treturn\n\t}\n\tfor fid, discard := range stats {\n\t\tvlog.discardStats.Update(fid, discard)\n\t}\n\t// The following is to coordinate with some test cases where we want to\n\t// verify that at least one iteration of updateDiscardStats has been completed.\n\tvlog.db.logToSyncChan(updateDiscardStatsMsg)\n}\n\ntype vlogThreshold struct {\n\tlogger         Logger\n\tpercentile     float64\n\tvalueThreshold atomic.Int64\n\tvalueCh        chan []int64\n\tclearCh        chan bool\n\tcloser         *z.Closer\n\t// Metrics contains a running log of statistics like amount of data stored etc.\n\tvlMetrics *z.HistogramData\n}\n\nfunc initVlogThreshold(opt *Options) *vlogThreshold {\n\tgetBounds := func() []float64 {\n\t\tmxbd := opt.maxValueThreshold\n\t\tmnbd := float64(opt.ValueThreshold)\n\t\ty.AssertTruef(mxbd >= mnbd, \"maximum threshold bound is less than the min threshold\")\n\t\tsize := math.Min(mxbd-mnbd+1, 1024.0)\n\t\tbdstp := (mxbd - mnbd) / size\n\t\tbounds := make([]float64, int64(size))\n\t\tfor i := range bounds {\n\t\t\tif i == 0 {\n\t\t\t\tbounds[0] = mnbd\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\tif i == int(size-1) {\n\t\t\t\tbounds[i] = mxbd\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\tbounds[i] = bounds[i-1] + bdstp\n\t\t}\n\t\treturn bounds\n\t}\n\tlt := &vlogThreshold{\n\t\tlogger:     opt.Logger,\n\t\tpercentile: opt.VLogPercentile,\n\t\tvalueCh:    make(chan []int64, 1000),\n\t\tclearCh:    make(chan bool, 1),\n\t\tcloser:     z.NewCloser(1),\n\t\tvlMetrics:  z.NewHistogramData(getBounds()),\n\t}\n\tlt.valueThreshold.Store(opt.ValueThreshold)\n\treturn lt\n}\n\nfunc (v *vlogThreshold) Clear(opt Options) {\n\tv.valueThreshold.Store(opt.ValueThreshold)\n\tv.clearCh <- true\n}\n\nfunc (v *vlogThreshold) update(sizes []int64) {\n\tv.valueCh <- sizes\n}\n\nfunc (v *vlogThreshold) close() {\n\tv.closer.SignalAndWait()\n}\n\nfunc (v *vlogThreshold) listenForValueThresholdUpdate() {\n\tdefer v.closer.Done()\n\tfor {\n\t\tselect {\n\t\tcase <-v.closer.HasBeenClosed():\n\t\t\treturn\n\t\tcase val := <-v.valueCh:\n\t\t\tfor _, e := range val {\n\t\t\t\tv.vlMetrics.Update(e)\n\t\t\t}\n\t\t\t// we are making it to get Options.VlogPercentile so that values with sizes\n\t\t\t// in range of Options.VlogPercentile will make it to the LSM tree and rest to the\n\t\t\t// value log file.\n\t\t\tp := int64(v.vlMetrics.Percentile(v.percentile))\n\t\t\tif v.valueThreshold.Load() != p {\n\t\t\t\tif v.logger != nil {\n\t\t\t\t\tv.logger.Infof(\"updating value of threshold to: %d\", p)\n\t\t\t\t}\n\t\t\t\tv.valueThreshold.Store(p)\n\t\t\t}\n\t\tcase <-v.clearCh:\n\t\t\tv.vlMetrics.Clear()\n\t\t}\n\t}\n}\n"
  },
  {
    "path": "value_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"bytes\"\n\t\"errors\"\n\t\"fmt\"\n\t\"math\"\n\t\"math/rand\"\n\t\"os\"\n\t\"reflect\"\n\t\"sync\"\n\t\"testing\"\n\n\thumanize \"github.com/dustin/go-humanize\"\n\t\"github.com/stretchr/testify/require\"\n\n\t\"github.com/dgraph-io/badger/v4/y\"\n)\n\nfunc TestDynamicValueThreshold(t *testing.T) {\n\tt.Skip()\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\ty.Check(err)\n\tdefer removeDir(dir)\n\tkv, _ := Open(getTestOptions(dir).WithValueThreshold(32).WithVLogPercentile(0.99))\n\tdefer kv.Close()\n\tlog := &kv.vlog\n\tfor vl := 32; vl <= 1024; vl = vl + 4 {\n\t\tfor i := 0; i < 1000; i++ {\n\t\t\tval := make([]byte, vl)\n\t\t\ty.Check2(rand.Read(val))\n\t\t\te1 := &Entry{\n\t\t\t\tKey:   []byte(fmt.Sprintf(\"samplekey_%d_%d\", vl, i)),\n\t\t\t\tValue: val,\n\t\t\t\tmeta:  bitValuePointer,\n\t\t\t}\n\t\t\tb := new(request)\n\t\t\tb.Entries = []*Entry{e1}\n\t\t\trequire.NoError(t, log.write([]*request{b}))\n\t\t}\n\t\tt.Logf(\"value threshold is %d \\n\", log.db.valueThreshold())\n\t}\n\n\tfor vl := 511; vl >= 31; vl = vl - 4 {\n\t\tfor i := 0; i < 5000; i++ {\n\t\t\tval := make([]byte, vl)\n\t\t\ty.Check2(rand.Read(val))\n\t\t\te1 := &Entry{\n\t\t\t\tKey:   []byte(fmt.Sprintf(\"samplekey_%d_%d\", vl, i)),\n\t\t\t\tValue: val,\n\t\t\t\tmeta:  bitValuePointer,\n\t\t\t}\n\t\t\tb := new(request)\n\t\t\tb.Entries = []*Entry{e1}\n\t\t\trequire.NoError(t, log.write([]*request{b}))\n\t\t}\n\t\tt.Logf(\"value threshold is %d \\n\", log.db.valueThreshold())\n\t}\n\trequire.Equal(t, log.db.valueThreshold(), int64(995))\n}\n\nfunc TestValueBasic(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\ty.Check(err)\n\tdefer removeDir(dir)\n\n\tkv, _ := Open(getTestOptions(dir).WithValueThreshold(32))\n\tdefer kv.Close()\n\tlog := &kv.vlog\n\n\t// Use value big enough that the value log writes them even if SyncWrites is false.\n\tconst val1 = \"sampleval012345678901234567890123\"\n\tconst val2 = \"samplevalb012345678901234567890123\"\n\trequire.True(t, int64(len(val1)) >= kv.vlog.db.valueThreshold())\n\n\te1 := &Entry{\n\t\tKey:   []byte(\"samplekey\"),\n\t\tValue: []byte(val1),\n\t\tmeta:  bitValuePointer,\n\t}\n\te2 := &Entry{\n\t\tKey:   []byte(\"samplekeyb\"),\n\t\tValue: []byte(val2),\n\t\tmeta:  bitValuePointer,\n\t}\n\n\tb := new(request)\n\tb.Entries = []*Entry{e1, e2}\n\n\trequire.NoError(t, log.write([]*request{b}))\n\trequire.Len(t, b.Ptrs, 2)\n\tt.Logf(\"Pointer written: %+v %+v\\n\", b.Ptrs[0], b.Ptrs[1])\n\n\tbuf1, lf1, err1 := log.readValueBytes(b.Ptrs[0])\n\tbuf2, lf2, err2 := log.readValueBytes(b.Ptrs[1])\n\trequire.NoError(t, err1)\n\trequire.NoError(t, err2)\n\tdefer runCallback(log.getUnlockCallback(lf1))\n\tdefer runCallback(log.getUnlockCallback(lf2))\n\te1, err = lf1.decodeEntry(buf1, b.Ptrs[0].Offset)\n\trequire.NoError(t, err)\n\te2, err = lf1.decodeEntry(buf2, b.Ptrs[1].Offset)\n\trequire.NoError(t, err)\n\treadEntries := []Entry{*e1, *e2}\n\trequire.EqualValues(t, []Entry{\n\t\t{\n\t\t\tKey:    []byte(\"samplekey\"),\n\t\t\tValue:  []byte(val1),\n\t\t\tmeta:   bitValuePointer,\n\t\t\toffset: b.Ptrs[0].Offset,\n\t\t},\n\t\t{\n\t\t\tKey:    []byte(\"samplekeyb\"),\n\t\t\tValue:  []byte(val2),\n\t\t\tmeta:   bitValuePointer,\n\t\t\toffset: b.Ptrs[1].Offset,\n\t\t},\n\t}, readEntries)\n\n}\n\nfunc TestValueGCManaged(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\tN := 10000\n\n\topt := getTestOptions(dir)\n\topt.ValueLogMaxEntries = uint32(N / 10)\n\topt.managedTxns = true\n\topt.BaseTableSize = 1 << 15\n\topt.ValueThreshold = 1 << 10\n\topt.MemTableSize = 1 << 15\n\n\tdb, err := Open(opt)\n\trequire.NoError(t, err)\n\tdefer db.Close()\n\n\tvar ts uint64\n\tnewTs := func() uint64 {\n\t\tts++\n\t\treturn ts\n\t}\n\n\tsz := 64 << 10\n\tvar wg sync.WaitGroup\n\tfor i := 0; i < N; i++ {\n\t\tv := make([]byte, sz)\n\t\trand.Read(v[:rand.Intn(sz)])\n\n\t\twg.Add(1)\n\t\ttxn := db.NewTransactionAt(newTs(), true)\n\t\trequire.NoError(t, txn.SetEntry(NewEntry([]byte(fmt.Sprintf(\"key%d\", i)), v)))\n\t\trequire.NoError(t, txn.CommitAt(newTs(), func(err error) {\n\t\t\twg.Done()\n\t\t\trequire.NoError(t, err)\n\t\t}))\n\t}\n\n\tfor i := 0; i < N; i++ {\n\t\twg.Add(1)\n\t\ttxn := db.NewTransactionAt(newTs(), true)\n\t\trequire.NoError(t, txn.Delete([]byte(fmt.Sprintf(\"key%d\", i))))\n\t\trequire.NoError(t, txn.CommitAt(newTs(), func(err error) {\n\t\t\twg.Done()\n\t\t\trequire.NoError(t, err)\n\t\t}))\n\t}\n\twg.Wait()\n\tentries, err := os.ReadDir(dir)\n\trequire.NoError(t, err)\n\tfor _, e := range entries {\n\t\tfi, err := e.Info()\n\t\trequire.NoError(t, err)\n\t\tt.Logf(\"File: %s. Size: %s\\n\", fi.Name(), humanize.IBytes(uint64(fi.Size())))\n\t}\n\n\tdb.SetDiscardTs(math.MaxUint32)\n\trequire.NoError(t, db.Flatten(3))\n\n\tfor i := 0; i < 100; i++ {\n\t\t// Try at max 100 times to GC even a single value log file.\n\t\tif err := db.RunValueLogGC(0.0001); err == nil {\n\t\t\treturn // Done\n\t\t}\n\t}\n\trequire.Fail(t, \"Unable to GC even a single value log file.\")\n}\n\nfunc TestValueGC(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\topt := getTestOptions(dir)\n\topt.ValueLogFileSize = 1 << 20\n\topt.BaseTableSize = 1 << 15\n\topt.ValueThreshold = 1 << 10\n\n\tkv, _ := Open(opt)\n\tdefer kv.Close()\n\n\tsz := 32 << 10\n\ttxn := kv.NewTransaction(true)\n\tfor i := 0; i < 100; i++ {\n\t\tv := make([]byte, sz)\n\t\trand.Read(v[:rand.Intn(sz)])\n\t\trequire.NoError(t, txn.SetEntry(NewEntry([]byte(fmt.Sprintf(\"key%d\", i)), v)))\n\t\tif i%20 == 0 {\n\t\t\trequire.NoError(t, txn.Commit())\n\t\t\ttxn = kv.NewTransaction(true)\n\t\t}\n\t}\n\trequire.NoError(t, txn.Commit())\n\n\tfor i := 0; i < 45; i++ {\n\t\ttxnDelete(t, kv, []byte(fmt.Sprintf(\"key%d\", i)))\n\t}\n\n\tkv.vlog.filesLock.RLock()\n\tlf := kv.vlog.filesMap[kv.vlog.sortedFids()[0]]\n\tkv.vlog.filesLock.RUnlock()\n\n\t//\tlf.iterate(0, func(e Entry) bool {\n\t//\t\te.print(\"lf\")\n\t//\t\treturn true\n\t//\t})\n\n\trequire.NoError(t, kv.vlog.rewrite(lf))\n\tfor i := 45; i < 100; i++ {\n\t\tkey := []byte(fmt.Sprintf(\"key%d\", i))\n\n\t\trequire.NoError(t, kv.View(func(txn *Txn) error {\n\t\t\titem, err := txn.Get(key)\n\t\t\trequire.NoError(t, err)\n\t\t\tval := getItemValue(t, item)\n\t\t\trequire.NotNil(t, val)\n\t\t\trequire.True(t, len(val) == sz, \"Size found: %d\", len(val))\n\t\t\treturn nil\n\t\t}))\n\t}\n}\n\nfunc TestValueGC2(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\topt := getTestOptions(dir)\n\topt.ValueLogFileSize = 1 << 20\n\topt.BaseTableSize = 1 << 15\n\topt.ValueThreshold = 1 << 10\n\n\tkv, _ := Open(opt)\n\tdefer kv.Close()\n\n\tsz := 32 << 10\n\ttxn := kv.NewTransaction(true)\n\tfor i := 0; i < 100; i++ {\n\t\tv := make([]byte, sz)\n\t\trand.Read(v[:rand.Intn(sz)])\n\t\trequire.NoError(t, txn.SetEntry(NewEntry([]byte(fmt.Sprintf(\"key%d\", i)), v)))\n\t\tif i%20 == 0 {\n\t\t\trequire.NoError(t, txn.Commit())\n\t\t\ttxn = kv.NewTransaction(true)\n\t\t}\n\t}\n\trequire.NoError(t, txn.Commit())\n\n\tfor i := 0; i < 5; i++ {\n\t\ttxnDelete(t, kv, []byte(fmt.Sprintf(\"key%d\", i)))\n\t}\n\n\tfor i := 5; i < 10; i++ {\n\t\tv := []byte(fmt.Sprintf(\"value%d\", i))\n\t\ttxnSet(t, kv, []byte(fmt.Sprintf(\"key%d\", i)), v, 0)\n\t}\n\n\tkv.vlog.filesLock.RLock()\n\tlf := kv.vlog.filesMap[kv.vlog.sortedFids()[0]]\n\tkv.vlog.filesLock.RUnlock()\n\n\t//\tlf.iterate(0, func(e Entry) bool {\n\t//\t\te.print(\"lf\")\n\t//\t\treturn true\n\t//\t})\n\n\trequire.NoError(t, kv.vlog.rewrite(lf))\n\tfor i := 0; i < 5; i++ {\n\t\tkey := []byte(fmt.Sprintf(\"key%d\", i))\n\t\trequire.NoError(t, kv.View(func(txn *Txn) error {\n\t\t\t_, err := txn.Get(key)\n\t\t\trequire.Equal(t, ErrKeyNotFound, err)\n\t\t\treturn nil\n\t\t}))\n\t}\n\tfor i := 5; i < 10; i++ {\n\t\tkey := []byte(fmt.Sprintf(\"key%d\", i))\n\t\trequire.NoError(t, kv.View(func(txn *Txn) error {\n\t\t\titem, err := txn.Get(key)\n\t\t\trequire.NoError(t, err)\n\t\t\tval := getItemValue(t, item)\n\t\t\trequire.NotNil(t, val)\n\t\t\trequire.Equal(t, string(val), fmt.Sprintf(\"value%d\", i))\n\t\t\treturn nil\n\t\t}))\n\t}\n\t// Moved entries.\n\tfor i := 10; i < 100; i++ {\n\t\tkey := []byte(fmt.Sprintf(\"key%d\", i))\n\t\trequire.NoError(t, kv.View(func(txn *Txn) error {\n\t\t\titem, err := txn.Get(key)\n\t\t\trequire.NoError(t, err)\n\t\t\tval := getItemValue(t, item)\n\t\t\trequire.NotNil(t, val)\n\t\t\trequire.True(t, len(val) == sz, \"Size found: %d\", len(val))\n\t\t\treturn nil\n\t\t}))\n\t}\n}\n\nfunc TestValueGC3(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\topt := getTestOptions(dir)\n\topt.ValueLogFileSize = 1 << 20\n\topt.BaseTableSize = 1 << 15\n\topt.ValueThreshold = 1 << 10\n\n\tkv, err := Open(opt)\n\trequire.NoError(t, err)\n\tdefer kv.Close()\n\n\t// We want to test whether an iterator can continue through a value log GC.\n\n\tvalueSize := 32 << 10\n\n\tvar value3 []byte\n\ttxn := kv.NewTransaction(true)\n\tfor i := 0; i < 100; i++ {\n\t\tv := make([]byte, valueSize) // 32K * 100 will take >=3'276'800 B.\n\t\tif i == 3 {\n\t\t\tvalue3 = v\n\t\t}\n\t\trand.Read(v[:])\n\t\t// Keys key000, key001, key002, such that sorted order matches insertion order\n\t\trequire.NoError(t, txn.SetEntry(NewEntry([]byte(fmt.Sprintf(\"key%03d\", i)), v)))\n\t\tif i%20 == 0 {\n\t\t\trequire.NoError(t, txn.Commit())\n\t\t\ttxn = kv.NewTransaction(true)\n\t\t}\n\t}\n\trequire.NoError(t, txn.Commit())\n\n\t// Start an iterator to keys in the first value log file\n\titOpt := IteratorOptions{\n\t\tPrefetchValues: false,\n\t\tPrefetchSize:   0,\n\t\tReverse:        false,\n\t}\n\n\ttxn = kv.NewTransaction(true)\n\tit := txn.NewIterator(itOpt)\n\tdefer it.Close()\n\t// Walk a few keys\n\tit.Rewind()\n\trequire.True(t, it.Valid())\n\titem := it.Item()\n\trequire.Equal(t, []byte(\"key000\"), item.Key())\n\tit.Next()\n\trequire.True(t, it.Valid())\n\titem = it.Item()\n\trequire.Equal(t, []byte(\"key001\"), item.Key())\n\tit.Next()\n\trequire.True(t, it.Valid())\n\titem = it.Item()\n\trequire.Equal(t, []byte(\"key002\"), item.Key())\n\n\t// Like other tests, we pull out a logFile to rewrite it directly\n\n\tkv.vlog.filesLock.RLock()\n\tlogFile := kv.vlog.filesMap[kv.vlog.sortedFids()[0]]\n\tkv.vlog.filesLock.RUnlock()\n\n\trequire.NoError(t, kv.vlog.rewrite(logFile))\n\tit.Next()\n\trequire.True(t, it.Valid())\n\titem = it.Item()\n\trequire.Equal(t, []byte(\"key003\"), item.Key())\n\n\tv3, err := item.ValueCopy(nil)\n\trequire.NoError(t, err)\n\trequire.Equal(t, value3, v3)\n}\n\nfunc TestValueGC4(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\topt := getTestOptions(dir)\n\topt.ValueLogFileSize = 1 << 20\n\topt.BaseTableSize = 1 << 15\n\topt.ValueThreshold = 1 << 10\n\n\tkv, err := Open(opt)\n\trequire.NoError(t, err)\n\n\tsz := 128 << 10 // 5 entries per value log file.\n\ttxn := kv.NewTransaction(true)\n\tfor i := 0; i < 24; i++ {\n\t\tv := make([]byte, sz)\n\t\trand.Read(v[:rand.Intn(sz)])\n\t\trequire.NoError(t, txn.SetEntry(NewEntry([]byte(fmt.Sprintf(\"key%d\", i)), v)))\n\t\tif i%3 == 0 {\n\t\t\trequire.NoError(t, txn.Commit())\n\t\t\ttxn = kv.NewTransaction(true)\n\t\t}\n\t}\n\trequire.NoError(t, txn.Commit())\n\n\tfor i := 0; i < 8; i++ {\n\t\ttxnDelete(t, kv, []byte(fmt.Sprintf(\"key%d\", i)))\n\t}\n\n\tfor i := 8; i < 16; i++ {\n\t\tv := []byte(fmt.Sprintf(\"value%d\", i))\n\t\ttxnSet(t, kv, []byte(fmt.Sprintf(\"key%d\", i)), v, 0)\n\t}\n\n\tkv.vlog.filesLock.RLock()\n\tlf0 := kv.vlog.filesMap[kv.vlog.sortedFids()[0]]\n\tlf1 := kv.vlog.filesMap[kv.vlog.sortedFids()[1]]\n\tkv.vlog.filesLock.RUnlock()\n\n\t//\tlf.iterate(0, func(e Entry) bool {\n\t//\t\te.print(\"lf\")\n\t//\t\treturn true\n\t//\t})\n\n\trequire.NoError(t, kv.vlog.rewrite(lf0))\n\trequire.NoError(t, kv.vlog.rewrite(lf1))\n\n\trequire.NoError(t, kv.Close())\n\n\tkv, err = Open(opt)\n\trequire.NoError(t, err)\n\n\tfor i := 0; i < 8; i++ {\n\t\tkey := []byte(fmt.Sprintf(\"key%d\", i))\n\t\trequire.NoError(t, kv.View(func(txn *Txn) error {\n\t\t\t_, err := txn.Get(key)\n\t\t\trequire.Equal(t, ErrKeyNotFound, err)\n\t\t\treturn nil\n\t\t}))\n\t}\n\tfor i := 8; i < 16; i++ {\n\t\tkey := []byte(fmt.Sprintf(\"key%d\", i))\n\t\trequire.NoError(t, kv.View(func(txn *Txn) error {\n\t\t\titem, err := txn.Get(key)\n\t\t\trequire.NoError(t, err)\n\t\t\tval := getItemValue(t, item)\n\t\t\trequire.NotNil(t, val)\n\t\t\trequire.Equal(t, string(val), fmt.Sprintf(\"value%d\", i))\n\t\t\treturn nil\n\t\t}))\n\t}\n\trequire.NoError(t, kv.Close())\n}\n\nfunc TestPersistLFDiscardStats(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\topt := getTestOptions(dir)\n\t// Force more compaction by reducing the number of L0 tables.\n\topt.NumLevelZeroTables = 1\n\topt.ValueLogFileSize = 1 << 20\n\t// Avoid compaction on close so that the discard map remains the same.\n\topt.CompactL0OnClose = false\n\topt.MemTableSize = 1 << 15\n\topt.ValueThreshold = 1 << 10\n\ttChan := make(chan string, 100)\n\tdefer close(tChan)\n\topt.syncChan = tChan\n\n\tdb, err := Open(opt)\n\trequire.NoError(t, err)\n\tcapturedDiscardStats := make(map[uint64]uint64)\n\tdb.onCloseDiscardCapture = capturedDiscardStats\n\n\tsz := 128 << 10 // 5 entries per value log file.\n\tv := make([]byte, sz)\n\trand.Read(v[:rand.Intn(sz)])\n\ttxn := db.NewTransaction(true)\n\tfor i := 0; i < 500; i++ {\n\t\trequire.NoError(t, txn.SetEntry(NewEntry([]byte(fmt.Sprintf(\"key%d\", i)), v)))\n\t\tif i%3 == 0 {\n\t\t\trequire.NoError(t, txn.Commit())\n\t\t\ttxn = db.NewTransaction(true)\n\t\t}\n\t}\n\trequire.NoError(t, txn.Commit(), \"error while committing txn\")\n\n\tfor i := 0; i < 500; i++ {\n\t\t// use Entry.WithDiscard() to delete entries, because this causes data to be flushed on\n\t\t// disk, creating SSTs. Simple Delete was having data in Memtables only.\n\t\terr = db.Update(func(txn *Txn) error {\n\t\t\treturn txn.SetEntry(NewEntry([]byte(fmt.Sprintf(\"key%d\", i)), v).WithDiscard())\n\t\t})\n\t\trequire.NoError(t, err)\n\t}\n\n\t// Wait for invocation of updateDiscardStats at least once -- timeout after 60 seconds.\n\twaitForMessage(tChan, updateDiscardStatsMsg, 1, 60, t)\n\n\tdb.vlog.discardStats.Lock()\n\trequire.True(t, db.vlog.discardStats.Len() > 1, \"some discardStats should be generated\")\n\n\tdb.vlog.discardStats.Unlock()\n\trequire.NoError(t, db.Close())\n\n\t// Avoid running compactors on reopening badger.\n\topt.NumCompactors = 0\n\tdb, err = Open(opt)\n\trequire.NoError(t, err)\n\tdefer db.Close()\n\twaitForMessage(tChan, endVLogInitMsg, 1, 60, t)\n\tdb.vlog.discardStats.Lock()\n\tstatsMap := make(map[uint64]uint64)\n\tdb.vlog.discardStats.Iterate(func(fid, val uint64) {\n\t\tstatsMap[fid] = val\n\t})\n\trequire.Truef(t, reflect.DeepEqual(capturedDiscardStats, statsMap),\n\t\t\"Discard maps are not equal. On Close: %+v, After Reopen: %+v\",\n\t\tcapturedDiscardStats, statsMap)\n\tdb.vlog.discardStats.Unlock()\n}\n\nfunc TestValueChecksums(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\t// Set up SST with K1=V1\n\topts := getTestOptions(dir)\n\topts.ValueLogFileSize = 100 * 1024 * 1024 // 100Mb\n\topts.VerifyValueChecksum = true\n\tkv, err := Open(opts)\n\trequire.NoError(t, err)\n\trequire.NoError(t, kv.Close())\n\n\tvar (\n\t\tk0 = []byte(\"k0\")\n\t\tk1 = []byte(\"k1\")\n\t\tk2 = []byte(\"k2\")\n\t\tk3 = []byte(\"k3\")\n\t\tv0 = []byte(\"value0-012345678901234567890123012345678901234567890123\")\n\t\tv1 = []byte(\"value1-012345678901234567890123012345678901234567890123\")\n\t\tv2 = []byte(\"value2-012345678901234567890123012345678901234567890123\")\n\t\tv3 = []byte(\"value3-012345678901234567890123012345678901234567890123\")\n\t)\n\n\t// Use a vlog with K0=V0 and a (corrupted) second transaction(k1,k2)\n\tbuf, offset := createMemFile(t, []*Entry{\n\t\t{Key: k0, Value: v0},\n\t\t{Key: k1, Value: v1},\n\t\t{Key: k2, Value: v2},\n\t})\n\tbuf[offset-1]++ // Corrupt last byte\n\trequire.NoError(t, os.WriteFile(kv.mtFilePath(1), buf, 0777))\n\n\t// K1 should exist, but K2 shouldn't.\n\tkv, err = Open(opts)\n\trequire.NoError(t, err)\n\n\trequire.NoError(t, kv.View(func(txn *Txn) error {\n\t\t// Replay should have added K0.\n\t\titem, err := txn.Get(k0)\n\t\trequire.NoError(t, err)\n\t\trequire.Equal(t, getItemValue(t, item), v0)\n\n\t\t_, err = txn.Get(k1)\n\t\trequire.Equal(t, ErrKeyNotFound, err)\n\n\t\t_, err = txn.Get(k2)\n\t\trequire.Equal(t, ErrKeyNotFound, err)\n\t\treturn nil\n\t}))\n\n\t// Write K3 at the end of the vlog.\n\ttxnSet(t, kv, k3, v3, 0)\n\trequire.NoError(t, kv.Close())\n\n\t// The DB should contain K0 and K3 (K1 and k2 was lost when Badger started up\n\t// last due to checksum failure).\n\tkv, err = Open(opts)\n\trequire.NoError(t, err)\n\n\t{\n\t\ttxn := kv.NewTransaction(false)\n\n\t\titer := txn.NewIterator(DefaultIteratorOptions)\n\t\titer.Seek(k0)\n\t\trequire.True(t, iter.Valid())\n\t\tit := iter.Item()\n\t\trequire.Equal(t, it.Key(), k0)\n\t\trequire.Equal(t, getItemValue(t, it), v0)\n\t\titer.Next()\n\t\trequire.True(t, iter.Valid())\n\t\tit = iter.Item()\n\t\trequire.Equal(t, it.Key(), k3)\n\t\trequire.Equal(t, getItemValue(t, it), v3)\n\n\t\titer.Close()\n\t\ttxn.Discard()\n\t}\n\n\trequire.NoError(t, kv.Close())\n}\n\n// TODO: Do we need this test?\nfunc TestPartialAppendToWAL(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\t// Create skeleton files.\n\topts := getTestOptions(dir)\n\topts.ValueLogFileSize = 100 * 1024 * 1024 // 100Mb\n\topts.ValueThreshold = 32\n\tkv, err := Open(opts)\n\trequire.NoError(t, err)\n\trequire.NoError(t, kv.Close())\n\n\tvar (\n\t\tk0 = []byte(\"k0\")\n\t\tk1 = []byte(\"k1\")\n\t\tk2 = []byte(\"k2\")\n\t\tk3 = []byte(\"k3\")\n\t\tv0 = []byte(\"value0-01234567890123456789012012345678901234567890123\")\n\t\tv1 = []byte(\"value1-01234567890123456789012012345678901234567890123\")\n\t\tv2 = []byte(\"value2-01234567890123456789012012345678901234567890123\")\n\t\tv3 = []byte(\"value3-01234567890123456789012012345678901234567890123\")\n\t)\n\t// Values need to be long enough to actually get written to value log.\n\trequire.True(t, int64(len(v3)) >= kv.vlog.db.valueThreshold())\n\n\t// Create truncated vlog to simulate a partial append.\n\t// k0 - single transaction, k1 and k2 in another transaction\n\tbuf, offset := createMemFile(t, []*Entry{\n\t\t{Key: k0, Value: v0},\n\t\t{Key: k1, Value: v1},\n\t\t{Key: k2, Value: v2},\n\t})\n\tbuf = buf[:offset-6]\n\trequire.NoError(t, os.WriteFile(kv.mtFilePath(1), buf, 0777))\n\n\t// Badger should now start up\n\tkv, err = Open(opts)\n\trequire.NoError(t, err)\n\n\trequire.NoError(t, kv.View(func(txn *Txn) error {\n\t\titem, err := txn.Get(k0)\n\t\trequire.NoError(t, err)\n\t\trequire.Equal(t, v0, getItemValue(t, item))\n\n\t\t_, err = txn.Get(k1)\n\t\trequire.Equal(t, ErrKeyNotFound, err)\n\t\t_, err = txn.Get(k2)\n\t\trequire.Equal(t, ErrKeyNotFound, err)\n\t\treturn nil\n\t}))\n\n\t// When K3 is set, it should be persisted after a restart.\n\ttxnSet(t, kv, k3, v3, 0)\n\trequire.NoError(t, kv.Close())\n\tkv, err = Open(opts)\n\trequire.NoError(t, err)\n\tcheckKeys(t, kv, [][]byte{k3})\n\t// Replay value log from beginning, badger head is past k2.\n\trequire.NoError(t, kv.vlog.Close())\n}\n\nfunc TestReadOnlyOpenWithPartialAppendToWAL(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\t// Create skeleton files.\n\topts := getTestOptions(dir)\n\topts.ValueLogFileSize = 100 * 1024 * 1024 // 100Mb\n\tkv, err := Open(opts)\n\trequire.NoError(t, err)\n\trequire.NoError(t, kv.Close())\n\n\tvar (\n\t\tk0 = []byte(\"k0\")\n\t\tk1 = []byte(\"k1\")\n\t\tk2 = []byte(\"k2\")\n\t\tv0 = []byte(\"value0-012345678901234567890123\")\n\t\tv1 = []byte(\"value1-012345678901234567890123\")\n\t\tv2 = []byte(\"value2-012345678901234567890123\")\n\t)\n\n\t// Create truncated vlog to simulate a partial append.\n\t// k0 - single transaction, k1 and k2 in another transaction\n\tbuf, offset := createMemFile(t, []*Entry{\n\t\t{Key: k0, Value: v0},\n\t\t{Key: k1, Value: v1},\n\t\t{Key: k2, Value: v2},\n\t})\n\tbuf = buf[:offset-6]\n\trequire.NoError(t, os.WriteFile(kv.mtFilePath(1), buf, 0777))\n\n\topts.ReadOnly = true\n\t// Badger should fail a read-only open with values to replay\n\t_, err = Open(opts)\n\trequire.Error(t, err)\n\trequire.Regexp(t, \"Log truncate required\", err.Error())\n}\n\nfunc TestValueLogTrigger(t *testing.T) {\n\tt.Skip(\"Difficult to trigger compaction, so skipping. Re-enable after fixing #226\")\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\topt := getTestOptions(dir)\n\topt.ValueLogFileSize = 1 << 20\n\tkv, err := Open(opt)\n\trequire.NoError(t, err)\n\n\t// Write a lot of data, so it creates some work for valug log GC.\n\tsz := 32 << 10\n\ttxn := kv.NewTransaction(true)\n\tfor i := 0; i < 100; i++ {\n\t\tv := make([]byte, sz)\n\t\trand.Read(v[:rand.Intn(sz)])\n\t\trequire.NoError(t, txn.SetEntry(NewEntry([]byte(fmt.Sprintf(\"key%d\", i)), v)))\n\t\tif i%20 == 0 {\n\t\t\trequire.NoError(t, txn.Commit())\n\t\t\ttxn = kv.NewTransaction(true)\n\t\t}\n\t}\n\trequire.NoError(t, txn.Commit())\n\n\tfor i := 0; i < 45; i++ {\n\t\ttxnDelete(t, kv, []byte(fmt.Sprintf(\"key%d\", i)))\n\t}\n\n\trequire.NoError(t, kv.RunValueLogGC(0.5))\n\n\trequire.NoError(t, kv.Close())\n\n\terr = kv.RunValueLogGC(0.5)\n\trequire.Equal(t, ErrRejected, err, \"Error should be returned after closing DB.\")\n}\n\n// createMemFile creates a new memFile and returns the last valid offset.\nfunc createMemFile(t *testing.T, entries []*Entry) ([]byte, uint32) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\topts := getTestOptions(dir)\n\topts.ValueLogFileSize = 100 * 1024 * 1024 // 100Mb\n\tkv, err := Open(opts)\n\trequire.NoError(t, err)\n\tdefer kv.Close()\n\n\ttxnSet(t, kv, entries[0].Key, entries[0].Value, entries[0].meta)\n\n\tentries = entries[1:]\n\ttxn := kv.NewTransaction(true)\n\tfor _, entry := range entries {\n\t\trequire.NoError(t, txn.SetEntry(NewEntry(entry.Key, entry.Value).WithMeta(entry.meta)))\n\t}\n\trequire.NoError(t, txn.Commit())\n\n\tfilename := kv.mtFilePath(1)\n\tbuf, err := os.ReadFile(filename)\n\trequire.NoError(t, err)\n\treturn buf, kv.mt.wal.writeAt\n}\n\n// This test creates two mem files and corrupts the last bit of the first file.\nfunc TestPenultimateMemCorruption(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\topt := getTestOptions(dir)\n\n\tdb0, err := Open(opt)\n\trequire.NoError(t, err)\n\tdefer func() { require.NoError(t, db0.Close()) }()\n\n\th := testHelper{db: db0, t: t}\n\th.writeRange(0, 2) // 00001.mem\n\n\t// Move the current memtable to the db.imm and create a new memtable so\n\t// that we can have more than one mem files.\n\trequire.Zero(t, len(db0.imm))\n\tdb0.imm = append(db0.imm, db0.mt)\n\tdb0.mt, err = db0.newMemTable()\n\trequire.NoError(t, err)\n\n\th.writeRange(3, 7) // 00002.mem\n\n\t// Verify we have all the data we wrote.\n\th.readRange(0, 7)\n\n\tfor i := 2; i >= 1; i-- {\n\t\tfpath := db0.mtFilePath(i)\n\t\tfi, err := os.Stat(fpath)\n\t\trequire.NoError(t, err)\n\t\trequire.True(t, fi.Size() > 0, \"Empty file at log=%d\", i)\n\t\tif i == 1 {\n\t\t\t// This should corrupt the last entry in the first memtable (that is entry number 2)\n\t\t\twal := db0.imm[0].wal\n\t\t\t_, err = wal.Fd.WriteAt([]byte{0}, int64(wal.writeAt-1))\n\t\t\trequire.NoError(t, err)\n\t\t\t// We have corrupted the file. We can remove it. If we don't remove\n\t\t\t// the imm here, the db.close in defer will crash since db0.mt !=\n\t\t\t// db0.imm[0]\n\t\t\tdb0.imm = db0.imm[:0]\n\t\t}\n\t}\n\t// Simulate a crash by not closing db0, but releasing the locks.\n\tif db0.dirLockGuard != nil {\n\t\trequire.NoError(t, db0.dirLockGuard.release())\n\t\tdb0.dirLockGuard = nil\n\t}\n\tif db0.valueDirGuard != nil {\n\t\trequire.NoError(t, db0.valueDirGuard.release())\n\t\tdb0.valueDirGuard = nil\n\t}\n\n\tdb1, err := Open(opt)\n\trequire.NoError(t, err)\n\th.db = db1\n\t// Only 2 should be gone because it is at the end of 0001.mem (first memfile).\n\th.readRange(0, 1)\n\th.readRange(3, 7)\n\terr = db1.View(func(txn *Txn) error {\n\t\t_, err := txn.Get(h.key(2)) // Verify that 2 is gone.\n\t\trequire.Equal(t, ErrKeyNotFound, err)\n\t\treturn nil\n\t})\n\trequire.NoError(t, err)\n\trequire.NoError(t, db1.Close())\n}\n\nfunc checkKeys(t *testing.T, kv *DB, keys [][]byte) {\n\ti := 0\n\ttxn := kv.NewTransaction(false)\n\tdefer txn.Discard()\n\titer := txn.NewIterator(IteratorOptions{})\n\tdefer iter.Close()\n\tfor iter.Seek(keys[0]); iter.Valid(); iter.Next() {\n\t\trequire.Equal(t, iter.Item().Key(), keys[i])\n\t\ti++\n\t}\n\trequire.Equal(t, i, len(keys))\n}\n\ntype testHelper struct {\n\tdb  *DB\n\tt   *testing.T\n\tval []byte\n}\n\nfunc (th *testHelper) key(i int) []byte {\n\treturn []byte(fmt.Sprintf(\"%010d\", i))\n}\nfunc (th *testHelper) value() []byte {\n\tif len(th.val) > 0 {\n\t\treturn th.val\n\t}\n\tth.val = make([]byte, 100)\n\ty.Check2(rand.Read(th.val))\n\treturn th.val\n}\n\n// writeRange [from, to].\nfunc (th *testHelper) writeRange(from, to int) {\n\tfor i := from; i <= to; i++ {\n\t\terr := th.db.Update(func(txn *Txn) error {\n\t\t\treturn txn.SetEntry(NewEntry(th.key(i), th.value()))\n\t\t})\n\t\trequire.NoError(th.t, err)\n\t}\n}\n\nfunc (th *testHelper) readRange(from, to int) {\n\tfor i := from; i <= to; i++ {\n\t\terr := th.db.View(func(txn *Txn) error {\n\t\t\titem, err := txn.Get(th.key(i))\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\treturn item.Value(func(val []byte) error {\n\t\t\t\trequire.Equal(th.t, val, th.value(), \"key=%q\", th.key(i))\n\t\t\t\treturn nil\n\n\t\t\t})\n\t\t})\n\t\trequire.NoError(th.t, err, \"key=%q\", th.key(i))\n\t}\n}\n\n// Test Bug #578, which showed that if a value is moved during value log GC, an\n// older version can end up at a higher level in the LSM tree than a newer\n// version, causing the data to not be returned.\nfunc TestBug578(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\ty.Check(err)\n\tdefer removeDir(dir)\n\n\tdb, err := Open(DefaultOptions(dir).\n\t\tWithValueLogMaxEntries(64).\n\t\tWithBaseTableSize(1 << 13))\n\trequire.NoError(t, err)\n\n\th := testHelper{db: db, t: t}\n\n\t// Let's run this whole thing a few times.\n\tfor j := 0; j < 10; j++ {\n\t\tt.Logf(\"Cycle: %d\\n\", j)\n\t\th.writeRange(0, 32)\n\t\th.writeRange(0, 10)\n\t\th.writeRange(50, 72)\n\t\th.writeRange(40, 72)\n\t\th.writeRange(40, 72)\n\n\t\t// Run value log GC a few times.\n\t\tfor i := 0; i < 5; i++ {\n\t\t\tif err := db.RunValueLogGC(0.5); err != nil && !errors.Is(ErrNoRewrite, err) {\n\t\t\t\trequire.NoError(t, err)\n\t\t\t}\n\t\t}\n\t\th.readRange(0, 10)\n\t}\n\trequire.NoError(t, db.Close())\n}\n\nfunc BenchmarkReadWrite(b *testing.B) {\n\trwRatio := []float32{\n\t\t0.1, 0.2, 0.5, 1.0,\n\t}\n\tvalueSize := []int{\n\t\t64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384,\n\t}\n\n\tfor _, vsz := range valueSize {\n\t\tfor _, rw := range rwRatio {\n\t\t\tb.Run(fmt.Sprintf(\"%3.1f,%04d\", rw, vsz), func(b *testing.B) {\n\t\t\t\tdir, err := os.MkdirTemp(\"\", \"vlog-benchmark\")\n\t\t\t\ty.Check(err)\n\t\t\t\tdefer removeDir(dir)\n\t\t\t\topts := getTestOptions(dir)\n\t\t\t\topts.ValueThreshold = 0\n\t\t\t\tdb, err := Open(opts)\n\t\t\t\ty.Check(err)\n\t\t\t\tdefer db.Close()\n\t\t\t\tvl := &db.vlog\n\t\t\t\tb.ResetTimer()\n\n\t\t\t\tfor i := 0; i < b.N; i++ {\n\t\t\t\t\te := new(Entry)\n\t\t\t\t\te.Key = make([]byte, 16)\n\t\t\t\t\te.Value = make([]byte, vsz)\n\t\t\t\t\tbl := new(request)\n\t\t\t\t\tbl.Entries = []*Entry{e}\n\n\t\t\t\t\tvar ptrs []valuePointer\n\n\t\t\t\t\t_ = vl.write([]*request{bl})\n\t\t\t\t\tptrs = append(ptrs, bl.Ptrs...)\n\n\t\t\t\t\tf := rand.Float32()\n\t\t\t\t\tif f < rw {\n\t\t\t\t\t\t_ = vl.write([]*request{bl})\n\n\t\t\t\t\t} else {\n\t\t\t\t\t\tln := len(ptrs)\n\t\t\t\t\t\tif ln == 0 {\n\t\t\t\t\t\t\tb.Fatalf(\"Zero length of ptrs\")\n\t\t\t\t\t\t}\n\t\t\t\t\t\tidx := rand.Intn(ln)\n\t\t\t\t\t\tbuf, lf, err := vl.readValueBytes(ptrs[idx])\n\t\t\t\t\t\tif err != nil {\n\t\t\t\t\t\t\tb.Fatalf(\"Benchmark Read: %v\", err)\n\t\t\t\t\t\t}\n\n\t\t\t\t\t\te, err := lf.decodeEntry(buf, ptrs[idx].Offset)\n\t\t\t\t\t\trequire.NoError(b, err)\n\t\t\t\t\t\tif len(e.Key) != 16 {\n\t\t\t\t\t\t\tb.Fatalf(\"Key is invalid\")\n\t\t\t\t\t\t}\n\t\t\t\t\t\tif len(e.Value) != vsz {\n\t\t\t\t\t\t\tb.Fatalf(\"Value is invalid\")\n\t\t\t\t\t\t}\n\t\t\t\t\t\trunCallback(db.vlog.getUnlockCallback(lf))\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t})\n\t\t}\n\t}\n}\n\n// Regression test for https://github.com/dgraph-io/badger/issues/817\n// This test verifies if fully corrupted memtables are deleted on reopen.\nfunc TestValueLogTruncate(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\t// Initialize the data directory.\n\tdb, err := Open(DefaultOptions(dir))\n\trequire.NoError(t, err)\n\t// Insert 1 entry so that we have valid data in first mem file\n\trequire.NoError(t, db.Update(func(txn *Txn) error {\n\t\treturn txn.Set([]byte(\"foo\"), nil)\n\t}))\n\n\tfileCountBeforeCorruption := 1\n\trequire.NoError(t, db.Close())\n\n\t// Create two mem files with corrupted data. These will be truncated when DB starts next time\n\trequire.NoError(t, os.WriteFile(db.mtFilePath(2), []byte(\"foo\"), 0664))\n\trequire.NoError(t, os.WriteFile(db.mtFilePath(3), []byte(\"foo\"), 0664))\n\n\tdb, err = Open(DefaultOptions(dir))\n\trequire.NoError(t, err)\n\n\t// Ensure we have only one SST file.\n\trequire.Equal(t, 1, len(db.Tables()))\n\n\t// Ensure mem file with ID 4 is zero.\n\trequire.Equal(t, 4, int(db.mt.wal.fid))\n\tfileStat, err := db.mt.wal.Fd.Stat()\n\trequire.NoError(t, err)\n\trequire.Equal(t, 2*db.opt.MemTableSize, fileStat.Size())\n\n\tfileCountAfterCorruption := len(db.Tables()) + len(db.imm) + 1 // +1 for db.mt\n\t// We should have one memtable and one sst file.\n\trequire.Equal(t, fileCountBeforeCorruption+1, fileCountAfterCorruption)\n\t// maxFid will be 2 because we increment the max fid on DB open every time.\n\trequire.Equal(t, 2, int(db.vlog.maxFid))\n\trequire.NoError(t, db.Close())\n}\n\nfunc TestSafeEntry(t *testing.T) {\n\tvar s safeRead\n\ts.lf = &logFile{}\n\te := NewEntry([]byte(\"foo\"), []byte(\"bar\"))\n\tbuf := bytes.NewBuffer(nil)\n\t_, err := s.lf.encodeEntry(buf, e, 0)\n\trequire.NoError(t, err)\n\n\tne, err := s.Entry(buf)\n\trequire.NoError(t, err)\n\trequire.Equal(t, e.Key, ne.Key, \"key mismatch\")\n\trequire.Equal(t, e.Value, ne.Value, \"value mismatch\")\n\trequire.Equal(t, e.meta, ne.meta, \"meta mismatch\")\n\trequire.Equal(t, e.UserMeta, ne.UserMeta, \"usermeta mismatch\")\n\trequire.Equal(t, e.ExpiresAt, ne.ExpiresAt, \"expiresAt mismatch\")\n}\n\nfunc TestValueEntryChecksum(t *testing.T) {\n\tk := []byte(\"KEY\")\n\tv := []byte(fmt.Sprintf(\"val%100d\", 10))\n\tt.Run(\"ok\", func(t *testing.T) {\n\t\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\t\trequire.NoError(t, err)\n\t\tdefer removeDir(dir)\n\n\t\topt := getTestOptions(dir)\n\t\topt.VerifyValueChecksum = true\n\t\topt.ValueThreshold = 32\n\t\tdb, err := Open(opt)\n\t\trequire.NoError(t, err)\n\n\t\trequire.Greater(t, int64(len(v)), db.vlog.db.valueThreshold())\n\t\ttxnSet(t, db, k, v, 0)\n\t\trequire.NoError(t, db.Close())\n\n\t\tdb, err = Open(opt)\n\t\trequire.NoError(t, err)\n\n\t\ttxn := db.NewTransaction(false)\n\t\tentry, err := txn.Get(k)\n\t\trequire.NoError(t, err)\n\n\t\tx, err := entry.ValueCopy(nil)\n\t\trequire.NoError(t, err)\n\t\trequire.Equal(t, v, x)\n\n\t\trequire.NoError(t, db.Close())\n\t})\n\t// Regression test for https://github.com/dgraph-io/badger/issues/1049\n\tt.Run(\"Corruption\", func(t *testing.T) {\n\t\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\t\trequire.NoError(t, err)\n\t\tdefer removeDir(dir)\n\n\t\topt := getTestOptions(dir)\n\t\topt.VerifyValueChecksum = true\n\t\topt.ValueThreshold = 32\n\t\tdb, err := Open(opt)\n\t\trequire.NoError(t, err)\n\n\t\trequire.Greater(t, int64(len(v)), db.vlog.db.valueThreshold())\n\t\ttxnSet(t, db, k, v, 0)\n\n\t\tpath := db.vlog.fpath(1)\n\t\trequire.NoError(t, db.Close())\n\n\t\tfile, err := os.OpenFile(path, os.O_RDWR, 0644)\n\t\trequire.NoError(t, err)\n\t\toffset := 50\n\t\torig := make([]byte, 1)\n\t\t_, err = file.ReadAt(orig, int64(offset))\n\t\trequire.NoError(t, err)\n\t\t// Corrupt a single bit.\n\t\t_, err = file.WriteAt([]byte{7}, int64(offset))\n\t\trequire.NoError(t, err)\n\t\trequire.NoError(t, file.Close())\n\n\t\tdb, err = Open(opt)\n\t\trequire.NoError(t, err)\n\n\t\ttxn := db.NewTransaction(false)\n\t\tentry, err := txn.Get(k)\n\t\trequire.NoError(t, err)\n\n\t\t// TODO(ibrahim): This test is broken since we're not returning errors\n\t\t// in case we cannot read the values. This is incorrect behavior but\n\t\t// we're doing this to debug an issue where the values are being read\n\t\t// from old vlog files.\n\t\t_, _ = entry.ValueCopy(nil)\n\t\t// require.Error(t, err)\n\t\t// require.Contains(t, err.Error(), \"ErrEOF\")\n\t\t// require.Nil(t, x)\n\n\t\trequire.NoError(t, db.Close())\n\t})\n}\n\nfunc TestValidateWrite(t *testing.T) {\n\t// Mocking the file size, so that we don't allocate big memory while running test.\n\tmaxVlogFileSize = 400\n\tdefer func() {\n\t\tmaxVlogFileSize = math.MaxUint32\n\t}()\n\n\tbigBuf := make([]byte, maxVlogFileSize+1)\n\tlog := &valueLog{\n\t\topt: DefaultOptions(\".\"),\n\t}\n\n\t// Sending a request with big values which will overflow uint32.\n\tkey := []byte(\"HelloKey\")\n\treq := &request{\n\t\tEntries: []*Entry{\n\t\t\t{\n\t\t\t\tKey:   key,\n\t\t\t\tValue: bigBuf,\n\t\t\t},\n\t\t\t{\n\t\t\t\tKey:   key,\n\t\t\t\tValue: bigBuf,\n\t\t\t},\n\t\t\t{\n\t\t\t\tKey:   key,\n\t\t\t\tValue: bigBuf,\n\t\t\t},\n\t\t},\n\t}\n\n\terr := log.validateWrites([]*request{req})\n\trequire.Error(t, err)\n\n\t// Testing with small values.\n\tsmallBuf := make([]byte, 4)\n\treq1 := &request{\n\t\tEntries: []*Entry{\n\t\t\t{\n\t\t\t\tKey:   key,\n\t\t\t\tValue: smallBuf,\n\t\t\t},\n\t\t\t{\n\t\t\t\tKey:   key,\n\t\t\t\tValue: smallBuf,\n\t\t\t},\n\t\t\t{\n\t\t\t\tKey:   key,\n\t\t\t\tValue: smallBuf,\n\t\t\t},\n\t\t},\n\t}\n\n\terr = log.validateWrites([]*request{req1})\n\trequire.NoError(t, err)\n\n\t// Batching small and big request.\n\terr = log.validateWrites([]*request{req1, req})\n\trequire.Error(t, err)\n}\n\nfunc TestValueLogMeta(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\ty.Check(err)\n\tdefer removeDir(dir)\n\n\topt := getTestOptions(dir).WithValueThreshold(16)\n\tdb, _ := Open(opt)\n\tdefer db.Close()\n\ttxn := db.NewTransaction(true)\n\tfor i := 0; i < 10; i++ {\n\t\tk := []byte(fmt.Sprintf(\"key=%d\", i))\n\t\tv := []byte(fmt.Sprintf(\"val=%020d\", i))\n\t\trequire.NoError(t, txn.SetEntry(NewEntry(k, v)))\n\t}\n\trequire.NoError(t, txn.Commit())\n\tfids := db.vlog.sortedFids()\n\trequire.Equal(t, 1, len(fids))\n\n\t// vlog entries must not have txn meta.\n\t_, err = db.vlog.filesMap[fids[0]].iterate(true, 0, func(e Entry, vp valuePointer) error {\n\t\trequire.Zero(t, e.meta&(bitTxn|bitFinTxn))\n\t\treturn nil\n\t})\n\trequire.NoError(t, err)\n\n\t// Entries in LSM tree must have txn bit of meta set\n\ttxn = db.NewTransaction(false)\n\tdefer txn.Discard()\n\tiopt := DefaultIteratorOptions\n\tkey := []byte(\"key\")\n\tiopt.Prefix = key\n\titr := txn.NewIterator(iopt)\n\tdefer itr.Close()\n\tvar count int\n\tfor itr.Seek(key); itr.ValidForPrefix(key); itr.Next() {\n\t\titem := itr.Item()\n\t\trequire.Equal(t, bitTxn, item.meta&(bitTxn|bitFinTxn))\n\t\tcount++\n\t}\n\trequire.Equal(t, 10, count)\n}\n\n// This tests asserts the condition that vlog fids start from 1.\n// TODO(naman): should this be changed to assert instead?\nfunc TestFirstVlogFile(t *testing.T) {\n\tdir, err := os.MkdirTemp(\"\", \"badger-test\")\n\trequire.NoError(t, err)\n\tdefer removeDir(dir)\n\n\topt := DefaultOptions(dir)\n\tdb, err := Open(opt)\n\trequire.NoError(t, err)\n\tdefer db.Close()\n\n\tfids := db.vlog.sortedFids()\n\trequire.NotZero(t, len(fids))\n\trequire.Equal(t, uint32(1), fids[0])\n}\n"
  },
  {
    "path": "watermark_edge_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage badger\n\nimport (\n\t\"crypto/rand\"\n\t\"errors\"\n\t\"fmt\"\n\t\"math/big\"\n\t\"sync\"\n\t\"testing\"\n\t\"time\"\n)\n\nfunc TestWaterMarkEdgeCase(t *testing.T) {\n\tconst N = 1_000\n\trunBadgerTest(t, nil, func(t *testing.T, db *DB) {\n\t\teg := make(chan error, N)\n\t\tdefer close(eg)\n\n\t\tvar wg sync.WaitGroup\n\t\twg.Add(N)\n\t\tfor i := 0; i < N; i++ {\n\t\t\tgo func(j int) {\n\t\t\t\tdefer wg.Done()\n\t\t\t\tif err := doWork(db, j); errors.Is(err, ErrConflict) {\n\t\t\t\t\teg <- nil\n\t\t\t\t} else {\n\t\t\t\t\teg <- fmt.Errorf(\"expected conflict not found, err: %v, i = %v\", err, j)\n\t\t\t\t}\n\t\t\t}(i)\n\t\t}\n\t\twg.Wait()\n\n\t\tfor i := 0; i < N; i++ {\n\t\t\tif err := <-eg; err != nil {\n\t\t\t\tt.Fatal(err)\n\t\t\t}\n\t\t}\n\t})\n}\n\nfunc doWork(db *DB, i int) error {\n\tdelay()\n\n\tkey1 := fmt.Sprintf(\"v:%d:%s\", i, generateRandomBytes())\n\tkey2 := fmt.Sprintf(\"v:%d:%s\", i, generateRandomBytes())\n\n\ttx1 := db.NewTransaction(true)\n\tdefer tx1.Discard()\n\ttx2 := db.NewTransaction(true)\n\tdefer tx2.Discard()\n\n\tgetValue(tx2, key1)\n\tgetValue(tx2, key2)\n\tgetValue(tx1, key1)\n\tgetValue(tx2, key1)\n\tsetValue(tx2, key1, \"value1\")\n\tsetValue(tx2, key2, \"value2\")\n\n\tif err := tx2.Commit(); err != nil {\n\t\treturn fmt.Errorf(\"tx2 failed: %w (key1 = %s, key2 = %s)\", err, key1, key2)\n\t}\n\n\tsetValue(tx1, key1, \"value1-second\")\n\tgetValue(tx1, key1)\n\tsetValue(tx1, key1, \"value1-third\")\n\n\tdelay()\n\tif err := tx1.Commit(); err != nil {\n\t\treturn fmt.Errorf(\"tx1 failed: %w (key1 = %s, key2 = %s)\", err, key1, key2)\n\t}\n\n\treturn nil\n}\n\nfunc generateRandomBytes() []byte {\n\tb := make([]byte, 20)\n\tif _, err := rand.Read(b); err != nil {\n\t\tpanic(err)\n\t}\n\treturn b\n}\n\nfunc getValue(txn *Txn, key string) {\n\tif _, err := txn.Get([]byte(key)); err != nil {\n\t\tif !errors.Is(err, ErrKeyNotFound) {\n\t\t\tpanic(err)\n\t\t}\n\t}\n}\n\nfunc setValue(txn *Txn, key, value string) {\n\tif err := txn.Set([]byte(key), []byte(value)); err != nil {\n\t\tpanic(err)\n\t}\n}\n\nfunc delay() {\n\tjitter, err := rand.Int(rand.Reader, big.NewInt(100))\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\t<-time.After(time.Duration(jitter.Int64()) * time.Millisecond)\n}\n"
  },
  {
    "path": "y/bloom.go",
    "content": "// Copyright 2013 The LevelDB-Go Authors. All rights reserved.\n// Use of this source code is governed by a BSD-style\n// license that can be found in the LICENSE file.\n\npackage y\n\nimport \"math\"\n\n// Filter is an encoded set of []byte keys.\ntype Filter []byte\n\nfunc (f Filter) MayContainKey(k []byte) bool {\n\treturn f.MayContain(Hash(k))\n}\n\n// MayContain returns whether the filter may contain given key. False positives\n// are possible, where it returns true for keys not in the original set.\nfunc (f Filter) MayContain(h uint32) bool {\n\tif len(f) < 2 {\n\t\treturn false\n\t}\n\tk := f[len(f)-1]\n\tif k > 30 {\n\t\t// This is reserved for potentially new encodings for short Bloom filters.\n\t\t// Consider it a match.\n\t\treturn true\n\t}\n\tnBits := uint32(8 * (len(f) - 1))\n\tdelta := h>>17 | h<<15\n\tfor j := uint8(0); j < k; j++ {\n\t\tbitPos := h % nBits\n\t\tif f[bitPos/8]&(1<<(bitPos%8)) == 0 {\n\t\t\treturn false\n\t\t}\n\t\th += delta\n\t}\n\treturn true\n}\n\n// NewFilter returns a new Bloom filter that encodes a set of []byte keys with\n// the given number of bits per key, approximately.\n//\n// A good bitsPerKey value is 10, which yields a filter with ~ 1% false\n// positive rate.\nfunc NewFilter(keys []uint32, bitsPerKey int) Filter {\n\treturn Filter(appendFilter(nil, keys, bitsPerKey))\n}\n\n// BloomBitsPerKey returns the bits per key required by bloomfilter based on\n// the false positive rate.\nfunc BloomBitsPerKey(numEntries int, fp float64) int {\n\tsize := -1 * float64(numEntries) * math.Log(fp) / math.Pow(float64(0.69314718056), 2)\n\tlocs := math.Ceil(float64(0.69314718056) * size / float64(numEntries))\n\treturn int(locs)\n}\n\nfunc appendFilter(buf []byte, keys []uint32, bitsPerKey int) []byte {\n\tif bitsPerKey < 0 {\n\t\tbitsPerKey = 0\n\t}\n\t// 0.69 is approximately ln(2).\n\tk := uint32(float64(bitsPerKey) * 0.69)\n\tif k < 1 {\n\t\tk = 1\n\t}\n\tif k > 30 {\n\t\tk = 30\n\t}\n\n\tnBits := len(keys) * bitsPerKey\n\t// For small len(keys), we can see a very high false positive rate. Fix it\n\t// by enforcing a minimum bloom filter length.\n\tif nBits < 64 {\n\t\tnBits = 64\n\t}\n\tnBytes := (nBits + 7) / 8\n\tnBits = nBytes * 8\n\tbuf, filter := extend(buf, nBytes+1)\n\n\tfor _, h := range keys {\n\t\tdelta := h>>17 | h<<15\n\t\tfor j := uint32(0); j < k; j++ {\n\t\t\tbitPos := h % uint32(nBits)\n\t\t\tfilter[bitPos/8] |= 1 << (bitPos % 8)\n\t\t\th += delta\n\t\t}\n\t}\n\tfilter[nBytes] = uint8(k)\n\n\treturn buf\n}\n\n// extend appends n zero bytes to b. It returns the overall slice (of length\n// n+len(originalB)) and the slice of n trailing zeroes.\nfunc extend(b []byte, n int) (overall, trailer []byte) {\n\twant := n + len(b)\n\tif want <= cap(b) {\n\t\toverall = b[:want]\n\t\ttrailer = overall[len(b):]\n\t\tfor i := range trailer {\n\t\t\ttrailer[i] = 0\n\t\t}\n\t} else {\n\t\t// Grow the capacity exponentially, with a 1KiB minimum.\n\t\tc := 1024\n\t\tfor c < want {\n\t\t\tc += c / 4\n\t\t}\n\t\toverall = make([]byte, want, c)\n\t\ttrailer = overall[len(b):]\n\t\tcopy(overall, b)\n\t}\n\treturn overall, trailer\n}\n\n// hash implements a hashing algorithm similar to the Murmur hash.\nfunc Hash(b []byte) uint32 {\n\tconst (\n\t\tseed = 0xbc9f1d34\n\t\tm    = 0xc6a4a793\n\t)\n\th := uint32(seed) ^ uint32(len(b))*m\n\tfor ; len(b) >= 4; b = b[4:] {\n\t\th += uint32(b[0]) | uint32(b[1])<<8 | uint32(b[2])<<16 | uint32(b[3])<<24\n\t\th *= m\n\t\th ^= h >> 16\n\t}\n\tswitch len(b) {\n\tcase 3:\n\t\th += uint32(b[2]) << 16\n\t\tfallthrough\n\tcase 2:\n\t\th += uint32(b[1]) << 8\n\t\tfallthrough\n\tcase 1:\n\t\th += uint32(b[0])\n\t\th *= m\n\t\th ^= h >> 24\n\t}\n\treturn h\n}\n\n// FilterPolicy implements the db.FilterPolicy interface from the leveldb/db\n// package.\n//\n// The integer value is the approximate number of bits used per key. A good\n// value is 10, which yields a filter with ~ 1% false positive rate.\n//\n// It is valid to use the other API in this package (leveldb/bloom) without\n// using this type or the leveldb/db package.\n\n// type FilterPolicy int\n\n// // Name implements the db.FilterPolicy interface.\n// func (p FilterPolicy) Name() string {\n// \t// This string looks arbitrary, but its value is written to LevelDB .ldb\n// \t// files, and should be this exact value to be compatible with those files\n// \t// and with the C++ LevelDB code.\n// \treturn \"leveldb.BuiltinBloomFilter2\"\n// }\n\n// // AppendFilter implements the db.FilterPolicy interface.\n// func (p FilterPolicy) AppendFilter(dst []byte, keys [][]byte) []byte {\n// \treturn appendFilter(dst, keys, int(p))\n// }\n\n// // MayContain implements the db.FilterPolicy interface.\n// func (p FilterPolicy) MayContain(filter, key []byte) bool {\n// \treturn Filter(filter).MayContain(key)\n// }\n"
  },
  {
    "path": "y/bloom_test.go",
    "content": "// Copyright 2013 The LevelDB-Go Authors. All rights reserved.\n// Use of this source code is governed by a BSD-style\n// license that can be found in the LICENSE file.\n\npackage y\n\nimport (\n\t\"testing\"\n)\n\nfunc (f Filter) String() string {\n\ts := make([]byte, 8*len(f))\n\tfor i, x := range f {\n\t\tfor j := 0; j < 8; j++ {\n\t\t\tif x&(1<<uint(j)) != 0 {\n\t\t\t\ts[8*i+j] = '1'\n\t\t\t} else {\n\t\t\t\ts[8*i+j] = '.'\n\t\t\t}\n\t\t}\n\t}\n\treturn string(s)\n}\n\nfunc TestSmallBloomFilter(t *testing.T) {\n\tvar hash []uint32\n\tfor _, word := range [][]byte{\n\t\t[]byte(\"hello\"),\n\t\t[]byte(\"world\"),\n\t} {\n\t\thash = append(hash, Hash(word))\n\t}\n\n\tf := NewFilter(hash, 10)\n\tgot := f.String()\n\t// The magic want string comes from running the C++ leveldb code's bloom_test.cc.\n\twant := \"1...1.........1.........1.....1...1...1.....1.........1.....1....11.....\"\n\tif got != want {\n\t\tt.Fatalf(\"bits:\\ngot  %q\\nwant %q\", got, want)\n\t}\n\n\tm := map[string]bool{\n\t\t\"hello\": true,\n\t\t\"world\": true,\n\t\t\"x\":     false,\n\t\t\"foo\":   false,\n\t}\n\tfor k, want := range m {\n\t\tgot := f.MayContainKey([]byte(k))\n\t\tif got != want {\n\t\t\tt.Errorf(\"MayContain: k=%q: got %v, want %v\", k, got, want)\n\t\t}\n\t}\n}\n\nfunc TestBloomFilter(t *testing.T) {\n\tnextLength := func(x int) int {\n\t\tif x < 10 {\n\t\t\treturn x + 1\n\t\t}\n\t\tif x < 100 {\n\t\t\treturn x + 10\n\t\t}\n\t\tif x < 1000 {\n\t\t\treturn x + 100\n\t\t}\n\t\treturn x + 1000\n\t}\n\tle32 := func(i int) []byte {\n\t\tb := make([]byte, 4)\n\t\tb[0] = uint8(uint32(i) >> 0)\n\t\tb[1] = uint8(uint32(i) >> 8)\n\t\tb[2] = uint8(uint32(i) >> 16)\n\t\tb[3] = uint8(uint32(i) >> 24)\n\t\treturn b\n\t}\n\n\tnMediocreFilters, nGoodFilters := 0, 0\nloop:\n\tfor length := 1; length <= 10000; length = nextLength(length) {\n\t\tkeys := make([][]byte, 0, length)\n\t\tfor i := 0; i < length; i++ {\n\t\t\tkeys = append(keys, le32(i))\n\t\t}\n\t\tvar hashes []uint32\n\t\tfor _, key := range keys {\n\t\t\thashes = append(hashes, Hash(key))\n\t\t}\n\t\tf := NewFilter(hashes, 10)\n\n\t\tif len(f) > (length*10/8)+40 {\n\t\t\tt.Errorf(\"length=%d: len(f)=%d is too large\", length, len(f))\n\t\t\tcontinue\n\t\t}\n\n\t\t// All added keys must match.\n\t\tfor _, key := range keys {\n\t\t\tif !f.MayContainKey(key) {\n\t\t\t\tt.Errorf(\"length=%d: did not contain key %q\", length, key)\n\t\t\t\tcontinue loop\n\t\t\t}\n\t\t}\n\n\t\t// Check false positive rate.\n\t\tnFalsePositive := 0\n\t\tfor i := 0; i < 10000; i++ {\n\t\t\tif f.MayContainKey(le32(1e9 + i)) {\n\t\t\t\tnFalsePositive++\n\t\t\t}\n\t\t}\n\t\tif nFalsePositive > 0.02*10000 {\n\t\t\tt.Errorf(\"length=%d: %d false positives in 10000\", length, nFalsePositive)\n\t\t\tcontinue\n\t\t}\n\t\tif nFalsePositive > 0.0125*10000 {\n\t\t\tnMediocreFilters++\n\t\t} else {\n\t\t\tnGoodFilters++\n\t\t}\n\t}\n\n\tif nMediocreFilters > nGoodFilters/5 {\n\t\tt.Errorf(\"%d mediocre filters but only %d good filters\", nMediocreFilters, nGoodFilters)\n\t}\n}\n\nfunc TestHash(t *testing.T) {\n\t// The magic want numbers come from running the C++ leveldb code in hash.cc.\n\ttestCases := []struct {\n\t\ts    string\n\t\twant uint32\n\t}{\n\t\t{\"\", 0xbc9f1d34},\n\t\t{\"g\", 0xd04a8bda},\n\t\t{\"go\", 0x3e0b0745},\n\t\t{\"gop\", 0x0c326610},\n\t\t{\"goph\", 0x8c9d6390},\n\t\t{\"gophe\", 0x9bfd4b0a},\n\t\t{\"gopher\", 0xa78edc7c},\n\t\t{\"I had a dream it would end this way.\", 0xe14a9db9},\n\t}\n\tfor _, tc := range testCases {\n\t\tif got := Hash([]byte(tc.s)); got != tc.want {\n\t\t\tt.Errorf(\"s=%q: got 0x%08x, want 0x%08x\", tc.s, got, tc.want)\n\t\t}\n\t}\n}\n"
  },
  {
    "path": "y/checksum.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage y\n\nimport (\n\tstderrors \"errors\"\n\t\"hash/crc32\"\n\n\t\"github.com/cespare/xxhash/v2\"\n\n\t\"github.com/dgraph-io/badger/v4/pb\"\n)\n\n// ErrChecksumMismatch is returned at checksum mismatch.\nvar ErrChecksumMismatch = stderrors.New(\"checksum mismatch\")\n\n// CalculateChecksum calculates checksum for data using ct checksum type.\nfunc CalculateChecksum(data []byte, ct pb.Checksum_Algorithm) uint64 {\n\tswitch ct {\n\tcase pb.Checksum_CRC32C:\n\t\treturn uint64(crc32.Checksum(data, CastagnoliCrcTable))\n\tcase pb.Checksum_XXHash64:\n\t\treturn xxhash.Sum64(data)\n\tdefault:\n\t\tpanic(\"checksum type not supported\")\n\t}\n}\n\n// VerifyChecksum validates the checksum for the data against the given expected checksum.\nfunc VerifyChecksum(data []byte, expected *pb.Checksum) error {\n\tactual := CalculateChecksum(data, expected.Algo)\n\tif actual != expected.Sum {\n\t\treturn Wrapf(ErrChecksumMismatch, \"actual: %d, expected: %d\", actual, expected.Sum)\n\t}\n\treturn nil\n}\n"
  },
  {
    "path": "y/checksum_test.go",
    "content": "package y\n\nimport (\n\t\"hash/crc32\"\n\t\"testing\"\n\n\t\"github.com/cespare/xxhash/v2\"\n\t\"github.com/dgraph-io/badger/v4/pb\"\n\t\"github.com/stretchr/testify/require\"\n)\n\nfunc TestCalculateChecksum_CRC32C(t *testing.T) {\n\tdata := []byte(\"hello world\")\n\texpected := uint64(crc32.Checksum(data, CastagnoliCrcTable))\n\tgot := CalculateChecksum(data, pb.Checksum_CRC32C)\n\trequire.Equal(t, expected, got)\n\n\t// empty input\n\texpectedEmpty := uint64(crc32.Checksum([]byte{}, CastagnoliCrcTable))\n\tgotEmpty := CalculateChecksum([]byte{}, pb.Checksum_CRC32C)\n\trequire.Equal(t, expectedEmpty, gotEmpty)\n}\n\nfunc TestCalculateChecksum_XXHash64(t *testing.T) {\n\tdata := []byte(\"hello world\")\n\texpected := xxhash.Sum64(data)\n\tgot := CalculateChecksum(data, pb.Checksum_XXHash64)\n\trequire.Equal(t, expected, got)\n}\n\nfunc TestVerifyChecksum_Success(t *testing.T) {\n\tdata := []byte(\"hello world\")\n\tc1 := &pb.Checksum{Algo: pb.Checksum_CRC32C, Sum: CalculateChecksum(data, pb.Checksum_CRC32C)}\n\trequire.NoError(t, VerifyChecksum(data, c1))\n\n\tc2 := &pb.Checksum{Algo: pb.Checksum_XXHash64, Sum: CalculateChecksum(data, pb.Checksum_XXHash64)}\n\trequire.NoError(t, VerifyChecksum(data, c2))\n}\n\nfunc TestVerifyChecksum_Mismatch(t *testing.T) {\n\tdata := []byte(\"x\")\n\tc := &pb.Checksum{Algo: pb.Checksum_CRC32C, Sum: 0}\n\terr := VerifyChecksum(data, c)\n\trequire.Error(t, err)\n\trequire.Contains(t, err.Error(), \"checksum mismatch\")\n}\n\nfunc TestCalculateChecksum_UnsupportedAlgoPanics(t *testing.T) {\n\tdefer func() {\n\t\tif r := recover(); r == nil {\n\t\t\tt.Fatalf(\"expected panic for unsupported algorithm\")\n\t\t}\n\t}()\n\n\t_ = CalculateChecksum([]byte(\"x\"), pb.Checksum_Algorithm(999))\n}\n"
  },
  {
    "path": "y/encrypt.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage y\n\nimport (\n\t\"bytes\"\n\t\"crypto/aes\"\n\t\"crypto/cipher\"\n\t\"crypto/rand\"\n\t\"io\"\n)\n\n// XORBlock encrypts the given data with AES and XOR's with IV.\n// Can be used for both encryption and decryption. IV is of\n// AES block size.\nfunc XORBlock(dst, src, key, iv []byte) error {\n\tblock, err := aes.NewCipher(key)\n\tif err != nil {\n\t\treturn err\n\t}\n\tstream := cipher.NewCTR(block, iv)\n\tstream.XORKeyStream(dst, src)\n\treturn nil\n}\n\nfunc XORBlockAllocate(src, key, iv []byte) ([]byte, error) {\n\tblock, err := aes.NewCipher(key)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\tstream := cipher.NewCTR(block, iv)\n\tdst := make([]byte, len(src))\n\tstream.XORKeyStream(dst, src)\n\treturn dst, nil\n}\n\nfunc XORBlockStream(w io.Writer, src, key, iv []byte) error {\n\tblock, err := aes.NewCipher(key)\n\tif err != nil {\n\t\treturn err\n\t}\n\tstream := cipher.NewCTR(block, iv)\n\tsw := cipher.StreamWriter{S: stream, W: w}\n\t_, err = io.Copy(sw, bytes.NewReader(src))\n\treturn Wrapf(err, \"XORBlockStream\")\n}\n\n// GenerateIV generates IV.\nfunc GenerateIV() ([]byte, error) {\n\tiv := make([]byte, aes.BlockSize)\n\t_, err := rand.Read(iv)\n\treturn iv, err\n}\n"
  },
  {
    "path": "y/encrypt_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage y\n\nimport (\n\t\"crypto/aes\"\n\t\"crypto/rand\"\n\t\"testing\"\n\n\t\"github.com/stretchr/testify/require\"\n)\n\nfunc TestXORBlock(t *testing.T) {\n\tkey := make([]byte, 32)\n\t_, _ = rand.Read(key)\n\n\tvar iv []byte\n\t{\n\t\tb, err := aes.NewCipher(key)\n\t\trequire.NoError(t, err)\n\t\tiv = make([]byte, b.BlockSize())\n\t\t_, _ = rand.Read(iv)\n\t\tt.Logf(\"Using %d size IV\\n\", len(iv))\n\t}\n\n\tsrc := make([]byte, 1024)\n\t_, _ = rand.Read(src)\n\n\tdst := make([]byte, 1024)\n\terr := XORBlock(dst, src, key, iv)\n\trequire.NoError(t, err)\n\n\tact := make([]byte, 1024)\n\terr = XORBlock(act, dst, key, iv)\n\trequire.NoError(t, err)\n\trequire.Equal(t, src, act)\n\n\t// Now check if we can use the same byte slice as src and dst. While this is useful to know that\n\t// we can use src and dst as the same slice, this isn't applicable to Badger because we're\n\t// reading data right off mmap. We should not modify that data, so we have to use a different\n\t// slice for dst anyway.\n\tcp := append([]byte{}, src...)\n\terr = XORBlock(cp, cp, key, iv)\n\trequire.NoError(t, err)\n\trequire.Equal(t, dst, cp)\n\n\terr = XORBlock(cp, cp, key, iv)\n\trequire.NoError(t, err)\n\trequire.Equal(t, src, cp)\n}\n"
  },
  {
    "path": "y/error.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage y\n\n// This file contains some functions for error handling. Note that we are moving\n// towards using x.Trace, i.e., rpc tracing using net/tracer. But for now, these\n// functions are useful for simple checks logged on one machine.\n// Some common use cases are:\n// (1) You receive an error from external lib, and would like to check/log fatal.\n//     For this, use x.Check, x.Checkf. These will check for err != nil, which is\n//     more common in Go. If you want to check for boolean being true, use\n//\t\t   x.Assert, x.Assertf.\n// (2) You receive an error from external lib, and would like to pass on with some\n//     stack trace information. In this case, use x.Wrap or x.Wrapf.\n// (3) You want to generate a new error with stack trace info. Use x.Errorf.\n\nimport (\n\t\"errors\"\n\t\"fmt\"\n\t\"log\"\n)\n\nvar debugMode = false\n\n// Check logs fatal if err != nil.\nfunc Check(err error) {\n\tif err != nil {\n\t\tlog.Fatalf(\"%+v\", Wrap(err, \"\"))\n\t}\n}\n\n// Check2 acts as convenience wrapper around Check, using the 2nd argument as error.\nfunc Check2(_ interface{}, err error) {\n\tCheck(err)\n}\n\n// AssertTrue asserts that b is true. Otherwise, it would log fatal.\nfunc AssertTrue(b bool) {\n\tif !b {\n\t\tlog.Fatalf(\"%+v\", errors.New(\"Assert failed\"))\n\t}\n}\n\n// AssertTruef is AssertTrue with extra info.\nfunc AssertTruef(b bool, format string, args ...interface{}) {\n\tif !b {\n\t\tlog.Fatalf(\"%+v\", fmt.Errorf(format, args...))\n\t}\n}\n\n// Wrap wraps errors from external lib.\nfunc Wrap(err error, msg string) error {\n\tif !debugMode {\n\t\tif err == nil {\n\t\t\treturn nil\n\t\t}\n\t\treturn fmt.Errorf(\"%s err: %+v\", msg, err)\n\t}\n\treturn fmt.Errorf(\"%s: %w\", msg, err)\n}\n\n// Wrapf is Wrap with extra info.\nfunc Wrapf(err error, format string, args ...interface{}) error {\n\treturn Wrap(err, fmt.Sprintf(format, args...))\n}\n\nfunc CombineErrors(one, other error) error {\n\tif one != nil && other != nil {\n\t\treturn fmt.Errorf(\"%v; %v\", one, other)\n\t}\n\tif one != nil && other == nil {\n\t\treturn fmt.Errorf(\"%v\", one)\n\t}\n\tif one == nil && other != nil {\n\t\treturn fmt.Errorf(\"%v\", other)\n\t}\n\treturn nil\n}\n"
  },
  {
    "path": "y/error_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage y\n\nimport (\n\t\"errors\"\n\t\"testing\"\n\n\t\"github.com/stretchr/testify/require\"\n)\n\nfunc TestCombineWithBothErrorsPresent(t *testing.T) {\n\tcombinedError := CombineErrors(errors.New(\"one\"), errors.New(\"two\"))\n\trequire.Equal(t, \"one; two\", combinedError.Error())\n}\n\nfunc TestCombineErrorsWithOneErrorPresent(t *testing.T) {\n\tcombinedError := CombineErrors(errors.New(\"one\"), nil)\n\trequire.Equal(t, \"one\", combinedError.Error())\n}\n\nfunc TestCombineErrorsWithOtherErrorPresent(t *testing.T) {\n\tcombinedError := CombineErrors(nil, errors.New(\"other\"))\n\trequire.Equal(t, \"other\", combinedError.Error())\n}\n\nfunc TestCombineErrorsWithBothErrorsAsNil(t *testing.T) {\n\tcombinedError := CombineErrors(nil, nil)\n\trequire.NoError(t, combinedError)\n}\n"
  },
  {
    "path": "y/file_dsync.go",
    "content": "//go:build !dragonfly && !freebsd && !windows && !plan9 && !js && !wasip1\n// +build !dragonfly,!freebsd,!windows,!plan9,!js,!wasip1\n\n/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage y\n\nimport \"golang.org/x/sys/unix\"\n\nfunc init() {\n\tdatasyncFileFlag = unix.O_DSYNC\n}\n"
  },
  {
    "path": "y/file_nodsync.go",
    "content": "//go:build dragonfly || freebsd || windows || plan9\n// +build dragonfly freebsd windows plan9\n\n/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage y\n\nimport \"syscall\"\n\nfunc init() {\n\tdatasyncFileFlag = syscall.O_SYNC\n}\n"
  },
  {
    "path": "y/iterator.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage y\n\nimport (\n\t\"bytes\"\n\t\"encoding/binary\"\n)\n\n// ValueStruct represents the value info that can be associated with a key, but also the internal\n// Meta field.\ntype ValueStruct struct {\n\tMeta      byte\n\tUserMeta  byte\n\tExpiresAt uint64\n\tValue     []byte\n\n\tVersion uint64 // This field is not serialized. Only for internal usage.\n}\n\nfunc sizeVarint(x uint64) (n int) {\n\tfor {\n\t\tn++\n\t\tx >>= 7\n\t\tif x == 0 {\n\t\t\tbreak\n\t\t}\n\t}\n\treturn n\n}\n\n// EncodedSize is the size of the ValueStruct when encoded\nfunc (v *ValueStruct) EncodedSize() uint32 {\n\tsz := len(v.Value) + 2 // meta, usermeta.\n\tenc := sizeVarint(v.ExpiresAt)\n\treturn uint32(sz + enc)\n}\n\n// Decode uses the length of the slice to infer the length of the Value field.\nfunc (v *ValueStruct) Decode(b []byte) {\n\tv.Meta = b[0]\n\tv.UserMeta = b[1]\n\tvar sz int\n\tv.ExpiresAt, sz = binary.Uvarint(b[2:])\n\tv.Value = b[2+sz:]\n}\n\n// Encode expects a slice of length at least v.EncodedSize().\nfunc (v *ValueStruct) Encode(b []byte) uint32 {\n\tb[0] = v.Meta\n\tb[1] = v.UserMeta\n\tsz := binary.PutUvarint(b[2:], v.ExpiresAt)\n\tn := copy(b[2+sz:], v.Value)\n\treturn uint32(2 + sz + n)\n}\n\n// EncodeTo should be kept in sync with the Encode function above. The reason\n// this function exists is to avoid creating byte arrays per key-value pair in\n// table/builder.go.\nfunc (v *ValueStruct) EncodeTo(buf *bytes.Buffer) {\n\tbuf.WriteByte(v.Meta)\n\tbuf.WriteByte(v.UserMeta)\n\tvar enc [binary.MaxVarintLen64]byte\n\tsz := binary.PutUvarint(enc[:], v.ExpiresAt)\n\n\tbuf.Write(enc[:sz])\n\tbuf.Write(v.Value)\n}\n\n// Iterator is an interface for a basic iterator.\ntype Iterator interface {\n\tNext()\n\tRewind()\n\tSeek(key []byte)\n\tKey() []byte\n\tValue() ValueStruct\n\tValid() bool\n\n\t// All iterators should be closed so that file garbage collection works.\n\tClose() error\n}\n"
  },
  {
    "path": "y/metrics.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage y\n\nimport (\n\t\"expvar\"\n)\n\nconst (\n\tBADGER_METRIC_PREFIX = \"badger_\"\n)\n\nvar (\n\t// lsmSize has size of the LSM in bytes\n\tlsmSize *expvar.Map\n\t// vlogSize has size of the value log in bytes\n\tvlogSize *expvar.Map\n\t// pendingWrites tracks the number of pending writes.\n\tpendingWrites *expvar.Map\n\n\t// These are cumulative\n\n\t// VLOG METRICS\n\t// numReads has cumulative number of reads from vlog\n\tnumReadsVlog *expvar.Int\n\t// numWrites has cumulative number of writes into vlog\n\tnumWritesVlog *expvar.Int\n\t// numBytesRead has cumulative number of bytes read from VLOG\n\tnumBytesReadVlog *expvar.Int\n\t// numBytesVlogWritten has cumulative number of bytes written into VLOG\n\tnumBytesVlogWritten *expvar.Int\n\n\t// LSM METRICS\n\t// numBytesRead has cumulative number of bytes read from LSM tree\n\tnumBytesReadLSM *expvar.Int\n\t// numBytesWrittenToL0 has cumulative number of bytes written into LSM Tree\n\tnumBytesWrittenToL0 *expvar.Int\n\t// numLSMGets is number of LSM gets\n\tnumLSMGets *expvar.Map\n\t// numBytesCompactionWritten is the number of bytes written in the lsm tree due to compaction\n\tnumBytesCompactionWritten *expvar.Map\n\t// numLSMBloomHits is number of LMS bloom hits\n\tnumLSMBloomHits *expvar.Map\n\n\t// DB METRICS\n\t// numGets is number of gets -> Number of get requests made\n\tnumGets *expvar.Int\n\t// number of get queries in which we actually get a result\n\tnumGetsWithResults *expvar.Int\n\t// number of iterators created, these would be the number of range queries\n\tnumIteratorsCreated *expvar.Int\n\t// numPuts is number of puts -> Number of puts requests made\n\tnumPuts *expvar.Int\n\t// numMemtableGets is number of memtable gets -> Number of get requests made on memtable\n\tnumMemtableGets *expvar.Int\n\t// numCompactionTables is the number of tables being compacted\n\tnumCompactionTables *expvar.Int\n\t// Total writes by a user in bytes\n\tnumBytesWrittenUser *expvar.Int\n)\n\n// These variables are global and have cumulative values for all kv stores.\n// Naming convention of metrics: {badger_version}_{singular operation}_{granularity}_{component}\nfunc init() {\n\tnumReadsVlog = expvar.NewInt(BADGER_METRIC_PREFIX + \"read_num_vlog\")\n\tnumBytesReadVlog = expvar.NewInt(BADGER_METRIC_PREFIX + \"read_bytes_vlog\")\n\tnumWritesVlog = expvar.NewInt(BADGER_METRIC_PREFIX + \"write_num_vlog\")\n\tnumBytesVlogWritten = expvar.NewInt(BADGER_METRIC_PREFIX + \"write_bytes_vlog\")\n\n\tnumBytesReadLSM = expvar.NewInt(BADGER_METRIC_PREFIX + \"read_bytes_lsm\")\n\tnumBytesWrittenToL0 = expvar.NewInt(BADGER_METRIC_PREFIX + \"write_bytes_l0\")\n\tnumBytesCompactionWritten = expvar.NewMap(BADGER_METRIC_PREFIX + \"write_bytes_compaction\")\n\n\tnumLSMGets = expvar.NewMap(BADGER_METRIC_PREFIX + \"get_num_lsm\")\n\tnumLSMBloomHits = expvar.NewMap(BADGER_METRIC_PREFIX + \"hit_num_lsm_bloom_filter\")\n\tnumMemtableGets = expvar.NewInt(BADGER_METRIC_PREFIX + \"get_num_memtable\")\n\n\t// User operations\n\tnumGets = expvar.NewInt(BADGER_METRIC_PREFIX + \"get_num_user\")\n\tnumPuts = expvar.NewInt(BADGER_METRIC_PREFIX + \"put_num_user\")\n\tnumBytesWrittenUser = expvar.NewInt(BADGER_METRIC_PREFIX + \"write_bytes_user\")\n\n\t// Required for Enabled\n\tnumGetsWithResults = expvar.NewInt(BADGER_METRIC_PREFIX + \"get_with_result_num_user\")\n\tnumIteratorsCreated = expvar.NewInt(BADGER_METRIC_PREFIX + \"iterator_num_user\")\n\n\t// Sizes\n\tlsmSize = expvar.NewMap(BADGER_METRIC_PREFIX + \"size_bytes_lsm\")\n\tvlogSize = expvar.NewMap(BADGER_METRIC_PREFIX + \"size_bytes_vlog\")\n\n\tpendingWrites = expvar.NewMap(BADGER_METRIC_PREFIX + \"write_pending_num_memtable\")\n\tnumCompactionTables = expvar.NewInt(BADGER_METRIC_PREFIX + \"compaction_current_num_lsm\")\n}\n\nfunc NumIteratorsCreatedAdd(enabled bool, val int64) {\n\taddInt(enabled, numIteratorsCreated, val)\n}\n\nfunc NumGetsWithResultsAdd(enabled bool, val int64) {\n\taddInt(enabled, numGetsWithResults, val)\n}\n\nfunc NumReadsVlogAdd(enabled bool, val int64) {\n\taddInt(enabled, numReadsVlog, val)\n}\n\nfunc NumBytesWrittenUserAdd(enabled bool, val int64) {\n\taddInt(enabled, numBytesWrittenUser, val)\n}\n\nfunc NumWritesVlogAdd(enabled bool, val int64) {\n\taddInt(enabled, numWritesVlog, val)\n}\n\nfunc NumBytesReadsVlogAdd(enabled bool, val int64) {\n\taddInt(enabled, numBytesReadVlog, val)\n}\n\nfunc NumBytesReadsLSMAdd(enabled bool, val int64) {\n\taddInt(enabled, numBytesReadLSM, val)\n}\n\nfunc NumBytesWrittenVlogAdd(enabled bool, val int64) {\n\taddInt(enabled, numBytesVlogWritten, val)\n}\n\nfunc NumBytesWrittenToL0Add(enabled bool, val int64) {\n\taddInt(enabled, numBytesWrittenToL0, val)\n}\n\nfunc NumBytesCompactionWrittenAdd(enabled bool, key string, val int64) {\n\taddToMap(enabled, numBytesCompactionWritten, key, val)\n}\n\nfunc NumGetsAdd(enabled bool, val int64) {\n\taddInt(enabled, numGets, val)\n}\n\nfunc NumPutsAdd(enabled bool, val int64) {\n\taddInt(enabled, numPuts, val)\n}\n\nfunc NumMemtableGetsAdd(enabled bool, val int64) {\n\taddInt(enabled, numMemtableGets, val)\n}\n\nfunc NumCompactionTablesAdd(enabled bool, val int64) {\n\taddInt(enabled, numCompactionTables, val)\n}\n\nfunc LSMSizeSet(enabled bool, key string, val expvar.Var) {\n\tstoreToMap(enabled, lsmSize, key, val)\n}\n\nfunc VlogSizeSet(enabled bool, key string, val expvar.Var) {\n\tstoreToMap(enabled, vlogSize, key, val)\n}\n\nfunc PendingWritesSet(enabled bool, key string, val expvar.Var) {\n\tstoreToMap(enabled, pendingWrites, key, val)\n}\n\nfunc NumLSMBloomHitsAdd(enabled bool, key string, val int64) {\n\taddToMap(enabled, numLSMBloomHits, key, val)\n}\n\nfunc NumLSMGetsAdd(enabled bool, key string, val int64) {\n\taddToMap(enabled, numLSMGets, key, val)\n}\n\nfunc LSMSizeGet(enabled bool, key string) expvar.Var {\n\treturn getFromMap(enabled, lsmSize, key)\n}\n\nfunc VlogSizeGet(enabled bool, key string) expvar.Var {\n\treturn getFromMap(enabled, vlogSize, key)\n}\n\nfunc addInt(enabled bool, metric *expvar.Int, val int64) {\n\tif !enabled {\n\t\treturn\n\t}\n\n\tmetric.Add(val)\n}\n\nfunc addToMap(enabled bool, metric *expvar.Map, key string, val int64) {\n\tif !enabled {\n\t\treturn\n\t}\n\n\tmetric.Add(key, val)\n}\n\nfunc storeToMap(enabled bool, metric *expvar.Map, key string, val expvar.Var) {\n\tif !enabled {\n\t\treturn\n\t}\n\n\tmetric.Set(key, val)\n}\n\nfunc getFromMap(enabled bool, metric *expvar.Map, key string) expvar.Var {\n\tif !enabled {\n\t\treturn nil\n\t}\n\n\treturn metric.Get(key)\n}\n"
  },
  {
    "path": "y/watermark.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage y\n\nimport (\n\t\"container/heap\"\n\t\"context\"\n\t\"sync/atomic\"\n\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\ntype uint64Heap []uint64\n\nfunc (u uint64Heap) Len() int            { return len(u) }\nfunc (u uint64Heap) Less(i, j int) bool  { return u[i] < u[j] }\nfunc (u uint64Heap) Swap(i, j int)       { u[i], u[j] = u[j], u[i] }\nfunc (u *uint64Heap) Push(x interface{}) { *u = append(*u, x.(uint64)) }\nfunc (u *uint64Heap) Pop() interface{} {\n\told := *u\n\tn := len(old)\n\tx := old[n-1]\n\t*u = old[0 : n-1]\n\treturn x\n}\n\n// mark contains one of more indices, along with a done boolean to indicate the\n// status of the index: begin or done. It also contains waiters, who could be\n// waiting for the watermark to reach >= a certain index.\ntype mark struct {\n\t// Either this is an (index, waiter) pair or (index, done) or (indices, done).\n\tindex   uint64\n\twaiter  chan struct{}\n\tindices []uint64\n\tdone    bool // Set to true if the index is done.\n}\n\n// WaterMark is used to keep track of the minimum un-finished index.  Typically, an index k becomes\n// finished or \"done\" according to a WaterMark once Done(k) has been called\n//  1. as many times as Begin(k) has, AND\n//  2. a positive number of times.\n//\n// An index may also become \"done\" by calling SetDoneUntil at a time such that it is not\n// inter-mingled with Begin/Done calls.\n//\n// Since doneUntil and lastIndex addresses are passed to sync/atomic packages, we ensure that they\n// are 64-bit aligned by putting them at the beginning of the structure.\ntype WaterMark struct {\n\tdoneUntil atomic.Uint64\n\tlastIndex atomic.Uint64\n\tName      string\n\tmarkCh    chan mark\n}\n\n// Init initializes a WaterMark struct. MUST be called before using it.\nfunc (w *WaterMark) Init(closer *z.Closer) {\n\tw.markCh = make(chan mark, 100)\n\tgo w.process(closer)\n}\n\n// Begin sets the last index to the given value.\nfunc (w *WaterMark) Begin(index uint64) {\n\tw.lastIndex.Store(index)\n\tw.markCh <- mark{index: index, done: false}\n}\n\n// BeginMany works like Begin but accepts multiple indices.\nfunc (w *WaterMark) BeginMany(indices []uint64) {\n\tw.lastIndex.Store(indices[len(indices)-1])\n\tw.markCh <- mark{index: 0, indices: indices, done: false}\n}\n\n// Done sets a single index as done.\nfunc (w *WaterMark) Done(index uint64) {\n\tw.markCh <- mark{index: index, done: true}\n}\n\n// DoneMany works like Done but accepts multiple indices.\nfunc (w *WaterMark) DoneMany(indices []uint64) {\n\tw.markCh <- mark{index: 0, indices: indices, done: true}\n}\n\n// DoneUntil returns the maximum index that has the property that all indices\n// less than or equal to it are done.\nfunc (w *WaterMark) DoneUntil() uint64 {\n\treturn w.doneUntil.Load()\n}\n\n// SetDoneUntil sets the maximum index that has the property that all indices\n// less than or equal to it are done.\nfunc (w *WaterMark) SetDoneUntil(val uint64) {\n\tw.doneUntil.Store(val)\n}\n\n// LastIndex returns the last index for which Begin has been called.\nfunc (w *WaterMark) LastIndex() uint64 {\n\treturn w.lastIndex.Load()\n}\n\n// WaitForMark waits until the given index is marked as done.\nfunc (w *WaterMark) WaitForMark(ctx context.Context, index uint64) error {\n\tif w.DoneUntil() >= index {\n\t\treturn nil\n\t}\n\twaitCh := make(chan struct{})\n\tw.markCh <- mark{index: index, waiter: waitCh}\n\n\tselect {\n\tcase <-ctx.Done():\n\t\treturn ctx.Err()\n\tcase <-waitCh:\n\t\treturn nil\n\t}\n}\n\n// process is used to process the Mark channel. This is not thread-safe,\n// so only run one goroutine for process. One is sufficient, because\n// all goroutine ops use purely memory and cpu.\n// Each index has to emit at least one begin watermark in serial order otherwise waiters\n// can get blocked idefinitely. Example: We had an watermark at 100 and a waiter at 101,\n// if no watermark is emitted at index 101 then waiter would get stuck indefinitely as it\n// can't decide whether the task at 101 has decided not to emit watermark or it didn't get\n// scheduled yet.\nfunc (w *WaterMark) process(closer *z.Closer) {\n\tdefer closer.Done()\n\n\tvar indices uint64Heap\n\t// pending maps raft proposal index to the number of pending mutations for this proposal.\n\tpending := make(map[uint64]int)\n\twaiters := make(map[uint64][]chan struct{})\n\n\theap.Init(&indices)\n\n\tprocessOne := func(index uint64, done bool) {\n\t\t// If not already done, then set. Otherwise, don't undo a done entry.\n\t\tprev, present := pending[index]\n\t\tif !present {\n\t\t\theap.Push(&indices, index)\n\t\t}\n\n\t\tdelta := 1\n\t\tif done {\n\t\t\tdelta = -1\n\t\t}\n\t\tpending[index] = prev + delta\n\n\t\t// Update mark by going through all indices in order; and checking if they have\n\t\t// been done. Stop at the first index, which isn't done.\n\t\tdoneUntil := w.DoneUntil()\n\t\tif doneUntil > index {\n\t\t\tAssertTruef(false, \"Name: %s doneUntil: %d. Index: %d\", w.Name, doneUntil, index)\n\t\t}\n\n\t\tuntil := doneUntil\n\t\tloops := 0\n\n\t\tfor len(indices) > 0 {\n\t\t\tmin := indices[0]\n\t\t\tif done := pending[min]; done > 0 {\n\t\t\t\tbreak // len(indices) will be > 0.\n\t\t\t}\n\t\t\t// Even if done is called multiple times causing it to become\n\t\t\t// negative, we should still pop the index.\n\t\t\theap.Pop(&indices)\n\t\t\tdelete(pending, min)\n\t\t\tuntil = min\n\t\t\tloops++\n\t\t}\n\n\t\tif until != doneUntil {\n\t\t\tAssertTrue(w.doneUntil.CompareAndSwap(doneUntil, until))\n\t\t}\n\n\t\tnotifyAndRemove := func(idx uint64, toNotify []chan struct{}) {\n\t\t\tfor _, ch := range toNotify {\n\t\t\t\tclose(ch)\n\t\t\t}\n\t\t\tdelete(waiters, idx) // Release the memory back.\n\t\t}\n\n\t\tif until-doneUntil <= uint64(len(waiters)) {\n\t\t\t// Issue #908 showed that if doneUntil is close to 2^60, while until is zero, this loop\n\t\t\t// can hog up CPU just iterating over integers creating a busy-wait loop. So, only do\n\t\t\t// this path if until - doneUntil is less than the number of waiters.\n\t\t\tfor idx := doneUntil + 1; idx <= until; idx++ {\n\t\t\t\tif toNotify, ok := waiters[idx]; ok {\n\t\t\t\t\tnotifyAndRemove(idx, toNotify)\n\t\t\t\t}\n\t\t\t}\n\t\t} else {\n\t\t\tfor idx, toNotify := range waiters {\n\t\t\t\tif idx <= until {\n\t\t\t\t\tnotifyAndRemove(idx, toNotify)\n\t\t\t\t}\n\t\t\t}\n\t\t} // end of notifying waiters.\n\t}\n\n\tfor {\n\t\tselect {\n\t\tcase <-closer.HasBeenClosed():\n\t\t\treturn\n\t\tcase mark := <-w.markCh:\n\t\t\tif mark.waiter != nil {\n\t\t\t\tdoneUntil := w.doneUntil.Load()\n\t\t\t\tif doneUntil >= mark.index {\n\t\t\t\t\tclose(mark.waiter)\n\t\t\t\t} else {\n\t\t\t\t\tws, ok := waiters[mark.index]\n\t\t\t\t\tif !ok {\n\t\t\t\t\t\twaiters[mark.index] = []chan struct{}{mark.waiter}\n\t\t\t\t\t} else {\n\t\t\t\t\t\twaiters[mark.index] = append(ws, mark.waiter)\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\t// it is possible that mark.index is zero. We need to handle that case as well.\n\t\t\t\tif mark.index > 0 || (mark.index == 0 && len(mark.indices) == 0) {\n\t\t\t\t\tprocessOne(mark.index, mark.done)\n\t\t\t\t}\n\t\t\t\tfor _, index := range mark.indices {\n\t\t\t\t\tprocessOne(index, mark.done)\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n}\n"
  },
  {
    "path": "y/y.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage y\n\nimport (\n\t\"bytes\"\n\t\"encoding/binary\"\n\tstderrors \"errors\"\n\t\"fmt\"\n\t\"hash/crc32\"\n\t\"io\"\n\t\"math\"\n\t\"os\"\n\t\"reflect\"\n\t\"strconv\"\n\t\"sync\"\n\t\"time\"\n\t\"unsafe\"\n\n\t\"github.com/dgraph-io/badger/v4/pb\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\nvar (\n\t// ErrEOF indicates an end of file when trying to read from a memory mapped file\n\t// and encountering the end of slice.\n\tErrEOF = stderrors.New(\"ErrEOF: End of file\")\n\n\t// ErrCommitAfterFinish indicates that write batch commit was called after\n\t// finish\n\tErrCommitAfterFinish = stderrors.New(\"Batch commit not permitted after finish\")\n)\n\ntype Flags int\n\nconst (\n\t// Sync indicates that O_DSYNC should be set on the underlying file,\n\t// ensuring that data writes do not return until the data is flushed\n\t// to disk.\n\tSync Flags = 1 << iota\n\t// ReadOnly opens the underlying file on a read-only basis.\n\tReadOnly\n)\n\nvar (\n\t// This is O_DSYNC (datasync) on platforms that support it -- see file_unix.go\n\tdatasyncFileFlag = 0x0\n\n\t// CastagnoliCrcTable is a CRC32 polynomial table\n\tCastagnoliCrcTable = crc32.MakeTable(crc32.Castagnoli)\n)\n\n// OpenExistingFile opens an existing file, errors if it doesn't exist.\nfunc OpenExistingFile(filename string, flags Flags) (*os.File, error) {\n\topenFlags := os.O_RDWR\n\tif flags&ReadOnly != 0 {\n\t\topenFlags = os.O_RDONLY\n\t}\n\n\tif flags&Sync != 0 {\n\t\topenFlags |= datasyncFileFlag\n\t}\n\treturn os.OpenFile(filename, openFlags, 0)\n}\n\n// CreateSyncedFile creates a new file (using O_EXCL), errors if it already existed.\nfunc CreateSyncedFile(filename string, sync bool) (*os.File, error) {\n\tflags := os.O_RDWR | os.O_CREATE | os.O_EXCL\n\tif sync {\n\t\tflags |= datasyncFileFlag\n\t}\n\treturn os.OpenFile(filename, flags, 0600)\n}\n\n// OpenSyncedFile creates the file if one doesn't exist.\nfunc OpenSyncedFile(filename string, sync bool) (*os.File, error) {\n\tflags := os.O_RDWR | os.O_CREATE\n\tif sync {\n\t\tflags |= datasyncFileFlag\n\t}\n\treturn os.OpenFile(filename, flags, 0600)\n}\n\n// OpenTruncFile opens the file with O_RDWR | O_CREATE | O_TRUNC\nfunc OpenTruncFile(filename string, sync bool) (*os.File, error) {\n\tflags := os.O_RDWR | os.O_CREATE | os.O_TRUNC\n\tif sync {\n\t\tflags |= datasyncFileFlag\n\t}\n\treturn os.OpenFile(filename, flags, 0600)\n}\n\n// SafeCopy does append(a[:0], src...).\nfunc SafeCopy(a, src []byte) []byte {\n\tb := append(a[:0], src...)\n\tif b == nil {\n\t\treturn []byte{}\n\t}\n\treturn b\n}\n\n// Copy copies a byte slice and returns the copied slice.\nfunc Copy(a []byte) []byte {\n\tb := make([]byte, len(a))\n\tcopy(b, a)\n\treturn b\n}\n\n// KeyWithTs generates a new key by appending ts to key.\nfunc KeyWithTs(key []byte, ts uint64) []byte {\n\tout := make([]byte, len(key)+8)\n\tcopy(out, key)\n\tbinary.BigEndian.PutUint64(out[len(key):], math.MaxUint64-ts)\n\treturn out\n}\n\n// ParseTs parses the timestamp from the key bytes.\nfunc ParseTs(key []byte) uint64 {\n\tif len(key) <= 8 {\n\t\treturn 0\n\t}\n\treturn math.MaxUint64 - binary.BigEndian.Uint64(key[len(key)-8:])\n}\n\n// CompareKeys checks the key without timestamp and checks the timestamp if keyNoTs\n// is same.\n// a<timestamp> would be sorted higher than aa<timestamp> if we use bytes.compare\n// All keys should have timestamp.\nfunc CompareKeys(key1, key2 []byte) int {\n\tif cmp := bytes.Compare(key1[:len(key1)-8], key2[:len(key2)-8]); cmp != 0 {\n\t\treturn cmp\n\t}\n\treturn bytes.Compare(key1[len(key1)-8:], key2[len(key2)-8:])\n}\n\n// ParseKey parses the actual key from the key bytes.\nfunc ParseKey(key []byte) []byte {\n\tif len(key) < 8 {\n\t\treturn nil\n\t}\n\n\treturn key[:len(key)-8]\n}\n\n// SameKey checks for key equality ignoring the version timestamp suffix.\nfunc SameKey(src, dst []byte) bool {\n\tif len(src) != len(dst) {\n\t\treturn false\n\t}\n\treturn bytes.Equal(ParseKey(src), ParseKey(dst))\n}\n\n// Slice holds a reusable buf, will reallocate if you request a larger size than ever before.\n// One problem is with n distinct sizes in random order it'll reallocate log(n) times.\ntype Slice struct {\n\tbuf []byte\n}\n\n// Resize reuses the Slice's buffer (or makes a new one) and returns a slice in that buffer of\n// length sz.\nfunc (s *Slice) Resize(sz int) []byte {\n\tif cap(s.buf) < sz {\n\t\ts.buf = make([]byte, sz)\n\t}\n\treturn s.buf[0:sz]\n}\n\n// FixedDuration returns a string representation of the given duration with the\n// hours, minutes, and seconds.\nfunc FixedDuration(d time.Duration) string {\n\tstr := fmt.Sprintf(\"%02ds\", int(d.Seconds())%60)\n\tif d >= time.Minute {\n\t\tstr = fmt.Sprintf(\"%02dm\", int(d.Minutes())%60) + str\n\t}\n\tif d >= time.Hour {\n\t\tstr = fmt.Sprintf(\"%02dh\", int(d.Hours())) + str\n\t}\n\treturn str\n}\n\n// Throttle allows a limited number of workers to run at a time. It also\n// provides a mechanism to check for errors encountered by workers and wait for\n// them to finish.\ntype Throttle struct {\n\tonce      sync.Once\n\twg        sync.WaitGroup\n\tch        chan struct{}\n\terrCh     chan error\n\tfinishErr error\n}\n\n// NewThrottle creates a new throttle with a max number of workers.\nfunc NewThrottle(max int) *Throttle {\n\treturn &Throttle{\n\t\tch:    make(chan struct{}, max),\n\t\terrCh: make(chan error, max),\n\t}\n}\n\n// Do should be called by workers before they start working. It blocks if there\n// are already maximum number of workers working. If it detects an error from\n// previously Done workers, it would return it.\nfunc (t *Throttle) Do() error {\n\tfor {\n\t\tselect {\n\t\tcase t.ch <- struct{}{}:\n\t\t\tt.wg.Add(1)\n\t\t\treturn nil\n\t\tcase err := <-t.errCh:\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t}\n\t}\n}\n\n// Done should be called by workers when they finish working. They can also\n// pass the error status of work done.\nfunc (t *Throttle) Done(err error) {\n\tif err != nil {\n\t\tt.errCh <- err\n\t}\n\tselect {\n\tcase <-t.ch:\n\tdefault:\n\t\tpanic(\"Throttle Do Done mismatch\")\n\t}\n\tt.wg.Done()\n}\n\n// Finish waits until all workers have finished working. It would return any error passed by Done.\n// If Finish is called multiple time, it will wait for workers to finish only once(first time).\n// From next calls, it will return same error as found on first call.\nfunc (t *Throttle) Finish() error {\n\tt.once.Do(func() {\n\t\tt.wg.Wait()\n\t\tclose(t.ch)\n\t\tclose(t.errCh)\n\t\tfor err := range t.errCh {\n\t\t\tif err != nil {\n\t\t\t\tt.finishErr = err\n\t\t\t\treturn\n\t\t\t}\n\t\t}\n\t})\n\n\treturn t.finishErr\n}\n\n// U16ToBytes converts the given Uint16 to bytes\nfunc U16ToBytes(v uint16) []byte {\n\tvar uBuf [2]byte\n\tbinary.BigEndian.PutUint16(uBuf[:], v)\n\treturn uBuf[:]\n}\n\n// BytesToU16 converts the given byte slice to uint16\nfunc BytesToU16(b []byte) uint16 {\n\treturn binary.BigEndian.Uint16(b)\n}\n\n// U32ToBytes converts the given Uint32 to bytes\nfunc U32ToBytes(v uint32) []byte {\n\tvar uBuf [4]byte\n\tbinary.BigEndian.PutUint32(uBuf[:], v)\n\treturn uBuf[:]\n}\n\n// BytesToU32 converts the given byte slice to uint32\nfunc BytesToU32(b []byte) uint32 {\n\treturn binary.BigEndian.Uint32(b)\n}\n\n// U32SliceToBytes converts the given Uint32 slice to byte slice\nfunc U32SliceToBytes(u32s []uint32) []byte {\n\tif len(u32s) == 0 {\n\t\treturn nil\n\t}\n\tvar b []byte\n\thdr := (*reflect.SliceHeader)(unsafe.Pointer(&b))\n\thdr.Len = len(u32s) * 4\n\thdr.Cap = hdr.Len\n\thdr.Data = uintptr(unsafe.Pointer(&u32s[0]))\n\treturn b\n}\n\n// BytesToU32Slice converts the given byte slice to uint32 slice\nfunc BytesToU32Slice(b []byte) []uint32 {\n\tif len(b) == 0 {\n\t\treturn nil\n\t}\n\tvar u32s []uint32\n\thdr := (*reflect.SliceHeader)(unsafe.Pointer(&u32s))\n\thdr.Len = len(b) / 4\n\thdr.Cap = hdr.Len\n\thdr.Data = uintptr(unsafe.Pointer(&b[0]))\n\treturn u32s\n}\n\n// U64ToBytes converts the given Uint64 to bytes\nfunc U64ToBytes(v uint64) []byte {\n\tvar uBuf [8]byte\n\tbinary.BigEndian.PutUint64(uBuf[:], v)\n\treturn uBuf[:]\n}\n\n// BytesToU64 converts the given byte slice to uint64\nfunc BytesToU64(b []byte) uint64 {\n\treturn binary.BigEndian.Uint64(b)\n}\n\n// U64SliceToBytes converts the given Uint64 slice to byte slice\nfunc U64SliceToBytes(u64s []uint64) []byte {\n\tif len(u64s) == 0 {\n\t\treturn nil\n\t}\n\tvar b []byte\n\thdr := (*reflect.SliceHeader)(unsafe.Pointer(&b))\n\thdr.Len = len(u64s) * 8\n\thdr.Cap = hdr.Len\n\thdr.Data = uintptr(unsafe.Pointer(&u64s[0]))\n\treturn b\n}\n\n// BytesToU64Slice converts the given byte slice to uint64 slice\nfunc BytesToU64Slice(b []byte) []uint64 {\n\tif len(b) == 0 {\n\t\treturn nil\n\t}\n\tvar u64s []uint64\n\thdr := (*reflect.SliceHeader)(unsafe.Pointer(&u64s))\n\thdr.Len = len(b) / 8\n\thdr.Cap = hdr.Len\n\thdr.Data = uintptr(unsafe.Pointer(&b[0]))\n\treturn u64s\n}\n\n// page struct contains one underlying buffer.\ntype page struct {\n\tbuf []byte\n}\n\n// PageBuffer consists of many pages. A page is a wrapper over []byte. PageBuffer can act as a\n// replacement of bytes.Buffer. Instead of having single underlying buffer, it has multiple\n// underlying buffers. Hence it avoids any copy during relocation(as happens in bytes.Buffer).\n// PageBuffer allocates memory in pages. Once a page is full, it will allocate page with double the\n// size of previous page. Its function are not thread safe.\ntype PageBuffer struct {\n\tpages []*page\n\n\tlength       int // Length of PageBuffer.\n\tnextPageSize int // Size of next page to be allocated.\n}\n\n// NewPageBuffer returns a new PageBuffer with first page having size pageSize.\nfunc NewPageBuffer(pageSize int) *PageBuffer {\n\tb := &PageBuffer{}\n\tb.pages = append(b.pages, &page{buf: make([]byte, 0, pageSize)})\n\tb.nextPageSize = pageSize * 2\n\treturn b\n}\n\n// Write writes data to PageBuffer b. It returns number of bytes written and any error encountered.\nfunc (b *PageBuffer) Write(data []byte) (int, error) {\n\tdataLen := len(data)\n\tfor {\n\t\tcp := b.pages[len(b.pages)-1] // Current page.\n\n\t\tn := copy(cp.buf[len(cp.buf):cap(cp.buf)], data)\n\t\tcp.buf = cp.buf[:len(cp.buf)+n]\n\t\tb.length += n\n\n\t\tif len(data) == n {\n\t\t\tbreak\n\t\t}\n\t\tdata = data[n:]\n\n\t\tb.pages = append(b.pages, &page{buf: make([]byte, 0, b.nextPageSize)})\n\t\tb.nextPageSize *= 2\n\t}\n\n\treturn dataLen, nil\n}\n\n// WriteByte writes data byte to PageBuffer and returns any encountered error.\nfunc (b *PageBuffer) WriteByte(data byte) error {\n\t_, err := b.Write([]byte{data})\n\treturn err\n}\n\n// Len returns length of PageBuffer.\nfunc (b *PageBuffer) Len() int {\n\treturn b.length\n}\n\n// pageForOffset returns pageIdx and startIdx for the offset.\nfunc (b *PageBuffer) pageForOffset(offset int) (int, int) {\n\tAssertTrue(offset < b.length)\n\n\tvar pageIdx, startIdx, sizeNow int\n\tfor i := 0; i < len(b.pages); i++ {\n\t\tcp := b.pages[i]\n\n\t\tif sizeNow+len(cp.buf)-1 < offset {\n\t\t\tsizeNow += len(cp.buf)\n\t\t} else {\n\t\t\tpageIdx = i\n\t\t\tstartIdx = offset - sizeNow\n\t\t\tbreak\n\t\t}\n\t}\n\n\treturn pageIdx, startIdx\n}\n\n// Truncate truncates PageBuffer to length n.\nfunc (b *PageBuffer) Truncate(n int) {\n\tpageIdx, startIdx := b.pageForOffset(n)\n\t// For simplicity of the code reject extra pages. These pages can be kept.\n\tb.pages = b.pages[:pageIdx+1]\n\tcp := b.pages[len(b.pages)-1]\n\tcp.buf = cp.buf[:startIdx]\n\tb.length = n\n}\n\n// Bytes returns whole Buffer data as single []byte.\nfunc (b *PageBuffer) Bytes() []byte {\n\tbuf := make([]byte, b.length)\n\twritten := 0\n\tfor i := 0; i < len(b.pages); i++ {\n\t\twritten += copy(buf[written:], b.pages[i].buf)\n\t}\n\n\treturn buf\n}\n\n// WriteTo writes whole buffer to w. It returns number of bytes written and any error encountered.\nfunc (b *PageBuffer) WriteTo(w io.Writer) (int64, error) {\n\twritten := int64(0)\n\tfor i := 0; i < len(b.pages); i++ {\n\t\tn, err := w.Write(b.pages[i].buf)\n\t\twritten += int64(n)\n\t\tif err != nil {\n\t\t\treturn written, err\n\t\t}\n\t}\n\n\treturn written, nil\n}\n\n// NewReaderAt returns a reader which starts reading from offset in page buffer.\nfunc (b *PageBuffer) NewReaderAt(offset int) *PageBufferReader {\n\tpageIdx, startIdx := b.pageForOffset(offset)\n\n\treturn &PageBufferReader{\n\t\tbuf:      b,\n\t\tpageIdx:  pageIdx,\n\t\tstartIdx: startIdx,\n\t}\n}\n\n// PageBufferReader is a reader for PageBuffer.\ntype PageBufferReader struct {\n\tbuf      *PageBuffer // Underlying page buffer.\n\tpageIdx  int         // Idx of page from where it will start reading.\n\tstartIdx int         // Idx inside page - buf.pages[pageIdx] from where it will start reading.\n}\n\n// Read reads upto len(p) bytes. It returns number of bytes read and any error encountered.\nfunc (r *PageBufferReader) Read(p []byte) (int, error) {\n\t// Check if there is enough to Read.\n\tpc := len(r.buf.pages)\n\n\tread := 0\n\tfor r.pageIdx < pc && read < len(p) {\n\t\tcp := r.buf.pages[r.pageIdx] // Current Page.\n\t\tendIdx := len(cp.buf)        // Last Idx up to which we can read from this page.\n\n\t\tn := copy(p[read:], cp.buf[r.startIdx:endIdx])\n\t\tread += n\n\t\tr.startIdx += n\n\n\t\t// Instead of len(cp.buf), we comparing with cap(cp.buf). This ensures that we move to next\n\t\t// page only when we have read all data. Reading from last page is an edge case. We don't\n\t\t// want to move to next page until last page is full to its capacity.\n\t\tif r.startIdx >= cap(cp.buf) {\n\t\t\t// We should move to next page.\n\t\t\tr.pageIdx++\n\t\t\tr.startIdx = 0\n\t\t\tcontinue\n\t\t}\n\n\t\t// When last page in not full to its capacity and we have read all data up to its\n\t\t// length, just break out of the loop.\n\t\tif r.pageIdx == pc-1 {\n\t\t\tbreak\n\t\t}\n\t}\n\n\tif read == 0 && len(p) > 0 {\n\t\treturn read, io.EOF\n\t}\n\n\treturn read, nil\n}\n\nconst kvsz = int(unsafe.Sizeof(pb.KV{}))\n\nfunc NewKV(alloc *z.Allocator) *pb.KV {\n\tif alloc == nil {\n\t\treturn &pb.KV{}\n\t}\n\tb := alloc.AllocateAligned(kvsz)\n\treturn (*pb.KV)(unsafe.Pointer(&b[0]))\n}\n\n// IBytesToString converts size in bytes to human readable format.\n// The code is taken from humanize library and changed to provide\n// value upto custom decimal precision.\n// IBytesToString(12312412, 1) -> 11.7 MiB\nfunc IBytesToString(size uint64, precision int) string {\n\tsizes := []string{\"B\", \"KiB\", \"MiB\", \"GiB\", \"TiB\", \"PiB\", \"EiB\"}\n\tbase := float64(1024)\n\tif size < 10 {\n\t\treturn fmt.Sprintf(\"%d B\", size)\n\t}\n\te := math.Floor(math.Log(float64(size)) / math.Log(base))\n\tsuffix := sizes[int(e)]\n\tval := float64(size) / math.Pow(base, e)\n\tf := \"%.\" + strconv.Itoa(precision) + \"f %s\"\n\n\treturn fmt.Sprintf(f, val, suffix)\n}\n\ntype RateMonitor struct {\n\tstart       time.Time\n\tlastSent    uint64\n\tlastCapture time.Time\n\trates       []float64\n\tidx         int\n}\n\nfunc NewRateMonitor(numSamples int) *RateMonitor {\n\treturn &RateMonitor{\n\t\tstart: time.Now(),\n\t\trates: make([]float64, numSamples),\n\t}\n}\n\nconst minRate = 0.0001\n\n// Capture captures the current number of sent bytes. This number should be monotonically\n// increasing.\nfunc (rm *RateMonitor) Capture(sent uint64) {\n\tdiff := sent - rm.lastSent\n\tdur := time.Since(rm.lastCapture)\n\trm.lastCapture, rm.lastSent = time.Now(), sent\n\n\trate := float64(diff) / dur.Seconds()\n\tif rate < minRate {\n\t\trate = minRate\n\t}\n\trm.rates[rm.idx] = rate\n\trm.idx = (rm.idx + 1) % len(rm.rates)\n}\n\n// Rate returns the average rate of transmission smoothed out by the number of samples.\nfunc (rm *RateMonitor) Rate() uint64 {\n\tvar total float64\n\tvar den float64\n\tfor _, r := range rm.rates {\n\t\tif r < minRate {\n\t\t\t// Ignore this. We always set minRate, so this is a zero.\n\t\t\t// Typically at the start of the rate monitor, we'd have zeros.\n\t\t\tcontinue\n\t\t}\n\t\ttotal += r\n\t\tden += 1.0\n\t}\n\tif den < minRate {\n\t\treturn 0\n\t}\n\treturn uint64(total / den)\n}\n"
  },
  {
    "path": "y/y_test.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage y\n\nimport (\n\t\"bytes\"\n\t\"encoding/binary\"\n\t\"fmt\"\n\t\"io\"\n\t\"math/rand\"\n\t\"testing\"\n\t\"time\"\n\n\t\"github.com/stretchr/testify/require\"\n\t\"google.golang.org/protobuf/proto\"\n\n\t\"github.com/dgraph-io/badger/v4/pb\"\n\t\"github.com/dgraph-io/ristretto/v2/z\"\n)\n\nfunc BenchmarkBuffer(b *testing.B) {\n\tvar btw [1024]byte\n\trand.Read(btw[:])\n\n\tpageSize := 1024\n\n\tb.Run(\"bytes-buffer\", func(b *testing.B) {\n\t\tbuf := new(bytes.Buffer)\n\t\tbuf.Grow(pageSize)\n\n\t\tfor i := 0; i < b.N; i++ {\n\t\t\tbuf.Write(btw[:])\n\t\t}\n\t})\n\n\tb.Run(\"page-buffer\", func(b *testing.B) {\n\t\tb.Run(fmt.Sprintf(\"page-size-%d\", pageSize), func(b *testing.B) {\n\t\t\tpageBuffer := NewPageBuffer(pageSize)\n\t\t\tfor i := 0; i < b.N; i++ {\n\t\t\t\t_, _ = pageBuffer.Write(btw[:])\n\t\t\t}\n\t\t})\n\t})\n}\n\nfunc TestPageBuffer(t *testing.T) {\n\trand.Seed(time.Now().Unix())\n\n\tvar bytesBuffer bytes.Buffer // This is just for verifying result.\n\tbytesBuffer.Grow(512)\n\n\tpageBuffer := NewPageBuffer(512)\n\n\t// Writer small []byte\n\tvar smallBytes [256]byte\n\trand.Read(smallBytes[:])\n\tvar bigBytes [1024]byte\n\trand.Read(bigBytes[:])\n\n\t_, err := pageBuffer.Write(smallBytes[:])\n\trequire.NoError(t, err, \"unable to write data to page buffer\")\n\t_, err = pageBuffer.Write(bigBytes[:])\n\trequire.NoError(t, err, \"unable to write data to page buffer\")\n\n\t// Write data to bytesBuffer also, just to match result.\n\tbytesBuffer.Write(smallBytes[:])\n\tbytesBuffer.Write(bigBytes[:])\n\n\trequire.True(t, bytes.Equal(pageBuffer.Bytes(), bytesBuffer.Bytes()))\n}\n\nfunc TestBufferWrite(t *testing.T) {\n\trand.Seed(time.Now().Unix())\n\n\tvar wb [128]byte\n\trand.Read(wb[:])\n\n\tpb := NewPageBuffer(32)\n\tbb := new(bytes.Buffer)\n\n\tend := 32\n\tfor i := 0; i < 3; i++ {\n\t\tn, err := pb.Write(wb[:end])\n\t\trequire.NoError(t, err, \"unable to write bytes to buffer\")\n\t\trequire.Equal(t, n, end, \"length of buffer and length written should be equal\")\n\n\t\t// append to bb also for testing.\n\t\tbb.Write(wb[:end])\n\n\t\trequire.True(t, bytes.Equal(pb.Bytes(), bb.Bytes()), \"Both bytes should match\")\n\t\tend = end * 2\n\t}\n}\n\nfunc TestPagebufferTruncate(t *testing.T) {\n\trand.Seed(time.Now().Unix())\n\n\tvar wb [1024]byte\n\trand.Read(wb[:])\n\n\tb := NewPageBuffer(32)\n\tn, err := b.Write(wb[:])\n\trequire.Equal(t, n, len(wb), \"length of buffer and length written should be equal\")\n\trequire.NoError(t, err, \"unable to write bytes to buffer\")\n\n\trequire.True(t, bytes.Equal(wb[:], b.Bytes()), \"bytes written and read should be equal\")\n\n\t// Truncate to 512.\n\tb.Truncate(512)\n\trequire.True(t, bytes.Equal(b.Bytes(), wb[:512]))\n\n\t// Again write wb.\n\tn, err = b.Write(wb[:])\n\trequire.Equal(t, n, len(wb), \"length of buffer and length written should be equal\")\n\trequire.NoError(t, err, \"unable to write bytes to buffer\")\n\n\t// Truncate to 1000.\n\tb.Truncate(1000)\n\trequire.True(t, bytes.Equal(b.Bytes(), append(wb[:512], wb[:]...)[:1000]))\n}\n\n// Test PageBufferReader using large buffers.\nfunc TestPagebufferReader(t *testing.T) {\n\trand.Seed(time.Now().Unix())\n\n\tvar wb [1024]byte\n\trand.Read(wb[:])\n\n\tb := NewPageBuffer(32)\n\tn, err := b.Write(wb[:])\n\trequire.Equal(t, n, len(wb), \"length of buffer and length written should be equal\")\n\trequire.NoError(t, err, \"unable to write bytes to buffer\")\n\t// Also append some bytes so that last page is not full.\n\tn, err = b.Write(wb[:10])\n\trequire.Equal(t, n, 10, \"length of buffer and length written should be equal\")\n\trequire.NoError(t, err, \"unable to write bytes to buffer\")\n\n\treader := b.NewReaderAt(0)\n\t// Read first 512 bytes.\n\tvar rb [512]byte\n\tn, err = reader.Read(rb[:])\n\trequire.NoError(t, err, \"unable to read error\")\n\trequire.True(t, n == len(rb), \"length read should be equal\")\n\t// Match if read bytes are correct or not.\n\trb2 := b.Bytes()[:512]\n\trequire.True(t, bytes.Equal(rb[:], rb2))\n\n\t// Next read using reader.\n\tn, err = reader.Read(rb[:])\n\trequire.NoError(t, err, \"unable to read error\")\n\trequire.True(t, n == len(rb), \"length read should be equal\")\n\t// Read same number of bytes using Bytes method.\n\trb2 = b.Bytes()[512:1024]\n\trequire.True(t, bytes.Equal(rb[:], rb2))\n\n\t// Next read using reader for reading last 10 bytes.\n\tn, err = reader.Read(rb[:10])\n\trequire.NoError(t, err, \"unable to read error\")\n\trequire.True(t, n == 10, \"length read should be equal\")\n\t// Read same number of bytes using Bytes method.\n\trb2 = b.Bytes()[1024 : 1024+10]\n\trequire.True(t, bytes.Equal(rb[:10], rb2))\n\n\t// Check if EOF is returned at end or not.\n\tn, err = reader.Read(rb[:10])\n\trequire.Equal(t, err, io.EOF, \"EOF should be returned at end\")\n\trequire.Zero(t, n, \"read length should be 0\")\n}\n\n// Test PageBuffer by reading at random offset, random length.\nfunc TestPagebufferReader2(t *testing.T) {\n\trand.Seed(time.Now().Unix())\n\n\tvar wb [1024]byte\n\trand.Read(wb[:])\n\n\tb := NewPageBuffer(32)\n\tn, err := b.Write(wb[:])\n\trequire.Equal(t, n, len(wb), \"length of buffer and length written should be equal\")\n\trequire.NoError(t, err, \"unable to write bytes to buffer\")\n\t// Also append some bytes so that last page is not full.\n\tn, err = b.Write(wb[:10])\n\trequire.Equal(t, n, 10, \"length of buffer and length written should be equal\")\n\trequire.NoError(t, err, \"unable to write bytes to buffer\")\n\n\trandOffset := int(rand.Int31n(int32(b.length) - 1))\n\trandLength := int(rand.Int31n(int32(b.length - randOffset)))\n\treader := b.NewReaderAt(randOffset)\n\t// Read randLength bytes.\n\trb := make([]byte, randLength)\n\tn, err = reader.Read(rb[:])\n\trequire.NoError(t, err, \"unable to read error\")\n\trequire.True(t, n == len(rb), \"length read should be equal\")\n\t// Read same number of bytes using Bytes method.\n\trb2 := b.Bytes()[randOffset : randOffset+randLength]\n\trequire.True(t, bytes.Equal(rb[:], rb2))\n}\n\n// Test PageBuffer while reading multiple chunks. Chunks are smaller than pages of PageBuffer.\nfunc TestPagebufferReader3(t *testing.T) {\n\trand.Seed(time.Now().Unix())\n\n\tvar wb [1000]byte\n\trand.Read(wb[:])\n\n\tb := NewPageBuffer(32)\n\tn, err := b.Write(wb[:])\n\trequire.Equal(t, n, len(wb), \"length of buffer and length written should be equal\")\n\trequire.NoError(t, err, \"unable to write bytes to buffer\")\n\n\treader := b.NewReaderAt(0)\n\n\tchunk := 10 // Read 10 bytes in loop.\n\treadBuf := make([]byte, chunk)\n\tcurrentOffset := 0\n\n\tfor i := 0; i < len(wb)/chunk; i++ {\n\t\tn, err = reader.Read(readBuf)\n\t\trequire.NoError(t, err, \"unable to read from reader\")\n\t\trequire.Equal(t, chunk, n, \"length read should be equal to chunk\")\n\t\trequire.True(t, bytes.Equal(readBuf, wb[currentOffset:currentOffset+chunk]))\n\n\t\trb := b.Bytes()[currentOffset : currentOffset+chunk]\n\t\trequire.True(t, bytes.Equal(wb[currentOffset:currentOffset+chunk], rb))\n\n\t\tcurrentOffset += chunk\n\t}\n\n\t// Read EOF.\n\tn, err = reader.Read(readBuf)\n\trequire.Equal(t, err, io.EOF, \"should return EOF\")\n\trequire.Equal(t, n, 0)\n\n\t// Read EOF again.\n\tn, err = reader.Read(readBuf)\n\trequire.Equal(t, err, io.EOF, \"should return EOF\")\n\trequire.Equal(t, n, 0)\n}\n\n// Test when read buffer is larger than PageBuffer.\nfunc TestPagebufferReader4(t *testing.T) {\n\trand.Seed(time.Now().Unix())\n\n\tvar wb [20]byte\n\trand.Read(wb[:])\n\n\tb := NewPageBuffer(32)\n\tn, err := b.Write(wb[:])\n\trequire.Equal(t, n, len(wb), \"length of buffer and length written should be equal\")\n\trequire.NoError(t, err, \"unable to write bytes to buffer\")\n\n\treader := b.NewReaderAt(0)\n\treadBuf := make([]byte, 100)\n\n\tn, err = reader.Read(readBuf)\n\trequire.NoError(t, err, \"unable to read from reader\")\n\trequire.Equal(t, 20, n, \"length read should be equal to chunk\")\n\n\t// Read EOF.\n\tn, err = reader.Read(readBuf)\n\trequire.Equal(t, err, io.EOF, \"should return EOF\")\n\trequire.Equal(t, n, 0)\n}\n\n// Test when reading into 0 length readBuffer\nfunc TestPagebufferReader5(t *testing.T) {\n\tb := NewPageBuffer(32)\n\tvar wb [20]byte\n\trand.Read(wb[:])\n\tn, err := b.Write(wb[:])\n\trequire.Equal(t, n, len(wb), \"length of buffer and length written should be equal\")\n\trequire.NoError(t, err, \"unable to write bytes to buffer\")\n\n\treader := b.NewReaderAt(0)\n\n\treadBuffer := []byte{} // Intentionally empty readBuffer.\n\tn, err = reader.Read(readBuffer)\n\trequire.NoError(t, err, \"reading into empty buffer should return no error\")\n\trequire.Equal(t, 0, n, \"read into empty buffer should return 0 bytes\")\n}\n\nfunc TestSizeVarintForZero(t *testing.T) {\n\tsiz := sizeVarint(0)\n\trequire.Equal(t, 1, siz)\n}\n\nfunc TestEncodedSize(t *testing.T) {\n\tvalBufSize := uint32(rand.Int31n(1e5))\n\texpiry := rand.Uint64()\n\texpiryVarintBuf := make([]byte, 64)\n\texpVarintSize := uint32(binary.PutUvarint(expiryVarintBuf, expiry))\n\tvalBuf := make([]byte, valBufSize)\n\t_, _ = rand.Read(valBuf)\n\n\tvalStruct := &ValueStruct{\n\t\tValue:     valBuf,\n\t\tExpiresAt: expiry,\n\t}\n\n\trequire.Equal(t, valBufSize+uint32(2)+expVarintSize, valStruct.EncodedSize())\n}\n\nfunc TestAllocatorReuse(t *testing.T) {\n\ta := z.NewAllocator(1024, \"test\")\n\tdefer a.Release()\n\n\tN := 1024\n\tbuf := make([]byte, 4096)\n\trand.Read(buf)\n\n\tfor i := 0; i < N; i++ {\n\t\ta.Reset()\n\t\tvar list pb.KVList\n\t\tfor j := 0; j < N; j++ {\n\t\t\tkv := NewKV(a)\n\t\t\tsz := rand.Intn(1024)\n\t\t\tkv.Key = a.Copy(buf[:sz])\n\t\t\tkv.Value = a.Copy(buf[:4*sz])\n\t\t\tkv.Meta = a.Copy([]byte{1})\n\t\t\tkv.Version = uint64(sz)\n\t\t\tlist.Kv = append(list.Kv, kv)\n\t\t}\n\t\t_, err := proto.Marshal(&list)\n\t\trequire.NoError(t, err)\n\t}\n\tt.Logf(\"Allocator: %s\\n\", a)\n}\n\nfunc TestSafeCopy_Issue2067(t *testing.T) {\n\ttype args struct {\n\t\ta   []byte\n\t\tsrc []byte\n\t}\n\ttests := []struct {\n\t\tname string\n\t\targs args\n\t\twant []byte\n\t}{\n\t\t{\n\t\t\tname: \"Nil src should return empty slice not nil\",\n\t\t\targs: args{a: nil, src: nil},\n\t\t\twant: []byte{},\n\t\t},\n\t\t{\n\t\t\tname: \"Empty src should return empty slice not nil\",\n\t\t\targs: args{a: nil, src: []byte{}},\n\t\t\twant: []byte{},\n\t\t},\n\t\t{\n\t\t\tname: \"Normal src should return src content\",\n\t\t\targs: args{a: nil, src: []byte(\"hello\")},\n\t\t\twant: []byte(\"hello\"),\n\t\t},\n\t\t{\n\t\t\tname: \"Buffer reuse with nil src should return empty slice\",\n\t\t\targs: args{a: make([]byte, 10), src: nil},\n\t\t\twant: []byte{},\n\t\t},\n\t}\n\tfor _, tt := range tests {\n\t\tt.Run(tt.name, func(t *testing.T) {\n\t\t\tgot := SafeCopy(tt.args.a, tt.args.src)\n\t\t\trequire.Equal(t, tt.want, got)\n\n\t\t\t// Explicit check for nil vs empty slice distinction\n\t\t\tif len(tt.want) == 0 {\n\t\t\t\trequire.NotNil(t, got, \"SafeCopy returned nil, but we expected an empty slice []byte{}\")\n\t\t\t}\n\t\t})\n\t}\n}\n"
  },
  {
    "path": "y/zstd.go",
    "content": "/*\n * SPDX-FileCopyrightText: © 2017-2025 Istari Digital, Inc.\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage y\n\nimport (\n\t\"sync\"\n\n\t\"github.com/klauspost/compress/zstd\"\n)\n\nvar (\n\tdecoder *zstd.Decoder\n\tencoder *zstd.Encoder\n\n\tencOnce, decOnce sync.Once\n)\n\n// ZSTDDecompress decompresses a block using ZSTD algorithm.\nfunc ZSTDDecompress(dst, src []byte) ([]byte, error) {\n\tdecOnce.Do(func() {\n\t\tvar err error\n\t\tdecoder, err = zstd.NewReader(nil)\n\t\tCheck(err)\n\t})\n\treturn decoder.DecodeAll(src, dst[:0])\n}\n\n// ZSTDCompress compresses a block using ZSTD algorithm.\nfunc ZSTDCompress(dst, src []byte, compressionLevel int) ([]byte, error) {\n\tencOnce.Do(func() {\n\t\tvar err error\n\t\tlevel := zstd.EncoderLevelFromZstd(compressionLevel)\n\t\tencoder, err = zstd.NewWriter(nil, zstd.WithEncoderLevel(level))\n\t\tCheck(err)\n\t})\n\treturn encoder.EncodeAll(src, dst[:0]), nil\n}\n\n// ZSTDCompressBound returns the worst case size needed for a destination buffer.\n// Klauspost ZSTD library does not provide any API for Compression Bound. This\n// calculation is based on the DataDog ZSTD library.\n// See https://pkg.go.dev/github.com/DataDog/zstd#CompressBound\nfunc ZSTDCompressBound(srcSize int) int {\n\tlowLimit := 128 << 10 // 128 kB\n\tvar margin int\n\tif srcSize < lowLimit {\n\t\tmargin = (lowLimit - srcSize) >> 11\n\t}\n\treturn srcSize + (srcSize >> 8) + margin\n}\n"
  }
]