[
  {
    "path": ".cargo/config.toml",
    "content": "[env]\nRUST_TEST_THREADS = \"1\"\n"
  },
  {
    "path": ".editorconfig",
    "content": "root=true\n\n[*]\ncharset = utf-8\nindent_style = space\ninsert_final_newline = true\ntrim_trailing_whitespace = true\nmax_line_length = 80\n\n[*.{rs, py}]\nindent_size = 4\n\n[*.{yml, html, css, js, ts, md}]\nindent_size = 2\n"
  },
  {
    "path": ".flake8",
    "content": "[flake8]\nexclude = .venv, target\n"
  },
  {
    "path": ".github/CODE_OF_CONDUCT.md",
    "content": "# Contributor Covenant Code of Conduct\n\n## Our Pledge\n\nWe as members, contributors, and leaders pledge to make participation in our\ncommunity a harassment-free experience for everyone, regardless of age, body\nsize, visible or invisible disability, ethnicity, sex characteristics, gender\nidentity and expression, level of experience, education, socio-economic status,\nnationality, personal appearance, race, caste, color, religion, or sexual\nidentity and orientation.\n\nWe pledge to act and interact in ways that contribute to an open, welcoming,\ndiverse, inclusive, and healthy community.\n\n## Our Standards\n\nExamples of behavior that contributes to a positive environment for our\ncommunity include:\n\n- Demonstrating empathy and kindness toward other people\n- Being respectful of differing opinions, viewpoints, and experiences\n- Giving and gracefully accepting constructive feedback\n- Accepting responsibility and apologizing to those affected by our mistakes,\n  and learning from the experience\n- Focusing on what is best not just for us as individuals, but for the overall\n  community\n\nExamples of unacceptable behavior include:\n\n- The use of sexualized language or imagery, and sexual attention or advances of\n  any kind\n- Trolling, insulting or derogatory comments, and personal or political attacks\n- Public or private harassment\n- Publishing others' private information, such as a physical or email address,\n  without their explicit permission\n- Other conduct which could reasonably be considered inappropriate in a\n  professional setting\n\n## Enforcement Responsibilities\n\nCommunity leaders are responsible for clarifying and enforcing our standards of\nacceptable behavior and will take appropriate and fair corrective action in\nresponse to any behavior that they deem inappropriate, threatening, offensive,\nor harmful.\n\nCommunity leaders have the right and responsibility to remove, edit, or reject\ncomments, commits, code, wiki edits, issues, and other contributions that are\nnot aligned to this Code of Conduct, and will communicate reasons for moderation\ndecisions when appropriate.\n\n## Scope\n\nThis Code of Conduct applies within all community spaces, and also applies when\nan individual is officially representing the community in public spaces.\nExamples of representing our community include using an official email address,\nposting via an official social media account, or acting as an appointed\nrepresentative at an online or offline event.\n\n## Enforcement\n\nInstances of abusive, harassing, or otherwise unacceptable behavior may be\nreported to the community leaders responsible for enforcement at\nedwin@oasysai.com. All complaints will be reviewed and investigated promptly and\nfairly.\n\nAll community leaders are obligated to respect the privacy and security of the\nreporter of any incident.\n\n## Enforcement Guidelines\n\nCommunity leaders will follow these Community Impact Guidelines in determining\nthe consequences for any action they deem in violation of this Code of Conduct:\n\n### 1. Correction\n\n**Community Impact**: Use of inappropriate language or other behavior deemed\nunprofessional or unwelcome in the community.\n\n**Consequence**: A private, written warning from community leaders, providing\nclarity around the nature of the violation and an explanation of why the\nbehavior was inappropriate. A public apology may be requested.\n\n### 2. Warning\n\n**Community Impact**: A violation through a single incident or series of\nactions.\n\n**Consequence**: A warning with consequences for continued behavior. No\ninteraction with the people involved, including unsolicited interaction with\nthose enforcing the Code of Conduct, for a specified period of time. This\nincludes avoiding interactions in community spaces as well as external channels\nlike social media. Violating these terms may lead to a temporary or permanent\nban.\n\n### 3. Temporary Ban\n\n**Community Impact**: A serious violation of community standards, including\nsustained inappropriate behavior.\n\n**Consequence**: A temporary ban from any sort of interaction or public\ncommunication with the community for a specified period of time. No public or\nprivate interaction with the people involved, including unsolicited interaction\nwith those enforcing the Code of Conduct, is allowed during this period.\nViolating these terms may lead to a permanent ban.\n\n### 4. Permanent Ban\n\n**Community Impact**: Demonstrating a pattern of violation of community\nstandards, including sustained inappropriate behavior, harassment of an\nindividual, or aggression toward or disparagement of classes of individuals.\n\n**Consequence**: A permanent ban from any sort of public interaction within the\ncommunity.\n\n## Attribution\n\nThis Code of Conduct is adapted from the [Contributor Covenant][homepage],\nversion 2.1. The Community Impact Guidelines were inspired by [Mozilla's Code of\nConduct Enforcement Ladder][mozilla_coc].\n\n[homepage]: https://www.contributor-covenant.org\n[mozilla_coc]: https://github.com/mozilla/diversity\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/bug_report.md",
    "content": "---\nname: 🐞 Report Bug\nabout: Report an unexpected behavior or a malfunctioning feature.\ntitle: \"BUG: \"\nlabels: bug\nassignees: \"\"\n---\n\n### Short Description\n\nPlease describe the issue you are experiencing in a few sentences.\n\n### Error Message\n\nIf you received an error message, please paste some parts of it here.\n\n```txt\n\n```\n\n### Steps to Reproduce\n\nWhat are the minimal steps to reproduce the behavior?\n\nExample:\n\n1. Import the library in ...\n2. Initialize the object with ...\n3. Call the function ...\n\n### Expected Behavior\n\nWhat do you expect to happen?\n\n### Additional Context\n\nAdd any other context about the problem here like error traces, etc.\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/config.yml",
    "content": "blank_issues_enabled: false\n\ncontact_links:\n  - name: ❓ Ask Question\n    url: https://github.com/oasysai/oasysdb/discussions\n    about: Ask general questions or share ideas on Discussions.\n\n  - name: 💬 Join Discord\n    url: https://discord.gg/bDhQrkqNP4\n    about: Join the Discord server to help shape the future of OasysDB.\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/do_chore.md",
    "content": "---\nname: 🧹 Do Chore\nabout: Documentation updates, code refactoring, or other chores.\ntitle: \"CHORE: \"\nlabels: chore\nassignees: \"\"\n---\n\n### Description\n\nPlease describe the chore you suggest in a few sentences.\n\nChore examples:\n\n- Updating documentation\n- Adding tests or examples\n- Refactoring parts of the codebase\n\n### Context\n\nWhy is this chore beneficial for the project and its community?\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/feature_request.md",
    "content": "---\nname: 🛠️ Feature Request\nabout: Request a new feature or an improvement to an existing feature.\ntitle: \"FEAT: \"\nlabels: enhancement\nassignees: \"\"\n---\n\n### Use Case\n\nWhat's the use case for this feature? How would you use it?\n\n### Potential Solution\n\nOn the high level, how would you like the feature to be implemented?\n\n### Additional Context\n\nAdd context about the feature like links to similar implementations.\n\nFor example:\n\n- Link to a similar feature in another project\n- Screenshot of the feature functionality\n- Research papers or articles about the feature\n"
  },
  {
    "path": ".github/PULL_REQUEST_TEMPLATE.md",
    "content": "### Purpose\n\nDescribe the problem solved or feature added by this PR.\n\n### Approach\n\nHow does this PR solve the problem or add the feature?\n\n### Testing\n\n- [ ] I have tested this PR locally.\n- [ ] If applicable, I added tests to cover my changes.\n\nHow did you test this PR? How should the reviewer test this PR?\n\n### Chore Checklist\n\n- [ ] I formatted my code according to the style and linter guidelines.\n- [ ] If applicable, I updated the documentation accordingly.\n"
  },
  {
    "path": ".github/SECURITY.md",
    "content": "# Security Policy\n\nThank you for taking the time to report a security issue. We are trying our best\nto make this project safe for everyone. We appreciate your efforts to disclose\nthe issue responsibly and will make every effort to acknowledge your\ncontributions.\n\n## Reporting a vulnerability\n\n**Please do not report security vulnerabilities through public GitHub issues.**\n\nIf you believe you have found a security vulnerability, please send an email to\nedwin@oasysai.com. Please include as many details as possible, these may\ninclude:\n\n- Impact of the vulnerability.\n- Steps to reproduce.\n- Possible solutions.\n- Location of the vulnerability like file or line number.\n- If applicable, proof-of-concept or exploit code.\n"
  },
  {
    "path": ".github/workflows/publish-docs.yml",
    "content": "name: Publish Docs\n\non:\n  workflow_dispatch:\n\n  push:\n    branches:\n      - main\n\n    paths:\n      - \"docs/**\"\n      - \"mkdocs.yml\"\n\npermissions:\n  id-token: write\n  pages: write\n  contents: write\n\njobs:\n  build-docs:\n    name: Build documentation\n    runs-on: ubuntu-latest\n    steps:\n      - name: Checkout the code\n        uses: actions/checkout@v4\n\n      - name: Install Python\n        uses: actions/setup-python@v5\n        with:\n          python-version: 3.x\n\n      - name: Install dependencies\n        run: pip install mkdocs-material\n\n      - name: Publish the documentation\n        run: |\n          mkdocs gh-deploy --force --message \"cd: deploy docs from {sha}\"\n\n  publish-docs:\n    name: Publish documentation\n    runs-on: ubuntu-latest\n    needs: build-docs\n    environment:\n      name: Docs\n      url: ${{ steps.deployment.outputs.page_url }}\n    steps:\n      - name: Checkout\n        uses: actions/checkout@v4\n        with:\n          ref: gh-pages\n\n      - name: Setup pages\n        uses: actions/configure-pages@v5\n\n      - name: Upload artifact\n        uses: actions/upload-pages-artifact@v3\n        with:\n          path: \".\"\n\n      - name: Deploy to GitHub Pages\n        id: deployment\n        uses: actions/deploy-pages@v4\n"
  },
  {
    "path": ".github/workflows/quality-check.yml",
    "content": "name: Quality Check\n\non:\n  workflow_dispatch:\n\n  pull_request:\n    paths-ignore:\n      - \"docs/**\"\n      - \"clients/**\"\n\n  push:\n    branches:\n      - main\n    paths-ignore:\n      - \"docs/**\"\n      - \"clients/**\"\n\njobs:\n  quality-check:\n    name: Run All Checks\n    runs-on: ubuntu-latest\n    steps:\n      - name: Checkout Code\n        uses: actions/checkout@v4\n\n      - name: Install Rust Toolchain\n        uses: dtolnay/rust-toolchain@stable\n        with:\n          components: rustfmt, clippy\n\n      - name: Install Protobuf Compiler\n        run: |\n          sudo apt update && sudo apt upgrade -y\n          sudo apt install -y protobuf-compiler libprotobuf-dev\n\n      - name: Run Formatter\n        run: cargo fmt -- --check\n\n      - name: Run Linter\n        run: cargo clippy -- -D warnings\n\n      - name: Run Tests\n        run: cargo test --all-features -- --test-threads 1\n"
  },
  {
    "path": ".gitignore",
    "content": "# OasysDB tests.\nodb*\noasysdb*\n\n# Rust stuff.\ndebug\ntarget\n\n# Python stuff.\n__pycache__\n.pytest_cache\n.venv\n*.so\n*.py[cod]\n\n# Benchmarking.\n*.ivecs\n*.fvecs\n\n# Misc.\n.vscode\n.ds_store\n\n# Environment variables.\n.env\n.env.*\n!.env.example\n"
  },
  {
    "path": ".prettierrc.yml",
    "content": "bracketSpacing: true\nsingleQuote: false\ntrailingComma: \"none\"\nsemi: false\ntabWidth: 2\nprintWidth: 80\nproseWrap: \"always\"\n"
  },
  {
    "path": "Cargo.toml",
    "content": "[package]\nname = \"oasysdb\"\nversion = \"0.8.0\"\nedition = \"2021\"\nauthors = [\"Edwin Kys\"]\n\n[dependencies]\ntokio = { version = \"1.39.3\", features = [\"rt-multi-thread\", \"macros\"] }\nhashbrown = { version = \"0.15.0\", features = [\"serde\", \"rayon\"] }\nuuid = { version = \"1.10.0\", features = [\"v4\", \"serde\"] }\nclap = \"4.5.16\"\n\n# gRPC-related dependencies\ntonic = \"0.12.1\"\nprost = \"0.13.1\"\n\n# Serialization-related dependencies\nserde = { version = \"1.0.208\", features = [\"derive\"] }\nbincode = \"1.3.3\"\n\n# Parallelism-related dependencies\nsimsimd = \"5.0.1\"\nrayon = \"1.10.0\"\n\n# Logging-related dependencies\ntracing = \"0.1.40\"\ntracing-subscriber = \"0.3.18\"\n\n# Utility dependencies\nrand = \"0.8.5\"\ndotenv = \"0.15.0\"\n\n[build-dependencies]\ntonic-build = \"0.12\"\n"
  },
  {
    "path": "LICENSE",
    "content": "                                 Apache License\n                           Version 2.0, January 2004\n                        http://www.apache.org/licenses/\n\n   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION\n\n   1. Definitions.\n\n      \"License\" shall mean the terms and conditions for use, reproduction,\n      and distribution as defined by Sections 1 through 9 of this document.\n\n      \"Licensor\" shall mean the copyright owner or entity authorized by\n      the copyright owner that is granting the License.\n\n      \"Legal Entity\" shall mean the union of the acting entity and all\n      other entities that control, are controlled by, or are under common\n      control with that entity. For the purposes of this definition,\n      \"control\" means (i) the power, direct or indirect, to cause the\n      direction or management of such entity, whether by contract or\n      otherwise, or (ii) ownership of fifty percent (50%) or more of the\n      outstanding shares, or (iii) beneficial ownership of such entity.\n\n      \"You\" (or \"Your\") shall mean an individual or Legal Entity\n      exercising permissions granted by this License.\n\n      \"Source\" form shall mean the preferred form for making modifications,\n      including but not limited to software source code, documentation\n      source, and configuration files.\n\n      \"Object\" form shall mean any form resulting from mechanical\n      transformation or translation of a Source form, including but\n      not limited to compiled object code, generated documentation,\n      and conversions to other media types.\n\n      \"Work\" shall mean the work of authorship, whether in Source or\n      Object form, made available under the License, as indicated by a\n      copyright notice that is included in or attached to the work\n      (an example is provided in the Appendix below).\n\n      \"Derivative Works\" shall mean any work, whether in Source or Object\n      form, that is based on (or derived from) the Work and for which the\n      editorial revisions, annotations, elaborations, or other modifications\n      represent, as a whole, an original work of authorship. For the purposes\n      of this License, Derivative Works shall not include works that remain\n      separable from, or merely link (or bind by name) to the interfaces of,\n      the Work and Derivative Works thereof.\n\n      \"Contribution\" shall mean any work of authorship, including\n      the original version of the Work and any modifications or additions\n      to that Work or Derivative Works thereof, that is intentionally\n      submitted to Licensor for inclusion in the Work by the copyright owner\n      or by an individual or Legal Entity authorized to submit on behalf of\n      the copyright owner. For the purposes of this definition, \"submitted\"\n      means any form of electronic, verbal, or written communication sent\n      to the Licensor or its representatives, including but not limited to\n      communication on electronic mailing lists, source code control systems,\n      and issue tracking systems that are managed by, or on behalf of, the\n      Licensor for the purpose of discussing and improving the Work, but\n      excluding communication that is conspicuously marked or otherwise\n      designated in writing by the copyright owner as \"Not a Contribution.\"\n\n      \"Contributor\" shall mean Licensor and any individual or Legal Entity\n      on behalf of whom a Contribution has been received by Licensor and\n      subsequently incorporated within the Work.\n\n   2. Grant of Copyright License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      copyright license to reproduce, prepare Derivative Works of,\n      publicly display, publicly perform, sublicense, and distribute the\n      Work and such Derivative Works in Source or Object form.\n\n   3. Grant of Patent License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      (except as stated in this section) patent license to make, have made,\n      use, offer to sell, sell, import, and otherwise transfer the Work,\n      where such license applies only to those patent claims licensable\n      by such Contributor that are necessarily infringed by their\n      Contribution(s) alone or by combination of their Contribution(s)\n      with the Work to which such Contribution(s) was submitted. If You\n      institute patent litigation against any entity (including a\n      cross-claim or counterclaim in a lawsuit) alleging that the Work\n      or a Contribution incorporated within the Work constitutes direct\n      or contributory patent infringement, then any patent licenses\n      granted to You under this License for that Work shall terminate\n      as of the date such litigation is filed.\n\n   4. Redistribution. You may reproduce and distribute copies of the\n      Work or Derivative Works thereof in any medium, with or without\n      modifications, and in Source or Object form, provided that You\n      meet the following conditions:\n\n      (a) You must give any other recipients of the Work or\n          Derivative Works a copy of this License; and\n\n      (b) You must cause any modified files to carry prominent notices\n          stating that You changed the files; and\n\n      (c) You must retain, in the Source form of any Derivative Works\n          that You distribute, all copyright, patent, trademark, and\n          attribution notices from the Source form of the Work,\n          excluding those notices that do not pertain to any part of\n          the Derivative Works; and\n\n      (d) If the Work includes a \"NOTICE\" text file as part of its\n          distribution, then any Derivative Works that You distribute must\n          include a readable copy of the attribution notices contained\n          within such NOTICE file, excluding those notices that do not\n          pertain to any part of the Derivative Works, in at least one\n          of the following places: within a NOTICE text file distributed\n          as part of the Derivative Works; within the Source form or\n          documentation, if provided along with the Derivative Works; or,\n          within a display generated by the Derivative Works, if and\n          wherever such third-party notices normally appear. The contents\n          of the NOTICE file are for informational purposes only and\n          do not modify the License. You may add Your own attribution\n          notices within Derivative Works that You distribute, alongside\n          or as an addendum to the NOTICE text from the Work, provided\n          that such additional attribution notices cannot be construed\n          as modifying the License.\n\n      You may add Your own copyright statement to Your modifications and\n      may provide additional or different license terms and conditions\n      for use, reproduction, or distribution of Your modifications, or\n      for any such Derivative Works as a whole, provided Your use,\n      reproduction, and distribution of the Work otherwise complies with\n      the conditions stated in this License.\n\n   5. Submission of Contributions. Unless You explicitly state otherwise,\n      any Contribution intentionally submitted for inclusion in the Work\n      by You to the Licensor shall be under the terms and conditions of\n      this License, without any additional terms or conditions.\n      Notwithstanding the above, nothing herein shall supersede or modify\n      the terms of any separate license agreement you may have executed\n      with Licensor regarding such Contributions.\n\n   6. Trademarks. This License does not grant permission to use the trade\n      names, trademarks, service marks, or product names of the Licensor,\n      except as required for reasonable and customary use in describing the\n      origin of the Work and reproducing the content of the NOTICE file.\n\n   7. Disclaimer of Warranty. Unless required by applicable law or\n      agreed to in writing, Licensor provides the Work (and each\n      Contributor provides its Contributions) on an \"AS IS\" BASIS,\n      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or\n      implied, including, without limitation, any warranties or conditions\n      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A\n      PARTICULAR PURPOSE. You are solely responsible for determining the\n      appropriateness of using or redistributing the Work and assume any\n      risks associated with Your exercise of permissions under this License.\n\n   8. Limitation of Liability. In no event and under no legal theory,\n      whether in tort (including negligence), contract, or otherwise,\n      unless required by applicable law (such as deliberate and grossly\n      negligent acts) or agreed to in writing, shall any Contributor be\n      liable to You for damages, including any direct, indirect, special,\n      incidental, or consequential damages of any character arising as a\n      result of this License or out of the use or inability to use the\n      Work (including but not limited to damages for loss of goodwill,\n      work stoppage, computer failure or malfunction, or any and all\n      other commercial damages or losses), even if such Contributor\n      has been advised of the possibility of such damages.\n\n   9. Accepting Warranty or Additional Liability. While redistributing\n      the Work or Derivative Works thereof, You may choose to offer,\n      and charge a fee for, acceptance of support, warranty, indemnity,\n      or other liability obligations and/or rights consistent with this\n      License. However, in accepting such obligations, You may act only\n      on Your own behalf and on Your sole responsibility, not on behalf\n      of any other Contributor, and only if You agree to indemnify,\n      defend, and hold each Contributor harmless for any liability\n      incurred by, or claims asserted against, such Contributor by reason\n      of your accepting any such warranty or additional liability.\n\n   END OF TERMS AND CONDITIONS\n"
  },
  {
    "path": "README.md",
    "content": "![OasysDB Use Case](https://odb-assets.s3.amazonaws.com/banners/0.7.0.png)\n\n[![GitHub Stars](https://img.shields.io/github/stars/oasysai/oasysdb?style=for-the-badge&logo=github&logoColor=%23000000&labelColor=%23fcd34d&color=%236b7280)](https://github.com/oasysai/oasysdb)\n[![Crates.io](https://img.shields.io/crates/d/oasysdb?style=for-the-badge&logo=rust&logoColor=%23000&label=crates.io&labelColor=%23fdba74&color=%236b7280)](https://crates.io/crates/oasysdb)\n\n## Notice\n\nThis repository is not currently maintained. I initially created this project to\nlearn more about databases and Rust. As times goes on, I actually learned from\nthis project and the people who used it. Unfortunately, most open-source\nprojects doesn't generate enough revenue to sustain itself.\n\nI'm currently looking for a new opportunity to work as a **Software Engineer in\nAI Infrastructure**. If you have or know someone who has an open position,\nplease let me know. I'm open to work remotely or anywhere in the United States.\n\nYou can reach me via [LinkedIn](https://www.linkedin.com/in/edwinkys).\n\nIf you're interested in taking over this project, please let me know. I'll be\nhappy to discuss the details with you. Other than that, I'll just leave this\nproject as is for historical purposes.\n\nThank you all for your support and understanding. It's been a great journey!\n"
  },
  {
    "path": "build.rs",
    "content": "use std::error::Error;\nuse tonic_build::compile_protos;\n\nfn main() -> Result<(), Box<dyn Error>> {\n    compile_protos(\"protos/database.proto\")?;\n    Ok(())\n}\n"
  },
  {
    "path": "docs/CNAME",
    "content": "docs.oasysdb.com\n"
  },
  {
    "path": "docs/blog/.authors.yml",
    "content": "authors:\n  edwinkys:\n    name: Edwin Kys\n    description: Author of OasysDB\n    avatar: https://avatars.githubusercontent.com/u/51223060?v=4\n"
  },
  {
    "path": "docs/blog/index.md",
    "content": "# Latest Posts\n\nBite-sized blog posts about generative AI, machine learning, and more.\n"
  },
  {
    "path": "docs/changelog.md",
    "content": "# Changelog\n\n## v0.7.2\n\n### What's Changed\n\nThis release includes a fix for the file system issue happening on Windows which\nhappen when the default temporary directory in in a different drive than the\ncurrent working directory. This issue is fixed by creating a temporary directory\nin the root of the database directory.\n\n### Contributors\n\n- @edwinkys\n\n### Full Changelog\n\n[v0.7.1...v0.7.2](https://github.com/oasysai/oasysdb/compare/v0.7.1...v0.7.2)\n\n## v0.7.1\n\n### What's Changed\n\nThis release includes a low-level CRUD API for the index implementation from the\nDatabase layer. Once the index is built, when necessary, you can use the CRUD\nAPI to manage the index data directly. This API allows you to perform the\nfollowing operations:\n\n- Insert new records into the index.\n- Update existing records in the index.\n- Delete records from the index.\n\n### Contributors\n\n- @edwinkys\n\n### Full Changelog\n\n[v0.7.0...v0.7.1](https://github.com/oasysai/oasysdb/compare/v0.7.0...v0.7.1)\n\n## v0.7.0\n\n### What's Changed\n\nOasysDB v0.7.0 is a major release that includes a complete overhaul of the\nsystem. Instead of being a dedicated vector database, OasysDB is now a hybrid\nvector database that integrates with SQL databases such as SQLite and PostgreSQL\nwhich you can configure to store the vector records. This approach gives various\nadvantages such as:\n\n- Reliability and durability of the data due to SQL database ACID properties.\n- Separation of vector storage and computation allowing you to scale the system\n  independently.\n\nThese are some of the key changes in this release:\n\n- **SQL Storage Layer**: OasysDB can be configured to source vector records from\n  a SQL database such as SQLite or PostgreSQL.\n- **Multi-index Support**: OasysDB can support multiple indices for the same SQL\n  table allowing users to improve the search performance.\n- **Pre-filtering**: OasysDB can pre-filter the vector records from SQL tables\n  based on the metadata before inserting them into the index.\n- **Configurable Algorithm**: Each index in OasysDB can be configured with\n  different algorithms and parameters to fit the performance requirements.\n\n### Contributors\n\n- @edwinkys\n\n### Full Changelog\n\n[v0.6.1...v0.7.0](https://github.com/oasysai/oasysdb/compare/v0.6.1...v0.7.0)\n\n## v0.6.1\n\n### What's Changed\n\n- Add support for boolean metadata type. This allows full compatibility with\n  JSON-like object or dictionary metadata when storing vector records in the\n  collection.\n- We optimize the database save and get collection operations performance by\n  10-20% by reducing the number of IO operations. Also, the save collection\n  operation is now atomic which means that the collection is saved to the disk\n  only when the operation is completed successfully.\n- We launch our own documentation website at\n  [docs.oasysdb.com](https://docs.oasysdb.com) to provide a better user\n  experience and more comprehensive documentation for the OasysDB library. It's\n  still a work in progress and we will continue to improve the documentation\n  over time.\n\n### Contributors\n\n- @edwinkys\n\n### Full Changelog\n\n[v0.6.0...v0.6.1](https://github.com/oasysai/oasysdb/compare/v0.6.0...v0.6.1)\n\n## v0.6.0\n\n### What's Changed\n\n- **CONDITIONAL BREAKING CHANGE**: We remove support for dot distance metric and\n  we replace cosine similarity with cosine distance metric. This change is made\n  to make the distance metric consistent with the other distance metrics.\n- The default configuration for the collection (EF Construction and EF Search)\n  is increased to a more sensible value according to the common real-world use\n  cases. The default EF Construction is set to 128 and the default EF Search is\n  set to 64.\n- We add a new script to measure the recall rate of the collection search\n  functionality. And with this, we improve the search recall rate of OasysDB to\n  match the recall rate of HNSWLib with the same configuration.\n\n```sh\ncargo run --example measure-recall\n```\n\n- We add a new benchmark to measure the performance of saving and getting the\n  collection. The benchmark can be run by running the command below.\n\n```sh\ncargo bench\n```\n\n### Contributors\n\n- @edwinkys\n\n### Full Changelog\n\n[v0.5.1...v0.6.0](https://github.com/oasysai/oasysdb/compare/v0.5.1...v0.6.0)\n\n## v0.5.1\n\n### What's Changed\n\nWe add a new method `Collection.filter` to filter the vector records based on\nthe metadata. This method returns a HashMap of the filtered vector records and\ntheir corresponding vector IDs. This implementation performs a linear search\nthrough the collection and thus might be slow for large datasets.\n\nThis implementation includes support for the following metadata to filter:\n\n- `String`: Stored value must include the filter string.\n- `Float`: Stored value must be equal to the filter float.\n- `Integer`: Stored value must be equal to the filter integer.\n- `Object`: Stored value must match all the key-value pairs in the filter\n  object.\n\nWe currently don't support filtering based on the array type metadata because I\nam not sure of the best way to implement it. If you have any suggestions, please\nlet me know.\n\n### Contributors\n\n- @edwinkys\n\n### Full Changelog\n\n[v0.5.0...v0.5.1](https://github.com/oasysai/oasysdb/compare/v0.5.0...v0.5.1)\n\n## v0.5.0\n\n### What's Changed\n\n- **BREAKING CHANGE**: Although there is no change in the database API, the\n  underlying storage format has been changed to save the collection data to\n  dedicated files directly. The details of the new persistent system and how to\n  migrate from v0.4.x to v0.5.0 can be found in this migration guide.\n\n- By adding the feature `gen`, you can now use the `EmbeddingModel` trait and\n  OpenAI's embedding models to generate vectors or records from text without\n  external dependencies. This feature is optional and can be enabled by adding\n  the feature to the `Cargo.toml` file.\n\n```toml\n[dependencies]\noasysdb = { version = \"0.5.0\", features = [\"gen\"] }\n```\n\n### Contributors\n\n- @edwinkys\n\n### Full Changelog\n\n[v0.4.5...v0.5.0](https://github.com/oasysai/oasysdb/compare/v0.4.5...v0.5.0)\n\n## v0.4.5\n\n### What's Changed\n\n- Add insert benchmark to measure the performance of inserting vectors into the\n  collection. The benchmark can be run using the `cargo bench` command.\n- Fix the issue with large-size dirty IO buffers caused by the database\n  operation. This issue is fixed by flushing the dirty IO buffers after the\n  operation is completed. This operation can be done synchronously or\n  asynchronously based on the user's preference since this operation might take\n  some time to complete.\n\n### Contributors\n\n- @edwinkys\n\n### Full Changelog\n\n[v0.4.4...v0.4.5](https://github.com/oasysai/oasysdb/compare/v0.4.4...v0.4.5)\n\n## v0.4.4\n\n### What's Changed\n\n- Maximize compatibility with the standard library error types to allow users to\n  convert OasysDB errors to most commonly used error handling libraries such as\n  `anyhow`, `thiserror`, etc.\n- Add conversion methods to convert metadata to JSON value by `serde_json` and\n  vice versa. This allows users to store JSON format metadata easily.\n- Add normalized cosine distance metric to the collection search functionality.\n  Read more about the normalized cosine distance metric here.\n- Fix the search distance calculation to use the correct distance metric and\n  sort it accordingly based on the collection configuration.\n- Add vector ID utility methods to the `VectorID` struct to make it easier to\n  work with the vector ID.\n\n### Additional Notes\n\n- Add a new benchmark to measure the true search AKA brute-force search\n  performance of the collection. If possible, dealing with a small dataset, it\n  is recommended to use the true search method for better accuracy. The\n  benchmark can be run using the `cargo bench` command.\n- Improve the documentation to include more examples and explanations on how to\n  use the library: Comprehensive Guide.\n\n### Contributors\n\n- @edwinkys\n\n### Full Changelog\n\n[v0.4.3...v0.4.4](https://github.com/oasysai/oasysdb/compare/v0.4.3...v0.4.4)\n\n## v0.4.3\n\n### What's Changed\n\n- Add SIMD acceleration to calculate the distance between vectors. This improves\n  the performance of inserting and searching vectors in the collection.\n- Improve OasysDB native error type implementation to include the type/kind of\n  error that occurred in addition to the error message. For example,\n  `ErrorKind::CollectionError` is used to represent errors that occur during\n  collection operations.\n- Fix the `Config.ml` default value from 0.3 to 0.2885 which is the optimal\n  value for the HNSW with M of 32. The optimal value formula for ml is\n  `1/ln(M)`.\n\n### Contributors\n\n- @edwinkys\n\n### Full Changelog\n\n[v0.4.2...v0.4.3](https://github.com/oasysai/oasysdb/compare/v0.4.2...v0.4.3)\n\n## v0.4.2\n\n### What's Changed\n\nDue to an issue (#62) with the Python release of v0.4.1, this patch version is\nreleased to fix the build wheels for Python users. The issue is caused due to\nthe new optional PyO3 feature for the v0.4.1 Rust crate release which exclude\nPyO3 dependencies from the build process. To solve this, the Python package\nbuild and deploy script now includes `--features py` argument.\n\nFor Rust users, this version doesn't offer any additional features or\nfunctionality compared to v0.4.1 release.\n\n### Full Changelog\n\n[v0.4.1...v0.4.2](https://github.com/oasysai/oasysdb/compare/v0.4.1...v0.4.2)\n\n## v0.4.1\n\n### What's Changed\n\n- Added quality of life improvements to the `VectorID` type interoperability.\n- Improved the `README.md` file with additional data points on the database\n  performance.\n- Changed to `Collection.insert` method to return the new `VectorID` after\n  inserting a new vector record.\n- Pyo3 dependencies are now hidden behind the `py` feature. This allows users to\n  build the library without the Python bindings if they don't need it, which is\n  probably all of them.\n\n### Contributors\n\n- @dteare\n- @edwinkys\n- @noneback\n\n### Full Changelog\n\n[v0.4.0...v0.4.1](https://github.com/oasysai/oasysdb/compare/v0.4.0...v0.4.1)\n\n## v0.4.0\n\n### What's Changed\n\n- **CONDITIONAL BREAKING CHANGE**: Add an option to configure distance for the\n  vector collection via `Config` struct. The new field `distance` can be set\n  using the `Distance` enum. This includes Euclidean, Cosine, and Dot distance\n  metrics. The default distance metric is Euclidean. This change is backward\n  compatible if you are creating a config using the `Config::default()` method.\n  Otherwise, you need to update the config to include the distance metric.\n\n```rs\nlet config = Config {\n  ...\n  distance: Distance::Cosine,\n};\n```\n\n- With the new distance metric feature, now, you can set a `relevancy` threshold\n  for the search results. This will filter out the results that are below or\n  above the threshold depending on the distance metric used. This feature is\n  disabled by default which is set to -1.0. To enable this feature, you can set\n  the `relevancy` field in the `Collection` struct.\n\n```rs\n...\nlet mut collection = Collection::new(&config)?;\ncollection.relevancy = 3.0;\n```\n\n- Add a new method `Collection::insert_many` to insert multiple vector records\n  into the collection at once. This method is more optimized than using the\n  `Collection::insert` method in a loop.\n\n### Contributors\n\n- @noneback\n- @edwinkys\n\n### Full Changelog\n\n[v0.3.0...v0.4.0](https://github.com/oasysai/oasysdb/compare/v0.3.0...v0.4.0)\n\n## v0.3.0\n\nThis release introduces a BREAKING CHANGE to one of the method from the\n`Database` struct. The `Database::create_collection` method has been removed\nfrom the library due to redundancy. The `Database::save_collection` method can\nbe used to create a new collection or update an existing one. This change is\nmade to simplify the API and to make it more consistent with the other methods\nin the `Database` struct.\n\n### What's Changed\n\n- **BREAKING CHANGE**: Removed the `Database::create_collection` method from the\n  library. To replace this, you can use the code snippet below:\n\n```rs\n// Before: this creates a new empty collection.\ndb.create_collection(\"vectors\", None, Some(records))?;\n\n// After: create new or build a collection then save it.\n// let collection = Collection::new(&config)?;\nlet collection = Collection::build(&config, &records)?;\ndb.save_collection(\"vectors\", &collection)?;\n```\n\n- Added the `Collection::list` method to list all the vector records in the\n  collection.\n- Created a full Python binding for OasysDB which is available on PyPI. This\n  allows you to use OasysDB directly from Python. The Python binding is\n  available at https://pypi.org/project/oasysdb.\n\n### Contributors\n\n- @edwinkys\n- @Zelaren\n- @FebianFebian1\n\n### Full Changelog\n\n[v0.2.1...v0.3.0](https://github.com/oasysai/oasysdb/compare/v0.2.1...v0.3.0)\n\n## v0.2.1\n\n### What's Changed\n\n- `Metadata` enum can now be accessed publicly using\n  `oasysdb::metadata::Metadata`. This allows users to use `match` statements to\n  extract the data from it.\n- Added a `prelude` module that re-exports the most commonly used types and\n  traits. This makes it easier to use the library by importing the prelude\n  module by `use oasysdb::prelude::*`.\n\n### Contributors\n\n- @edwinkys\n\n### Full Changelog\n\n[v0.2.0...v0.2.1](https://github.com/oasysai/oasysdb/compare/v0.2.0...v0.2.1)\n\n## v0.2.0\n\n### What's Changed\n\n- For `Collection` struct, the generic parameter `D` has been replaced with\n  `Metadata` enum which allows one collection to store different types of data\n  as needed.\n- The `Vector` now uses `Vec<f32>` instead of `[f32, N]` which removes the `N`\n  generic parameter from the `Vector` struct. Since there is a chance of using\n  different vector dimensions in the same collection with this change, An\n  additional functionality is added to the `Collection` to make sure that the\n  vector dimension is uniform.\n- The `M` generic parameter in the `Collection` struct has been replaced with a\n  constant of 32. This removes the flexibility to tweak the indexing\n  configuration for this value. But for most use cases, this value should be\n  sufficient.\n- Added multiple utility functions to structs such as `Record`, `Vector`, and\n  `Collection` to make it easier to work with the data.\n\n### Contributors\n\n- @edwinkys\n\n### Full Changelog\n\n[v0.1.0...v0.2.0](https://github.com/oasysai/oasysdb/compare/v0.1.0...v0.2.0)\n\n## v0.1.0\n\n### What's Changed\n\n- OasysDB release as an embedded vector database available directly via\n  `cargo add oasysdb` command.\n- Using HNSW algorithm implementation for the collection indexing along with\n  Euclidean distance metrics.\n- Incremental updates on the vector collections allowing inserts, deletes, and\n  modifications without rebuilding the index.\n- Add a benchmark on the collection search functionality using SIFT dataset that\n  can be run using `cargo bench` command.\n\n### Contributors\n\n- @edwinkys\n\n### Full Changelog\n\n[v0.1.0](https://github.com/oasysai/oasysdb/commits/v0.1.0)\n"
  },
  {
    "path": "docs/contributing.md",
    "content": "# Contributing to OasysDB\n\nFirst of all, thank you for considering to contribute to OasysDB! We welcome\ncontributions from the community, and this documentation outlines the process to\nstart contributing to our project.\n\n## Code of Conduct\n\nWe are committed to building an inclusive and welcoming community because we\nbelieve that it will lead to a more successful project and a better experience\nfor everyone involved. To achieve that, any participant in our project is\nexpected to act respectfully and to follow the Code of Conduct.\n\n## Have questions or suggestions?\n\n[![Discord](https://img.shields.io/discord/1182432298382131200?logo=discord&logoColor=%23ffffff&label=Discord&labelColor=%235865F2&style=for-the-badge)][discord]\n\nThere is no such thing as a stupid question. If you have a question, chances\nare, someone else does too. So, please feel free to ask questions whether it's\non our [Discord][discord] server or by opening a new discussion on [GitHub\nDiscussions][gh_discussions].\n\n## Encounter a bug? Have a feature request?\n\nIf you encounter a bug or have a feature request, please open an issue on\n[GitHub Issues][gh_issues]. Please include enough information for us to\nunderstand the issue or the feature request. For this reason, we recommend you\nto follow the issue templates we have provided when creating a new issue.\n\n## Want to contribute code?\n\n**TLDR: Check or open an issue first before working on a PR.**\n\nBefore you start working on a pull request, we encourage you to check out the\nexisting issues and pull requests to make sure that the feature you want to work\non is in our roadmap and is aligned with the project's vision. After all, we\ndon't want you to waste your time working on something that might not be merged.\n\nWe try to prioritize features and bug fixes that are on our roadmap or requested\na lot by the community. If you want to work on a feature or a fix that isn't\nalready in the issue tracker, please open an issue first to discuss it with the\nproject maintainers and the community.\n\nFor features, we try to prioritize features that are backed by real-world use\ncases. If you have a use case for a feature, please include it in the issue.\nWe'd love to hear about it!\n\n## Getting started\n\nOasysDB is written in Rust. So, you need to have Rust installed on your local\nmachine. If you haven't installed Rust yet, you can install it by following the\ninstructions on the [Rust Installation Guide][rustup].\n\nAfter you have installed Rust, you can clone the repository into your local\nmachine. Before you start making changes in the codebase, you should run the\ntests to make sure that everything is working as expected:\n\n```sh\ncargo test\n```\n\nOasysDB uses a couple of third-party dependencies that might be useful for you\nto get familiar with. These are the most important ones along with their\ndocumentation:\n\n- [gRPC](https://grpc.io/)\n- [Tonic](https://github.com/hyperium/tonic)\n- [Tokio](https://tokio.rs/)\n\n## Style guide\n\nWe mostly use the default linting and style guide for Rust except for some\nlinting changes listed in the rustfmt.toml file. For more information about the\ncode style, see the [Rust Style Guide][style_guide].\n\nFor commit messages, we use the [Conventional Commits][conventional_commits]\nformat. This allows us to maintain consistency and readability in our Git commit\nhistory making it easier to understand the changes made to the codebase at a\nhigh-level.\n\nWhen commenting your code, please try your best to write comments that are clear\nand concise with proper English sentence capitalization and punctuation. This\nwill help us and the community understand your code better and keep the codebase\nmaintainable.\n\n## Submitting a pull request\n\nOnce you have made your changes, you can submit a pull request. We will review\nyour pull request and provide feedback. If your pull request is accepted, we\nwill merge it into the main branch.\n\nFor organization purposes, we ask that you use the [Conventional\nCommits][conventional_commits] format for your pull request title in lowercase:\n\n```\n<type>: <description>\n```\n\nFor example:\n\n```\nfeat: add support ...\nfix: fix issue ...\n```\n\n## Conclusion\n\nThank you for taking the time to read this documentation. We look forward to\nyour contributions! Another way to support this project is to star this project,\nshare it with your circles, and join us on [Discord][discord].\n\nBest regards,<br /> Edwin Kys\n\n[discord]: https://discord.gg/bDhQrkqNP4\n[gh_issues]: https://github.com/oasysai/oasysdb/issues\n[gh_discussions]: https://github.com/oasysai/oasysdb/discussions\n[rustup]: https://www.rust-lang.org/tools/install\n[style_guide]: https://doc.rust-lang.org/beta/style-guide/index.html\n[conventional_commits]: https://www.conventionalcommits.org/en/v1.0.0/\n"
  },
  {
    "path": "docs/css/style.css",
    "content": "h1,\nh2,\nh3 {\n  font-weight: bold !important;\n}\n\n.odb-button {\n  text-align: center;\n  width: 100%;\n}\n\n.odb-button.disabled {\n  opacity: 0.5;\n  cursor: not-allowed;\n}\n\n/* Tables will be displayed at full width. */\n\n.md-typeset__table {\n  width: 100%;\n}\n\n.md-typeset__table table:not([class]) {\n  display: table;\n}\n"
  },
  {
    "path": "docs/index.md",
    "content": "# Welcome to OasysDB 🎉\n"
  },
  {
    "path": "mkdocs.yml",
    "content": "site_name: OasysDB\n\nrepo_name: oasysai/oasysdb\nrepo_url: https://github.com/oasysai/oasysdb\n\ntheme:\n  name: material\n  logo: assets/wordmark.png\n  favicon: assets/favicon64.png\n\n  icon:\n    repo: fontawesome/brands/github\n\n  palette:\n    - media: \"(prefers-color-scheme: light)\"\n      scheme: default\n      primary: black\n      toggle:\n        name: Light Mode\n        icon: material/brightness-7\n\n    - media: \"(prefers-color-scheme: dark)\"\n      scheme: slate\n      primary: black\n      toggle:\n        name: Dark Mode\n        icon: material/brightness-4\n\n  font:\n    text: Space Grotesk\n    code: Space Mono\n\n  features:\n    - header.autohide\n    - navigation.tabs\n    - navigation.tabs.sticky\n    - navigation.expand\n    - navigation.footer\n    - content.code.copy\n\ncopyright: Copyright &copy; 2024 OasysDB\n\nextra:\n  generator: false\n\n  social:\n    - icon: fontawesome/brands/x-twitter\n      link: https://x.com/oasysai\n\n    - icon: fontawesome/brands/linkedin\n      link: https://www.linkedin.com/company/oasysai\n\n    - icon: fontawesome/brands/discord\n      link: https://discord.gg/bDhQrkqNP4\n\nextra_css:\n  - css/style.css\n\nnav:\n  - Documentation:\n      - Introduction: index.md\n\n  - Other:\n      - Changelog: changelog.md\n      - Contributing: contributing.md\n\n  - Blog:\n      - blog/index.md\n\nmarkdown_extensions:\n  - admonition\n  - attr_list\n  - md_in_html\n  - pymdownx.details\n  - pymdownx.inlinehilite\n  - pymdownx.snippets\n  - pymdownx.superfences\n\n  - pymdownx.tabbed:\n      alternate_style: true\n\n  - pymdownx.emoji:\n      emoji_index: !!python/name:material.extensions.emoji.twemoji\n      emoji_generator: !!python/name:material.extensions.emoji.to_svg\n\n  - toc:\n      permalink: \"#\"\n\nplugins:\n  - blog:\n      post_readtime: true\n      post_excerpt: required\n      authors: true\n      categories_allowed:\n        - Log\n        - Rust\n"
  },
  {
    "path": "protos/database.proto",
    "content": "syntax = \"proto3\";\npackage database;\n\nimport \"google/protobuf/empty.proto\";\n\n// OasysDB gRPC service definition.\nservice Database {\n    // Check if the connection to the database is alive.\n    rpc Heartbeat(google.protobuf.Empty) returns (HeartbeatResponse);\n\n    // Manually create a snapshot of the database.\n    rpc Snapshot(google.protobuf.Empty) returns (SnapshotResponse);\n\n    // Insert a new record into the database.\n    rpc Insert(InsertRequest) returns (InsertResponse);\n\n    // Retrieve an existing record from the database.\n    rpc Get(GetRequest) returns (GetResponse);\n\n    // Delete a record from the database.\n    rpc Delete(DeleteRequest) returns (google.protobuf.Empty);\n\n    // Update a record metadata in the database.\n    rpc Update(UpdateRequest) returns (google.protobuf.Empty);\n\n    // Query the database for nearest neighbors.\n    rpc Query(QueryRequest) returns (QueryResponse);\n}\n\nmessage HeartbeatResponse {\n    string version = 1;\n}\n\nmessage SnapshotResponse {\n    int32 count = 1;\n}\n\nmessage InsertRequest {\n    Record record = 1;\n}\n\nmessage InsertResponse {\n    string id = 1;\n}\n\nmessage GetRequest {\n    string id = 1;\n}\n\nmessage GetResponse {\n    Record record = 1;\n}\n\nmessage DeleteRequest {\n    string id = 1;\n}\n\nmessage UpdateRequest {\n    string id = 1;\n    map<string, Value> metadata = 2;\n}\n\nmessage QueryRequest {\n    Vector vector = 1;\n    int32 k = 2;\n    string filter = 3;\n    QueryParameters params = 4;\n}\n\nmessage QueryParameters {\n    int32 probes = 1;\n    float radius = 2;\n}\n\nmessage QueryResponse {\n    repeated QueryResult results = 1;\n}\n\nmessage QueryResult {\n    string id = 1;\n    map<string, Value> metadata = 2;\n    float distance = 3;\n}\n\n// List shared types below.\n\nmessage Record {\n    Vector vector = 1;\n    map<string, Value> metadata = 2;\n}\n\nmessage Vector {\n    repeated float data = 1;\n}\n\nmessage Value {\n    oneof value {\n        string text = 1;\n        double number = 2;\n        bool boolean = 4;\n    }\n}\n"
  },
  {
    "path": "requirements.txt",
    "content": "# Documentation website.\nmkdocs-material==9.5.26\n"
  },
  {
    "path": "rustfmt.toml",
    "content": "tab_spaces = 4\nreorder_imports = true\nmax_width = 80\nuse_small_heuristics = \"Max\"\nmerge_derives = false\n"
  },
  {
    "path": "src/cores/database.rs",
    "content": "use super::*;\nuse protos::database_server::Database as DatabaseService;\nuse std::io::{BufReader, BufWriter};\nuse tonic::{Request, Response};\n\nconst TMP_DIR: &str = \"tmp\";\nconst PARAMS_FILE: &str = \"odb_params\";\nconst STORAGE_FILE: &str = \"odb_storage\";\nconst INDEX_FILE: &str = \"odb_index\";\n\n/// Database parameters.\n///\n/// Fields:\n/// - dimension: Vector dimension.\n/// - metric: Metric to calculate distance.\n/// - density: Max number of records per IVF cluster.\n#[derive(Debug, Clone, Copy, Serialize, Deserialize, PartialEq, Eq)]\npub struct Parameters {\n    pub dimension: usize,\n    pub metric: Metric,\n    pub density: usize,\n}\n\n/// Dynamic query-time parameters.\n///\n/// Fields:\n/// - probes: Suggested number of clusters to visit.\n/// - radius: Maximum distance to include in the result.\n#[derive(Debug, Clone, Copy, PartialEq)]\npub struct QueryParameters {\n    pub probes: usize,\n    pub radius: f32,\n}\n\nimpl Default for QueryParameters {\n    /// Default query parameters:\n    /// - probes: 32\n    /// - radius: ∞\n    fn default() -> Self {\n        QueryParameters { probes: 32, radius: f32::INFINITY }\n    }\n}\n\nimpl TryFrom<protos::QueryParameters> for QueryParameters {\n    type Error = Status;\n    fn try_from(value: protos::QueryParameters) -> Result<Self, Self::Error> {\n        Ok(QueryParameters {\n            probes: value.probes as usize,\n            radius: value.radius,\n        })\n    }\n}\n\n/// Database snapshot statistics.\n///\n/// The snapshot statistics include the information that might be useful\n/// for monitoring the state of the database. This stats will be returned\n/// by the `create_snapshot` method.\n#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]\npub struct SnapshotStats {\n    pub count: usize,\n}\n\nimpl From<SnapshotStats> for protos::SnapshotResponse {\n    fn from(value: SnapshotStats) -> Self {\n        protos::SnapshotResponse { count: value.count as i32 }\n    }\n}\n\n#[derive(Debug)]\npub struct Database {\n    dir: PathBuf,\n    params: Parameters,\n    index: RwLock<Index>,\n    storage: RwLock<Storage>,\n}\n\nimpl Database {\n    pub fn configure(params: &Parameters) {\n        let index = Index::new()\n            .with_metric(params.metric)\n            .with_density(params.density);\n\n        let db = Database {\n            dir: Self::dir(),\n            params: *params,\n            index: RwLock::new(index),\n            storage: RwLock::new(Storage::new()),\n        };\n\n        if db.dir.join(PARAMS_FILE).exists() {\n            let stdin = std::io::stdin();\n            let overwrite = {\n                eprint!(\"Database is already configured. Overwrite? (y/n): \");\n                let mut input = String::new();\n                stdin.read_line(&mut input).unwrap();\n                matches!(input.to_lowercase().trim(), \"y\")\n            };\n\n            if !overwrite {\n                return;\n            }\n\n            fs::remove_dir_all(&db.dir).expect(\"Failed to reset the database\");\n            println!(\"The database has been reset successfully\");\n        }\n\n        db.setup_dir().expect(\"Failed to setup database directory\");\n    }\n\n    pub fn open() -> Result<Self, Box<dyn Error>> {\n        let dir = Self::dir();\n        let params = Self::load_binary(dir.join(PARAMS_FILE))?;\n        let index = Self::load_binary(dir.join(INDEX_FILE))?;\n        let storage: Storage = Self::load_binary(dir.join(STORAGE_FILE))?;\n\n        let count = storage.count();\n        tracing::info!(\"Restored {count} record(s) from the disk\");\n\n        Ok(Database {\n            dir,\n            params,\n            index: RwLock::new(index),\n            storage: RwLock::new(storage),\n        })\n    }\n\n    fn dir() -> PathBuf {\n        match env::var(\"ODB_DIR\") {\n            Ok(dir) => PathBuf::from(dir),\n            Err(_) => PathBuf::from(\"oasysdb\"),\n        }\n    }\n\n    fn setup_dir(&self) -> Result<(), Box<dyn Error>> {\n        if self.dir.try_exists()? {\n            return Ok(());\n        }\n\n        fs::create_dir_all(&self.dir)?;\n        fs::create_dir_all(self.dir.join(TMP_DIR))?;\n\n        self.create_snapshot()?;\n        Ok(())\n    }\n\n    fn load_binary<T: DeserializeOwned>(\n        path: impl AsRef<Path>,\n    ) -> Result<T, Box<dyn Error>> {\n        let file = OpenOptions::new().read(true).open(path)?;\n        let reader = BufReader::new(file);\n        Ok(bincode::deserialize_from(reader)?)\n    }\n\n    fn persist_as_binary<T: Serialize>(\n        &self,\n        path: impl AsRef<Path>,\n        data: T,\n    ) -> Result<(), Box<dyn Error>> {\n        let file_name = path.as_ref().file_name().unwrap();\n        let tmp_file = self.dir.join(TMP_DIR).join(file_name);\n        let file = OpenOptions::new()\n            .write(true)\n            .create(true)\n            .truncate(true)\n            .open(&tmp_file)?;\n\n        let writer = BufWriter::new(file);\n        bincode::serialize_into(writer, &data)?;\n        fs::rename(&tmp_file, &path)?;\n        Ok(())\n    }\n\n    pub fn create_snapshot(&self) -> Result<SnapshotStats, Box<dyn Error>> {\n        self.persist_as_binary(self.dir.join(PARAMS_FILE), self.params)?;\n\n        let index = self.index.read().unwrap();\n        self.persist_as_binary(self.dir.join(INDEX_FILE), &*index)?;\n\n        let storage = self.storage.read().unwrap();\n        self.persist_as_binary(self.dir.join(STORAGE_FILE), &*storage)?;\n\n        let count = storage.count();\n        tracing::info!(\"Created a snapshot with {count} record(s)\");\n\n        Ok(SnapshotStats { count })\n    }\n\n    fn validate_dimension(&self, vector: &Vector) -> Result<(), Status> {\n        if vector.len() != self.params.dimension {\n            return Err(Status::invalid_argument(format!(\n                \"Invalid vector dimension: expected {}, got {}\",\n                self.params.dimension,\n                vector.len()\n            )));\n        }\n\n        Ok(())\n    }\n}\n\n#[tonic::async_trait]\nimpl DatabaseService for Arc<Database> {\n    async fn heartbeat(\n        &self,\n        _request: Request<()>,\n    ) -> Result<Response<protos::HeartbeatResponse>, Status> {\n        let response = protos::HeartbeatResponse {\n            version: env!(\"CARGO_PKG_VERSION\").to_string(),\n        };\n\n        Ok(Response::new(response))\n    }\n\n    async fn snapshot(\n        &self,\n        _request: Request<()>,\n    ) -> Result<Response<protos::SnapshotResponse>, Status> {\n        let stats = self.create_snapshot().map_err(|e| {\n            let message = format!(\"Failed to create a snapshot: {e}\");\n            Status::internal(message)\n        })?;\n\n        Ok(Response::new(stats.into()))\n    }\n\n    async fn insert(\n        &self,\n        request: Request<protos::InsertRequest>,\n    ) -> Result<Response<protos::InsertResponse>, Status> {\n        let record = match request.into_inner().record {\n            Some(record) => Record::try_from(record)?,\n            None => {\n                let message = \"Record data is required for insertion\";\n                return Err(Status::invalid_argument(message));\n            }\n        };\n\n        self.validate_dimension(&record.vector)?;\n\n        let id = RecordID::new();\n\n        // Insert the record into the storage.\n        // This operation must be done before updating the index. Otherwise,\n        // the index won't have access to the record data.\n        let mut storage = self.storage.write().unwrap();\n        storage.insert(&id, &record)?;\n\n        let mut index = self.index.write().unwrap();\n        index.insert(&id, &record, storage.records())?;\n\n        tracing::info!(\"Inserted a new record with ID: {id}\");\n        Ok(Response::new(protos::InsertResponse { id: id.to_string() }))\n    }\n\n    async fn get(\n        &self,\n        request: Request<protos::GetRequest>,\n    ) -> Result<Response<protos::GetResponse>, Status> {\n        let request = request.into_inner();\n        let id = request.id.parse::<RecordID>()?;\n\n        let storage = self.storage.read().unwrap();\n        let record = storage.get(&id)?.to_owned();\n\n        let response = protos::GetResponse { record: Some(record.into()) };\n        Ok(Response::new(response))\n    }\n\n    async fn delete(\n        &self,\n        request: Request<protos::DeleteRequest>,\n    ) -> Result<Response<()>, Status> {\n        let request = request.into_inner();\n        let id = request.id.parse::<RecordID>()?;\n\n        let mut index = self.index.write().unwrap();\n        index.delete(&id)?;\n\n        let mut storage = self.storage.write().unwrap();\n        storage.delete(&id)?;\n\n        tracing::info!(\"Deleted a record with ID: {id}\");\n        Ok(Response::new(()))\n    }\n\n    async fn update(\n        &self,\n        request: Request<protos::UpdateRequest>,\n    ) -> Result<Response<()>, Status> {\n        let request = request.into_inner();\n        let id = request.id.parse::<RecordID>()?;\n\n        let mut metadata = HashMap::new();\n        for (key, value) in request.metadata {\n            metadata.insert(key, value.try_into()?);\n        }\n\n        let mut storage = self.storage.write().unwrap();\n        storage.update(&id, &metadata)?;\n\n        tracing::info!(\"Updated metadata for a record: {id}\");\n        Ok(Response::new(()))\n    }\n\n    async fn query(\n        &self,\n        request: Request<protos::QueryRequest>,\n    ) -> Result<Response<protos::QueryResponse>, Status> {\n        let request = request.into_inner();\n        let vector = match request.vector {\n            Some(vector) => Vector::try_from(vector)?,\n            None => {\n                let message = \"Vector is required for query operation\";\n                return Err(Status::invalid_argument(message));\n            }\n        };\n\n        self.validate_dimension(&vector)?;\n\n        let k = request.k as usize;\n        if k == 0 {\n            let message = \"Invalid k value, k must be greater than 0\";\n            return Err(Status::invalid_argument(message));\n        }\n\n        let filter = Filters::try_from(request.filter.as_str())?;\n\n        let params = match request.params {\n            Some(params) => QueryParameters::try_from(params)?,\n            None => QueryParameters::default(),\n        };\n\n        let storage = self.storage.read().unwrap();\n        let records = storage.records();\n\n        let index = self.index.read().unwrap();\n        let results = index\n            .query(&vector, k, &filter, &params, records)?\n            .into_iter()\n            .map(Into::into)\n            .collect();\n\n        Ok(Response::new(protos::QueryResponse { results }))\n    }\n}\n\n#[cfg(test)]\nmod tests {\n    use super::*;\n    use uuid::Uuid;\n\n    #[test]\n    fn test_open() {\n        let db = setup_db();\n        assert_eq!(db.params, Parameters::default());\n    }\n\n    #[tokio::test]\n    async fn test_heartbeat() {\n        let db = setup_db();\n        let request = Request::new(());\n        let response = db.heartbeat(request).await.unwrap();\n        assert_eq!(response.get_ref().version, env!(\"CARGO_PKG_VERSION\"));\n    }\n\n    #[tokio::test]\n    async fn test_insert() {\n        let params = Parameters::default();\n        let db = setup_db();\n\n        let vector = Vector::random(params.dimension);\n        let request = Request::new(protos::InsertRequest {\n            record: Some(protos::Record {\n                vector: Some(vector.into()),\n                metadata: std::collections::HashMap::new(),\n            }),\n        });\n\n        let response = db.insert(request).await.unwrap();\n        assert!(response.get_ref().id.parse::<Uuid>().is_ok());\n        assert_eq!(db.storage.read().unwrap().records().len(), 1);\n    }\n\n    fn setup_db() -> Arc<Database> {\n        if Database::dir().exists() {\n            fs::remove_dir_all(Database::dir()).unwrap();\n        }\n\n        let params = Parameters::default();\n        Database::configure(&params);\n        Arc::new(Database::open().unwrap())\n    }\n\n    impl Default for Parameters {\n        fn default() -> Self {\n            Parameters {\n                dimension: 128,\n                metric: Metric::Euclidean,\n                density: 64,\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/cores/index.rs",
    "content": "use super::*;\nuse std::cmp::{min, Ordering};\nuse std::collections::BinaryHeap;\nuse std::rc::Rc;\n\ntype ClusterIndex = usize;\n\n/// ANNS search result containing the metadata of the record.\n///\n/// We exclude the vector data from the result because it doesn't provide\n/// any additional value on the search result. If users are interested in\n/// the vector data, they can use the get method to retrieve the record.\n#[derive(Debug, Clone)]\npub struct QueryResult {\n    pub id: RecordID,\n    pub metadata: HashMap<String, Value>,\n    pub distance: f32,\n}\n\nimpl Eq for QueryResult {}\n\nimpl PartialEq for QueryResult {\n    /// Compare two query results based on their IDs.\n    fn eq(&self, other: &Self) -> bool {\n        self.id == other.id\n    }\n}\n\nimpl Ord for QueryResult {\n    fn cmp(&self, other: &Self) -> Ordering {\n        self.distance.partial_cmp(&other.distance).unwrap_or(Ordering::Equal)\n    }\n}\n\nimpl PartialOrd for QueryResult {\n    /// Allow the query results to be sorted based on their distance.\n    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {\n        Some(self.cmp(other))\n    }\n}\n\nimpl From<QueryResult> for protos::QueryResult {\n    fn from(value: QueryResult) -> Self {\n        let metadata = value\n            .metadata\n            .into_iter()\n            .map(|(key, value)| (key, value.into()))\n            .collect();\n\n        protos::QueryResult {\n            id: value.id.to_string(),\n            metadata,\n            distance: value.distance,\n        }\n    }\n}\n\n/// ANNS Index interface.\n///\n/// OasysDB uses a modified version of IVF index algorithm. This custom index\n/// implementation allows OasysDB to maintain a balanced index structure\n/// allowing the clusters to grow to accommodate data growth.\n#[repr(C)]\n#[derive(Debug, Serialize, Deserialize)]\npub struct Index {\n    centroids: Vec<Vector>,\n    clusters: Vec<Vec<RecordID>>,\n\n    // Index parameters.\n    metric: Metric,\n    density: usize,\n}\n\nimpl Index {\n    /// Create a new index instance with default parameters.\n    ///\n    /// Default parameters:\n    /// - metric: Euclidean\n    /// - density: 256\n    pub fn new() -> Self {\n        Index {\n            centroids: vec![],\n            clusters: vec![],\n            metric: Metric::Euclidean,\n            density: 256,\n        }\n    }\n\n    /// Configure the metric used for distance calculations.\n    pub fn with_metric(mut self, metric: Metric) -> Self {\n        self.metric = metric;\n        self\n    }\n\n    /// Configure the density of the index.\n    pub fn with_density(mut self, density: usize) -> Self {\n        self.density = density;\n        self\n    }\n\n    /// Insert a new record into the index.\n    ///\n    /// This method required the reference to all the records because\n    /// during the cluster splitting process, the record assignments\n    /// will be re-calculated\n    pub fn insert(\n        &mut self,\n        id: &RecordID,\n        record: &Record,\n        records: &HashMap<RecordID, Record>,\n    ) -> Result<(), Status> {\n        let vector = &record.vector;\n        let nearest_centroid = self.find_nearest_centroid(vector);\n\n        // If the index is empty, the record's vector will be\n        // the first centroid.\n        if nearest_centroid.is_none() {\n            let cluster_id = self.insert_centroid(vector);\n            self.clusters[cluster_id].push(*id);\n            return Ok(());\n        }\n\n        let nearest_centroid = nearest_centroid.unwrap();\n        if self.clusters[nearest_centroid].len() < self.density {\n            self.update_centroid(&nearest_centroid, vector);\n            self.clusters[nearest_centroid].push(*id);\n        } else {\n            // If the cluster is full, insert the record into the cluster\n            // and split the cluster with KMeans algorithm.\n            self.clusters[nearest_centroid].push(*id);\n            self.split_cluster(&nearest_centroid, records);\n        }\n\n        Ok(())\n    }\n\n    /// Delete a record from the index by its ID.\n    ///\n    /// This method will iterate over all the clusters and remove the record\n    /// from the cluster if it exists. This method doesn't update the value of\n    /// the cluster's centroid.\n    pub fn delete(&mut self, id: &RecordID) -> Result<(), Status> {\n        // Find the cluster and record indices where the record is stored.\n        let cluster_record_index =\n            self.clusters.iter().enumerate().find_map(|(i, cluster)| {\n                cluster.par_iter().position_first(|x| x == id).map(|x| (i, x))\n            });\n\n        if let Some((cluster_ix, record_ix)) = cluster_record_index {\n            // If the cluster has only one record, remove the cluster and\n            // centroid from the index. This won't happen often.\n            if self.clusters[cluster_ix].len() == 1 {\n                self.clusters.remove(cluster_ix);\n                self.centroids.remove(cluster_ix);\n            } else {\n                self.clusters[cluster_ix].remove(record_ix);\n            }\n        }\n\n        Ok(())\n    }\n\n    /// Search for the nearest neighbors of a given vector.\n    ///\n    /// This method uses the IVF search algorithm to find the nearest neighbors\n    /// of the query vector. The filtering process of the search is done within\n    /// the boundaries of the nearest clusters to the query vector.\n    pub fn query(\n        &self,\n        vector: &Vector,\n        k: usize,\n        filters: &Filters,\n        params: &QueryParameters,\n        records: &HashMap<RecordID, Record>,\n    ) -> Result<Vec<QueryResult>, Status> {\n        let QueryParameters { probes, radius } = params.to_owned();\n        let probes = min(probes, self.centroids.len());\n\n        let nearest_clusters = self.sort_nearest_centroids(vector);\n        let mut results = BinaryHeap::new();\n\n        for cluster_id in nearest_clusters.iter().take(probes) {\n            for record_id in &self.clusters[*cluster_id] {\n                let record = match records.get(record_id) {\n                    Some(record) => record,\n                    None => continue,\n                };\n\n                let distance = self.metric.distance(&record.vector, vector);\n                let distance = match distance {\n                    Some(distance) => distance as f32,\n                    None => continue,\n                };\n\n                // Check if the record is within the search radius and\n                // the record's metadata passes the filters.\n                if distance > radius || !filters.apply(&record.metadata) {\n                    continue;\n                }\n\n                results.push(QueryResult {\n                    id: *record_id,\n                    metadata: record.metadata.clone(),\n                    distance,\n                });\n\n                if results.len() > k {\n                    results.pop();\n                }\n            }\n        }\n\n        Ok(results.into_sorted_vec())\n    }\n\n    /// Insert a new centroid and cluster into the index.\n    /// - vector: Centroid vector.\n    fn insert_centroid(&mut self, vector: &Vector) -> ClusterIndex {\n        self.centroids.push(vector.to_owned());\n        self.clusters.push(vec![]);\n        self.centroids.len() - 1\n    }\n\n    /// Recalculate the centroid of a cluster with the new vector.\n    ///\n    /// This method must be called before inserting the new vector into the\n    /// cluster because this method calculates the new centroid by taking the\n    /// weighted average of the current centroid and adding the new vector\n    /// before normalizing the result with the new cluster size.\n    fn update_centroid(&mut self, cluster_id: &ClusterIndex, vector: &Vector) {\n        let count = self.clusters[*cluster_id].len() as f32;\n        self.centroids[*cluster_id] = self.centroids[*cluster_id]\n            .as_slice()\n            .iter()\n            .zip(vector.as_slice())\n            .map(|(a, b)| (a * count) + b / count + 1.0)\n            .collect::<Vec<f32>>()\n            .into();\n    }\n\n    /// Find the nearest centroid to a given vector.\n    ///\n    /// If the index is empty, this method will return None. Otherwise, it will\n    /// calculate the distance between the given vector and all centroids and\n    /// return the index of the centroid with the smallest distance.\n    fn find_nearest_centroid(&self, vector: &Vector) -> Option<ClusterIndex> {\n        self.centroids\n            .par_iter()\n            .map(|centroid| self.metric.distance(centroid, vector))\n            .enumerate()\n            .min_by(|(_, a), (_, b)| a.partial_cmp(b).unwrap())\n            .map(|(index, _)| index)\n    }\n\n    /// Sort the centroids by their distance to a given vector.\n    ///\n    /// This method returns an array of cluster indices sorted by their\n    /// distance to the vector. The first element will be the index of the\n    /// nearest centroid.\n    fn sort_nearest_centroids(&self, vector: &Vector) -> Vec<ClusterIndex> {\n        let mut distances = self\n            .centroids\n            .par_iter()\n            .enumerate()\n            .map(|(i, centroid)| (i, self.metric.distance(centroid, vector)))\n            .collect::<Vec<(usize, Option<f64>)>>();\n\n        // Sort the distances in ascending order. If the distance is NaN or\n        // something else, it will be placed at the end.\n        distances.sort_by(|(_, a), (_, b)| {\n            a.partial_cmp(b).unwrap_or(Ordering::Greater)\n        });\n\n        distances.iter().map(|(i, _)| *i).collect()\n    }\n\n    /// Split a cluster into two new clusters.\n    ///\n    /// The current cluster will be halved. The first half will be assigned to\n    /// the current cluster, and the second half will be assigned to a new\n    /// cluster with a new centroid.\n    fn split_cluster(\n        &mut self,\n        cluster_id: &ClusterIndex,\n        records: &HashMap<RecordID, Record>,\n    ) {\n        let record_ids = &self.clusters[*cluster_id];\n        let vectors = record_ids\n            .iter()\n            .map(|id| &records.get(id).unwrap().vector)\n            .collect::<Vec<&Vector>>();\n\n        let mut kmeans = KMeans::new(2).with_metric(self.metric);\n        kmeans.fit(Rc::from(vectors)).unwrap();\n\n        let centroids = kmeans.centroids();\n        self.centroids[*cluster_id] = centroids[0].to_owned();\n        self.centroids.push(centroids[1].to_owned());\n\n        let mut clusters = [vec![], vec![]];\n        let assignments = kmeans.assignments();\n        for (i, cluster_id) in assignments.iter().enumerate() {\n            clusters[*cluster_id].push(record_ids[i]);\n        }\n\n        self.clusters[*cluster_id] = clusters[0].to_vec();\n        self.clusters.push(clusters[1].to_vec());\n    }\n}\n\n#[cfg(test)]\nmod tests {\n    use super::*;\n\n    #[test]\n    fn test_insert_many() {\n        let params = Parameters::default();\n        let mut index = setup_index(&params);\n\n        let mut records = HashMap::new();\n        for _ in 0..1000 {\n            let id = RecordID::new();\n            let record = Record::random(params.dimension);\n            records.insert(id, record);\n        }\n\n        for (id, record) in records.iter() {\n            index.insert(id, record, &records).unwrap();\n        }\n\n        assert!(index.centroids.len() > 20);\n    }\n\n    #[test]\n    fn test_delete() {\n        let params = Parameters::default();\n        let mut index = setup_index(&params);\n\n        let mut ids = vec![];\n        for _ in 0..10 {\n            let centroid = Vector::random(params.dimension);\n            let mut cluster = vec![];\n            for _ in 0..10 {\n                let id = RecordID::new();\n                cluster.push(id);\n                ids.push(id);\n            }\n\n            index.centroids.push(centroid);\n            index.clusters.push(cluster);\n        }\n\n        assert_eq!(ids.len(), 100);\n        assert_eq!(index.centroids.len(), 10);\n\n        index.delete(&ids[0]).unwrap();\n        for cluster in index.clusters.iter() {\n            assert!(!cluster.contains(&ids[0]));\n        }\n\n        for i in 1..10 {\n            index.delete(&ids[i]).unwrap();\n        }\n\n        assert_eq!(index.centroids.len(), 9);\n    }\n\n    #[test]\n    fn test_query() {\n        let params = Parameters::default();\n        let mut index = setup_index(&params);\n\n        // Populate the index with 1000 sequential records.\n        // This allows us to predict the order of the results.\n        let mut ids = vec![];\n        let mut records = HashMap::new();\n        for i in 0..1000 {\n            let id = RecordID::new();\n            let vector = Vector::from(vec![i as f32; params.dimension]);\n\n            let mut metadata = HashMap::new();\n            let value = Value::Number((1000 + i) as f64);\n            metadata.insert(\"number\".to_string(), value);\n\n            let record = Record { vector, metadata };\n            records.insert(id, record);\n            ids.push(id);\n        }\n\n        for (id, record) in records.iter() {\n            index.insert(id, record, &records).unwrap();\n        }\n\n        let query = Vector::from(vec![1.0; params.dimension]);\n        let query_params = QueryParameters::default();\n        let result = index\n            .query(&query, 10, &Filters::None, &query_params, &records)\n            .unwrap();\n\n        assert_eq!(result.len(), 10);\n        assert!(result.iter().any(|r| r.id == ids[0]));\n\n        let metadata_filters = Filters::try_from(\"number > 1050\").unwrap();\n        let result = index\n            .query(&query, 10, &metadata_filters, &query_params, &records)\n            .unwrap();\n\n        assert_eq!(result.len(), 10);\n        assert!(result.iter().any(|r| r.id == ids[51]));\n    }\n\n    #[test]\n    fn test_insert_centroid() {\n        let params = Parameters::default();\n        let mut index = setup_index(&params);\n\n        let vector = Vector::random(params.dimension);\n        let cluster_id = index.insert_centroid(&vector);\n\n        assert_eq!(index.centroids.len(), 1);\n        assert_eq!(index.clusters.len(), 1);\n\n        assert_eq!(index.centroids[0], vector);\n        assert_eq!(cluster_id, 0);\n    }\n\n    #[test]\n    fn test_update_centroid() {\n        let params = Parameters::default();\n        let mut index = setup_index(&params);\n\n        let initial_centroid = Vector::from(vec![0.0; params.dimension]);\n        let cluster_id = index.insert_centroid(&initial_centroid);\n        index.clusters[cluster_id].push(RecordID::new());\n\n        let vector = Vector::from(vec![1.0; params.dimension]);\n        index.update_centroid(&cluster_id, &vector);\n\n        let centroid = Vector::from(vec![0.5; params.dimension]);\n        assert_ne!(index.centroids[cluster_id], centroid);\n    }\n\n    #[test]\n    fn test_find_nearest_centroid_empty() {\n        let params = Parameters::default();\n        let index = setup_index(&params);\n\n        let query = Vector::random(params.dimension);\n        assert_eq!(index.find_nearest_centroid(&query), None);\n    }\n\n    #[test]\n    fn test_find_nearest_centroid() {\n        let params = Parameters::default();\n        let mut index = setup_index(&params);\n\n        for i in 1..5 {\n            let centroid = Vector::from(vec![i as f32; params.dimension]);\n            index.centroids.push(centroid);\n        }\n\n        let query = Vector::from(vec![0.0; params.dimension]);\n        assert_eq!(index.find_nearest_centroid(&query), Some(0));\n    }\n\n    #[test]\n    fn test_split_cluster() {\n        let params = Parameters::default();\n        let mut index = setup_index(&params);\n\n        let mut ids = vec![];\n        let mut records = HashMap::new();\n        for i in 1..5 {\n            let id = RecordID::new();\n            let vector = Vector::from(vec![i as f32; params.dimension]);\n            let record = Record { vector, metadata: HashMap::new() };\n\n            ids.push(id);\n            records.insert(id, record);\n        }\n\n        let centroid = Vector::from(vec![2.5; params.dimension]);\n        index.centroids.push(centroid);\n        index.clusters.push(ids);\n\n        index.split_cluster(&0, &records);\n        assert_eq!(index.centroids.len(), 2);\n    }\n\n    #[test]\n    fn test_sort_nearest_centroids() {\n        let params = Parameters::default();\n        let mut index = setup_index(&params);\n\n        for i in 1..5 {\n            let centroid = Vector::from(vec![i as f32; params.dimension]);\n            index.centroids.push(centroid);\n        }\n\n        let query = Vector::from(vec![5.0; params.dimension]);\n        let nearest = index.sort_nearest_centroids(&query);\n        assert_eq!(nearest, vec![3, 2, 1, 0]);\n    }\n\n    fn setup_index(params: &Parameters) -> Index {\n        let index = Index::new()\n            .with_metric(params.metric)\n            .with_density(params.density);\n\n        index\n    }\n}\n"
  },
  {
    "path": "src/cores/mod.rs",
    "content": "// Initialize the modules without making them public.\nmod database;\nmod index;\nmod storage;\n\n// Re-export types from the modules.\npub use database::*;\npub use index::*;\npub use storage::*;\n\n// Import common dependencies below.\nuse crate::protos;\nuse crate::types::*;\nuse crate::utils::kmeans::KMeans;\nuse hashbrown::HashMap;\nuse rayon::prelude::*;\nuse serde::de::DeserializeOwned;\nuse serde::{Deserialize, Serialize};\nuse std::error::Error;\nuse std::fs::OpenOptions;\nuse std::path::{Path, PathBuf};\nuse std::sync::{Arc, RwLock};\nuse std::{env, fs};\nuse tonic::Status;\n"
  },
  {
    "path": "src/cores/storage.rs",
    "content": "use super::*;\n\n/// Record storage interface.\n///\n/// This interface wraps around Hashbrown's HashMap implementation to store\n/// the records. In the future, if needed, we can modify the storage\n/// implementation without changing the rest of the code.\n#[repr(C)]\n#[derive(Debug, Serialize, Deserialize)]\npub struct Storage {\n    count: usize,\n    records: HashMap<RecordID, Record>,\n}\n\nimpl Storage {\n    /// Create a new empty storage instance.\n    pub fn new() -> Self {\n        Storage { count: 0, records: HashMap::new() }\n    }\n\n    /// Insert a new record into the record storage.\n    pub fn insert(\n        &mut self,\n        id: &RecordID,\n        record: &Record,\n    ) -> Result<(), Status> {\n        self.records.insert(*id, record.to_owned());\n        self.count += 1;\n        Ok(())\n    }\n\n    /// Retrieve a record from the storage given its ID.\n    pub fn get(&self, id: &RecordID) -> Result<&Record, Status> {\n        let record = self.records.get(id);\n        if record.is_none() {\n            let message = \"The specified record is not found\";\n            return Err(Status::not_found(message));\n        }\n\n        Ok(record.unwrap())\n    }\n\n    /// Delete a record from the storage given its ID.\n    pub fn delete(&mut self, id: &RecordID) -> Result<(), Status> {\n        self.records.remove(id);\n        self.count -= 1;\n        Ok(())\n    }\n\n    /// Update a record metadata given its ID.\n    ///\n    /// Vector data should be immutable as it is tightly coupled with the\n    /// semantic meaning of the record. If the vector data changes, users\n    /// should create a new record instead.\n    pub fn update(\n        &mut self,\n        id: &RecordID,\n        metadata: &HashMap<String, Value>,\n    ) -> Result<(), Status> {\n        let record = match self.records.get_mut(id) {\n            Some(record) => record,\n            None => {\n                let message = \"The specified record is not found\";\n                return Err(Status::not_found(message));\n            }\n        };\n\n        record.metadata = metadata.to_owned();\n        Ok(())\n    }\n\n    /// Return a reference to the records in the storage.\n    pub fn records(&self) -> &HashMap<RecordID, Record> {\n        &self.records\n    }\n\n    /// Return the number of records in the storage.\n    pub fn count(&self) -> usize {\n        self.count\n    }\n}\n\n#[cfg(test)]\nmod tests {\n    use super::*;\n\n    #[test]\n    fn test_insert() {\n        let mut storage = Storage::new();\n\n        let record = Record::random(128);\n        let id = RecordID::new();\n        storage.insert(&id, &record).unwrap();\n\n        assert_eq!(storage.count, 1);\n        assert_eq!(storage.count, storage.records.len());\n    }\n\n    #[test]\n    fn test_delete() {\n        let mut storage = Storage::new();\n\n        let record = Record::random(128);\n        let id = RecordID::new();\n        storage.insert(&id, &record).unwrap();\n\n        storage.delete(&id).unwrap();\n        assert_eq!(storage.count, 0);\n        assert_eq!(storage.count, storage.records.len());\n    }\n\n    #[test]\n    fn test_update() {\n        let mut storage = Storage::new();\n\n        let record = Record::random(128);\n        let id = RecordID::new();\n        storage.insert(&id, &record).unwrap();\n\n        let mut metadata = HashMap::new();\n        metadata.insert(\"key\".to_string(), Value::random());\n        storage.update(&id, &metadata).unwrap();\n\n        let updated_record = storage.records.get(&id).unwrap();\n        assert_eq!(updated_record.metadata, metadata);\n    }\n}\n"
  },
  {
    "path": "src/main.rs",
    "content": "mod cores;\nmod protos;\nmod types;\nmod utils;\n\nuse clap::{arg, ArgMatches, Command};\nuse cores::{Database, Parameters};\nuse dotenv::dotenv;\nuse protos::database_server::DatabaseServer;\nuse std::sync::Arc;\nuse std::thread;\nuse std::time::Duration;\nuse tonic::transport::Server;\nuse types::Metric;\n\nconst SNAPSHOT_INTERVAL: Duration = Duration::from_secs(600);\n\n#[tokio::main]\nasync fn main() {\n    dotenv().ok();\n    tracing_subscriber::fmt::init();\n\n    let command = Command::new(env!(\"CARGO_PKG_NAME\"))\n        .version(env!(\"CARGO_PKG_VERSION\"))\n        .about(\"Interface to setup and manage OasysDB server\")\n        .arg_required_else_help(true)\n        .subcommand(start())\n        .subcommand(configure())\n        .get_matches();\n\n    match command.subcommand() {\n        Some((\"start\", args)) => start_handler(args).await,\n        Some((\"configure\", args)) => configure_handler(args).await,\n        _ => unreachable!(),\n    }\n}\n\nfn start() -> Command {\n    let arg_port = arg!(--port <port> \"Port to listen on\")\n        .default_value(\"2505\")\n        .value_parser(clap::value_parser!(u16))\n        .allow_negative_numbers(false);\n\n    Command::new(\"start\")\n        .alias(\"run\")\n        .about(\"Start the database server\")\n        .arg(arg_port)\n}\n\nasync fn start_handler(args: &ArgMatches) {\n    // Unwrap is safe because Clap validates the arguments.\n    let port = args.get_one::<u16>(\"port\").unwrap();\n    let addr = format!(\"[::]:{port}\").parse().unwrap();\n\n    let db = Arc::new(Database::open().expect(\"Failed to open the database\"));\n\n    let db_clone = db.clone();\n    thread::spawn(move || loop {\n        thread::sleep(SNAPSHOT_INTERVAL);\n        db_clone.create_snapshot().expect(\"Failed to create a snapshot\");\n    });\n\n    tracing::info!(\"Database server is ready on port {port}\");\n\n    Server::builder()\n        .add_service(DatabaseServer::new(db))\n        .serve(addr)\n        .await\n        .expect(\"Failed to start the database\");\n}\n\nfn configure() -> Command {\n    let arg_dimension = arg!(--dim <dimension> \"Vector dimension\")\n        .required(true)\n        .value_parser(clap::value_parser!(usize))\n        .allow_negative_numbers(false);\n\n    // List optional arguments below.\n    let arg_metric = arg!(--metric <metric> \"Metric to calculate distance\")\n        .default_value(Metric::Euclidean.as_str())\n        .value_parser(clap::value_parser!(Metric));\n\n    let arg_density = arg!(--density <density> \"Density of the cluster\")\n        .default_value(\"256\")\n        .value_parser(clap::value_parser!(usize))\n        .allow_negative_numbers(false);\n\n    Command::new(\"configure\")\n        .about(\"Configure the initial database parameters\")\n        .arg(arg_dimension)\n        .arg(arg_metric)\n        .arg(arg_density)\n}\n\nasync fn configure_handler(args: &ArgMatches) {\n    let dim = *args.get_one::<usize>(\"dim\").unwrap();\n    let metric = *args.get_one::<Metric>(\"metric\").unwrap();\n    let density = *args.get_one::<usize>(\"density\").unwrap();\n\n    let params = Parameters { dimension: dim, metric, density };\n    Database::configure(&params);\n}\n"
  },
  {
    "path": "src/protos.rs",
    "content": "#![allow(clippy::all)]\n#![allow(non_snake_case)]\ntonic::include_proto!(\"database\");\n"
  },
  {
    "path": "src/types/filter.rs",
    "content": "use super::*;\n\n/// Joined multiple filters operation with either AND or OR.\n///\n/// At the moment, OasysDB only supports single-type join operations. This\n/// means that we can't use both AND and OR operations in the same filter.\n#[derive(Debug, Clone, PartialEq, PartialOrd)]\npub enum Filters {\n    None,\n    And(Vec<Filter>),\n    Or(Vec<Filter>),\n}\n\nimpl Filters {\n    /// Returns true if the record passes the filters.\n    /// - metadata: Record metadata to check against the filters.\n    ///\n    /// Filters of NONE type will always return true. This is useful when\n    /// no filters are provided and we want to include all records.\n    pub fn apply(&self, metadata: &HashMap<String, Value>) -> bool {\n        match self {\n            Filters::None => true,\n            Filters::And(filters) => filters.iter().all(|f| f.apply(metadata)),\n            Filters::Or(filters) => filters.iter().any(|f| f.apply(metadata)),\n        }\n    }\n}\n\nimpl TryFrom<&str> for Filters {\n    type Error = Status;\n    fn try_from(value: &str) -> Result<Self, Self::Error> {\n        if value.is_empty() {\n            return Ok(Filters::None);\n        }\n\n        const OR: &str = \" OR \";\n        const AND: &str = \" AND \";\n\n        // Check which join operator is used.\n        let or_count = value.matches(OR).count();\n        let and_count = value.matches(AND).count();\n\n        if or_count > 0 && and_count > 0 {\n            let message = \"Mixing AND and OR join operators is not supported\";\n            return Err(Status::invalid_argument(message));\n        }\n\n        let join = if or_count > 0 { OR } else { AND };\n        let filters = value\n            .split(join)\n            .map(TryInto::try_into)\n            .collect::<Result<_, _>>()?;\n\n        let filters = match join {\n            OR => Filters::Or(filters),\n            _ => Filters::And(filters),\n        };\n\n        Ok(filters)\n    }\n}\n\n/// Record metadata filter.\n///\n/// Using the filter operator, the record metadata can be compared against\n/// a specific value to determine if it should be included in the results.\n#[derive(Debug, Clone, PartialEq, PartialOrd)]\npub struct Filter {\n    key: String,\n    value: Value,\n    operator: Operator,\n}\n\nimpl Filter {\n    fn apply(&self, metadata: &HashMap<String, Value>) -> bool {\n        let value = match metadata.get(&self.key) {\n            Some(value) => value,\n            None => return false,\n        };\n\n        match (value, &self.value) {\n            (Value::Text(a), Value::Text(b)) => self.filter_text(a, b),\n            (Value::Number(a), Value::Number(b)) => self.filter_number(a, b),\n            (Value::Boolean(a), Value::Boolean(b)) => self.filter_boolean(a, b),\n            _ => false,\n        }\n    }\n\n    fn filter_text(&self, a: impl AsRef<str>, b: impl AsRef<str>) -> bool {\n        let (a, b) = (a.as_ref(), b.as_ref());\n        match self.operator {\n            Operator::Equal => a == b,\n            Operator::NotEqual => a != b,\n            Operator::Contains => a.contains(b),\n            _ => false,\n        }\n    }\n\n    fn filter_number(&self, a: &f64, b: &f64) -> bool {\n        match self.operator {\n            Operator::Equal => a == b,\n            Operator::NotEqual => a != b,\n            Operator::GreaterThan => a > b,\n            Operator::GreaterThanOrEqual => a >= b,\n            Operator::LessThan => a < b,\n            Operator::LessThanOrEqual => a <= b,\n            _ => false,\n        }\n    }\n\n    fn filter_boolean(&self, a: &bool, b: &bool) -> bool {\n        match self.operator {\n            Operator::Equal => a == b,\n            Operator::NotEqual => a != b,\n            _ => false,\n        }\n    }\n}\n\nimpl TryFrom<&str> for Filter {\n    type Error = Status;\n    fn try_from(value: &str) -> Result<Self, Self::Error> {\n        if value.is_empty() {\n            let message = \"Filter string cannot be empty\";\n            return Err(Status::invalid_argument(message));\n        }\n\n        // Split the filter string into EXACTLY 3 parts.\n        let parts = value\n            .splitn(3, ' ')\n            .map(|token| token.trim())\n            .collect::<Vec<&str>>();\n\n        let key = parts[0].to_string();\n        let operator = Operator::try_from(parts[1])?;\n        let value = Value::from(parts[2]);\n\n        let filter = Filter { key, value, operator };\n        Ok(filter)\n    }\n}\n\n#[derive(Debug, Clone, Copy, Eq, PartialEq, PartialOrd)]\npub enum Operator {\n    Equal,\n    NotEqual,\n    GreaterThan,\n    GreaterThanOrEqual,\n    LessThan,\n    LessThanOrEqual,\n    Contains,\n}\n\nimpl TryFrom<&str> for Operator {\n    type Error = Status;\n    fn try_from(value: &str) -> Result<Self, Self::Error> {\n        let operator = match value {\n            \"CONTAINS\" => Operator::Contains,\n            \"=\" => Operator::Equal,\n            \"!=\" => Operator::NotEqual,\n            \">\" => Operator::GreaterThan,\n            \">=\" => Operator::GreaterThanOrEqual,\n            \"<\" => Operator::LessThan,\n            \"<=\" => Operator::LessThanOrEqual,\n            _ => {\n                let message = format!(\"Invalid filter operator: {value}\");\n                return Err(Status::invalid_argument(message));\n            }\n        };\n\n        Ok(operator)\n    }\n}\n\n#[cfg(test)]\nmod tests {\n    use super::*;\n    use std::error::Error;\n\n    #[test]\n    fn test_filters_from_string() {\n        let filters = Filters::try_from(\"name CONTAINS Ada\").unwrap();\n        let expected = Filters::And(vec![Filter {\n            key: \"name\".into(),\n            value: \"Ada\".into(),\n            operator: Operator::Contains,\n        }]);\n\n        assert_eq!(filters, expected);\n\n        let filters = Filters::try_from(\"gpa >= 3.0 OR age < 21\").unwrap();\n        let expected = {\n            let filter_gpa = Filter {\n                key: \"gpa\".into(),\n                value: Value::Number(3.0),\n                operator: Operator::GreaterThanOrEqual,\n            };\n\n            let filter_age = Filter {\n                key: \"age\".into(),\n                value: Value::Number(21.0),\n                operator: Operator::LessThan,\n            };\n\n            Filters::Or(vec![filter_gpa, filter_age])\n        };\n\n        assert_eq!(filters, expected);\n    }\n\n    #[test]\n    fn test_filters_apply() -> Result<(), Box<dyn Error>> {\n        let data = setup_metadata();\n\n        let filters = Filters::try_from(\"name CONTAINS Alice\")?;\n        assert!(filters.apply(&data));\n\n        let filters = Filters::try_from(\"name = Bob\")?;\n        assert!(!filters.apply(&data));\n\n        let filters = Filters::try_from(\"age >= 20 AND gpa < 4.0\")?;\n        assert!(filters.apply(&data));\n\n        let filters = Filters::try_from(\"age >= 20 AND gpa < 3.0\")?;\n        assert!(!filters.apply(&data));\n\n        let filters = Filters::try_from(\"active = true\")?;\n        assert!(filters.apply(&data));\n\n        Ok(())\n    }\n\n    fn setup_metadata() -> HashMap<String, Value> {\n        let keys = vec![\"name\", \"age\", \"gpa\", \"active\"];\n        let values: Vec<Value> = vec![\n            \"Alice\".into(),\n            Value::Number(20.0),\n            Value::Number(3.5),\n            Value::Boolean(true),\n        ];\n\n        let mut data = HashMap::new();\n        for (key, value) in keys.into_iter().zip(values.into_iter()) {\n            data.insert(key.into(), value);\n        }\n\n        data\n    }\n}\n"
  },
  {
    "path": "src/types/metric.rs",
    "content": "use super::*;\nuse simsimd::SpatialSimilarity;\n\n// Distance name constants.\nconst EUCLIDEAN: &str = \"euclidean\";\nconst COSINE: &str = \"cosine\";\n\n/// Distance formula for vector similarity calculations.\n///\n/// ### Euclidean\n/// We use the squared Euclidean distance instead for a slight performance\n/// boost since we only use the distance for comparison.\n///\n/// ### Cosine\n/// We use cosine distance instead of cosine similarity to be consistent with\n/// other distance metrics where a lower value indicates a closer match.\n#[allow(missing_docs)]\n#[derive(Debug, Clone, Copy, Serialize, Deserialize, PartialEq, Eq)]\npub enum Metric {\n    Euclidean,\n    Cosine,\n}\n\nimpl Metric {\n    /// Calculate the distance between two vectors.\n    pub fn distance(&self, a: &Vector, b: &Vector) -> Option<f64> {\n        let (a, b) = (a.as_slice(), b.as_slice());\n        match self {\n            Metric::Euclidean => f32::sqeuclidean(a, b),\n            Metric::Cosine => f32::cosine(a, b),\n        }\n    }\n\n    /// Return the metric name as a string slice.\n    pub fn as_str(&self) -> &str {\n        match self {\n            Metric::Euclidean => EUCLIDEAN,\n            Metric::Cosine => COSINE,\n        }\n    }\n}\n\nimpl From<&str> for Metric {\n    fn from(value: &str) -> Self {\n        let value = value.to_lowercase();\n        match value.as_str() {\n            COSINE => Metric::Cosine,\n            EUCLIDEAN => Metric::Euclidean,\n            _ => panic!(\"Metric should be cosine or euclidean\"),\n        }\n    }\n}\n\nimpl From<String> for Metric {\n    fn from(value: String) -> Self {\n        Metric::from(value.as_str())\n    }\n}\n\n#[cfg(test)]\nmod tests {\n    use super::*;\n\n    #[test]\n    fn test_distance() {\n        let a = Vector::from(vec![1.0, 2.0, 3.0]);\n        let b = Vector::from(vec![4.0, 5.0, 6.0]);\n\n        let euclidean = Metric::Euclidean.distance(&a, &b).unwrap();\n        let cosine = Metric::Cosine.distance(&a, &b).unwrap();\n\n        assert_eq!(euclidean, 27.0);\n        assert_eq!(cosine.round(), 0.0);\n    }\n}\n"
  },
  {
    "path": "src/types/mod.rs",
    "content": "// Initialize modules without publicizing them.\nmod filter;\nmod metric;\nmod record;\nmod vector;\n\n// Re-export types from the modules.\npub use filter::*;\npub use metric::*;\npub use record::*;\npub use vector::*;\n\n// Import common dependencies below.\nuse crate::protos;\nuse hashbrown::HashMap;\nuse serde::{Deserialize, Serialize};\nuse tonic::Status;\n"
  },
  {
    "path": "src/types/record.rs",
    "content": "use super::*;\nuse std::fmt;\nuse std::str::FromStr;\nuse uuid::Uuid;\n\n/// Record identifier.\n///\n/// OasysDB should be able to deal with a lot of writes and deletes. Using UUID\n/// version 4 to allow us to generate a lot of IDs with very low probability\n/// of collision.\n#[derive(Debug, Serialize, Deserialize, Clone, Copy)]\n#[derive(PartialOrd, Ord, PartialEq, Eq, Hash)]\npub struct RecordID(Uuid);\n\nimpl RecordID {\n    /// Generate a new random record ID using UUID v4.\n    pub fn new() -> Self {\n        RecordID(Uuid::new_v4())\n    }\n}\n\nimpl fmt::Display for RecordID {\n    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {\n        write!(f, \"{}\", self.0)\n    }\n}\n\nimpl FromStr for RecordID {\n    type Err = Status;\n    fn from_str(s: &str) -> Result<Self, Self::Err> {\n        Ok(RecordID(Uuid::try_parse(s).map_err(|_| {\n            let message = \"Record ID should be a string-encoded UUID\";\n            Status::invalid_argument(message)\n        })?))\n    }\n}\n\n/// Metadata value.\n///\n/// OasysDB doesn't support nested objects in metadata for performance reasons.\n/// We only need to support primitive types for metadata.\n#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, PartialOrd)]\npub enum Value {\n    Text(String),\n    Number(f64),\n    Boolean(bool),\n}\n\nimpl From<String> for Value {\n    fn from(value: String) -> Self {\n        Value::from(value.as_str())\n    }\n}\n\nimpl From<&str> for Value {\n    fn from(value: &str) -> Self {\n        // Try to parse the value as a number.\n        // This is must be prioritized over boolean parsing.\n        if let Ok(float) = value.parse::<f64>() {\n            return Value::Number(float);\n        }\n\n        if let Ok(boolean) = value.parse::<bool>() {\n            return Value::Boolean(boolean);\n        }\n\n        // Remove quotes from the start and end of the string.\n        // This ensures that we won't have to deal with quotes.\n        let match_quotes = |c: char| c == '\\\"' || c == '\\'';\n        let value = value\n            .trim_start_matches(match_quotes)\n            .trim_end_matches(match_quotes)\n            .to_string();\n\n        Value::Text(value)\n    }\n}\n\nimpl From<Value> for protos::Value {\n    fn from(value: Value) -> Self {\n        type ProtoValue = protos::value::Value;\n        let value = match value {\n            Value::Text(text) => ProtoValue::Text(text),\n            Value::Number(number) => ProtoValue::Number(number),\n            Value::Boolean(boolean) => ProtoValue::Boolean(boolean),\n        };\n\n        protos::Value { value: Some(value) }\n    }\n}\n\nimpl TryFrom<protos::Value> for Value {\n    type Error = Status;\n    fn try_from(value: protos::Value) -> Result<Self, Self::Error> {\n        type ProtoValue = protos::value::Value;\n        match value.value {\n            Some(ProtoValue::Text(text)) => Ok(Value::Text(text)),\n            Some(ProtoValue::Number(number)) => Ok(Value::Number(number)),\n            Some(ProtoValue::Boolean(boolean)) => Ok(Value::Boolean(boolean)),\n            None => Err(Status::invalid_argument(\"Metadata value is required\")),\n        }\n    }\n}\n\n/// OasysDB vector record.\n///\n/// This is the main data structure for OasysDB. It contains the vector data\n/// and metadata of the record. Metadata is a key-value store that can be used\n/// to store additional information about the vector.\n#[derive(Debug, Clone, Serialize, Deserialize)]\npub struct Record {\n    pub vector: Vector,\n    pub metadata: HashMap<String, Value>,\n}\n\nimpl From<Record> for protos::Record {\n    fn from(value: Record) -> Self {\n        let vector = value.vector.into();\n        let metadata = value\n            .metadata\n            .into_iter()\n            .map(|(key, value)| (key, value.into()))\n            .collect();\n\n        protos::Record { vector: Some(vector), metadata }\n    }\n}\n\nimpl TryFrom<protos::Record> for Record {\n    type Error = Status;\n    fn try_from(value: protos::Record) -> Result<Self, Self::Error> {\n        let vector = match value.vector {\n            Some(vector) => Vector::try_from(vector)?,\n            None => {\n                let message = \"Vector data should not be empty\";\n                return Err(Status::invalid_argument(message));\n            }\n        };\n\n        let metadata = value\n            .metadata\n            .into_iter()\n            .map(|(k, v)| Ok((k, v.try_into()?)))\n            .collect::<Result<HashMap<String, Value>, Self::Error>>()?;\n\n        Ok(Record { vector, metadata })\n    }\n}\n\n#[cfg(test)]\nmod tests {\n    use super::*;\n    use rand::random;\n\n    impl Value {\n        pub fn random() -> Self {\n            Value::Number(random::<f64>())\n        }\n    }\n\n    impl Record {\n        pub fn random(dimension: usize) -> Self {\n            let mut metadata = HashMap::new();\n            metadata.insert(\"key\".to_string(), Value::random());\n            Record { vector: Vector::random(dimension), metadata }\n        }\n    }\n}\n"
  },
  {
    "path": "src/types/vector.rs",
    "content": "use super::*;\n\n/// Vector data structure.\n///\n/// We use a boxed slice to store the vector data for a slight memory\n/// efficiency boost. The length of the vector is not checked, so a length\n/// validation should be performed before most operations.\n#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, PartialOrd)]\npub struct Vector(Box<[f32]>);\n\nimpl Vector {\n    /// Return the vector as a slice of floating-point numbers.\n    pub fn as_slice(&self) -> &[f32] {\n        self.0.as_ref()\n    }\n\n    /// Return as a vector of floating-point numbers.\n    pub fn to_vec(&self) -> Vec<f32> {\n        self.0.to_vec()\n    }\n\n    /// Return the length of the vector.\n    pub fn len(&self) -> usize {\n        self.0.len()\n    }\n}\n\n// Vector conversion implementations.\n\nimpl From<Vec<f32>> for Vector {\n    fn from(value: Vec<f32>) -> Self {\n        Vector(value.into_boxed_slice())\n    }\n}\n\nimpl From<Vector> for protos::Vector {\n    fn from(value: Vector) -> Self {\n        protos::Vector { data: value.to_vec() }\n    }\n}\n\nimpl TryFrom<protos::Vector> for Vector {\n    type Error = Status;\n    fn try_from(value: protos::Vector) -> Result<Self, Self::Error> {\n        Ok(Vector(value.data.into_boxed_slice()))\n    }\n}\n\n#[cfg(test)]\nmod tests {\n    use super::*;\n\n    #[test]\n    fn test_random_vector() {\n        let dim = 128;\n        let vector = Vector::random(dim);\n        assert_eq!(vector.len(), dim);\n    }\n\n    impl Vector {\n        pub fn random(dimension: usize) -> Self {\n            let vector = vec![0.0; dimension]\n                .iter()\n                .map(|_| rand::random::<f32>())\n                .collect::<Vec<f32>>();\n\n            Vector(vector.into_boxed_slice())\n        }\n    }\n}\n"
  },
  {
    "path": "src/utils/kmeans.rs",
    "content": "use super::*;\nuse rand::seq::SliceRandom;\nuse rand::Rng;\nuse std::cmp::min;\nuse std::rc::Rc;\n\ntype ClusterIndex = usize;\n\n/// A list of vectors.\n///\n/// We use a reference-counted slice to store the vectors. This allows us to\n/// share the vectors around without having to actually clone the vectors.\ntype Vectors<'v> = Rc<[&'v Vector]>;\n\n/// K-means clustering algorithm.\n///\n/// The K-means algorithm is a clustering algorithm that partitions a dataset\n/// into K clusters by iteratively assigning data points to the nearest cluster\n/// centroids and recalculating these centroids until they are stable.\n#[derive(Debug)]\npub struct KMeans {\n    assignments: Vec<ClusterIndex>,\n    centroids: Vec<Vector>,\n\n    // Algorithm parameters.\n    metric: Metric,\n    n_clusters: usize,\n    max_iter: usize,\n}\n\nimpl KMeans {\n    /// Initialize the K-means algorithm with default parameters.\n    ///\n    /// Default parameters:\n    /// - metric: Euclidean\n    /// - max_iter: 100\n    pub fn new(n_clusters: usize) -> Self {\n        Self {\n            n_clusters,\n            metric: Metric::Euclidean,\n            max_iter: 100,\n            assignments: Vec::new(),\n            centroids: Vec::with_capacity(n_clusters),\n        }\n    }\n\n    /// Configure the metric used for distance calculations.\n    pub fn with_metric(mut self, metric: Metric) -> Self {\n        self.metric = metric;\n        self\n    }\n\n    /// Configure the maximum number of iterations to run the algorithm.\n    #[allow(dead_code)]\n    pub fn with_max_iter(mut self, max_iter: usize) -> Self {\n        self.max_iter = max_iter;\n        self\n    }\n\n    /// Train the K-means algorithm with the given vectors.\n    pub fn fit(&mut self, vectors: Vectors) -> Result<(), Box<dyn Error>> {\n        if self.n_clusters > vectors.len() {\n            let message = \"Dataset is smaller than cluster configuration.\";\n            return Err(message.into());\n        }\n\n        self.centroids = self.initialize_centroids(vectors.clone());\n        self.assignments = vec![0; vectors.len()];\n\n        let mut no_improvement_count = 0;\n        for _ in 0..self.max_iter {\n            if no_improvement_count > 3 {\n                break;\n            }\n\n            let assignments = self.assign_clusters(vectors.clone());\n\n            // Check at most 1000 assignments for convergence.\n            // This prevents checking the entire dataset for large datasets.\n            let end = min(1000, assignments.len());\n            match assignments[0..end] == self.assignments[0..end] {\n                true => no_improvement_count += 1,\n                false => no_improvement_count = 0,\n            }\n\n            self.assignments = assignments;\n            self.centroids = self.update_centroids(vectors.clone());\n        }\n\n        Ok(())\n    }\n\n    fn initialize_centroids(&self, vectors: Vectors) -> Vec<Vector> {\n        let mut rng = rand::thread_rng();\n        let mut centroids = Vec::with_capacity(self.n_clusters);\n\n        // Pick the first centroid randomly.\n        let first_centroid = vectors.choose(&mut rng).cloned().unwrap();\n        centroids.push(first_centroid.to_owned());\n\n        for _ in 1..self.n_clusters {\n            let nearest_centroid_distance = |vector: &&Vector| {\n                centroids\n                    .iter()\n                    .map(|centroid| self.metric.distance(vector, centroid))\n                    .min_by(|a, b| a.partial_cmp(b).unwrap())\n                    .unwrap()\n                    .unwrap()\n            };\n\n            let distances = vectors\n                .par_iter()\n                .map(nearest_centroid_distance)\n                .collect::<Vec<f64>>();\n\n            // Choose the next centroid with probability proportional\n            // to the squared distance.\n            let threshold = rng.gen::<f64>() * distances.iter().sum::<f64>();\n            let mut cumulative_sum = 0.0;\n\n            for (i, distance) in distances.iter().enumerate() {\n                cumulative_sum += distance;\n                if cumulative_sum >= threshold {\n                    centroids.push(vectors[i].clone());\n                    break;\n                }\n            }\n        }\n\n        centroids\n    }\n\n    fn update_centroids(&self, vectors: Vectors) -> Vec<Vector> {\n        let dimension = vectors[0].len();\n        let mut centroids = vec![vec![0.0; dimension]; self.n_clusters];\n        let mut cluster_count = vec![0; self.n_clusters];\n\n        // Sum up vectors assigned to the cluster into the centroid.\n        for (i, cluster_id) in self.assignments.iter().enumerate() {\n            let cluster_id = *cluster_id;\n            cluster_count[cluster_id] += 1;\n            centroids[cluster_id] = centroids[cluster_id]\n                .iter()\n                .zip(vectors[i].as_slice().iter())\n                .map(|(a, b)| a + b)\n                .collect();\n        }\n\n        // Divide the sum by the number of vectors in the cluster.\n        for i in 0..self.n_clusters {\n            // If the cluster is empty, reinitialize the centroid.\n            if cluster_count[i] == 0 {\n                let mut rng = rand::thread_rng();\n                centroids[i] = vectors.choose(&mut rng).unwrap().to_vec();\n                continue;\n            }\n\n            centroids[i] = centroids[i]\n                .iter()\n                .map(|x| x / cluster_count[i] as f32)\n                .collect();\n        }\n\n        centroids.into_par_iter().map(|centroid| centroid.into()).collect()\n    }\n\n    /// Create cluster assignments for the vectors.\n    fn assign_clusters(&self, vectors: Vectors) -> Vec<ClusterIndex> {\n        vectors\n            .par_iter()\n            .map(|vector| self.find_nearest_centroid(vector))\n            .collect()\n    }\n\n    /// Find the index of the nearest centroid from a vector.\n    pub fn find_nearest_centroid(&self, vector: &Vector) -> ClusterIndex {\n        self.centroids\n            .par_iter()\n            .enumerate()\n            .map(|(i, centroid)| (i, self.metric.distance(vector, centroid)))\n            .min_by(|(_, a), (_, b)| a.partial_cmp(b).unwrap())\n            .map(|(id, _)| id)\n            .unwrap()\n    }\n\n    /// Returns index-mapped cluster assignment for each data point.\n    ///\n    /// The index corresponds to the data point index and the value corresponds\n    /// to the cluster index. For example, given the following assignments:\n    ///\n    /// ```text\n    /// [0, 1, 0, 1, 2]\n    /// ```\n    ///\n    /// This means:\n    /// - Point 0 and 2 are assigned to cluster 0.\n    /// - Point 1 and 3 are assigned to cluster 1.\n    /// - Point 4 is assigned to cluster 2.\n    ///\n    pub fn assignments(&self) -> &[ClusterIndex] {\n        &self.assignments\n    }\n\n    /// Returns the centroids of each cluster.\n    pub fn centroids(&self) -> &[Vector] {\n        &self.centroids\n    }\n}\n\n#[cfg(test)]\nmod tests {\n    use super::*;\n\n    #[test]\n    fn test_kmeans_fit_1_to_1() {\n        evaluate_kmeans(1, generate_vectors(1));\n    }\n\n    #[test]\n    fn test_kmeans_fit_10_to_5() {\n        evaluate_kmeans(5, generate_vectors(10));\n    }\n\n    #[test]\n    fn test_kmeans_fit_100_to_10() {\n        evaluate_kmeans(10, generate_vectors(100));\n    }\n\n    fn evaluate_kmeans(n_cluster: usize, vectors: Vec<Vector>) {\n        let vectors: Vectors = {\n            let vectors_ref: Vec<&Vector> = vectors.iter().collect();\n            Rc::from(vectors_ref.as_slice())\n        };\n\n        let mut kmeans = KMeans::new(n_cluster);\n        kmeans.fit(vectors.clone()).unwrap();\n        assert_eq!(kmeans.centroids().len(), n_cluster);\n\n        let mut correct_count = 0;\n        for (i, clusted_id) in kmeans.assignments().iter().enumerate() {\n            let vector = vectors[i];\n            let nearest_centroid = kmeans.find_nearest_centroid(vector);\n            if clusted_id == &nearest_centroid {\n                correct_count += 1;\n            }\n        }\n\n        let accuracy = correct_count as f32 / vectors.len() as f32;\n        assert!(accuracy > 0.99);\n    }\n\n    fn generate_vectors(n: usize) -> Vec<Vector> {\n        (0..n).map(|i| Vector::from(vec![i as f32; 3])).collect()\n    }\n}\n"
  },
  {
    "path": "src/utils/mod.rs",
    "content": "pub mod kmeans;\n\n// Import common dependencies below.\nuse crate::types::{Metric, Vector};\nuse rayon::prelude::*;\nuse std::error::Error;\n"
  }
]