Repository: nanoporetech/pod5-file-format Branch: master Commit: 6356e5c97b36 Files: 297 Total size: 2.3 MB Directory structure: gitextract_n9lck0re/ ├── .clang-format ├── .codespellrc ├── .flake8 ├── .gitattributes ├── .github/ │ └── ISSUE_TEMPLATE/ │ └── bug_report.md ├── .gitignore ├── .gitlab-ci.yml ├── .gitmodules ├── .pre-commit-config.yaml ├── .readthedocs.yaml ├── CHANGELOG.md ├── CMakeLists.txt ├── CMakePresets.json ├── DEV.md ├── LICENSE.md ├── README.md ├── benchmarks/ │ ├── .gitignore │ ├── README.md │ ├── build.sh │ ├── convert/ │ │ ├── run_blow5.sh │ │ └── run_pod5.sh │ ├── find_all_read_ids/ │ │ ├── run_blow5.sh │ │ ├── run_fast5.sh │ │ └── run_pod5.sh │ ├── find_all_samples/ │ │ ├── run_blow5.sh │ │ ├── run_fast5.sh │ │ └── run_pod5.sh │ ├── find_selected_read_ids_read_number/ │ │ ├── run_blow5.sh │ │ ├── run_fast5.sh │ │ └── run_pod5.sh │ ├── find_selected_read_ids_sample_count/ │ │ ├── run_blow5.sh │ │ ├── run_fast5.sh │ │ └── run_pod5.sh │ ├── find_selected_read_ids_samples/ │ │ ├── run_blow5.sh │ │ ├── run_fast5.sh │ │ └── run_pod5.sh │ ├── image/ │ │ ├── Dockerfile.base │ │ ├── install_slow5.sh │ │ └── requirements-benchmarks.txt │ ├── run_benchmarks.py │ ├── run_benchmarks_in_docker.sh │ └── tools/ │ ├── check_csvs_consistent.py │ ├── fast5_to_single_blow5.sh │ ├── find_and_get_fast5.py │ ├── find_and_get_pod5.py │ ├── pyslow5_tests.py │ ├── run_benchmarks_docker_entry.sh │ └── select-random-ids.py ├── c++/ │ ├── CMakeLists.txt │ ├── examples/ │ │ ├── CMakeLists.txt │ │ ├── README.md │ │ ├── find_all_read_data.cpp │ │ ├── find_all_read_ids.cpp │ │ ├── find_specific_read_ids.cpp │ │ └── find_specific_read_ids_with_signal.cpp │ ├── pod5_format/ │ │ ├── async_signal_loader.cpp │ │ ├── async_signal_loader.h │ │ ├── c_api.cpp │ │ ├── c_api.h │ │ ├── dictionary_writer.h │ │ ├── expandable_buffer.h │ │ ├── file_output_stream.h │ │ ├── file_reader.cpp │ │ ├── file_reader.h │ │ ├── file_recovery.h │ │ ├── file_updater.cpp │ │ ├── file_updater.h │ │ ├── file_writer.cpp │ │ ├── file_writer.h │ │ ├── flatbuffers/ │ │ │ └── footer.fbs │ │ ├── internal/ │ │ │ ├── async_output_stream.h │ │ │ ├── combined_file_utils.h │ │ │ ├── linux_output_stream.h │ │ │ └── tracing/ │ │ │ └── tracing.h │ │ ├── io_manager.cpp │ │ ├── io_manager.h │ │ ├── memory_pool.cpp │ │ ├── memory_pool.h │ │ ├── migration/ │ │ │ ├── migration.cpp │ │ │ ├── migration.h │ │ │ ├── migration_utils.h │ │ │ ├── v0_to_v1.cpp │ │ │ ├── v1_to_v2.cpp │ │ │ ├── v2_to_v3.cpp │ │ │ └── v3_to_v4.cpp │ │ ├── read_table_reader.cpp │ │ ├── read_table_reader.h │ │ ├── read_table_schema.cpp │ │ ├── read_table_schema.h │ │ ├── read_table_utils.cpp │ │ ├── read_table_utils.h │ │ ├── read_table_writer.cpp │ │ ├── read_table_writer.h │ │ ├── read_table_writer_utils.cpp │ │ ├── read_table_writer_utils.h │ │ ├── result.h │ │ ├── run_info_table_reader.cpp │ │ ├── run_info_table_reader.h │ │ ├── run_info_table_schema.cpp │ │ ├── run_info_table_schema.h │ │ ├── run_info_table_writer.cpp │ │ ├── run_info_table_writer.h │ │ ├── schema_field_builder.h │ │ ├── schema_metadata.cpp │ │ ├── schema_metadata.h │ │ ├── schema_utils.cpp │ │ ├── schema_utils.h │ │ ├── signal_builder.h │ │ ├── signal_compression.cpp │ │ ├── signal_compression.h │ │ ├── signal_table_reader.cpp │ │ ├── signal_table_reader.h │ │ ├── signal_table_schema.cpp │ │ ├── signal_table_schema.h │ │ ├── signal_table_utils.h │ │ ├── signal_table_writer.cpp │ │ ├── signal_table_writer.h │ │ ├── svb16/ │ │ │ ├── common.hpp │ │ │ ├── decode.hpp │ │ │ ├── decode_scalar.hpp │ │ │ ├── decode_x64.hpp │ │ │ ├── encode.hpp │ │ │ ├── encode_scalar.hpp │ │ │ ├── encode_x64.hpp │ │ │ ├── generate_shuffle_tables.py │ │ │ ├── intrinsics.hpp │ │ │ ├── shuffle_tables.hpp │ │ │ ├── simd_detect_x64.hpp │ │ │ ├── streamvbytedelta_decode_16.c │ │ │ ├── streamvbytedelta_encode_16.c │ │ │ ├── streamvbytedelta_x64_decode_16.c │ │ │ ├── streamvbytedelta_x64_encode_16.c │ │ │ ├── svb16.c │ │ │ └── svb16.h │ │ ├── table_reader.cpp │ │ ├── table_reader.h │ │ ├── thread_pool.cpp │ │ ├── thread_pool.h │ │ ├── tuple_utils.h │ │ ├── types.cpp │ │ ├── types.h │ │ ├── uuid.h │ │ └── version.h.in │ ├── pod5_format_pybind/ │ │ ├── CMakeLists.txt │ │ ├── _version.py.in │ │ ├── api.h │ │ ├── bindings.cpp │ │ ├── build_wheel.cmake │ │ ├── repack/ │ │ │ ├── repack_functions.h │ │ │ ├── repack_output.cpp │ │ │ ├── repack_output.h │ │ │ ├── repack_states.h │ │ │ ├── repack_utils.h │ │ │ ├── repacker.cpp │ │ │ └── repacker.h │ │ ├── subset.cpp │ │ ├── subset.h │ │ └── utils.h │ └── test/ │ ├── CMakeLists.txt │ ├── TemporaryDirectory.h │ ├── c_api_build_test.c │ ├── c_api_null_input.cpp │ ├── c_api_test_utils.h │ ├── c_api_tests.cpp │ ├── file_reader_writer_tests.cpp │ ├── main.cpp │ ├── output_stream_tests.cpp │ ├── read_table_tests.cpp │ ├── read_table_writer_utils_tests.cpp │ ├── run_info_table_tests.cpp │ ├── schema_tests.cpp │ ├── signal_compression_tests.cpp │ ├── signal_table_tests.cpp │ ├── svb16_scalar_tests.cpp │ ├── svb16_x64_tests.cpp │ ├── test_utils.h │ ├── thread_pool_tests.cpp │ ├── utils.h │ └── uuid_tests.cpp ├── ci/ │ ├── docker/ │ │ ├── Dockerfile.conda │ │ ├── Dockerfile.py39.arm64 │ │ └── Dockerfile.py39.x64 │ ├── generate_coverage_report.sh │ ├── get_tag_version.cmake │ ├── gitlab-ci-common.yml │ ├── install.sh │ ├── package.sh │ └── unpack_libs_for_python.sh ├── cmake/ │ ├── BuildFlatBuffers.cmake │ ├── Findzstd.cmake │ ├── conan_provider.cmake │ ├── pod5_fuzz.cmake │ ├── pod5_packaging.cmake │ └── presets/ │ ├── conan-build-options.json │ ├── conan-profiles.json │ └── conan-provider.json ├── conanfile.py ├── docs/ │ ├── DESIGN.md │ ├── README.md │ ├── SPECIFICATION.md │ └── tables/ │ ├── reads.toml │ ├── run_info.toml │ └── signal.toml ├── fuzz/ │ ├── .gitattributes │ ├── CMakeLists.txt │ ├── fuzz_compress.cpp │ ├── fuzz_file.cpp │ └── runner.cpp ├── pod5_make_version.py ├── pyproject.toml ├── pytest.ini ├── python/ │ ├── .gitignore │ ├── lib_pod5/ │ │ ├── Makefile │ │ ├── README.md │ │ ├── pyproject.toml │ │ ├── setup.py │ │ └── src/ │ │ ├── lib_pod5/ │ │ │ ├── __init__.py │ │ │ ├── pod5_format_pybind.pyi │ │ │ └── py.typed │ │ └── test/ │ │ └── test_lib_pod5.py │ └── pod5/ │ ├── Makefile │ ├── README.md │ ├── examples/ │ │ ├── find_all_reads.py │ │ └── find_specific_reads.py │ ├── pyproject.toml │ ├── setup.py │ ├── src/ │ │ ├── pod5/ │ │ │ ├── __init__.py │ │ │ ├── api_utils.py │ │ │ ├── dataset.py │ │ │ ├── pod5_types.py │ │ │ ├── reader.py │ │ │ ├── repack.py │ │ │ ├── signal_tools.py │ │ │ ├── tools/ │ │ │ │ ├── __init__.py │ │ │ │ ├── main.py │ │ │ │ ├── parsers.py │ │ │ │ ├── pod5_convert_from_fast5.py │ │ │ │ ├── pod5_convert_to_fast5.py │ │ │ │ ├── pod5_filter.py │ │ │ │ ├── pod5_inspect.py │ │ │ │ ├── pod5_merge.py │ │ │ │ ├── pod5_recover.py │ │ │ │ ├── pod5_repack.py │ │ │ │ ├── pod5_subset.py │ │ │ │ ├── pod5_update.py │ │ │ │ ├── pod5_view.py │ │ │ │ ├── polars_utils.py │ │ │ │ └── utils.py │ │ │ └── writer.py │ │ └── tests/ │ │ ├── __init__.py │ │ ├── conftest.py │ │ ├── test_api.py │ │ ├── test_convert_from_fast5.py │ │ ├── test_convert_to_fast5.py │ │ ├── test_dataset.py │ │ ├── test_filter.py │ │ ├── test_inspect.py │ │ ├── test_merge.py │ │ ├── test_reader.py │ │ ├── test_recover.py │ │ ├── test_repack.py │ │ ├── test_signal_tools.py │ │ ├── test_subset.py │ │ ├── test_tools.py │ │ ├── test_update.py │ │ ├── test_view.py │ │ └── test_writer.py │ └── test_utils/ │ └── check_pod5_files_equal.py ├── test_data/ │ ├── multi_fast5_zip.fast5 │ ├── multi_fast5_zip_v0.pod5 │ ├── multi_fast5_zip_v1.pod5 │ ├── multi_fast5_zip_v2.pod5 │ ├── multi_fast5_zip_v3.pod5 │ ├── multi_fast5_zip_v4.pod5 │ ├── single_read_fast5/ │ │ └── fe85b517-62ee-4a33-8767-41cab5d5ab39.fast5.single-read │ ├── split_1_v4.pod5 │ ├── split_2_v4.pod5 │ └── subset_mapping_examples/ │ ├── read_ids.txt │ ├── subset.csv │ └── subset.summary ├── test_package/ │ ├── CMakeLists.txt │ ├── conanfile.py │ ├── test_cpp_api.cpp │ └── test_package.cpp └── third_party/ ├── build_instructions.txt ├── gsl-disable-gsl-suppress.patch ├── include/ │ ├── .editorconfig │ ├── catch2/ │ │ └── catch.hpp │ ├── gsl/ │ │ ├── gsl │ │ ├── gsl-lite-vc6.hpp │ │ ├── gsl-lite.h │ │ └── gsl-lite.hpp │ └── gsl.h ├── jsoncons-0.166-icc-fix.patch ├── licenses/ │ ├── catch2.txt │ └── gsl-lite.txt └── software_versions.yaml ================================================ FILE CONTENTS ================================================ ================================================ FILE: .clang-format ================================================ --- # See https://releases.llvm.org/14.0.0/tools/clang/docs/ClangFormatStyleOptions.html BasedOnStyle: Chromium AccessModifierOffset: -4 AlignAfterOpenBracket: AlwaysBreak # AlignArrayOfStructures can cause crashes, see https://github.com/llvm/llvm-project/issues/55269 #AlignArrayOfStructures: Left AllowAllParametersOfDeclarationOnNextLine: false AllowShortBlocksOnASingleLine: Empty AllowShortFunctionsOnASingleLine: All BinPackArguments: false BinPackParameters: false BreakBeforeBinaryOperators: NonAssignment BreakBeforeBraces: Custom BraceWrapping: # NB: due to https://github.com/llvm/llvm-project/issues/55582 the Multiline setting will not # always work (should be fixed in clang-format 15, but that is not available as a python wheel yet # due to https://github.com/ssciwr/clang-format-wheel/issues/49) AfterControlStatement: MultiLine # makes sure multiline ifs don't run into their bodies AfterFunction: true # makes constructors with initialisers much nicer BreakBeforeConceptDeclarations: true BreakBeforeTernaryOperators: true BreakConstructorInitializers: BeforeComma BreakStringLiterals: true ColumnLimit: 100 CompactNamespaces: true ConstructorInitializerIndentWidth: 0 ContinuationIndentWidth: 4 Cpp11BracedListStyle: true DerivePointerAlignment: false # force use of the PointerAlignment setting FixNamespaceComments: true IncludeBlocks : Regroup IncludeCategories: # Aim is: # 0. the "main" header file (#include "foo.h" in foo.cpp) automatically gets priority 0 # 1. internal headers (#include "util/helpers.h"): quotation marks, with a '/' # 2. third-party headers (#include ): angle brackets, '/' or .h/.hpp/.h++ # file ext # 3. standard library headers (#include ): angle brackets, no file ext, no '/' - Regex: '^"' Priority: 1 - Regex: '^<.*/' Priority: 2 - Regex: '\.h>' Priority: 2 - Regex: '\.hpp>' Priority: 2 - Regex: '\.h\+\+>' Priority: 2 IncludeIsMainRegex: '(_test|_tests|Tests|Test)?$' # foo.h will be considered the "main" header (and sorted to the top) for all of the following: # - foo.cpp # - foo_test.cpp # - foo_tests.cpp # - fooTests.cpp (although this is intended for Foo.h and FooTests.cpp) # - fooTest.cpp (although this is intended for Foo.h and FooTest.cpp) IndentCaseLabels: false IndentWidth: 4 InsertBraces: true PackConstructorInitializers: CurrentLine PointerAlignment: Middle QualifierAlignment: Right # const east # clang 14 *should* know about QualifierOrder (according to its docs) but claims it doesn't #QualifierOrder: ['static', 'constexpr', 'inline', 'type', 'const', 'volatile', 'restrict'] ReflowComments: false SeparateDefinitionBlocks: Always SortIncludes: CaseInsensitive SortUsingDeclarations: true SpaceAroundPointerQualifiers: Before Standard: c++20 ================================================ FILE: .codespellrc ================================================ # Waiting for pyproject.toml support: https://github.com/codespell-project/codespell/issues/2055 [codespell] # "write-changes" doesn't work with "ignore-regex" # https://github.com/codespell-project/codespell/issues/2056 # comma-separated list of built-in dictionaries (default is "clear,rare") builtin = clear,rare,code # show the line in which the error occurred context = 0 # these options are turned on by specifying them here check-filenames = check-hidden = enable-colors = # split words on underscores # e.g. "foo_bar" is split into two words ("foo", "bar") instead of one word ("foo_bar") ignore-regex = _ # comma-separated list of false positives ignore-words-list = iff,inout,befores,deque,stdio,O_WRONLY,wronly,sv_lite,lite,creat,arange # comma-separated list of globs of files not to check skip = .gitignore,.codespellrc ================================================ FILE: .flake8 ================================================ # Waiting for pyproject.toml support: https://gitlab.com/pycqa/flake8/-/issues/428 [flake8] extend-ignore = E203, W503 max-line-length = 120 per-file-ignores = __init__.py:F401, __init__.py:F403 ================================================ FILE: .gitattributes ================================================ # Based on https://github.com/alexkaratarakis/gitattributes # Auto detect text files and force linux-style line endings * text=auto eol=lf # Documents *.bibtex text diff=bibtex *.doc diff=astextplain *.DOC diff=astextplain *.docx diff=astextplain *.DOCX diff=astextplain *.dot diff=astextplain *.DOT diff=astextplain *.pdf diff=astextplain *.PDF diff=astextplain *.rtf diff=astextplain *.RTF diff=astextplain *.md text diff=markdown *.mdx text diff=markdown *.tex text diff=tex *.adoc text *.textile text *.mustache text *.csv text *.tab text *.tsv text *.txt text *.sql text *.epub diff=astextplain # Graphics *.png binary *.jpg binary *.jpeg binary *.gif binary *.tif binary *.tiff binary *.ico binary *.svg binary *.eps binary *.bash text *.fish text *.sh text *.zsh text # These are explicitly windows files and should use crlf *.bat text eol=crlf *.cmd text eol=crlf *.ps1 text eol=crlf # Serialisation *.json text *.toml text *.xml text *.yaml text *.yml text # Archives *.7z binary *.gz binary *.tar binary *.tgz binary *.zip binary # Text files where line endings should be preserved *.diff -text *.patch -text # Exclude git(lab)-specific files when making an archive of the source tree .gitattributes export-ignore .gitignore export-ignore .gitkeep export-ignore .git-blame-ignore-revs export-ignore .gitlab-ci.yml export-ignore /ci export-ignore # C++ Sources *.c text diff=cpp *.cc text diff=cpp *.cxx text diff=cpp *.cpp text diff=cpp *.c++ text diff=cpp *.hpp text diff=cpp *.h text diff=cpp *.h++ text diff=cpp *.hh text diff=cpp # Read formats *.pod5 filter=lfs diff=lfs merge=lfs -text *.fast5 binary ================================================ FILE: .github/ISSUE_TEMPLATE/bug_report.md ================================================ --- name: Bug report about: Create a report to help us improve title: '' labels: '' assignees: '' --- ## Issue Description > Please provide a description of your issue and include any commands used to reproduce the issue. ## Logs > Please provide any log files. These can be generated by setting the `POD5_DEBUG` environment variable e.g. `POD5_DEBUG=1 pod5 view my.pod5` ## Specifications - Pod5 Version: - Python Version: - Platform: ================================================ FILE: .gitignore ================================================ build*/ .conan/ cmake-build*/ CMakeUserPresets.json _build/ .conan/ .cache/ dist/ .DS_Store .pod5 venv/ *.venv/ uv.lock docs/public/ .tmp_pod5* _version.py *egg-info/ POD5Version.cmake *.swp test_package/CMakeUserPresets.json .vscode/ .devcontainer/ __pycache__ python/Python.framework/ /fuzz/corpus_* ================================================ FILE: .gitlab-ci.yml ================================================ stages: - .pre - build - test - build-conan - archive - deploy include: - local: '/ci/gitlab-ci-common.yml' variables: GIT_SUBMODULE_STRATEGY: recursive STABLE_BRANCH_NAME: master DO_UPLOAD: "yes" # Always upload in conan upload jobs (only run on tags) CONAN_PROFILE_BUILD_TYPE: Release CONAN_VENV_PYTHON: "3.13" CMAKE_VERSION: "4.2.3" before_script: - "" # The versions that we build and test. .parallel-py-versions: parallel: matrix: - PYTHON_VERSION: ["3.10", "3.11", "3.12", "3.13", "3.14"] # ====================================== # # Docker # # ====================================== .build-docker-image: stage: .pre image: ${CI_REGISTRY}/traque/ont-docker-base/ont-base-docker:latest before_script: - docker login --username ${CI_REGISTRY_USER} --password ${CI_REGISTRY_PASSWORD} ${CI_REGISTRY} when: manual retry: max: 2 when: runner_system_failure script: - tag="${CI_REGISTRY_IMAGE}/${IMAGE_TAG}" - docker image build --pull --target "${PLATFORM}" --tag "${tag}" --file ${DOCKERFILE} ci/docker - docker image push ${tag} docker base aarch64: tags: - docker-builder-arm extends: - .build-docker-image variables: IMAGE_TAG: "build-arm64" DOCKERFILE: "ci/docker/Dockerfile.py39.arm64" docker base x86-64: tags: - docker-builder extends: - .build-docker-image variables: IMAGE_TAG: "build-x64" DOCKERFILE: "ci/docker/Dockerfile.py39.x64" docker conda: tags: - docker-builder extends: - .build-docker-image variables: IMAGE_TAG: "conda" DOCKERFILE: "ci/docker/Dockerfile.conda" .docker template: stage: docker image: ${CI_REGISTRY}/traque/ont-docker-base/ont-base-docker:latest before_script: - docker login --username ${CI_REGISTRY_USER} --password ${CI_REGISTRY_PASSWORD} ${CI_REGISTRY} retry: max: 2 when: runner_system_failure # ====================================== # # Versioning # # ====================================== prepare_version: stage: .pre image: ${CI_REGISTRY}/traque/ont-docker-base/ont-base-python:3.10 script: - git tag -d $(git tag -l "*a*") - git tag -d $(git tag -l "*b*") - git tag -d $(git tag -l "*r*") - git tag -d $(git tag -l "*c*") - git tag -d $(git tag -l "*dev*") - if [[ ${CI_COMMIT_TAG/#v/} && -z $( git tag -l "${CI_COMMIT_TAG/#v/}" ) ]]; then git tag ${CI_COMMIT_TAG/#v/}; fi - pip install --upgrade pip setuptools_scm~=7.1 - apt update && apt install -y git-lfs - git status --porcelain - python -m setuptools_scm - cat _version.py # Show the version that will be used in the pod5/pyproject.toml - VERSION=$(grep "__version__" _version.py | awk '{print $5}' | tr -d "'" | cut -d'+' -f1) - echo $VERSION - python -m pod5_make_version - cat cmake/POD5Version.cmake - cat _version.py python/lib_pod5/src/lib_pod5/_version.py - cat _version.py python/pod5/src/pod5/_version.py artifacts: name: "${CI_JOB_NAME}-artifacts" paths: - "cmake/POD5Version.cmake" - "_version.py" - "python/lib_pod5/src/lib_pod5/_version.py" - "python/pod5/src/pod5/_version.py" # ====================================== # # Pre-Flight Setup / Checks # # ====================================== tag_version_check: stage: .pre needs: - "prepare_version" only: - tags image: ${CI_REGISTRY}/minknow/images/build-x86_64-gcc13:latest script: - uv venv .venv - source .venv/bin/activate - uv pip install "cmake==${CMAKE_VERSION}" - pod5_version="$(cmake -P ci/get_tag_version.cmake 2>&1)" - tag_version="${CI_COMMIT_TAG/#v/}" - if [[ "${pod5_version}" != "${tag_version}" ]]; then echo "Tag is for release ${tag_version}, but POD5 version is $pod5_version"; exit 1; fi api_lib_version_check: stage: .pre needs: - "prepare_version" image: ${CI_REGISTRY}/minknow/images/build-x86_64-gcc13:latest script: - cat _version.py - NO_DEV_VERSION=$(grep "__version__" _version.py | awk '{print $5}' | tr -d "'" | cut -d'+' -f1 | sed 's/\([0-9]\+\.[0-9]\+\.[0-9]\+\).*$/\1/') - echo $NO_DEV_VERSION - cat python/pod5/pyproject.toml - echo "If this jobs fails then we have forgotten to match the api and lib version in the api python/pod5/pyproject.toml" - grep "lib_pod5\s*==\s*$NO_DEV_VERSION" python/pod5/pyproject.toml pre-commit checks: image: ${CI_REGISTRY}/traque/ont-docker-base/ont-base-python:3.10 stage: .pre tags: - linux_x86 - docker script: - pip install pre-commit - if ! pre-commit run --all-files; then - cat "${PRE_COMMIT_HOME}/pre-commit.log" - >- if grep -F -q \ -e "InvalidManifestError" \ -e "error: [Errno 17] File exists: 'build/temp.linux-x86_64-cpython-" \ "${PRE_COMMIT_HOME}/pre-commit.log"; then - echo "Bad cache state detected, deleting cache and re-running" - rm -rf "${PRE_COMMIT_HOME}/" - pre-commit run --all-files - else - exit 1 - fi - fi after_script: - cat "${PRE_COMMIT_HOME}/pre-commit.log" || true variables: PRE_COMMIT_HOME: ${CI_PROJECT_DIR}/.cache/pre-commit cache: paths: - ${PRE_COMMIT_HOME} # ====================================== # # Build Lib Standalone # # ====================================== build-standalone-ubu22: stage: build image: external-docker.artifactory.oxfordnanolabs.local/ubuntu:22.04 needs: - "prepare_version" script: - export DEBIAN_FRONTEND=noninteractive - apt-get update - apt-get install -y -V ca-certificates lsb-release wget - wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb - apt-get install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb - apt-get update - apt-get install -y cmake build-essential libzstd-dev libzstd-dev libflatbuffers-dev libarrow-dev=18.0.0-1 - mkdir -p build - cd build - cmake -D POD5_DISABLE_TESTS=OFF -D POD5_BUILD_EXAMPLES=ON -D BUILD_PYTHON_WHEEL=OFF .. - cmake --build . --parallel - ctest -C Release -VV # ====================================== # # Build helpers # # ====================================== # Takes CMAKE_ARGS, AUDITWHEEL_PLATFORM, and PYTHON_VERSION. .conan-build-and-test: - | - export TOOLCHAIN_FILE=build/generators/conan_toolchain.cmake - pod5_version="$(cmake -P ci/get_tag_version.cmake 2>&1)" - mkdir -p build - cd build - ${conan_exe} install --profile ${CONAN_PROFILE} ${EXTRA_INSTALL_ARGS} .. - cmake ${CMAKE_ARGS} -D BUILD_SHARED_LIB=ON -D CMAKE_BUILD_TYPE=Release -D POD5_DISABLE_TESTS=OFF -D POD5_BUILD_EXAMPLES=ON -D BUILD_PYTHON_WHEEL=OFF -D CMAKE_TOOLCHAIN_FILE=${TOOLCHAIN_FILE} .. - cmake --build . --config Release --parallel - ctest -C Release -VV - ../ci/install.sh - cmake ${CMAKE_ARGS} -D BUILD_SHARED_LIB=OFF -D CMAKE_BUILD_TYPE=Release -D POD5_DISABLE_TESTS=OFF -D POD5_BUILD_EXAMPLES=ON -D BUILD_PYTHON_WHEEL=ON -D PYTHON_VERSION=${PYTHON_VERSION} -D CMAKE_TOOLCHAIN_FILE=${TOOLCHAIN_FILE} .. - cmake --build . --config Release --parallel - ctest -C Release -VV - ../ci/install.sh STATIC_BUILD - ../ci/package.sh ${OUTPUT_SKU} ${AUDITWHEEL_PLATFORM} # ====================================== # # Build Lib Linux # # ====================================== .build-linux: stage: build needs: - "prepare_version" variables: EXTRA_INSTALL_ARGS: "-o arrow:with_boost=False -o arrow:with_thrift=False -o arrow:parquet=False" before_script: - /opt/python/cp310-cp310/bin/pip install -U pip 'conan<2' auditwheel build "cmake==${CMAKE_VERSION}" - ln -n /opt/python/cp310-cp310/bin/auditwheel /usr/bin/auditwheel - ln -n /opt/python/cp310-cp310/bin/conan /usr/bin/conan - conan config install --verify-ssl=no ${CONAN_CONFIG_URL} - conan_exe=$(which conan) script: - !reference [".conan-build-and-test"] artifacts: name: "${CI_JOB_NAME}-artifacts" paths: - "lib_pod5*.tar.gz" - "lib_pod5*.whl" linux-x64-gcc9-release-build: image: external-quay.artifactory.oxfordnanolabs.local/pypa/manylinux2014_x86_64 extends: - .build-linux - .parallel-py-versions tags: - linux variables: CONAN_PROFILE: "linux-x86_64-gcc9.jinja" CONAN_PROFILE_CPPSTD: 17 OUTPUT_SKU: "linux-x64" AUDITWHEEL_PLATFORM: manylinux2014_x86_64 linux-aarch64-gcc9-release-build: image: external-quay.artifactory.oxfordnanolabs.local/pypa/manylinux2014_aarch64 extends: - .build-linux - .parallel-py-versions tags: - linux_aarch64 - high-cpu variables: CONAN_PROFILE: "linux-aarch64-gcc9.jinja" CONAN_PROFILE_CPPSTD: 17 OUTPUT_SKU: "linux-arm64" AUDITWHEEL_PLATFORM: manylinux2014_aarch64 # ====================================== # # Build Lib OSX # # ====================================== .build-osx-common: stage: build needs: - "prepare_version" variables: EXTRA_INSTALL_ARGS: "-o arrow:with_boost=False -o arrow:with_thrift=False -o arrow:parquet=False" before_script: - uv venv .venv_conan --python ${CONAN_VENV_PYTHON} --seed - source .venv_conan/bin/activate # Note that cmake 3.31+ do not work properly on macOS 14 (and earlier) # Pinning to 3.30 avoid SSL issues when connecting to internal servers - uv pip install -U pip 'conan<2' 'cmake==3.30.9' - conan config install --verify-ssl=no "${CONAN_CONFIG_URL}" - conan_exe=$(which conan) - uv python install ${PYTHON_VERSION} - uv venv --python "python${PYTHON_VERSION}" .venv --seed - source .venv/bin/activate - which python - python --version script: - python3 -c "import sysconfig; print(sysconfig.get_platform())" - !reference [".conan-build-and-test"] artifacts: name: "${CI_JOB_NAME}-artifacts" paths: - "lib_pod5*.tar.gz" - "lib_pod5*.whl" osx-arm64-clang15-release-build: extends: - .build-osx-common - .parallel-py-versions tags: - osx_arm64 - xcode-15.3 - conan variables: CONAN_PROFILE: "macos-aarch64-appleclang-15.0.jinja" CONAN_PROFILE_CPPSTD: 20 CMAKE_ARGS: "-DCMAKE_OSX_ARCHITECTURES=arm64" MACOSX_DEPLOYMENT_TARGET: "14.0" OUTPUT_SKU: "osx-14.0-arm64" FORCE_PYTHON_PLATFORM: macosx_14_0_arm64 # ====================================== # # Build Lib Windows # # ====================================== .build-win-common: stage: build needs: - "prepare_version" retry: 1 variables: # We need to override arrow's boost 1.85.0 requirement to match the version we use internally. EXTRA_INSTALL_ARGS: "-o arrow:with_thrift=False -o arrow:parquet=False --require=boost/1.86.0@ -o boost:without_locale=True" before_script: - uv venv .venv_conan --python ${CONAN_VENV_PYTHON} --seed - source .venv_conan/Scripts/activate - uv pip install 'conan<2' "cmake==${CMAKE_VERSION}" - conan config install --verify-ssl=no "${CONAN_CONFIG_URL}" - conan_exe=$(which conan) - uv python install ${PYTHON_VERSION} - uv venv --python "python${PYTHON_VERSION}" .venv --seed - source .venv/Scripts/activate script: - uv pip install build - !reference [".conan-build-and-test"] after_script: # HACK: for some reason, pod5_unit_tests.exe is sticking around; deleting it works, but it # doesn't go away immediately (as though something had it open with FILE_SHARE_DELETE, although # the Handle utility from SysInternals couldn't find anything). # This also appears to be happening for the fuzz targets, so remove and wait for every exe. - rm -v build/Release/bin/*.exe - date - while true; do - ls build/Release/bin/*.exe || break - sleep 1 - done - date win-x64-msvc2019-release-build: extends: - .build-win-common - .parallel-py-versions tags: - windows - VS2019 - conan variables: CONAN_PROFILE: "windows-x86_64-vs2019.jinja" CONAN_PROFILE_CPPSTD: 17 OUTPUT_SKU: "win-x64" CMAKE_ARGS: "-A x64" CMAKE_GENERATOR: "Visual Studio 16 2019" artifacts: name: "${CI_JOB_NAME}-artifacts" paths: - "lib_pod5*.tar.gz" - "lib_pod5*.whl" # ====================================== # # Build Python API # # ====================================== build-python-api: stage: build needs: - "prepare_version" image: ${CI_REGISTRY}/traque/ont-docker-base/ont-base-python:3.10 tags: - linux script: - git tag -d $(git tag -l "*a*") - git tag -d $(git tag -l "*b*") - git tag -d $(git tag -l "*r*") - git tag -d $(git tag -l "*c*") - git tag -d $(git tag -l "*dev*") - if [[ ${CI_COMMIT_TAG/#v/} && -z $( git tag -l "${CI_COMMIT_TAG/#v/}" ) ]]; then git tag ${CI_COMMIT_TAG/#v/}; fi - cat _version.py - VERSION=$(grep "__version__" _version.py | awk '{print $5}' | tr -d "'" | cut -d'+' -f1) - echo $VERSION - cd python/pod5/ # update the lib_pod5 dependency in pod5/pyproject.toml to match - sed -i "s/.*lib_pod5.*/\ \ \ \ \'lib_pod5 == ${VERSION}\',/" pyproject.toml - cat pyproject.toml - pip install -U pip build - python -m build --outdir ../../ - cd ../.. - ls *.whl *.tar.gz artifacts: name: "${CI_JOB_NAME}-artifacts" paths: - "pod5*.whl" - "pod5*.tar.gz" # ====================================== # # Test Tools # # ====================================== tools-linux-x64: extends: - .parallel-py-versions stage: test image: ${CI_REGISTRY}/traque/ont-docker-base/ont-base-python:${PYTHON_VERSION} tags: - linux before_script: - python${PYTHON_VERSION} -m venv .venv/ - source .venv/bin/activate needs: - linux-x64-gcc9-release-build - build-python-api script: - pip install ./lib_pod5*cp${PYTHON_VERSION/./}*.whl pod5-*.whl - pod5 convert fast5 ./test_data/ --output ./output_files --one-to-one ./test_data - python${PYTHON_VERSION} python/pod5/test_utils/check_pod5_files_equal.py ./output_files/multi_fast5_zip.pod5 ./test_data/multi_fast5_zip_v4.pod5 - python${PYTHON_VERSION} python/pod5/test_utils/check_pod5_files_equal.py ./output_files/multi_fast5_zip.pod5 ./test_data/multi_fast5_zip_v3.pod5 - python${PYTHON_VERSION} python/pod5/test_utils/check_pod5_files_equal.py ./output_files/multi_fast5_zip.pod5 ./test_data/multi_fast5_zip_v2.pod5 - python${PYTHON_VERSION} python/pod5/test_utils/check_pod5_files_equal.py ./output_files/multi_fast5_zip.pod5 ./test_data/multi_fast5_zip_v1.pod5 - python${PYTHON_VERSION} python/pod5/test_utils/check_pod5_files_equal.py ./output_files/multi_fast5_zip.pod5 ./test_data/multi_fast5_zip_v0.pod5 - pod5 convert to_fast5 ./output_files/ --output ./output_files - pod5 convert fast5 ./output_files/*.fast5 --output ./output_files_2 --one-to-one ./output_files/ - python${PYTHON_VERSION} python/pod5/test_utils/check_pod5_files_equal.py ./output_files/multi_fast5_zip.pod5 ./output_files_2/*.pod5 # ====================================== # # Pytest # # ====================================== .pytest: stage: test before_script: - python${PYTHON_VERSION} -m venv .venv/ - source .venv/*/activate - python --version - python -m pip install --upgrade pip script: - pip install ./lib_pod5*cp${PYTHON_VERSION/./}*.whl pod5-*.whl - pip install pytest pytest-cov pytest-mock psutil - pytest - POD5_DISABLE_MMAP_OPEN=1 pytest .pytest-with-uv: extends: - .pytest before_script: - uv python install ${PYTHON_VERSION} - uv venv --python "python${PYTHON_VERSION}" .venv --seed - source .venv/*/activate pytest-linux-x64: extends: - .pytest - .parallel-py-versions image: ${CI_REGISTRY}/traque/ont-docker-base/ont-base-python:${PYTHON_VERSION} tags: - linux needs: - linux-x64-gcc9-release-build - build-python-api pytest-linux-aarch64: extends: - .pytest - .parallel-py-versions image: ${CI_REGISTRY}/traque/ont-docker-base/ont-base-python:${PYTHON_VERSION} tags: - linux_aarch64 - high-cpu needs: - linux-aarch64-gcc9-release-build - build-python-api pytest-osx-arm64: extends: - .pytest-with-uv - .parallel-py-versions tags: - osx_arm64 needs: - osx-arm64-clang15-release-build - build-python-api pytest-win-x64: retry: 1 extends: - .pytest-with-uv - .parallel-py-versions tags: - windows needs: - win-x64-msvc2019-release-build - build-python-api # ====================================== # # Conda Testing # # ====================================== conda_pytest: extends: - .pytest - .parallel-py-versions image: ${CI_REGISTRY}/minknow/pod5-file-format/conda:latest tags: - linux needs: - linux-x64-gcc9-release-build - build-python-api before_script: - | cat > environment.yml << EOF name: pod5_conda_test channels: - conda-forge - bioconda dependencies: - python=${PYTHON_VERSION} - cmake - pyarrow - pip EOF - cat environment.yml - mamba --version - mamba env create -f environment.yml - conda env list # This is a work around for conda init in gitlab - eval "$(conda shell.bash hook)" - conda activate pod5_conda_test # ====================================== # # Benchmarks # # ====================================== .benchmark: stage: test before_script: - python3 -m venv .venv/ - source .venv/bin/activate script: - pip install ./${LIB_WHEEL_GLOB} pod5-*.whl setuptools - pip install -r ./benchmarks/image/requirements-benchmarks.txt - ./benchmarks/image/install_slow5.sh - export PATH="$(pwd)/slow5tools-v1.3.0/:$PATH" - ./benchmarks/run_benchmarks.py ./test_data/ ./benchmark-outputs benchmark-linux-x64: extends: [".benchmark"] image: ${CI_REGISTRY}/traque/ont-docker-base/ont-base-python:3.14 tags: - linux needs: - linux-x64-gcc9-release-build - build-python-api variables: LIB_WHEEL_GLOB: "lib_pod5*cp314*.whl" benchmark-linux-aarch64: extends: [".benchmark"] image: ${CI_REGISTRY}/traque/ont-docker-base/ont-base-python:3.14 tags: - linux_aarch64 - high-cpu needs: - linux-aarch64-gcc9-release-build - build-python-api variables: LIB_WHEEL_GLOB: "lib_pod5*cp314*.whl" # ====================================== # # Fuzz tests and coverage reports # # ====================================== .generic-linux-x64-gcc11-build: stage: build image: external-docker.artifactory.oxfordnanolabs.local/ubuntu:jammy tags: - linux variables: CONAN_PROFILE: "linux-x86_64-gcc11.jinja" CONAN_PROFILE_CPPSTD: 20 CMAKE_BUILD_TYPE: Release needs: - "prepare_version" script: # Install requirements. - apt-get update - apt-get install -y pip - pip install -U pip 'conan<2' auditwheel build "cmake==${CMAKE_VERSION}" - conan config install --verify-ssl=no ${CONAN_CONFIG_URL} # Setup build. - pod5_version="$(cmake -P ci/get_tag_version.cmake 2>&1)" - mkdir -p build - pushd build # Tell conan that it's OK to use libstdc++ settings. - conan install --profile ${CONAN_PROFILE} -s compiler.libcxx=libstdc++11 -s compiler.cppstd=${CONAN_PROFILE_CPPSTD} -s build_type=${CMAKE_BUILD_TYPE} -o arrow:with_boost=False -o arrow:with_thrift=False -o arrow:parquet=False .. - cmake -D CMAKE_TOOLCHAIN_FILE=build/generators/conan_toolchain.cmake -D CMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} -D BUILD_PYTHON_WHEEL=OFF ${CMAKE_EXTRA_ARGS} .. # Do the build - cmake --build . --config ${CMAKE_BUILD_TYPE} --parallel - popd linux-x64-gcc11-fuzz: extends: .generic-linux-x64-gcc11-build allow_failure: true variables: CC: clang CXX: clang++ CMAKE_EXTRA_ARGS: "-D ENABLE_FUZZERS=ON -D FUZZER_RUN_TIME=600" script: # We need clang for libFuzzer. - apt-get update - apt-get install -y clang # Do the build. - !reference [".generic-linux-x64-gcc11-build", "script"] # Remove the zipped corpora now that we've extracted it, since we # don't need it artifacted. - rm fuzz/*.zip # Run the tests. - ctest -C Release --test-dir build -VV -R ${FUZZER_TEST} parallel: matrix: - FUZZER_TEST: - "compress" - "file" artifacts: # Artifact everything in /fuzz so that we can get to any new/failing corpora. when: always paths: - ./fuzz linux-x64-gcc11-coverage: extends: .generic-linux-x64-gcc11-build variables: CMAKE_BUILD_TYPE: "Debug" CMAKE_EXTRA_ARGS: "-D POD5_DISABLE_TESTS=OFF -D ENABLE_COVERAGE_REPORT=ON" script: # We need a venv. - apt-get update - apt-get install -y python3-venv # Do the build. - !reference [".generic-linux-x64-gcc11-build", "script"] # Run the coverage report. - ./ci/generate_coverage_report.sh build coverage: '/^TOTAL\s+\d+\s+\d+\s+(\d+(?:\.\d+)?\%)$/' artifacts: reports: coverage_report: coverage_format: cobertura path: coverage-report-*.xml paths: # Artifact the human readable ones too. - coverage-report-*.html # ====================================== # # Conan # # ====================================== .setup-venv: - KERNEL=$(uname -s) - if [[ ! ${KERNEL} =~ "Linux" ]]; then # Must use an explicit version here otherwise we get the windows store one. # Can be any version since it's only for installing conan. - python3.13 -m venv .venv - source .venv/*/activate - fi .reset-line-endings: # This is needed to enforce LF line-endings in the pybind submodule # otherwise conan generates different revisions for windows and unix - re='^(MINGW|CYGWIN|MSYS).*' - if [[ $(uname -s) =~ $re ]]; then - git rm -rf :/ - git checkout HEAD -- :/ - fi .conan-setup-common: - !reference [".reset-line-endings"] - !reference [".setup-venv"] - pip install 'conan<2' - conan --version - VERSIONS="$(cmake -P ci/get_tag_version.cmake 2>&1)" .conan-build-common: stage: build-conan dependencies: - "prepare_version" before_script: - !reference [".conan-setup-common"] - conan remove "*" -f - conan config install --verify-ssl=no "${CONAN_CONFIG_URL}" .conan2-common: before_script: - !reference [".reset-line-endings"] - !reference [".setup-venv"] - pip install --upgrade conan - conan --version - conan remove "*" --confirm - conan config install --verify-ssl=no "${CONAN2_CONFIG_URL}" .conan2-build: extends: .conan2-common stage: build-conan dependencies: - "prepare_version" script: - version=$(cmake -P ci/get_tag_version.cmake 2>&1 | cut -d. -f1-3) # set up the correct ref - opts=("--version=${version}" --user=nanopore --channel=stable) # fail if we can't find dependencies - opts+=("--build=pod5_file_format/*") # select the build profile - opts+=(-pr:a "${PROFILE_BASE}") # use the arrow packages we have built - opts+=('-o:a=arrow/*:with_thrift=False' '-o:a=arrow/*:parquet=False' '-o:a=arrow/*:with_zstd=True' '-o:a=arrow/*:with_boost=False') - echo "Running conan create . ${opts[@]}" - conan create . "${opts[@]}" - conan cache save "*/*:*" --file=conan-${CI_JOB_ID}.tgz variables: # use an arrow package that doesn't use Boost, even on Windows CONAN_MANUAL_OVERRIDES: "arrow/*:arrow/18.0.0@nanopore/noboost" CONAN_PROFILE_CPPSTD: "20" artifacts: paths: - 'conan-*.tgz' parallel: matrix: - CONAN_PROFILE_BUILD_TYPE: ["Debug", "Release"] .conan2-upload: extends: .conan2-common stage: deploy #only: ["tags"] script: - for f in conan-*.tgz; do conan cache restore "$f"; done - conan remote auth ONT-Conan-V2 --force - conan upload "*:*" --check --confirm --remote=ONT-Conan-V2 --dry-run .conan-upload: extends: .upload-package # from informatics/conan-config stage: deploy only: ["tags"] before_script: - pip install "cmake==${CMAKE_VERSION}" - !reference [".conan-setup-common"] variables: EXPECTED_PACKAGE_COUNT: "4" # Expect shared and static packages # Conan: build and upload packages: build-conan:windows-x86_64-vs2019: extends: - .profile-windows-x86_64-vs2019 - .build-package-win - .conan-build-common - .build-shared-and-static needs: ["prepare_version", "win-x64-msvc2019-release-build"] upload-conan:windows-x86_64-vs2019: extends: .conan-upload dependencies: [ "prepare_version", "build-conan:windows-x86_64-vs2019" ] build-conan2:windows-x86_64-vs2019: extends: - .conan2-build - .profile-windows-x86_64-vs2019-conan2 needs: ["prepare_version", "win-x64-msvc2019-release-build"] upload-conan2:windows-x86_64-vs2019: extends: - .conan2-upload - .profile-windows-x86_64-vs2019-conan2 dependencies: [ "prepare_version", "build-conan2:windows-x86_64-vs2019" ] build-conan:macos-aarch64-appleclang-15.0: extends: - .profile-macos-aarch64-appleclang-15.0 - .build-package - .conan-build-common - .build-shared-and-static needs: ["prepare_version", "osx-arm64-clang15-release-build"] upload-conan:macos-aarch64-appleclang-15.0: extends: .conan-upload dependencies: [ "prepare_version", "build-conan:macos-aarch64-appleclang-15.0" ] build-conan:macos-aarch64-appleclang-16.0: extends: - .profile-macos-aarch64-appleclang-16.0 - .build-package - .conan-build-common - .build-shared-and-static needs: ["prepare_version"] upload-conan:macos-aarch64-appleclang-16.0: extends: .conan-upload dependencies: [ "prepare_version", "build-conan:macos-aarch64-appleclang-16.0" ] build-conan2:macos-aarch64-appleclang-15.0: extends: - .profile-macos-aarch64-appleclang-15.0-conan2 - .conan2-build needs: ["prepare_version", "osx-arm64-clang15-release-build"] upload-conan2:macos-aarch64-appleclang-15.0: extends: - .profile-macos-aarch64-appleclang-15.0-conan2 - .conan2-upload dependencies: [ "prepare_version", "build-conan2:macos-aarch64-appleclang-15.0" ] build-conan2:macos-aarch64-appleclang-16.0: extends: - .profile-macos-aarch64-appleclang-16.0-conan2 - .conan2-build needs: ["prepare_version"] upload-conan2:macos-aarch64-appleclang-16.0: extends: - .profile-macos-aarch64-appleclang-16.0-conan2 - .conan2-upload dependencies: [ "prepare_version", "build-conan2:macos-aarch64-appleclang-16.0" ] build-conan:linux-x86_64-gcc11: extends: - .profile-linux-x86_64-gcc11 - .build-package - .conan-build-common - .build-shared-and-static needs: ["prepare_version"] upload-conan:linux-x86_64-gcc11: extends: .conan-upload dependencies: [ "prepare_version", "build-conan:linux-x86_64-gcc11" ] build-conan2:linux-x86_64-gcc11: extends: - .profile-linux-x86_64-gcc11-conan2 - .conan2-build needs: ["prepare_version"] upload-conan2:linux-x86_64-gcc11: extends: - .profile-linux-x86_64-gcc11-conan2 - .conan2-upload dependencies: [ "prepare_version", "build-conan2:linux-x86_64-gcc11" ] build-conan2:linux-x86_64-gcc11-asan-static: extends: - .profile-linux-x86_64-gcc11-asan-static-conan2 - .conan2-build needs: ["prepare_version"] upload-conan2:linux-x86_64-gcc11-asan-static: extends: - .profile-linux-x86_64-gcc11-asan-static-conan2 - .conan2-upload dependencies: [ "prepare_version", "build-conan2:linux-x86_64-gcc11-asan-static" ] build-conan2:linux-x86_64-gcc11-usan-static: extends: - .profile-linux-x86_64-gcc11-usan-static-conan2 - .conan2-build needs: ["prepare_version"] upload-conan2:linux-x86_64-gcc11-usan-static: extends: - .profile-linux-x86_64-gcc11-usan-static-conan2 - .conan2-upload dependencies: [ "prepare_version", "build-conan2:linux-x86_64-gcc11-usan-static" ] build-conan2:linux-x86_64-gcc11-tsan-static: extends: - .profile-linux-x86_64-gcc11-tsan-static-conan2 - .conan2-build needs: ["prepare_version"] upload-conan2:linux-x86_64-gcc11-tsan-static: extends: - .profile-linux-x86_64-gcc11-tsan-static-conan2 - .conan2-upload dependencies: [ "prepare_version", "build-conan2:linux-x86_64-gcc11-tsan-static" ] build-conan:linux-x86_64-gcc13: extends: - .profile-linux-x86_64-gcc13 - .build-package - .conan-build-common - .build-shared-and-static needs: ["prepare_version"] upload-conan:linux-x86_64-gcc13: extends: .conan-upload dependencies: [ "prepare_version", "build-conan:linux-x86_64-gcc13" ] build-conan2:linux-x86_64-gcc13: extends: - .profile-linux-x86_64-gcc13-conan2 - .conan2-build needs: ["prepare_version"] upload-conan2:linux-x86_64-gcc13: extends: - .profile-linux-x86_64-gcc13-conan2 - .conan2-upload dependencies: [ "prepare_version", "build-conan2:linux-x86_64-gcc13" ] build-conan:linux-x86_64-gcc11-asan-static: extends: - .profile-linux-x86_64-gcc11-asan-static - .build-package - .conan-build-common - .build-shared-and-static needs: ["prepare_version"] upload-conan:linux-x86_64-gcc11-asan-static: extends: .conan-upload dependencies: [ "prepare_version", "build-conan:linux-x86_64-gcc11-asan-static" ] build-conan:linux-x86_64-gcc13-asan-static: extends: - .profile-linux-x86_64-gcc13-asan-static - .build-package - .conan-build-common - .build-shared-and-static needs: ["prepare_version"] upload-conan:linux-x86_64-gcc13-asan-static: extends: .conan-upload dependencies: [ "prepare_version", "build-conan:linux-x86_64-gcc13-asan-static" ] build-conan:linux-x86_64-gcc11-tsan-static: extends: - .profile-linux-x86_64-gcc11-tsan-static - .build-package - .conan-build-common - .build-shared-and-static needs: ["prepare_version"] upload-conan:linux-x86_64-gcc11-tsan-static: extends: .conan-upload dependencies: [ "prepare_version", "build-conan:linux-x86_64-gcc11-tsan-static" ] build-conan:linux-x86_64-gcc13-tsan-static: extends: - .profile-linux-x86_64-gcc13-tsan-static - .build-package - .conan-build-common - .build-shared-and-static needs: ["prepare_version"] upload-conan:linux-x86_64-gcc13-tsan-static: extends: .conan-upload dependencies: [ "prepare_version", "build-conan:linux-x86_64-gcc13-tsan-static" ] build-conan:linux-x86_64-gcc11-usan-static: extends: - .profile-linux-x86_64-gcc11-usan-static - .build-package - .conan-build-common - .build-shared-and-static needs: ["prepare_version"] upload-conan:linux-x86_64-gcc11-usan-static: extends: .conan-upload dependencies: [ "prepare_version", "build-conan:linux-x86_64-gcc11-usan-static" ] build-conan:linux-x86_64-gcc13-usan-static: extends: - .profile-linux-x86_64-gcc13-usan-static - .build-package - .conan-build-common - .build-shared-and-static needs: ["prepare_version"] upload-conan:linux-x86_64-gcc13-usan-static: extends: .conan-upload dependencies: [ "prepare_version", "build-conan:linux-x86_64-gcc13-usan-static" ] build-conan:linux-aarch64-gcc11: extends: - .profile-linux-aarch64-gcc11 - .build-package - .conan-build-common - .build-shared-and-static needs: ["prepare_version"] upload-conan:linux-aarch64-gcc11: extends: .conan-upload dependencies: [ "prepare_version", "build-conan:linux-aarch64-gcc11" ] build-conan2:linux-aarch64-gcc11: extends: - .profile-linux-aarch64-gcc11-conan2 - .conan2-build needs: ["prepare_version"] upload-conan2:linux-aarch64-gcc11: extends: - .profile-linux-aarch64-gcc11-conan2 - .conan2-upload dependencies: [ "prepare_version", "build-conan2:linux-aarch64-gcc11" ] build-conan:linux-aarch64-gcc13: extends: - .profile-linux-aarch64-gcc13 - .build-package - .conan-build-common - .build-shared-and-static needs: ["prepare_version"] upload-conan:linux-aarch64-gcc13: extends: .conan-upload dependencies: [ "prepare_version", "build-conan:linux-aarch64-gcc13" ] build-conan2:linux-aarch64-gcc13: extends: - .profile-linux-aarch64-gcc13-conan2 - .conan2-build needs: ["prepare_version"] upload-conan2:linux-aarch64-gcc13: extends: - .profile-linux-aarch64-gcc13-conan2 - .conan2-upload dependencies: [ "prepare_version", "build-conan2:linux-aarch64-gcc13" ] # ====================================== # # Archive # # ====================================== build-archive: stage: archive needs: - linux-x64-gcc9-release-build - linux-aarch64-gcc9-release-build - osx-arm64-clang15-release-build - win-x64-msvc2019-release-build - build-python-api script: - find . artifacts: name: "${CI_JOB_NAME}-artifacts" paths: - ./*.tar.gz - ./*.whl # ====================================== # # Deploy # # ====================================== internal_wheel_upload: stage: deploy image: ${UPLOAD_PYTHON_IMAGE} needs: - build-archive script: - ls -lh . - pip install twine - twine upload *.whl pod5*.tar.gz only: ["tags"] when: manual external_wheel_upload: stage: deploy image: ${UPLOAD_PYTHON_IMAGE} needs: - build-archive script: - ls -lh . - pip install twine - unset TWINE_REPOSITORY_URL - unset TWINE_CERT - twine upload lib*.whl -u __token__ -p"${EXTERNAL_LIB_POD5_PYPI_KEY}" - twine upload pod5*.whl pod5*.tar.gz -u __token__ -p"${EXTERNAL_POD5_PYPI_KEY}" only: ["tags"] when: manual # ====================================== # # MLHub Testing # # ====================================== mlhub: stage: deploy image: ${MLHUB_TRIGGER_IMAGE} needs: ["build-archive"] variables: GIT_STRATEGY: none script: - | curl -i --header "Content-Type: application/json" \ --request POST \ --data '{ "key": "'${MLHUB_TRIGGER_KEY}'", "job_name": "POD5-CI '${CI_COMMIT_REF_NAME}' - '"$CI_COMMIT_TITLE"' ", "script_parameters": { "mode":"artifact", "source":"'${CI_COMMIT_SHA}'" "python_ver":"'${PYTHON_VERSION}'" } }' \ ${MLHUB_TRIGGER_URL} when: manual extends: - .parallel-py-versions ================================================ FILE: .gitmodules ================================================ [submodule "third_party/pybind11"] path = third_party/pybind11 url = https://github.com/pybind/pybind11.git branch = stable ================================================ FILE: .pre-commit-config.yaml ================================================ # See https://pre-commit.com for more information # See https://pre-commit.com/hooks.html for more hooks repos: - repo: https://github.com/pre-commit/pre-commit-hooks rev: v5.0.0 hooks: - id: trailing-whitespace - id: end-of-file-fixer - id: check-case-conflict - id: check-merge-conflict - id: check-added-large-files - repo: https://github.com/psf/black rev: 25.1.0 hooks: - id: black - repo: https://github.com/codespell-project/codespell rev: v2.4.1 hooks: - id: codespell exclude: 'third_party/' - repo: https://github.com/PyCQA/flake8 rev: 7.2.0 hooks: - id: flake8 exclude: docs/conf.py - repo: https://github.com/shellcheck-py/shellcheck-py rev: v0.10.0.1 hooks: - id: shellcheck - repo: https://github.com/pre-commit/mirrors-clang-format rev: 'v20.1.4' hooks: - id: clang-format exclude: 'third_party/' - repo: https://github.com/pre-commit/mirrors-mypy rev: 'v1.15.0' hooks: - id: mypy files: 'python/pod5/src/' args: [ --check-untyped-defs, --ignore-missing-imports ] additional_dependencies: - types-Deprecated - types-setuptools - types-pytz # NB: by default, pre-commit only installs the pre-commit hook ("commit" stage), # but you can tell `pre-commit install` to install other hooks. # This set of default stages ensures we don't slow down or break other git operations # even if you install hooks for them. default_stages: - pre-commit - pre-merge-commit - manual # vi:et:sw=2:sts=2: ================================================ FILE: .readthedocs.yaml ================================================ # Read the Docs configuration file # See https://docs.readthedocs.io/en/stable/config-file/v2.html for details # Required version: 2 # Build all formats formats: all # Set the version of Python and other tools you might need build: os: ubuntu-20.04 tools: python: "3.10" jobs: pre_build: - python -c "import pod5; print(pod5.__version__)" # Build documentation in the docs/ directory with Sphinx sphinx: configuration: docs/conf.py # If using Sphinx, optionally build your docs in additional formats such as PDF # formats: # - pdf # Optionally declare the Python requirements required to build your docs python: install: - requirements: docs/requirements.txt ================================================ FILE: CHANGELOG.md ================================================ # Changelog All notable changes, updates, and fixes to pod5 will be documented here The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). ## [0.3.39] ### Fixed - Fix python bindings build without Conan. ### Changed - CI Stabilisation ## [0.3.38] ### Changed - CI Stabilisation ## [0.3.37] ### Changed - Use standard file IO to read POD5 header and footer metadata before memory mapping (if not disabled e.g. `POD5_DISABLE_MMAP_OPEN=1`). This should to avoid SIGBUS errors caused by memory mapping file stubs (archive artefacts). - Improve file_reader_writer unit-tests robustness - Scale number of open input file handles during pod5 subset / filter by the system limits and number of output files. ### Fixed - Fixed bug where invalid read ids could be passed into pod5 subset via summary table. - Fixed bug where invalid read ids in `DatasetReader.reads` selection could return valid read records. - Fixed bug in CI where the python venv was not activated resulting in incorrect conan version being used. ## [0.3.36] ### Added - Added missing licence files. ### Removed - Removed Python 3.9 and macOS 10.15 support since they're EOL. ## [0.3.35] ### Added - Python 3.14 support ### Removed - Removed sphinx-style python docstrings references - Removed most documentation as part of migration to - Deprecated `--duplicate-ok` argument from pod5 tools - duplicating reads is now always invalid. ### Changed - Moved filter + subset implementation into C++ for improved performance. - Performance improvements to `pod5 view` especially when reading read ids from large files. - Updated polars version from "~=1.20,<1.32" to "~= 1.30" - Switch to uv for managing CI python environments - Updated to pyarrow 22.0.0 ## [0.3.34] ### Added - `open_pore_level` to `pod5 inspect read` ### Changed - Fixed migration behaviour on nfs systems, where migrated tables could be left orphaned on disk. - Limited polars install version to "~=1.20,<1.32" following breaking changes - Tidied up how tmp files are named, used a larger set of numbers for naming. ## [0.3.33] ### Added - Added Conan 2 CI ### Changed - Reduced virtual memory usage when opening POD5 files by 75%. - Python API now memory maps inner tables using the `mmap.mmap` `offset` and `length` arguments directly instead of taking a slice of the whole file. ## [0.3.32] ### Added - Option to allow users of C++ API to not keep file handles open if required. ### Changed - Order of `pod5 view` is backwards compatible with 0.3.30, new `open_pore_level` field is at the end of the list. ## [0.3.31] ### Added - Added new field `open_pore_level`, containing the level of the open pore as tracked by MinKNOW for this channel/well. ### Removed - Deprecated support for unused read scaling values "tracked_scaling_scale", "tracked_scaling_shift", "predicted_scaling_scale", "predicted_scaling_shift", "num_reads_since_mux_change" and "time_since_mux_change". These will be removed from stored data and writer API in 0.4.0, with accessing API remaining in place until 0.5.0. ## [0.3.30] ### Changed - Build with sanitization on GCC13 ### Removed - Dropped incorrect sanitized conan jobs. ## [0.3.29] ### Removed - Dropped support for macOS x86 ## [0.3.28] ### Changed - Additional testing for Linux file access. ## [0.3.27] ### Fixed - Fixed some crashes when parsing corrupt POD5 files. - Fixed missing error handling when the C API is called incorrectly. - Fixed and clarified C API thread safety. - Fall back to regular IO if direct IO is requested, but file opening fails. ### Removed - Dropped automated ARM+GCC8 builds. ### Changed - Bumped polars to next major version (`~= 1.20`). ## [0.3.26] ### Changed - The read end reason now includes paused - for reads that ended because acquisition was paused. ## [0.3.25] ### Changed - Python 3.8 wheels are no longer built for Windows or macOS (Python 3.8 is end-of-life). - Better error messages and testing of file recovery. ### Added - Conan pod5 builds with address, thread and undefined behaviour sanitizer support. - Added fuzz testing. - New option cleanup temporary files after file recovery. ## [0.3.24] ### Changed - Update to arrow 18 for the cpp library. ### Fixed - Flush `pod5 view` header to prevent issue on Windows systems where header would not be on top. ## [0.3.23] ### Changed - Removed use of python `build` when building wheel in cmake. ## [0.3.22] ### Added - `ArrowTableHandle` `stream` member to store the `BatchFileReader` backend - `ArrowTableHandle` `options` argument to pass in `IpcReadOptions` - `pod5::default_memory_pool` function which selects an appropriate memory pool even on large page systems. ### Changed - Refactored Multi-threading in `DatasetReader` to prevent too many open files errors - Updated dependency to `pyarrow~=18.0.0` for `python>=3.9` - Relaxed h5py python dependency ## [0.3.21] ### Added - Support for python 3.13. ### Changed - Removed use of Boost. This does not affect the C interface, but may require changes to consumers of the C++ headers. ## [0.3.20] ### Changed - Refactored directio writing engine to open up async io support. - Fixed Boost version compatibility checking in Conan packages. ## [0.3.19] ### Added - New end reason for reads terminated due to an analysis configuration change. ### Changed - Reduced allocations when compressing signal. ### Fixed - Crash when searching empty file for reads. ## [0.3.18] ### Added - Ability to disable flushing on batch complete - Use new LinuxOutputStream to cache allocations and reduce memory when writing many files. ## [0.3.17] ### Changed - Move svb headers to correct subdirectory in ## [0.3.16] ### Added - svb16 headers packaged with pod5 ### Changed - Directio output now writes on batch complete without flushing explicitly. ## [0.3.15] ### Added - Added new end reasons "api_request" and "device_data_error" to allow for new read end reasons future minknow versions will generate. - Allow directio to specify the chunk size directly. ## [0.3.14] ### Added - gcc8 builds ## [0.3.13] ### Fixed - Instability when creating a pod5 writer fails. - Issue with directio mode where space is over reserved. ## [0.3.12] ### Fixed - Fixed issues reading signal from uncompressed pod5 files. ## [0.3.11] ### Added - Typechecking on `Writer.add_reads` to inform users incorrectly passing `ReadRecords` - Compatibility with numpy 2.0. ### Fixed - `DatasetReader` correctly handles string paths ## [0.3.10] ### Added - Required pypa project metadata. ### Removed - Dropped support OSX builds for XCode < 14.2. ## [0.3.9] ### Fixed - `ReadRecord.to_read()` missing fields ## [0.3.8] ### Fixed - Conan windows upload jobs failure due to using different line endings. ## [0.3.7] ### Fixed - CI package uploading to PyPi following [API token migration](https://pypi.org/help/#apitoken). - Documentation for some functions. - Explicitly sized type in `pod5_vbz_decompress_signal()`. - CI execution of tests. ### Changed - Updated `pre-commit` to `clang-format-17`. - Updated Arrow to 12.0.0. ## [0.3.6] ### Fixed - Polars `ColumnNotFoundError: not_set` introduced by `polars==0.20.0` ## [0.3.5] ### Fixed - Arrow build flags in conanfile are now configured in the configure() fnc rather than being default options. ## [0.3.4] ### Added - boost_internal_build flag in conanfile. - CI now builds with the above flag turned on. ## [0.3.3] ### Added - CI for appleclang 14 - cppstd builds ## [0.3.2] ### Added - Support for Python 3.12 ## [0.3.1] 2023-11-10 ### Fixed - Logging no longer calls `basicConfig` which may unintentionally edit users logging configuration ## [0.3.0] 2023-11-07 ### Changed - Transfers dataframes used in subsetting / filter use categorical fields to reduce memory consumption - Polars version increased to `~=0.19` - Documentation regarding positional arguments - Renamed deprecated `polars.groupby` to `polars.group_by` ### Fixed - Fixed a bug in the build scripts that prevented iOS and Windows Conan packages from being uploaded. - Remove exposed artifactory URL env var from gitlab ci config. - `convert to_fast5` writes byte encoded read_ids to match Minkow (was `str`) ### Removed - Removed python3.7 support ## [0.2.9] 2023-11-02 ### Fixed - Corrected the visibility of dependencies when building pod5 as a shared library. ## [0.2.8] 2023-11-01 ### Added - Added compression status to `pod5 inspect summary ` - Added environment override "POD5_DISABLE_MMAP_OPEN" to force non-mmapped opening of files. ### Fixed - Remove exposed artifactory URL env var from gitlab ci config. - `convert to_fast5` writes byte encoded read_ids to match Minkow (was `str`) ## [0.2.7] 2023-09-11 ### Added - `DatasetReader` class for reading collections of pod5 files - Return index errors when querying invalid errors from API's ### Changed - Recursive search for files now traverses symbolic links and ignores hidden files - Tweak block size of directio writes to 1MB. ## [0.2.6] 2023-09-04 ### Changed - Write pod5 files using DirectIO on Linux platforms (performance) ## [0.2.5] 2023-08-01 ### Added - Shared builds to conan ### Fixed - `num_minknow_events` field description from `int8` to `uint64` - `ReadRecord.num_minknow_events` return type-hint from `float` to `int` ## [0.2.4] 2023-07-13 ### Changed - Increased `numpy` minimum version to `>= 1.21.0` - Improved performance of `subset`, `filter` and `merge` tools. - `Repacker.wait` and `Repacker.waiter` parameters ### Deprecated - `Repacker.wait` and `Repacker.waiter` some parameters are deprecated and issue `FutureWarning` ### Fixed - `Repacker.is_complete` returning `True` when work is queued. ## [0.2.3] 2023-06-26 ### Added - Add API (pod5_open_file_options) to prevent pod5 from opening a file using mmap, instead using direct file IO. - Default field values (empty string) when converting fast5 files with missing fields ### Changed - Corrected Oxford Nanopore Technologies company name in package metadata to use Public Limited Company (Plc) instead of Limited (Ltd) - Limited the number of processes created when specifying `--threads` to the number of cpu cores available `os.cpu_count()` - Reduced the default value for `--threads` from 8 to 4 to improve stability on resource constrained systems ## [0.2.2] 2023-06-06 ### Fixed - Add API error when adding reads with invalid end reason, pore type or run info. ## [0.2.1] 2023-05-25 ### Changed - Update internal arrow lib to not export flatbuffers symbols. ## [0.2.0] 2023-05-18 ### Added - `pod5 view` tool to view / inspect pod5 files as tables. Gives a >200x speed improvement compared to `pod5 inspect reads` - `pod5 recover` tool to recover data from corrupted / truncated pod5 files - `pod5 update` documentation - source distributions to pypi ### Changed - `pod5 subset` and `pod5 filter` uses `polars` to parse inputs - `pod5 subset` and `pod5 filter` csv formatting requirements tightened - `pod5` tools which use multiple pod5 file inputs now accept directories which can be searched recursively with `-r/--recursive` - `pod5 subset` `--read-id-column` argument abbreviateion `-r` change to `-R` to allow `-r/--recursive` to be consistent for all tools - `pod5` tools use hyphens in all arguments (e.g. `--force-overwrite` and `--read-id-column`) - `pod5 merge` and `pod5 update` uses named `-o/--output` argument instead of positional `output` argument to standardise tools - `pod5 update` progress bar and better detection of name conflicts - Minimised number of open file handles in tools to prevent `Too many open files` error - Logging added to `merge`, `filter` and `subset`. Enabled with `POD5_DEBUG=1` ### Deprecated - `pod5 inspect reads` deprecated in-favour of `pod5 view` ### Fixed - Exception raised when calling `pod5` without any arguments - Exception raised in `pod5 convert fast5` where closed writers were reopened after being closed by a caught exception - Fixed Gitlab 38, pod5_get_end_reason and pod5_get_pore_type ignoring input string length checks. ### Removed - `pod5 subset` `--json` mapping arguments - `pod5 merge` `--chunk-size` argument - `ReadTableVersion` replaced with an integer value ## [0.1.21] 2023-04-27 ### Fixed - Repacker `reads_completed` value while copying a selection of reads. - Fixed crash when trying to load files with a bad footer. ## [0.1.20] 2023-04-20 ### Fixed - Fixed merging many files running out the size limit of dictionary indices. ## [0.1.19] 2023-04-14 ### Changed - `pod5 convert fast5` now creates logs when `POD5_DEBUG=1` set - `pod5 convert fast5` checks multi-read fast5s at conversion time ### Fixed - Fixed memory usage growth over time as signal was loaded with large pod5 files. - Fixed crash loading malicious files (found via fuzz testing) - Fixed leaks and UB when running unit tests. - Fixed run-away memory consumption during fast5 conversion ## [0.1.17] 2023-04-06 ### Changed - Updated internal arrow version to 8.0.0.3 ## [0.1.16] 2023-04-06 ### Fixed - Fixed issue where pod5 would read out of bounds memory when decompressing some reads. ## [0.1.15] 2023-03-31 ### Changed - Refactored `pod5 convert fast5` to use `concurrent.futures` only. - Add further info to error message when signal cannot be decompressed by zstd - Make merge operation not generate multiple identical run infos. ### Fixed - Fixed closing uninitialised file handles. - Fixed `pod5 inspect reads` repeating header - Fixed a crash with certain pod5 search operations. ## [0.1.13] 2023-03-23 ### Fixed - Fix loading large pod5 files on virtual-memory limited systems. ## [0.1.12] 2023-03-20 ### Added - Added `--output` argument to `pod5 convert fast5` and `to_fast5` replacing positional argument of the same name - Added `--strict` argument to `pod5 convert fast5` to promptly stop on exceptions - Added readthedocs documentation links in README.md ### Changed - Updated developer installation instructions to use `conan<2` - Reworked `pod5 convert fast5` to tolerate runtime exceptions - Use same type `run_info_index_t` for `pod5_get_file_run_info_count` and `pod5_get_file_run_info`. ### Fixed - Fixed file handle leak in repacker ## [0.1.11] 2023-03-13 ### Added - Python API supports python 3.11 - Added missing python API wheels on windows ### Changed - Changed python API dependency version `pyarrow~=11.0.0` from `8.0.0` to support python 3.11 - Changed python API dependency version `hdf5~=8.0.0` from `v7.0.0` to support python 3.11 ## [0.1.10] 2023-03-09 ### Added - Added `pod5_get_read_count` to find the count of all reads in file - Added `pod5_get_read_ids` to retrieve all read id's in file - Added `pod5_get_file_run_info` to retrieve a run info at an absolute index in the file - Added `pod5_free_run_info` to free run info's (replaces `pod5_release_run_info`) - Added `pod5_get_file_run_info_count` to find the number of run info's in a file - Added `pod5 filter` tool to subset pod5 files with simple list of read ids - Added `tqdm` progress bar to `pod5 subset` (disable with `POD5_PBAR=0`) ### Changed - Reworked `pod5 subset` to give better control over resources used - `pod5 subset` can now parse csv and tsv tables / summaries - `pod5 repack` now repacks all inputs one-to-one ### Deprecated - Deprecated `pod5_release_run_info` (see `pod5_free_run_info`) ### Removed - Removed filepath header line from `pod5 inspect reads` ## [0.1.9] 2023-03-07 ### Added - Added version attributes to `lib-pod5` ### Changed - Versioning now controlled by VCS inspection using `setuptools_scm` ## [0.1.8] 2023-02-23 ### Added - Added more `read_id` getter methods to `Reader` - Added support for python 3.8 + 3.10 on windows - Added gcc7 linux build of pod5 ### Changed - Update to zlib 1.2.13 - Update to zstd 1.5.4 - Pinned `pre-commit=v2.21.0` while supporting `python3.7` - Reworked `pod5 convert to_fast5` output filenames to allow for `1-1` mapping ### Fixed - Fixed `pod5 inspect read` - Fixed `pod5 convert to_fast5` creating an empty fast5 output - Fixed `pod5 convert to_fast5` ignoring the `--force_overwrite` argument - Fixed issue where thread_pool.h wasn't shipped. ## [0.1.5] - 2023-01-20 ### Added - Explicitly re-exported `lib-pod5` public symbols and added `py.typed` marker file to support type-checking. ### Fixed - Fixed issue where closing many pod5 files in sequence is slow. - Fixed incorrect python types and adopted python type-checking. ## [0.1.4] - 2022-12-22 ### Added - Linux python 3.11 wheels - ReadTheDocs documentation support ### Fixed - OSX arm64 wheel naming corrections - works with wider set of python executables ## [0.1.3] - 2022-12-16 ### Added - Added `Reader.__iter__` method. ### Changed - Renamed `EndReason.name` to `EndReason.reason` to access the inner enum and added `EndReason.name` as a property to return the string representation of this enum value. - `BaseRead`, `Read`, `CompressedRead`, `Calibration` and `Pore` dataclasses are now mutable. ### Removed - Removed deprecated `Writer` functions. ### Fixed - Fixed osx arm64 wheel compatibility for older python versions. - Fixed EndReason type errors. - Fixed EndReason in pod5 to fast5 conversion. ## [0.1.2] - 2022-12-06 ### Changed - Optimised the file writing utilities ## [0.1.1] - 2022-12-06 ### Changed - Restricted exported boost dependencies of conan package to just the boost::headers component. ## [0.1] - 2022-12-02 ### Changed - Documentation edits - `Writer.add_reads` now handles both `Read` and `CompressedRead`. ### Deprecated - Deprecated `Writer` methods `add_read_object` and `add_read_objects` for `add_read` and `add_reads` respectively. ### Removed - Removed direct pod5 tool scripts. ### Fixed - Fixed name of internal utils - "pad_file". - Fixed spelling of various internal variables. - Fixed `pod5 convert to_fast5` ## [0.0.43] ### Changed - Reformat c++ code with more consistent format file. ## [0.0.42] ### Added - Added `pod5` tools entry-point - Added api to query file version information as written on disk. ### Changed - Fixed signal_chunk_size type error in convert-from-fast5 - Replaced `ont_fast5_api` dependency with `vbz_h5py_plugin` - Restructured Python packaging to include `lib_pod5_format` which contains the native bindings build from pybind11. - `pod5_format` and `pod5_format_tools` are now pure python packages which depend on `lib_pod5_format` - Python packages `pod5_format` and `pod5_format_tools` have been merged into single `pod5` pure-python package. - `pod5-convert-from-fast5` `--output-one-to-one` reworked so that output files maintain the input structure making this argument more flexible and avoid filename clobbering. - Added missing `lib_pod5.update_file` function to pyi. - `pod5-convert-from-fast5` `output` now takes existing directories and writes `output.pod5` (current behaviour) or creates a new file with the given name if it doesn't exist. - Renamed arguments in tools relating to multi-processing / multi-threading from `-p/--processes` to the mode common `-t/--threads`. ## [0.0.41] - 2022-10-27 ### Changed - Fixed pod5-inspect erroring when loading data. - Fixed issue where some files in between 0.34 - 0.38 wouldn't load correctly. ## [0.0.40] - 2022-10-27 ### Changed - Fixed migrating of large files from older versions. ## [0.0.39] - 2022-10-18 ### Changed - Fixed building against the c++ api - previously missing include files. ## [0.0.38] - 2022-10-18 ### Changed - All data in the read table that was previously contained in dictionaries of structs is now stored in the read table, or a new "run info" table. This change simplifies data access into the pod5 files, and helps users who want to convert the pod5 data to pandas or other arrow-compatible reader formats. Old data is migrated on load, and will continue to work, data can be permanently migrated using the tool `pod5-migrate` ### Removed - Support for opening and writing "split" pod5 files. All API's now expect and return combined pod5 files. ## [0.0.37] - 2022-10-18 ### Changed - Updated Conan recipe to support building without specifying C++ standard version. ## [0.0.36] - 2022-10-07 ### Changed - Bump the Boost and Arrow versions to pick up latest changes. ## [0.0.35] - 2022-10-07 ### Changed - Support C++17 + C++20 with the conan package pod5 generates. ## [0.0.34] - 2022-10-05 ### Changed - Modified `pod5_format_tools/pod5_convert_to_fast5.py` to separate `pod5_convert_to_fast5_argparser()` and `convert_from_fast5()` out from `pod5_convert_from_fast5.main()`. ## [0.0.33] - 2022-10-05 ### Added - Added `num_samples` field to read table, containing the total number of samples a read contains. The field is filled in by API if it doesn't exist. ### Changed - File version is now V2, due to the addition of `num_samples`. ## [0.0.32] - 2022-10-03 ### Fixed - Fixed an issue where multi-threaded access to a single batch could cause a crash discovered by dorado testing. - Fixed help text in convert to fast5 script. ================================================ FILE: CMakeLists.txt ================================================ cmake_minimum_required(VERSION 3.18.0) project(POD5) include(${PROJECT_SOURCE_DIR}/cmake/POD5Version.cmake) set(CMAKE_PROJECT_VERSION ${POD5_NUMERIC_VERSION}) set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${PROJECT_SOURCE_DIR}/cmake") # use compiler cache if available option(DISABLE_CCACHE "Do not try to use ccache to speed compilation" NO) if (NOT DISABLE_CCACHE) find_program(CCACHE_EXECUTABLE ccache HINTS "C:/Program\ Files/ccache/" ) if (CCACHE_EXECUTABLE) message(STATUS "Using ccache: ${CCACHE_EXECUTABLE}") set(CMAKE_CXX_COMPILER_LAUNCHER "${CCACHE_EXECUTABLE}") set(CMAKE_C_COMPILER_LAUNCHER "${CCACHE_EXECUTABLE}") endif() endif() if (NOT DEFINED ENABLE_CONAN) option(ENABLE_CONAN "Enable conan for dependency installation" OFF) endif() if (NOT DEFINED CONAN2) option(CONAN2 "Temp flag until we fully migrate to conan2" OFF) endif() option(BUILD_SHARED_LIB "Build a shared library" OFF) option(POD5_DISABLE_TESTS "Disable building all tests" ON) option(POD5_BUILD_EXAMPLES "Enable building all examples" OFF) option(ENABLE_ADDRESS_SANITIZER "Enable address sanitizer" OFF) if (NOT DEFINED ENABLE_POD5_PACKAGING) option(ENABLE_POD5_PACKAGING "Enable packaging support" ON) endif() option(BUILD_PYTHON_WHEEL "Build a python wheel for pod5" OFF) # debug symbols don't depend on the build type, only on this option option(DISABLE_DEBUG_SYMBOLS "Force debug symbols to be disabled" OFF) if (NOT DISABLE_DEBUG_SYMBOLS) if (MSVC) # Z7 embeds deubgging info into .obj files, which is easier to manage for # build accelerators (note that a .pdb will still be generated for libs) # https://docs.microsoft.com/en-us/cpp/build/reference/z7-zi-zi-debug-information-format add_compile_options(/Z7) # this will use fastlink in the IDE and full link from the command link # https://docs.microsoft.com/en-us/cpp/build/reference/debug-generate-debug-info set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} /DEBUG") set(CMAKE_MODULE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} /DEBUG") set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} /DEBUG") # /DEBUG option is not recognised for STATIC lib linking elseif (CMAKE_COMPILER_IS_GNUCXX OR CMAKE_CXX_COMPILER_ID MATCHES "Clang") add_compile_options(-g) endif() endif() option(ENABLE_FUZZERS "Build the fuzzers that can be used to catch new issues" OFF) if (ENABLE_FUZZERS) include(pod5_fuzz) endif() add_compile_definitions(POD5_ENABLE_FUZZERS=$) option(ENABLE_COVERAGE_REPORT "Executables emit coverage reports" OFF) if (ENABLE_COVERAGE_REPORT) if (DISABLE_DEBUG_SYMBOLS) message(FATAL_ERROR "Debug symbols are required for coverage reports to work") elseif (NOT CMAKE_BUILD_TYPE STREQUAL "Debug") message(FATAL_ERROR "Only unoptimised builds give reliable coverage reports") elseif (CMAKE_CXX_COMPILER_ID MATCHES "(GNU|Clang)") add_compile_options(--coverage) add_link_options(--coverage) else() message(FATAL_ERROR "Cannot enable coverage on unknown compiler") endif() endif() # FIXME: DISABLE CONDITIONAL TO WORK ON BIONIC if (ENABLE_CONAN AND CMAKE_COMPILER_IS_GNUCXX AND CMAKE_CXX_COMPILER_VERSION VERSION_GREATER_EQUAL "9.0" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS "10.0") # We build POD5 on CentOS 7 in CI, where we have GCC 9 but only the pre-C++11 ABI # See https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_abi.html # This forces GCC 9 on other platforms (eg: Ubuntu Focal) to use the same ABI. # The main gain here is being able to use the same conan packages. add_compile_definitions(_GLIBCXX_USE_CXX11_ABI=0) endif() if(ENABLE_ADDRESS_SANITIZER) add_compile_options("-fsanitize=address") add_link_options("-fsanitize=address") endif() include_directories("third_party/include") foreach (config "Release" "Debug") string(TOUPPER "${config}" config_upper) set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY_${config_upper} ${CMAKE_BINARY_DIR}/${config}/lib) set(CMAKE_LIBRARY_OUTPUT_DIRECTORY_${config_upper} ${CMAKE_BINARY_DIR}/${config}/lib) set(CMAKE_RUNTIME_OUTPUT_DIRECTORY_${config_upper} ${CMAKE_BINARY_DIR}/${config}/bin) endforeach() set(CMAKE_INSTALL_DEFAULT_COMPONENT_NAME "archive") include(GenerateExportHeader) enable_testing() if (BUILD_PYTHON_WHEEL) find_package(Python ${PYTHON_VERSION} EXACT COMPONENTS Interpreter Development) set(PYBIND11_FINDPYTHON ON) add_subdirectory(third_party/pybind11) install( FILES third_party/pybind11/LICENSE DESTINATION licenses RENAME pybind11.txt ) endif() add_subdirectory(c++) # The fuzz directory contains both the fuzzers and the regression runners, # the latter of which can be built as ordinary tests. if (ENABLE_FUZZERS OR NOT POD5_DISABLE_TESTS) add_subdirectory(fuzz) endif() # Install licenses. install( DIRECTORY ${CMAKE_BINARY_DIR}/pod5_conan_licenses/ DESTINATION licenses ) install( FILES LICENSE.md third_party/licenses/gsl-lite.txt DESTINATION licenses ) if (ENABLE_POD5_PACKAGING) include(pod5_packaging) endif() ================================================ FILE: CMakePresets.json ================================================ { "version": 4, "include": [ "cmake/presets/conan-provider.json", "cmake/presets/conan-build-options.json", "cmake/presets/conan-profiles.json" ], "configurePresets": [ { "name": "conan2-linux-gcc9-x86_64-cppstd20-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-gcc9-x86_64-profile", "conan2-cppstd20" ] }, { "name": "conan2-linux-gcc9-x86_64-cppstd20-release", "inherits": [ "conan2-provider", "conan2-release", "conan2-gcc9-x86_64-profile", "conan2-cppstd20" ] }, { "name": "conan2-linux-gcc9-x86_64-cppstd17-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-gcc9-x86_64-profile", "conan2-cppstd17" ] }, { "name": "conan2-linux-gcc9-x86_64-cppstd17-release", "inherits": [ "conan2-provider", "conan2-release", "conan2-gcc9-x86_64-profile", "conan2-cppstd17" ] }, { "name": "conan2-linux-gcc13-x86_64-cppstd20-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-gcc13-x86_64-profile", "conan2-cppstd20" ] }, { "name": "conan2-linux-gcc13-x86_64-cppstd20-release", "inherits": [ "conan2-provider", "conan2-release", "conan2-gcc13-x86_64-profile", "conan2-cppstd20" ] }, { "name": "conan2-linux-gcc13-x86_64-cppstd17-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-gcc13-x86_64-profile", "conan2-cppstd17" ] }, { "name": "conan2-linux-gcc13-x86_64-cppstd17-release", "inherits": [ "conan2-provider", "conan2-release", "conan2-gcc13-x86_64-profile", "conan2-cppstd17" ] }, { "name": "conan2-linux-gcc11-x86_64-cppstd20-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-gcc11-x86_64-profile", "conan2-cppstd20" ] }, { "name": "conan2-linux-gcc11-x86_64-cppstd20-release", "hidden": false, "inherits": [ "conan2-provider", "conan2-release", "conan2-gcc11-x86_64-profile", "conan2-cppstd20" ] }, { "name": "conan2-linux-gcc11-x86_64-cppstd17-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-gcc11-x86_64-profile", "conan2-cppstd17" ] }, { "name": "conan2-linux-gcc11-x86_64-cppstd17-release", "inherits": [ "conan2-provider", "conan2-release", "conan2-gcc11-x86_64-profile", "conan2-cppstd17" ] }, { "name": "conan2-linux-gcc11-asan-static-x86_64-cppstd20-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-gcc11-asan-static-x86_64-profile", "conan2-cppstd20" ] }, { "name": "conan2-linux-gcc11-asan-static-x86_64-cppstd20-release", "hidden": false, "inherits": [ "conan2-provider", "conan2-release", "conan2-gcc11-asan-static-x86_64-profile", "conan2-cppstd20" ] }, { "name": "conan2-linux-gcc11-usan-static-x86_64-cppstd20-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-gcc11-usan-static-x86_64-profile", "conan2-cppstd20" ] }, { "name": "conan2-linux-gcc11-usan-static-x86_64-cppstd20-release", "hidden": false, "inherits": [ "conan2-provider", "conan2-release", "conan2-gcc11-usan-static-x86_64-profile", "conan2-cppstd20" ] }, { "name": "conan2-linux-gcc11-tsan-static-x86_64-cppstd20-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-gcc11-tsan-static-x86_64-profile", "conan2-cppstd20" ] }, { "name": "conan2-linux-gcc11-tsan-static-x86_64-cppstd20-release", "hidden": false, "inherits": [ "conan2-provider", "conan2-release", "conan2-gcc11-tsan-static-x86_64-profile", "conan2-cppstd20" ] }, { "name": "conan2-linux-gcc11-asan-static-x86_64-cppstd17-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-gcc11-asan-static-x86_64-profile", "conan2-cppstd17" ] }, { "name": "conan2-linux-gcc11-asan-static-x86_64-cppstd17-release", "hidden": false, "inherits": [ "conan2-provider", "conan2-release", "conan2-gcc11-asan-static-x86_64-profile", "conan2-cppstd20" ] }, { "name": "conan2-linux-gcc11-usan-static-x86_64-cppstd17-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-gcc11-usan-static-x86_64-profile", "conan2-cppstd17" ] }, { "name": "conan2-linux-gcc11-usan-static-x86_64-cppstd17-release", "hidden": false, "inherits": [ "conan2-provider", "conan2-release", "conan2-gcc11-usan-static-x86_64-profile", "conan2-cppstd17" ] }, { "name": "conan2-linux-gcc11-tsan-static-x86_64-cppstd17-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-gcc11-tsan-static-x86_64-profile", "conan2-cppstd17" ] }, { "name": "conan2-linux-gcc11-tsan-static-x86_64-cppstd17-release", "hidden": false, "inherits": [ "conan2-provider", "conan2-release", "conan2-gcc11-tsan-static-x86_64-profile", "conan2-cppstd17" ] }, { "name": "conan2-linux-gcc13-aarch64-cppstd20-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-gcc13-aarch64-profile", "conan2-cppstd20" ] }, { "name": "conan2-linux-gcc13-aarch64-cppstd20-release", "inherits": [ "conan2-provider", "conan2-release", "conan2-gcc13-aarch64-profile", "conan2-cppstd20" ] }, { "name": "conan2-linux-gcc13-aarch64-cppstd17-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-gcc13-aarch64-profile", "conan2-cppstd17" ] }, { "name": "conan2-linux-gcc13-aarch64-cppstd17-release", "inherits": [ "conan2-provider", "conan2-release", "conan2-gcc13-aarch64-profile", "conan2-cppstd17" ] }, { "name": "conan2-linux-gcc11-aarch64-cppstd20-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-gcc11-aarch64-profile", "conan2-cppstd20" ] }, { "name": "conan2-linux-gcc11-aarch64-cppstd20-release", "hidden": false, "inherits": [ "conan2-provider", "conan2-release", "conan2-gcc11-aarch64-profile", "conan2-cppstd20" ] }, { "name": "conan2-linux-gcc11-aarch64-cppstd17-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-gcc11-aarch64-profile", "conan2-cppstd17" ] }, { "name": "conan2-linux-gcc11-aarch64-cppstd17-release", "hidden": false, "inherits": [ "conan2-provider", "conan2-release", "conan2-gcc11-aarch64-profile", "conan2-cppstd17" ] }, { "name": "conan2-linux-gcc9-aarch64-cppstd17-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-gcc9-aarch64-profile", "conan2-cppstd17" ] }, { "name": "conan2-linux-gcc9-aarch64-cppstd17-release", "inherits": [ "conan2-provider", "conan2-release", "conan2-gcc9-aarch64-profile", "conan2-cppstd17" ] }, { "name": "conan2-linux-gcc9-aarch64-cppstd20-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-gcc9-aarch64-profile", "conan2-cppstd20" ] }, { "name": "conan2-linux-gcc9-aarch64-cppstd20-release", "inherits": [ "conan2-provider", "conan2-release", "conan2-gcc9-aarch64-profile", "conan2-cppstd20" ] }, { "name": "conan2-macos-appleclang-15.0-aarch64-cppstd17-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-appleclang-15.0-aarch64-profile", "conan2-cppstd17" ] }, { "name": "conan2-macos-appleclang-15.0-aarch64-cppstd17-release", "inherits": [ "conan2-provider", "conan2-release", "conan2-appleclang-15.0-aarch64-profile", "conan2-cppstd17" ] }, { "name": "conan2-macos-appleclang-15.0-aarch64-cppstd20-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-appleclang-15.0-aarch64-profile", "conan2-cppstd20" ] }, { "name": "conan2-macos-appleclang-15.0-aarch64-cppstd20-release", "inherits": [ "conan2-provider", "conan2-release", "conan2-appleclang-15.0-aarch64-profile", "conan2-cppstd20" ] }, { "name": "conan2-macos-appleclang-16.0-aarch64-cppstd17-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-appleclang-16.0-aarch64-profile", "conan2-cppstd17" ] }, { "name": "conan2-macos-appleclang-16.0-aarch64-cppstd17-release", "inherits": [ "conan2-provider", "conan2-release", "conan2-appleclang-16.0-aarch64-profile", "conan2-cppstd17" ] }, { "name": "conan2-macos-appleclang-16.0-aarch64-cppstd20-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-appleclang-16.0-aarch64-profile", "conan2-cppstd20" ] }, { "name": "conan2-macos-appleclang-16.0-aarch64-cppstd20-release", "inherits": [ "conan2-provider", "conan2-release", "conan2-appleclang-16.0-aarch64-profile", "conan2-cppstd20" ] }, { "name": "conan2-windows-x86_64-vs2019-cppstd17-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-windows-x86_64-vs2019-profile", "conan2-cppstd17" ] }, { "name": "conan2-windows-x86_64-vs2019-cppstd17-release", "inherits": [ "conan2-provider", "conan2-release", "conan2-windows-x86_64-vs2019-profile", "conan2-cppstd17" ] }, { "name": "conan2-windows-x86_64-vs2019-cppstd20-debug", "inherits": [ "conan2-provider", "conan2-debug", "conan2-windows-x86_64-vs2019-profile", "conan2-cppstd20" ] }, { "name": "conan2-windows-x86_64-vs2019-cppstd20-release", "inherits": [ "conan2-provider", "conan2-release", "conan2-windows-x86_64-vs2019-profile", "conan2-cppstd20" ] } ] } ================================================ FILE: DEV.md ================================================ Development =========== If you want to contribute to pod5_file_format, or our pre-built binaries do not meet your platform requirements, you can build pod5 from source using the instructions in `docs/install.rst`. ### Developing Building the project requires several tools and libraries are available: - CMake - Arrow - Zstd - Flatbuffers - Python - setuptools_scm ```bash # Docs on installing arrow from here: https://arrow.apache.org/install/ > sudo apt install -y -V ca-certificates lsb-release wget > wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb > sudo apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb > sudo apt update # Now install the rest of the dependencies: > sudo apt install cmake libzstd-dev libzstd-dev libflatbuffers-dev libarrow-dev=12.0.1-1 > pip install setuptools_scm~=7.1 # Finally start build of POD5: > git clone https://github.com/nanoporetech/pod5-file-format.git > cd pod5-file-format > git submodule update --init --recursive > python -m setuptools_scm > python ./pod5_make_version.py > mkdir build > cd build > cmake .. > make -j ``` ### Pre commit The project uses pre-commit to ensure code is consistently formatted, you can set this up using pip: ```bash > pip install pre-commit==v2.21.0 # Install pre-commit hooks in your pod5-file-format repo: > cd pod5-file-format > pre-commit install # Run hooks on all files: > pre-commit run --all-files ``` Python Development ================== After completing the required build stages above, to create a Python virtual environment for development follow the instructions below . ```bash > cd python > make install ``` ================================================ FILE: LICENSE.md ================================================ This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at https://mozilla.org/MPL/2.0/. ©2021 Oxford Nanopore Technologies PLC. Mozilla Public License Version 2.0 ================================== ### 1. Definitions **1.1. “Contributor”** means each individual or legal entity that creates, contributes to the creation of, or owns Covered Software. **1.2. “Contributor Version”** means the combination of the Contributions of others (if any) used by a Contributor and that particular Contributor's Contribution. **1.3. “Contribution”** means Covered Software of a particular Contributor. **1.4. “Covered Software”** means Source Code Form to which the initial Contributor has attached the notice in Exhibit A, the Executable Form of such Source Code Form, and Modifications of such Source Code Form, in each case including portions thereof. **1.5. “Incompatible With Secondary Licenses”** means * **(a)** that the initial Contributor has attached the notice described in Exhibit B to the Covered Software; or * **(b)** that the Covered Software was made available under the terms of version 1.1 or earlier of the License, but not also under the terms of a Secondary License. **1.6. “Executable Form”** means any form of the work other than Source Code Form. **1.7. “Larger Work”** means a work that combines Covered Software with other material, in a separate file or files, that is not Covered Software. **1.8. “License”** means this document. **1.9. “Licensable”** means having the right to grant, to the maximum extent possible, whether at the time of the initial grant or subsequently, any and all of the rights conveyed by this License. **1.10. “Modifications”** means any of the following: * **(a)** any file in Source Code Form that results from an addition to, deletion from, or modification of the contents of Covered Software; or * **(b)** any new file in Source Code Form that contains any Covered Software. **1.11. “Patent Claims” of a Contributor** means any patent claim(s), including without limitation, method, process, and apparatus claims, in any patent Licensable by such Contributor that would be infringed, but for the grant of the License, by the making, using, selling, offering for sale, having made, import, or transfer of either its Contributions or its Contributor Version. **1.12. “Secondary License”** means either the GNU General Public License, Version 2.0, the GNU Lesser General Public License, Version 2.1, the GNU Affero General Public License, Version 3.0, or any later versions of those licenses. **1.13. “Source Code Form”** means the form of the work preferred for making modifications. **1.14. “You” (or “Your”)** means an individual or a legal entity exercising rights under this License. For legal entities, “You” includes any entity that controls, is controlled by, or is under common control with You. For purposes of this definition, “control” means **(a)** the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or **(b)** ownership of more than fifty percent (50%) of the outstanding shares or beneficial ownership of such entity. ### 2. License Grants and Conditions #### 2.1. Grants Each Contributor hereby grants You a world-wide, royalty-free, non-exclusive license: * **(a)** under intellectual property rights (other than patent or trademark) Licensable by such Contributor to use, reproduce, make available, modify, display, perform, distribute, and otherwise exploit its Contributions, either on an unmodified basis, with Modifications, or as part of a Larger Work; and * **(b)** under Patent Claims of such Contributor to make, use, sell, offer for sale, have made, import, and otherwise transfer either its Contributions or its Contributor Version. #### 2.2. Effective Date The licenses granted in Section 2.1 with respect to any Contribution become effective for each Contribution on the date the Contributor first distributes such Contribution. #### 2.3. Limitations on Grant Scope The licenses granted in this Section 2 are the only rights granted under this License. No additional rights or licenses will be implied from the distribution or licensing of Covered Software under this License. Notwithstanding Section 2.1(b) above, no patent license is granted by a Contributor: * **(a)** for any code that a Contributor has removed from Covered Software; or * **(b)** for infringements caused by: **(i)** Your and any other third party's modifications of Covered Software, or **(ii)** the combination of its Contributions with other software (except as part of its Contributor Version); or * **(c)** under Patent Claims infringed by Covered Software in the absence of its Contributions. This License does not grant any rights in the trademarks, service marks, or logos of any Contributor (except as may be necessary to comply with the notice requirements in Section 3.4). #### 2.4. Subsequent Licenses No Contributor makes additional grants as a result of Your choice to distribute the Covered Software under a subsequent version of this License (see Section 10.2) or under the terms of a Secondary License (if permitted under the terms of Section 3.3). #### 2.5. Representation Each Contributor represents that the Contributor believes its Contributions are its original creation(s) or it has sufficient rights to grant the rights to its Contributions conveyed by this License. #### 2.6. Fair Use This License is not intended to limit any rights You have under applicable copyright doctrines of fair use, fair dealing, or other equivalents. #### 2.7. Conditions Sections 3.1, 3.2, 3.3, and 3.4 are conditions of the licenses granted in Section 2.1. ### 3. Responsibilities #### 3.1. Distribution of Source Form All distribution of Covered Software in Source Code Form, including any Modifications that You create or to which You contribute, must be under the terms of this License. You must inform recipients that the Source Code Form of the Covered Software is governed by the terms of this License, and how they can obtain a copy of this License. You may not attempt to alter or restrict the recipients' rights in the Source Code Form. #### 3.2. Distribution of Executable Form If You distribute Covered Software in Executable Form then: * **(a)** such Covered Software must also be made available in Source Code Form, as described in Section 3.1, and You must inform recipients of the Executable Form how they can obtain a copy of such Source Code Form by reasonable means in a timely manner, at a charge no more than the cost of distribution to the recipient; and * **(b)** You may distribute such Executable Form under the terms of this License, or sublicense it under different terms, provided that the license for the Executable Form does not attempt to limit or alter the recipients' rights in the Source Code Form under this License. #### 3.3. Distribution of a Larger Work You may create and distribute a Larger Work under terms of Your choice, provided that You also comply with the requirements of this License for the Covered Software. If the Larger Work is a combination of Covered Software with a work governed by one or more Secondary Licenses, and the Covered Software is not Incompatible With Secondary Licenses, this License permits You to additionally distribute such Covered Software under the terms of such Secondary License(s), so that the recipient of the Larger Work may, at their option, further distribute the Covered Software under the terms of either this License or such Secondary License(s). #### 3.4. Notices You may not remove or alter the substance of any license notices (including copyright notices, patent notices, disclaimers of warranty, or limitations of liability) contained within the Source Code Form of the Covered Software, except that You may alter any license notices to the extent required to remedy known factual inaccuracies. #### 3.5. Application of Additional Terms You may choose to offer, and to charge a fee for, warranty, support, indemnity or liability obligations to one or more recipients of Covered Software. However, You may do so only on Your own behalf, and not on behalf of any Contributor. You must make it absolutely clear that any such warranty, support, indemnity, or liability obligation is offered by You alone, and You hereby agree to indemnify every Contributor for any liability incurred by such Contributor as a result of warranty, support, indemnity or liability terms You offer. You may include additional disclaimers of warranty and limitations of liability specific to any jurisdiction. ### 4. Inability to Comply Due to Statute or Regulation If it is impossible for You to comply with any of the terms of this License with respect to some or all of the Covered Software due to statute, judicial order, or regulation then You must: **(a)** comply with the terms of this License to the maximum extent possible; and **(b)** describe the limitations and the code they affect. Such description must be placed in a text file included with all distributions of the Covered Software under this License. Except to the extent prohibited by statute or regulation, such description must be sufficiently detailed for a recipient of ordinary skill to be able to understand it. ### 5. Termination **5.1.** The rights granted under this License will terminate automatically if You fail to comply with any of its terms. However, if You become compliant, then the rights granted under this License from a particular Contributor are reinstated **(a)** provisionally, unless and until such Contributor explicitly and finally terminates Your grants, and **(b)** on an ongoing basis, if such Contributor fails to notify You of the non-compliance by some reasonable means prior to 60 days after You have come back into compliance. Moreover, Your grants from a particular Contributor are reinstated on an ongoing basis if such Contributor notifies You of the non-compliance by some reasonable means, this is the first time You have received notice of non-compliance with this License from such Contributor, and You become compliant prior to 30 days after Your receipt of the notice. **5.2.** If You initiate litigation against any entity by asserting a patent infringement claim (excluding declaratory judgment actions, counter-claims, and cross-claims) alleging that a Contributor Version directly or indirectly infringes any patent, then the rights granted to You by any and all Contributors for the Covered Software under Section 2.1 of this License shall terminate. **5.3.** In the event of termination under Sections 5.1 or 5.2 above, all end user license agreements (excluding distributors and resellers) which have been validly granted by You or Your distributors under this License prior to termination shall survive termination. ### 6. Disclaimer of Warranty > Covered Software is provided under this License on an “as is” > basis, without warranty of any kind, either expressed, implied, or > statutory, including, without limitation, warranties that the > Covered Software is free of defects, merchantable, fit for a > particular purpose or non-infringing. The entire risk as to the > quality and performance of the Covered Software is with You. > Should any Covered Software prove defective in any respect, You > (not any Contributor) assume the cost of any necessary servicing, > repair, or correction. This disclaimer of warranty constitutes an > essential part of this License. No use of any Covered Software is > authorized under this License except under this disclaimer. ### 7. Limitation of Liability > Under no circumstances and under no legal theory, whether tort > (including negligence), contract, or otherwise, shall any > Contributor, or anyone who distributes Covered Software as > permitted above, be liable to You for any direct, indirect, > special, incidental, or consequential damages of any character > including, without limitation, damages for lost profits, loss of > goodwill, work stoppage, computer failure or malfunction, or any > and all other commercial damages or losses, even if such party > shall have been informed of the possibility of such damages. This > limitation of liability shall not apply to liability for death or > personal injury resulting from such party's negligence to the > extent applicable law prohibits such limitation. Some > jurisdictions do not allow the exclusion or limitation of > incidental or consequential damages, so this exclusion and > limitation may not apply to You. ### 8. Litigation Any litigation relating to this License may be brought only in the courts of a jurisdiction where the defendant maintains its principal place of business and such litigation shall be governed by laws of that jurisdiction, without reference to its conflict-of-law provisions. Nothing in this Section shall prevent a party's ability to bring cross-claims or counter-claims. ### 9. Miscellaneous This License represents the complete agreement concerning the subject matter hereof. If any provision of this License is held to be unenforceable, such provision shall be reformed only to the extent necessary to make it enforceable. Any law or regulation which provides that the language of a contract shall be construed against the drafter shall not be used to construe this License against a Contributor. ### 10. Versions of the License #### 10.1. New Versions Mozilla Foundation is the license steward. Except as provided in Section 10.3, no one other than the license steward has the right to modify or publish new versions of this License. Each version will be given a distinguishing version number. #### 10.2. Effect of New Versions You may distribute the Covered Software under the terms of the version of the License under which You originally received the Covered Software, or under the terms of any subsequent version published by the license steward. #### 10.3. Modified Versions If you create software not governed by this License, and you want to create a new license for such software, you may create and use a modified version of this License if you rename the license and remove any references to the name of the license steward (except to note that such modified license differs from this License). #### 10.4. Distributing Source Code Form that is Incompatible With Secondary Licenses If You choose to distribute Source Code Form that is Incompatible With Secondary Licenses under the terms of this version of the License, the notice described in Exhibit B of this License must be attached. ## Exhibit A - Source Code Form License Notice This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at https://mozilla.org/MPL/2.0/. If it is not possible or desirable to put the notice in a particular file, then You may include the notice in a location (such as a LICENSE file in a relevant directory) where a recipient would be likely to look for such a notice. You may add additional accurate notices of copyright ownership. ## Exhibit B - “Incompatible With Secondary Licenses” Notice This Source Code Form is "Incompatible With Secondary Licenses", as defined by the Mozilla Public License, v. 2.0. ================================================ FILE: README.md ================================================ [![Documentation Status](https://readthedocs.org/projects/pod5-file-format/badge/?version=latest)](https://pod5-file-format.readthedocs.io/) POD5 File Format ================ POD5 File Format ================ POD5 is a file format for storing nanopore dna data in an easily accessible way. The format is able to be written in a streaming manner which allows a sequencing instrument to directly write the format. Data in POD5 is stored using [Apache Arrow](https://github.com/apache/arrow), allowing users to consume data in many languages using standard tools. What does this project contain ------------------------------ This project contains a core library for reading and writing POD5 data, and a toolkit for accessing this data in other languages. Documentation ------------- Full documentation is found at https://pod5-file-format.readthedocs.io/ Usage ----- POD5 is also bundled as a python module for easy use in scripts, a user can install using: ```bash > pip install pod5 ``` This python module provides the python library to write custom scripts against. Please see [examples](./python/pod5/examples) for documentation on using the library. The `pod5` package also provides [a selection of tools](./python/pod5/README.md). Design ------ For information about the design of POD5, see the [docs](./docs/README.md). Development ----------- If you want to contribute to pod5_file_format, or our pre-built binaries do not meet your platform requirements, you can build pod5 from source using the instructions in [DEV.md](DEV.md) ================================================ FILE: benchmarks/.gitignore ================================================ */outputs/ image/*.whl ================================================ FILE: benchmarks/README.md ================================================ POD5 Benchmarks ============== Building the benchmark environment ---------------------------------- To run benchmarks you first have to build the docker environment to run them: ```bash > ./build.sh ``` Running a benchmark ------------------- To run a benchmark, use the helper script to start the docker image: ```bash > ./run_benchmark.sh convert ./path-to-source-files/ ``` Benchmarking Results -------------------- Note preliminary results Results run on: 0.0.16 POD5 pyslow5 dev branch (commit 2643310a) Benchmark numbers are produced using a GridION. Note the benchmarks are run using python APIs, more work is required on C benchmarks. ## PCR Dataset On dataset a PCR Zymo dataset PAM50264, on 10.4.1 e8.2 data (`pcr_zymo/20220419_1706_2E_PAM50264_3c6f33f1`): ### File sizes | pod5 | blow5 | fast5 | |--------|---------|---------| | 37G | 37G | 52G | ### Timings | | pod5 | blow5 | fast5 | |-------------------------------------|------------|------------|------------| | convert | 197.5 secs | 241.4 secs | Not Run | | find all read ids | 10.1 secs | 1.8 secs | 5.2 secs | | find all samples | 22.3 secs | 82.5 secs | 520.6 secs | | find selected read ids read number | 1.1 secs | 5.8 secs | 387.1 secs | | find selected read ids sample count | 1.5 secs | 5.7 secs | 417.8 secs | | find selected read ids samples | 5.3 secs | 6.4 secs | 465.6 secs | ```* Note blow5 convert times include the index + merge operation``` ## InterARTIC Dataset Dataset available at: https://github.com/Psy-Fer/interARTIC ### File sizes | pod5 | blow5 | fast5 | |--------|---------|---------| | 3.3G | 3.4G | 6.9G | ### Timings | | pod5 | blow5 | fast5 | |-------------------------------------|-----------|-----------|-----------| | convert | 28.6 secs | 21.0 secs | Not Run | | find all read ids | 0.5 secs | 0.5 secs | 0.7 secs | | find all samples | 3.0 secs | 8.0 secs | 73.5 secs | | find selected read ids read number | 0.4 secs | 1.3 secs | 29.3 secs | | find selected read ids sample count | 0.6 secs | 1.3 secs | 30.4 secs | | find selected read ids samples | 1.4 secs | 1.3 secs | 37.8 secs | ```* Note blow5 convert times include the index + merge operation``` ================================================ FILE: benchmarks/build.sh ================================================ #!/bin/bash set -o errexit set -o pipefail set -o nounset # set -o xtrace script_dir=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd ) cd "${script_dir}" cd image/ docker build -t pod5-benchmark-base -f Dockerfile.base . ================================================ FILE: benchmarks/convert/run_blow5.sh ================================================ #!/bin/bash input_dir=$1 output_dir=$2 ./tools/fast5_to_single_blow5.sh "$input_dir" "$output_dir" ================================================ FILE: benchmarks/convert/run_pod5.sh ================================================ #!/bin/bash input_dir=$1 output_dir=$2 pod5 convert fast5 "$input_dir" --output "$output_dir" ================================================ FILE: benchmarks/find_all_read_ids/run_blow5.sh ================================================ #!/bin/bash input_dir=$1 output_dir=$2 tools/pyslow5_tests.py "${input_dir}"/blow5/*.blow5 "${output_dir}" get_all_read_ids ================================================ FILE: benchmarks/find_all_read_ids/run_fast5.sh ================================================ #!/bin/bash input_dir=$1 output_dir=$2 tools/find_and_get_fast5.py "${input_dir}/fast5" "${output_dir}" ================================================ FILE: benchmarks/find_all_read_ids/run_pod5.sh ================================================ #!/bin/bash input_dir=$1 output_dir=$2 ./tools/find_and_get_pod5.py "${input_dir}/pod5" "${output_dir}" ================================================ FILE: benchmarks/find_all_samples/run_blow5.sh ================================================ #!/bin/bash input_dir=$1 output_dir=$2 tools/pyslow5_tests.py "${input_dir}"/blow5/*.blow5 "${output_dir}" all_values --get-column samples ================================================ FILE: benchmarks/find_all_samples/run_fast5.sh ================================================ #!/bin/bash input_dir=$1 output_dir=$2 tools/find_and_get_fast5.py "${input_dir}/fast5" "${output_dir}" --get-column samples ================================================ FILE: benchmarks/find_all_samples/run_pod5.sh ================================================ #!/bin/bash input_dir=$1 output_dir=$2 ./tools/find_and_get_pod5.py "${input_dir}/pod5" "${output_dir}" --get-column samples ================================================ FILE: benchmarks/find_selected_read_ids_read_number/run_blow5.sh ================================================ #!/bin/bash input_dir=$1 output_dir=$2 full_output_dir=$3 tools/pyslow5_tests.py "${input_dir}"/blow5/*.blow5 "${output_dir}" sample_values --get-column read_number --select-ids "${full_output_dir}/selected_read_ids.csv" ================================================ FILE: benchmarks/find_selected_read_ids_read_number/run_fast5.sh ================================================ #!/bin/bash input_dir=$1 output_dir=$2 full_output_dir=$3 tools/find_and_get_fast5.py "${input_dir}/fast5" "${output_dir}" --get-column read_number --select-ids "${full_output_dir}/selected_read_ids.csv" ================================================ FILE: benchmarks/find_selected_read_ids_read_number/run_pod5.sh ================================================ #!/bin/bash input_dir=$1 type_output_dir=$2 full_output_dir=$3 ./tools/find_and_get_pod5.py "${input_dir}/pod5" "${type_output_dir}" --get-column read_number --select-ids "${full_output_dir}/selected_read_ids.csv" ================================================ FILE: benchmarks/find_selected_read_ids_sample_count/run_blow5.sh ================================================ #!/bin/bash input_dir=$1 output_dir=$2 full_output_dir=$3 tools/pyslow5_tests.py "${input_dir}"/blow5/*.blow5 "${output_dir}" sample_values --get-column sample_count --select-ids "${full_output_dir}/selected_read_ids.csv" ================================================ FILE: benchmarks/find_selected_read_ids_sample_count/run_fast5.sh ================================================ #!/bin/bash input_dir=$1 output_dir=$2 full_output_dir=$3 tools/find_and_get_fast5.py "${input_dir}/fast5" "${output_dir}" --get-column sample_count --select-ids "${full_output_dir}/selected_read_ids.csv" ================================================ FILE: benchmarks/find_selected_read_ids_sample_count/run_pod5.sh ================================================ #!/bin/bash input_dir=$1 type_output_dir=$2 full_output_dir=$3 ./tools/find_and_get_pod5.py "${input_dir}/pod5" "${type_output_dir}" --get-column sample_count --select-ids "${full_output_dir}/selected_read_ids.csv" ================================================ FILE: benchmarks/find_selected_read_ids_samples/run_blow5.sh ================================================ #!/bin/bash input_dir=$1 output_dir=$2 full_output_dir=$3 tools/pyslow5_tests.py "${input_dir}"/blow5/*.blow5 "${output_dir}" sample_values --get-column samples --select-ids "${full_output_dir}/selected_read_ids.csv" ================================================ FILE: benchmarks/find_selected_read_ids_samples/run_fast5.sh ================================================ #!/bin/bash input_dir=$1 output_dir=$2 full_output_dir=$3 tools/find_and_get_fast5.py "${input_dir}/fast5" "${output_dir}" --get-column samples --select-ids "${full_output_dir}/selected_read_ids.csv" ================================================ FILE: benchmarks/find_selected_read_ids_samples/run_pod5.sh ================================================ #!/bin/bash input_dir=$1 type_output_dir=$2 full_output_dir=$3 ./tools/find_and_get_pod5.py "${input_dir}/pod5" "${type_output_dir}" --get-column samples --select-ids "${full_output_dir}/selected_read_ids.csv" ================================================ FILE: benchmarks/image/Dockerfile.base ================================================ FROM ubuntu:20.04 RUN apt update && apt install -y wget python3 python3-pip git libzstd-dev RUN wget https://github.com/nanoporetech/vbz_compression/releases/download/v1.0.1/ont-vbz-hdf-plugin_1.0.1-1.focal_amd64.deb && apt install -y ./ont-vbz-hdf-plugin_1.0.1-1.focal_amd64.deb && rm ont-vbz-hdf-plugin_1.0.1-1.focal_amd64.deb COPY ./requirements-benchmarks.txt / RUN pip install -r /requirements-benchmarks.txt COPY ./install_slow5.sh / RUN /install_slow5.sh ENV PATH="/slow5tools-v0.4.0/:$PATH" RUN pip install numpy COPY ./pod5*.whl / RUN pip install *.whl && rm *.whl ================================================ FILE: benchmarks/image/install_slow5.sh ================================================ #!/bin/bash set -e : "${SLOW_5_TOOLS_VERSION:=v1.3.0}" : "${SLOW_5_LIB_VERSION:=v1.3.1}" apt update apt install -y libzstd-dev libhdf5-dev wget "https://github.com/hasindu2008/slow5tools/releases/download/${SLOW_5_TOOLS_VERSION}/slow5tools-${SLOW_5_TOOLS_VERSION}-release.tar.gz" tar xvf "slow5tools-${SLOW_5_TOOLS_VERSION}-release.tar.gz" rm "slow5tools-${SLOW_5_TOOLS_VERSION}-release.tar.gz" ( cd "slow5tools-${SLOW_5_TOOLS_VERSION}" ./configure make zstd=1 -j "$(nproc)" ) # pyslow5 must be built with zstd support for fair comparison (otherwise default zlib is slower than zstd) git clone -b "${SLOW_5_LIB_VERSION}" https://github.com/hasindu2008/slow5lib ( cd slow5lib/ echo "Installing numpy" pip install numpy make pyslow5 -j "$(nproc)" 2> build_log.txt || (cat build_log.txt && exit) echo "Installing pyslow5" PYSLOW5_ZSTD=1 pip install dist/*.tar.gz # adding slow5 C API benchmarks make zstd=1 -j "$(nproc)" && test/bench/build.sh ) ================================================ FILE: benchmarks/image/requirements-benchmarks.txt ================================================ h5py numpy pandas tabulate ================================================ FILE: benchmarks/run_benchmarks.py ================================================ #!/usr/bin/env python3 """ Example usage: ``` > taskset -c 0-10 ./benchmarks/run_benchmarks.py ./input_files/ \ ./benchmark-outputs/ --skip-to-benchmark find_all_samples ``` """ import argparse import json import shutil import subprocess import time from collections import namedtuple from pathlib import Path import tabulate Benchmark = namedtuple( "Benchmark", ["name", "file_types", "checks", "input_benchmark", "pre_run", "post_run_fixup"], ) BENCHMARK_ROOT = Path(__file__).resolve().parent POD5_FILE_TYPE = "pod5" BLOW5_FILE_TYPE = "blow5" FAST5_FILE_TYPE = "fast5" ALL_FILE_TYPES = [POD5_FILE_TYPE, BLOW5_FILE_TYPE, FAST5_FILE_TYPE] def du(path): """disk usage in human readable format (e.g. '2,1GB')""" return subprocess.check_output(["du", "-sh", path]).split()[0].decode("utf-8") def generate_report(input_dir, output_dir, timing_results): report = "" skipped_benchmarks = len(ALL_BENCHMARKS) - len(timing_results) skipped = "" if skipped_benchmarks != 0: skipped = f", skipped {skipped_benchmarks}" report += f"Ran {len(timing_results)} benchmarks{skipped}\n\n" report += f"Input data was {input_dir}\n\n" report += "File sizes\n" report += "----------\n\n" convert_output_dir = output_dir / "convert" sizes = [] for file_type in ALL_FILE_TYPES: file_type_dir = convert_output_dir / file_type if file_type_dir.exists: sizes.append(du(file_type_dir)) else: sizes.append("Not Run") report += ( tabulate.tabulate([sizes], headers=ALL_FILE_TYPES, tablefmt="github") + "\n\n" ) report += "Timings\n" report += "-------\n\n" results = [] for benchmark in ALL_BENCHMARKS: row = [benchmark.name.replace("_", " ")] results.append(row) if benchmark.name in timing_results: timings = timing_results[benchmark.name] for file_type in ALL_FILE_TYPES: if file_type in timings: row.append(f"{timings[file_type]:.1f} secs") else: row.append("Not Run") else: for file_type in ALL_FILE_TYPES: row.append("Not Run") results_headers = [""] + ALL_FILE_TYPES report += ( tabulate.tabulate(results, headers=results_headers, tablefmt="github") + "\n" ) return report def check_read_ids(benchmark, file_types, output_dir, only_format): if only_format is not None: print("Not checking read ids - only one format executed") return csv_check_files = [] for file_type in file_types: csv_check_files.append(output_dir / file_type / "read_ids.csv") for a, b in zip(csv_check_files[1:], csv_check_files): subprocess.run( [BENCHMARK_ROOT / "tools" / "check_csvs_consistent.py", a, b], check=True ) def check_file_sizes(benchmark, file_types, output_dir, only_format): print("File sizes for output dir") subprocess.run(["du", "-sh"] + list(output_dir.glob("*")), check=True) def copy_fast5_files(benchmark, input_dir, output_dir): shutil.copytree(input_dir, output_dir / "fast5") def randomly_select_read_ids(benchmark, input_dir, output_dir): print("Randomly selecting read ids for benchmark") subprocess.run( [ BENCHMARK_ROOT / "tools" / "find_and_get_pod5.py", input_dir / "pod5", output_dir, ], check=True, ) subprocess.run( [ BENCHMARK_ROOT / "tools" / "select-random-ids.py", output_dir / "read_ids.csv", output_dir / "selected_read_ids.csv", "--select-ratio", "0.1", ], check=True, ) ALL_BENCHMARKS = [ Benchmark( "convert", [POD5_FILE_TYPE, BLOW5_FILE_TYPE], [check_file_sizes], input_benchmark=None, post_run_fixup=copy_fast5_files, pre_run=None, ), Benchmark( "find_all_read_ids", ALL_FILE_TYPES, [check_read_ids], input_benchmark="convert", post_run_fixup=None, pre_run=None, ), Benchmark( "find_all_samples", ALL_FILE_TYPES, [check_read_ids], input_benchmark="convert", post_run_fixup=None, pre_run=None, ), Benchmark( "find_selected_read_ids_read_number", ALL_FILE_TYPES, [check_read_ids], input_benchmark="convert", post_run_fixup=None, pre_run=randomly_select_read_ids, ), Benchmark( "find_selected_read_ids_sample_count", ALL_FILE_TYPES, [check_read_ids], input_benchmark="convert", post_run_fixup=None, pre_run=randomly_select_read_ids, ), Benchmark( "find_selected_read_ids_samples", ALL_FILE_TYPES, [check_read_ids], input_benchmark="convert", post_run_fixup=None, pre_run=randomly_select_read_ids, ), ] def run_benchmark(benchmark, input_dir, output_dir, only_format=None): if output_dir.exists(): print("Removing old output dir") shutil.rmtree(output_dir) file_types = benchmark.file_types if benchmark.file_types else ALL_FILE_TYPES time_results = {} if benchmark.pre_run: benchmark.pre_run(benchmark, input_dir, output_dir) for file_type in file_types: if only_format is not None and only_format != file_type: print(f"## Skipping for file type {file_type}:") continue print(f"## Running for file type {file_type}:") file_type_output_dir = output_dir / file_type file_type_output_dir.mkdir(exist_ok=True, parents=True) start = time.time() subprocess.run( [ BENCHMARK_ROOT / benchmark.name / f"run_{file_type}.sh", input_dir, file_type_output_dir, output_dir, ], check=True, cwd=BENCHMARK_ROOT, ) end = time.time() duration_secs = end - start time_results[file_type] = duration_secs print(f"## Took {duration_secs:.2f} seconds") if benchmark.post_run_fixup: benchmark.post_run_fixup(benchmark, input_dir, output_dir) if benchmark.checks: print("## Running checks") for check in benchmark.checks: check(benchmark, file_types, output_dir, only_format) return time_results def run_benchmarks(args): timing_results = {} input_dir = args.input_dir.resolve() output_dir = args.output_dir.resolve() skip_list = [] if args.skip_to_benchmark: found = False for benchmark in ALL_BENCHMARKS: if benchmark.name == args.skip_to_benchmark: found = True if not found: skip_list.append(benchmark.name) for benchmark in ALL_BENCHMARKS: if benchmark.name in skip_list: print(f"# Skipping benchmark {benchmark.name}") continue print(f"# Running benchmark {benchmark.name}:") benchmark_input_dir = input_dir if benchmark.input_benchmark: benchmark_input_dir = output_dir / benchmark.input_benchmark timing_results[benchmark.name] = run_benchmark( benchmark, benchmark_input_dir, output_dir / benchmark.name, args.only_format, ) report = generate_report(input_dir, output_dir, timing_results) print(report) with open(args.output_dir / "timings.json", "w") as f: f.write(json.dumps(timing_results, indent=2)) with open(args.output_dir / "report.md", "w") as f: f.write(report) def main(): parser = argparse.ArgumentParser("Run Benchmarks for POD5 format") parser.add_argument("input_dir", type=Path) parser.add_argument("output_dir", type=Path) parser.add_argument( "--skip-to-benchmark", type=str, help="Start benchmarking from a named benchmark", ) parser.add_argument( "--only-format", type=str, help="Only run benchmarks for a single format", ) args = parser.parse_args() run_benchmarks(args) if __name__ == "__main__": main() ================================================ FILE: benchmarks/run_benchmarks_in_docker.sh ================================================ #!/bin/bash set -e input_dir=$(readlink -f "$1") output_dir="$(pwd)/pod5-benchmark-outputs" mkdir -p "${output_dir}" script_dir=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd ) echo "Running benchmark on input '${input_dir}'" docker run --rm -it -v"${input_dir}":/input -v"${output_dir}":/outputs -v"${script_dir}"/:/benchmarks pod5-benchmark-base /benchmarks/tools/run_benchmarks_docker_entry.sh ================================================ FILE: benchmarks/tools/check_csvs_consistent.py ================================================ #!/usr/bin/env python3 import argparse import sys import pandas as pd from pandas.testing import assert_frame_equal def check_consistency(df1, df2): df1 = df1.sort_values("read_id", ignore_index=True) df2 = df2.sort_values("read_id", ignore_index=True) assert_frame_equal(df1, df2) print("Data frames are consistent") sys.exit(0) def main(): parser = argparse.ArgumentParser() parser.add_argument("input_a") parser.add_argument("input_b") args = parser.parse_args() a = pd.read_csv(args.input_a) b = pd.read_csv(args.input_b) print(f"Check consistency of files {args.input_a} and {args.input_b}") check_consistency(a, b) if __name__ == "__main__": main() ================================================ FILE: benchmarks/tools/fast5_to_single_blow5.sh ================================================ #!/bin/bash input_path=$1 output_path=$2 mkdir -p "$output_path" temp_dir="${output_path}/tmp" mkdir -p "$temp_dir" # specific options (-c zstd -s svb-zd) must be provided to slow5tools to create compression comparable to vbz # also number of processes/threads must be set to 10 to match with default value in pod5_convert # however, the svb-zd stream variable byte + zig-zag delta implementation in slow5 mirrors # ONT's previous 32 bit zigzag delta, where as pod5 is using a newer 16 bit zigzag delta with SIMD optimisations # so pod5 has the added performance benefit of using the newer zigzag delta # slow5 compression methods are modular, so we can easily add the new one iff necessary slow5tools f2s "$input_path" -d "$temp_dir" -p 10 -c zstd -s svb-zd # Most comparable to have one file for both formats: slow5tools cat "$temp_dir -o $output_path/file.blow5" || slow5tools merge "$temp_dir" -o "$output_path/file.blow5" -t 10 -c zstd -s svb-zd #if the files are from the same run ID, slow5tools cat can be used, which is significantly faster #slow5tools cat $temp_dir -o $output_path/file.blow5 rm -r "$temp_dir" # Index will get generated on first test anyway, we should do it now to give best results later: # current slow5tools implementation decompresses the whole record for indexing and is not efficient # the specification supports partial decompress of the record (also signal chunking if necessary) slow5tools index "$output_path/file.blow5" ================================================ FILE: benchmarks/tools/find_and_get_fast5.py ================================================ #!/usr/bin/env python3 import argparse from pathlib import Path import h5py import numpy import pandas as pd def select_reads(file, selection): if selection is not None: for read in selection: path = f"/read_{read}" if path not in file: continue yield read, path else: for key in file.keys(): if key.startswith("read_"): yield key[5:], key def run(input_dir, output, select_read_ids=None, get_columns=[]): output.mkdir(parents=True, exist_ok=True) if select_read_ids is not None: print(f"Selecting {len(select_read_ids)} specific read ids") if get_columns is not None: print(f"Selecting columns: {get_columns}") read_ids = [] extracted_columns = {"read_id": read_ids} print(f"Search for input files in {input_dir}") for file in input_dir.glob("*.fast5"): print(f"Searching for reads in {file}") file = h5py.File(file, "r") for read_id, read_path in select_reads(file, select_read_ids): read_ids.append(read_id) for c in get_columns: if c not in extracted_columns: extracted_columns[c] = [] col = extracted_columns[c] if c == "read_number": col.append(file[f"{read_path}/Raw"].attrs["read_number"]) elif c == "sample_count": col.append(len(file[f"{read_path}/Raw"]["Signal"])) elif c == "samples": col.append(numpy.sum(file[f"{read_path}/Raw"]["Signal"])) df = pd.DataFrame(extracted_columns) print(f"Selected {len(read_ids)} items") df.to_csv(output / "read_ids.csv", index=False) def main(): parser = argparse.ArgumentParser() parser.add_argument("input", type=Path) parser.add_argument("output", type=Path) parser.add_argument( "--select-ids", type=str, help="CSV file with a read_id column, listing ids to find in input files", ) parser.add_argument( "--get-column", default=[], nargs="+", type=str, help="Add columns that should be extracted", ) args = parser.parse_args() select_read_ids = None if args.select_ids: select_read_ids = pd.read_csv(args.select_ids)["read_id"] run( args.input, args.output, select_read_ids=select_read_ids, get_columns=args.get_column, ) if __name__ == "__main__": main() ================================================ FILE: benchmarks/tools/find_and_get_pod5.py ================================================ #!/usr/bin/env python3 import argparse import multiprocessing as mp import tempfile from collections import namedtuple from pathlib import Path from queue import Empty import numpy import pandas as pd import pod5 as p5 SelectReadIdsData = namedtuple( "SelectReadIdsData", ["path", "slice_start", "slice_end", "shape"] ) def load_mapped_ids(select_read_ids_data): """Load a set of read ids from a mmapped file on disk""" select_read_ids_all = numpy.memmap( select_read_ids_data.path, dtype=numpy.uint8, mode="r+", shape=select_read_ids_data.shape, ) return select_read_ids_all[ select_read_ids_data.slice_start : select_read_ids_data.slice_end ] def do_batch_work(filename, batches, column, mode, result_q): """ Per process worker to do loading of data from a set of batches """ read_ids = [] vals = [] extracted_columns = {"read_id": read_ids, column: vals} if column == "samples": file = p5.Reader(filename) for batch in file.read_batches(batch_selection=batches, preload={"samples"}): read_ids.extend(p5.format_read_ids(batch.read_id_column)) for read in batch.reads(): vals.append(numpy.sum(read.signal)) else: print(f"Unknown column {column}") result_q.put(pd.DataFrame(extracted_columns)) def do_search_work(files, select_read_ids_data, column, mode, result_q): """ Per process worker to do loading of data from a number of read ids """ select_read_ids = load_mapped_ids(select_read_ids_data) read_ids = [] vals = [] extracted_columns = {"read_id": read_ids, column: vals} if column == "samples": for filename in files: file = p5.Reader(filename) for batch in file.read_batches(select_read_ids, preload={"samples"}): read_ids.extend(p5.format_read_ids(batch.read_id_column)) vals.extend([numpy.sum(s) for s in batch.cached_samples_column]) else: print(f"Unknown column {column}") result_q.put(pd.DataFrame(extracted_columns)) def run_multiprocess(files, output, select_read_ids=None, column=None, mode=None): """ Do work across a number of python multiprocesses """ mp.set_start_method("spawn") if select_read_ids is not None: print("Placing select read id data on disk for mmapping:") numpy_select_read_ids = p5.pack_read_ids(select_read_ids) # Copy data to memory-map fp = tempfile.NamedTemporaryFile() fp.close() mapped_select_read_ids = numpy.memmap( fp.name, dtype=numpy.uint8, mode="w+", shape=numpy_select_read_ids.shape ) numpy.copyto(mapped_select_read_ids, numpy_select_read_ids) select_read_ids_mmap_path = Path(fp.name) result_queue = mp.Queue() runners = 10 processes = [] if select_read_ids is not None: approx_chunk_size = max(1, len(select_read_ids) // runners) start_index = 0 while start_index < len(select_read_ids): select_read_ids_data = SelectReadIdsData( select_read_ids_mmap_path, start_index, start_index + approx_chunk_size, numpy_select_read_ids.shape, ) p = mp.Process( target=do_search_work, args=(files, select_read_ids_data, column, mode, result_queue), ) p.start() processes.append(p) start_index += approx_chunk_size else: for filename in files: file = p5.Reader(filename) batches = list(range(file.batch_count)) approx_chunk_size = max(1, len(batches) // runners) start_index = 0 while start_index < len(batches): select_batches = batches[start_index : start_index + approx_chunk_size] p = mp.Process( target=do_batch_work, args=(filename, select_batches, column, mode, result_queue), ) p.start() processes.append(p) start_index += len(select_batches) print("Wait for processes...") items = [] while len(items) < len(processes): try: item = result_queue.get(timeout=0.5) items.append(item) except Empty: continue for p in processes: p.join() return pd.concat(items) if select_read_ids is not None: select_read_ids_mmap_path.unlink() def run_get_read_ids(files): """ Load all read ids from the file. """ read_ids = [] for filename in files: file = p5.Reader(filename) for batch in file.read_batches(): read_ids.extend(p5.format_read_ids(batch.read_id_column)) return pd.DataFrame({"read_id": read_ids}) def run_select(files, select_read_ids, column): """ Load column from a specific set of read ids """ read_ids = [] vals = [] extracted_columns = {"read_id": read_ids, column: vals} for filename in files: file = p5.Reader(filename) if column == "sample_count": for batch in file.read_batches(select_read_ids, preload={"sample_count"}): read_id_selection = batch.read_id_column read_ids.extend(p5.format_read_ids(read_id_selection)) vals.extend(batch.cached_sample_count_column) else: col_name = f"{column}_column" for batch in file.read_batches(select_read_ids): read_id_selection = batch.read_id_column read_ids.extend(p5.format_read_ids(read_id_selection)) read_number_selection = getattr(batch, col_name) vals.extend(read_number_selection) return pd.DataFrame(extracted_columns) def run_batched(files, column): """ Load column from a all reads """ read_ids = [] vals = [] extracted_columns = {"read_id": read_ids, column: vals} for filename in files: file = p5.Reader(filename) if column == "sample_count": for batch in file.read_batches(preload={"sample_count"}): read_ids.extend(p5.format_read_ids(batch.read_id_column)) vals.extend(batch.cached_sample_count_column) else: col_name = f"{column}_column" for batch in file.read_batches(): read_ids.extend(p5.format_read_ids(batch.read_id_column)) vals.extend(getattr(batch, col_name).to_numpy()) return pd.DataFrame(extracted_columns) def main(): parser = argparse.ArgumentParser() parser.add_argument("input", type=Path) parser.add_argument("output", type=Path) parser.add_argument( "--select-ids", type=str, help="CSV file with a read_id column, listing ids to find in input files", ) parser.add_argument( "--get-column", default=None, type=str, help="Add column that should be extracted", ) args = parser.parse_args() select_read_ids = None if args.select_ids: select_read_ids = pd.read_csv(args.select_ids)["read_id"] if select_read_ids is not None: print(f"Selecting {len(select_read_ids)} specific read ids") if args.get_column is not None: print(f"Selecting column: {args.get_column}") mode = None print(f"Search for input files in {args.input}") files = list(args.input.glob("*.pod5")) print(f"Searching in {[str(f) for f in files]}") # Run benchmark using most appropriate method: if args.get_column is None: df = run_get_read_ids(files) elif args.get_column == "samples": # Because we the "samples" column to be the sum # of all samples in input data, it is quicker to use # python multiprocessing to split the summing work: df = run_multiprocess( files, args.output, select_read_ids=select_read_ids, column=args.get_column, mode=mode, ) elif args.select_ids: df = run_select( files, select_read_ids=select_read_ids, column=args.get_column, ) else: df = run_batched( files, column=args.get_column, ) print(f"Selected {len(df)} items") args.output.mkdir(parents=True, exist_ok=True) df.to_csv(args.output / "read_ids.csv", index=False) if __name__ == "__main__": main() ================================================ FILE: benchmarks/tools/pyslow5_tests.py ================================================ #!/usr/bin/env python3 import argparse import multiprocessing as mp from pathlib import Path from queue import Empty import numpy import pandas as pd import pyslow5 def random_access(s5_file, read_list, col, result_q): file = pyslow5.Open(str(s5_file), "r") print("processing ", s5_file) read_ids = [] extracted_columns = {"read_id": read_ids} extracted_columns[col] = [] vals = extracted_columns[col] if col == "samples": for read in file.get_read_list_multi(read_list, threads=10, batchsize=5000): read_ids.append(read["read_id"]) vals.append(numpy.sum(read["signal"])) elif col == "sample_count": for read in file.get_read_list_multi(read_list, threads=10, batchsize=5000): read_ids.append(read["read_id"]) vals.append(read["len_raw_signal"]) else: for read in file.get_read_list_multi( read_list, threads=10, batchsize=5000, pA=False, aux=col ): read_ids.append(read["read_id"]) vals.append(read[col]) result_q.put(pd.DataFrame(extracted_columns)) def run(s5_file, benchmark, select_read_ids, col): if benchmark == "get_all_read_ids": read_ids = [] extracted_columns = {"read_id": read_ids} file = pyslow5.Open(str(s5_file), "r") print("processing ", s5_file) read_ids, num_reads = file.get_read_ids() extracted_columns = {"read_id": read_ids} elif benchmark == "sample_values": mp.set_start_method("spawn") result_queue = mp.Queue() runners = 10 processes = [] approx_chunk_size = max(1, len(select_read_ids) // runners) select_ids = [] for i in range(0, len(select_read_ids), approx_chunk_size): for j in range(i, min(len(select_read_ids), i + approx_chunk_size)): select_ids.append(select_read_ids[j]) p = mp.Process( target=random_access, args=(s5_file, select_ids, col, result_queue) ) p.start() processes.append(p) select_ids = [] print("Wait for processes...") items = [] while len(items) < len(processes): try: item = result_queue.get(timeout=0.5) items.append(item) except Empty: continue for p in processes: p.join() df = pd.concat(items) return df elif benchmark == "all_values": read_ids = [] extracted_columns = {"read_id": read_ids} file = pyslow5.Open(str(s5_file), "r") print("processing ", s5_file) extracted_columns[col] = [] vals = extracted_columns[col] if col == "samples": for read in file.seq_reads_multi(threads=10, batchsize=5000): read_ids.append(read["read_id"]) vals.append(numpy.sum(read["signal"])) elif col == "sample_count": for read in file.seq_reads_multi(threads=10, batchsize=5000): read_ids.append(read["read_id"]) vals.append(read["len_raw_signal"]) else: for read in file.seq_reads_multi( threads=10, batchsize=5000, pA=False, aux=col ): read_ids.append(read["read_id"]) vals.append(read[col]) return pd.DataFrame(extracted_columns) def main(): parser = argparse.ArgumentParser() parser.add_argument("input", type=Path) parser.add_argument("output", type=Path) parser.add_argument( "benchmark", type=str, choices=["get_all_read_ids", "sample_values", "all_values"], help="which benchmark to run", ) parser.add_argument( "--select-ids", type=str, help="CSV file with a read_id column, listing ids to find in input files", ) parser.add_argument( "--get-column", default=None, type=str, help="Add columns that should be extracted", ) args = parser.parse_args() args.output.mkdir(parents=True, exist_ok=True) select_read_ids = None select_reads = [] if args.select_ids: select_read_ids = pd.read_csv(args.select_ids)["read_id"] for i in select_read_ids: select_reads.append(i) print(f"Num of select_reads: {len(select_reads)}") df = run( args.input, args.benchmark, select_read_ids=select_reads, col=args.get_column, ) print(f"Selected {len(df)} items") df.to_csv(args.output / "read_ids.csv", index=False) if __name__ == "__main__": main() ================================================ FILE: benchmarks/tools/run_benchmarks_docker_entry.sh ================================================ #!/bin/bash # Use taskset to limit benchmarks to specific cores, ensuring a fair test of limited resources: taskset -c 0-10 /benchmarks/run_benchmarks.py /input /outputs ================================================ FILE: benchmarks/tools/select-random-ids.py ================================================ #!/usr/bin/env python3 import argparse from pathlib import Path import pandas as pd def main(): parser = argparse.ArgumentParser() parser.add_argument("input_csv", type=Path) parser.add_argument("output_csv", type=Path) parser.add_argument("--select-ratio", type=float) args = parser.parse_args() df = pd.read_csv(args.input_csv) selected_rows_df = df.sample(frac=args.select_ratio) args.output_csv.parent.mkdir(parents=True, exist_ok=True) selected_rows_df.to_csv(args.output_csv) if __name__ == "__main__": main() ================================================ FILE: c++/CMakeLists.txt ================================================ if (ENABLE_CONAN) find_package(Arrow REQUIRED CONFIG) find_package(Flatbuffers REQUIRED CONFIG) find_package(zstd REQUIRED CONFIG) find_package(ZLIB REQUIRED CONFIG) if (${CMAKE_SYSTEM_NAME} STREQUAL "Linux") find_package(jemalloc REQUIRED CONFIG) endif() else() find_package(Arrow REQUIRED) find_package(Flatbuffers REQUIRED) find_package(zstd REQUIRED) find_package(ZLIB REQUIRED) # Our non-conan ubuntu CI build has a different name for this target if (NOT CONAN2) add_library(arrow::arrow INTERFACE IMPORTED) target_link_libraries(arrow::arrow INTERFACE Arrow::arrow_shared) endif() endif() find_package(Threads REQUIRED) find_program( FLATBUFFERS_FLATC_EXECUTABLE flatc ) include(BuildFlatBuffers) configure_file( pod5_format/version.h.in pod5_format/version.h ) set(pod5_library_type STATIC) if (BUILD_SHARED_LIB) set(pod5_library_type SHARED) endif() add_library(pod5_format ${pod5_library_type} pod5_format/file_recovery.h pod5_format/file_writer.cpp pod5_format/file_writer.h pod5_format/file_reader.cpp pod5_format/file_reader.h pod5_format/file_updater.cpp pod5_format/file_updater.h pod5_format/async_signal_loader.cpp pod5_format/async_signal_loader.h pod5_format/schema_metadata.cpp pod5_format/table_reader.h pod5_format/schema_field_builder.h pod5_format/read_table_reader.cpp pod5_format/read_table_reader.h pod5_format/read_table_schema.cpp pod5_format/read_table_schema.h pod5_format/read_table_writer.cpp pod5_format/read_table_writer.h pod5_format/read_table_writer_utils.cpp pod5_format/read_table_writer_utils.h pod5_format/read_table_utils.cpp pod5_format/read_table_utils.h pod5_format/run_info_table_reader.cpp pod5_format/run_info_table_reader.h pod5_format/run_info_table_schema.cpp pod5_format/run_info_table_schema.h pod5_format/run_info_table_writer.cpp pod5_format/run_info_table_writer.h pod5_format/signal_compression.cpp pod5_format/signal_compression.h pod5_format/signal_table_reader.cpp pod5_format/signal_table_reader.h pod5_format/signal_table_schema.cpp pod5_format/signal_table_schema.h pod5_format/signal_table_writer.cpp pod5_format/signal_table_writer.h pod5_format/signal_table_utils.h pod5_format/signal_builder.h pod5_format/c_api.cpp pod5_format/c_api.h pod5_format/expandable_buffer.h pod5_format/io_manager.cpp pod5_format/io_manager.h pod5_format/memory_pool.cpp pod5_format/memory_pool.h pod5_format/result.h pod5_format/schema_utils.cpp pod5_format/schema_utils.h pod5_format/table_reader.cpp pod5_format/table_reader.h pod5_format/thread_pool.cpp pod5_format/thread_pool.h pod5_format/tuple_utils.h pod5_format/types.cpp pod5_format/types.h pod5_format/uuid.h pod5_format/migration/migration.cpp pod5_format/migration/migration.h pod5_format/migration/migration_utils.h pod5_format/migration/v0_to_v1.cpp pod5_format/migration/v1_to_v2.cpp pod5_format/migration/v2_to_v3.cpp pod5_format/migration/v3_to_v4.cpp pod5_format/internal/async_output_stream.h pod5_format/internal/combined_file_utils.h pod5_format/internal/linux_output_stream.h pod5_format/svb16/common.hpp pod5_format/svb16/decode.hpp pod5_format/svb16/decode_scalar.hpp pod5_format/svb16/decode_x64.hpp pod5_format/svb16/encode.hpp pod5_format/svb16/encode_scalar.hpp pod5_format/svb16/encode_x64.hpp pod5_format/svb16/intrinsics.hpp pod5_format/svb16/shuffle_tables.hpp pod5_format/svb16/simd_detect_x64.hpp ) set(public_headers) list(APPEND public_headers pod5_format/file_writer.h pod5_format/file_reader.h pod5_format/schema_metadata.h pod5_format/read_table_reader.h pod5_format/read_table_schema.h pod5_format/read_table_writer.h pod5_format/read_table_writer_utils.h pod5_format/read_table_utils.h pod5_format/run_info_table_writer.h pod5_format/run_info_table_reader.h pod5_format/run_info_table_schema.h pod5_format/signal_compression.h pod5_format/signal_table_reader.h pod5_format/signal_table_schema.h pod5_format/signal_table_writer.h pod5_format/signal_table_utils.h pod5_format/signal_builder.h pod5_format/uuid.h pod5_format/c_api.h pod5_format/expandable_buffer.h pod5_format/file_output_stream.h pod5_format/io_manager.h pod5_format/memory_pool.h pod5_format/result.h pod5_format/dictionary_writer.h pod5_format/schema_field_builder.h pod5_format/schema_utils.h pod5_format/table_reader.h pod5_format/thread_pool.h pod5_format/tuple_utils.h pod5_format/types.h ${CMAKE_CURRENT_BINARY_DIR}/pod5_format/pod5_format_export.h ) set(svb16_headers pod5_format/svb16/svb16.h pod5_format/svb16/common.hpp pod5_format/svb16/decode.hpp pod5_format/svb16/decode_scalar.hpp pod5_format/svb16/decode_x64.hpp pod5_format/svb16/encode.hpp pod5_format/svb16/encode_scalar.hpp pod5_format/svb16/encode_x64.hpp pod5_format/svb16/intrinsics.hpp pod5_format/svb16/shuffle_tables.hpp pod5_format/svb16/simd_detect_x64.hpp ) set_target_properties(pod5_format PROPERTIES POSITION_INDEPENDENT_CODE 1 CXX_STANDARD 20 PUBLIC_HEADER "${public_headers}" ) # Link these libraries publicly when doing a static lib build set(maybe_public_libs arrow::arrow flatbuffers::flatbuffers ) if (BUILD_SHARED_LIB) target_link_libraries(pod5_format PRIVATE ${maybe_public_libs}) else() target_link_libraries(pod5_format PUBLIC ${maybe_public_libs}) endif() target_link_libraries(pod5_format PRIVATE pod5_flatbuffers zstd::zstd ZLIB::ZLIB Threads::Threads ) if(APPLE) find_library(CORE_FOUNDATION CoreFoundation) target_link_libraries(pod5_format PRIVATE ${CORE_FOUNDATION}) endif() target_include_directories(pod5_format PUBLIC ${CMAKE_CURRENT_SOURCE_DIR} ${CMAKE_CURRENT_BINARY_DIR} ) flatbuffers_generate_headers( TARGET pod5_flatbuffers SCHEMAS pod5_format/flatbuffers/footer.fbs INCLUDE_PREFIX "" FLAGS --cpp ) if (NOT MSVC) set(pod5_warning_options -Werror -Wall -Wno-comment -Wno-error=deprecated-declarations -Wno-deprecated-declarations) target_compile_options(pod5_format PRIVATE ${pod5_warning_options}) endif() generate_export_header(pod5_format EXPORT_FILE_NAME pod5_format/pod5_format_export.h) install( TARGETS pod5_format PUBLIC_HEADER DESTINATION "include/pod5_format" ) install( FILES ${svb16_headers} DESTINATION "include/pod5_format/svb16" ) if (POD5_BUILD_EXAMPLES) add_subdirectory(examples) endif() if (NOT POD5_DISABLE_TESTS) add_subdirectory(test) endif() if (BUILD_PYTHON_WHEEL) add_subdirectory(pod5_format_pybind) endif() ================================================ FILE: c++/examples/CMakeLists.txt ================================================ add_executable(find_all_read_ids find_all_read_ids.cpp ) target_link_libraries(find_all_read_ids pod5_format ) # Needs C++17 to use pod5_format/uuid.h set_target_properties(find_all_read_ids PROPERTIES CXX_STANDARD 17) add_executable(find_specific_read_ids find_specific_read_ids.cpp ) target_link_libraries(find_specific_read_ids pod5_format ) # Needs C++17 to use pod5_format/uuid.h set_target_properties(find_specific_read_ids PROPERTIES CXX_STANDARD 17) add_executable(find_all_read_data find_all_read_data.cpp ) target_link_libraries(find_all_read_data pod5_format ) # Needs C++17 to use pod5_format/uuid.h set_target_properties(find_all_read_data PROPERTIES CXX_STANDARD 17) add_executable(find_specific_read_ids_with_signal find_specific_read_ids_with_signal.cpp ) target_link_libraries(find_specific_read_ids_with_signal pod5_format ) # Needs C++17 to use pod5_format/uuid.h set_target_properties(find_specific_read_ids_with_signal PROPERTIES CXX_STANDARD 17) ================================================ FILE: c++/examples/README.md ================================================ C++ Examples ============ These examples use the POD5 C API to read file data, they are written using C++. find_all_read_ids ----------------- Find all the read ids in a given pod5 file, and save their read id to a text file. find_specific_read_ids ---------------------- Find specific read ids in a given pod5 file, and save their read number to a text file. ================================================ FILE: c++/examples/find_all_read_data.cpp ================================================ #include "pod5_format/c_api.h" #include "pod5_format/uuid.h" #include #include #include #include int main(int argc, char ** argv) { if (argc != 2) { std::cerr << "Expected one argument - an pod5 file to search\n"; return EXIT_FAILURE; } // Initialise the POD5 library: pod5_init(); // Open the file ready for walking: Pod5FileReader_t * file = pod5_open_file(argv[1]); if (!file) { std::cerr << "Failed to open file " << argv[1] << ": " << pod5_get_error_string() << "\n"; return EXIT_FAILURE; } std::size_t batch_count = 0; if (pod5_get_read_batch_count(&batch_count, file) != POD5_OK) { std::cerr << "Failed to query batch count: " << pod5_get_error_string() << "\n"; return EXIT_FAILURE; } std::size_t read_count = 0; for (std::size_t batch_index = 0; batch_index < batch_count; ++batch_index) { std::cout << "batch_index: " << batch_index + 1 << "/" << batch_count << "\n"; Pod5ReadRecordBatch_t * batch = nullptr; if (pod5_get_read_batch(&batch, file, batch_index) != POD5_OK) { std::cerr << "Failed to get batch: " << pod5_get_error_string() << "\n"; return EXIT_FAILURE; } std::size_t batch_row_count = 0; if (pod5_get_read_batch_row_count(&batch_row_count, batch) != POD5_OK) { std::cerr << "Failed to get batch row count\n"; return EXIT_FAILURE; } for (std::size_t row = 0; row < batch_row_count; ++row) { uint16_t read_table_version = 0; ReadBatchRowInfo_t read_data; if (pod5_get_read_batch_row_info_data( batch, row, READ_BATCH_ROW_INFO_VERSION, &read_data, &read_table_version) != POD5_OK) { std::cerr << "Failed to get read " << row << ": " << pod5_get_error_string() << "\n"; return EXIT_FAILURE; } read_count += 1; std::size_t sample_count = 0; pod5_get_read_complete_sample_count(file, batch, row, &sample_count); std::vector samples; samples.resize(sample_count); pod5_get_read_complete_signal(file, batch, row, samples.size(), samples.data()); // Run info RunInfoDictData_t * run_info = nullptr; if (pod5_get_run_info(batch, read_data.run_info, &run_info) != POD5_OK) { std::cerr << "Failed to get run info " + std::to_string(read_data.run_info) + " : " + pod5_get_error_string() << "\n"; return EXIT_FAILURE; } pod5_free_run_info(run_info); } if (pod5_free_read_batch(batch) != POD5_OK) { std::cerr << "Failed to release batch\n"; return EXIT_FAILURE; } } std::cout << "Extracted " << read_count << " reads " << "\n"; // Close the reader if (pod5_close_and_free_reader(file) != POD5_OK) { std::cerr << "Failed to close reader: " << pod5_get_error_string() << "\n"; return EXIT_FAILURE; } // Cleanup the library pod5_terminate(); } ================================================ FILE: c++/examples/find_all_read_ids.cpp ================================================ #include "pod5_format/c_api.h" #include "pod5_format/uuid.h" #include #include #include #include int main(int argc, char ** argv) { if (argc != 2) { std::cerr << "Expected one argument - an pod5 file to search\n"; return EXIT_FAILURE; } // Initialise the POD5 library: pod5_init(); // Open the file ready for walking: Pod5FileReader_t * file = pod5_open_file(argv[1]); if (!file) { std::cerr << "Failed to open file " << argv[1] << ": " << pod5_get_error_string() << "\n"; return EXIT_FAILURE; } std::size_t batch_count = 0; if (pod5_get_read_batch_count(&batch_count, file) != POD5_OK) { std::cerr << "Failed to query batch count: " << pod5_get_error_string() << "\n"; return EXIT_FAILURE; } std::string output_path("read_ids.txt"); std::cout << "Writing read ids to " << output_path << "\n"; std::ofstream output_stream(output_path); std::size_t read_count = 0; for (std::size_t batch_index = 0; batch_index < batch_count; ++batch_index) { Pod5ReadRecordBatch_t * batch = nullptr; if (pod5_get_read_batch(&batch, file, batch_index) != POD5_OK) { std::cerr << "Failed to get batch: " << pod5_get_error_string() << "\n"; return EXIT_FAILURE; } std::size_t batch_row_count = 0; if (pod5_get_read_batch_row_count(&batch_row_count, batch) != POD5_OK) { std::cerr << "Failed to get batch row count\n"; return EXIT_FAILURE; } for (std::size_t row = 0; row < batch_row_count; ++row) { uint16_t read_table_version = 0; ReadBatchRowInfo_t read_data; if (pod5_get_read_batch_row_info_data( batch, row, READ_BATCH_ROW_INFO_VERSION, &read_data, &read_table_version) != POD5_OK) { std::cerr << "Failed to get read " << row << "\n"; return EXIT_FAILURE; } std::array formatted_read_id; pod5_format_read_id(read_data.read_id, formatted_read_id.data()); output_stream << formatted_read_id.data() << "\n"; read_count += 1; std::size_t sample_count = 0; pod5_get_read_complete_sample_count(file, batch, row, &sample_count); std::vector samples; samples.resize(sample_count); pod5_get_read_complete_signal(file, batch, row, samples.size(), samples.data()); } if (pod5_free_read_batch(batch) != POD5_OK) { std::cerr << "Failed to release batch\n"; return EXIT_FAILURE; } } std::cout << "Extracted " << read_count << " read ids into " << output_path << "\n"; // Close the reader if (pod5_close_and_free_reader(file) != POD5_OK) { std::cerr << "Failed to close reader: " << pod5_get_error_string() << "\n"; return EXIT_FAILURE; } // Cleanup the library pod5_terminate(); } ================================================ FILE: c++/examples/find_specific_read_ids.cpp ================================================ #include "pod5_format/c_api.h" #include "pod5_format/uuid.h" #include #include #include int main(int argc, char ** argv) { if (argc != 3) { std::cerr << "Expected two arguments:\n" << " - an pod5 file to search\n" << " - a file containing newline separated of read ids\n"; return EXIT_FAILURE; } // Initialise the POD5 library: pod5_init(); // Open the file ready for walking: Pod5FileReader_t * file = pod5_open_file(argv[1]); if (!file) { std::cerr << "Failed to open file " << argv[1] << ": " << pod5_get_error_string() << "\n"; return EXIT_FAILURE; } std::size_t batch_count = 0; if (pod5_get_read_batch_count(&batch_count, file) != POD5_OK) { std::cerr << "Failed to query batch count: " << pod5_get_error_string() << "\n"; return EXIT_FAILURE; } std::vector search_uuids; std::string input_path(argv[2]); try { std::cout << "Reading input read ids from " << input_path << "\n"; std::string line; std::ifstream input_stream(input_path); while (std::getline(input_stream, line)) { auto const uuid = pod5::Uuid::from_string(line); if (!uuid) { std::cerr << '"' << line << "\" is not a valid UUID, ignoring it\n"; } else { search_uuids.push_back(*uuid); } } std::cout << " Read " << search_uuids.size() << " ids from the text file\n"; } catch (std::exception const & e) { std::cerr << "Failed to parse UUID values from " << input_path << ": " << e.what() << "\n"; } std::string output_path("read_ids.txt"); std::cout << "Writing selected read numbers to " << output_path << "\n"; std::ofstream output_stream(output_path); // Plan the most efficient route through the file for the required read ids: std::vector traversal_batch_counts(batch_count); std::vector traversal_row_indices(search_uuids.size()); std::size_t find_success_count = 0; if (pod5_plan_traversal( file, (uint8_t *)search_uuids.data(), search_uuids.size(), traversal_batch_counts.data(), traversal_row_indices.data(), &find_success_count) != POD5_OK) { std::cerr << "Failed to plan traversal of file: " << pod5_get_error_string() << "\n"; return EXIT_FAILURE; } if (find_success_count != search_uuids.size()) { std::cerr << "Failed to find " << (search_uuids.size() - find_success_count) << " reads\n"; } std::size_t read_count = 0; std::size_t row_offset = 0; // Walk the suggested traversal route, storing read data. for (std::size_t batch_index = 0; batch_index < batch_count; ++batch_index) { Pod5ReadRecordBatch_t * batch = nullptr; if (pod5_get_read_batch(&batch, file, batch_index) != POD5_OK) { std::cerr << "Failed to get batch: " << pod5_get_error_string() << "\n"; return EXIT_FAILURE; } std::cout << "Processing batch " << (batch_index + 1) << " of " << batch_count << "\n"; for (std::size_t row_index = 0; row_index < traversal_batch_counts[batch_index]; ++row_index) { std::uint32_t batch_row = traversal_row_indices[row_index + row_offset]; uint16_t read_table_version = 0; ReadBatchRowInfo_t read_data; if (pod5_get_read_batch_row_info_data( batch, batch_row, READ_BATCH_ROW_INFO_VERSION, &read_data, &read_table_version) != POD5_OK) { std::cerr << "Failed to get read " << batch_row << "\n"; return EXIT_FAILURE; } output_stream << read_data.read_number << "\n"; read_count += 1; } row_offset += traversal_batch_counts[batch_index]; if (pod5_free_read_batch(batch) != POD5_OK) { std::cerr << "Failed to release batch\n"; return EXIT_FAILURE; } } std::cout << "Extracted " << read_count << " read numbers into " << output_path << "\n"; // Close the reader if (pod5_close_and_free_reader(file) != POD5_OK) { std::cerr << "Failed to close reader: " << pod5_get_error_string() << "\n"; return EXIT_FAILURE; } // Cleanup the library pod5_terminate(); } ================================================ FILE: c++/examples/find_specific_read_ids_with_signal.cpp ================================================ #include "pod5_format/c_api.h" #include "pod5_format/uuid.h" #include #include #include int main(int argc, char ** argv) { if (argc != 3) { std::cerr << "Expected two arguments:\n" << " - an pod5 file to search\n" << " - a file containing newline separated of read ids\n"; return EXIT_FAILURE; } // Initialise the POD5 library: pod5_init(); // Open the file ready for walking: Pod5FileReader_t * file = pod5_open_file(argv[1]); if (!file) { std::cerr << "Failed to open file " << argv[1] << ": " << pod5_get_error_string() << "\n"; return EXIT_FAILURE; } std::size_t batch_count = 0; if (pod5_get_read_batch_count(&batch_count, file) != POD5_OK) { std::cerr << "Failed to query batch count: " << pod5_get_error_string() << "\n"; return EXIT_FAILURE; } std::vector search_uuids; std::string input_path(argv[2]); try { std::cout << "Reading input read ids from " << input_path << "\n"; std::string line; std::ifstream input_stream(input_path); while (std::getline(input_stream, line)) { auto const uuid = pod5::Uuid::from_string(line); if (!uuid) { std::cerr << '"' << line << "\" is not a valid UUID, ignoring it\n"; } else { search_uuids.push_back(*uuid); } } std::cout << " Read " << search_uuids.size() << " ids from the text file\n"; } catch (std::exception const & e) { std::cerr << "Failed to parse UUID values from " << input_path << ": " << e.what() << "\n"; } std::string output_path("read_ids.txt"); std::cout << "Writing selected read numbers to " << output_path << "\n"; std::ofstream output_stream(output_path); // Plan the most efficient route through the file for the required read ids: std::vector traversal_batch_counts(batch_count); std::vector traversal_row_indices(search_uuids.size()); std::size_t find_success_count = 0; if (pod5_plan_traversal( file, (uint8_t *)search_uuids.data(), search_uuids.size(), traversal_batch_counts.data(), traversal_row_indices.data(), &find_success_count) != POD5_OK) { std::cerr << "Failed to plan traversal of file: " << pod5_get_error_string() << "\n"; return EXIT_FAILURE; } if (find_success_count != search_uuids.size()) { std::cerr << "Failed to find " << (search_uuids.size() - find_success_count) << " reads\n"; } std::size_t read_count = 0; std::size_t samples_read = 0; std::size_t row_offset = 0; // Walk the suggested traversal route, storing read data. for (std::size_t batch_index = 0; batch_index < batch_count; ++batch_index) { Pod5ReadRecordBatch_t * batch = nullptr; if (pod5_get_read_batch(&batch, file, batch_index) != POD5_OK) { std::cerr << "Failed to get batch: " << pod5_get_error_string() << "\n"; return EXIT_FAILURE; } std::cout << "Processing batch " << (batch_index + 1) << " of " << batch_count << "\n"; for (std::size_t row_index = 0; row_index < traversal_batch_counts[batch_index]; ++row_index) { std::uint32_t batch_row = traversal_row_indices[row_index + row_offset]; uint16_t read_table_version = 0; ReadBatchRowInfo_t read_data; if (pod5_get_read_batch_row_info_data( batch, batch_row, READ_BATCH_ROW_INFO_VERSION, &read_data, &read_table_version) != POD5_OK) { std::cerr << "Failed to get read " << batch_row << "\n"; return EXIT_FAILURE; } std::size_t sample_count = 0; pod5_get_read_complete_sample_count(file, batch, batch_row, &sample_count); std::vector samples; samples.resize(sample_count); pod5_get_read_complete_signal(file, batch, batch_row, samples.size(), samples.data()); std::int64_t samples_sum = 0; for (std::size_t i = 0; i < samples.size(); ++i) { samples_sum += samples[i]; } output_stream << read_data.calibration_offset << " " << read_data.calibration_scale << " " << samples_sum << "\n"; read_count += 1; samples_read += samples.size(); } row_offset += traversal_batch_counts[batch_index]; if (pod5_free_read_batch(batch) != POD5_OK) { std::cerr << "Failed to release batch\n"; return EXIT_FAILURE; } } std::cout << "Extracted " << read_count << " reads and " << samples_read << " samples into " << output_path << "\n"; // Close the reader if (pod5_close_and_free_reader(file) != POD5_OK) { std::cerr << "Failed to close reader: " << pod5_get_error_string() << "\n"; return EXIT_FAILURE; } // Cleanup the library pod5_terminate(); } ================================================ FILE: c++/pod5_format/async_signal_loader.cpp ================================================ #include "pod5_format/async_signal_loader.h" namespace pod5 { std::size_t const AsyncSignalLoader::MINIMUM_JOB_SIZE = 50; AsyncSignalLoader::AsyncSignalLoader( std::shared_ptr const & reader, SamplesMode samples_mode, gsl::span const & batch_counts, gsl::span const & batch_rows, std::size_t worker_count, std::size_t max_pending_batches) : m_reader(reader) , m_samples_mode(samples_mode) , m_max_pending_batches(max_pending_batches) , m_reads_batch_count(m_reader->num_read_record_batches()) , m_batch_counts(batch_counts) , m_total_batch_count_so_far(0) , m_batch_rows(batch_rows) , m_worker_job_size( std::max( MINIMUM_JOB_SIZE, m_batch_rows.size() / (m_reads_batch_count * worker_count * 2))) , m_current_batch(0) , m_finished(false) , m_has_error(false) , m_batches_size(0) { // Setup first batch: { std::unique_lock l(m_worker_sync); auto setup_result = setup_next_in_progress_batch(l); if (!setup_result.ok()) { set_error(setup_result); } } // Kick off workers on jobs: for (std::size_t i = 0; i < worker_count; ++i) { m_workers.emplace_back([&] { run_worker(); }); } } AsyncSignalLoader::~AsyncSignalLoader() { m_finished = true; // Wait for all workers to complete: for (std::size_t i = 0; i < m_workers.size(); ++i) { m_workers[i].join(); } } Result> AsyncSignalLoader::release_next_batch( std::optional timeout) { std::shared_ptr batch; // Return any error, if one has occurred: if (m_has_error) { return error(); } // First wait until there is a batch available: do { std::unique_lock l(m_batches_sync); // Wait until there is a batch available: m_batch_done.wait_until( l, timeout.value_or(std::chrono::steady_clock::now() + std::chrono::seconds(5)), [&] { return m_batches.size() || m_finished || m_has_error; }); // Grab a batch if one exists (note error or user destroying us might have happened instead): if (!m_batches.empty()) { batch = std::move(m_batches.front()); assert(batch); m_batches.pop_front(); m_batches_size -= 1; break; } if (timeout && std::chrono::steady_clock::now() > *timeout) { return nullptr; } } while (!m_finished && !m_has_error); // Return any error, if one has occurred during our wait: if (m_has_error) { return error(); } // If we got a batch, wait for all work to be finished, then return it: if (batch) { // Wait if we are ahead of the loader: while (!batch->is_complete()) { std::this_thread::sleep_for(std::chrono::milliseconds(1)); } return batch->release_data(); } // No more data - return null. return nullptr; } void AsyncSignalLoader::set_error(pod5::Status status) { assert(!status.ok()); { std::lock_guard l{m_error_mutex}; m_error = std::move(status); } m_has_error = true; } pod5::Status AsyncSignalLoader::error() const { std::lock_guard l{m_error_mutex}; return m_error; } void AsyncSignalLoader::run_worker() { // Continue to work while there is work to do, and no error has occurred while (!m_finished && !m_has_error) { std::shared_ptr batch; std::uint32_t row_start = 0; // Try to secure some new work: { std::unique_lock l(m_worker_sync); // If we have run out of batches to process, release anything in progress and return: if (m_current_batch >= m_reads_batch_count) { release_in_progress_batch(); break; } // If we have more batches than asked for complete that have // not been queried, wait for it to get taken: if (m_batches_size > m_max_pending_batches) { l.unlock(); std::this_thread::sleep_for(std::chrono::milliseconds(10)); continue; } // Now, if we have no work left in the current batch, release that: if (!m_in_progress_batch->has_work_left()) { if (!m_batch_counts.empty()) { m_total_batch_count_so_far += m_batch_counts[m_current_batch]; } // Release the current batch: release_in_progress_batch(); // Then try to setup the next batch, if one exists: m_current_batch += 1; if (m_current_batch >= m_reads_batch_count) { // No more work to do. m_finished = true; break; } auto setup_result = setup_next_in_progress_batch(l); if (!setup_result.ok()) { set_error(setup_result); return; } } // Finally, tell the work package we have secured we are starting to do some work: batch = m_in_progress_batch; row_start = m_in_progress_batch->start_rows(l, m_worker_job_size); } // Now execute the work, for all the rows we said we would: std::uint32_t const row_end = std::min(row_start + m_worker_job_size, batch->job_row_count()); do_work(batch, row_start, row_end); // And report the work completed for anyone waiting: batch->complete_rows(m_worker_job_size); } } void AsyncSignalLoader::do_work( std::shared_ptr const & batch, std::uint32_t row_start, std::uint32_t row_end) { // First secure the sample counts column for the batch we are processing: auto signal_column = batch->read_batch().signal_column(); // And record where we are starting in the batch rows array, if it exists: for (std::uint32_t i = row_start; i < row_end; ++i) { // Find the actual batch row to query - we may be working on a subset of batch data: auto const actual_batch_row = batch->get_batch_row_to_query(i); // Get the signal row data for the read: auto const signal_rows = std::static_pointer_cast( signal_column->value_slice(actual_batch_row)); auto const signal_rows_span = gsl::make_span(signal_rows->raw_values(), signal_rows->length()); // Find the sample count for these rows: auto sample_count_result = m_reader->extract_sample_count(signal_rows_span); if (!sample_count_result.ok()) { set_error(sample_count_result.status()); return; } std::uint64_t sample_count = *sample_count_result; // And query the samples if that has been requested: std::vector samples; if (m_samples_mode == SamplesMode::Samples) { samples.resize(sample_count); auto samples_result = m_reader->extract_samples(signal_rows_span, gsl::make_span(samples)); if (!samples_result.ok()) { set_error(std::move(samples_result)); return; } sample_count = samples.size(); } // Store the queried data into the batch: batch->set_samples(i, sample_count, std::move(samples)); } } Status AsyncSignalLoader::setup_next_in_progress_batch(std::unique_lock & lock) { assert(!m_in_progress_batch); ARROW_ASSIGN_OR_RAISE(auto read_batch, m_reader->read_read_record_batch(m_current_batch)); std::size_t row_count = read_batch.num_rows(); gsl::span next_specific_batch_rows; if (!m_batch_counts.empty()) { row_count = m_batch_counts[m_current_batch]; if (!m_batch_rows.empty()) { next_specific_batch_rows = m_batch_rows.subspan(m_total_batch_count_so_far, row_count); } } m_in_progress_batch = std::make_shared( m_current_batch, row_count, next_specific_batch_rows, std::move(read_batch)); return Status::OK(); } void AsyncSignalLoader::release_in_progress_batch() { if (m_in_progress_batch) { assert(!m_in_progress_batch->has_work_left()); std::lock_guard l(m_batches_sync); m_batches.emplace_back(std::move(m_in_progress_batch)); m_batches_size += 1; m_batch_done.notify_all(); } } } // namespace pod5 ================================================ FILE: c++/pod5_format/async_signal_loader.h ================================================ #pragma once #include "pod5_format/file_reader.h" #include "pod5_format/pod5_format_export.h" #include "pod5_format/read_table_reader.h" #include "pod5_format/signal_table_reader.h" #include #include #include #include #include #include #include #include namespace pod5 { class POD5_FORMAT_EXPORT CachedBatchSignalData { public: CachedBatchSignalData(std::uint32_t batch_index, std::size_t entry_count) : m_batch_index(batch_index) , m_sample_counts(entry_count) , m_samples(entry_count) { } std::uint32_t batch_index() const { return m_batch_index; } /// Find a list of sample counts for all requested batch rows. std::vector const & sample_count() const { return m_sample_counts; } /// Find a list of signal samples counts for all requested batch rows. std::vector> const & samples() const { return m_samples; } void set_samples(std::size_t row, std::uint64_t sample_count, std::vector && samples) { m_sample_counts[row] = sample_count; m_samples[row] = std::move(samples); } private: std::uint32_t m_batch_index; std::vector m_sample_counts; std::vector> m_samples; }; class POD5_FORMAT_EXPORT SignalCacheWorkPackage { public: SignalCacheWorkPackage( std::uint32_t batch_index, std::size_t job_row_count, gsl::span const & specific_job_rows, pod5::ReadTableRecordBatch && read_batch) : m_job_row_count(job_row_count) , m_specific_job_rows(specific_job_rows) , m_next_row_to_start(0) , m_completed_rows(0) , m_cached_data(std::make_unique(batch_index, m_job_row_count)) , m_read_batch(std::move(read_batch)) { } std::uint32_t job_row_count() const { return m_job_row_count; } void set_samples(std::size_t row, std::uint64_t sample_count, std::vector && samples) { m_cached_data->set_samples(row, sample_count, std::move(samples)); } std::unique_ptr release_data() { return std::move(m_cached_data); } pod5::ReadTableRecordBatch const & read_batch() const { return m_read_batch; } // Find the actual batch row to query, for a given job row index. std::uint32_t get_batch_row_to_query(std::uint32_t job_row_index) const { // We allow the caller to specify a subset of batch rows to iterate: if (!m_specific_job_rows.empty()) { return m_specific_job_rows[job_row_index]; } return job_row_index; } std::uint32_t start_rows(std::unique_lock & l, std::size_t row_count) { auto row = m_next_row_to_start; m_next_row_to_start += row_count; return row; } void complete_rows(std::uint32_t row_count) { m_completed_rows += row_count; } bool has_work_left() const { return m_next_row_to_start < m_job_row_count; } bool is_complete() const { return m_completed_rows.load() >= m_job_row_count; } private: std::size_t m_job_row_count; gsl::span m_specific_job_rows; std::uint32_t m_next_row_to_start; std::atomic m_completed_rows; std::unique_ptr m_cached_data; pod5::ReadTableRecordBatch m_read_batch; }; class POD5_FORMAT_EXPORT AsyncSignalLoader { public: // Minimum number of tasks one thread will do in a batch. static std::size_t const MINIMUM_JOB_SIZE; enum class SamplesMode { NoSamples, Samples, }; AsyncSignalLoader( std::shared_ptr const & reader, SamplesMode samples_mode, gsl::span const & batch_counts, gsl::span const & batch_rows, std::size_t worker_count = std::thread::hardware_concurrency(), std::size_t max_pending_batches = 10); ~AsyncSignalLoader(); /// Find if all work is complete in the loader. bool is_finished() const { return m_finished; } /// Get the next batch of loaded signal, always returns the consecutive next signal batch /// \note Returns nullptr when timeoout occurs, or if all data is exhausted. Result> release_next_batch( std::optional timeout = std::nullopt); private: /// Set an error code that will stop all async loading and return an error to the caller. void set_error(pod5::Status status); pod5::Status error() const; void run_worker(); void do_work( std::shared_ptr const & batch, std::uint32_t row_start, std::uint32_t row_end); /// Setup a new batch for in progress work to contain. /// \param lock A lock held on m_worker_sync. /// \note There must not be a batch already in progress. /// \note m_current_batch is used as the index of the next batch to begin. Status setup_next_in_progress_batch(std::unique_lock & lock); /// Release the currently in progress batch to readers, if it exists. /// \note This call locks m_batches_sync internally. /// \note The batch must not have any work remaining to start, but can be completing already started work. /// \note This call notifys the condition variable to alert readers that new data is available. void release_in_progress_batch(); std::shared_ptr m_reader; SamplesMode m_samples_mode; std::size_t m_max_pending_batches; std::size_t m_reads_batch_count; gsl::span m_batch_counts; std::size_t m_total_batch_count_so_far; gsl::span m_batch_rows; std::uint32_t const m_worker_job_size; std::mutex m_worker_sync; std::condition_variable m_batch_done; std::uint32_t m_current_batch; std::atomic m_finished; std::atomic m_has_error; mutable std::mutex m_error_mutex; pod5::Status m_error; std::shared_ptr m_in_progress_batch; std::mutex m_batches_sync; std::atomic m_batches_size; std::deque> m_batches; std::vector m_workers; }; } // namespace pod5 ================================================ FILE: c++/pod5_format/c_api.cpp ================================================ #include "pod5_format/c_api.h" #include "pod5_format/file_reader.h" #include "pod5_format/file_writer.h" #include "pod5_format/read_table_reader.h" #include "pod5_format/signal_compression.h" #include "pod5_format/signal_table_reader.h" #include "pod5_format/uuid.h" #include #include #include #include #include #include #include #include //--------------------------------------------------------------------------------------------------------------------- struct Pod5FileReader { Pod5FileReader(std::shared_ptr && reader_) : reader(std::move(reader_)) {} std::shared_ptr reader; }; struct Pod5FileWriter { Pod5FileWriter(std::unique_ptr && writer_) : writer(std::move(writer_)) {} std::unique_ptr writer; }; struct Pod5ReadRecordBatch { Pod5ReadRecordBatch( pod5::ReadTableRecordBatch && batch_, std::shared_ptr reader) : batch(std::move(batch_)) , reader(std::move(reader)) { } pod5::ReadTableRecordBatch const batch; std::shared_ptr reader; }; namespace { //--------------------------------------------------------------------------------------------------------------------- thread_local pod5_error_t g_pod5_error_no; thread_local std::string g_pod5_error_string; void pod5_set_error(arrow::Status status) { g_pod5_error_no = (pod5_error_t)status.code(); g_pod5_error_string = status.ToString(); } void pod5_reset_error() { g_pod5_error_no = pod5_error_t::POD5_OK; g_pod5_error_string.clear(); } #define POD5_C_RETURN_NOT_OK(result) \ do { \ ::arrow::Status __s = (result); \ if (!__s.ok()) { \ pod5_set_error(__s); \ return g_pod5_error_no; \ } \ } while (0) #define POD5_C_ASSIGN_OR_RAISE_IMPL(result_name, lhs, rexpr) \ auto && result_name = (rexpr); \ if (!(result_name).ok()) { \ pod5_set_error((result_name).status()); \ return g_pod5_error_no; \ } \ lhs = std::move(result_name).ValueUnsafe(); #define POD5_C_ASSIGN_OR_RAISE(lhs, rexpr) \ POD5_C_ASSIGN_OR_RAISE_IMPL( \ ARROW_ASSIGN_OR_RAISE_NAME(_error_or_value, __COUNTER__), lhs, rexpr); //--------------------------------------------------------------------------------------------------------------------- bool check_string_not_empty(char const * str) { if (!str) { pod5_set_error(arrow::Status::Invalid("null string passed to C API")); return false; } if (strlen(str) == 0) { pod5_set_error(arrow::Status::Invalid("empty string passed to C API")); return false; } return true; } bool check_not_null(void const * ptr) { if (!ptr) { pod5_set_error(arrow::Status::Invalid("null passed to C API")); return false; } return true; } bool check_file_not_null(void const * file) { if (!file) { pod5_set_error(arrow::Status::Invalid("null file passed to C API")); return false; } return true; } bool check_output_pointer_not_null(void const * output) { if (!output) { pod5_set_error(arrow::Status::Invalid("null output parameter passed to C API")); return false; } return true; } //--------------------------------------------------------------------------------------------------------------------- pod5::FileWriterOptions make_internal_writer_options(Pod5WriterOptions const * options) { pod5::FileWriterOptions internal_options; if (options) { if (options->max_signal_chunk_size != 0) { internal_options.set_max_signal_chunk_size(options->max_signal_chunk_size); } if (options->signal_compression_type == UNCOMPRESSED_SIGNAL) { internal_options.set_signal_type(pod5::SignalType::UncompressedSignal); } if (options->signal_table_batch_size != 0) { internal_options.set_signal_table_batch_size(options->signal_table_batch_size); } if (options->read_table_batch_size != 0) { internal_options.set_read_table_batch_size(options->read_table_batch_size); } } return internal_options; } pod5::FileReaderOptions make_internal_reader_options(Pod5ReaderOptions const & options) { pod5::FileReaderOptions internal_options; internal_options.set_force_disable_file_mapping(options.force_disable_file_mapping); return internal_options; } } // namespace extern "C" { //--------------------------------------------------------------------------------------------------------------------- pod5_error_t pod5_init() { pod5_reset_error(); POD5_C_RETURN_NOT_OK(pod5::register_extension_types()); return POD5_OK; } pod5_error_t pod5_terminate() { pod5_reset_error(); POD5_C_RETURN_NOT_OK(pod5::unregister_extension_types()); return POD5_OK; } pod5_error_t pod5_get_error_no() { return g_pod5_error_no; } char const * pod5_get_error_string() { return g_pod5_error_string.c_str(); } //--------------------------------------------------------------------------------------------------------------------- Pod5FileReader * pod5_open_file(char const * filename) { Pod5ReaderOptions_t options{}; return pod5_open_file_options(filename, &options); } Pod5FileReader * pod5_open_file_options(char const * filename, Pod5ReaderOptions_t const * options) { pod5_reset_error(); if (!check_string_not_empty(filename) || !check_not_null(options)) { return nullptr; } auto internal_reader = pod5::open_file_reader(filename, make_internal_reader_options(*options)); if (!internal_reader.ok()) { pod5_set_error(internal_reader.status()); return nullptr; } auto reader = std::make_unique(std::move(*internal_reader)); return reader.release(); } pod5_error_t pod5_close_and_free_reader(Pod5FileReader * file) { pod5_reset_error(); std::unique_ptr ptr{file}; ptr.reset(); return POD5_OK; } pod5_error_t pod5_get_file_info(Pod5FileReader_t const * reader, FileInfo * file_info) { pod5_reset_error(); if (!check_file_not_null(reader) || !check_output_pointer_not_null(file_info)) { return g_pod5_error_no; } auto const metadata = reader->reader->schema_metadata(); metadata.file_identifier.to_c_array(file_info->file_identifier); file_info->version.major = metadata.writing_pod5_version.major_version(); file_info->version.minor = metadata.writing_pod5_version.minor_version(); file_info->version.revision = metadata.writing_pod5_version.revision_version(); return POD5_OK; } pod5_error_t pod5_get_file_read_table_location( Pod5FileReader_t const * reader, EmbeddedFileData_t * file_data) { pod5_reset_error(); if (!check_file_not_null(reader) || !check_output_pointer_not_null(file_data)) { return g_pod5_error_no; } auto const & read_table_location = reader->reader->read_table_location(); file_data->file_name = read_table_location.file_path.c_str(); file_data->offset = read_table_location.offset; file_data->length = read_table_location.size; return POD5_OK; } pod5_error_t pod5_get_file_signal_table_location( Pod5FileReader_t const * reader, EmbeddedFileData_t * file_data) { pod5_reset_error(); if (!check_file_not_null(reader) || !check_output_pointer_not_null(file_data)) { return g_pod5_error_no; } auto const & signal_table_location = reader->reader->signal_table_location(); file_data->file_name = signal_table_location.file_path.c_str(); file_data->offset = signal_table_location.offset; file_data->length = signal_table_location.size; return POD5_OK; } pod5_error_t pod5_get_file_run_info_table_location( Pod5FileReader_t const * reader, EmbeddedFileData_t * file_data) { pod5_reset_error(); if (!check_file_not_null(reader) || !check_output_pointer_not_null(file_data)) { return g_pod5_error_no; } auto const & run_info_table_location = reader->reader->run_info_table_location(); file_data->file_name = run_info_table_location.file_path.c_str(); file_data->offset = run_info_table_location.offset; file_data->length = run_info_table_location.size; return POD5_OK; } pod5_error_t pod5_get_read_count(Pod5FileReader_t const * reader, size_t * count) { pod5_reset_error(); if (!check_file_not_null(reader) || !check_output_pointer_not_null(count)) { return g_pod5_error_no; } POD5_C_ASSIGN_OR_RAISE(*count, reader->reader->read_count()); return POD5_OK; } pod5_error_t pod5_get_read_ids(Pod5FileReader_t const * reader, size_t count, read_id_t * read_ids) { pod5_reset_error(); if (!check_file_not_null(reader) || !check_output_pointer_not_null(read_ids)) { return g_pod5_error_no; } POD5_C_ASSIGN_OR_RAISE(auto read_count, reader->reader->read_count()); if (count < read_count) { pod5_set_error(arrow::Status::Invalid("array to short to receive read ids")); return g_pod5_error_no; } std::size_t count_so_far = 0; for (std::size_t i = 0; i < reader->reader->num_read_record_batches(); ++i) { POD5_C_ASSIGN_OR_RAISE(auto const batch, reader->reader->read_read_record_batch(i)); auto const read_id_column = batch.read_id_column(); auto raw_data = reinterpret_cast(read_id_column->raw_values()); std::copy( raw_data, raw_data + (read_id_column->length() * sizeof(read_id_t)), reinterpret_cast(read_ids + count_so_far)); count_so_far += read_id_column->length(); } return POD5_OK; } pod5_error_t pod5_plan_traversal( Pod5FileReader_t const * reader, uint8_t const * read_id_array, size_t read_id_count, uint32_t * batch_counts, uint32_t * batch_rows, size_t * find_success_count_out) { pod5_reset_error(); if (!check_file_not_null(reader) || !check_not_null(read_id_array) || !check_output_pointer_not_null(batch_counts) || !check_output_pointer_not_null(batch_rows) || !check_output_pointer_not_null(find_success_count_out)) { return g_pod5_error_no; } auto search_input = pod5::ReadIdSearchInput( gsl::make_span(reinterpret_cast(read_id_array), read_id_count)); POD5_C_ASSIGN_OR_RAISE( auto find_success_count, reader->reader->search_for_read_ids( search_input, gsl::make_span(batch_counts, reader->reader->num_read_record_batches()), gsl::make_span(batch_rows, read_id_count))); // TODO: on MAJOR_VERSION bump drop this out param and do the check internally. *find_success_count_out = find_success_count; return POD5_OK; } pod5_error_t pod5_get_read_batch_count(size_t * count, Pod5FileReader const * reader) { pod5_reset_error(); if (!check_file_not_null(reader) || !check_output_pointer_not_null(count)) { return g_pod5_error_no; } *count = reader->reader->num_read_record_batches(); return POD5_OK; } pod5_error_t pod5_get_read_batch(Pod5ReadRecordBatch ** batch, Pod5FileReader const * reader, size_t index) { pod5_reset_error(); if (!check_file_not_null(reader) || !check_output_pointer_not_null(batch)) { return g_pod5_error_no; } POD5_C_ASSIGN_OR_RAISE(auto internal_batch, reader->reader->read_read_record_batch(index)); auto wrapped_batch = std::make_unique(std::move(internal_batch), reader->reader); *batch = wrapped_batch.release(); return POD5_OK; } pod5_error_t pod5_free_read_batch(Pod5ReadRecordBatch * batch) { pod5_reset_error(); std::unique_ptr ptr{batch}; ptr.reset(); return POD5_OK; } pod5_error_t pod5_get_read_batch_row_count(size_t * count, Pod5ReadRecordBatch const * batch) { pod5_reset_error(); if (!check_not_null(batch) || !check_output_pointer_not_null(count)) { return g_pod5_error_no; } *count = batch->batch.num_rows(); return POD5_OK; } static pod5_error_t check_row_index_and_set_error(size_t row, int64_t batch_size) { if (row > static_cast(std::numeric_limits::max()) || static_cast(row) >= batch_size) { pod5_set_error( arrow::Status::IndexError( "Invalid index into batch. Index ", row, " with batch size ", batch_size)); return g_pod5_error_no; } return POD5_OK; } pod5_error_t pod5_get_read_batch_row_info_data( Pod5ReadRecordBatch_t const * batch, size_t row, uint16_t struct_version, void * row_data, uint16_t * read_table_version) { pod5_reset_error(); if (!check_not_null(batch) || !check_output_pointer_not_null(row_data) || !check_output_pointer_not_null(read_table_version)) { return g_pod5_error_no; } static_assert( READ_BATCH_ROW_INFO_VERSION == READ_BATCH_ROW_INFO_VERSION_4, "New versions must be explicitly loaded"); auto load_common_v3_v4_fields = [](pod5::ReadTableRecordColumns const & cols, std::size_t row, auto * typed_row_data) { // Inform the caller of the version of the input table. if (check_row_index_and_set_error(row, cols.read_id->length()) != POD5_OK) { return g_pod5_error_no; } auto read_id_val = cols.read_id->Value(row); read_id_val.to_c_array(typed_row_data->read_id); typed_row_data->read_number = cols.read_number->Value(row); typed_row_data->start_sample = cols.start_sample->Value(row); typed_row_data->median_before = cols.median_before->Value(row); typed_row_data->channel = cols.channel->Value(row); typed_row_data->well = cols.well->Value(row); auto const & pore_type_col = cols.pore_type->indices(); typed_row_data->pore_type = static_cast(*pore_type_col).Value(row); typed_row_data->calibration_offset = cols.calibration_offset->Value(row); typed_row_data->calibration_scale = cols.calibration_scale->Value(row); auto const & end_reason_col = cols.end_reason->indices(); typed_row_data->end_reason = static_cast(*end_reason_col).Value(row); typed_row_data->end_reason_forced = cols.end_reason_forced->Value(row); auto const & run_info_col = cols.run_info->indices(); typed_row_data->run_info = static_cast(*run_info_col).Value(row); typed_row_data->num_minknow_events = cols.num_minknow_events->Value(row); typed_row_data->tracked_scaling_scale = cols.tracked_scaling_scale->Value(row); typed_row_data->tracked_scaling_shift = cols.tracked_scaling_shift->Value(row); typed_row_data->predicted_scaling_scale = cols.predicted_scaling_scale->Value(row); typed_row_data->predicted_scaling_shift = cols.predicted_scaling_shift->Value(row); typed_row_data->num_reads_since_mux_change = cols.num_reads_since_mux_change->Value(row); typed_row_data->time_since_mux_change = cols.time_since_mux_change->Value(row); typed_row_data->signal_row_count = cols.signal->value_length(row); typed_row_data->num_samples = cols.num_samples->Value(row); return POD5_OK; }; if (struct_version == READ_BATCH_ROW_INFO_VERSION_3) { auto typed_row_data = static_cast(row_data); POD5_C_ASSIGN_OR_RAISE(auto cols, batch->batch.columns()); *read_table_version = cols.table_version.as_int(); auto result = load_common_v3_v4_fields(cols, row, typed_row_data); if (result != POD5_OK) { return result; } } else if (struct_version == READ_BATCH_ROW_INFO_VERSION_4) { auto typed_row_data = static_cast(row_data); POD5_C_ASSIGN_OR_RAISE(auto cols, batch->batch.columns()); *read_table_version = cols.table_version.as_int(); auto result = load_common_v3_v4_fields(cols, row, typed_row_data); if (result != POD5_OK) { return result; } // This is the only difference between v3 and v4. typed_row_data->open_pore_level = cols.open_pore_level->Value(row); } else { pod5_set_error( arrow::Status::Invalid("Invalid struct version '", struct_version, "' passed")); return g_pod5_error_no; } return POD5_OK; } pod5_error_t pod5_get_signal_row_indices( Pod5ReadRecordBatch const * batch, size_t row, int64_t signal_row_indices_count, uint64_t * signal_row_indices) { pod5_reset_error(); if (!check_not_null(batch) || !check_output_pointer_not_null(signal_row_indices)) { return g_pod5_error_no; } auto const signal_col = batch->batch.signal_column(); if (check_row_index_and_set_error(row, signal_col->length()) != POD5_OK) { return g_pod5_error_no; } auto const row_data = std::static_pointer_cast(signal_col->value_slice(row)); if (signal_row_indices_count != row_data->length()) { pod5_set_error( pod5::Status::Invalid( "Incorrect number of signal indices, expected ", row_data->length(), " received ", signal_row_indices_count)); return g_pod5_error_no; } for (std::int64_t i = 0; i < signal_row_indices_count; ++i) { signal_row_indices[i] = row_data->Value(i); } return POD5_OK; } pod5_error_t pod5_get_calibration_extra_info( Pod5ReadRecordBatch_t const * batch, size_t row, CalibrationExtraData_t * calibration_extra_data) { pod5_reset_error(); if (!check_not_null(batch) || !check_output_pointer_not_null(calibration_extra_data)) { return g_pod5_error_no; } POD5_C_ASSIGN_OR_RAISE(auto cols, batch->batch.columns()); if (check_row_index_and_set_error(row, cols.calibration_scale->length()) != POD5_OK) { return g_pod5_error_no; } auto scale = cols.calibration_scale->Value(row); auto const & run_info_indices = cols.run_info->indices(); auto const run_info_dict_index = static_cast(*run_info_indices).Value(row); POD5_C_ASSIGN_OR_RAISE( auto const acquisition_id, batch->batch.get_run_info(run_info_dict_index)); POD5_C_ASSIGN_OR_RAISE(auto run_info_data, batch->reader->find_run_info(acquisition_id)); calibration_extra_data->digitisation = run_info_data->adc_max - run_info_data->adc_min + 1; calibration_extra_data->range = scale * calibration_extra_data->digitisation; return POD5_OK; } namespace { struct RunInfoDataCHelper : public RunInfoDictData { struct InternalMapHelper { std::vector keys; std::vector values; }; RunInfoDataCHelper(std::shared_ptr && internal_data_) : internal_data(std::move(internal_data_)) { acquisition_id = internal_data->acquisition_id.c_str(); acquisition_start_time_ms = internal_data->acquisition_start_time; adc_max = internal_data->adc_max; adc_min = internal_data->adc_min; context_tags = map_to_c(internal_data->context_tags, context_tags_helper); experiment_name = internal_data->experiment_name.c_str(); flow_cell_id = internal_data->flow_cell_id.c_str(); flow_cell_product_code = internal_data->flow_cell_product_code.c_str(); protocol_name = internal_data->protocol_name.c_str(); protocol_run_id = internal_data->protocol_run_id.c_str(); protocol_start_time_ms = internal_data->protocol_start_time; sample_id = internal_data->sample_id.c_str(); sample_rate = internal_data->sample_rate; sequencing_kit = internal_data->sequencing_kit.c_str(); sequencer_position = internal_data->sequencer_position.c_str(); sequencer_position_type = internal_data->sequencer_position_type.c_str(); software = internal_data->software.c_str(); system_name = internal_data->system_name.c_str(); system_type = internal_data->system_type.c_str(); tracking_id = map_to_c(internal_data->tracking_id, tracking_id_helper); } KeyValueData map_to_c(pod5::RunInfoData::MapType const & map, InternalMapHelper & helper) { helper.keys.reserve(map.size()); helper.values.reserve(map.size()); for (auto const & item : map) { helper.keys.push_back(item.first.c_str()); helper.values.push_back(item.second.c_str()); } KeyValueData result; result.size = helper.keys.size(); result.keys = helper.keys.data(); result.values = helper.values.data(); return result; } std::shared_ptr internal_data; InternalMapHelper context_tags_helper; InternalMapHelper tracking_id_helper; }; } // namespace pod5_error_t pod5_get_run_info( Pod5ReadRecordBatch const * batch, int16_t run_info, RunInfoDictData ** run_info_data) { pod5_reset_error(); if (!check_not_null(batch) || !check_output_pointer_not_null(run_info_data)) { return g_pod5_error_no; } POD5_C_ASSIGN_OR_RAISE(auto const acquisition_id, batch->batch.get_run_info(run_info)); POD5_C_ASSIGN_OR_RAISE(auto internal_data, batch->reader->find_run_info(acquisition_id)); auto data = std::make_unique(std::move(internal_data)); *run_info_data = data.release(); return POD5_OK; } pod5_error_t pod5_get_file_run_info( Pod5FileReader_t const * file, run_info_index_t run_info_index, RunInfoDictData_t ** run_info_data) { pod5_reset_error(); if (!check_file_not_null(file) || !check_output_pointer_not_null(run_info_data)) { return g_pod5_error_no; } POD5_C_ASSIGN_OR_RAISE(auto internal_data, file->reader->get_run_info(run_info_index)); auto data = std::make_unique(std::move(internal_data)); *run_info_data = data.release(); return POD5_OK; } pod5_error_t pod5_free_run_info(RunInfoDictData_t * run_info_data) { pod5_reset_error(); std::unique_ptr helper(static_cast(run_info_data)); helper.reset(); return POD5_OK; } pod5_error_t pod5_release_run_info(RunInfoDictData * run_info_data) { return pod5_free_run_info(run_info_data); } pod5_error_t pod5_get_file_run_info_count( Pod5FileReader_t const * file, run_info_index_t * run_info_count) { pod5_reset_error(); if (!check_file_not_null(file) || !check_output_pointer_not_null(run_info_count)) { return g_pod5_error_no; } POD5_C_ASSIGN_OR_RAISE(*run_info_count, file->reader->get_run_info_count()); return POD5_OK; } pod5_error_t pod5_get_end_reason( Pod5ReadRecordBatch_t const * batch, int16_t end_reason, pod5_end_reason * end_reason_value, char * end_reason_string_value, size_t * end_reason_string_value_size) { pod5_reset_error(); if (!check_not_null(batch) || !check_output_pointer_not_null(end_reason_value) || !check_output_pointer_not_null(end_reason_string_value) || !check_output_pointer_not_null(end_reason_string_value_size)) { return g_pod5_error_no; } POD5_C_ASSIGN_OR_RAISE(auto const end_reason_val, batch->batch.get_end_reason(end_reason)); auto const input_buffer_len = *end_reason_string_value_size; *end_reason_string_value_size = end_reason_val.second.size() + 1; if (end_reason_val.second.size() >= input_buffer_len) { return POD5_ERROR_STRING_NOT_LONG_ENOUGH; } *end_reason_value = POD5_END_REASON_UNKNOWN; switch (end_reason_val.first) { case pod5::ReadEndReason::mux_change: *end_reason_value = POD5_END_REASON_MUX_CHANGE; break; case pod5::ReadEndReason::unblock_mux_change: *end_reason_value = POD5_END_REASON_UNBLOCK_MUX_CHANGE; break; case pod5::ReadEndReason::data_service_unblock_mux_change: *end_reason_value = POD5_END_REASON_DATA_SERVICE_UNBLOCK_MUX_CHANGE; break; case pod5::ReadEndReason::signal_positive: *end_reason_value = POD5_END_REASON_SIGNAL_POSITIVE; break; case pod5::ReadEndReason::signal_negative: *end_reason_value = POD5_END_REASON_SIGNAL_NEGATIVE; break; case pod5::ReadEndReason::api_request: *end_reason_value = POD5_END_REASON_API_REQUEST; break; case pod5::ReadEndReason::device_data_error: *end_reason_value = POD5_END_REASON_DEVICE_DATA_ERROR; break; case pod5::ReadEndReason::analysis_config_change: *end_reason_value = POD5_END_REASON_ANALYSIS_CONFIG_CHANGE; break; case pod5::ReadEndReason::paused: *end_reason_value = POD5_END_REASON_PAUSED; break; case pod5::ReadEndReason::unknown: *end_reason_value = POD5_END_REASON_UNKNOWN; break; } std::copy(end_reason_val.second.begin(), end_reason_val.second.end(), end_reason_string_value); end_reason_string_value[*end_reason_string_value_size] = '\0'; return POD5_OK; } pod5_error_t pod5_get_pore_type( Pod5ReadRecordBatch_t const * batch, int16_t pore_type, char * pore_type_string_value, size_t * pore_type_string_value_size) { pod5_reset_error(); if (!check_not_null(batch) || !check_output_pointer_not_null(pore_type_string_value) || !check_output_pointer_not_null(pore_type_string_value_size)) { return g_pod5_error_no; } POD5_C_ASSIGN_OR_RAISE(auto const pore_type_str, batch->batch.get_pore_type(pore_type)); auto const input_buffer_len = *pore_type_string_value_size; *pore_type_string_value_size = pore_type_str.size() + 1; if (pore_type_str.size() >= input_buffer_len) { return POD5_ERROR_STRING_NOT_LONG_ENOUGH; } std::copy(pore_type_str.begin(), pore_type_str.end(), pore_type_string_value); pore_type_string_value[*pore_type_string_value_size] = '\0'; return POD5_OK; } namespace { class SignalRowInfoCHelper : public SignalRowInfo { public: SignalRowInfoCHelper(pod5::SignalTableRecordBatch && b) : batch(std::move(b)) {} pod5::SignalTableRecordBatch const batch; }; } // namespace pod5_error_t pod5_get_signal_row_info( Pod5FileReader const * reader, size_t signal_rows_count, uint64_t const * signal_rows, SignalRowInfo ** signal_row_info) { pod5_reset_error(); // Check for a valid reader. if (!check_file_not_null(reader)) { return g_pod5_error_no; } // Check for valid inputs. if (signal_rows_count == 0) { // Nothing to do. return POD5_OK; } else if (!check_not_null(signal_rows) || !check_output_pointer_not_null(signal_row_info)) { return g_pod5_error_no; } // Sort all rows first, in order to make searching faster. std::vector signal_rows_sorted{signal_rows, signal_rows + signal_rows_count}; std::sort(signal_rows_sorted.begin(), signal_rows_sorted.end()); // Store allocations to a temporary buffer so that we don't leak them on failure. std::vector> row_infos(signal_rows_count); // Then loop all rows, forward. for (std::size_t completed_rows = 0; completed_rows < signal_rows_sorted.size(); completed_rows++) { auto const start_row = signal_rows_sorted[completed_rows]; std::size_t batch_row = 0; POD5_C_ASSIGN_OR_RAISE( std::size_t row_batch, (reader->reader->signal_batch_for_row_id(start_row, &batch_row))); POD5_C_ASSIGN_OR_RAISE(auto batch, reader->reader->read_signal_record_batch(row_batch)); auto output = std::make_unique(std::move(batch)); output->batch_index = start_row; output->batch_row_index = batch_row; auto samples = output->batch.samples_column(); output->stored_sample_count = samples->Value(batch_row); POD5_C_ASSIGN_OR_RAISE( output->stored_byte_count, output->batch.samples_byte_count(batch_row)); row_infos[completed_rows] = std::move(output); } // Pass ownership of the info back to the caller. for (std::size_t row_idx = 0; row_idx < signal_rows_count; row_idx++) { signal_row_info[row_idx] = row_infos[row_idx].release(); } return POD5_OK; } pod5_error_t pod5_free_signal_row_info(size_t signal_rows_count, SignalRowInfo_t ** signal_row_info) { pod5_reset_error(); if (signal_rows_count > 0 && !check_not_null(signal_row_info)) { return g_pod5_error_no; } for (std::size_t i = 0; i < signal_rows_count; ++i) { std::unique_ptr helper( static_cast(signal_row_info[i])); helper.reset(); } return POD5_OK; } pod5_error_t pod5_get_signal( Pod5FileReader const * reader, SignalRowInfo_t const * row_info, size_t sample_count, int16_t * sample_data) { pod5_reset_error(); if (!check_file_not_null(reader) || !check_not_null(row_info) || !check_output_pointer_not_null(sample_data)) { return g_pod5_error_no; } auto * row_info_data = static_cast(row_info); POD5_C_RETURN_NOT_OK(row_info_data->batch.extract_signal_row( row_info->batch_row_index, gsl::make_span(sample_data, sample_count))); return POD5_OK; } pod5_error_t pod5_get_read_complete_sample_count( Pod5FileReader_t const * reader, Pod5ReadRecordBatch_t const * batch, size_t batch_row, size_t * sample_count) { pod5_reset_error(); if (!check_file_not_null(reader) || !check_not_null(batch) || !check_output_pointer_not_null(sample_count)) { return g_pod5_error_no; } POD5_C_ASSIGN_OR_RAISE(auto const & signal_rows, batch->batch.get_signal_rows(batch_row)); POD5_C_ASSIGN_OR_RAISE( *sample_count, reader->reader->extract_sample_count( gsl::make_span(signal_rows->raw_values(), signal_rows->length()))); return POD5_OK; } pod5_error_t pod5_get_read_complete_signal( Pod5FileReader_t const * reader, Pod5ReadRecordBatch_t const * batch, size_t batch_row, size_t sample_count, int16_t * signal) { pod5_reset_error(); if (!check_file_not_null(reader) || !check_not_null(batch) || !check_output_pointer_not_null(signal)) { return g_pod5_error_no; } POD5_C_ASSIGN_OR_RAISE(auto const & signal_rows, batch->batch.get_signal_rows(batch_row)); POD5_C_RETURN_NOT_OK(reader->reader->extract_samples( gsl::make_span(signal_rows->raw_values(), signal_rows->length()), gsl::make_span(signal, sample_count))); return POD5_OK; } //--------------------------------------------------------------------------------------------------------------------- Pod5FileWriter * pod5_create_file(char const * filename, char const * writer_name, Pod5WriterOptions const * options) { pod5_reset_error(); if (!check_string_not_empty(filename) || !check_string_not_empty(writer_name)) { return nullptr; } auto internal_writer = pod5::create_file_writer(filename, writer_name, make_internal_writer_options(options)); if (!internal_writer.ok()) { pod5_set_error(internal_writer.status()); return nullptr; } auto writer = std::make_unique(std::move(*internal_writer)); return writer.release(); } pod5_error_t pod5_close_and_free_writer(Pod5FileWriter * file) { pod5_reset_error(); std::unique_ptr ptr{file}; if (ptr) { POD5_C_RETURN_NOT_OK(ptr->writer->close()); } ptr.reset(); return POD5_OK; } pod5_error_t pod5_add_pore(int16_t * pore_index, Pod5FileWriter * file, char const * pore_type) { pod5_reset_error(); if (!check_string_not_empty(pore_type) || !check_file_not_null(file) || !check_output_pointer_not_null(pore_index)) { return g_pod5_error_no; } POD5_C_ASSIGN_OR_RAISE(*pore_index, file->writer->add_pore_type(pore_type)); return POD5_OK; } pod5_error_t pod5_add_run_info( int16_t * run_info_index, Pod5FileWriter * file, char const * acquisition_id, int64_t acquisition_start_time_ms, int16_t adc_max, int16_t adc_min, size_t context_tags_count, char const ** context_tags_keys, char const ** context_tags_values, char const * experiment_name, char const * flow_cell_id, char const * flow_cell_product_code, char const * protocol_name, char const * protocol_run_id, int64_t protocol_start_time_ms, char const * sample_id, uint16_t sample_rate, char const * sequencing_kit, char const * sequencer_position, char const * sequencer_position_type, char const * software, char const * system_name, char const * system_type, size_t tracking_id_count, char const ** tracking_id_keys, char const ** tracking_id_values) { pod5_reset_error(); if (!check_file_not_null(file) || !check_not_null(run_info_index) || !check_string_not_empty(acquisition_id) || !check_string_not_empty(experiment_name) || !check_string_not_empty(flow_cell_id) || !check_string_not_empty(flow_cell_product_code) || !check_string_not_empty(protocol_name) || !check_string_not_empty(protocol_run_id) || !check_string_not_empty(sample_id) || !check_string_not_empty(sequencing_kit) || !check_string_not_empty(sequencer_position) || !check_string_not_empty(sequencer_position_type) || !check_string_not_empty(software) || !check_string_not_empty(system_name) || !check_string_not_empty(system_type)) { return g_pod5_error_no; } auto const parse_map = [](std::size_t tracking_id_count, char const ** tracking_id_keys, char const ** tracking_id_values) -> pod5::Result { if (!check_not_null(tracking_id_keys) || !check_not_null(tracking_id_values)) { return arrow::Status::Invalid(g_pod5_error_string); } pod5::RunInfoData::MapType result; for (std::size_t i = 0; i < tracking_id_count; ++i) { auto key = tracking_id_keys[i]; auto value = tracking_id_values[i]; if (!check_string_not_empty(key) || !check_string_not_empty(value)) { return arrow::Status::Invalid(g_pod5_error_string); } result.emplace_back(key, value); } return result; }; POD5_C_ASSIGN_OR_RAISE( auto const context_tags, parse_map(context_tags_count, context_tags_keys, context_tags_values)); POD5_C_ASSIGN_OR_RAISE( auto const tracking_id, parse_map(tracking_id_count, tracking_id_keys, tracking_id_values)); POD5_C_ASSIGN_OR_RAISE( *run_info_index, file->writer->add_run_info( pod5::RunInfoData( acquisition_id, acquisition_start_time_ms, adc_max, adc_min, context_tags, experiment_name, flow_cell_id, flow_cell_product_code, protocol_name, protocol_run_id, protocol_start_time_ms, sample_id, sample_rate, sequencing_kit, sequencer_position, sequencer_position_type, software, system_name, system_type, tracking_id))); return POD5_OK; } static bool check_read_data_struct(std::uint16_t struct_version, void const * row_data) { static_assert( READ_BATCH_ROW_INFO_VERSION == READ_BATCH_ROW_INFO_VERSION_4, "New versions must be explicitly loaded"); if (!check_not_null(row_data)) { return false; } if (struct_version < READ_BATCH_ROW_INFO_VERSION_3) { pod5_set_error(arrow::Status::Invalid("Unable to write V1 + V2 reads, update to V3 API.")); return false; } auto check_common_v3_v4_fields = [](auto typed_row_data) -> bool { return check_not_null(typed_row_data->read_id) && check_not_null(typed_row_data->read_number) && check_not_null(typed_row_data->start_sample) && check_not_null(typed_row_data->median_before) && check_not_null(typed_row_data->channel) && check_not_null(typed_row_data->well) && check_not_null(typed_row_data->pore_type) && check_not_null(typed_row_data->calibration_offset) && check_not_null(typed_row_data->calibration_scale) && check_not_null(typed_row_data->end_reason) && check_not_null(typed_row_data->end_reason_forced) && check_not_null(typed_row_data->run_info_id) && check_not_null(typed_row_data->num_minknow_events) && check_not_null(typed_row_data->tracked_scaling_scale) && check_not_null(typed_row_data->tracked_scaling_shift) && check_not_null(typed_row_data->predicted_scaling_scale) && check_not_null(typed_row_data->predicted_scaling_shift) && check_not_null(typed_row_data->num_reads_since_mux_change) && check_not_null(typed_row_data->time_since_mux_change); }; if (struct_version == READ_BATCH_ROW_INFO_VERSION_3) { auto const * typed_row_data = static_cast(row_data); if (!check_common_v3_v4_fields(typed_row_data)) { return false; } } if (struct_version == READ_BATCH_ROW_INFO_VERSION_4) { auto const * typed_row_data = static_cast(row_data); if (!check_common_v3_v4_fields(typed_row_data) || !check_not_null(typed_row_data->open_pore_level)) { return false; } } return true; } static bool load_struct_row_into_read_data( std::unique_ptr const & writer, pod5::ReadData & read_data, std::uint16_t struct_version, void const * row_data, std::uint32_t row_id) { static_assert( READ_BATCH_ROW_INFO_VERSION == READ_BATCH_ROW_INFO_VERSION_4, "New versions must be explicitly loaded"); auto load_common_v3_v4_fields = [](std::unique_ptr const & writer, auto const * typed_row_data, std::uint32_t row_id, pod5::ReadData & read_data) { pod5::Uuid read_id_uuid{typed_row_data->read_id[row_id]}; std::optional end_reason_internal; switch (typed_row_data->end_reason[row_id]) { case POD5_END_REASON_UNKNOWN: end_reason_internal = pod5::ReadEndReason::unknown; break; case POD5_END_REASON_MUX_CHANGE: end_reason_internal = pod5::ReadEndReason::mux_change; break; case POD5_END_REASON_UNBLOCK_MUX_CHANGE: end_reason_internal = pod5::ReadEndReason::unblock_mux_change; break; case POD5_END_REASON_DATA_SERVICE_UNBLOCK_MUX_CHANGE: end_reason_internal = pod5::ReadEndReason::data_service_unblock_mux_change; break; case POD5_END_REASON_SIGNAL_POSITIVE: end_reason_internal = pod5::ReadEndReason::signal_positive; break; case POD5_END_REASON_SIGNAL_NEGATIVE: end_reason_internal = pod5::ReadEndReason::signal_negative; break; case POD5_END_REASON_API_REQUEST: end_reason_internal = pod5::ReadEndReason::api_request; break; case POD5_END_REASON_DEVICE_DATA_ERROR: end_reason_internal = pod5::ReadEndReason::device_data_error; break; case POD5_END_REASON_ANALYSIS_CONFIG_CHANGE: end_reason_internal = pod5::ReadEndReason::analysis_config_change; break; case POD5_END_REASON_PAUSED: end_reason_internal = pod5::ReadEndReason::paused; break; } if (!end_reason_internal.has_value()) { pod5_set_error( arrow::Status::Invalid( "out of range end reason ", typed_row_data->end_reason[row_id], " passed to add read")); return false; } auto const end_reason_index = writer->lookup_end_reason(*end_reason_internal); if (!end_reason_index.ok()) { pod5_set_error(end_reason_index.status()); return false; } read_data = pod5::ReadData{ read_id_uuid, typed_row_data->read_number[row_id], typed_row_data->start_sample[row_id], typed_row_data->channel[row_id], typed_row_data->well[row_id], typed_row_data->pore_type[row_id], typed_row_data->calibration_offset[row_id], typed_row_data->calibration_scale[row_id], typed_row_data->median_before[row_id], *end_reason_index, typed_row_data->end_reason_forced[row_id] != 0, typed_row_data->run_info_id[row_id], typed_row_data->num_minknow_events[row_id], typed_row_data->tracked_scaling_scale[row_id], typed_row_data->tracked_scaling_shift[row_id], typed_row_data->predicted_scaling_scale[row_id], typed_row_data->predicted_scaling_shift[row_id], typed_row_data->num_reads_since_mux_change[row_id], typed_row_data->time_since_mux_change[row_id], // open_pore_level is only present in v4. std::numeric_limits::quiet_NaN()}; return true; }; // Version 0-2 are no longer supported for writing. if (struct_version == READ_BATCH_ROW_INFO_VERSION_4) { auto const * typed_row_data = static_cast(row_data); if (!load_common_v3_v4_fields(writer, typed_row_data, row_id, read_data)) { return false; } read_data.open_pore_level = typed_row_data->open_pore_level[row_id]; } else if (struct_version == READ_BATCH_ROW_INFO_VERSION_3) { auto const * typed_row_data = static_cast(row_data); if (!load_common_v3_v4_fields(writer, typed_row_data, row_id, read_data)) { return false; } } else { pod5_set_error( arrow::Status::Invalid("Invalid writer struct version '", struct_version, "' passed")); return false; } return true; }; pod5_error_t pod5_add_reads_data( Pod5FileWriter_t * file, uint32_t read_count, uint16_t struct_version, void const * row_data, int16_t const ** signal, uint32_t const * signal_size) { pod5_reset_error(); if (!check_file_not_null(file)) { return g_pod5_error_no; } if (read_count == 0) { return POD5_OK; } if (!check_read_data_struct(struct_version, row_data) || !check_not_null(signal) || !check_not_null(signal_size)) { return g_pod5_error_no; } for (std::uint32_t read = 0; read < read_count; ++read) { if (!check_not_null(signal[read])) { return g_pod5_error_no; } } for (std::uint32_t read = 0; read < read_count; ++read) { pod5::ReadData read_data; if (!load_struct_row_into_read_data( file->writer, read_data, struct_version, row_data, read)) { return g_pod5_error_no; } POD5_C_RETURN_NOT_OK(file->writer->add_complete_read( read_data, gsl::make_span(signal[read], signal_size[read]))); } return POD5_OK; } pod5_error_t pod5_add_reads_data_pre_compressed( Pod5FileWriter_t * file, uint32_t read_count, uint16_t struct_version, void const * row_data, char const *** compressed_signal, size_t const ** compressed_signal_size, uint32_t const ** sample_counts, size_t const * signal_chunk_count) { pod5_reset_error(); if (!check_file_not_null(file)) { return g_pod5_error_no; } if (read_count == 0) { return POD5_OK; } if (!check_read_data_struct(struct_version, row_data) || !check_not_null(compressed_signal) || !check_not_null(compressed_signal_size) || !check_not_null(sample_counts) || !check_not_null(signal_chunk_count)) { return g_pod5_error_no; } for (std::uint32_t read = 0; read < read_count; ++read) { if (!check_not_null(compressed_signal[read]) || !check_not_null(compressed_signal_size[read]) || !check_not_null(sample_counts[read])) { return g_pod5_error_no; } } for (std::uint32_t read = 0; read < read_count; ++read) { pod5::ReadData read_data; if (!load_struct_row_into_read_data( file->writer, read_data, struct_version, row_data, read)) { return g_pod5_error_no; } std::uint64_t total_sample_count = 0; std::vector signal_rows; for (std::size_t i = 0; i < signal_chunk_count[read]; ++i) { auto signal = compressed_signal[read][i]; auto signal_size = compressed_signal_size[read][i]; auto sample_count = sample_counts[read][i]; total_sample_count += sample_count; POD5_C_ASSIGN_OR_RAISE( auto row_id, file->writer->add_pre_compressed_signal( read_data.read_id, gsl::make_span(signal, signal_size).as_span(), sample_count)); signal_rows.push_back(row_id); } POD5_C_RETURN_NOT_OK(file->writer->add_complete_read( read_data, gsl::make_span(signal_rows), total_sample_count)); } return POD5_OK; } size_t pod5_vbz_compressed_signal_max_size(size_t sample_count) { pod5_reset_error(); auto const compressed_size = pod5::compressed_signal_max_size(sample_count); if (!compressed_size.ok()) { // TODO: on MAJOR_VERSION bump change this to return an error code. pod5_set_error(compressed_size.status()); return 0; } else { return compressed_size.ValueUnsafe(); } } pod5_error_t pod5_vbz_compress_signal( int16_t const * signal, size_t signal_size, char * compressed_signal_out, size_t * compressed_signal_size) { pod5_reset_error(); if (!check_not_null(signal) || !check_output_pointer_not_null(compressed_signal_out) || !check_output_pointer_not_null(compressed_signal_size)) { return g_pod5_error_no; } POD5_C_ASSIGN_OR_RAISE( auto buffer, pod5::compress_signal(gsl::make_span(signal, signal_size), arrow::system_memory_pool())); if ((std::size_t)buffer->size() > *compressed_signal_size) { pod5_set_error( pod5::Status::Invalid( "Compressed signal size (", buffer->size(), ") is greater than provided buffer size (", compressed_signal_size, ")")); return g_pod5_error_no; } std::copy(buffer->data(), buffer->data() + buffer->size(), compressed_signal_out); *compressed_signal_size = buffer->size(); return POD5_OK; } pod5_error_t pod5_vbz_decompress_signal( char const * compressed_signal, size_t compressed_signal_size, size_t sample_count, int16_t * signal_out) { pod5_reset_error(); if (!check_not_null(compressed_signal) || !check_output_pointer_not_null(signal_out)) { return g_pod5_error_no; } auto const in_span = gsl::make_span(compressed_signal, compressed_signal_size).as_span(); auto out_span = gsl::make_span(signal_out, sample_count); POD5_C_RETURN_NOT_OK(pod5::decompress_signal(in_span, arrow::system_memory_pool(), out_span)); return POD5_OK; } pod5_error_t pod5_format_read_id(read_id_t const read_id, char * read_id_string) { pod5_reset_error(); if (!check_not_null(read_id) || !check_output_pointer_not_null(read_id_string)) { return g_pod5_error_no; } auto * uuid_data = reinterpret_cast(read_id); uuid_data->write_to(read_id_string); read_id_string[36] = '\0'; return POD5_OK; } } ================================================ FILE: c++/pod5_format/c_api.h ================================================ #pragma once #include "pod5_format/pod5_format_export.h" #include #include #ifdef __cplusplus extern "C" { #endif #ifndef _WIN32 #define POD5_DEPRECATED __attribute__((deprecated)) #elif (__STDC_VERSION__ >= 202000) #define POD5_DEPRECATED [[deprecated]] #else #define POD5_DEPRECATED #endif /// All functions are thread safe unless otherwise stated. Types may be used by multiple /// threads as long as the functions being called only take them by const pointer. struct Pod5FileReader; typedef struct Pod5FileReader Pod5FileReader_t; struct Pod5FileWriter; typedef struct Pod5FileWriter Pod5FileWriter_t; struct Pod5ReadRecordBatch; typedef struct Pod5ReadRecordBatch Pod5ReadRecordBatch_t; //--------------------------------------------------------------------------------------------------------------------- // Error management //--------------------------------------------------------------------------------------------------------------------- /// \brief Integer error codes. /// \note Taken from the arrow status enum. enum pod5_error { POD5_OK = 0, POD5_ERROR_OUTOFMEMORY = 1, POD5_ERROR_KEYERROR = 2, POD5_ERROR_TYPEERROR = 3, POD5_ERROR_INVALID = 4, POD5_ERROR_IOERROR = 5, POD5_ERROR_CAPACITYERROR = 6, POD5_ERROR_INDEXERROR = 7, POD5_ERROR_CANCELLED = 8, POD5_ERROR_UNKNOWNERROR = 9, POD5_ERROR_NOTIMPLEMENTED = 10, POD5_ERROR_SERIALIZATIONERROR = 11, POD5_ERROR_STRING_NOT_LONG_ENOUGH = 12, }; typedef enum pod5_error pod5_error_t; /// \brief Get the most recent error number from all pod5 api's on the current thread. POD5_FORMAT_EXPORT pod5_error_t pod5_get_error_no(); /// \brief Get the most recent error description string from all pod5 api's on the current thread. /// \note The string's lifetime is internally managed, a caller should not free it. POD5_FORMAT_EXPORT char const * pod5_get_error_string(); //--------------------------------------------------------------------------------------------------------------------- // Global state //--------------------------------------------------------------------------------------------------------------------- /// \brief Initialise and register global pod5 types POD5_FORMAT_EXPORT pod5_error_t pod5_init(); /// \brief Terminate global pod5 types POD5_FORMAT_EXPORT pod5_error_t pod5_terminate(); //--------------------------------------------------------------------------------------------------------------------- // Shared Structures //--------------------------------------------------------------------------------------------------------------------- enum pod5_end_reason { POD5_END_REASON_UNKNOWN = 0, POD5_END_REASON_MUX_CHANGE = 1, POD5_END_REASON_UNBLOCK_MUX_CHANGE = 2, POD5_END_REASON_DATA_SERVICE_UNBLOCK_MUX_CHANGE = 3, POD5_END_REASON_SIGNAL_POSITIVE = 4, POD5_END_REASON_SIGNAL_NEGATIVE = 5, POD5_END_REASON_API_REQUEST = 6, POD5_END_REASON_DEVICE_DATA_ERROR = 7, POD5_END_REASON_ANALYSIS_CONFIG_CHANGE = 8, POD5_END_REASON_PAUSED = 9 }; typedef enum pod5_end_reason pod5_end_reason_t; typedef uint16_t run_info_index_t; typedef uint8_t read_id_t[16]; typedef uint8_t run_id_t[16]; // Single entry of read data: struct ReadBatchRowInfoV3 { // The read id data, in binary form. read_id_t read_id; // Read number for the read. uint32_t read_number; // Start sample for the read. uint64_t start_sample; // Median before level. float median_before; // Channel for the read. uint16_t channel; // Well for the read. uint8_t well; // Dictionary index for the pore type. int16_t pore_type; // Calibration offset type for the read. float calibration_offset; // Palibration type for the read. float calibration_scale; // End reason index for the read. int16_t end_reason; // Was the end reason for the read forced (0 for false, 1 for true). uint8_t end_reason_forced; // Dictionary index for run id for the read, can be used to look up run info. int16_t run_info; // Number of minknow events that the read contains uint64_t num_minknow_events; // Scale/Shift for tracked read scaling values (based on previous reads) // DEPRECATED: will be removed in 0.4.0 POD5_DEPRECATED float tracked_scaling_scale; POD5_DEPRECATED float tracked_scaling_shift; // Scale/Shift for predicted read scaling values (based on this read's raw signal) // DEPRECATED: will be removed in 0.4.0 POD5_DEPRECATED float predicted_scaling_scale; POD5_DEPRECATED float predicted_scaling_shift; // How many reads have been selected prior to this read on the channel-well since it was made active. // DEPRECATED: will be removed in 0.4.0 POD5_DEPRECATED uint32_t num_reads_since_mux_change; // How many seconds have passed since the channel-well was made active // DEPRECATED: will be removed in 0.4.0 POD5_DEPRECATED float time_since_mux_change; // Number of signal row entries for the read. int64_t signal_row_count; // The length of the read in samples. uint64_t num_samples; }; // Single entry of read data: struct ReadBatchRowInfoV4 { // The read id data, in binary form. read_id_t read_id; // Read number for the read. uint32_t read_number; // Start sample for the read. uint64_t start_sample; // Median before level. float median_before; // Channel for the read. uint16_t channel; // Well for the read. uint8_t well; // Dictionary index for the pore type. int16_t pore_type; // Calibration offset type for the read. float calibration_offset; // Palibration type for the read. float calibration_scale; // End reason index for the read. int16_t end_reason; // Was the end reason for the read forced (0 for false, 1 for true). uint8_t end_reason_forced; // Dictionary index for run id for the read, can be used to look up run info. int16_t run_info; // Number of minknow events that the read contains uint64_t num_minknow_events; // Scale/Shift for tracked read scaling values (based on previous reads) // DEPRECATED: will be removed in 0.4.0 POD5_DEPRECATED float tracked_scaling_scale; POD5_DEPRECATED float tracked_scaling_shift; // Scale/Shift for predicted read scaling values (based on this read's raw signal) // DEPRECATED: will be removed in 0.4.0 POD5_DEPRECATED float predicted_scaling_scale; POD5_DEPRECATED float predicted_scaling_shift; // How many reads have been selected prior to this read on the channel-well since it was made active. // DEPRECATED: will be removed in 0.4.0 POD5_DEPRECATED uint32_t num_reads_since_mux_change; // How many seconds have passed since the channel-well was made active // DEPRECATED: will be removed in 0.4.0 POD5_DEPRECATED float time_since_mux_change; // Number of signal row entries for the read. int64_t signal_row_count; // The length of the read in samples. uint64_t num_samples; // The level of the pore. float open_pore_level; }; // Typedef for latest batch row info structure. typedef struct ReadBatchRowInfoV4 ReadBatchRowInfo_t; struct POD5_DEPRECATED ReadBatchRowInfoArrayV3 { // The read id data, in binary form. read_id_t const * read_id; // Read number for the read. uint32_t const * read_number; // Start sample for the read. uint64_t const * start_sample; // Median before level. float const * median_before; // Channel for the read. uint16_t const * channel; // Well for the read. uint8_t const * well; // Pore type for the read. int16_t const * pore_type; // Calibration offset type for the read. float const * calibration_offset; // Palibration type for the read. float const * calibration_scale; // End reason type for the read. pod5_end_reason_t const * end_reason; // Was the end reason for the read forced (0 for false, 1 for true). uint8_t const * end_reason_forced; // Run info type for the read. int16_t const * run_info_id; // Number of minknow events that the read contains uint64_t const * num_minknow_events; // Scale/Shift for tracked read scaling values (based on previous reads) // DEPRECATED: will be removed in 0.4.0 POD5_DEPRECATED float const * tracked_scaling_scale; POD5_DEPRECATED float const * tracked_scaling_shift; // Scale/Shift for predicted read scaling values (based on this read's raw signal) // DEPRECATED: will be removed in 0.4.0 POD5_DEPRECATED float const * predicted_scaling_scale; POD5_DEPRECATED float const * predicted_scaling_shift; // How many reads have been selected prior to this read on the channel-well since it was made active. // DEPRECATED: will be removed in 0.4.0 POD5_DEPRECATED uint32_t const * num_reads_since_mux_change; // How many seconds have passed since the channel-well was made active // DEPRECATED: will be removed in 0.4.0 POD5_DEPRECATED float const * time_since_mux_change; }; // Array of read data: struct ReadBatchRowInfoArrayV4 { // The read id data, in binary form. read_id_t const * read_id; // Read number for the read. uint32_t const * read_number; // Start sample for the read. uint64_t const * start_sample; // Median before level. float const * median_before; // Channel for the read. uint16_t const * channel; // Well for the read. uint8_t const * well; // Pore type for the read. int16_t const * pore_type; // Calibration offset type for the read. float const * calibration_offset; // Palibration type for the read. float const * calibration_scale; // End reason type for the read. pod5_end_reason_t const * end_reason; // Was the end reason for the read forced (0 for false, 1 for true). uint8_t const * end_reason_forced; // Run info type for the read. int16_t const * run_info_id; // Number of minknow events that the read contains uint64_t const * num_minknow_events; // Scale/Shift for tracked read scaling values (based on previous reads) // DEPRECATED: will be removed in 0.4.0 POD5_DEPRECATED float const * tracked_scaling_scale; POD5_DEPRECATED float const * tracked_scaling_shift; // Scale/Shift for predicted read scaling values (based on this read's raw signal) // DEPRECATED: will be removed in 0.4.0 POD5_DEPRECATED float const * predicted_scaling_scale; POD5_DEPRECATED float const * predicted_scaling_shift; // How many reads have been selected prior to this read on the channel-well since it was made active. // DEPRECATED: will be removed in 0.4.0 POD5_DEPRECATED uint32_t const * num_reads_since_mux_change; // How many seconds have passed since the channel-well was made active // DEPRECATED: will be removed in 0.4.0 POD5_DEPRECATED float const * time_since_mux_change; // The level of the pore. float const * open_pore_level; }; // Typedef for latest batch row info structure. typedef struct ReadBatchRowInfoArrayV4 ReadBatchRowInfoArray_t; #define READ_BATCH_ROW_INFO_VERSION_0 0 // Addition of num_minknow_events fields, scaling fields. #define READ_BATCH_ROW_INFO_VERSION_1 1 // Addition of num_samples fields. #define READ_BATCH_ROW_INFO_VERSION_2 2 // Flattening of read structures. #define READ_BATCH_ROW_INFO_VERSION_3 3 // Introduction of new open_pore_level field. #define READ_BATCH_ROW_INFO_VERSION_4 4 // Latest available version. #define READ_BATCH_ROW_INFO_VERSION READ_BATCH_ROW_INFO_VERSION_4 //--------------------------------------------------------------------------------------------------------------------- // Reading files //--------------------------------------------------------------------------------------------------------------------- // Options to control how a file is written. struct Pod5ReaderOptions { /// \brief Disable file mapping into memory. Reduces memory usage of pod5 files, at the expense /// of the underlying arrow file loading into memory on demand. char force_disable_file_mapping; }; typedef struct Pod5ReaderOptions Pod5ReaderOptions_t; /// \brief Open a file reader with default options. /// \param filename The filename of the pod5 file. /// \see pod5_open_file_options POD5_FORMAT_EXPORT Pod5FileReader_t * pod5_open_file(char const * filename); /// \brief Open a file reader /// \param filename The filename of the pod5 file. /// \param options The options to use when opening the file. /// \return A reader, or null on error. The reason for the error can be queried with /// pod5_get_error_no() and pod5_get_error_string(). POD5_FORMAT_EXPORT Pod5FileReader_t * pod5_open_file_options( char const * filename, Pod5ReaderOptions_t const * options); /// \brief Close a file reader, releasing all memory held by the reader. /// \param file A previously opened reader. /// \note Any references to \a file or its components are no longer valid after this call. /// \note It is safe to call this with a null \a file. POD5_FORMAT_EXPORT pod5_error_t pod5_close_and_free_reader(Pod5FileReader_t * file); struct FileInfo { read_id_t file_identifier; struct Version { uint16_t major; uint16_t minor; uint16_t revision; } version; }; typedef struct FileInfo FileInfo_t; /// \brief Find info about a file. /// \param[in] file The file to be queried. /// \param[out] file_info The info read from the file. POD5_FORMAT_EXPORT pod5_error_t pod5_get_file_info(Pod5FileReader_t const * file, FileInfo_t * file_info); struct EmbeddedFileData { // The embedded file name - note this may not be the original file name, if the file has been migrated. // This pointer will remain valid until the next pod5 api call on the associated reader. char const * file_name; size_t offset; size_t length; }; typedef struct EmbeddedFileData EmbeddedFileData_t; /// \brief Find the location of the read table data /// \param[in] file The file to be queried. /// \param[out] file_data The output read table file data. POD5_FORMAT_EXPORT pod5_error_t pod5_get_file_read_table_location(Pod5FileReader_t const * file, EmbeddedFileData_t * file_data); /// \brief Find the location of the signal table data /// \param[in] file The file to be queried. /// \param[out] file_data The output signal table file data. POD5_FORMAT_EXPORT pod5_error_t pod5_get_file_signal_table_location(Pod5FileReader_t const * file, EmbeddedFileData_t * file_data); /// \brief Find the location of the run info table data /// \param[in] file The file to be queried. /// \param[out] file_data The output signal table file data. POD5_FORMAT_EXPORT pod5_error_t pod5_get_file_run_info_table_location( Pod5FileReader_t const * file, EmbeddedFileData_t * file_data); /// \brief Find the number of reads in the file. /// \param[in] reader The file reader to read from /// \param[out] count The number of reads in the file POD5_FORMAT_EXPORT pod5_error_t pod5_get_read_count(Pod5FileReader_t const * reader, size_t * count); /// \brief Grab the read_id's from the file. /// \param[in] reader The file reader to read from. /// \param count The number of read_id's allocated in [read_ids], an error is raised if the count is not greater or equal to pod5_get_read_count. /// \param[out] read_ids The read id's written in a contiguous array. POD5_FORMAT_EXPORT pod5_error_t pod5_get_read_ids(Pod5FileReader_t const * reader, size_t count, read_id_t * read_ids); /// \brief Plan the most efficient route through the data for the given read ids /// \param[in] file The file to be queried. /// \param[in] read_id_array The read id array (contiguous array, 16 bytes per id). /// \param read_id_count The number of read ids. /// \param[out] batch_counts The number of rows per batch that need to be visited (rows listed in batch_rows), /// input array length should be the number of read table batches. /// \param[out] batch_rows Rows to visit per batch, packed into one array. Offsets into this array from /// [batch_counts] provide the per-batch row data. Input array length should /// equal read_id_count. /// \param[out] find_success_count The number of requested read ids that were found. /// \note The output arrays are sorted in file storage order, to improve read efficiency. POD5_FORMAT_EXPORT pod5_error_t pod5_plan_traversal( Pod5FileReader_t const * file, uint8_t const * read_id_array, size_t read_id_count, uint32_t * batch_counts, uint32_t * batch_rows, size_t * find_success_count); /// \brief Find the number of read batches in the file. /// \param[out] count The number of read batches in the file /// \param[in] reader The file reader to read from POD5_FORMAT_EXPORT pod5_error_t pod5_get_read_batch_count(size_t * count, Pod5FileReader_t const * reader); /// \brief Get a read batch from the file. /// \param[out] batch The extracted batch. /// \param[in] reader The file reader to read from /// \param index The index of the batch to read. /// \note Batches returned from this API must be freed using #pod5_free_read_batch POD5_FORMAT_EXPORT pod5_error_t pod5_get_read_batch(Pod5ReadRecordBatch_t ** batch, Pod5FileReader_t const * reader, size_t index); /// \brief Release a read batch when it is not longer used. /// \param batch The batch to release. /// \note Any references to \a batch or its components are no longer valid after this call. /// \note It is safe to call this with a null \a batch. POD5_FORMAT_EXPORT pod5_error_t pod5_free_read_batch(Pod5ReadRecordBatch_t * batch); /// \brief Find the number of rows in a batch. /// \param[out] count The number of rows in the batch. /// \param[in] batch The batch to query the number of rows for. POD5_FORMAT_EXPORT pod5_error_t pod5_get_read_batch_row_count(size_t * count, Pod5ReadRecordBatch_t const * batch); /// \brief Find the info for a row in a read batch. /// \param[in] batch The read batch to query. /// \param row The row index to query. /// \param struct_version The version of the struct being passed in, calling code /// should use [READ_BATCH_ROW_INFO_VERSION]. /// \param[out] row_data The data for reading into, should be a pointer to ReadBatchRowInfo_t. /// \param[out] read_table_version The table version read from the file, will indicate which fields should be available. /// See READ_BATCH_ROW_INFO_VERSION and ReadBatchRowInfo_t above for corresponding fields. POD5_FORMAT_EXPORT pod5_error_t pod5_get_read_batch_row_info_data( Pod5ReadRecordBatch_t const * batch, size_t row, uint16_t struct_version, void * row_data, uint16_t * read_table_version); /// \brief Find the signal indices for a row in a read batch. /// \param[in] batch The read batch to query. /// \param row The row index to query. /// \param signal_row_indices_count Number of entries in the signal_row_indices array. /// \param[out] signal_row_indices The signal row indices read out of the read row. /// \note signal_row_indices_count Must equal signal_row_count returned from pod5_get_read_batch_row_info_data /// or an error is generated. POD5_FORMAT_EXPORT pod5_error_t pod5_get_signal_row_indices( Pod5ReadRecordBatch_t const * batch, size_t row, int64_t signal_row_indices_count, uint64_t * signal_row_indices); struct CalibrationExtraData { // The digitisation value used by the sequencer, equal to: // // adc_max - adc_min + 1 uint16_t digitisation; // The range of the calibrated channel in pA. float range; }; typedef struct CalibrationExtraData CalibrationExtraData_t; /// \brief Find the extra calibration info for a row in a read batch. /// \param[in] batch The read batch to query. /// \param row The read row index. /// \param[out] calibration_extra_data Output location for the calibration data. /// \note The values are computed from data held in the file, and written directly to the address provided, there is no need to release any data. POD5_FORMAT_EXPORT pod5_error_t pod5_get_calibration_extra_info( Pod5ReadRecordBatch_t const * batch, size_t row, CalibrationExtraData_t * calibration_extra_data); struct KeyValueData { size_t size; char const ** keys; char const ** values; }; struct RunInfoDictData { char const * acquisition_id; int64_t acquisition_start_time_ms; int16_t adc_max; int16_t adc_min; struct KeyValueData context_tags; char const * experiment_name; char const * flow_cell_id; char const * flow_cell_product_code; char const * protocol_name; char const * protocol_run_id; int64_t protocol_start_time_ms; char const * sample_id; uint16_t sample_rate; char const * sequencing_kit; char const * sequencer_position; char const * sequencer_position_type; char const * software; char const * system_name; char const * system_type; struct KeyValueData tracking_id; }; typedef struct RunInfoDictData RunInfoDictData_t; /// \brief Find the run info for a row in a read batch. /// \param[in] batch The read batch to query. /// \param run_info The run info index to query from the passed batch. /// \param[out] run_info_data Output location for the run info data. /// \note The returned run_info value should be released using pod5_free_run_info when it is no longer used. POD5_FORMAT_EXPORT pod5_error_t pod5_get_run_info( Pod5ReadRecordBatch_t const * batch, int16_t run_info, RunInfoDictData_t ** run_info_data); /// \brief Find the run info for a row in a file. /// \param[in] file The file to query. /// \param run_info_index The run info index to query from the passed file. /// \param[out] run_info_data Output location for the run info data. /// \note The returned run_info value should be released using pod5_free_run_info when it is no longer used. POD5_FORMAT_EXPORT pod5_error_t pod5_get_file_run_info( Pod5FileReader_t const * file, run_info_index_t run_info_index, RunInfoDictData_t ** run_info_data); /// \brief Release a RunInfoDictData struct after use. /// \param run_info_data The run info to release. /// \note Any references to \a run_info_data or its components are no longer valid after this call. /// \note It is safe to call this with a null \a run_info_data. POD5_FORMAT_EXPORT pod5_error_t pod5_free_run_info(RunInfoDictData_t * run_info_data); /// \brief Release a RunInfoDictData struct after use. /// \deprecated POD5_FORMAT_EXPORT POD5_DEPRECATED pod5_error_t pod5_release_run_info(RunInfoDictData_t * run_info_data); /// \brief Find the run info for a row in a read file. /// \param[in] file The file to query. /// \param[out] run_info_count The number of run info's that are present in they queried file POD5_FORMAT_EXPORT pod5_error_t pod5_get_file_run_info_count(Pod5FileReader_t const * file, run_info_index_t * run_info_count); /// \brief Find the end reason for a row in a read batch. /// \param[in] batch The read batch to query. /// \param end_reason The end reason index to query from the passed batch. /// \param[out] end_reason_value The enum value for end reason. /// \param[out] end_reason_string_value Output location for the string value for the end reason. /// \param[in,out] end_reason_string_value_size Size of [end_reason_string_value], the number of characters written (including 1 for null character) is placed in this value on return. /// \note If the string input is not long enough POD5_ERROR_STRING_NOT_LONG_ENOUGH is returned. POD5_FORMAT_EXPORT pod5_error_t pod5_get_end_reason( Pod5ReadRecordBatch_t const * batch, int16_t end_reason, pod5_end_reason_t * end_reason_value, char * end_reason_string_value, size_t * end_reason_string_value_size); /// \brief Find the pore type for a row in a read batch. /// \param[in] batch The read batch to query. /// \param pore_type The pore type index to query from the passed batch. /// \param[out] pore_type_string_value Output location for the string value for the pore type. /// \param[in,out] pore_type_string_value_size Size of [pore_type_string_value], the number of characters written (including 1 for null character) is placed in this value on return. /// \note If the string input is not long enough POD5_ERROR_STRING_NOT_LONG_ENOUGH is returned. POD5_FORMAT_EXPORT pod5_error_t pod5_get_pore_type( Pod5ReadRecordBatch_t const * batch, int16_t pore_type, char * pore_type_string_value, size_t * pore_type_string_value_size); struct SignalRowInfo { size_t batch_index; size_t batch_row_index; uint32_t stored_sample_count; size_t stored_byte_count; }; typedef struct SignalRowInfo SignalRowInfo_t; /// \brief Find the info for a signal row in a reader. /// \param[in] reader The reader to query. /// \param signal_rows_count The number of signal rows to query. /// \param[in] signal_rows The signal rows to query. /// \param[out] signal_row_info Pointers to the output signal row information (must be an array of size signal_rows_count) POD5_FORMAT_EXPORT pod5_error_t pod5_get_signal_row_info( Pod5FileReader_t const * reader, size_t signal_rows_count, uint64_t const * signal_rows, SignalRowInfo_t ** signal_row_info); /// \brief Release a list of signal row infos allocated by [pod5_get_signal_row_info]. /// \param signal_rows_count The number of signal rows to release. /// \param signal_row_info The signal row infos to release. /// \note Calls to pod5_free_signal_row_info must be 1:1 with [pod5_get_signal_row_info], you cannot free part of the returned data. POD5_FORMAT_EXPORT pod5_error_t pod5_free_signal_row_info(size_t signal_rows_count, SignalRowInfo_t ** signal_row_info); /// \brief Find the info for a signal row in a reader. /// \param[in] reader The reader to query. /// \param[in] row_info The signal row info batch index to query data for. /// \param sample_count The number of samples allocated in [sample_data] (must equal the length of signal data in the row). /// \param[out] sample_data The output location for the queried samples. /// \note The signal data is allocated by the caller and should be released as appropriate by the caller. /// \todo MAJOR_VERSION Rename to include "chunk" or "row" or similar to indicate this gets only part of read signal. POD5_FORMAT_EXPORT pod5_error_t pod5_get_signal( Pod5FileReader_t const * reader, SignalRowInfo_t const * row_info, size_t sample_count, int16_t * sample_data); /// \brief Find the sample count for a full read. /// \param[in] reader The reader to query. /// \param[in] batch The read batch to query. /// \param batch_row The read row to query data for. /// \param[out] sample_count The number of samples in the read - including all chunks of raw data. POD5_FORMAT_EXPORT pod5_error_t pod5_get_read_complete_sample_count( Pod5FileReader_t const * reader, Pod5ReadRecordBatch_t const * batch, size_t batch_row, size_t * sample_count); /// \brief Find the signal for a full read. /// \param[in] reader The reader to query. /// \param[in] batch The read batch to query. /// \param batch_row The read row to query data for. /// \param sample_count The number of samples allocated in [signal] (must equal the length of signal data in the queryied read row). /// \param[out] signal The output location for the queried samples. /// \note The signal data is allocated by the caller and should be released as appropriate by the caller. POD5_FORMAT_EXPORT pod5_error_t pod5_get_read_complete_signal( Pod5FileReader_t const * reader, Pod5ReadRecordBatch_t const * batch, size_t batch_row, size_t sample_count, int16_t * signal); //--------------------------------------------------------------------------------------------------------------------- // Writing files //--------------------------------------------------------------------------------------------------------------------- // Signal compression options. enum CompressionOption { /// \brief Use the default signal compression option. DEFAULT_SIGNAL_COMPRESSION = 0, /// \brief Use vbz to compress read signals in tables. VBZ_SIGNAL_COMPRESSION = 1, /// \brief Write signals uncompressed to tables. UNCOMPRESSED_SIGNAL = 2, }; // Options to control how a file is written. struct Pod5WriterOptions { /// \brief Maximum number of samples to place in one signal record in the signals table. /// \note Use zero to use default value. uint32_t max_signal_chunk_size; /// \brief Signal type to write to the signals table. /// \note Use 'DEFAULT_SIGNAL_COMPRESSION' to use default value. int8_t signal_compression_type; /// \brief The size of each batch written for the signal table (zero for default). size_t signal_table_batch_size; /// \brief The size of each batch written for the reads table (zero for default). size_t read_table_batch_size; }; typedef struct Pod5WriterOptions Pod5WriterOptions_t; /// \brief Create a new pod5 file using specified filenames and options. /// \param filename The filename of the pod5 file. /// \param writer_name A descriptive string for the user software writing this file. /// \param options Options controlling how the file will be written. /// \return A writer, or null on error. The reason for the error can be queried with /// pod5_get_error_no() and pod5_get_error_string(). POD5_FORMAT_EXPORT Pod5FileWriter_t * pod5_create_file( char const * filename, char const * writer_name, Pod5WriterOptions_t const * options); /// \brief Close a file writer, releasing all memory held by the writer. /// \param file A previously opened writer. /// \note Any references to \a file or its components are no longer valid after this call. /// \note It is safe to call this with a null \a file. POD5_FORMAT_EXPORT pod5_error_t pod5_close_and_free_writer(Pod5FileWriter_t * file); /// \brief Add a new pore type to the file. /// \param[out] pore_index The index of the added pore. /// \param file The file to add the new pore type to. /// \param pore_type The pore type string for the pore. POD5_FORMAT_EXPORT pod5_error_t pod5_add_pore(int16_t * pore_index, Pod5FileWriter_t * file, char const * pore_type); /// \brief Add a new run info to the file, containing tracking information about a sequencing run. /// \param[out] run_info_index The index of the added run_info. /// \param file The file to add the new pore type to. /// \param acquisition_id The offset parameter for the calibration. /// \param acquisition_start_time_ms Milliseconds after unix epoch when the acquisition was started. /// \param adc_max Maximum ADC value supported by this hardware. /// \param adc_min Minimum ADC value supported by this hardware. /// \param context_tags_count Number of entries in the context tags map. /// \param context_tags_keys Array of strings used as keys into the context tags map (must have context_tags_count entries). /// \param context_tags_values Array of strings used as values in the context tags map (must have context_tags_count entries). /// \param experiment_name Name given by the user to the group including this protocol. /// \param flow_cell_id Id for the flow cell used in the run. /// \param flow_cell_product_code Product code for the flow cell used in the run. /// \param protocol_name Name given by the user to the protocol. /// \param protocol_run_id Run id for the protocol. /// \param protocol_start_time_ms Milliseconds after unix epoch when the protocol was started. /// \param sample_id Name given by the user for sample id. /// \param sample_rate Sample rate of the run. /// \param sequencing_kit Sequencing kit used in the run. /// \param sequencer_position Sequencer position used in the run. /// \param sequencer_position_type Sequencer position type used in the run. /// \param software Name of the software used to produce the run. /// \param system_name Name of the system used to produce the run. /// \param system_type Type of the system used to produce the run. /// \param tracking_id_count Number of entries in the tracking id map. /// \param tracking_id_keys Array of strings used as keys into the tracking id map (must have tracking_id_count entries). /// \param tracking_id_values Array of strings used as values in the tracking id map (must have tracking_id_count entries). POD5_FORMAT_EXPORT pod5_error_t pod5_add_run_info( int16_t * run_info_index, Pod5FileWriter_t * file, char const * acquisition_id, int64_t acquisition_start_time_ms, int16_t adc_max, int16_t adc_min, size_t context_tags_count, char const ** context_tags_keys, char const ** context_tags_values, char const * experiment_name, char const * flow_cell_id, char const * flow_cell_product_code, char const * protocol_name, char const * protocol_run_id, int64_t protocol_start_time_ms, char const * sample_id, uint16_t sample_rate, char const * sequencing_kit, char const * sequencer_position, char const * sequencer_position_type, char const * software, char const * system_name, char const * system_type, size_t tracking_id_count, char const ** tracking_id_keys, char const ** tracking_id_values); /// \brief Add a read to the file. /// /// For each read `r`, where `0 <= r < read_count`: /// - `row_data->field[r]` describes a field of the read metadata /// - `signal[r]` is the raw signal data for the read /// - `signal_size[r]` is the length of `signal[r]` (in samples, not in bytes) /// /// \param file The file to add the reads to. /// \param read_count The number of reads to add with this call. /// \param struct_version The version of the struct of [row_data] being filled, use READ_BATCH_ROW_INFO_VERSION. /// \param row_data The array data for injecting into the file, should be ReadBatchRowInfoArray_t. /// All fields of the array must have length [read_count]. /// \param signal The signal data for the reads. /// \param signal_size The number of samples in the signal data. /// This must be an array of length [read_count]. POD5_FORMAT_EXPORT pod5_error_t pod5_add_reads_data( Pod5FileWriter_t * file, uint32_t read_count, uint16_t struct_version, void const * row_data, int16_t const ** signal, uint32_t const * signal_size); /// \brief Add a read to the file, with pre compressed signal chunk sections. /// /// Consider using the simpler [pod5_add_reads_data] unless you have performance requirements that demand /// more control over compression and chunking. /// /// Data should be compressed using [pod5_vbz_compress_signal]. /// /// For each read `r`, where `0 <= r < read_count`: /// - `row_data->field[r]` describes a field of the read metadata /// - `signal_chunk_count[r]` is the number of signal chunks /// - for each signal chunk `i` where `0 <= i < signal_chunk_count[r]`: /// - `sample_counts[r][i]` is the number of samples in the chunk (ie: the size of the uncompressed data in /// samples, not in bytes) /// - `compressed_signal[r][i]` is the compressed data /// - `compressed_signal_size[r][i]` is the length of the compressed data at `compressed_signal[r][i]` /// /// \param file The file to add the read to. /// \param read_count The number of reads to add with this call. /// \param struct_version The version of the struct of [row_data] being filled, use READ_BATCH_ROW_INFO_VERSION. /// \param row_data The array data for injecting into the file, should be ReadBatchRowInfoArray_t. /// All fields of the array must have length [read_count]. /// \param compressed_signal The signal chunks data for the read. /// \param compressed_signal_size The sizes (in bytes) of each signal chunk. /// \param sample_counts The number of samples of each signal chunk. In other words, it is the *uncompressed* size of the /// corresponding [compressed_signal] array, in samples (not bytes!). /// \param signal_chunk_count The number of sections of compressed signal. /// This must be an array of length [read_count]. POD5_FORMAT_EXPORT pod5_error_t pod5_add_reads_data_pre_compressed( Pod5FileWriter_t * file, uint32_t read_count, uint16_t struct_version, void const * row_data, char const *** compressed_signal, size_t const ** compressed_signal_size, uint32_t const ** sample_counts, size_t const * signal_chunk_count); /// \brief Find the max size of a compressed array of samples. /// \param sample_count The number of samples in the source signal. /// \return The max number of bytes required for the compressed signal, or 0 on error. /// The reason for the error can be queried with pod5_get_error_no() and /// pod5_get_error_string(). POD5_FORMAT_EXPORT size_t pod5_vbz_compressed_signal_max_size(size_t sample_count); /// \brief VBZ compress an array of samples. /// \param signal The signal to compress. /// \param signal_size The number of samples to compress. /// \param[out] compressed_signal_out The compressed signal. /// \param[in,out] compressed_signal_size The number of compressed bytes, should be set to the size of compressed_signal_out on call. POD5_FORMAT_EXPORT pod5_error_t pod5_vbz_compress_signal( int16_t const * signal, size_t signal_size, char * compressed_signal_out, size_t * compressed_signal_size); /// \brief VBZ decompress an array of samples. /// \param compressed_signal The signal to decompress. /// \param compressed_signal_size The number of compressed bytes, ie the size of compressed_signal in bytes. /// \param sample_count The number of samples to decompress, ie the size of signal_out in samples. /// \param[out] signal_out The decompressed signal. POD5_FORMAT_EXPORT pod5_error_t pod5_vbz_decompress_signal( char const * compressed_signal, size_t compressed_signal_size, size_t sample_count, int16_t * signal_out); //--------------------------------------------------------------------------------------------------------------------- // Global state //--------------------------------------------------------------------------------------------------------------------- /// \brief Format a packed binary read id as a readable read id string: /// \param read_id A 16 byte binary formatted UUID. /// \param[out] read_id_string Output string containing the string formatted UUID (expects a string of at least 37 bytes, one null byte is written.) POD5_FORMAT_EXPORT pod5_error_t pod5_format_read_id(read_id_t const read_id, char * read_id_string); #ifdef __cplusplus } #endif ================================================ FILE: c++/pod5_format/dictionary_writer.h ================================================ #pragma once #include "pod5_format/pod5_format_export.h" #include "pod5_format/result.h" namespace arrow { class Array; } namespace pod5 { class POD5_FORMAT_EXPORT DictionaryWriter { public: virtual ~DictionaryWriter() = default; pod5::Result> build_dictionary_array( std::shared_ptr const & indices); virtual pod5::Result> get_value_array() = 0; virtual std::size_t item_count() = 0; bool is_valid(std::size_t value) { return value < item_count(); } }; } // namespace pod5 ================================================ FILE: c++/pod5_format/expandable_buffer.h ================================================ #pragma once #include #include #include #include namespace pod5 { template class ExpandableBuffer { public: static constexpr int EXPANSION_FACTOR = 2; ExpandableBuffer(arrow::MemoryPool * pool = nullptr) { m_pool = pool; } arrow::Status init_buffer(arrow::MemoryPool * pool) { m_pool = pool; return clear(); } std::size_t size() const { if (!m_buffer) { return 0; } return m_buffer->size() / sizeof(T); } std::uint8_t * mutable_data() { return m_buffer->mutable_data(); } std::shared_ptr get_buffer() const { return m_buffer; } arrow::Status clear() { if (!m_buffer || m_buffer.use_count() > 1) { ARROW_ASSIGN_OR_RAISE(m_buffer, arrow::AllocateResizableBuffer(0, m_pool)); return arrow::Status::OK(); } else { return m_buffer->Resize(0, false); } } gsl::span get_data_span() const { if (!m_buffer) { return {}; } return gsl::make_span(m_buffer->data(), m_buffer->size()).template as_span(); } /// \brief Append an object where you don't know the size up front. /// \param max_size The maximum possible size of the object to append. /// \param append_fn A function that appends the object to the buffer. template arrow::Status append(std::size_t max_size, Callable append_fn) { auto const old_size = m_buffer->size(); ARROW_RETURN_NOT_OK(reserve(old_size + max_size)); auto const potential_buffer = gsl::make_span(m_buffer->mutable_data() + old_size, max_size); ARROW_ASSIGN_OR_RAISE(auto final_size, append_fn(potential_buffer)); assert(final_size < max_size); return resize(old_size + final_size); } arrow::Status append(T const & new_value) { auto const bytes_span = gsl::make_span(&new_value, 1).template as_span(); return append_bytes(bytes_span); } arrow::Status append_array(gsl::span const & new_value_span) { auto const bytes_span = new_value_span.template as_span(); return append_bytes(bytes_span); } arrow::Status resize(std::int64_t new_size) { ARROW_RETURN_NOT_OK(reserve(new_size)); return m_buffer->Resize(new_size, false); } arrow::Status reserve(std::int64_t new_capacity) { assert(m_buffer); auto const old_size = m_buffer->size(); if (m_buffer.use_count() > 1) { std::shared_ptr buffer; ARROW_ASSIGN_OR_RAISE(buffer, arrow::AllocateResizableBuffer(old_size, m_pool)); std::copy(m_buffer->data(), m_buffer->data() + old_size, buffer->mutable_data()); std::swap(m_buffer, buffer); } if (new_capacity > m_buffer->capacity()) { ARROW_RETURN_NOT_OK(m_buffer->Reserve(new_capacity * EXPANSION_FACTOR)); } return arrow::Status::OK(); } private: arrow::Status append_bytes(gsl::span const & bytes_span) { auto old_size = 0; if (!m_buffer) { ARROW_ASSIGN_OR_RAISE( m_buffer, arrow::AllocateResizableBuffer(bytes_span.size(), m_pool)); } else { old_size = m_buffer->size(); } auto const new_size = old_size + bytes_span.size(); ARROW_RETURN_NOT_OK(reserve(new_size)); ARROW_RETURN_NOT_OK(m_buffer->Resize(new_size, false)); std::copy(bytes_span.begin(), bytes_span.end(), m_buffer->mutable_data() + old_size); return arrow::Status::OK(); } std::shared_ptr m_buffer; arrow::MemoryPool * m_pool = nullptr; }; } // namespace pod5 ================================================ FILE: c++/pod5_format/file_output_stream.h ================================================ #pragma once #include #include namespace pod5 { class FileOutputStream : public arrow::io::OutputStream { public: virtual arrow::Status batch_complete() { return arrow::Status::OK(); } virtual void set_file_start_offset(std::size_t val) {} }; } // namespace pod5 ================================================ FILE: c++/pod5_format/file_reader.cpp ================================================ #include "pod5_format/file_reader.h" #include "pod5_format/internal/combined_file_utils.h" #include "pod5_format/memory_pool.h" #include "pod5_format/migration/migration.h" #include "pod5_format/read_table_reader.h" #include "pod5_format/run_info_table_reader.h" #include "pod5_format/signal_table_reader.h" #include #include #include #include #include #if defined(__APPLE__) || defined(__linux__) || defined(__unix__) #include #endif namespace pod5 { namespace { #if defined(__APPLE__) || defined(__linux__) || defined(__unix__) constexpr std::uint64_t kStatBlockBytes = S_BLKSIZE; // Issue warning if the majority of the file is missing to avoid false positives constexpr double kMinMissingFractionThreshold = 0.8; void warn_if_stat_size_and_blocks_differ_significantly(std::string const & path) { struct stat file_stat = {}; if (::stat(path.c_str(), &file_stat) != 0 || file_stat.st_size <= 0 || file_stat.st_blocks < 0) { return; } auto const logical_size_bytes = static_cast(file_stat.st_size); auto const allocated_size_bytes = static_cast(file_stat.st_blocks) * kStatBlockBytes; if (allocated_size_bytes >= logical_size_bytes) { return; } auto const missing_bytes = logical_size_bytes - allocated_size_bytes; auto const missing_fraction = static_cast(missing_bytes) / static_cast(logical_size_bytes); if (missing_fraction < kMinMissingFractionThreshold) { return; } std::cerr << "Warning: POD5 file '" << path << "' has st_size=" << logical_size_bytes << " bytes but only approximately " << allocated_size_bytes << " bytes allocated via st_blocks. The file may be sparse or offloaded and " "open/read operations may fail." << std::endl; } #else void warn_if_stat_size_and_blocks_differ_significantly(std::string const &) {} #endif } // namespace FileReaderOptions::FileReaderOptions() : m_memory_pool(pod5::default_memory_pool()) , m_max_cached_signal_table_batches(DEFAULT_MAX_CACHED_SIGNAL_TABLE_BATCHES) { } void FileReaderOptions::set_max_cached_signal_table_batches( std::size_t max_cached_signal_table_batches) { m_max_cached_signal_table_batches = max_cached_signal_table_batches; } inline FileLocation make_file_locaton(combined_file_utils::ParsedFileInfo const & parsed_file_info) { return FileLocation{ parsed_file_info.file_path, std::size_t(parsed_file_info.file_start_offset), std::size_t(parsed_file_info.file_length)}; } class FileReaderImpl : public FileReader { public: FileReaderImpl( Version const & file_version_pre_migration, MigrationResult && migration_result, RunInfoTableReader && run_info_table_reader, ReadTableReader && read_table_reader, SignalTableReader && signal_table_reader) : m_file_version_pre_migration(file_version_pre_migration) , m_migration_result(std::move(migration_result)) , m_run_info_table_location(make_file_locaton(m_migration_result.footer().run_info_table)) , m_read_table_location(make_file_locaton(m_migration_result.footer().reads_table)) , m_signal_table_location(make_file_locaton(m_migration_result.footer().signal_table)) , m_run_info_table_reader(std::move(run_info_table_reader)) , m_read_table_reader(std::move(read_table_reader)) , m_signal_table_reader(std::move(signal_table_reader)) { } SchemaMetadataDescription schema_metadata() const override { return m_read_table_reader.schema_metadata(); } Result run_info_count() const override { return m_run_info_table_reader.CountRows(); } virtual Result read_count() const override { auto const batch_count = num_read_record_batches(); if (batch_count == 0) { return 0; } ARROW_ASSIGN_OR_RAISE(auto const first_batch, read_read_record_batch(0)); ARROW_ASSIGN_OR_RAISE(auto const last_batch, read_read_record_batch(batch_count - 1)); return (batch_count - 1) * first_batch.num_rows() + last_batch.num_rows(); } Result read_read_record_batch(std::size_t i) const override { return m_read_table_reader.read_record_batch(i); } std::size_t num_read_record_batches() const override { return m_read_table_reader.num_record_batches(); } Result search_for_read_ids( ReadIdSearchInput const & search_input, gsl::span const & batch_counts, gsl::span const & batch_rows) const override { return m_read_table_reader.search_for_read_ids(search_input, batch_counts, batch_rows); } Result read_signal_record_batch(std::size_t i) const override { return m_signal_table_reader.read_record_batch(i); } std::size_t num_signal_record_batches() const override { return m_signal_table_reader.num_record_batches(); } Result signal_batch_for_row_id(std::size_t row, std::size_t * batch_row) const override { return m_signal_table_reader.signal_batch_for_row_id(row, batch_row); } Result extract_sample_count( gsl::span const & row_indices) const override { return m_signal_table_reader.extract_sample_count(row_indices); } Status extract_samples( gsl::span const & row_indices, gsl::span const & output_samples) const override { return m_signal_table_reader.extract_samples(row_indices, output_samples); } Result>> extract_samples_inplace( gsl::span const & row_indices, std::vector & sample_count) const override { return m_signal_table_reader.extract_samples_inplace(row_indices, sample_count); } FileLocation const & run_info_table_location() const override { return m_run_info_table_location; } FileLocation const & read_table_location() const override { return m_read_table_location; } FileLocation const & signal_table_location() const override { return m_signal_table_location; } Version file_version_pre_migration() const override { return m_file_version_pre_migration; } SignalType signal_type() const override { return m_signal_table_reader.signal_type(); } Result> find_run_info( std::string const & acquisition_id) const override { return m_run_info_table_reader.find_run_info(acquisition_id); } Result> get_run_info(std::size_t index) const override { return m_run_info_table_reader.get_run_info(index); } Result get_run_info_count() const override { return m_run_info_table_reader.get_run_info_count(); } private: Version m_file_version_pre_migration; MigrationResult m_migration_result; FileLocation m_run_info_table_location; FileLocation m_read_table_location; FileLocation m_signal_table_location; RunInfoTableReader m_run_info_table_reader; ReadTableReader m_read_table_reader; SignalTableReader m_signal_table_reader; }; pod5::Result> open_file_reader( std::string const & path, FileReaderOptions const & options) { auto pool = options.memory_pool(); if (!pool) { return Status::Invalid("Invalid memory pool specified for file writer"); } // Issue warning if the file appears to be a stub warn_if_stat_size_and_blocks_differ_significantly(path); // "Preflight" file reads are done via standard file I/O first to prevent SIGBUS errors // if the file is not resident (e.g. stub remains when file is archived). // If mmap succeeds afterwards, use it for normal table reads otherwise continue // using this preflight file handle. ARROW_ASSIGN_OR_RAISE(auto preflight_file, arrow::io::ReadableFile::Open(path, pool)); ARROW_ASSIGN_OR_RAISE( auto original_footer_metadata, combined_file_utils::read_footer(path, preflight_file)); std::shared_ptr file = preflight_file; if (!options.force_disable_file_mapping() && getenv("POD5_DISABLE_MMAP_OPEN") == nullptr) { auto file_opt = arrow::io::MemoryMappedFile::Open(path, arrow::io::FileMode::READ); if (file_opt.ok()) { file = *file_opt; // Downstream handles are extracted from the `footer.{table}.file`, update them // to use the mmap handle. combined_file_utils::bind_footer_file(original_footer_metadata, file); } } ARROW_ASSIGN_OR_RAISE( auto const original_writer_version, parse_version_number(original_footer_metadata.writer_pod5_version)); ARROW_ASSIGN_OR_RAISE( auto migration_result, migrate_if_required(original_writer_version, original_footer_metadata, file, pool)); // Files are written standalone, and so needs to be treated with a file offset - it wants to seek around as if the reads file is standalone: ARROW_ASSIGN_OR_RAISE( auto run_info_sub_file, open_sub_file(migration_result.footer().run_info_table)); ARROW_ASSIGN_OR_RAISE( auto run_info_table_reader, make_run_info_table_reader(run_info_sub_file, pool)); ARROW_ASSIGN_OR_RAISE( auto reads_sub_file, open_sub_file(migration_result.footer().reads_table)); ARROW_ASSIGN_OR_RAISE(auto read_table_reader, make_read_table_reader(reads_sub_file, pool)); ARROW_ASSIGN_OR_RAISE( auto signal_sub_file, open_sub_file(migration_result.footer().signal_table)); ARROW_ASSIGN_OR_RAISE( auto signal_table_reader, make_signal_table_reader(signal_sub_file, options.max_cached_signal_table_batches(), pool)); auto signal_metadata = signal_table_reader.schema_metadata(); auto reads_metadata = read_table_reader.schema_metadata(); if (signal_metadata.file_identifier != reads_metadata.file_identifier) { return Status::Invalid( "Invalid read and signal file pair signal identifier: ", signal_metadata.file_identifier, ", reads identifier: ", reads_metadata.file_identifier); } return std::make_shared( original_writer_version, std::move(migration_result), std::move(run_info_table_reader), std::move(read_table_reader), std::move(signal_table_reader)); } } // namespace pod5 ================================================ FILE: c++/pod5_format/file_reader.h ================================================ #pragma once #include "pod5_format/pod5_format_export.h" #include "pod5_format/read_table_utils.h" #include "pod5_format/result.h" #include "pod5_format/signal_table_utils.h" #include #include namespace arrow { class Array; class Buffer; class MemoryPool; } // namespace arrow namespace pod5 { class Version; struct SchemaMetadataDescription; class POD5_FORMAT_EXPORT FileReaderOptions { public: static constexpr std::uint32_t DEFAULT_MAX_CACHED_SIGNAL_TABLE_BATCHES = 5; FileReaderOptions(); void memory_pool(arrow::MemoryPool * memory_pool) { m_memory_pool = memory_pool; } arrow::MemoryPool * memory_pool() const { return m_memory_pool; } std::size_t max_cached_signal_table_batches() const { return m_max_cached_signal_table_batches; } // Set how many signal table batches can be cached in memory, // Note: 0 here implies no limit. void set_max_cached_signal_table_batches(std::size_t max_cached_signal_table_batches); void set_force_disable_file_mapping(bool force_disable_file_mapping) { m_force_disable_file_mapping = force_disable_file_mapping; } bool force_disable_file_mapping() const { return m_force_disable_file_mapping; } private: arrow::MemoryPool * m_memory_pool; std::size_t m_max_cached_signal_table_batches; bool m_force_disable_file_mapping = false; }; class POD5_FORMAT_EXPORT FileLocation { public: FileLocation(std::string const & file_path_, std::size_t offset_, std::size_t size_) : file_path(file_path_) , offset(offset_) , size(size_) { } std::string file_path; std::size_t offset; std::size_t size; }; class ReadTableRecordBatch; class SignalTableRecordBatch; class POD5_FORMAT_EXPORT FileReader { public: virtual ~FileReader() = default; /// \brief Find the read schema metadata for this file. virtual SchemaMetadataDescription schema_metadata() const = 0; virtual Result run_info_count() const = 0; virtual Result read_count() const = 0; virtual Result read_read_record_batch(std::size_t i) const = 0; virtual std::size_t num_read_record_batches() const = 0; virtual Result search_for_read_ids( ReadIdSearchInput const & search_input, gsl::span const & batch_counts, gsl::span const & batch_rows) const = 0; virtual Result read_signal_record_batch(std::size_t i) const = 0; virtual std::size_t num_signal_record_batches() const = 0; virtual Result signal_batch_for_row_id(std::size_t row, std::size_t * batch_row) const = 0; /// \brief Find the number of samples in a given list of rows. /// \param row_indices The rows to query for sample ount. /// \returns The sum of all sample counts on input rows. virtual Result extract_sample_count( gsl::span const & row_indices) const = 0; /// \brief Extract the samples for a list of rows. /// \param row_indices The rows to query for samples. /// \param output_samples The output samples from the rows. virtual Status extract_samples( gsl::span const & row_indices, gsl::span const & output_samples) const = 0; /// \brief Extract the samples as written in the arrow table for a list of rows. /// \param row_indices The rows to query for samples. /// \param sample_count The output samples from the rows. virtual Result>> extract_samples_inplace( gsl::span const & row_indices, std::vector & sample_count) const = 0; virtual FileLocation const & run_info_table_location() const = 0; virtual FileLocation const & read_table_location() const = 0; virtual FileLocation const & signal_table_location() const = 0; virtual Version file_version_pre_migration() const = 0; virtual SignalType signal_type() const = 0; virtual Result> find_run_info( std::string const & acquisition_id) const = 0; virtual Result> get_run_info(std::size_t index) const = 0; virtual Result get_run_info_count() const = 0; }; POD5_FORMAT_EXPORT pod5::Result> open_file_reader( std::string const & path, FileReaderOptions const & options = {}); } // namespace pod5 ================================================ FILE: c++/pod5_format/file_recovery.h ================================================ #pragma once #include "pod5_format/internal/combined_file_utils.h" #include "pod5_format/schema_metadata.h" #include #include #include #include #include namespace pod5 { static constexpr char const * kArrowMagicBytes = "ARROW1"; struct RecoveredData { // Metadata from the original file: SchemaMetadataDescription metadata; std::size_t recovered_batches = 0; arrow::Status failed_batch_status; std::size_t recovered_rows = 0; }; template arrow::Result recover_arrow_file( std::shared_ptr const & file_to_recover, DestFileType const & destination_file) { // Check for arrow start file: int32_t const magic_size = static_cast(::strlen(kArrowMagicBytes)); ARROW_ASSIGN_OR_RAISE(auto buffer, file_to_recover->ReadAt(0, magic_size)); if (buffer->size() < magic_size || memcmp(buffer->data(), kArrowMagicBytes, magic_size)) { return arrow::Status::Invalid("Not an Arrow file"); } // Open the stream format within the ipc file: ARROW_ASSIGN_OR_RAISE( auto input_stream, combined_file_utils::open_sub_file(file_to_recover, 8)); ARROW_ASSIGN_OR_RAISE( auto opened_stream, arrow::ipc::RecordBatchStreamReader::Open(input_stream)); auto const & expected_schema = destination_file->schema(); auto schema = opened_stream->schema(); if (!schema->Equals(*expected_schema, false)) { return arrow::Status::Invalid( "Recovered file Schema does not match expected schema, version mismatch?"); } RecoveredData recovered_data; ARROW_ASSIGN_OR_RAISE( recovered_data.metadata, read_schema_key_value_metadata(schema->metadata())); while (true) { auto result_opt = opened_stream->Next(); // Check if the batch failed to load: if (!result_opt.ok()) { recovered_data.failed_batch_status = result_opt.status(); return recovered_data; } auto & result = *result_opt; if (!result) { break; } recovered_data.recovered_batches += 1; recovered_data.recovered_rows += result->num_rows(); ARROW_RETURN_NOT_OK(destination_file->write_batch(*result)); } return recovered_data; } } // namespace pod5 ================================================ FILE: c++/pod5_format/file_updater.cpp ================================================ #include "pod5_format/file_updater.h" #include "pod5_format/file_reader.h" #include "pod5_format/internal/combined_file_utils.h" #include "pod5_format/schema_metadata.h" #include "pod5_format/uuid.h" #include namespace pod5 { pod5::Status update_file( arrow::MemoryPool * pool, std::shared_ptr const & source, std::string destination) { ARROW_ASSIGN_OR_RAISE(auto main_file, arrow::io::FileOutputStream::Open(destination, false)); std::random_device gen; auto uuid_gen = BasicUuidRandomGenerator{gen}; auto const section_marker = uuid_gen(); auto metadata = source->schema_metadata(); // Write the initial header to the combined file: ARROW_RETURN_NOT_OK(combined_file_utils::write_combined_header(main_file, section_marker)); ARROW_ASSIGN_OR_RAISE( auto signal_info_table, combined_file_utils::write_file_and_marker( pool, main_file, source->signal_table_location(), combined_file_utils::SubFileCleanup::LeaveOrignalFile, section_marker)); ARROW_ASSIGN_OR_RAISE( auto run_info_info_table, combined_file_utils::write_file_and_marker( pool, main_file, source->run_info_table_location(), combined_file_utils::SubFileCleanup::LeaveOrignalFile, section_marker)); ARROW_ASSIGN_OR_RAISE( auto reads_info_table, combined_file_utils::write_file_and_marker( pool, main_file, source->read_table_location(), combined_file_utils::SubFileCleanup::LeaveOrignalFile, section_marker)); // Write full file footer: ARROW_RETURN_NOT_OK( combined_file_utils::write_footer( main_file, section_marker, metadata.file_identifier, metadata.writing_software, signal_info_table, run_info_info_table, reads_info_table)); return main_file->Close(); } } // namespace pod5 ================================================ FILE: c++/pod5_format/file_updater.h ================================================ #pragma once #include "pod5_format/result.h" #include namespace arrow { class MemoryPool; } namespace pod5 { class FileReader; /// \brief Write the path [destination] with any migrated data from [source]. /// \param source The source file data to write updated. /// \param destination The destination path to write the data to. /// \note The destination path should not be the same file that was opened for input. pod5::Status update_file( arrow::MemoryPool * pool, std::shared_ptr const & source, std::string destination); } // namespace pod5 ================================================ FILE: c++/pod5_format/file_writer.cpp ================================================ #include "pod5_format/file_writer.h" #include "pod5_format/file_recovery.h" #include "pod5_format/internal/async_output_stream.h" #include "pod5_format/internal/combined_file_utils.h" #include "pod5_format/io_manager.h" #include "pod5_format/memory_pool.h" #include "pod5_format/read_table_reader.h" #include "pod5_format/read_table_writer.h" #include "pod5_format/read_table_writer_utils.h" #include "pod5_format/run_info_table_writer.h" #include "pod5_format/schema_metadata.h" #include "pod5_format/signal_table_reader.h" #include "pod5_format/signal_table_writer.h" #include "pod5_format/thread_pool.h" #include "pod5_format/uuid.h" #include "pod5_format/version.h" #include #include #include #include #include #include #include #ifdef __linux__ #include "pod5_format/internal/linux_output_stream.h" #endif namespace { struct CachedFileValues { std::shared_ptr io_manager; std::shared_ptr thread_pool; }; enum class FlushMode { Default, ForceFlushOnBatchComplete }; arrow::Result> make_file_stream( std::string const & path, pod5::FileWriterOptions const & options, CachedFileValues & cached_values, bool keep_file_open, FlushMode flush_mode = FlushMode::Default) { auto const flush_on_batch_complete = flush_mode == FlushMode::ForceFlushOnBatchComplete || options.flush_on_batch_complete(); #ifdef __linux__ if (options.use_directio() || options.use_sync_io()) { if (!cached_values.io_manager) { if (options.io_manager()) { cached_values.io_manager = options.io_manager(); } else { ARROW_ASSIGN_OR_RAISE( cached_values.io_manager, pod5::make_sync_io_manager(options.memory_pool())); } } auto const ret = pod5::LinuxOutputStream::make( path, cached_values.io_manager, options.write_chunk_size(), options.use_directio(), options.use_sync_io(), flush_on_batch_complete, keep_file_open); // Failure could be due to direct IO used by LinuxOutputStream not being // supported. On error drop-through to make a regular AsyncOutputStream. if (ret.ok()) { return ret; } } #endif if (!cached_values.thread_pool) { if (options.thread_pool()) { cached_values.thread_pool = options.thread_pool(); } else { cached_values.thread_pool = pod5::make_thread_pool(1); } } return pod5::AsyncOutputStream::make( path, cached_values.thread_pool, flush_on_batch_complete, options.memory_pool(), keep_file_open); } } // namespace namespace pod5 { FileWriterOptions::FileWriterOptions() : m_max_signal_chunk_size(DEFAULT_SIGNAL_CHUNK_SIZE) , m_memory_pool(pod5::default_memory_pool()) , m_signal_type(DEFAULT_SIGNAL_TYPE) , m_signal_table_batch_size(DEFAULT_SIGNAL_TABLE_BATCH_SIZE) , m_read_table_batch_size(DEFAULT_READ_TABLE_BATCH_SIZE) , m_run_info_table_batch_size(DEFAULT_RUN_INFO_TABLE_BATCH_SIZE) , m_use_directio{DEFAULT_USE_DIRECTIO} , m_write_chunk_size(DEFAULT_WRITE_CHUNK_SIZE) , m_use_sync_io(DEFAULT_USE_SYNC_IO) , m_flush_on_batch_complete(DEFAULT_FLUSH_ON_BATCH_COMPLETE) , m_keep_signal_file_open(DEFAULT_KEEP_FILES_OPEN) , m_keep_run_info_file_open(DEFAULT_KEEP_FILES_OPEN) , m_keep_read_table_file_open(DEFAULT_KEEP_FILES_OPEN) { } class FileWriterImpl { public: class WriterTypeImpl; struct DictionaryWriters { std::shared_ptr end_reason_writer; std::shared_ptr pore_writer; std::shared_ptr run_info_writer; }; FileWriterImpl( DictionaryWriters && read_table_dict_writers, RunInfoTableWriter && run_info_table_writer, ReadTableWriter && read_table_writer, SignalTableWriter && signal_table_writer, std::uint32_t signal_chunk_size, arrow::MemoryPool * pool) : m_read_table_dict_writers(std::move(read_table_dict_writers)) , m_run_info_table_writer(std::move(run_info_table_writer)) , m_read_table_writer(std::move(read_table_writer)) , m_signal_table_writer(std::move(signal_table_writer)) , m_signal_chunk_size(signal_chunk_size) , m_pool(pool) { } virtual ~FileWriterImpl() = default; virtual std::string path() const = 0; pod5::Result lookup_end_reason(ReadEndReason end_reason) { return m_read_table_dict_writers.end_reason_writer->lookup(end_reason); } pod5::Result add_pore_type(std::string const & pore_type_data) { return m_read_table_dict_writers.pore_writer->add(pore_type_data); } pod5::Result add_run_info(RunInfoData const & run_info_data) { ARROW_RETURN_NOT_OK(m_run_info_table_writer->add_run_info(run_info_data)); return m_read_table_dict_writers.run_info_writer->add(run_info_data.acquisition_id); } pod5::Status add_complete_read( ReadData const & read_data, gsl::span const & signal) { if (!m_signal_table_writer || !m_read_table_writer) { return arrow::Status::Invalid("File writer closed, cannot write further data"); } ARROW_RETURN_NOT_OK(check_read(read_data)); ARROW_ASSIGN_OR_RAISE( std::vector signal_rows, add_signal(read_data.read_id, signal)); // Write read data and signal row entries: auto read_table_row = m_read_table_writer->add_read( read_data, gsl::make_span(signal_rows.data(), signal_rows.size()), signal.size()); return read_table_row.status(); } pod5::Status add_complete_read( ReadData const & read_data, gsl::span const & signal_rows, std::uint64_t signal_duration) { if (!m_signal_table_writer || !m_read_table_writer) { return arrow::Status::Invalid("File writer closed, cannot write further data"); } ARROW_RETURN_NOT_OK(check_read(read_data)); // Write read data and signal row entries: auto read_table_row = m_read_table_writer->add_read(read_data, signal_rows, signal_duration); return read_table_row.status(); } arrow::Status check_read(ReadData const & read_data) { if (!m_read_table_dict_writers.run_info_writer->is_valid(read_data.run_info)) { return arrow::Status::Invalid("Invalid run info passed to add_read"); } if (!m_read_table_dict_writers.pore_writer->is_valid(read_data.pore_type)) { return arrow::Status::Invalid("Invalid pore type passed to add_read"); } if (!m_read_table_dict_writers.end_reason_writer->is_valid(read_data.end_reason)) { return arrow::Status::Invalid("Invalid end reason passed to add_read"); } return arrow::Status::OK(); } pod5::Result> add_signal( Uuid const & read_id, gsl::span const & signal) { if (!m_signal_table_writer || !m_read_table_writer) { return arrow::Status::Invalid("File writer closed, cannot write further data"); } std::vector signal_rows; signal_rows.reserve((signal.size() / m_signal_chunk_size) + 1); // Chunk and write each piece of signal to the file: for (std::size_t chunk_start = 0; chunk_start < signal.size(); chunk_start += m_signal_chunk_size) { std::size_t chunk_size = std::min(signal.size() - chunk_start, m_signal_chunk_size); auto const chunk_span = signal.subspan(chunk_start, chunk_size); ARROW_ASSIGN_OR_RAISE( auto row_index, m_signal_table_writer->add_signal(read_id, chunk_span)); signal_rows.push_back(row_index); } return signal_rows; } pod5::Result add_pre_compressed_signal( Uuid const & read_id, gsl::span const & signal_bytes, std::uint32_t sample_count) { if (!m_signal_table_writer || !m_read_table_writer) { return arrow::Status::Invalid("File writer closed, cannot write further data"); } return m_signal_table_writer->add_pre_compressed_signal( read_id, signal_bytes, sample_count); } pod5::Result> add_signal_batch( std::size_t row_count, std::vector> && columns, bool final_batch) { if (!m_signal_table_writer || !m_read_table_writer) { return arrow::Status::Invalid("File writer closed, cannot write further data"); } return m_signal_table_writer->add_signal_batch(row_count, std::move(columns), final_batch); } SignalType signal_type() const { return m_signal_table_writer->signal_type(); } std::size_t signal_table_batch_size() const { return m_signal_table_writer->table_batch_size(); } pod5::Status close_run_info_table_writer() { if (m_run_info_table_writer) { ARROW_RETURN_NOT_OK(m_run_info_table_writer->close()); m_run_info_table_writer = std::nullopt; } return pod5::Status::OK(); } pod5::Status close_read_table_writer() { if (m_read_table_writer) { ARROW_RETURN_NOT_OK(m_read_table_writer->close()); m_read_table_writer = std::nullopt; } return pod5::Status::OK(); } pod5::Status close_signal_table_writer() { if (m_signal_table_writer) { ARROW_RETURN_NOT_OK(m_signal_table_writer->close()); m_signal_table_writer = std::nullopt; } return pod5::Status::OK(); } virtual arrow::Status close() = 0; bool is_closed() const { assert(!!m_read_table_writer == !!m_signal_table_writer); return !m_signal_table_writer; } arrow::MemoryPool * pool() const { return m_pool; } RunInfoTableWriter * run_info_table_writer() { if (is_closed() || !m_run_info_table_writer.has_value()) { return nullptr; } return &m_run_info_table_writer.value(); } ReadTableWriter * read_table_writer() { if (is_closed() || !m_read_table_writer.has_value()) { return nullptr; } return &m_read_table_writer.value(); } SignalTableWriter * signal_table_writer() { if (is_closed() || !m_signal_table_writer.has_value()) { return nullptr; } return &m_signal_table_writer.value(); } private: DictionaryWriters m_read_table_dict_writers; std::optional m_run_info_table_writer; std::optional m_read_table_writer; std::optional m_signal_table_writer; std::uint32_t m_signal_chunk_size; arrow::MemoryPool * m_pool; }; class CombinedFileWriterImpl : public FileWriterImpl { public: CombinedFileWriterImpl( std::string const & path, std::string const & run_info_tmp_path, std::string const & reads_tmp_path, std::int64_t signal_file_start_offset, Uuid const & section_marker, Uuid const & file_identifier, std::string const & software_name, DictionaryWriters && dict_writers, RunInfoTableWriter && run_info_table_writer, ReadTableWriter && read_table_writer, SignalTableWriter && signal_table_writer, std::uint32_t signal_chunk_size, arrow::MemoryPool * pool) : FileWriterImpl( std::move(dict_writers), std::move(run_info_table_writer), std::move(read_table_writer), std::move(signal_table_writer), signal_chunk_size, pool) , m_path(path) , m_run_info_tmp_path(run_info_tmp_path) , m_reads_tmp_path(reads_tmp_path) , m_signal_file_start_offset(signal_file_start_offset) , m_section_marker(section_marker) , m_file_identifier(file_identifier) , m_software_name(software_name) { } std::string path() const override { return m_path; } arrow::Status close() override { if (is_closed()) { return arrow::Status::OK(); } ARROW_RETURN_NOT_OK(close_run_info_table_writer()); ARROW_RETURN_NOT_OK(close_read_table_writer()); ARROW_RETURN_NOT_OK(close_signal_table_writer()); // Open main path with append set: ARROW_ASSIGN_OR_RAISE(auto file, arrow::io::FileOutputStream::Open(m_path, true)); // Record signal table length: combined_file_utils::FileInfo signal_table; signal_table.file_start_offset = m_signal_file_start_offset; ARROW_ASSIGN_OR_RAISE(signal_table.file_length, file->Tell()); signal_table.file_length -= signal_table.file_start_offset; // pad file to 8 bytes and mark section: ARROW_RETURN_NOT_OK(combined_file_utils::pad_file(file, 8)); ARROW_RETURN_NOT_OK(combined_file_utils::write_section_marker(file, m_section_marker)); auto file_location_for_full_file = [&](std::string const & filename) -> arrow::Result { ARROW_ASSIGN_OR_RAISE(auto file, arrow::io::ReadableFile::Open(filename, pool())); ARROW_ASSIGN_OR_RAISE(auto size, file->GetSize()); return FileLocation{filename, 0, std::size_t(size)}; }; // Write in run_info table: ARROW_ASSIGN_OR_RAISE( auto run_info_location, file_location_for_full_file(m_run_info_tmp_path)); ARROW_ASSIGN_OR_RAISE( auto run_info_info_table, combined_file_utils::write_file_and_marker( pool(), file, run_info_location, combined_file_utils::SubFileCleanup::CleanupOriginalFile, m_section_marker)); // Write in read table: ARROW_ASSIGN_OR_RAISE(auto reads_location, file_location_for_full_file(m_reads_tmp_path)); ARROW_ASSIGN_OR_RAISE( auto reads_info_table, combined_file_utils::write_file_and_marker( pool(), file, reads_location, combined_file_utils::SubFileCleanup::CleanupOriginalFile, m_section_marker)); // Write full file footer: ARROW_RETURN_NOT_OK( combined_file_utils::write_footer( file, m_section_marker, m_file_identifier, m_software_name, signal_table, run_info_info_table, reads_info_table)); return arrow::Status::OK(); } private: std::string m_path; std::string m_run_info_tmp_path; std::string m_reads_tmp_path; std::int64_t m_signal_file_start_offset; Uuid m_section_marker; Uuid m_file_identifier; std::string m_software_name; }; FileWriter::FileWriter(std::unique_ptr && impl) : m_impl(std::move(impl)) {} FileWriter::~FileWriter() { (void)close(); } std::string FileWriter::path() const { return m_impl->path(); } arrow::Status FileWriter::close() { return m_impl->close(); } arrow::Status FileWriter::add_complete_read( ReadData const & read_data, gsl::span const & signal) { return m_impl->add_complete_read(read_data, signal); } arrow::Status FileWriter::add_complete_read( ReadData const & read_data, gsl::span const & signal_rows, std::uint64_t signal_duration) { return m_impl->add_complete_read(read_data, signal_rows, signal_duration); } pod5::Result> FileWriter::add_signal( Uuid const & read_id, gsl::span const & signal) { return m_impl->add_signal(read_id, signal); } pod5::Result FileWriter::add_pre_compressed_signal( Uuid const & read_id, gsl::span const & signal_bytes, std::uint32_t sample_count) { return m_impl->add_pre_compressed_signal(read_id, signal_bytes, sample_count); } pod5::Result> FileWriter::add_signal_batch( std::size_t row_count, std::vector> && columns, bool final_batch) { return m_impl->add_signal_batch(row_count, std::move(columns), final_batch); } pod5::Result FileWriter::lookup_end_reason(ReadEndReason end_reason) const { return m_impl->lookup_end_reason(end_reason); } pod5::Result FileWriter::add_pore_type(std::string const & pore_type_data) { return m_impl->add_pore_type(pore_type_data); } pod5::Result FileWriter::add_run_info(RunInfoData const & run_info_data) { return m_impl->add_run_info(run_info_data); } SignalType FileWriter::signal_type() const { return m_impl->signal_type(); } std::size_t FileWriter::signal_table_batch_size() const { return m_impl->signal_table_batch_size(); } pod5::Result make_dictionary_writers(arrow::MemoryPool * pool) { FileWriterImpl::DictionaryWriters writers; ARROW_ASSIGN_OR_RAISE(writers.end_reason_writer, pod5::make_end_reason_writer(pool)); ARROW_ASSIGN_OR_RAISE(writers.pore_writer, pod5::make_pore_writer(pool)); ARROW_ASSIGN_OR_RAISE(writers.run_info_writer, pod5::make_run_info_writer(pool)); return writers; } std::string make_reads_tmp_path( ::arrow::internal::PlatformFilename const & arrow_path, Uuid const & file_identifier) { return arrow_path.Parent().ToString() + "/" + ("." + to_string(file_identifier) + ".tmp-reads"); } std::string make_run_info_tmp_path( ::arrow::internal::PlatformFilename const & arrow_path, Uuid const & file_identifier) { return arrow_path.Parent().ToString() + "/" + ("." + to_string(file_identifier) + ".tmp-run-info"); } pod5::Result> create_file_writer( std::string const & path, std::string const & writing_software_name, FileWriterOptions const & options) { auto pool = options.memory_pool(); if (!pool) { return Status::Invalid("Invalid memory pool specified for file writer"); } ARROW_ASSIGN_OR_RAISE(auto arrow_path, ::arrow::internal::PlatformFilename::FromString(path)); ARROW_ASSIGN_OR_RAISE(bool file_exists, arrow::internal::FileExists(arrow_path)); if (file_exists) { return Status::Invalid("Unable to create new file '", path, "', already exists"); } // Open dictionary writers: ARROW_ASSIGN_OR_RAISE(auto dict_writers, make_dictionary_writers(pool)); // Prep file metadata: std::random_device gen; auto uuid_gen = BasicUuidRandomGenerator{gen}; auto const section_marker = uuid_gen(); auto const file_identifier = uuid_gen(); ARROW_ASSIGN_OR_RAISE(auto current_version, parse_version_number(Pod5Version)); ARROW_ASSIGN_OR_RAISE( auto file_schema_metadata, make_schema_key_value_metadata({file_identifier, writing_software_name, current_version})); auto reads_tmp_path = make_reads_tmp_path(arrow_path, file_identifier); auto run_info_tmp_path = make_run_info_tmp_path(arrow_path, file_identifier); CachedFileValues cached_values; // Prepare the temporary reads file: ARROW_ASSIGN_OR_RAISE( auto read_table_file_async, make_file_stream( reads_tmp_path, options, cached_values, options.keep_read_table_file_open())); ARROW_ASSIGN_OR_RAISE( auto read_table_tmp_writer, make_read_table_writer( read_table_file_async, file_schema_metadata, options.read_table_batch_size(), dict_writers.pore_writer, dict_writers.end_reason_writer, dict_writers.run_info_writer, pool)); // Prepare the temporary run_info file: // // Run info is normally global, if we don't flush on batch complete we can // lose a large number of reads in a crash. ARROW_ASSIGN_OR_RAISE( auto run_info_table_file_async, make_file_stream( run_info_tmp_path, options, cached_values, options.keep_run_info_file_open(), FlushMode::ForceFlushOnBatchComplete)); ARROW_ASSIGN_OR_RAISE( auto run_info_table_tmp_writer, make_run_info_table_writer( run_info_table_file_async, file_schema_metadata, options.run_info_table_batch_size(), pool)); // Prepare the main file - and set up the signal table to write here: ARROW_ASSIGN_OR_RAISE( auto signal_file, make_file_stream(path, options, cached_values, options.keep_signal_file_open())); // Write the initial header to the combined file: ARROW_RETURN_NOT_OK(combined_file_utils::write_combined_header(signal_file, section_marker)); ARROW_ASSIGN_OR_RAISE(size_t const signal_table_start, signal_file->Tell()); static_cast(signal_file.get())->set_file_start_offset(signal_table_start); // Then place the signal file directly after that: ARROW_ASSIGN_OR_RAISE( auto signal_table_writer, make_signal_table_writer( signal_file, file_schema_metadata, options.signal_table_batch_size(), options.signal_type(), pool)); // Throw it all together into a writer object: return std::make_unique(std::make_unique( path, run_info_tmp_path, reads_tmp_path, signal_table_start, section_marker, file_identifier, writing_software_name, std::move(dict_writers), std::move(run_info_table_tmp_writer), std::move(read_table_tmp_writer), std::move(signal_table_writer), options.max_signal_chunk_size(), pool)); } static Status add_recovery_failure_context( Status status, std::string const & tmp_path, std::string const & description) { assert(!status.ok()); std::string const error_context = "Failed whilst attempting to recover " + description + " from file - " + tmp_path; if (status.detail()) { return status.WithMessage(error_context); } return arrow::Status::FromArgs(status.code(), error_context + ". Detail: " + status.message()); } template static Status append_recovered_file( std::string const & tmp_path, writer_type const & destination_writer, std::string const & description, arrow::MemoryPool * const pool) { arrow::Status inner_status = [&] { ARROW_ASSIGN_OR_RAISE(auto file, arrow::io::ReadableFile::Open(tmp_path, pool)); ARROW_ASSIGN_OR_RAISE(auto size, file->GetSize()); if (size == 0) { return arrow::Status::Invalid("File is empty/zero bytes long."); } ARROW_ASSIGN_OR_RAISE( RecoveredData const recovered_raw_data, recover_arrow_file(file, destination_writer)); return arrow::Status::OK(); }(); if (!inner_status.ok()) { return add_recovery_failure_context(inner_status, tmp_path, description); } return inner_status; } namespace { struct TemporaryFilePaths { std::string run_info; std::string reads; }; } // namespace static pod5::Status recover_file( std::string const & src_path, std::string const & dest_path, FileWriterOptions const & options, TemporaryFilePaths & temporary_file_paths) { if (!check_extension_types_registered()) { return arrow::Status::Invalid("POD5 library is not correctly initialised."); } // Create a file to push recovered data into: ARROW_ASSIGN_OR_RAISE( auto dest_file, create_file_writer(dest_path, "pod5_file_recovery", options)); auto pool = arrow::default_memory_pool(); ARROW_ASSIGN_OR_RAISE( auto arrow_path, ::arrow::internal::PlatformFilename::FromString(src_path)); ARROW_ASSIGN_OR_RAISE(auto file, arrow::io::ReadableFile::Open(src_path, pool)); // Signature should be right at 0: ARROW_RETURN_NOT_OK(combined_file_utils::check_signature(file, 0)); // Recover the signal data into [dest_file]: arrow::Result recovered_raw_data; { ARROW_ASSIGN_OR_RAISE( auto raw_sub_file, combined_file_utils::open_sub_file(file, combined_file_utils::header_size)); recovered_raw_data = recover_arrow_file(raw_sub_file, dest_file->impl()->signal_table_writer()); } if (!recovered_raw_data.ok()) { return add_recovery_failure_context( recovered_raw_data.status(), arrow_path.ToString(), "signal data sub file"); } auto file_identifier = recovered_raw_data->metadata.file_identifier; temporary_file_paths.run_info = make_run_info_tmp_path(arrow_path, file_identifier); temporary_file_paths.reads = make_reads_tmp_path(arrow_path, file_identifier); // Recover the run info data into [dest_file]: auto run_info_writer = dest_file->impl()->run_info_table_writer(); ARROW_RETURN_NOT_OK(append_recovered_file( temporary_file_paths.run_info, run_info_writer, "run information", pool)); // Recover the read data into [dest_file]: auto read_writer = dest_file->impl()->read_table_writer(); ARROW_RETURN_NOT_OK( append_recovered_file(temporary_file_paths.reads, read_writer, "reads", pool)); return dest_file->close(); } /// This is a thorough count of all rows. Doing it this way ensures that all rows can be read. static pod5::Result count_recovered_rows( std::filesystem::path const & recovered_path) { ARROW_ASSIGN_OR_RAISE( std::shared_ptr recovered, pod5::open_file_reader(recovered_path.string())); RecoveredRowCounts counts; std::size_t const signal_batches = recovered->num_signal_record_batches(); for (std::size_t index = 0; index < signal_batches; ++index) { ARROW_ASSIGN_OR_RAISE(auto const signal_batch, recovered->read_signal_record_batch(index)); counts.signal += signal_batch.num_rows(); } ARROW_ASSIGN_OR_RAISE(counts.run_info, recovered->run_info_count()); std::size_t const read_batches = recovered->num_read_record_batches(); for (std::size_t index = 0; index < read_batches; ++index) { ARROW_ASSIGN_OR_RAISE(auto const record_batch, recovered->read_read_record_batch(index)); counts.reads += record_batch.num_rows(); } return counts; } /// \brief File is considered useless for recovery if it is 0 bytes long /// or if all bytes have value 0. static bool is_useless(std::filesystem::path const & file_path) { if (file_size(file_path) == 0) { return true; } std::ifstream file{file_path, std::ios::in | std::ios::binary}; if (!file.is_open()) { // If we can't open the file, assume there is data, just in case. return false; } while (true) { std::uint8_t byte; file >> byte; if (file.eof()) { return true; } if (byte != 0) { return false; } } } static void remove_if_useless(std::filesystem::path const & file_path) { if (exists(file_path) && is_useless(file_path)) { remove(file_path); } } static std::optional try_remove(std::string const & file_path) { try { std::filesystem::remove(file_path); return {}; } catch (std::filesystem::filesystem_error const & error) { return CleanupError{.file_path = file_path, .description = error.what()}; } } static Status add_clean_up_error(Status status, std::filesystem::filesystem_error const & exception) { return arrow::Status::FromArgs(status.code(), status.message(), exception.what()); } pod5::Result recover_file( std::string const & src_path, std::string const & dest_path, RecoverFileOptions const & options) { TemporaryFilePaths temp_file_paths; auto const result = [&]() -> pod5::Result { ARROW_RETURN_NOT_OK( recover_file(src_path, dest_path, options.file_writer_options, temp_file_paths)); auto const row_count_result = count_recovered_rows(dest_path); if (!row_count_result.ok()) { return add_recovery_failure_context(row_count_result.status(), dest_path, "row counts"); } return row_count_result; }(); if (!options.cleanup) { auto const to_recovery_details = [](RecoveredRowCounts counts) { return RecoveryDetails{counts}; }; return result.Map(to_recovery_details); } if (!result.ok()) { try { if (std::filesystem::exists(dest_path)) { std::filesystem::remove(dest_path); } remove_if_useless(temp_file_paths.reads); remove_if_useless(temp_file_paths.run_info); remove_if_useless(src_path); } catch (std::filesystem::filesystem_error const & error) { return add_clean_up_error(result.status(), error); } return result.status(); } RecoveryDetails details{.row_counts = *result}; if (auto const error = try_remove(src_path)) { details.cleanup_errors.push_back(*error); } if (auto const error = try_remove(temp_file_paths.reads)) { details.cleanup_errors.push_back(*error); } if (auto const error = try_remove(temp_file_paths.run_info)) { details.cleanup_errors.push_back(*error); } return details; } } // namespace pod5 ================================================ FILE: c++/pod5_format/file_writer.h ================================================ #pragma once #include "pod5_format/pod5_format_export.h" #include "pod5_format/read_table_utils.h" #include "pod5_format/result.h" #include "pod5_format/signal_table_utils.h" #include #include namespace arrow { class Array; class MemoryPool; } // namespace arrow namespace pod5 { class IOManager; class ThreadPool; class POD5_FORMAT_EXPORT FileWriterOptions { public: /// \brief Default chunk size for signal table entries static constexpr std::uint32_t DEFAULT_SIGNAL_CHUNK_SIZE = 102'400; static constexpr std::uint32_t DEFAULT_SIGNAL_TABLE_BATCH_SIZE = 100; static constexpr std::uint32_t DEFAULT_READ_TABLE_BATCH_SIZE = 1000; static constexpr std::uint32_t DEFAULT_RUN_INFO_TABLE_BATCH_SIZE = 1; static constexpr SignalType DEFAULT_SIGNAL_TYPE = SignalType::VbzSignal; static constexpr bool DEFAULT_USE_DIRECTIO = false; static constexpr bool DEFAULT_USE_SYNC_IO = false; static constexpr bool DEFAULT_FLUSH_ON_BATCH_COMPLETE = true; static constexpr std::size_t DEFAULT_WRITE_CHUNK_SIZE = 2 * 1024 * 1024; static constexpr std::size_t DEFAULT_KEEP_FILES_OPEN = true; FileWriterOptions(); void set_max_signal_chunk_size(std::uint32_t chunk_size) { m_max_signal_chunk_size = chunk_size; } std::uint32_t max_signal_chunk_size() const { return m_max_signal_chunk_size; } void set_memory_pool(arrow::MemoryPool * memory_pool) { m_memory_pool = memory_pool; } arrow::MemoryPool * memory_pool() const { return m_memory_pool; } void set_signal_type(SignalType signal_type) { m_signal_type = signal_type; } SignalType signal_type() const { return m_signal_type; } void set_signal_table_batch_size(std::size_t batch_size) { m_signal_table_batch_size = batch_size; } std::size_t signal_table_batch_size() const { return m_signal_table_batch_size; } void set_read_table_batch_size(std::size_t batch_size) { m_read_table_batch_size = batch_size; } std::size_t read_table_batch_size() const { return m_read_table_batch_size; } void set_run_info_table_batch_size(std::size_t batch_size) { m_run_info_table_batch_size = batch_size; } std::size_t run_info_table_batch_size() const { return m_run_info_table_batch_size; } void set_io_manager(std::shared_ptr const & io_manager) { m_io_manager = io_manager; } std::shared_ptr io_manager() const { return m_io_manager; } void set_thread_pool(std::shared_ptr const & writer_thread_pool) { m_writer_thread_pool = writer_thread_pool; } std::shared_ptr thread_pool() const { return m_writer_thread_pool; } void set_use_directio(bool use_directio) { m_use_directio = use_directio; } bool use_directio() const { return m_use_directio; } void set_write_chunk_size(std::size_t chunk_size) { m_write_chunk_size = chunk_size; } std::size_t write_chunk_size() const { return m_write_chunk_size; } void set_use_sync_io(bool use_sync_io) { m_use_sync_io = use_sync_io; } bool use_sync_io() const { return m_use_sync_io; } void set_flush_on_batch_complete(bool flush_on_batch_complete) { m_flush_on_batch_complete = flush_on_batch_complete; } bool flush_on_batch_complete() const { return m_flush_on_batch_complete; } bool keep_signal_file_open() const { return m_keep_signal_file_open; } void set_keep_signal_file_open(bool keep_signal_file_open) { m_keep_signal_file_open = keep_signal_file_open; } bool keep_run_info_file_open() const { return m_keep_run_info_file_open; } void set_keep_run_info_file_open(bool keep_run_info_file_open) { m_keep_run_info_file_open = keep_run_info_file_open; } bool keep_read_table_file_open() const { return m_keep_read_table_file_open; } void set_keep_read_table_file_open(bool keep_read_table_file_open) { m_keep_read_table_file_open = keep_read_table_file_open; } private: std::shared_ptr m_writer_thread_pool; std::shared_ptr m_io_manager; std::uint32_t m_max_signal_chunk_size; arrow::MemoryPool * m_memory_pool; SignalType m_signal_type; std::size_t m_signal_table_batch_size; std::size_t m_read_table_batch_size; std::size_t m_run_info_table_batch_size; bool m_use_directio; std::size_t m_write_chunk_size; bool m_use_sync_io; bool m_flush_on_batch_complete; bool m_keep_signal_file_open; bool m_keep_run_info_file_open; bool m_keep_read_table_file_open; }; class FileWriterImpl; class POD5_FORMAT_EXPORT FileWriter { public: FileWriter(std::unique_ptr && impl); ~FileWriter(); std::string path() const; pod5::Status close(); pod5::Status add_complete_read( ReadData const & read_data, gsl::span const & signal); /// \brief Add a complete with rows already pre appended. pod5::Status add_complete_read( ReadData const & read_data, gsl::span const & signal_rows, std::uint64_t signal_duration); pod5::Result> add_signal( Uuid const & read_id, gsl::span const & signal); pod5::Result add_pre_compressed_signal( Uuid const & read_id, gsl::span const & signal_bytes, std::uint32_t sample_count); pod5::Result> add_signal_batch( std::size_t row_count, std::vector> && columns, bool final_batch); // Find or create an end reason index representing this read end reason. pod5::Result lookup_end_reason(ReadEndReason end_reason) const; pod5::Result add_pore_type(std::string const & pore_type_data); pod5::Result add_run_info(RunInfoData const & run_info_data); SignalType signal_type() const; std::size_t signal_table_batch_size() const; FileWriterImpl * impl() const { return m_impl.get(); }; private: std::unique_ptr m_impl; }; POD5_FORMAT_EXPORT pod5::Result> create_file_writer( std::string const & path, std::string const & writing_software_name, FileWriterOptions const & options = {}); struct POD5_FORMAT_EXPORT RecoverFileOptions { FileWriterOptions file_writer_options = {}; /// If this is set to true, recover_file will remove the following files /// * Temp files which we have successfully recovered data from. /// * Temp files which we have failed to recover data from and which hold no data. /// * Output file created during failed recovery. bool cleanup = false; }; struct POD5_FORMAT_EXPORT RecoveredRowCounts final { std::size_t signal = 0; std::size_t run_info = 0; std::size_t reads = 0; }; struct POD5_FORMAT_EXPORT CleanupError final { std::string file_path; std::string description; }; struct POD5_FORMAT_EXPORT RecoveryDetails final { RecoveredRowCounts row_counts; std::vector cleanup_errors; }; POD5_FORMAT_EXPORT pod5::Result recover_file( std::string const & src_path, std::string const & dest_path, RecoverFileOptions const & options = {}); } // namespace pod5 ================================================ FILE: c++/pod5_format/flatbuffers/footer.fbs ================================================ namespace Minknow.ReadsFormat; enum ContentType:short { // The Reads table (an Arrow table) ReadsTable, // The Signal table (an Arrow table) SignalTable, // An index for looking up data in the ReadsTable by read_id ReadIdIndex, // An index based on other columns and/or tables (it will need to be opened to find out what it indexes) OtherIndex, // The Run Info table (an Arrow table) RunInfoTable, } enum Format:short { // The Apache Feather V2 format, also known as the Apache Arrow IPC File format. FeatherV2, } // Describes an embedded file. table EmbeddedFile { // The start of the embedded file offset: int64; // The length of the embedded file (excluding any padding) length: int64; // The format of the file format: Format; // What contents should be expected in the file content_type: ContentType; } table Footer { // Must match the "MINKNOW:file_identifier" custom metadata entry in the schemas of the bundled tables. file_identifier: string; // A free-form description of the software that wrote the file, intended to help pin down the source of files that violate the specification. software: string; // The version of this specification that the table schemas are based on (1.0.0). pod5_version: string; // The Apache Arrow tables stored in the file. contents: [ EmbeddedFile ]; } ================================================ FILE: c++/pod5_format/internal/async_output_stream.h ================================================ #pragma once #include "pod5_format/file_output_stream.h" #include "pod5_format/internal/tracing/tracing.h" #include "pod5_format/thread_pool.h" #include #include #include #include #include #include #include namespace pod5 { class AsyncOutputStream : public FileOutputStream { struct PrivateDummy {}; public: static arrow::Result> make( std::string const & file_path, std::shared_ptr const & thread_pool, bool flush_on_batch_complete, arrow::MemoryPool * memory_pool = arrow::default_memory_pool(), bool keep_file_open = true) { return std::make_shared( file_path, thread_pool, flush_on_batch_complete, memory_pool, keep_file_open, PrivateDummy{}); } ~AsyncOutputStream() { (void)Close(); } arrow::Status Close() override { // flush all output ARROW_RETURN_NOT_OK(Flush()); // and close stream std::lock_guard l{m_file_handle_mutex}; if (m_file_handle) { fclose(m_file_handle); m_file_handle = nullptr; } return arrow::Status::OK(); } arrow::Future<> CloseAsync() override { ARROW_RETURN_NOT_OK(Close()); return FileOutputStream::CloseAsync(); } arrow::Status Abort() override { std::lock_guard l{m_file_handle_mutex}; if (m_file_handle) { fclose(m_file_handle); m_file_handle = nullptr; } return arrow::Status::OK(); } arrow::Result Tell() const override { return m_actual_bytes_written - m_file_start_offset; } bool closed() const override { return m_file_handle == nullptr; } arrow::Status Write(void const * data, int64_t nbytes) override { POD5_TRACE_FUNCTION(); ARROW_ASSIGN_OR_RAISE( std::shared_ptr buffer, arrow::AllocateBuffer(nbytes, m_memory_pool)); auto const char_data = static_cast(data); std::copy(char_data, char_data + nbytes, buffer->mutable_data()); return Write(buffer); } arrow::Status Write(std::shared_ptr const & data) override { POD5_TRACE_FUNCTION(); if (m_has_error) { return error(); } std::size_t const BUFFER_SIZE = 10 * 1024 * 1024; // 10mb pending writes max while ((m_submitted_byte_writes - m_completed_byte_writes) > BUFFER_SIZE) { std::this_thread::sleep_for(std::chrono::milliseconds(5)); } m_submitted_byte_writes += data->size(); m_actual_bytes_written += data->size(); m_submitted_writes += 1; m_strand->post([&, data] { POD5_TRACE_FUNCTION(); if (m_has_error) { return; } std::lock_guard l{m_file_handle_mutex}; auto file_handle = get_or_open_file_handle(l); if (!file_handle) { set_error(arrow::Status::IOError("Failed to open file handle for writing")); return; } if (fwrite(data->data(), 1, (std::size_t)data->size(), file_handle) != (std::size_t)data->size()) { set_error(arrow::Status::IOError("Failed to write data to file")); return; } m_completed_byte_writes += data->size(); // Ensure we do this after editing all the other members, in order to prevent `Flush` // returning until we are done. m_completed_writes += 1; // Close the file handle if we do not have further writes pending: if (m_submitted_writes == m_completed_writes) { close_file_handle(l); } }); return arrow::Status::OK(); } arrow::Status Flush() override { POD5_TRACE_FUNCTION(); // Wait for our completed writes to match our submitted writes, // this guarantees our async operations are finished. auto wait_for_write_count = m_submitted_writes.load(); while (m_completed_writes.load() < wait_for_write_count && !m_has_error) { std::this_thread::sleep_for(std::chrono::microseconds(10)); } if (m_has_error) { return error(); } // No file handle so nothing to flush std::lock_guard l{m_file_handle_mutex}; if (!m_file_handle) { return arrow::Status::OK(); } if (fflush(m_file_handle) != 0) { return arrow::Status::IOError("Error flushing file"); } return arrow::Status::OK(); } void set_file_start_offset(std::size_t val) override { m_file_start_offset = val; } arrow::Status batch_complete() override { if (m_flush_on_batch_complete) { return Flush(); } return arrow::Status::OK(); } AsyncOutputStream( std::string const & file_path, std::shared_ptr const & thread_pool, bool flush_on_batch_complete, arrow::MemoryPool * memory_pool, bool keep_file_open, PrivateDummy) : m_has_error{false} , m_submitted_writes{0} , m_completed_writes{0} , m_submitted_byte_writes{0} , m_completed_byte_writes{0} , m_actual_bytes_written{0} , m_flush_on_batch_complete(flush_on_batch_complete) , m_file_path(file_path) , m_keep_file_open(keep_file_open) , m_file_start_offset{0} , m_strand{thread_pool->create_strand()} , m_memory_pool(memory_pool) { m_file_handle = fopen(m_file_path.c_str(), "wb"); if (!m_file_handle) { set_error(arrow::Status::IOError("Failed to open file for writing: ", errno)); } if (!m_keep_file_open) { fclose(m_file_handle); m_file_handle = nullptr; } } private: arrow::MemoryPool * memory_pool() { return m_memory_pool; } FILE * get_or_open_file_handle([[maybe_unused]] std::lock_guard & lock) { if (m_file_handle) { return m_file_handle; } m_file_handle = fopen(m_file_path.c_str(), "ab"); return m_file_handle; } void close_file_handle([[maybe_unused]] std::lock_guard & lock) { if (m_file_handle && !m_keep_file_open) { fclose(m_file_handle); m_file_handle = nullptr; } } void set_error(arrow::Status status) { assert(!status.ok()); { std::lock_guard l{m_error_mutex}; m_error = std::move(status); } m_has_error = true; } arrow::Status error() const { std::lock_guard l{m_error_mutex}; return m_error; } std::atomic m_has_error; std::atomic m_submitted_writes; std::atomic m_completed_writes; std::atomic m_submitted_byte_writes; std::atomic m_completed_byte_writes; // this represents the number of data bytes written (excluding any padding for alignment) // used for truncating the file for instance std::int64_t m_actual_bytes_written; bool m_flush_on_batch_complete; std::string m_file_path; std::mutex m_file_handle_mutex; FILE * m_file_handle{nullptr}; bool m_keep_file_open{false}; mutable std::mutex m_error_mutex; arrow::Status m_error; std::size_t m_file_start_offset; std::shared_ptr m_strand; arrow::MemoryPool * m_memory_pool; }; } // namespace pod5 ================================================ FILE: c++/pod5_format/internal/combined_file_utils.h ================================================ #pragma once #include "pod5_flatbuffers/footer_generated.h" #include "pod5_format/file_reader.h" #include "pod5_format/result.h" #include "pod5_format/uuid.h" #include "pod5_format/version.h" #include #include #include #include #include #include namespace pod5 { namespace combined_file_utils { static constexpr std::array FILE_SIGNATURE{'\213', 'P', 'O', 'D', '\r', '\n', '\032', '\n'}; static constexpr std::size_t header_size = 24; // signature 8 bytes, section marker 16 bytes inline pod5::Status pad_file( std::shared_ptr const & sink, std::uint32_t pad_to_size) { ARROW_ASSIGN_OR_RAISE(auto const current_byte_location, sink->Tell()); auto const bytes_to_write = pad_to_size - (current_byte_location % pad_to_size); if (bytes_to_write == pad_to_size) { return pod5::Status::OK(); } std::array zeroes{}; return sink->Write(zeroes.data(), bytes_to_write); } inline pod5::Status write_file_signature(std::shared_ptr const & sink) { return sink->Write(FILE_SIGNATURE.data(), FILE_SIGNATURE.size()); } inline pod5::Status write_section_marker( std::shared_ptr const & sink, Uuid const & section_marker) { return sink->Write(section_marker.data(), section_marker.size()); } inline pod5::Status write_combined_header( std::shared_ptr const & sink, Uuid const & section_marker) { ARROW_RETURN_NOT_OK(write_file_signature(sink)); return write_section_marker(sink, section_marker); } inline pod5::Status write_footer_magic(std::shared_ptr const & sink) { return sink->Write("FOOTER\0\0", 8); } struct FileInfo { std::int64_t file_start_offset = 0; std::int64_t file_length = 0; }; struct ParsedFileInfo : FileInfo { std::string file_path; std::shared_ptr file; arrow::Status from_full_file(std::string in_file_path) { file_path = in_file_path; ARROW_ASSIGN_OR_RAISE( file, arrow::io::MemoryMappedFile::Open(in_file_path, arrow::io::FileMode::READ)); file_start_offset = 0; ARROW_ASSIGN_OR_RAISE(file_length, file->GetSize()); return arrow::Status::OK(); } }; inline pod5::Result write_footer_flatbuffer( std::shared_ptr const & sink, Uuid const & file_identifier, std::string const & software_name, FileInfo const & signal_table, FileInfo const & run_info_table, FileInfo const & reads_table) { flatbuffers::FlatBufferBuilder builder(1024); auto signal_file = Minknow::ReadsFormat::CreateEmbeddedFile( builder, signal_table.file_start_offset, signal_table.file_length, Minknow::ReadsFormat::Format_FeatherV2, Minknow::ReadsFormat::ContentType_SignalTable); auto run_info_file = Minknow::ReadsFormat::CreateEmbeddedFile( builder, run_info_table.file_start_offset, run_info_table.file_length, Minknow::ReadsFormat::Format_FeatherV2, Minknow::ReadsFormat::ContentType_RunInfoTable); auto reads_file = Minknow::ReadsFormat::CreateEmbeddedFile( builder, reads_table.file_start_offset, reads_table.file_length, Minknow::ReadsFormat::Format_FeatherV2, Minknow::ReadsFormat::ContentType_ReadsTable); std::vector> const files{ signal_file, run_info_file, reads_file}; auto footer = Minknow::ReadsFormat::CreateFooterDirect( builder, to_string(file_identifier).c_str(), software_name.c_str(), Pod5Version.c_str(), &files); builder.Finish(footer); ARROW_RETURN_NOT_OK(sink->Write(builder.GetBufferPointer(), builder.GetSize())); return builder.GetSize(); } inline pod5::Status write_footer( std::shared_ptr const & sink, Uuid const & section_marker, Uuid const & file_identifier, std::string const & software_name, FileInfo const & signal_table, FileInfo const & run_info_table, FileInfo const & reads_table) { ARROW_RETURN_NOT_OK(write_footer_magic(sink)); ARROW_ASSIGN_OR_RAISE( std::int64_t length, write_footer_flatbuffer( sink, file_identifier, software_name, signal_table, run_info_table, reads_table)); ARROW_RETURN_NOT_OK(pad_file(sink, 8)); std::int64_t paded_flatbuffer_size = arrow::bit_util::ToLittleEndian(length); ARROW_RETURN_NOT_OK(sink->Write(&paded_flatbuffer_size, sizeof(paded_flatbuffer_size))); ARROW_RETURN_NOT_OK(write_section_marker(sink, section_marker)); return write_file_signature(sink); } struct ParsedFooter { Uuid file_identifier; std::string software_name; std::string writer_pod5_version; ParsedFileInfo run_info_table; ParsedFileInfo reads_table; ParsedFileInfo signal_table; }; inline void bind_footer_file( ParsedFooter & footer, std::shared_ptr const & file) { footer.reads_table.file = file; footer.run_info_table.file = file; footer.signal_table.file = file; } inline pod5::Status check_signature( std::shared_ptr const & file, std::int64_t offset_in_file) { std::array read_signature; ARROW_ASSIGN_OR_RAISE( auto read_bytes, file->ReadAt(offset_in_file, read_signature.size(), read_signature.data())); if (read_bytes != (std::int16_t)read_signature.size() || read_signature != FILE_SIGNATURE) { return arrow::Status::IOError("Invalid signature in file"); } return arrow::Status::OK(); } inline pod5::Result read_footer_flatbuffer( std::vector const & footer_data) { auto verifier = flatbuffers::Verifier(footer_data.data(), footer_data.size()); if (!verifier.VerifyBuffer()) { return arrow::Status::IOError("Invalid footer found in file"); } return flatbuffers::GetRoot(footer_data.data()); } inline pod5::Result read_footer( std::string const & file_path, std::shared_ptr const & file) { // Verify signature at start and end of file: ARROW_RETURN_NOT_OK(check_signature(file, 0)); ARROW_ASSIGN_OR_RAISE(auto const file_size, file->GetSize()); ARROW_RETURN_NOT_OK(check_signature(file, file_size - FILE_SIGNATURE.size())); auto footer_length_data_end = file_size; footer_length_data_end -= FILE_SIGNATURE.size(); footer_length_data_end -= sizeof(Uuid); std::int64_t footer_length = 0; ARROW_RETURN_NOT_OK(file->ReadAt( footer_length_data_end - sizeof(footer_length), sizeof(footer_length), &footer_length)); footer_length = arrow::bit_util::FromLittleEndian(footer_length); if (footer_length < 0 || static_cast(footer_length) > footer_length_data_end - sizeof(footer_length)) { return arrow::Status::IOError("Invalid footer length"); } std::vector footer_data; footer_data.resize(footer_length); ARROW_ASSIGN_OR_RAISE( auto read_bytes, file->ReadAt( footer_length_data_end - sizeof(footer_length) - footer_length, footer_length, footer_data.data())); if (read_bytes != footer_length) { return arrow::Status::IOError("Failed to read footer data"); } ARROW_ASSIGN_OR_RAISE(auto fb_footer, read_footer_flatbuffer(footer_data)); ParsedFooter footer; if (!fb_footer->file_identifier()) { return arrow::Status::IOError("Invalid footer file_identifier"); } auto const identifier = Uuid::from_string(fb_footer->file_identifier()->str()); if (!identifier) { return Status::IOError( "Invalid file_identifier in file: '", fb_footer->file_identifier()->str(), "'"); } footer.file_identifier = *identifier; if (!fb_footer->software()) { return arrow::Status::IOError("Invalid footer software"); } footer.software_name = fb_footer->software()->str(); if (!fb_footer->pod5_version()) { return arrow::Status::IOError("Invalid footer pod5_version"); } footer.writer_pod5_version = fb_footer->pod5_version()->str(); if (!fb_footer->contents()) { return arrow::Status::IOError("Invalid footer contents"); } for (auto const embedded_file : *fb_footer->contents()) { if (embedded_file->format() != Minknow::ReadsFormat::Format_FeatherV2) { return arrow::Status::IOError("Invalid embedded file format"); } switch (embedded_file->content_type()) { case Minknow::ReadsFormat::ContentType_RunInfoTable: footer.run_info_table.file_start_offset = embedded_file->offset(); footer.run_info_table.file_length = embedded_file->length(); footer.run_info_table.file = file; footer.run_info_table.file_path = file_path; break; case Minknow::ReadsFormat::ContentType_ReadsTable: footer.reads_table.file_start_offset = embedded_file->offset(); footer.reads_table.file_length = embedded_file->length(); footer.reads_table.file = file; footer.reads_table.file_path = file_path; break; case Minknow::ReadsFormat::ContentType_SignalTable: footer.signal_table.file_start_offset = embedded_file->offset(); footer.signal_table.file_length = embedded_file->length(); footer.signal_table.file = file; footer.signal_table.file_path = file_path; break; default: return arrow::Status::IOError("Unknown embedded file type"); } } return footer; } class SubFile : public arrow::io::internal::RandomAccessFileConcurrencyWrapper { public: SubFile( std::shared_ptr main_file, std::int64_t sub_file_offset, std::int64_t sub_file_length) : m_file(std::move(main_file)) , m_sub_file_offset(sub_file_offset) , m_sub_file_length(sub_file_length) { } protected: arrow::Status DoClose() { return m_file->Close(); } bool closed() const override { return m_file->closed(); } arrow::Result DoTell() const { ARROW_ASSIGN_OR_RAISE(auto t, m_file->Tell()); return t - m_sub_file_offset; } arrow::Status DoSeek(int64_t offset) { if (offset < 0 || offset > m_sub_file_length) { return arrow::Status::IOError("Invalid offset into SubFile"); } offset += m_sub_file_offset; return m_file->Seek(offset); } arrow::Result DoRead(int64_t length, void * data) { ARROW_ASSIGN_OR_RAISE(auto pos, m_file->Tell()); int64_t const remaining = m_sub_file_offset + m_sub_file_length - pos; length = std::min(remaining, length); return m_file->Read(length, data); } arrow::Result> DoRead(int64_t length) { ARROW_ASSIGN_OR_RAISE(auto pos, m_file->Tell()); int64_t const remaining = m_sub_file_offset + m_sub_file_length - pos; length = std::min(remaining, length); return m_file->Read(length); } Result DoReadAt(int64_t position, int64_t nbytes, void * out) { if (position < 0 || position > m_sub_file_length) { return arrow::Status::IOError("Invalid offset into SubFile"); } int64_t const remaining = m_sub_file_length - position; nbytes = std::min(nbytes, remaining); return m_file->ReadAt(position + m_sub_file_offset, nbytes, out); } Result> DoReadAt(int64_t position, int64_t nbytes) { if (position < 0 || position > m_sub_file_length) { return arrow::Status::IOError("Invalid offset into SubFile"); } int64_t const remaining = m_sub_file_length - position; nbytes = std::min(nbytes, remaining); return m_file->ReadAt(position + m_sub_file_offset, nbytes); } arrow::Result DoGetSize() { return m_sub_file_length; } private: friend RandomAccessFileConcurrencyWrapper; std::shared_ptr m_file; std::int64_t m_sub_file_offset; std::int64_t m_sub_file_length; }; inline arrow::Result> open_sub_file(ParsedFileInfo file_info) { if (!file_info.file) { return arrow::Status::Invalid("Failed to open file from footer"); } ARROW_ASSIGN_OR_RAISE(auto file_size, file_info.file->GetSize()); if (file_info.file_length < 0 || file_info.file_length > file_size || file_info.file_start_offset > file_size - file_info.file_length) { return arrow::Status::Invalid("Bad footer info"); } // Restrict our open file to just the run info section: auto sub_file = std::make_shared( file_info.file, file_info.file_start_offset, file_info.file_length); ARROW_RETURN_NOT_OK(sub_file->Seek(0)); return sub_file; } inline arrow::Result> open_sub_file( std::shared_ptr const & file, std::size_t offset) { if (!file) { return arrow::Status::Invalid("Failed to open file from footer"); } ARROW_ASSIGN_OR_RAISE(auto file_size, file->GetSize()); // Restrict our open file to just the run info section: auto sub_file = std::make_shared(file, offset, file_size - offset); ARROW_RETURN_NOT_OK(sub_file->Seek(0)); return sub_file; } enum class SubFileCleanup { CleanupOriginalFile, LeaveOrignalFile }; inline arrow::Result write_file( arrow::MemoryPool * pool, std::shared_ptr const & file, FileLocation const & file_location, SubFileCleanup cleanup_mode) { combined_file_utils::FileInfo table_data; // Record file start location in bytes within the main file: ARROW_ASSIGN_OR_RAISE(table_data.file_start_offset, file->Tell()); { // Stream out the reads table into the main file: ARROW_ASSIGN_OR_RAISE( auto reads_table_file_in, arrow::io::ReadableFile::Open(file_location.file_path, pool)); ARROW_RETURN_NOT_OK(reads_table_file_in->Seek(file_location.offset)); std::int64_t copied_bytes = 0; std::int64_t target_chunk_size = 10 * 1024 * 1024; // Read in 10MB of data at a time while (copied_bytes < std::int64_t(file_location.size)) { std::size_t const to_read = std::min(file_location.size - copied_bytes, target_chunk_size); ARROW_ASSIGN_OR_RAISE(auto const read_buffer, reads_table_file_in->Read(to_read)); copied_bytes += read_buffer->size(); ARROW_RETURN_NOT_OK(file->Write(read_buffer)); } // Store the reads file length for later reading: ARROW_ASSIGN_OR_RAISE(table_data.file_length, file->Tell()); table_data.file_length -= table_data.file_start_offset; } if (cleanup_mode == SubFileCleanup::CleanupOriginalFile) { // Clean up the tmp read path: ARROW_ASSIGN_OR_RAISE( auto arrow_path, ::arrow::internal::PlatformFilename::FromString(file_location.file_path)); ARROW_RETURN_NOT_OK(arrow::internal::DeleteFile(arrow_path)); } return table_data; } inline arrow::Result write_file_and_marker( arrow::MemoryPool * pool, std::shared_ptr const & file, FileLocation const & file_location, SubFileCleanup cleanup_mode, Uuid const & section_marker) { ARROW_ASSIGN_OR_RAISE(auto file_info, write_file(pool, file, file_location, cleanup_mode)); // Pad file to 8 bytes and mark section: ARROW_RETURN_NOT_OK(combined_file_utils::pad_file(file, 8)); ARROW_RETURN_NOT_OK(combined_file_utils::write_section_marker(file, section_marker)); return file_info; } }} // namespace pod5::combined_file_utils ================================================ FILE: c++/pod5_format/internal/linux_output_stream.h ================================================ #pragma once #include "pod5_format/file_output_stream.h" #include "pod5_format/internal/tracing/tracing.h" #include "pod5_format/io_manager.h" #include #include #include #include #include #ifdef __linux__ #include #include #endif namespace pod5 { namespace { constexpr size_t fallocate_chunk = 50 * 256 * IOManager::Alignment; // 50MB } // namespace #ifdef __linux__ class LinuxOutputStream : public FileOutputStream { struct PrivateDummy {}; public: static arrow::Result> make( std::string const & file_path, std::shared_ptr const & io_manager, std::size_t write_chunk_size, bool use_directio, bool use_syncio, bool flush_on_batch_complete, bool keep_file_open = true) { auto flags = O_RDWR | O_CREAT; if (use_directio) { flags |= O_DIRECT; } if (use_syncio) { flags |= O_SYNC; } auto const initial_file_descriptor = open(file_path.c_str(), flags, 0644); if (initial_file_descriptor < 0) { return arrow::Status::Invalid("Failed to open file"); } return std::make_shared( file_path, initial_file_descriptor, flags, io_manager, write_chunk_size, keep_file_open, flush_on_batch_complete, PrivateDummy{}); } ~LinuxOutputStream() { (void)Close(); } arrow::Status Close() override { // flush all output ARROW_RETURN_NOT_OK(Flush()); while (!m_queued_writes.empty()) { ARROW_RETURN_NOT_OK(process_queued_writes()); if (!m_queued_writes.empty()) { ARROW_RETURN_NOT_OK(m_io_manager->wait_for_event(std::chrono::seconds(1))); } } std::lock_guard l{m_file_handle_mutex}; ARROW_ASSIGN_OR_RAISE(auto const file_descriptor, get_or_open_fd(l)); // truncate excess data if (::ftruncate(file_descriptor, m_bytes_written) < 0) { return arrow::Status::IOError("Failed to truncate file"); } // and close stream return close_fd(l, true); } arrow::Future<> CloseAsync() override { return Close(); } arrow::Status Abort() override { std::lock_guard l{m_file_handle_mutex}; return close_fd(l, true); } arrow::Result Tell() const override { return m_bytes_written - m_file_start_offset; } bool closed() const override { return m_file_descriptor == -1; } arrow::Status Write(void const * data, int64_t nbytes) override { ARROW_RETURN_NOT_OK(allocate_file_space(nbytes)); auto remaining_data = gsl::make_span(reinterpret_cast(data), nbytes); while (!remaining_data.empty()) { ARROW_ASSIGN_OR_RAISE( remaining_data, m_aligned_buffer.consume_until_full(remaining_data)); if (m_aligned_buffer.is_full()) { ARROW_RETURN_NOT_OK(flush_writes(FlushMode::AlignedWrites)); } } m_bytes_written += nbytes; return arrow::Status::OK(); } arrow::Status Write(std::shared_ptr const & data) override { ARROW_RETURN_NOT_OK(allocate_file_space(data->size())); auto remaining_data = gsl::make_span(data->data(), data->size()); while (!remaining_data.empty()) { ARROW_ASSIGN_OR_RAISE( remaining_data, m_aligned_buffer.consume_until_full(remaining_data)); if (m_aligned_buffer.is_full()) { ARROW_RETURN_NOT_OK(flush_writes(FlushMode::AlignedWrites)); } } m_bytes_written += data->size(); return arrow::Status::OK(); } arrow::Status batch_complete() override { if (m_flush_on_batch_complete) { return flush_writes(FlushMode::AllWrites); } return arrow::Status::OK(); } arrow::Status Flush() override { ARROW_RETURN_NOT_OK(flush_writes(FlushMode::AllWrites)); std::lock_guard l{m_file_handle_mutex}; if (m_file_descriptor < 0) { return arrow::Status::OK(); } if (fsync(m_file_descriptor) < 0) { return arrow::Status::IOError("Error flushing file"); } return arrow::Status::OK(); } void set_file_start_offset(std::size_t val) override { m_file_start_offset = val; } void set_flush_on_batch_complete(bool flush_on_batch_complete) { m_flush_on_batch_complete = flush_on_batch_complete; } LinuxOutputStream( std::string const & file_path, int initial_file_descriptor, int flags, std::shared_ptr const & io_manager, std::size_t write_chunk_size, bool keep_file_open, bool flush_on_batch_complete, PrivateDummy) : m_file_path(file_path) , m_flags(flags) , m_file_descriptor(initial_file_descriptor) , m_keep_file_open(keep_file_open) , m_aligned_buffer(write_chunk_size, io_manager) , m_io_manager(io_manager) , m_flush_on_batch_complete(flush_on_batch_complete) { if (!m_keep_file_open) { close(m_file_descriptor); m_file_descriptor = -1; } } protected: enum class FlushMode { AllWrites, AlignedWrites }; class AlignedBuffer { public: AlignedBuffer(std::size_t capacity, std::shared_ptr const & io_manager) : m_io_manager(io_manager) , m_capacity(capacity) { } // Copy input span to the end of the buffer until this buffer is full. // // Return any remaining buffer. arrow::Result> consume_until_full( gsl::span input) { ARROW_RETURN_NOT_OK(ensure_next_write()); auto & buffer = m_next_write->get_buffer(); assert((std::size_t)buffer.capacity() >= m_capacity); auto const remaining_buffer_bytes = buffer.capacity() - buffer.size(); auto const to_copy = std::min(input.size(), (std::size_t)remaining_buffer_bytes); std::copy( input.begin(), input.begin() + to_copy, buffer.mutable_data() + buffer.size()); ARROW_RETURN_NOT_OK(buffer.Resize(buffer.size() + to_copy, false)); return input.subspan(to_copy); } // Find if the buffer is full (m_size == m_capacity) bool is_full() const { if (!m_next_write) { return false; } return (std::size_t)m_next_write->get_buffer().size() == m_capacity; } arrow::Result> release_all_writes_and_align( std::size_t * out_aligned_write_size) { ARROW_RETURN_NOT_OK(ensure_next_write()); *out_aligned_write_size = aligned_write_size(m_next_write->get_buffer().size()); auto result_write = std::move(m_next_write); auto & result_write_buffer = result_write->get_buffer(); ARROW_ASSIGN_OR_RAISE(m_next_write, m_io_manager->allocate_new_write(m_capacity)); auto & next_write_buffer = m_next_write->get_buffer(); std::copy( result_write_buffer.data() + *out_aligned_write_size, result_write_buffer.data() + result_write_buffer.size(), next_write_buffer.mutable_data()); ARROW_RETURN_NOT_OK(next_write_buffer.Resize( result_write_buffer.size() - *out_aligned_write_size, false)); // Ensure the result write buffer is aligned to our write alignment. auto const result_unaligned_size = result_write_buffer.size(); auto result_aligned_size = result_unaligned_size + (-result_unaligned_size & (IOManager::Alignment - 1)); ARROW_RETURN_NOT_OK(result_write_buffer.Resize(result_aligned_size, false)); assert(result_write_buffer.size() % IOManager::Alignment == 0); result_write->set_state(QueuedWrite::WriteState::ReadyForWrite); return result_write; } arrow::Result> release_aligned_writes() { ARROW_RETURN_NOT_OK(ensure_next_write()); auto result_write = std::move(m_next_write); auto & result_write_buffer = result_write->get_buffer(); ARROW_ASSIGN_OR_RAISE(m_next_write, m_io_manager->allocate_new_write(m_capacity)); auto & next_write_buffer = m_next_write->get_buffer(); auto aligned_size = aligned_write_size(result_write_buffer.size()); std::copy( result_write_buffer.data() + aligned_size, result_write_buffer.data() + result_write_buffer.size(), next_write_buffer.mutable_data()); ARROW_RETURN_NOT_OK(result_write_buffer.Resize(aligned_size, false)); ARROW_RETURN_NOT_OK( next_write_buffer.Resize(result_write_buffer.size() - aligned_size, false)); result_write->set_state(QueuedWrite::WriteState::ReadyForWrite); return result_write; } private: std::size_t aligned_write_size(std::size_t input_size) const { return (input_size / IOManager::Alignment) * IOManager::Alignment; } arrow::Status ensure_next_write() { if (m_next_write) { return arrow::Status::OK(); } ARROW_ASSIGN_OR_RAISE(m_next_write, m_io_manager->allocate_new_write(m_capacity)); assert((std::size_t)m_next_write->get_buffer().capacity() >= m_capacity); return arrow::Status::OK(); } std::shared_ptr m_next_write; std::shared_ptr m_io_manager; std::size_t m_capacity; }; arrow::Result get_or_open_fd([[maybe_unused]] std::lock_guard & lock) { if (m_file_descriptor >= 0) { return m_file_descriptor; } m_file_descriptor = open(m_file_path.c_str(), m_flags, 0644); if (m_file_descriptor < 0) { return arrow::Status::IOError("Failed to open file for writing"); } return m_file_descriptor; } arrow::Status close_fd([[maybe_unused]] std::lock_guard & lock, bool force = false) { if (m_keep_file_open && !force) { return arrow::Status::OK(); } if (close(m_file_descriptor) != 0) { return arrow::Status::IOError("Error closing file"); } m_file_descriptor = -1; return arrow::Status::OK(); } arrow::Status flush_writes(FlushMode flush_mode) { std::size_t write_offset{}; std::shared_ptr released_data; if (flush_mode == FlushMode::AllWrites) { std::size_t aligned_write_size = 0; ARROW_ASSIGN_OR_RAISE( released_data, m_aligned_buffer.release_all_writes_and_align(&aligned_write_size)); write_offset = m_bytes_submitted_to_manager; m_bytes_submitted_to_manager += aligned_write_size; } else if (flush_mode == FlushMode::AlignedWrites) { ARROW_ASSIGN_OR_RAISE(released_data, m_aligned_buffer.release_aligned_writes()); write_offset = m_bytes_submitted_to_manager; m_bytes_submitted_to_manager += released_data->get_buffer().size(); } else { assert(false); return arrow::Status::Invalid("Invalid FlushMode Passed."); } assert(released_data->get_buffer().size() % IOManager::Alignment == 0); if (released_data->get_buffer().size() == 0) { return arrow::Status::OK(); } std::lock_guard lock(m_file_handle_mutex); ARROW_ASSIGN_OR_RAISE(auto const file_descriptor, get_or_open_fd(lock)); released_data->prepare_for_write(file_descriptor, write_offset); m_queued_writes.emplace_back(released_data); ARROW_RETURN_NOT_OK(m_io_manager->write_buffer(std::move(released_data))); ARROW_RETURN_NOT_OK(process_queued_writes()); if (m_queued_writes.empty()) { return close_fd(lock); } else { // If we have queued writes, we keep the file open. return arrow::Status::OK(); } } arrow::Status process_queued_writes() { for (auto it = m_queued_writes.begin(); it != m_queued_writes.end();) { if ((*it)->state() == QueuedWrite::WriteState::Completed) { ARROW_RETURN_NOT_OK(m_io_manager->return_used_write(std::move(*it))); it = m_queued_writes.erase(it); } else { ++it; } } return arrow::Status::OK(); } arrow::Status allocate_file_space(std::size_t new_write_size) { auto new_total_size = m_bytes_written + new_write_size; if (new_total_size > m_fallocate_offset) { // reserve more space before continuing m_fallocate_offset += fallocate_chunk; std::lock_guard lock(m_file_handle_mutex); ARROW_ASSIGN_OR_RAISE(auto const file_descriptor, get_or_open_fd(lock)); // If this fails, we will just write less optimially, so we ignore the result. ::fallocate(file_descriptor, 0, m_fallocate_offset, fallocate_chunk); } return arrow::Status::OK(); } std::string m_file_path; int m_flags; std::mutex m_file_handle_mutex; int m_file_descriptor; bool m_keep_file_open{false}; AlignedBuffer m_aligned_buffer; std::vector> m_queued_writes; std::shared_ptr m_io_manager; std::size_t m_fallocate_offset{0}; std::size_t m_file_start_offset{0}; std::size_t m_bytes_written{0}; std::size_t m_bytes_submitted_to_manager{0}; bool m_flush_on_batch_complete; }; #endif } // namespace pod5 ================================================ FILE: c++/pod5_format/internal/tracing/tracing.h ================================================ #pragma once #define POD5_TRACE_FUNCTION() ================================================ FILE: c++/pod5_format/io_manager.cpp ================================================ #include "pod5_format/io_manager.h" #ifdef __linux__ #include #endif namespace pod5 { #ifdef __linux__ class IOManagerSyncImpl : public IOManager { public: IOManagerSyncImpl(arrow::MemoryPool * memory_pool) : m_memory_pool(memory_pool) {} arrow::Result> allocate_new_write(std::size_t capacity) override { if (m_queued_writes.size()) { auto new_write = m_queued_writes.back(); m_queued_writes.pop_back(); ARROW_RETURN_NOT_OK(new_write->reset_queued_write()); ARROW_RETURN_NOT_OK(new_write->get_buffer().Reserve(capacity)); assert((std::size_t)new_write->get_buffer().capacity() >= capacity); return new_write; } ARROW_ASSIGN_OR_RAISE( std::unique_ptr buffer, arrow::AllocateResizableBuffer(capacity, IOManager::Alignment, m_memory_pool)); ARROW_RETURN_NOT_OK(buffer->Resize(0, false)); assert((std::size_t)buffer->capacity() >= capacity); assert((std::size_t)buffer->size() == 0); return std::make_shared(std::move(buffer)); } arrow::Status return_used_write(std::shared_ptr && used_write) override { if (m_queued_writes.size() < CachedBufferCount) { m_queued_writes.push_back(std::move(used_write)); } used_write.reset(); return arrow::Status::OK(); } arrow::Status write_buffer(std::shared_ptr && data) override { auto result = lseek(data->file_descriptor(), data->file_offset(), SEEK_SET); if (result < 0) { return arrow::Status::IOError("Error seeking in file"); } result = write(data->file_descriptor(), data->get_buffer().data(), data->get_buffer().size()); if (result < 0) { return arrow::Status::IOError( "Error writing to file: ", errno, " desc: ", data->file_descriptor(), " offset: ", data->file_offset(), " size: ", data->get_buffer().size()); } data->set_state(QueuedWrite::WriteState::Completed); return {}; } private: arrow::MemoryPool * m_memory_pool; std::vector> m_queued_writes; }; arrow::Result> make_sync_io_manager(arrow::MemoryPool * memory_pool) { return std::make_shared(memory_pool); } #endif } // namespace pod5 ================================================ FILE: c++/pod5_format/io_manager.h ================================================ #pragma once #include #include #include #ifdef __linux__ #include #endif #include #include #include #include namespace pod5 { #ifdef __linux__ class QueuedWrite { public: QueuedWrite() = default; QueuedWrite(std::unique_ptr && buffer) : m_buffer(std::move(buffer)) {} arrow::Status reset_queued_write() { assert(m_state != WriteState::ReadyForWrite); assert(m_state != WriteState::InFlight); m_iovec = {}; m_state = WriteState::Empty; m_file_offset = -1; m_file_descriptor = -1; return m_buffer->Resize(0, false); } void prepare_for_write(int file_descriptor, std::uint64_t offset) { m_file_descriptor = file_descriptor; m_file_offset = offset; m_iovec = {.iov_base = m_buffer->mutable_data(), .iov_len = (std::size_t)m_buffer->size()}; set_state(WriteState::ReadyForWrite); } arrow::ResizableBuffer & get_buffer() { return *m_buffer; } arrow::Buffer const & get_buffer() const { return *m_buffer; } int file_descriptor() const { return m_file_descriptor; } std::uint64_t file_offset() const { return m_file_offset; } iovec * get_iovec_for_buffer() { return &m_iovec; } enum class WriteState { Empty, ReadyForWrite, InFlight, Completed }; WriteState state() const { return m_state; } void set_state(WriteState state) { m_state = state; } private: std::unique_ptr m_buffer; std::uint64_t m_file_offset{(std::uint64_t)-1}; iovec m_iovec{}; int m_file_descriptor{-1}; WriteState m_state{WriteState::Empty}; }; #endif class IOManager { public: constexpr static size_t Alignment = 4096; // buffer alignment (for block devices) constexpr static size_t CachedBufferCount = 5; virtual ~IOManager() = default; #ifdef __linux__ virtual arrow::Result> allocate_new_write( std::size_t capacity) = 0; virtual arrow::Status return_used_write(std::shared_ptr && used_write) = 0; virtual arrow::Status write_buffer(std::shared_ptr && data) = 0; virtual arrow::Status wait_for_event(std::chrono::nanoseconds timeout) { return {}; } #endif }; #ifdef __linux__ arrow::Result> make_sync_io_manager( arrow::MemoryPool * memory_pool = arrow::default_memory_pool()); #endif } // namespace pod5 ================================================ FILE: c++/pod5_format/memory_pool.cpp ================================================ #include "memory_pool.h" #ifdef _WIN32 #include #elif !defined(__FreeBSD__) #include #endif namespace { // Referenced from the jemalloc source: // https://github.com/jemalloc/jemalloc/blob/b82333fdec6e5833f88780fcf1fc50b799268e1b/src/pages.c#L596C1-L616C2 size_t os_page_detect(void) { #ifdef _WIN32 SYSTEM_INFO si; GetSystemInfo(&si); return si.dwPageSize; #elif defined(__FreeBSD__) /* * This returns the value obtained from * the auxv vector, avoiding a syscall. */ return getpagesize(); #else long result = sysconf(_SC_PAGESIZE); if (result == -1) { return 4095 * 16; // Default to safe, large page size } return (size_t)result; #endif } } // namespace namespace pod5 { arrow::MemoryPool * default_memory_pool() { // Default to system memory pool for systems with large pages: if (os_page_detect() > 4096) { return arrow::system_memory_pool(); } return arrow::default_memory_pool(); } } // namespace pod5 ================================================ FILE: c++/pod5_format/memory_pool.h ================================================ #pragma once #include namespace pod5 { /// \brief Find a memory pool that should be used by default when opening or creating a pod5 file. /// \note This function differs from the arrow equivalent by not using jemalloc on systems with large /// pages, which jemalloc does not support. arrow::MemoryPool * default_memory_pool(); } // namespace pod5 ================================================ FILE: c++/pod5_format/migration/migration.cpp ================================================ #include "pod5_format/migration/migration.h" #include namespace pod5 { static bool registered_delete_at_exit_called = false; std::vector registered_delete_at_exit_paths; void register_delete_at_exit(arrow::internal::PlatformFilename const & path) { registered_delete_at_exit_paths.push_back(path); if (!registered_delete_at_exit_called) { std::atexit([] { std::size_t delete_failed = 0; for (auto const & path : registered_delete_at_exit_paths) { auto result = ::arrow::internal::DeleteDirTree(path); if (!result.ok()) { delete_failed += 1; } } if (delete_failed > 0) { std::cerr << "Warning: Failed to remove " << delete_failed << " temporary migration directories at exit.\n"; } }); registered_delete_at_exit_called = true; } } Result> MakeTmpDir(char const * suffix) { std::default_random_engine gen( static_cast(arrow::internal::GetRandomSeed())); for (std::uint32_t counter = 0; counter < 5; ++counter) { std::string tmp_path = std::string{".tmp_"} + suffix; tmp_path += "_" + std::to_string(gen()); ARROW_ASSIGN_OR_RAISE( auto filename, arrow::internal::PlatformFilename::FromString(tmp_path)); ARROW_ASSIGN_OR_RAISE(auto created, CreateDir(filename)); if (created) { return std::make_unique(std::move(filename)); } } return arrow::Status::Invalid("Failed to make temporary directory"); } } // namespace pod5 ================================================ FILE: c++/pod5_format/migration/migration.h ================================================ #pragma once #include "pod5_format/internal/combined_file_utils.h" #include "pod5_format/result.h" #include "pod5_format/schema_utils.h" #include #include namespace pod5 { void register_delete_at_exit(arrow::internal::PlatformFilename const & path); class TemporaryDir { public: TemporaryDir(arrow::internal::PlatformFilename && path) : m_path(path) {} ~TemporaryDir() { cleanup(); } arrow::internal::PlatformFilename const & path() { return m_path; }; void cleanup() { if (m_path.ToString().empty()) { return; } auto result = ::arrow::internal::DeleteDirTree(m_path); if (!result.ok()) { // Push the delete of this directory off to exit, when all open file handles should be closed. register_delete_at_exit(m_path); } else { m_path = {}; } } private: arrow::internal::PlatformFilename m_path; }; Result> MakeTmpDir(char const * suffix); class MigrationResult { public: MigrationResult(combined_file_utils::ParsedFooter const & footer) : m_footer(footer) {} MigrationResult(MigrationResult &&) = default; MigrationResult & operator=(MigrationResult &&) = default; MigrationResult(MigrationResult const &) = delete; MigrationResult & operator=(MigrationResult const &) = delete; combined_file_utils::ParsedFooter & footer() { return m_footer; } combined_file_utils::ParsedFooter const & footer() const { return m_footer; } void add_temp_dir(std::unique_ptr && temp_dir) { m_temp_dirs.emplace_back(std::move(temp_dir)); } private: // This is first so we clean it up last, after the // footer and any open files it contains is destroyed. std::vector> m_temp_dirs; combined_file_utils::ParsedFooter m_footer; }; arrow::Result migrate_v0_to_v1( MigrationResult && v0_input, arrow::MemoryPool * pool); arrow::Result migrate_v1_to_v2( MigrationResult && v1_input, arrow::MemoryPool * pool); arrow::Result migrate_v2_to_v3( MigrationResult && v2_input, arrow::MemoryPool * pool); arrow::Result migrate_v3_to_v4( MigrationResult && v2_input, arrow::MemoryPool * pool); inline arrow::Result migrate_if_required( Version writer_version, combined_file_utils::ParsedFooter const & read_footer, std::shared_ptr const & source, arrow::MemoryPool * pool) { MigrationResult result{read_footer}; if (writer_version < Version(0, 0, 24)) { // Added fields for read scaling ARROW_ASSIGN_OR_RAISE(result, migrate_v0_to_v1(std::move(result), pool)); } if (writer_version < Version(0, 0, 32)) { // Added num samples field ARROW_ASSIGN_OR_RAISE(result, migrate_v1_to_v2(std::move(result), pool)); } if (writer_version < Version(0, 0, 38)) { // Flattening fields ARROW_ASSIGN_OR_RAISE(result, migrate_v2_to_v3(std::move(result), pool)); } if (writer_version < Version(0, 3, 30)) { // Flattening fields ARROW_ASSIGN_OR_RAISE(result, migrate_v3_to_v4(std::move(result), pool)); } return result; } } // namespace pod5 ================================================ FILE: c++/pod5_format/migration/migration_utils.h ================================================ #pragma once #include "pod5_format/internal/combined_file_utils.h" #include "pod5_format/result.h" #include "pod5_format/schema_metadata.h" #include #include #include #include #include namespace pod5 { template arrow::Result> make_filled_array(arrow::MemoryPool * pool, std::size_t row_count, U default_value) { // Minimal iterator to repeat the same value N times. // TODO: replace with std::views::repeat in C++23 struct RepeatIter { // These are necessary to make |std::distance| and |std::copy| fast. using iterator_category [[maybe_unused]] = std::random_access_iterator_tag; using value_type = U const; using difference_type = std::int64_t; using pointer [[maybe_unused]] = value_type *; using reference = value_type &; std::size_t m_idx; value_type m_value; constexpr RepeatIter & operator++() { m_idx++; return *this; } constexpr RepeatIter operator++(int) { RepeatIter retval = *this; m_idx++; return retval; } constexpr bool operator==(RepeatIter const & other) const { return m_idx == other.m_idx; } constexpr bool operator!=(RepeatIter const & other) const { return !operator==(other); } constexpr difference_type operator-(RepeatIter const & other) const { return static_cast(m_idx) - static_cast(other.m_idx); } constexpr reference operator*() const { return m_value; } }; RepeatIter iter_begin{0, default_value}; RepeatIter iter_end{row_count, default_value}; T builder(pool); ARROW_RETURN_NOT_OK(builder.AppendValues(iter_begin, iter_end)); return builder.Finish(); } inline arrow::Status set_column( std::shared_ptr const & schema, std::vector> & columns, char const * field_name, arrow::Result> const & array) { auto field_index = schema->GetFieldIndex(field_name); if (field_index == -1) { return arrow::Status::Invalid("Failed to find field '", field_name, "' during migration."); } if (field_index >= (std::int64_t)columns.size()) { columns.resize(field_index + 1); } ARROW_ASSIGN_OR_RAISE(columns[field_index], array); return arrow::Status::OK(); } inline arrow::Status copy_column( std::shared_ptr const & schema_a, std::vector> & columns_a, char const * field_name, std::shared_ptr const & schema_b, std::vector> & columns_b) { auto field_index_a = schema_a->GetFieldIndex(field_name); if (field_index_a == -1 || field_index_a >= (std::int64_t)columns_a.size()) { return arrow::Status::Invalid("Failed to find field '", field_name, "' during migration."); } auto source_column = columns_a[field_index_a]; auto field_index_b = schema_b->GetFieldIndex(field_name); if (field_index_b >= (std::int64_t)columns_b.size()) { columns_b.resize(field_index_b + 1); } columns_b[field_index_b] = source_column; return arrow::Status::OK(); } struct Pod5BatchRecordReader { std::shared_ptr reader; std::shared_ptr schema; std::shared_ptr metadata; }; struct Pod5BatchRecordWriter { std::shared_ptr writer; std::shared_ptr schema; arrow::Status write_batch( std::size_t num_rows, std::vector> const & columns) { auto const record_batch = arrow::RecordBatch::Make(schema, num_rows, std::move(columns)); return writer->WriteRecordBatch(*record_batch); } }; inline pod5::Result open_record_batch_reader( arrow::MemoryPool * pool, combined_file_utils::ParsedFileInfo file_info) { Pod5BatchRecordReader result; ARROW_ASSIGN_OR_RAISE(auto file, open_sub_file(file_info)); arrow::ipc::IpcReadOptions read_options; read_options.memory_pool = pool; ARROW_ASSIGN_OR_RAISE( result.reader, arrow::ipc::RecordBatchFileReader::Open(file, read_options)); result.schema = result.reader->schema(); result.metadata = result.schema->metadata(); if (!result.metadata) { return Status::IOError("Missing metadata on read table schema"); } return result; } inline pod5::Result> update_metadata( std::shared_ptr original_metadata, Version version_to_write) { auto result = original_metadata->Copy(); // Update the reader for the new version: ARROW_RETURN_NOT_OK(result->Set("MINKNOW:pod5_version", version_to_write.to_string())); return result; } inline pod5::Result make_record_batch_writer( arrow::MemoryPool * pool, std::string path, std::shared_ptr schema, std::shared_ptr metadata) { ARROW_ASSIGN_OR_RAISE(auto file, arrow::io::FileOutputStream::Open(path, false)); arrow::ipc::IpcWriteOptions write_options; write_options.memory_pool = pool; write_options.emit_dictionary_deltas = true; Pod5BatchRecordWriter result; ARROW_ASSIGN_OR_RAISE( result.writer, arrow::ipc::MakeFileWriter(file, schema, write_options, metadata)); result.schema = schema; return result; } inline pod5::Status check_columns( std::shared_ptr const & schema, std::vector> const & columns) { for (std::size_t i = 0; i < columns.size(); ++i) { auto const & column = columns[i]; auto const & schema_field = schema->field(i); if (auto list = std::dynamic_pointer_cast(column)) { auto last_value = list->value_offset(0); for (int i = 1; i <= list->length(); ++i) { if (list->value_offset(i) < last_value) { return arrow::Status::Invalid( "Field content for field `", schema_field->name(), "`, list offsets are invalid" " at row index ", i, " (", list->value_offset(i), " < ", last_value, ")"); } last_value = list->value_offset(i); } } else if (auto dict = std::dynamic_pointer_cast(column)) { auto dict_values = dict->dictionary(); auto string_dictionary_values = std::dynamic_pointer_cast(dict_values); if (string_dictionary_values) { auto const value_offsets = string_dictionary_values->value_offsets(); std::int64_t const value_offsets_length = value_offsets->size() / sizeof(arrow::StringArray::offset_type); if (value_offsets_length != (1 + dict_values->length())) { // We expect N+1 offsets for the final element length return arrow::Status::Invalid( "Dictionary length for field `", schema_field->name(), "`, dictionary length is ", dict_values->length(), " but value offsets is length ", value_offsets_length); } } auto indices = std::dynamic_pointer_cast(dict->indices()); if (!indices) { return arrow::Status::Invalid( "Field content for field `", schema_field->name(), "`, dictionary indexes are missing"); } for (int i = 0; i < indices->length(); ++i) { if (indices->Value(i) >= dict_values->length()) { return arrow::Status::Invalid( "Field content for field `", schema_field->name(), "`, dictionary indexes are invalid" " at row index ", i, " (", indices->Value(i), " >= ", dict_values->length(), ")"); } } } } return {}; } } // namespace pod5 ================================================ FILE: c++/pod5_format/migration/v0_to_v1.cpp ================================================ #include "pod5_format/migration/migration.h" #include "pod5_format/migration/migration_utils.h" #include "pod5_format/table_reader.h" #include #include #include #include namespace pod5 { arrow::Result migrate_v0_to_v1( MigrationResult && v0_input, arrow::MemoryPool * pool) { ARROW_ASSIGN_OR_RAISE(auto temp_dir, MakeTmpDir("pod5_v0_v1_migration")); ARROW_ASSIGN_OR_RAISE(auto v1_reads_table_path, temp_dir->path().Join("reads_table.arrow")); { ARROW_ASSIGN_OR_RAISE( auto v0_reader, open_record_batch_reader(pool, v0_input.footer().reads_table)); auto v1_new_schama = arrow::schema( {arrow::field("num_minknow_events", arrow::uint64()), arrow::field("tracked_scaling_scale", arrow::float32()), arrow::field("tracked_scaling_shift", arrow::float32()), arrow::field("predicted_scaling_scale", arrow::float32()), arrow::field("predicted_scaling_shift", arrow::float32()), arrow::field("num_reads_since_mux_change", arrow::uint32()), arrow::field("time_since_mux_change", arrow::float32())}); ARROW_ASSIGN_OR_RAISE( auto v1_schema, arrow::UnifySchemas({v0_reader.schema, v1_new_schama})); ARROW_ASSIGN_OR_RAISE( auto new_metadata, update_metadata(v0_reader.metadata, Version(0, 0, 24))); ARROW_ASSIGN_OR_RAISE( auto v1_writer, make_record_batch_writer( pool, v1_reads_table_path.ToString(), v1_schema, new_metadata)); for (std::int64_t batch_idx = 0; batch_idx < v0_reader.reader->num_record_batches(); ++batch_idx) { // Read V0 data: ARROW_ASSIGN_OR_RAISE( auto v0_batch, ReadRecordBatchAndValidate(*v0_reader.reader, batch_idx)); ARROW_RETURN_NOT_OK(v0_batch->ValidateFull()); auto const num_rows = v0_batch->num_rows(); if (num_rows < 0) { return arrow::Status::Invalid("Invalid number of rows"); } else if (POD5_ENABLE_FUZZERS && num_rows > 1'000'000) { return arrow::Status::Invalid("Skipping huge sizes when fuzzing"); } // Extend with V1 data: std::vector> columns = v0_batch->columns(); ARROW_RETURN_NOT_OK(check_columns(v0_reader.schema, columns)); ARROW_RETURN_NOT_OK(set_column( v1_schema, columns, "num_minknow_events", make_filled_array(pool, num_rows, 0))); ARROW_RETURN_NOT_OK(set_column( v1_schema, columns, "tracked_scaling_scale", make_filled_array( pool, num_rows, std::numeric_limits::quiet_NaN()))); ARROW_RETURN_NOT_OK(set_column( v1_schema, columns, "tracked_scaling_shift", make_filled_array( pool, num_rows, std::numeric_limits::quiet_NaN()))); ARROW_RETURN_NOT_OK(set_column( v1_schema, columns, "predicted_scaling_scale", make_filled_array( pool, num_rows, std::numeric_limits::quiet_NaN()))); ARROW_RETURN_NOT_OK(set_column( v1_schema, columns, "predicted_scaling_shift", make_filled_array( pool, num_rows, std::numeric_limits::quiet_NaN()))); ARROW_RETURN_NOT_OK(set_column( v1_schema, columns, "num_reads_since_mux_change", make_filled_array(pool, num_rows, 0))); ARROW_RETURN_NOT_OK(set_column( v1_schema, columns, "time_since_mux_change", make_filled_array(pool, num_rows, 0.0f))); ARROW_RETURN_NOT_OK(v1_writer.write_batch(num_rows, std::move(columns))); } ARROW_RETURN_NOT_OK(v1_writer.writer->Close()); } // Set up migrated data to point at our new table: MigrationResult result = std::move(v0_input); ARROW_RETURN_NOT_OK(result.footer().reads_table.from_full_file(v1_reads_table_path.ToString())); result.add_temp_dir(std::move(temp_dir)); return result; } } // namespace pod5 ================================================ FILE: c++/pod5_format/migration/v1_to_v2.cpp ================================================ #include "pod5_format/migration/migration.h" #include "pod5_format/migration/migration_utils.h" #include "pod5_format/table_reader.h" #include #include #include #include #include namespace pod5 { arrow::Result get_num_samples( std::shared_ptr const & signal_col, std::size_t row_idx, std::vector> const & signal_batches) { if (signal_batches.empty()) { return 0; } std::size_t signal_batch_size = signal_batches[0]->num_rows(); std::size_t num_samples = 0; auto values = std::dynamic_pointer_cast(signal_col->values()); if (!values) { return arrow::Status::Invalid("Invalid signal column, potentially corrupt file."); } auto offset = signal_col->value_offset(row_idx); for (std::int64_t index = 0; index < signal_col->value_length(row_idx); ++index) { auto const abs_index = offset + index; if (abs_index < 0 || abs_index >= values->length()) { return arrow::Status::Invalid("Invalid signal column, potentially corrupt file."); } auto const abs_row = values->Value(abs_index); auto const batch_idx = abs_row / signal_batch_size; auto const batch_row = abs_row - (batch_idx * signal_batch_size); if (batch_idx >= signal_batches.size()) { return arrow::Status::Invalid( "Invalid signal row ", abs_row, ", cannot find signal batch ", batch_idx); } auto batch = signal_batches[batch_idx]; auto samples_column = std::dynamic_pointer_cast(batch->GetColumnByName("samples")); if (!samples_column) { return arrow::Status::Invalid("`samples` column is missing from file"); } if (batch_row >= (std::size_t)samples_column->length()) { return arrow::Status::Invalid( "Invalid signal batch row ", batch_row, ", length is ", samples_column->length()); } num_samples += samples_column->Value(batch_row); } return num_samples; } arrow::Result migrate_v1_to_v2( MigrationResult && v1_input, arrow::MemoryPool * pool) { ARROW_ASSIGN_OR_RAISE(auto temp_dir, MakeTmpDir("pod5_v1_v2_migration")); ARROW_ASSIGN_OR_RAISE(auto v2_reads_table_path, temp_dir->path().Join("reads_table.arrow")); { ARROW_ASSIGN_OR_RAISE( auto v1_reader, open_record_batch_reader(pool, v1_input.footer().reads_table)); ARROW_ASSIGN_OR_RAISE( auto v1_signal_reader, open_record_batch_reader(pool, v1_input.footer().signal_table)); std::vector> signal_batches( v1_signal_reader.reader->num_record_batches()); for (std::size_t batch_idx = 0; batch_idx < (std::size_t)v1_signal_reader.reader->num_record_batches(); ++batch_idx) { ARROW_ASSIGN_OR_RAISE( signal_batches[batch_idx], ReadRecordBatchAndValidate(*v1_signal_reader.reader, batch_idx)); ARROW_RETURN_NOT_OK(signal_batches[batch_idx]->ValidateFull()); } auto v2_new_schama = arrow::schema({arrow::field("num_samples", arrow::uint64())}); ARROW_ASSIGN_OR_RAISE( auto new_metadata, update_metadata(v1_reader.metadata, Version(0, 0, 32))); ARROW_ASSIGN_OR_RAISE( auto v2_schema, arrow::UnifySchemas({v1_reader.schema, v2_new_schama})); ARROW_ASSIGN_OR_RAISE( auto v2_writer, make_record_batch_writer( pool, v2_reads_table_path.ToString(), v2_schema, new_metadata)); for (std::int64_t batch_idx = 0; batch_idx < v1_reader.reader->num_record_batches(); ++batch_idx) { // Read V1 data: ARROW_ASSIGN_OR_RAISE( auto v1_batch, ReadRecordBatchAndValidate(*v1_reader.reader, batch_idx)); ARROW_RETURN_NOT_OK(v1_batch->ValidateFull()); auto const num_rows = v1_batch->num_rows(); // Extend with V2 data: std::vector> columns = v1_batch->columns(); auto signal_column = std::dynamic_pointer_cast(v1_batch->GetColumnByName("signal")); if (!signal_column) { return arrow::Status::Invalid("`signal` column is missing from file"); } ARROW_RETURN_NOT_OK(signal_column->ValidateFull()); arrow::UInt64Builder num_samples_builder; for (std::int64_t row = 0; row < num_rows; ++row) { ARROW_ASSIGN_OR_RAISE( auto num_samples, get_num_samples(signal_column, row, signal_batches)); ARROW_RETURN_NOT_OK(num_samples_builder.Append(num_samples)); } ARROW_RETURN_NOT_OK( set_column(v2_schema, columns, "num_samples", num_samples_builder.Finish())); ARROW_RETURN_NOT_OK(v2_writer.write_batch(num_rows, std::move(columns))); } ARROW_RETURN_NOT_OK(v2_writer.writer->Close()); } // Set up migrated data to point at our new table: MigrationResult result = std::move(v1_input); ARROW_RETURN_NOT_OK(result.footer().reads_table.from_full_file(v2_reads_table_path.ToString())); result.add_temp_dir(std::move(temp_dir)); return result; } } // namespace pod5 ================================================ FILE: c++/pod5_format/migration/v2_to_v3.cpp ================================================ #include "pod5_format/migration/migration.h" #include "pod5_format/migration/migration_utils.h" #include "pod5_format/types.h" #include #include #include #include #include #include namespace pod5 { struct StructRow { std::int64_t dict_item_index; std::shared_ptr data; }; struct StringDictBuilder { arrow::Int16Builder indices; arrow::StringBuilder items; arrow::Result> finish() { ARROW_ASSIGN_OR_RAISE(auto finished_indices, indices.Finish()); ARROW_ASSIGN_OR_RAISE(auto finished_items, items.Finish()); auto const & finished_items_val = static_cast(*finished_items); // Re append the finished items to the now blank list for (std::int64_t i = 0; i < finished_items_val.length(); ++i) { ARROW_RETURN_NOT_OK(items.Append(finished_items_val.GetView(i))); } return arrow::DictionaryArray::FromArrays(finished_indices, finished_items); } std::unordered_map lookup; }; arrow::Result get_dict_struct( std::shared_ptr const & batch, std::size_t row, char const * field_name) { auto column = batch->GetColumnByName(field_name); if (!column) { return Status::Invalid("Failed to find column ", field_name); } auto dict_column = std::dynamic_pointer_cast(column); if (!dict_column) { return Status::Invalid("Found column ", field_name, " is not a dictionary as expected"); } auto dict_items = std::dynamic_pointer_cast(dict_column->dictionary()); if (!dict_items) { return Status::Invalid("Dictionary column is not a struct as expected"); } return StructRow{dict_column->GetValueIndex(row), dict_items}; } template arrow::Status append_struct_row(StructRow const & struct_row, char const * field_name, Builder & builder) { auto field_array = struct_row.data->GetFieldByName(field_name); if (!field_array) { return Status::Invalid("Struct is missing ", field_name, " field"); } auto typed_field_array = std::dynamic_pointer_cast(field_array); if (!typed_field_array) { return Status::Invalid(field_name, " field is the wrong type"); } if (struct_row.dict_item_index < 0 || struct_row.dict_item_index >= field_array->length()) { return Status::Invalid("Dictionary index is out of range"); } return builder.Append(typed_field_array->Value(struct_row.dict_item_index)); } arrow::Status append_struct_row_to_dict( StructRow const & struct_row, char const * field_name, StringDictBuilder & builder) { auto field_array = struct_row.data->GetFieldByName(field_name); if (!field_array) { return Status::Invalid("Struct is missing ", field_name, " field"); } auto typed_field_array = std::dynamic_pointer_cast(field_array); if (!typed_field_array) { return Status::Invalid(field_name, " field is the wrong type"); } if (struct_row.dict_item_index < 0 || struct_row.dict_item_index >= field_array->length()) { return Status::Invalid("Dictionary index is out of range"); } auto str_value = typed_field_array->GetString(struct_row.dict_item_index); auto it = builder.lookup.find(str_value); if (it != builder.lookup.end()) { return builder.indices.Append(it->second); } auto index = builder.items.length(); ARROW_RETURN_NOT_OK(builder.items.Append(str_value)); builder.lookup[str_value] = index; return builder.indices.Append(index); } arrow::Result migrate_v2_to_v3( MigrationResult && v2_input, arrow::MemoryPool * pool) { ARROW_ASSIGN_OR_RAISE(auto temp_dir, MakeTmpDir("pod5_v2_v3_migration")); ARROW_ASSIGN_OR_RAISE(auto v3_reads_table_path, temp_dir->path().Join("reads_table.arrow")); ARROW_ASSIGN_OR_RAISE( auto v3_run_info_table_path, temp_dir->path().Join("run_info_table.arrow")); { ARROW_ASSIGN_OR_RAISE( auto v2_reader, open_record_batch_reader(pool, v2_input.footer().reads_table)); ARROW_ASSIGN_OR_RAISE( auto new_metadata, update_metadata(v2_reader.metadata, Version(0, 0, 35))); auto const num_record_batches = v2_reader.reader->num_record_batches(); { auto v3_reads_schema = arrow::schema( {arrow::field("read_id", uuid()), arrow::field("signal", arrow::list(arrow::uint64())), arrow::field("read_number", arrow::uint32()), arrow::field("start", arrow::uint64()), arrow::field("median_before", arrow::float32()), arrow::field("num_minknow_events", arrow::uint64()), arrow::field("tracked_scaling_scale", arrow::float32()), arrow::field("tracked_scaling_shift", arrow::float32()), arrow::field("predicted_scaling_scale", arrow::float32()), arrow::field("predicted_scaling_shift", arrow::float32()), arrow::field("num_reads_since_mux_change", arrow::uint32()), arrow::field("time_since_mux_change", arrow::float32()), arrow::field("num_samples", arrow::uint64()), arrow::field("channel", arrow::uint16()), arrow::field("well", arrow::uint8()), arrow::field("pore_type", arrow::dictionary(arrow::int16(), arrow::utf8())), arrow::field("calibration_offset", arrow::float32()), arrow::field("calibration_scale", arrow::float32()), arrow::field("end_reason", arrow::dictionary(arrow::int16(), arrow::utf8())), arrow::field("end_reason_forced", arrow::boolean()), arrow::field("run_info", arrow::dictionary(arrow::int16(), arrow::utf8()))}, new_metadata); ARROW_ASSIGN_OR_RAISE( auto v3_reads_writer, make_record_batch_writer( pool, v3_reads_table_path.ToString(), v3_reads_schema, new_metadata)); std::vector const columns_to_copy{ "read_id", "signal", "read_number", "start", "median_before", "num_minknow_events", "tracked_scaling_scale", "tracked_scaling_shift", "predicted_scaling_scale", "predicted_scaling_shift", "num_reads_since_mux_change", "time_since_mux_change", "num_samples"}; // Builders for dict columns StringDictBuilder pore_type; StringDictBuilder end_reason; StringDictBuilder run_info; for (std::int64_t batch_idx = 0; batch_idx < num_record_batches; ++batch_idx) { // Read V2 data: ARROW_ASSIGN_OR_RAISE(auto v2_batch, v2_reader.reader->ReadRecordBatch(batch_idx)); ARROW_RETURN_NOT_OK(v2_batch->ValidateFull()); auto const num_rows = v2_batch->num_rows(); std::vector> v3_columns; // Write V3 data: std::vector> v2_columns = v2_batch->columns(); for (auto const & col_name : columns_to_copy) { ARROW_RETURN_NOT_OK(copy_column( v2_reader.schema, v2_columns, col_name.data(), v3_reads_schema, v3_columns)); } arrow::UInt16Builder channel; arrow::UInt8Builder well; arrow::FloatBuilder calibration_offset; arrow::FloatBuilder calibration_scale; arrow::BooleanBuilder end_reason_forced; for (std::int64_t row = 0; row < num_rows; ++row) { ARROW_ASSIGN_OR_RAISE( auto calibration_data, get_dict_struct(v2_batch, row, "calibration")); ARROW_RETURN_NOT_OK( append_struct_row( calibration_data, "offset", calibration_offset)); ARROW_RETURN_NOT_OK( append_struct_row( calibration_data, "scale", calibration_scale)); ARROW_ASSIGN_OR_RAISE(auto pore_data, get_dict_struct(v2_batch, row, "pore")); ARROW_RETURN_NOT_OK( append_struct_row(pore_data, "channel", channel)); ARROW_RETURN_NOT_OK( append_struct_row(pore_data, "well", well)); ARROW_RETURN_NOT_OK( append_struct_row_to_dict(pore_data, "pore_type", pore_type)); ARROW_ASSIGN_OR_RAISE( auto end_reason_data, get_dict_struct(v2_batch, row, "end_reason")); ARROW_RETURN_NOT_OK( append_struct_row_to_dict(end_reason_data, "name", end_reason)); ARROW_RETURN_NOT_OK( append_struct_row( end_reason_data, "forced", end_reason_forced)); ARROW_ASSIGN_OR_RAISE( auto run_info_data, get_dict_struct(v2_batch, row, "run_info")); ARROW_RETURN_NOT_OK( append_struct_row_to_dict(run_info_data, "acquisition_id", run_info)); } ARROW_RETURN_NOT_OK(set_column( v3_reads_schema, v3_columns, "calibration_offset", calibration_offset.Finish())); ARROW_RETURN_NOT_OK(set_column( v3_reads_schema, v3_columns, "calibration_scale", calibration_scale.Finish())); ARROW_RETURN_NOT_OK( set_column(v3_reads_schema, v3_columns, "channel", channel.Finish())); ARROW_RETURN_NOT_OK(set_column(v3_reads_schema, v3_columns, "well", well.Finish())); ARROW_RETURN_NOT_OK( set_column(v3_reads_schema, v3_columns, "pore_type", pore_type.finish())); ARROW_RETURN_NOT_OK( set_column(v3_reads_schema, v3_columns, "end_reason", end_reason.finish())); ARROW_RETURN_NOT_OK(set_column( v3_reads_schema, v3_columns, "end_reason_forced", end_reason_forced.Finish())); ARROW_RETURN_NOT_OK( set_column(v3_reads_schema, v3_columns, "run_info", run_info.finish())); ARROW_RETURN_NOT_OK(v3_reads_writer.write_batch(num_rows, std::move(v3_columns))); } ARROW_RETURN_NOT_OK(v3_reads_writer.writer->Close()); } if (num_record_batches > 0) { ARROW_ASSIGN_OR_RAISE( auto v2_last_batch, v2_reader.reader->ReadRecordBatch(num_record_batches - 1)); auto run_info_column = std::dynamic_pointer_cast( v2_last_batch->GetColumnByName("run_info")); if (!run_info_column) { return arrow::Status::Invalid("Failed to find the run info column"); } auto run_info_dict_type = std::dynamic_pointer_cast(run_info_column->type()); if (!run_info_dict_type) { return arrow::Status::Invalid("Failed to find a run info of the right type"); } auto run_info_items = std::dynamic_pointer_cast(run_info_column->dictionary()); if (!run_info_items) { return arrow::Status::Invalid("Failed to find a run info items array"); } auto run_info_items_type = std::dynamic_pointer_cast(run_info_items->type()); if (!run_info_items_type) { return arrow::Status::Invalid( "Failed to find a run info items array of the right type"); } // Append all the run info dict-struct data to the new table: auto v3_run_info_schema = arrow::schema(run_info_items_type->fields(), new_metadata); ARROW_ASSIGN_OR_RAISE( auto v3_run_info_writer, make_record_batch_writer( pool, v3_run_info_table_path.ToString(), v3_run_info_schema, new_metadata)); auto const & fields = run_info_items->fields(); std::vector> v3_columns( v3_run_info_schema->fields().size()); for (std::size_t col = 0; col < v3_columns.size(); ++col) { v3_columns[col] = fields[col]; } ARROW_RETURN_NOT_OK( v3_run_info_writer.write_batch(run_info_items->length(), std::move(v3_columns))); ARROW_RETURN_NOT_OK(v3_run_info_writer.writer->Close()); } } // Set up migrated data to point at our new table: MigrationResult result = std::move(v2_input); ARROW_RETURN_NOT_OK(result.footer().reads_table.from_full_file(v3_reads_table_path.ToString())); ARROW_RETURN_NOT_OK( result.footer().run_info_table.from_full_file(v3_run_info_table_path.ToString())); result.add_temp_dir(std::move(temp_dir)); return result; } } // namespace pod5 ================================================ FILE: c++/pod5_format/migration/v3_to_v4.cpp ================================================ #include "pod5_format/migration/migration.h" #include "pod5_format/migration/migration_utils.h" #include "pod5_format/table_reader.h" #include #include #include #include namespace pod5 { arrow::Result migrate_v3_to_v4( MigrationResult && v3_input, arrow::MemoryPool * pool) { ARROW_ASSIGN_OR_RAISE(auto temp_dir, MakeTmpDir("pod5_v3_v4_migration")); ARROW_ASSIGN_OR_RAISE(auto v4_reads_table_path, temp_dir->path().Join("reads_table.arrow")); { ARROW_ASSIGN_OR_RAISE( auto v3_reader, open_record_batch_reader(pool, v3_input.footer().reads_table)); auto v4_new_schama = arrow::schema({arrow::field("open_pore_level", arrow::float32())}); ARROW_ASSIGN_OR_RAISE( auto v4_schema, arrow::UnifySchemas({v3_reader.schema, v4_new_schama})); ARROW_ASSIGN_OR_RAISE( auto new_metadata, update_metadata(v3_reader.metadata, Version(0, 3, 30))); ARROW_ASSIGN_OR_RAISE( auto v4_writer, make_record_batch_writer( pool, v4_reads_table_path.ToString(), v4_schema, new_metadata)); for (std::int64_t batch_idx = 0; batch_idx < v3_reader.reader->num_record_batches(); ++batch_idx) { // Read V0 data: ARROW_ASSIGN_OR_RAISE( auto v3_batch, ReadRecordBatchAndValidate(*v3_reader.reader, batch_idx)); ARROW_RETURN_NOT_OK(v3_batch->ValidateFull()); auto const num_rows = v3_batch->num_rows(); if (num_rows < 0) { return arrow::Status::Invalid("Invalid number of rows"); } else if (POD5_ENABLE_FUZZERS && num_rows > 1'000'000) { return arrow::Status::Invalid("Skipping huge sizes when fuzzing"); } // Extend with V4 data: std::vector> columns = v3_batch->columns(); ARROW_RETURN_NOT_OK(check_columns(v3_reader.schema, columns)); ARROW_RETURN_NOT_OK(set_column( v4_schema, columns, "open_pore_level", make_filled_array( pool, num_rows, std::numeric_limits::quiet_NaN()))); ARROW_RETURN_NOT_OK(v4_writer.write_batch(num_rows, std::move(columns))); } ARROW_RETURN_NOT_OK(v4_writer.writer->Close()); } // Set up migrated data to point at our new table: MigrationResult result = std::move(v3_input); ARROW_RETURN_NOT_OK(result.footer().reads_table.from_full_file(v4_reads_table_path.ToString())); result.add_temp_dir(std::move(temp_dir)); return result; } } // namespace pod5 ================================================ FILE: c++/pod5_format/read_table_reader.cpp ================================================ #include "pod5_format/read_table_reader.h" #include "pod5_format/read_table_utils.h" #include "pod5_format/schema_metadata.h" #include "pod5_format/schema_utils.h" #include #include #include #include #include #include namespace pod5 { ReadTableRecordBatch::ReadTableRecordBatch( std::shared_ptr && batch, std::shared_ptr const & field_locations) : TableRecordBatch(std::move(batch)) , m_field_locations(field_locations) { } ReadTableRecordBatch::ReadTableRecordBatch(ReadTableRecordBatch && other) : TableRecordBatch(std::move(other)) , m_field_locations(std::move(other.m_field_locations)) { } std::shared_ptr ReadTableRecordBatch::read_id_column() const { return find_column(batch(), m_field_locations->read_id); } std::shared_ptr ReadTableRecordBatch::signal_column() const { return find_column(batch(), m_field_locations->signal); } Result ReadTableRecordBatch::columns() const { ReadTableRecordColumns result; result.table_version = m_field_locations->table_version(); auto const & bat = batch(); // V0 fields: result.read_id = find_column(bat, m_field_locations->read_id); result.signal = find_column(bat, m_field_locations->signal); result.read_number = find_column(bat, m_field_locations->read_number); result.start_sample = find_column(bat, m_field_locations->start); result.median_before = find_column(bat, m_field_locations->median_before); // V1 fields: if (result.table_version >= ReadTableSpecVersion::v1()) { result.num_minknow_events = find_column(bat, m_field_locations->num_minknow_events); result.tracked_scaling_scale = find_column(bat, m_field_locations->tracked_scaling_scale); result.tracked_scaling_shift = find_column(bat, m_field_locations->tracked_scaling_shift); result.predicted_scaling_scale = find_column(bat, m_field_locations->predicted_scaling_scale); result.predicted_scaling_shift = find_column(bat, m_field_locations->predicted_scaling_shift); result.num_reads_since_mux_change = find_column(bat, m_field_locations->num_reads_since_mux_change); result.time_since_mux_change = find_column(bat, m_field_locations->time_since_mux_change); } // V2 fields: if (result.table_version >= ReadTableSpecVersion::v2()) { result.num_samples = find_column(bat, m_field_locations->num_samples); } // V3 fields: if (result.table_version >= ReadTableSpecVersion::v3()) { result.channel = find_column(bat, m_field_locations->channel); result.well = find_column(bat, m_field_locations->well); result.pore_type = find_column(bat, m_field_locations->pore_type); result.calibration_offset = find_column(bat, m_field_locations->calibration_offset); result.calibration_scale = find_column(bat, m_field_locations->calibration_scale); result.end_reason = find_column(bat, m_field_locations->end_reason); result.end_reason_forced = find_column(bat, m_field_locations->end_reason_forced); result.run_info = find_column(bat, m_field_locations->run_info); } if (result.table_version >= ReadTableSpecVersion::v4()) { result.open_pore_level = find_column(bat, m_field_locations->open_pore_level); } return result; } Result> ReadTableRecordBatch::get_signal_rows( std::int64_t batch_row) const { auto signal_col = signal_column(); auto const & values = signal_col->values(); auto const offset = signal_col->value_offset(batch_row); if (offset >= values->length()) { return arrow::Status::Invalid( "Invalid signal row offset '", offset, "' is outside the size of the values array."); } auto const length = signal_col->value_length(batch_row); if (length > values->length() - offset) { return arrow::Status::Invalid( "Invalid signal row length '", length, "' is outside the size of the values array."); } return std::static_pointer_cast(values->Slice(offset, length)); } template auto & ReadTableRecordBatch::get_dictionary( std::shared_ptr const & array) const { auto & initialised = m_dictionary_initialised[static_cast(which)]; if (initialised.load(std::memory_order_acquire)) { return array->dictionary(); } std::lock_guard lock(m_dictionary_access_lock); auto & dict = array->dictionary(); initialised.store(true, std::memory_order_release); return dict; } Result ReadTableRecordBatch::get_pore_type(std::int16_t pore_index) const { if (!m_field_locations->pore_type.found_field()) { return arrow::Status::Invalid("pore field is not present in the file"); } auto pore_column = find_column(batch(), m_field_locations->pore_type); auto const & pore_dict = get_dictionary(pore_column); auto const & pore_data = static_cast(*pore_dict); if (pore_index < 0 || pore_index >= pore_data.length()) { return arrow::Status::IndexError( "Invalid index ", pore_index, " for pore array of length ", pore_data.length()); } return pore_data.GetString(pore_index); } Result> ReadTableRecordBatch::get_end_reason( std::int16_t end_reason_index) const { if (!m_field_locations->end_reason.found_field()) { return arrow::Status::Invalid("end_reason field is not present in the file"); } auto end_reason_column = find_column(batch(), m_field_locations->end_reason); auto const & end_reason_dict = get_dictionary(end_reason_column); auto const & end_reason_data = static_cast(*end_reason_dict); if (end_reason_index >= end_reason_data.length()) { return arrow::Status::IndexError( "Invalid index ", end_reason_index, " for end reason array of length ", end_reason_data.length()); } auto str_value = end_reason_data.GetString(end_reason_index); auto reason = end_reason_from_string(str_value); return std::make_pair(reason, std::move(str_value)); } Result ReadTableRecordBatch::get_run_info(std::int16_t run_info_index) const { if (!m_field_locations->run_info.found_field()) { return arrow::Status::Invalid("end_reason field is not present in the file"); } auto run_info_column = find_column(batch(), m_field_locations->run_info); auto const & run_info_dict = get_dictionary(run_info_column); auto const & run_info_data = static_cast(*run_info_dict); if (run_info_index < 0 || run_info_index >= run_info_data.length()) { return arrow::Status::IndexError( "Invalid index ", run_info_index, " for run info array of length ", run_info_data.length()); } return run_info_data.GetString(run_info_index); } //--------------------------------------------------------------------------------------------------------------------- ReadTableReader::ReadTableReader( std::shared_ptr && input_source, std::shared_ptr && reader, std::shared_ptr const & field_locations, SchemaMetadataDescription && schema_metadata, arrow::MemoryPool * pool) : TableReader(std::move(input_source), std::move(reader), std::move(schema_metadata), pool) , m_field_locations(field_locations) { } ReadTableReader::ReadTableReader(ReadTableReader && other) : TableReader(std::move(other)) , m_sorted_file_read_ids(std::move(other.m_sorted_file_read_ids)) , m_field_locations(std::move(other.m_field_locations)) { } ReadTableReader & ReadTableReader::operator=(ReadTableReader && other) { static_cast(*this) = std::move(static_cast(*this)); m_field_locations = std::move(other.m_field_locations); m_sorted_file_read_ids = std::move(other.m_sorted_file_read_ids); return *this; } Result ReadTableReader::read_record_batch(std::size_t i) const { std::lock_guard l(m_batch_get_mutex); ARROW_ASSIGN_OR_RAISE(auto record_batch, TableReader::ReadRecordBatch(i)); return ReadTableRecordBatch{std::move(record_batch), m_field_locations}; } Status ReadTableReader::build_read_id_lookup() const { std::lock_guard lock(m_sorted_file_read_ids_mutex); if (!m_sorted_file_read_ids.empty()) { return Status::OK(); } std::vector file_read_ids; auto const batch_count = num_record_batches(); std::size_t abs_row_count = 0; // Loop each batch and copy read ids out into the index: for (std::size_t i = 0; i < batch_count; ++i) { ARROW_ASSIGN_OR_RAISE(auto batch, read_record_batch(i)); if (file_read_ids.empty()) { file_read_ids.reserve(batch.num_rows() * batch_count); } file_read_ids.resize(file_read_ids.size() + batch.num_rows()); auto read_id_col = batch.read_id_column(); auto raw_read_id_values = read_id_col->raw_values(); for (std::size_t row = 0; row < (std::size_t)read_id_col->length(); ++row) { // Record the id, and its location within the file: file_read_ids[abs_row_count].id = raw_read_id_values[row]; file_read_ids[abs_row_count].batch = i; file_read_ids[abs_row_count].batch_row = row; abs_row_count += 1; } } // Sort by read id for searching later: std::sort(file_read_ids.begin(), file_read_ids.end(), [](auto const & a, auto const & b) { return a.id < b.id; }); // Move data out now we successfully build the index: m_sorted_file_read_ids = std::move(file_read_ids); return Status::OK(); } Result ReadTableReader::search_for_read_ids( ReadIdSearchInput const & search_input, gsl::span const & batch_counts, gsl::span const & batch_rows) const { ARROW_RETURN_NOT_OK(build_read_id_lookup()); if (m_sorted_file_read_ids.empty()) { return 0; } std::size_t successes = 0; std::vector> batch_data(batch_counts.size()); auto const initial_reserve_size = search_input.read_id_count() / batch_counts.size(); for (auto & br : batch_data) { br.reserve(initial_reserve_size); } auto file_ids_current_it = m_sorted_file_read_ids.begin(); auto const file_ids_end = m_sorted_file_read_ids.end(); for (std::size_t i = 0; i < search_input.read_id_count(); ++i) { auto const & search_item = search_input[i]; // Increment file pointer while less than the search term: while (file_ids_current_it != file_ids_end && file_ids_current_it->id < search_item.id) { ++file_ids_current_it; } // No more ids to search, both lists are sorted and we haven't found this one, we won't find any others. if (file_ids_current_it == file_ids_end) { break; } // If we found it record the location: if (file_ids_current_it->id == search_item.id) { batch_data[file_ids_current_it->batch].push_back(file_ids_current_it->batch_row); successes += 1; } } std::size_t full_size_so_far = 0; for (std::size_t i = 0; i < batch_data.size(); ++i) { auto & data = batch_data[i]; batch_counts[i] = data.size(); // Ensure the batch indices within the batch are sorted: std::sort(data.begin(), data.end()); // Copy the row indices into the packed vector: std::copy(data.begin(), data.end(), batch_rows.begin() + full_size_so_far); full_size_so_far += data.size(); } return successes; } //--------------------------------------------------------------------------------------------------------------------- Result make_read_table_reader( std::shared_ptr const & input, arrow::MemoryPool * pool) { arrow::ipc::IpcReadOptions options; options.memory_pool = pool; ARROW_ASSIGN_OR_RAISE(auto reader, arrow::ipc::RecordBatchFileReader::Open(input, options)); auto read_metadata_key_values = reader->schema()->metadata(); if (!read_metadata_key_values) { return Status::IOError("Missing metadata on read table schema"); } ARROW_ASSIGN_OR_RAISE( auto read_metadata, read_schema_key_value_metadata(read_metadata_key_values)); ARROW_ASSIGN_OR_RAISE( auto field_locations, read_read_table_schema(read_metadata, reader->schema())); return ReadTableReader( {input}, std::move(reader), field_locations, std::move(read_metadata), pool); } } // namespace pod5 ================================================ FILE: c++/pod5_format/read_table_reader.h ================================================ #pragma once #include "pod5_format/pod5_format_export.h" #include "pod5_format/read_table_schema.h" #include "pod5_format/read_table_utils.h" #include "pod5_format/result.h" #include "pod5_format/schema_metadata.h" #include "pod5_format/table_reader.h" #include "pod5_format/types.h" #include "pod5_format/uuid.h" #include #include #include namespace arrow { class Schema; namespace io { class RandomAccessFile; } namespace ipc { class RecordBatchFileReader; } } // namespace arrow namespace pod5 { class CalibrationData; class EndReasonData; class PoreData; class RunInfoData; class ReadIdSearchInput; struct ReadTableRecordColumns { std::shared_ptr read_id; std::shared_ptr signal; std::shared_ptr read_number; std::shared_ptr start_sample; std::shared_ptr median_before; std::shared_ptr num_minknow_events; [[deprecated]] std::shared_ptr tracked_scaling_scale; [[deprecated]] std::shared_ptr tracked_scaling_shift; [[deprecated]] std::shared_ptr predicted_scaling_scale; [[deprecated]] std::shared_ptr predicted_scaling_shift; [[deprecated]] std::shared_ptr num_reads_since_mux_change; [[deprecated]] std::shared_ptr time_since_mux_change; std::shared_ptr num_samples; std::shared_ptr channel; std::shared_ptr well; std::shared_ptr pore_type; std::shared_ptr calibration_offset; std::shared_ptr calibration_scale; std::shared_ptr end_reason; std::shared_ptr end_reason_forced; std::shared_ptr run_info; std::shared_ptr open_pore_level; TableSpecVersion table_version; }; class POD5_FORMAT_EXPORT ReadTableRecordBatch : public TableRecordBatch { public: ReadTableRecordBatch( std::shared_ptr && batch, std::shared_ptr const & field_locations); ReadTableRecordBatch(ReadTableRecordBatch &&); std::shared_ptr read_id_column() const; std::shared_ptr signal_column() const; Result get_pore_type(std::int16_t pore_dict_index) const; Result> get_end_reason( std::int16_t end_reason_dict_index) const; Result get_run_info(std::int16_t run_info_dict_index) const; Result columns() const; Result> get_signal_rows(std::int64_t batch_row) const; private: std::shared_ptr m_field_locations; enum class Dict : std::size_t { Pore, EndReason, RunInfo, Max }; mutable std::atomic_bool m_dictionary_initialised[static_cast(Dict::Max)]{}; mutable std::mutex m_dictionary_access_lock; template auto & get_dictionary(std::shared_ptr const & array) const; }; class POD5_FORMAT_EXPORT ReadTableReader : public TableReader { public: ReadTableReader( std::shared_ptr && input_source, std::shared_ptr && reader, std::shared_ptr const & field_locations, SchemaMetadataDescription && schema_metadata, arrow::MemoryPool * pool); ReadTableReader(ReadTableReader && other); ReadTableReader & operator=(ReadTableReader && other); Result read_record_batch(std::size_t i) const; Result search_for_read_ids( ReadIdSearchInput const & search_input, gsl::span const & batch_counts, gsl::span const & batch_rows) const; private: struct IndexData { Uuid id; std::size_t batch; std::size_t batch_row; }; Status build_read_id_lookup() const; mutable std::vector m_sorted_file_read_ids; mutable std::mutex m_sorted_file_read_ids_mutex; private: std::shared_ptr m_field_locations; mutable std::mutex m_batch_get_mutex; }; POD5_FORMAT_EXPORT Result make_read_table_reader( std::shared_ptr const & sink, arrow::MemoryPool * pool); } // namespace pod5 ================================================ FILE: c++/pod5_format/read_table_schema.cpp ================================================ #include "pod5_format/read_table_schema.h" #include "pod5_format/schema_metadata.h" #include "pod5_format/types.h" namespace pod5 { ReadTableSchemaDescription::ReadTableSchemaDescription() : SchemaDescriptionBase(ReadTableSpecVersion::latest()) // V0 Fields , read_id(this, "read_id", uuid(), ReadTableSpecVersion::v0()) , signal(this, "signal", arrow::list(arrow::uint64()), ReadTableSpecVersion::v0()) , read_number(this, "read_number", arrow::uint32(), ReadTableSpecVersion::v0()) , start(this, "start", arrow::uint64(), ReadTableSpecVersion::v0()) , median_before(this, "median_before", arrow::float32(), ReadTableSpecVersion::v0()) , // V1 Fields num_minknow_events(this, "num_minknow_events", arrow::uint64(), ReadTableSpecVersion::v1()) , tracked_scaling_scale(this, "tracked_scaling_scale", arrow::float32(), ReadTableSpecVersion::v1()) , tracked_scaling_shift(this, "tracked_scaling_shift", arrow::float32(), ReadTableSpecVersion::v1()) , predicted_scaling_scale( this, "predicted_scaling_scale", arrow::float32(), ReadTableSpecVersion::v1()) , predicted_scaling_shift( this, "predicted_scaling_shift", arrow::float32(), ReadTableSpecVersion::v1()) , num_reads_since_mux_change( this, "num_reads_since_mux_change", arrow::uint32(), ReadTableSpecVersion::v1()) , time_since_mux_change(this, "time_since_mux_change", arrow::float32(), ReadTableSpecVersion::v1()) , // V2 Fields num_samples(this, "num_samples", arrow::uint64(), ReadTableSpecVersion::v2()) , // V3 Fields channel(this, "channel", arrow::uint16(), ReadTableSpecVersion::v3()) , well(this, "well", arrow::uint8(), ReadTableSpecVersion::v3()) , pore_type( this, "pore_type", arrow::dictionary(arrow::int16(), arrow::utf8()), ReadTableSpecVersion::v3()) , calibration_offset(this, "calibration_offset", arrow::float32(), ReadTableSpecVersion::v3()) , calibration_scale(this, "calibration_scale", arrow::float32(), ReadTableSpecVersion::v3()) , end_reason( this, "end_reason", arrow::dictionary(arrow::int16(), arrow::utf8()), ReadTableSpecVersion::v3()) , end_reason_forced(this, "end_reason_forced", arrow::boolean(), ReadTableSpecVersion::v3()) , run_info( this, "run_info", arrow::dictionary(arrow::int16(), arrow::utf8()), ReadTableSpecVersion::v3()) , open_pore_level(this, "open_pore_level", arrow::float32(), ReadTableSpecVersion::v4()) { } TableSpecVersion ReadTableSchemaDescription::table_version_from_file_version( Version file_version) const { return ReadTableSpecVersion::latest(); } Result> read_read_table_schema( SchemaMetadataDescription const & schema_metadata, std::shared_ptr const & schema) { auto result = std::make_shared(); ARROW_RETURN_NOT_OK(ReadTableSchemaDescription::read_schema(result, schema_metadata, schema)); return result; } } // namespace pod5 ================================================ FILE: c++/pod5_format/read_table_schema.h ================================================ #pragma once #include "pod5_format/pod5_format_export.h" #include "pod5_format/result.h" #include "pod5_format/schema_utils.h" #include "pod5_format/tuple_utils.h" #include "pod5_format/types.h" #include #include #include namespace arrow { class KeyValueMetadata; class Schema; class DataType; class StructType; } // namespace arrow namespace pod5 { struct SchemaMetadataDescription; class ReadTableSpecVersion { public: static TableSpecVersion v0() { return TableSpecVersion::first_version(); } static TableSpecVersion v1() { // Addition of num_minknow_events and scaling parameters return TableSpecVersion::at_version(1); } static TableSpecVersion v2() { // Addition of num_samples parameters return TableSpecVersion::at_version(2); } static TableSpecVersion v3() { // Flattening of dictionaries into separate table. return TableSpecVersion::at_version(3); } static TableSpecVersion v4() { // Flattening of dictionaries into separate table. return TableSpecVersion::at_version(4); } static TableSpecVersion latest() { return v4(); } }; class ReadTableSchemaDescription : public SchemaDescriptionBase { public: ReadTableSchemaDescription(); ReadTableSchemaDescription(ReadTableSchemaDescription const &) = delete; ReadTableSchemaDescription & operator=(ReadTableSchemaDescription const &) = delete; TableSpecVersion table_version_from_file_version(Version file_version) const override; // V0 fields Field<0, UuidArray> read_id; ListField<1, arrow::ListArray, arrow::UInt64Array> signal; Field<2, arrow::UInt32Array> read_number; Field<3, arrow::UInt64Array> start; Field<4, arrow::FloatArray> median_before; // V1 fields Field<5, arrow::UInt64Array> num_minknow_events; [[deprecated]] Field<6, arrow::FloatArray> tracked_scaling_scale; [[deprecated]] Field<7, arrow::FloatArray> tracked_scaling_shift; [[deprecated]] Field<8, arrow::FloatArray> predicted_scaling_scale; [[deprecated]] Field<9, arrow::FloatArray> predicted_scaling_shift; [[deprecated]] Field<10, arrow::UInt32Array> num_reads_since_mux_change; [[deprecated]] Field<11, arrow::FloatArray> time_since_mux_change; // V2 fields Field<12, arrow::UInt64Array> num_samples; // V3 fields Field<13, arrow::UInt16Array> channel; Field<14, arrow::UInt8Array> well; Field<15, arrow::DictionaryArray> pore_type; Field<16, arrow::FloatArray> calibration_offset; Field<17, arrow::FloatArray> calibration_scale; Field<18, arrow::DictionaryArray> end_reason; Field<19, arrow::BooleanArray> end_reason_forced; Field<20, arrow::DictionaryArray> run_info; // V4 fields Field<21, arrow::FloatArray> open_pore_level; // Field Builders only for fields we write in newly generated files. // Should not include fields which are removed in the latest version: using FieldBuilders = FieldBuilder< // V0 fields decltype(read_id), decltype(signal), decltype(read_number), decltype(start), decltype(median_before), // V1 fields decltype(num_minknow_events), decltype(tracked_scaling_scale), decltype(tracked_scaling_shift), decltype(predicted_scaling_scale), decltype(predicted_scaling_shift), decltype(num_reads_since_mux_change), decltype(time_since_mux_change), // V2 fields decltype(num_samples), // V3 fields decltype(channel), decltype(well), decltype(pore_type), decltype(calibration_offset), decltype(calibration_scale), decltype(end_reason), decltype(end_reason_forced), decltype(run_info), // V4 fields decltype(open_pore_level)>; }; POD5_FORMAT_EXPORT Result> read_read_table_schema( SchemaMetadataDescription const & schema_metadata, std::shared_ptr const &); } // namespace pod5 ================================================ FILE: c++/pod5_format/read_table_utils.cpp ================================================ #include "pod5_format/read_table_utils.h" #include namespace pod5 { ReadIdSearchInput::ReadIdSearchInput(gsl::span const & input_ids) : m_search_read_ids(input_ids.size()) { // Copy in search input: for (std::size_t i = 0; i < input_ids.size(); ++i) { m_search_read_ids[i].id = input_ids[i]; m_search_read_ids[i].index = i; } // Sort input based on read id: std::sort( m_search_read_ids.begin(), m_search_read_ids.end(), [](auto const & a, auto const & b) { return a.id < b.id; }); } } // namespace pod5 ================================================ FILE: c++/pod5_format/read_table_utils.h ================================================ #pragma once #include "pod5_format/pod5_format_export.h" #include "pod5_format/uuid.h" #include #include #include #include #include namespace pod5 { using PoreDictionaryIndex = std::int16_t; using EndReasonDictionaryIndex = std::int16_t; using RunInfoDictionaryIndex = std::int16_t; class ReadData { public: ReadData() = default; /// \brief Create a new read data structure to add to a read. /// \param read_id The read id for the read entry. /// \param read_number Read number for this read. /// \param start_sample The sample which this read starts at. /// \param median_before The median of the read chunk prior to the start of this read. /// \param end_reason The dictionary index of the end reason name which caused this read to complete. /// \param end_reason_forced Boolean value indicating if the read end was forced. /// \param run_info The dictionary index of the run info for this read. /// \param num_minknow_events The number of minknow events in the read. ReadData( Uuid const & read_id, std::uint32_t read_number, std::uint64_t start_sample, std::uint16_t channel, std::uint8_t well, PoreDictionaryIndex pore_type, float calibration_offset, float calibration_scale, float median_before, EndReasonDictionaryIndex end_reason, bool end_reason_forced, RunInfoDictionaryIndex run_info, std::uint64_t num_minknow_events, float tracked_scaling_scale, float tracked_scaling_shift, float predicted_scaling_scale, float predicted_scaling_shift, std::uint32_t num_reads_since_mux_change, float time_since_mux_change, float open_pore_level) : read_id(read_id) , read_number(read_number) , start_sample(start_sample) , median_before(median_before) , end_reason(end_reason) , end_reason_forced(end_reason_forced) , run_info(run_info) , num_minknow_events(num_minknow_events) , tracked_scaling_scale(tracked_scaling_scale) , tracked_scaling_shift(tracked_scaling_shift) , predicted_scaling_scale(predicted_scaling_scale) , predicted_scaling_shift(predicted_scaling_shift) , num_reads_since_mux_change(num_reads_since_mux_change) , time_since_mux_change(time_since_mux_change) , channel(channel) , well(well) , pore_type(pore_type) , calibration_offset(calibration_offset) , calibration_scale(calibration_scale) , open_pore_level(open_pore_level) { } // V1 Fields Uuid read_id; std::uint32_t read_number; std::uint64_t start_sample; float median_before; EndReasonDictionaryIndex end_reason; bool end_reason_forced; RunInfoDictionaryIndex run_info; // V2 Fields std::uint64_t num_minknow_events; [[deprecated]] float tracked_scaling_scale; [[deprecated]] float tracked_scaling_shift; [[deprecated]] float predicted_scaling_scale; [[deprecated]] float predicted_scaling_shift; [[deprecated]] std::uint32_t num_reads_since_mux_change; [[deprecated]] float time_since_mux_change; // V3 Fields std::uint16_t channel; std::uint8_t well; PoreDictionaryIndex pore_type; float calibration_offset; float calibration_scale; // V4 Fields float open_pore_level; }; inline bool operator==(ReadData const & a, ReadData const & b) { return a.read_id == b.read_id && a.read_number == b.read_number && a.start_sample == b.start_sample && a.median_before == b.median_before && a.end_reason == b.end_reason && a.end_reason_forced == b.end_reason_forced && a.run_info == b.run_info && a.num_minknow_events == b.num_minknow_events && a.tracked_scaling_scale == b.tracked_scaling_scale && a.tracked_scaling_shift == b.tracked_scaling_shift && a.predicted_scaling_scale == b.predicted_scaling_scale && a.predicted_scaling_shift == b.predicted_scaling_shift && a.num_reads_since_mux_change == b.num_reads_since_mux_change && a.time_since_mux_change == b.time_since_mux_change && a.channel == b.channel && a.well == b.well && a.pore_type == b.pore_type && a.calibration_offset == b.calibration_offset && a.calibration_scale == b.calibration_scale && a.open_pore_level == b.open_pore_level; } class RunInfoData { public: using MapType = std::vector>; RunInfoData( std::string acquisition_id, std::int64_t acquisition_start_time, std::int16_t adc_max, std::int16_t adc_min, MapType context_tags, std::string experiment_name, std::string flow_cell_id, std::string flow_cell_product_code, std::string protocol_name, std::string protocol_run_id, std::int64_t protocol_start_time, std::string sample_id, std::uint16_t sample_rate, std::string sequencing_kit, std::string sequencer_position, std::string sequencer_position_type, std::string software, std::string system_name, std::string system_type, MapType tracking_id) : acquisition_id(std::move(acquisition_id)) , acquisition_start_time(std::move(acquisition_start_time)) , adc_max(std::move(adc_max)) , adc_min(std::move(adc_min)) , context_tags(std::move(context_tags)) , experiment_name(std::move(experiment_name)) , flow_cell_id(std::move(flow_cell_id)) , flow_cell_product_code(std::move(flow_cell_product_code)) , protocol_name(std::move(protocol_name)) , protocol_run_id(std::move(protocol_run_id)) , protocol_start_time(std::move(protocol_start_time)) , sample_id(std::move(sample_id)) , sample_rate(std::move(sample_rate)) , sequencing_kit(std::move(sequencing_kit)) , sequencer_position(std::move(sequencer_position)) , sequencer_position_type(std::move(sequencer_position_type)) , software(std::move(software)) , system_name(std::move(system_name)) , system_type(std::move(system_type)) , tracking_id(std::move(tracking_id)) { } static std::int64_t convert_from_system_clock(std::chrono::system_clock::time_point value) { return value.time_since_epoch() / std::chrono::milliseconds(1); } static std::chrono::system_clock::time_point convert_to_system_clock( std::int64_t since_epoch_ms) { return std::chrono::system_clock::time_point() + std::chrono::milliseconds(since_epoch_ms); } std::string acquisition_id; std::int64_t acquisition_start_time; std::int16_t adc_max; std::int16_t adc_min; MapType context_tags; std::string experiment_name; std::string flow_cell_id; std::string flow_cell_product_code; std::string protocol_name; std::string protocol_run_id; std::int64_t protocol_start_time; std::string sample_id; std::uint16_t sample_rate; std::string sequencing_kit; std::string sequencer_position; std::string sequencer_position_type; std::string software; std::string system_name; std::string system_type; MapType tracking_id; }; inline bool operator==(RunInfoData const & a, RunInfoData const & b) { return a.acquisition_id == b.acquisition_id && a.acquisition_start_time == b.acquisition_start_time && a.adc_max == b.adc_max && a.adc_min == b.adc_min && a.context_tags == b.context_tags && a.experiment_name == b.experiment_name && a.flow_cell_id == b.flow_cell_id && a.flow_cell_product_code == b.flow_cell_product_code && a.protocol_name == b.protocol_name && a.protocol_run_id == b.protocol_run_id && a.protocol_start_time == b.protocol_start_time && a.sample_id == b.sample_id && a.sample_rate == b.sample_rate && a.sequencing_kit == b.sequencing_kit && a.sequencer_position == b.sequencer_position && a.sequencer_position_type == b.sequencer_position_type && a.software == b.software && a.system_name == b.system_name && a.system_type == b.system_type && a.tracking_id == b.tracking_id; } enum class ReadEndReason : std::uint8_t { unknown, mux_change, unblock_mux_change, data_service_unblock_mux_change, signal_positive, signal_negative, api_request, device_data_error, analysis_config_change, paused, last_end_reason = paused }; inline char const * end_reason_as_string(ReadEndReason reason) { static_assert( ReadEndReason::last_end_reason == ReadEndReason::paused, "Need to add new end reason to this function"); switch (reason) { case ReadEndReason::mux_change: return "mux_change"; case ReadEndReason::unblock_mux_change: return "unblock_mux_change"; case ReadEndReason::data_service_unblock_mux_change: return "data_service_unblock_mux_change"; case ReadEndReason::signal_positive: return "signal_positive"; case ReadEndReason::signal_negative: return "signal_negative"; case ReadEndReason::api_request: return "api_request"; case ReadEndReason::device_data_error: return "device_data_error"; case ReadEndReason::analysis_config_change: return "analysis_config_change"; case ReadEndReason::paused: return "paused"; case ReadEndReason::unknown: break; } return "unknown"; } inline ReadEndReason end_reason_from_string(std::string const & reason) { static_assert( ReadEndReason::last_end_reason == ReadEndReason::paused, "Need to add new end reason to this function"); if (reason == "unknown") { return ReadEndReason::unknown; } else if (reason == "mux_change") { return ReadEndReason::mux_change; } else if (reason == "unblock_mux_change") { return ReadEndReason::unblock_mux_change; } else if (reason == "data_service_unblock_mux_change") { return ReadEndReason::data_service_unblock_mux_change; } else if (reason == "signal_positive") { return ReadEndReason::signal_positive; } else if (reason == "signal_negative") { return ReadEndReason::signal_negative; } else if (reason == "api_request") { return ReadEndReason::api_request; } else if (reason == "device_data_error") { return ReadEndReason::device_data_error; } else if (reason == "analysis_config_change") { return ReadEndReason::analysis_config_change; } else if (reason == "paused") { return ReadEndReason::paused; } return ReadEndReason::unknown; } /// \brief Input query to a search for a number of read ids in a file: class POD5_FORMAT_EXPORT ReadIdSearchInput { public: struct InputId { Uuid id; std::size_t index; }; ReadIdSearchInput(gsl::span const & input_ids); std::size_t read_id_count() const { return m_search_read_ids.size(); } InputId const & operator[](std::size_t i) const { return m_search_read_ids[i]; } private: std::vector m_search_read_ids; }; } // namespace pod5 ================================================ FILE: c++/pod5_format/read_table_writer.cpp ================================================ #include "pod5_format/read_table_writer.h" #include "pod5_format/file_output_stream.h" #include "pod5_format/internal/tracing/tracing.h" #include #include #include #include #include namespace pod5 { ReadTableWriter::ReadTableWriter( std::shared_ptr && writer, std::shared_ptr && schema, std::shared_ptr const & field_locations, std::size_t table_batch_size, std::shared_ptr const & pore_writer, std::shared_ptr const & end_reason_writer, std::shared_ptr const & run_info_writer, std::shared_ptr const & output_stream, arrow::MemoryPool * pool) : m_schema(schema) , m_field_locations(field_locations) , m_table_batch_size(table_batch_size) , m_writer(std::move(writer)) , m_field_builders(m_field_locations, pool) , m_output_stream{output_stream} { m_field_builders.get_builder(m_field_locations->pore_type).set_dict_writer(pore_writer); m_field_builders.get_builder(m_field_locations->end_reason).set_dict_writer(end_reason_writer); m_field_builders.get_builder(m_field_locations->run_info).set_dict_writer(run_info_writer); } ReadTableWriter::ReadTableWriter(ReadTableWriter && other) = default; ReadTableWriter & ReadTableWriter::operator=(ReadTableWriter &&) = default; ReadTableWriter::~ReadTableWriter() { if (m_writer) { (void)close(); } } Result ReadTableWriter::add_read( ReadData const & read_data, gsl::span const & signal, std::uint64_t signal_duration) { POD5_TRACE_FUNCTION(); if (!m_writer) { return Status::IOError("Writer terminated"); } ARROW_RETURN_NOT_OK(reserve_rows()); auto row_id = m_written_batched_row_count + m_current_batch_row_count; ARROW_RETURN_NOT_OK(m_field_builders.append( // V0 Fields read_data.read_id, signal, read_data.read_number, read_data.start_sample, read_data.median_before, // V1 Fields read_data.num_minknow_events, read_data.tracked_scaling_scale, read_data.tracked_scaling_shift, read_data.predicted_scaling_scale, read_data.predicted_scaling_shift, read_data.num_reads_since_mux_change, read_data.time_since_mux_change, // V2 Fields signal_duration, // V3 Fields read_data.channel, read_data.well, read_data.pore_type, read_data.calibration_offset, read_data.calibration_scale, read_data.end_reason, read_data.end_reason_forced, read_data.run_info, // V4 Fields read_data.open_pore_level)); ++m_current_batch_row_count; if (m_current_batch_row_count >= m_table_batch_size) { ARROW_RETURN_NOT_OK(write_batch()); } return row_id; } Status ReadTableWriter::close() { // Check for already closed if (!m_writer) { return Status::OK(); } ARROW_RETURN_NOT_OK(write_batch()); ARROW_RETURN_NOT_OK(m_writer->Close()); m_writer = nullptr; return Status::OK(); } Status ReadTableWriter::write_batch(arrow::RecordBatch const & record_batch) { ARROW_RETURN_NOT_OK(m_writer->WriteRecordBatch(record_batch)); return m_output_stream->batch_complete(); } Status ReadTableWriter::write_batch() { POD5_TRACE_FUNCTION(); if (m_current_batch_row_count == 0) { return Status::OK(); } if (!m_writer) { return Status::IOError("Writer terminated"); } ARROW_ASSIGN_OR_RAISE(auto columns, m_field_builders.finish_columns()); auto const record_batch = arrow::RecordBatch::Make(m_schema, m_current_batch_row_count, std::move(columns)); m_written_batched_row_count += m_current_batch_row_count; m_current_batch_row_count = 0; ARROW_RETURN_NOT_OK(m_writer->WriteRecordBatch(*record_batch)); return m_output_stream->batch_complete(); } Status ReadTableWriter::reserve_rows() { // Only reserve if we have not already reserved (at the start of a batch) if (m_current_batch_row_count > 0) { return arrow::Status::OK(); } return m_field_builders.reserve(m_table_batch_size); } Result make_read_table_writer( std::shared_ptr const & sink, std::shared_ptr const & metadata, std::size_t table_batch_size, std::shared_ptr const & pore_writer, std::shared_ptr const & end_reason_writer, std::shared_ptr const & run_info_writer, arrow::MemoryPool * pool) { auto field_locations = std::make_shared(); auto schema = field_locations->make_writer_schema(metadata); arrow::ipc::IpcWriteOptions options; options.memory_pool = pool; options.emit_dictionary_deltas = true; // todo... consider: //ARROW_ASSIGN_OR_RAISE(options.codec, arrow::util::Codec::Create(arrow::Compression::LZ4_FRAME)); ARROW_ASSIGN_OR_RAISE(auto writer, arrow::ipc::MakeFileWriter(sink, schema, options, metadata)); auto read_table_writer = ReadTableWriter( std::move(writer), std::move(schema), field_locations, table_batch_size, pore_writer, end_reason_writer, run_info_writer, sink, pool); return read_table_writer; } } // namespace pod5 ================================================ FILE: c++/pod5_format/read_table_writer.h ================================================ #pragma once #include "pod5_format/pod5_format_export.h" #include "pod5_format/read_table_schema.h" #include "pod5_format/read_table_writer_utils.h" #include "pod5_format/result.h" #include "pod5_format/schema_field_builder.h" #include "pod5_format/signal_table_utils.h" #include #include namespace arrow { class Schema; namespace ipc { class RecordBatchWriter; } } // namespace arrow namespace pod5 { class FileOutputStream; class POD5_FORMAT_EXPORT ReadTableWriter { public: ReadTableWriter( std::shared_ptr && writer, std::shared_ptr && schema, std::shared_ptr const & field_locations, std::size_t table_batch_size, std::shared_ptr const & pore_writer, std::shared_ptr const & end_reason_writer, std::shared_ptr const & run_info_writer, std::shared_ptr const & output_stream, arrow::MemoryPool * pool); ReadTableWriter(ReadTableWriter &&); ReadTableWriter & operator=(ReadTableWriter &&); ReadTableWriter(ReadTableWriter const &) = delete; ReadTableWriter & operator=(ReadTableWriter const &) = delete; ~ReadTableWriter(); /// \brief Add a read to the read table, adding to the current batch. /// \param read_data The data to add as a read. /// \param signal List of signal table row indices that belong to this read. /// \param signal_duration The length of the read in samples. /// \returns The row index of the inserted read, or a status on failure. Result add_read( ReadData const & read_data, gsl::span const & signal, std::uint64_t signal_duration); /// \brief Close this writer, signaling no further data will be written to the writer. Status close(); /// \brief Reserve space for future row writes, called automatically when a flush occurs. Status reserve_rows(); /// \brief Find the schema for the table std::shared_ptr const & schema() const { return m_schema; } /// \brief Flush passed data into the writer as a record batch. Status write_batch(arrow::RecordBatch const &); private: /// \brief Flush buffered data into the writer as a record batch. Status write_batch(); std::shared_ptr m_schema; std::shared_ptr m_field_locations; std::size_t m_table_batch_size; std::shared_ptr m_writer; ReadTableSchemaDescription::FieldBuilders m_field_builders; std::size_t m_written_batched_row_count = 0; std::size_t m_current_batch_row_count = 0; std::shared_ptr m_output_stream; }; /// \brief Make a new writer for a read table. /// \param sink Sink to be used for output of the table. /// \param metadata Metadata to be applied to the table schema. /// \param table_batch_size The size of each batch written for the table. /// \param pool Pool to be used for building table in memory. /// \returns The writer for the new table. POD5_FORMAT_EXPORT Result make_read_table_writer( std::shared_ptr const & sink, std::shared_ptr const & metadata, std::size_t table_batch_size, std::shared_ptr const & pore_writer, std::shared_ptr const & end_reason_writer, std::shared_ptr const & run_info_writer, arrow::MemoryPool * pool); } // namespace pod5 ================================================ FILE: c++/pod5_format/read_table_writer_utils.cpp ================================================ #include "pod5_format/read_table_writer_utils.h" #include "pod5_format/read_table_schema.h" #include #include #include #include namespace pod5 { namespace detail { arrow::Result> get_array_data( std::shared_ptr const & type, StringDictionaryKeyBuilder const & builder, std::size_t expected_length) { auto const value_data = builder.get_string_data(); if (!value_data) { return Status::Invalid("Missing array value data for dictionary"); } arrow::TypedBufferBuilder offset_builder; auto const & offset_data = builder.get_typed_offset_data(); if (offset_data.size() != expected_length) { return Status::Invalid("Invalid size for field in struct"); } ARROW_RETURN_NOT_OK(offset_builder.Append(offset_data.data(), offset_data.size())); // Append final offset - size of value data. ARROW_RETURN_NOT_OK(offset_builder.Append(value_data->size())); std::shared_ptr offsets; ARROW_RETURN_NOT_OK(offset_builder.Finish(&offsets)); return arrow::ArrayData::Make(type, expected_length, {nullptr, offsets, value_data}, 0, 0); } } // namespace detail arrow::Result> make_pore_writer(arrow::MemoryPool * pool) { return std::make_shared(pool); } arrow::Result> make_end_reason_writer(arrow::MemoryPool * pool) { std::shared_ptr end_reasons; { arrow::StringBuilder builder(pool); for (int end_reason = 0; end_reason <= (int)ReadEndReason::last_end_reason; ++end_reason) { ARROW_RETURN_NOT_OK(builder.Append(end_reason_as_string((ReadEndReason)end_reason))); } ARROW_RETURN_NOT_OK(builder.Finish(&end_reasons)); } return std::make_shared(end_reasons); } arrow::Result> make_run_info_writer(arrow::MemoryPool * pool) { return std::make_shared(pool); } pod5::Result> DictionaryWriter::build_dictionary_array( std::shared_ptr const & indices) { ARROW_ASSIGN_OR_RAISE(auto res, get_value_array()); return arrow::DictionaryArray::FromArrays(indices, res); } PoreWriter::PoreWriter(arrow::MemoryPool * pool) : m_builder(pool) {} pod5::Result> PoreWriter::get_value_array() { ARROW_ASSIGN_OR_RAISE(auto array_data, get_array_data(arrow::utf8(), m_builder, item_count())); return std::make_shared(array_data); } std::size_t PoreWriter::item_count() { return m_builder.length(); } EndReasonWriter::EndReasonWriter(std::shared_ptr const & end_reasons) : m_end_reasons(end_reasons) { } pod5::Result> EndReasonWriter::get_value_array() { return m_end_reasons; } std::size_t EndReasonWriter::item_count() { return m_end_reasons->length(); } RunInfoWriter::RunInfoWriter(arrow::MemoryPool * pool) : m_builder(pool) {} pod5::Result> RunInfoWriter::get_value_array() { ARROW_ASSIGN_OR_RAISE(auto array_data, get_array_data(arrow::utf8(), m_builder, item_count())); return std::make_shared(array_data); } std::size_t RunInfoWriter::item_count() { return m_builder.length(); } } // namespace pod5 ================================================ FILE: c++/pod5_format/read_table_writer_utils.h ================================================ #pragma once #include "pod5_format/dictionary_writer.h" #include "pod5_format/expandable_buffer.h" #include "pod5_format/pod5_format_export.h" #include "pod5_format/read_table_utils.h" #include "pod5_format/result.h" #include "pod5_format/tuple_utils.h" #include #include #include #include #include #include #include namespace pod5 { namespace detail { class StringDictionaryKeyBuilder { public: StringDictionaryKeyBuilder(arrow::MemoryPool * pool = nullptr) : m_offset_values(pool) , m_string_values(pool) { } arrow::Status init_buffer(arrow::MemoryPool * pool) { ARROW_RETURN_NOT_OK(m_offset_values.init_buffer(pool)); return m_string_values.init_buffer(pool); } arrow::Status append(std::string const & value) { ARROW_RETURN_NOT_OK(m_offset_values.append(m_string_values.size())); return m_string_values.append_array( gsl::make_span(value.data(), value.size()).as_span()); } std::size_t length() const { return m_offset_values.size(); } std::shared_ptr get_string_data() const { return m_string_values.get_buffer(); } gsl::span get_typed_offset_data() const { return m_offset_values.get_data_span(); } private: ExpandableBuffer m_offset_values; ExpandableBuffer m_string_values; }; } // namespace detail class POD5_FORMAT_EXPORT PoreWriter : public DictionaryWriter { public: PoreWriter(arrow::MemoryPool * pool); pod5::Result add(std::string const & pore_type) { auto const index = item_count(); if (index >= std::size_t(std::numeric_limits::max())) { return arrow::Status::Invalid( "Failed to add pore to dictionary, too many indices in file"); } ARROW_RETURN_NOT_OK(m_builder.append(pore_type)); return index; } pod5::Result> get_value_array() override; std::size_t item_count() override; private: detail::StringDictionaryKeyBuilder m_builder; }; class POD5_FORMAT_EXPORT EndReasonWriter : public DictionaryWriter { public: EndReasonWriter(std::shared_ptr const & end_reasons); pod5::Result lookup(ReadEndReason end_reason) const { if (end_reason > ReadEndReason::last_end_reason) { return pod5::Status::Invalid("Invalid read end reason requested"); } return EndReasonDictionaryIndex(end_reason); } pod5::Result> get_value_array() override; std::size_t item_count() override; private: std::shared_ptr m_end_reasons; }; class POD5_FORMAT_EXPORT RunInfoWriter : public DictionaryWriter { public: RunInfoWriter(arrow::MemoryPool * pool); pod5::Result add(std::string const & acquisition_id) { auto const index = item_count(); if (index >= std::size_t(std::numeric_limits::max())) { return arrow::Status::Invalid( "Failed to add run info to dictionary, too many indices in file"); } ARROW_RETURN_NOT_OK(m_builder.append(acquisition_id)); return index; } pod5::Result> get_value_array() override; std::size_t item_count() override; private: detail::StringDictionaryKeyBuilder m_builder; }; POD5_FORMAT_EXPORT arrow::Result> make_pore_writer( arrow::MemoryPool * pool); POD5_FORMAT_EXPORT arrow::Result> make_end_reason_writer( arrow::MemoryPool * pool); POD5_FORMAT_EXPORT arrow::Result> make_run_info_writer( arrow::MemoryPool * pool); } // namespace pod5 ================================================ FILE: c++/pod5_format/result.h ================================================ #pragma once #include namespace pod5 { /// pod5::Result is just an Arrow Result right now. template using Result = arrow::Result; using Status = arrow::Status; } // namespace pod5 ================================================ FILE: c++/pod5_format/run_info_table_reader.cpp ================================================ #include "pod5_format/run_info_table_reader.h" #include "pod5_format/schema_metadata.h" #include "pod5_format/schema_utils.h" #include #include #include #include #include #include namespace pod5 { inline std::vector> value_for_map( std::shared_ptr const & map_array, std::size_t row_index) { std::size_t offset = map_array->value_offset(row_index); std::size_t length = map_array->value_length(row_index); auto const & keys = std::dynamic_pointer_cast(map_array->keys()); auto const & items = std::dynamic_pointer_cast(map_array->items()); std::vector> result; for (std::size_t i = offset; i < offset + length; ++i) { result.push_back(std::make_pair(keys->GetString(i), items->GetString(i))); } return result; } RunInfoTableRecordBatch::RunInfoTableRecordBatch( std::shared_ptr && batch, std::shared_ptr const & field_locations) : TableRecordBatch(std::move(batch)) , m_field_locations(field_locations) { } RunInfoTableRecordBatch::RunInfoTableRecordBatch(RunInfoTableRecordBatch && other) : TableRecordBatch(std::move(other)) { m_field_locations = std::move(other.m_field_locations); } RunInfoTableRecordBatch & RunInfoTableRecordBatch::operator=(RunInfoTableRecordBatch && other) { TableRecordBatch & base = *this; base = other; m_field_locations = std::move(other.m_field_locations); return *this; } Result RunInfoTableRecordBatch::columns() const { RunInfoTableRecordColumns result; result.table_version = m_field_locations->table_version(); auto const & bat = batch(); // V0 fields: result.acquisition_id = find_column(bat, m_field_locations->acquisition_id); result.acquisition_start_time = find_column(bat, m_field_locations->acquisition_start_time); result.adc_max = find_column(bat, m_field_locations->adc_max); result.adc_min = find_column(bat, m_field_locations->adc_min); result.context_tags = find_column(bat, m_field_locations->context_tags); result.experiment_name = find_column(bat, m_field_locations->experiment_name); result.flow_cell_id = find_column(bat, m_field_locations->flow_cell_id); result.flow_cell_product_code = find_column(bat, m_field_locations->flow_cell_product_code); result.protocol_name = find_column(bat, m_field_locations->protocol_name); result.protocol_run_id = find_column(bat, m_field_locations->protocol_run_id); result.protocol_start_time = find_column(bat, m_field_locations->protocol_start_time); result.sample_id = find_column(bat, m_field_locations->sample_id); result.sample_rate = find_column(bat, m_field_locations->sample_rate); result.sequencing_kit = find_column(bat, m_field_locations->sequencing_kit); result.sequencer_position = find_column(bat, m_field_locations->sequencer_position); result.sequencer_position_type = find_column(bat, m_field_locations->sequencer_position_type); result.software = find_column(bat, m_field_locations->software); result.system_name = find_column(bat, m_field_locations->system_name); result.system_type = find_column(bat, m_field_locations->system_type); result.tracking_id = find_column(bat, m_field_locations->tracking_id); return result; } //--------------------------------------------------------------------------------------------------------------------- RunInfoTableReader::RunInfoTableReader( std::shared_ptr && input_source, std::shared_ptr && reader, std::shared_ptr const & field_locations, SchemaMetadataDescription && schema_metadata, arrow::MemoryPool * pool) : TableReader(std::move(input_source), std::move(reader), std::move(schema_metadata), pool) , m_field_locations(field_locations) { } RunInfoTableReader::RunInfoTableReader(RunInfoTableReader && other) : TableReader(std::move(other)) , m_field_locations(std::move(other.m_field_locations)) { } RunInfoTableReader & RunInfoTableReader::operator=(RunInfoTableReader && other) { static_cast(*this) = std::move(static_cast(*this)); m_field_locations = std::move(other.m_field_locations); return *this; } Result RunInfoTableReader::read_record_batch(std::size_t i) const { std::lock_guard l(m_batch_get_mutex); ARROW_ASSIGN_OR_RAISE(auto record_batch, TableReader::ReadRecordBatch(i)); return RunInfoTableRecordBatch{std::move(record_batch), m_field_locations}; } Result> RunInfoTableReader::find_run_info( std::string const & acquisition_id) const { std::lock_guard l(m_run_info_lookup_mutex); auto it = m_run_info_lookup.find(acquisition_id); if (it != m_run_info_lookup.end()) { return it->second; } ARROW_RETURN_NOT_OK(prepare_run_infos_vector()); std::shared_ptr run_info = nullptr; std::size_t glb_run_info_index = 0; for (std::size_t i = 0; i < num_record_batches(); ++i) { ARROW_ASSIGN_OR_RAISE(auto batch, read_record_batch(i)); auto acq_id = find_column(batch.batch(), m_field_locations->acquisition_id); for (std::size_t j = 0; j < batch.num_rows(); ++j) { if (acq_id->Value(j) == acquisition_id) { ARROW_ASSIGN_OR_RAISE( run_info, load_run_info_from_batch(batch, j, glb_run_info_index++)); break; } } if (run_info) { break; } } if (!run_info) { return arrow::Status::Invalid( "Failed to find acquisition id '", acquisition_id, "' in run info table"); } return run_info; } Result> RunInfoTableReader::get_run_info(std::size_t index) const { ARROW_RETURN_NOT_OK(prepare_run_infos_vector()); if (index < 0 || index >= m_run_infos.size()) { return arrow::Status::IndexError( "Invalid index into run infos (expected ", index, " < ", m_run_infos.size(), ")"); } if (m_run_infos[index]) { return m_run_infos[index]; } ARROW_ASSIGN_OR_RAISE(auto first_batch, read_record_batch(0)); auto const batch_size = first_batch.num_rows(); auto const batch_idx = index / batch_size; auto const batch_row = index - (batch_idx * batch_size); if (batch_idx >= num_record_batches()) { return Status::Invalid("Row outside batch bounds"); } ARROW_ASSIGN_OR_RAISE(auto batch, read_record_batch(batch_idx)); return load_run_info_from_batch(batch, batch_row, index); } Result RunInfoTableReader::get_run_info_count() const { auto batch_count = num_record_batches(); if (batch_count == 0) { return 0; } ARROW_ASSIGN_OR_RAISE(auto first_batch, read_record_batch(0)); ARROW_ASSIGN_OR_RAISE(auto last_batch, read_record_batch(batch_count - 1)); return (batch_count - 1) * first_batch.num_rows() + last_batch.num_rows(); } Result> RunInfoTableReader::load_run_info_from_batch( RunInfoTableRecordBatch const & batch, std::size_t batch_index, std::size_t global_index) const { ARROW_ASSIGN_OR_RAISE(auto columns, batch.columns()); auto acquisition_id = columns.acquisition_id->GetString(batch_index); auto run_info = std::make_shared( acquisition_id, columns.acquisition_start_time->Value(batch_index), columns.adc_max->Value(batch_index), columns.adc_min->Value(batch_index), value_for_map(columns.context_tags, batch_index), columns.experiment_name->GetString(batch_index), columns.flow_cell_id->GetString(batch_index), columns.flow_cell_product_code->GetString(batch_index), columns.protocol_name->GetString(batch_index), columns.protocol_run_id->GetString(batch_index), columns.protocol_start_time->Value(batch_index), columns.sample_id->GetString(batch_index), columns.sample_rate->Value(batch_index), columns.sequencing_kit->GetString(batch_index), columns.sequencer_position->GetString(batch_index), columns.sequencer_position_type->GetString(batch_index), columns.software->GetString(batch_index), columns.system_name->GetString(batch_index), columns.system_type->GetString(batch_index), value_for_map(columns.tracking_id, batch_index)); // Cache run info for later retrieval by index: m_run_infos[global_index] = run_info; m_run_info_lookup[acquisition_id] = run_info; return run_info; } arrow::Status RunInfoTableReader::prepare_run_infos_vector() const { if (m_run_infos.empty()) { ARROW_ASSIGN_OR_RAISE(auto row_count, get_run_info_count()) m_run_infos.resize(row_count); } return Status::OK(); } //--------------------------------------------------------------------------------------------------------------------- Result make_run_info_table_reader( std::shared_ptr const & input, arrow::MemoryPool * pool) { arrow::ipc::IpcReadOptions options; options.memory_pool = pool; ARROW_ASSIGN_OR_RAISE(auto reader, arrow::ipc::RecordBatchFileReader::Open(input, options)); auto read_metadata_key_values = reader->schema()->metadata(); if (!read_metadata_key_values) { return Status::IOError("Missing metadata on run info table schema"); } ARROW_ASSIGN_OR_RAISE( auto read_metadata, read_schema_key_value_metadata(read_metadata_key_values)); ARROW_ASSIGN_OR_RAISE( auto field_locations, read_run_info_table_schema(read_metadata, reader->schema())); return RunInfoTableReader( {input}, std::move(reader), field_locations, std::move(read_metadata), pool); } } // namespace pod5 ================================================ FILE: c++/pod5_format/run_info_table_reader.h ================================================ #pragma once #include "pod5_format/pod5_format_export.h" #include "pod5_format/read_table_utils.h" #include "pod5_format/result.h" #include "pod5_format/run_info_table_schema.h" #include "pod5_format/schema_metadata.h" #include "pod5_format/table_reader.h" #include "pod5_format/types.h" #include #include #include #include namespace arrow { class Schema; namespace io { class RandomAccessFile; } namespace ipc { class RecordBatchFileReader; } } // namespace arrow namespace pod5 { struct RunInfoTableRecordColumns { // V0 Fields std::shared_ptr acquisition_id; std::shared_ptr acquisition_start_time; std::shared_ptr adc_max; std::shared_ptr adc_min; std::shared_ptr context_tags; std::shared_ptr experiment_name; std::shared_ptr flow_cell_id; std::shared_ptr flow_cell_product_code; std::shared_ptr protocol_name; std::shared_ptr protocol_run_id; std::shared_ptr protocol_start_time; std::shared_ptr sample_id; std::shared_ptr sample_rate; std::shared_ptr sequencing_kit; std::shared_ptr sequencer_position; std::shared_ptr sequencer_position_type; std::shared_ptr software; std::shared_ptr system_name; std::shared_ptr system_type; std::shared_ptr tracking_id; TableSpecVersion table_version; }; class POD5_FORMAT_EXPORT RunInfoTableRecordBatch : public TableRecordBatch { public: RunInfoTableRecordBatch( std::shared_ptr && batch, std::shared_ptr const & field_locations); RunInfoTableRecordBatch(RunInfoTableRecordBatch &&); RunInfoTableRecordBatch & operator=(RunInfoTableRecordBatch &&); Result columns() const; private: std::shared_ptr m_field_locations; }; class POD5_FORMAT_EXPORT RunInfoTableReader : public TableReader { public: RunInfoTableReader( std::shared_ptr && input_source, std::shared_ptr && reader, std::shared_ptr const & field_locations, SchemaMetadataDescription && schema_metadata, arrow::MemoryPool * pool); RunInfoTableReader(RunInfoTableReader && other); RunInfoTableReader & operator=(RunInfoTableReader && other); Result read_record_batch(std::size_t i) const; Result> find_run_info( std::string const & acquisition_id) const; Result> get_run_info(std::size_t index) const; Result get_run_info_count() const; private: Result> load_run_info_from_batch( RunInfoTableRecordBatch const & batch, std::size_t batch_index, std::size_t global_index) const; arrow::Status prepare_run_infos_vector() const; std::shared_ptr m_field_locations; mutable std::mutex m_batch_get_mutex; mutable std::unordered_map> m_run_info_lookup; mutable std::vector> m_run_infos; mutable std::mutex m_run_info_lookup_mutex; }; POD5_FORMAT_EXPORT Result make_run_info_table_reader( std::shared_ptr const & sink, arrow::MemoryPool * pool); } // namespace pod5 ================================================ FILE: c++/pod5_format/run_info_table_schema.cpp ================================================ #include "pod5_format/run_info_table_schema.h" #include "pod5_format/schema_metadata.h" #include "pod5_format/types.h" namespace pod5 { RunInfoTableSchemaDescription::RunInfoTableSchemaDescription() : SchemaDescriptionBase(RunInfoTableSpecVersion::latest()) // V0 Fields , acquisition_id(this, "acquisition_id", arrow::utf8(), RunInfoTableSpecVersion::v0()) , acquisition_start_time( this, "acquisition_start_time", arrow::timestamp(arrow::TimeUnit::MILLI, "UTC"), RunInfoTableSpecVersion::v0()) , adc_max(this, "adc_max", arrow::int16(), RunInfoTableSpecVersion::v0()) , adc_min(this, "adc_min", arrow::int16(), RunInfoTableSpecVersion::v0()) , context_tags( this, "context_tags", arrow::map(arrow::utf8(), arrow::utf8()), RunInfoTableSpecVersion::v0()) , experiment_name(this, "experiment_name", arrow::utf8(), RunInfoTableSpecVersion::v0()) , flow_cell_id(this, "flow_cell_id", arrow::utf8(), RunInfoTableSpecVersion::v0()) , flow_cell_product_code( this, "flow_cell_product_code", arrow::utf8(), RunInfoTableSpecVersion::v0()) , protocol_name(this, "protocol_name", arrow::utf8(), RunInfoTableSpecVersion::v0()) , protocol_run_id(this, "protocol_run_id", arrow::utf8(), RunInfoTableSpecVersion::v0()) , protocol_start_time( this, "protocol_start_time", arrow::timestamp(arrow::TimeUnit::MILLI, "UTC"), RunInfoTableSpecVersion::v0()) , sample_id(this, "sample_id", arrow::utf8(), RunInfoTableSpecVersion::v0()) , sample_rate(this, "sample_rate", arrow::uint16(), RunInfoTableSpecVersion::v0()) , sequencing_kit(this, "sequencing_kit", arrow::utf8(), RunInfoTableSpecVersion::v0()) , sequencer_position(this, "sequencer_position", arrow::utf8(), RunInfoTableSpecVersion::v0()) , sequencer_position_type( this, "sequencer_position_type", arrow::utf8(), RunInfoTableSpecVersion::v0()) , software(this, "software", arrow::utf8(), RunInfoTableSpecVersion::v0()) , system_name(this, "system_name", arrow::utf8(), RunInfoTableSpecVersion::v0()) , system_type(this, "system_type", arrow::utf8(), RunInfoTableSpecVersion::v0()) , tracking_id( this, "tracking_id", arrow::map(arrow::utf8(), arrow::utf8()), RunInfoTableSpecVersion::v0()) { } TableSpecVersion RunInfoTableSchemaDescription::table_version_from_file_version( Version file_version) const { return RunInfoTableSpecVersion::latest(); } Result> read_run_info_table_schema( SchemaMetadataDescription const & schema_metadata, std::shared_ptr const & schema) { auto result = std::make_shared(); ARROW_RETURN_NOT_OK( RunInfoTableSchemaDescription::read_schema(result, schema_metadata, schema)); return result; } } // namespace pod5 ================================================ FILE: c++/pod5_format/run_info_table_schema.h ================================================ #pragma once #include "pod5_format/pod5_format_export.h" #include "pod5_format/result.h" #include "pod5_format/schema_utils.h" #include "pod5_format/tuple_utils.h" #include "pod5_format/types.h" #include #include #include namespace arrow { class KeyValueMetadata; class Schema; class DataType; class StructType; } // namespace arrow namespace pod5 { class RunInfoTableSpecVersion { public: static TableSpecVersion v0() { return TableSpecVersion::first_version(); } static TableSpecVersion latest() { return v0(); } }; class RunInfoTableSchemaDescription : public SchemaDescriptionBase { public: RunInfoTableSchemaDescription(); RunInfoTableSchemaDescription(RunInfoTableSchemaDescription const &) = delete; RunInfoTableSchemaDescription & operator=(RunInfoTableSchemaDescription const &) = delete; TableSpecVersion table_version_from_file_version(Version file_version) const override; Field<0, arrow::StringArray> acquisition_id; Field<1, arrow::TimestampArray> acquisition_start_time; Field<2, arrow::Int16Array> adc_max; Field<3, arrow::Int16Array> adc_min; Field<4, arrow::MapArray> context_tags; Field<5, arrow::StringArray> experiment_name; Field<6, arrow::StringArray> flow_cell_id; Field<7, arrow::StringArray> flow_cell_product_code; Field<8, arrow::StringArray> protocol_name; Field<9, arrow::StringArray> protocol_run_id; Field<10, arrow::TimestampArray> protocol_start_time; Field<11, arrow::StringArray> sample_id; Field<12, arrow::UInt16Array> sample_rate; Field<13, arrow::StringArray> sequencing_kit; Field<14, arrow::StringArray> sequencer_position; Field<15, arrow::StringArray> sequencer_position_type; Field<16, arrow::StringArray> software; Field<17, arrow::StringArray> system_name; Field<18, arrow::StringArray> system_type; Field<19, arrow::MapArray> tracking_id; // Field Builders only for fields we write in newly generated files. // Should not include fields which are removed in the latest version: using FieldBuilders = FieldBuilder< // V0 fields decltype(acquisition_id), decltype(acquisition_start_time), decltype(adc_max), decltype(adc_min), decltype(context_tags), decltype(experiment_name), decltype(flow_cell_id), decltype(flow_cell_product_code), decltype(protocol_name), decltype(protocol_run_id), decltype(protocol_start_time), decltype(sample_id), decltype(sample_rate), decltype(sequencing_kit), decltype(sequencer_position), decltype(sequencer_position_type), decltype(software), decltype(system_name), decltype(system_type), decltype(tracking_id)>; }; POD5_FORMAT_EXPORT Result> read_run_info_table_schema( SchemaMetadataDescription const & schema_metadata, std::shared_ptr const &); } // namespace pod5 ================================================ FILE: c++/pod5_format/run_info_table_writer.cpp ================================================ #include "pod5_format/run_info_table_writer.h" #include "pod5_format/file_output_stream.h" #include "pod5_format/internal/tracing/tracing.h" #include "pod5_format/read_table_utils.h" #include #include #include #include #include namespace pod5 { RunInfoTableWriter::RunInfoTableWriter( std::shared_ptr && writer, std::shared_ptr && schema, std::shared_ptr const & field_locations, std::shared_ptr const & output_stream, std::size_t table_batch_size, arrow::MemoryPool * pool) : m_schema(schema) , m_field_locations(field_locations) , m_output_stream{output_stream} , m_table_batch_size(table_batch_size) , m_writer(std::move(writer)) , m_field_builders(m_field_locations, pool) { } RunInfoTableWriter::RunInfoTableWriter(RunInfoTableWriter && other) = default; RunInfoTableWriter & RunInfoTableWriter::operator=(RunInfoTableWriter &&) = default; RunInfoTableWriter::~RunInfoTableWriter() { if (m_writer) { (void)close(); } } Result RunInfoTableWriter::add_run_info(RunInfoData const & run_info_data) { POD5_TRACE_FUNCTION(); if (!m_writer) { return Status::IOError("Writer terminated"); } ARROW_RETURN_NOT_OK(reserve_rows()); auto row_id = m_written_batched_row_count + m_current_batch_row_count; ARROW_RETURN_NOT_OK(m_field_builders.append( // V0 Fields run_info_data.acquisition_id, run_info_data.acquisition_start_time, run_info_data.adc_max, run_info_data.adc_min, run_info_data.context_tags, run_info_data.experiment_name, run_info_data.flow_cell_id, run_info_data.flow_cell_product_code, run_info_data.protocol_name, run_info_data.protocol_run_id, run_info_data.protocol_start_time, run_info_data.sample_id, run_info_data.sample_rate, run_info_data.sequencing_kit, run_info_data.sequencer_position, run_info_data.sequencer_position_type, run_info_data.software, run_info_data.system_name, run_info_data.system_type, run_info_data.tracking_id)); ++m_current_batch_row_count; if (m_current_batch_row_count >= m_table_batch_size) { ARROW_RETURN_NOT_OK(write_batch()); } return row_id; } Status RunInfoTableWriter::close() { // Check for already closed if (!m_writer) { return Status::OK(); } ARROW_RETURN_NOT_OK(write_batch()); ARROW_RETURN_NOT_OK(m_writer->Close()); m_writer = nullptr; return Status::OK(); } Status RunInfoTableWriter::write_batch(arrow::RecordBatch const & record_batch) { ARROW_RETURN_NOT_OK(m_writer->WriteRecordBatch(record_batch)); return m_output_stream->batch_complete(); } Status RunInfoTableWriter::write_batch() { POD5_TRACE_FUNCTION(); if (m_current_batch_row_count == 0) { return Status::OK(); } if (!m_writer) { return Status::IOError("Writer terminated"); } ARROW_ASSIGN_OR_RAISE(auto columns, m_field_builders.finish_columns()); auto const record_batch = arrow::RecordBatch::Make(m_schema, m_current_batch_row_count, std::move(columns)); m_written_batched_row_count += m_current_batch_row_count; m_current_batch_row_count = 0; ARROW_RETURN_NOT_OK(m_writer->WriteRecordBatch(*record_batch)); return m_output_stream->batch_complete(); } Status RunInfoTableWriter::reserve_rows() { // Only reserve if we have not already reserved (at the start of a batch) if (m_current_batch_row_count > 0) { return arrow::Status::OK(); } return m_field_builders.reserve(m_table_batch_size); } Result make_run_info_table_writer( std::shared_ptr const & sink, std::shared_ptr const & metadata, std::size_t table_batch_size, arrow::MemoryPool * pool) { auto field_locations = std::make_shared(); auto schema = field_locations->make_writer_schema(metadata); arrow::ipc::IpcWriteOptions options; options.memory_pool = pool; ARROW_ASSIGN_OR_RAISE(auto writer, arrow::ipc::MakeFileWriter(sink, schema, options, metadata)); auto run_info_table_writer = RunInfoTableWriter( std::move(writer), std::move(schema), field_locations, sink, table_batch_size, pool); return run_info_table_writer; } } // namespace pod5 ================================================ FILE: c++/pod5_format/run_info_table_writer.h ================================================ #pragma once #include "pod5_format/pod5_format_export.h" #include "pod5_format/result.h" #include "pod5_format/run_info_table_schema.h" #include "pod5_format/schema_field_builder.h" #include #include namespace arrow { class Schema; namespace ipc { class RecordBatchWriter; } } // namespace arrow namespace pod5 { class FileOutputStream; class RunInfoData; class POD5_FORMAT_EXPORT RunInfoTableWriter { public: RunInfoTableWriter( std::shared_ptr && writer, std::shared_ptr && schema, std::shared_ptr const & field_locations, std::shared_ptr const & output_stream, std::size_t table_batch_size, arrow::MemoryPool * pool); RunInfoTableWriter(RunInfoTableWriter &&); RunInfoTableWriter & operator=(RunInfoTableWriter &&); RunInfoTableWriter(RunInfoTableWriter const &) = delete; RunInfoTableWriter & operator=(RunInfoTableWriter const &) = delete; ~RunInfoTableWriter(); /// \brief Add a run info to the table, adding to the current batch. /// \param run_info_data The run info data to add. /// \returns The row index of the inserted read, or a status on failure. Result add_run_info(RunInfoData const & run_info_data); /// \brief Close this writer, signaling no further data will be written to the writer. Status close(); /// \brief Reserve space for future row writes, called automatically when a flush occurs. Status reserve_rows(); /// \brief Find the schema for the table std::shared_ptr const & schema() const { return m_schema; } /// \brief Flush passed data into the writer as a record batch. Status write_batch(arrow::RecordBatch const &); private: /// \brief Flush buffered data into the writer as a record batch. Status write_batch(); std::shared_ptr m_schema; std::shared_ptr m_field_locations; std::shared_ptr m_output_stream; std::size_t m_table_batch_size; std::shared_ptr m_writer; RunInfoTableSchemaDescription::FieldBuilders m_field_builders; std::size_t m_written_batched_row_count = 0; std::size_t m_current_batch_row_count = 0; }; /// \brief Make a new writer for a read table. /// \param sink Sink to be used for output of the table. /// \param metadata Metadata to be applied to the table schema. /// \param table_batch_size The size of each batch written for the table. /// \param pool Pool to be used for building table in memory. /// \returns The writer for the new table. POD5_FORMAT_EXPORT Result make_run_info_table_writer( std::shared_ptr const & sink, std::shared_ptr const & metadata, std::size_t table_batch_size, arrow::MemoryPool * pool); } // namespace pod5 ================================================ FILE: c++/pod5_format/schema_field_builder.h ================================================ #pragma once #include "pod5_format/dictionary_writer.h" #include "pod5_format/read_table_schema.h" #include #include #include namespace pod5 { class DictionaryWriter; namespace detail { template class BuilderHelper; template class ListBuilderHelper; template <> class BuilderHelper : public arrow::FixedSizeBinaryBuilder { public: BuilderHelper(std::shared_ptr const & uuid_type, arrow::MemoryPool * pool) : arrow::FixedSizeBinaryBuilder(find_storage_type(uuid_type), pool) { assert(byte_width() == 16); } static std::shared_ptr find_storage_type( std::shared_ptr const & uuid_type) { assert(uuid_type->id() == arrow::Type::EXTENSION); auto const & uuid_extension = static_cast(*uuid_type); return uuid_extension.storage_type(); } arrow::Status Append(Uuid const & uuid) { return static_cast(this)->Append(uuid.data()); } }; template <> class BuilderHelper : public arrow::FloatBuilder { public: BuilderHelper(std::shared_ptr const &, arrow::MemoryPool * pool) : arrow::FloatBuilder(pool) { } }; template <> class BuilderHelper : public arrow::UInt8Builder { public: BuilderHelper(std::shared_ptr const &, arrow::MemoryPool * pool) : arrow::UInt8Builder(pool) { } }; template <> class BuilderHelper : public arrow::UInt16Builder { public: BuilderHelper(std::shared_ptr const &, arrow::MemoryPool * pool) : arrow::UInt16Builder(pool) { } }; template <> class BuilderHelper : public arrow::Int16Builder { public: BuilderHelper(std::shared_ptr const &, arrow::MemoryPool * pool) : arrow::Int16Builder(pool) { } }; template <> class BuilderHelper : public arrow::UInt32Builder { public: BuilderHelper(std::shared_ptr const &, arrow::MemoryPool * pool) : arrow::UInt32Builder(pool) { } }; template <> class BuilderHelper : public arrow::UInt64Builder { public: BuilderHelper(std::shared_ptr const &, arrow::MemoryPool * pool) : arrow::UInt64Builder(pool) { } }; template <> class BuilderHelper : public arrow::BooleanBuilder { public: BuilderHelper(std::shared_ptr const &, arrow::MemoryPool * pool) : arrow::BooleanBuilder(pool) { } }; template <> class BuilderHelper> : public arrow::TimestampBuilder { public: BuilderHelper(std::shared_ptr const & type, arrow::MemoryPool * pool) : arrow::TimestampBuilder(type, pool) { } }; template <> class BuilderHelper : public arrow::StringBuilder { public: BuilderHelper(std::shared_ptr const &, arrow::MemoryPool * pool) : arrow::StringBuilder(pool) { } }; template <> class BuilderHelper { public: BuilderHelper(std::shared_ptr const &, arrow::MemoryPool * pool) : m_key_builder(std::make_shared(pool)) , m_item_builder(std::make_shared(pool)) , m_map_builder(pool, m_key_builder, m_item_builder) { } arrow::Status Finish(std::shared_ptr * dest) { return m_map_builder.Finish(dest); } arrow::Status Reserve(std::size_t rows) { return m_map_builder.Reserve(rows); } arrow::Status Append(std::vector> const & items) { ARROW_RETURN_NOT_OK(m_map_builder.Append()); // start new slot for (auto const & pair : items) { ARROW_RETURN_NOT_OK(m_key_builder->Append(pair.first)); ARROW_RETURN_NOT_OK(m_item_builder->Append(pair.second)); } return arrow::Status::OK(); } private: std::shared_ptr m_key_builder; std::shared_ptr m_item_builder; arrow::MapBuilder m_map_builder; }; template <> class BuilderHelper : public arrow::Int16Builder { public: BuilderHelper(std::shared_ptr const &, arrow::MemoryPool * pool) : arrow::Int16Builder(pool) { } void set_dict_writer(std::shared_ptr const & writer) { m_dict_writer = writer; } arrow::Status Finish(std::shared_ptr * dest) { arrow::Int16Builder * index_builder = this; ARROW_ASSIGN_OR_RAISE(auto indices, index_builder->Finish()); ARROW_ASSIGN_OR_RAISE(*dest, m_dict_writer->build_dictionary_array(indices)); return arrow::Status::OK(); } private: std::shared_ptr m_dict_writer; }; template class ListBuilderHelper { public: ListBuilderHelper(std::shared_ptr const &, arrow::MemoryPool * pool) : m_array_builder(std::make_shared>(nullptr, pool)) , m_builder(std::make_unique(pool, m_array_builder)) { } arrow::Status Reserve(std::size_t rows) { ARROW_RETURN_NOT_OK(m_builder->Reserve(rows)); return m_array_builder->Reserve(rows); } arrow::Status Finish(std::shared_ptr * dest) { return m_builder->Finish(dest); } template arrow::Status Append(Items const & items) { ARROW_RETURN_NOT_OK(m_builder->Append()); // start new slot return m_array_builder->AppendValues(items.data(), items.size()); } private: std::shared_ptr> m_array_builder; std::unique_ptr m_builder; }; } // namespace detail template class FieldBuilder { public: using BuilderTuple = std::tuple; template FieldBuilder(std::shared_ptr const & desc_base, arrow::MemoryPool * pool) : m_builders( typename Args::BuilderType( desc_base->fields()[Args::WriteIndex::value]->datatype(), pool)...) { } template std::tuple_element_t & get_builder(FieldType) { return std::get(m_builders); } arrow::Result>> finish_columns() { arrow::Status result; std::vector> columns; columns.resize(std::tuple_size::value); detail::for_each_in_tuple(m_builders, [&](auto & element, std::size_t index) { if (result.ok()) { result = element.Finish(&columns[index]); assert(columns[index] || !result.ok()); } }); if (!result.ok()) { return result; } return columns; } arrow::Status reserve(std::size_t row_count) { arrow::Status result; detail::for_each_in_tuple(m_builders, [&](auto & element, std::size_t _) { if (result.ok()) { result = element.Reserve(row_count); } }); return result; } template arrow::Status append(AppendArgs const &... args) { auto args_list = std::forward_as_tuple(args...); arrow::Status result; for_each_in_tuple_zipped( m_builders, args_list, [&](auto & builder, auto & item, std::size_t _) { if (result.ok()) { result = builder.Append(item); } }); return result; } private: BuilderTuple m_builders; }; } // namespace pod5 ================================================ FILE: c++/pod5_format/schema_metadata.cpp ================================================ #include "pod5_format/schema_metadata.h" #include "pod5_format/uuid.h" #include "pod5_format/version.h" #include namespace pod5 { Result parse_version_number(std::string const & ver) { std::uint16_t components[3]; std::size_t component_index = 0; std::size_t last_char_index = 0; std::size_t char_index = 0; auto parse_component = [&](std::size_t last_char_index, std::size_t char_index) { auto const component_str = std::string(ver.data() + last_char_index, ver.data() + char_index); std::size_t pos = 0; int val = std::stoi(component_str, &pos); if (pos != (char_index - last_char_index)) { throw std::runtime_error("Invalid remaining characters after version number"); } return val; }; try { while (char_index < ver.size()) { if (ver[char_index] == '.') { if (component_index > 3) { return Status::Invalid("Invalid component count"); } components[component_index] = parse_component(last_char_index, char_index); last_char_index = char_index + 1; component_index += 1; } char_index += 1; } // extract the final component if (component_index != 2) { return Status::Invalid("Invalid component count"); } components[2] = parse_component(last_char_index, char_index); } catch (std::exception const & e) { return Status::Invalid(e.what()); } return Version{components[0], components[1], components[2]}; } Version current_build_version_number() { return Version(Pod5MajorVersion, Pod5MinorVersion, Pod5RevVersion); } Result> make_schema_key_value_metadata( SchemaMetadataDescription const & schema_metadata) { if (schema_metadata.writing_software.empty()) { return Status::Invalid("Expected writing_software to be specified for metadata"); } if (schema_metadata.writing_pod5_version == Version{}) { return Status::Invalid("Expected writing_pod5_version to be specified for metadata"); } if (schema_metadata.file_identifier == Uuid{}) { return Status::Invalid("Expected file_identifier to be specified for metadata"); } return arrow::KeyValueMetadata::Make( {"MINKNOW:file_identifier", "MINKNOW:software", "MINKNOW:pod5_version"}, {to_string(schema_metadata.file_identifier), schema_metadata.writing_software, schema_metadata.writing_pod5_version.to_string()}); } Result read_schema_key_value_metadata( std::shared_ptr const & key_value_metadata) { ARROW_ASSIGN_OR_RAISE( auto file_identifier_str, key_value_metadata->Get("MINKNOW:file_identifier")); ARROW_ASSIGN_OR_RAISE(auto software_str, key_value_metadata->Get("MINKNOW:software")); ARROW_ASSIGN_OR_RAISE(auto pod5_version_str, key_value_metadata->Get("MINKNOW:pod5_version")); ARROW_ASSIGN_OR_RAISE(auto pod5_version, parse_version_number(pod5_version_str)); auto const file_identifier = Uuid::from_string(file_identifier_str); if (!file_identifier) { return Status::IOError( "Schema file_identifier metadata not uuid form: '", file_identifier_str, "'"); } return SchemaMetadataDescription{*file_identifier, software_str, pod5_version}; } } // namespace pod5 ================================================ FILE: c++/pod5_format/schema_metadata.h ================================================ #pragma once #include "pod5_format/pod5_format_export.h" #include "pod5_format/result.h" #include "pod5_format/uuid.h" #include #include #include namespace arrow { class KeyValueMetadata; } namespace pod5 { class Version { public: Version() : m_version(0, 0, 0) {} Version(std::uint16_t major, std::uint16_t minor, std::uint16_t revision) : m_version(major, minor, revision) { } bool operator<(Version const & in) const { return m_version < in.m_version; } bool operator>(Version const & in) const { return m_version > in.m_version; } bool operator==(Version const & in) const { return m_version == in.m_version; } bool operator!=(Version const & in) const { return m_version != in.m_version; } std::string to_string() const { return std::to_string(std::get<0>(m_version)) + "." + std::to_string(std::get<1>(m_version)) + "." + std::to_string(std::get<2>(m_version)); } std::uint16_t major_version() const { return std::get<0>(m_version); } std::uint16_t minor_version() const { return std::get<1>(m_version); } std::uint16_t revision_version() const { return std::get<2>(m_version); } private: std::tuple m_version; }; POD5_FORMAT_EXPORT Result parse_version_number(std::string const & ver); POD5_FORMAT_EXPORT Version current_build_version_number(); struct SchemaMetadataDescription { Uuid file_identifier; std::string writing_software; Version writing_pod5_version; }; POD5_FORMAT_EXPORT Result> make_schema_key_value_metadata(SchemaMetadataDescription const & schema_metadata); POD5_FORMAT_EXPORT Result read_schema_key_value_metadata( std::shared_ptr const & key_value_metadata); } // namespace pod5 ================================================ FILE: c++/pod5_format/schema_utils.cpp ================================================ #include "pod5_format/schema_utils.h" namespace pod5 { /// \brief Make a new schema for a read table to be written (will only contain fields which are written in the latest version). /// \param metadata Metadata to be applied to the schema. /// \returns The schema for a read table. std::shared_ptr SchemaDescriptionBase::make_writer_schema( std::shared_ptr const & metadata) const { auto const latest_version = latest_table_version(); arrow::FieldVector writer_fields; for (auto & field : fields()) { if (field->removed_table_spec_version() > latest_version) { writer_fields.emplace_back(arrow::field(field->name(), field->datatype())); } } return arrow::schema(writer_fields, metadata); } Status SchemaDescriptionBase::read_schema( std::shared_ptr dest_schema, SchemaMetadataDescription const & schema_metadata, std::shared_ptr const & schema) { dest_schema->m_table_spec_version = dest_schema->table_version_from_file_version(schema_metadata.writing_pod5_version); for (auto & field : dest_schema->fields()) { if (dest_schema->table_version() < field->added_table_spec_version() || dest_schema->table_version() >= field->removed_table_spec_version()) { continue; } auto const & datatype = field->datatype(); int field_index = 0; if (datatype->id() == arrow::Type::DICTIONARY) { auto const & dict_type = static_cast(*datatype); if (dict_type.value_type()->id() == arrow::Type::STRUCT) { std::shared_ptr value_type; ARROW_ASSIGN_OR_RAISE( field_index, find_dict_field(schema, field->name().c_str(), arrow::int16(), &value_type)); } else { std::shared_ptr value_type; ARROW_ASSIGN_OR_RAISE( field_index, find_dict_field(schema, field->name().c_str(), arrow::int16(), &value_type)); } } else { ARROW_ASSIGN_OR_RAISE(field_index, find_field(schema, field->name().c_str(), datatype)); } field->set_field_index(field_index); } return arrow::Status::OK(); } } // namespace pod5 ================================================ FILE: c++/pod5_format/schema_utils.h ================================================ #pragma once #include "pod5_format/schema_metadata.h" #include #include namespace pod5 { inline arrow::Result find_field_untyped( std::shared_ptr const & schema, char const * name) { auto const field_idx = schema->GetFieldIndex(name); if (field_idx == -1) { return Status::TypeError("Schema missing field '", name, "'"); } return field_idx; } inline arrow::Result find_field( std::shared_ptr const & schema, char const * name, std::shared_ptr const & expected_data_type) { ARROW_ASSIGN_OR_RAISE(auto field_idx, find_field_untyped(schema, name)); auto const field = schema->field(field_idx); auto const type = field->type(); if (!type->Equals(expected_data_type)) { return Status::TypeError( "Schema field '", name, "' is incorrect type: '", type->name(), "'"); } return field_idx; } template inline arrow::Result find_dict_field( std::shared_ptr const & schema, char const * name, std::shared_ptr const & index_type, std::shared_ptr * value_type) { ARROW_ASSIGN_OR_RAISE(auto field_idx, find_field_untyped(schema, name)); auto const field = schema->field(field_idx); auto const type = std::dynamic_pointer_cast(field->type()); if (!type) { return Status::TypeError("Dictionary field was unexpected type: ", field->type()->name()); } if (!type->index_type()->Equals(index_type)) { return Status::TypeError( "Schema field '", name, "' is incorrect type: '", type->name(), "'"); } *value_type = std::dynamic_pointer_cast(type->value_type()); if (!*value_type) { return Status::TypeError( "Dictionary value was unexpected type: ", type->value_type()->name()); } return field_idx; } template std::shared_ptr find_column( std::shared_ptr const & batch, FieldType const & field) { auto field_base = batch->column(field.field_index()); return std::static_pointer_cast(std::move(field_base)); } class FieldBase; enum class SpecialFieldValues : int { InvalidField = -1, }; class TableSpecVersion { public: using UnderlyingType = std::uint8_t; TableSpecVersion() : m_version(std::numeric_limits::max()) {} static TableSpecVersion first_version() { return TableSpecVersion(0); } static TableSpecVersion unknown_version() { return TableSpecVersion(); } static TableSpecVersion at_version(UnderlyingType version) { return TableSpecVersion(version); } UnderlyingType as_int() const { return m_version; } bool operator<(TableSpecVersion const & other) const { return m_version < other.m_version; } bool operator>(TableSpecVersion const & other) const { return m_version > other.m_version; } bool operator<=(TableSpecVersion const & other) const { return m_version <= other.m_version; } bool operator>=(TableSpecVersion const & other) const { return m_version >= other.m_version; } private: TableSpecVersion(UnderlyingType version) : m_version(version) {} UnderlyingType m_version; }; class SchemaDescriptionBase { public: SchemaDescriptionBase(TableSpecVersion version) : m_table_spec_version(version) {} virtual ~SchemaDescriptionBase() = default; void add_field(FieldBase * field) { m_fields.push_back(field); } std::vector const & fields() { return m_fields; } std::vector const & fields() const { return reinterpret_cast const &>(m_fields); } TableSpecVersion latest_table_version() const { return table_version_from_file_version(current_build_version_number()); } virtual TableSpecVersion table_version_from_file_version(Version file_version) const = 0; TableSpecVersion table_version() const { return m_table_spec_version; } /// \brief Make a new schema for a read table to be written (will only contain fields which are written in the latest version). /// \param metadata Metadata to be applied to the schema. /// \returns The schema for a read table. std::shared_ptr make_writer_schema( std::shared_ptr const & metadata) const; static Status read_schema( std::shared_ptr dest_schema, SchemaMetadataDescription const & schema_metadata, std::shared_ptr const & schema); private: std::vector m_fields; TableSpecVersion m_table_spec_version; }; namespace detail { template class BuilderHelper; template class ListBuilderHelper; } // namespace detail class FieldBase { public: FieldBase( SchemaDescriptionBase * owner, int field_index, std::string name, std::shared_ptr const & datatype, TableSpecVersion added_table_spec_version = TableSpecVersion::first_version(), TableSpecVersion removed_table_spec_version = TableSpecVersion::unknown_version()) : m_name(name) , m_datatype(datatype) , m_field_index(field_index) , m_added_table_spec_version(added_table_spec_version) , m_removed_table_spec_version(removed_table_spec_version) { owner->add_field(this); } std::string const & name() const { return m_name; } std::shared_ptr const & datatype() const { return m_datatype; } int field_index() const { return m_field_index; } TableSpecVersion added_table_spec_version() const { return m_added_table_spec_version; } TableSpecVersion removed_table_spec_version() const { return m_removed_table_spec_version; } void set_field_index(int index) { m_field_index = index; } bool found_field() const { return m_field_index != (int)SpecialFieldValues::InvalidField; } private: std::string m_name; std::shared_ptr m_datatype; int m_field_index = (int)SpecialFieldValues::InvalidField; TableSpecVersion m_added_table_spec_version; TableSpecVersion m_removed_table_spec_version; }; template struct Field : public FieldBase { using WriteIndex = std::integral_constant; using ArrayType = ArrayType_; using BuilderType = detail::BuilderHelper; Field( SchemaDescriptionBase * owner, std::string name, std::shared_ptr const & datatype, TableSpecVersion added_table_spec_version = TableSpecVersion::first_version(), TableSpecVersion removed_table_spec_version = TableSpecVersion::unknown_version()) : FieldBase( owner, WriteIndex::value, name, datatype, added_table_spec_version, removed_table_spec_version) { } }; template struct ListField : public Field { using ElementType = ElementType_; using BuilderType = detail::ListBuilderHelper; ListField( SchemaDescriptionBase * owner, std::string name, std::shared_ptr const & datatype, TableSpecVersion added_table_spec_version = TableSpecVersion::first_version(), TableSpecVersion removed_table_spec_version = TableSpecVersion::unknown_version()) : Field( owner, name, datatype, added_table_spec_version, removed_table_spec_version) { } }; template class FieldBuilder; } // namespace pod5 ================================================ FILE: c++/pod5_format/signal_builder.h ================================================ #pragma once #include "pod5_format/expandable_buffer.h" #include "pod5_format/signal_compression.h" #include "pod5_format/signal_table_utils.h" #include "pod5_format/types.h" #include #include #include #include namespace pod5 { struct UncompressedSignalBuilder { std::shared_ptr signal_data_builder; std::unique_ptr signal_builder; }; struct VbzSignalBuilder { ExpandableBuffer offset_values; ExpandableBuffer data_values; }; using SignalBuilderVariant = std::variant; inline arrow::Result make_signal_builder( SignalType compression_type, arrow::MemoryPool * pool) { if (compression_type == SignalType::UncompressedSignal) { auto signal_array_builder = std::make_shared(pool); return UncompressedSignalBuilder{ signal_array_builder, std::make_unique(pool, signal_array_builder), }; } else { VbzSignalBuilder vbz_builder; ARROW_RETURN_NOT_OK(vbz_builder.offset_values.init_buffer(pool)); ARROW_RETURN_NOT_OK(vbz_builder.data_values.init_buffer(pool)); return vbz_builder; } } namespace visitors { class reserve_rows { public: reserve_rows(std::size_t row_count, std::size_t approx_read_samples) : m_row_count(row_count) , m_approx_read_samples(approx_read_samples) { } Status operator()(UncompressedSignalBuilder & builder) const { ARROW_RETURN_NOT_OK(builder.signal_builder->Reserve(m_row_count)); return builder.signal_data_builder->Reserve(m_row_count * m_approx_read_samples); } Status operator()(VbzSignalBuilder & builder) const { ARROW_RETURN_NOT_OK(builder.offset_values.reserve(m_row_count + 1)); return builder.data_values.reserve(m_row_count * m_approx_read_samples); } std::size_t m_row_count; std::size_t m_approx_read_samples; }; class append_pre_compressed_signal { public: append_pre_compressed_signal(gsl::span const & signal) : m_signal(signal) {} Status operator()(UncompressedSignalBuilder & builder) const { ARROW_RETURN_NOT_OK(builder.signal_builder->Append()); // start new slot auto as_uncompressed = m_signal.as_span(); return builder.signal_data_builder->AppendValues( as_uncompressed.data(), as_uncompressed.size()); } Status operator()(VbzSignalBuilder & builder) const { ARROW_RETURN_NOT_OK(builder.offset_values.append(builder.data_values.size())); return builder.data_values.append_array(m_signal); } gsl::span m_signal; }; class append_signal { public: append_signal(gsl::span const & signal, arrow::MemoryPool * pool) : m_signal(signal) , m_pool(pool) { } Status operator()(UncompressedSignalBuilder & builder) const { ARROW_RETURN_NOT_OK(builder.signal_builder->Append()); // start new slot return builder.signal_data_builder->AppendValues(m_signal.data(), m_signal.size()); } Status operator()(VbzSignalBuilder & builder) const { ARROW_RETURN_NOT_OK(builder.offset_values.append(builder.data_values.size())); ARROW_ASSIGN_OR_RAISE(auto const max_size, compressed_signal_max_size(m_signal.size())); // Compress the signal in place into our buffer. return builder.data_values.append( max_size, [&](gsl::span buffer) -> arrow::Result { return compress_signal(m_signal, m_pool, buffer); }); } gsl::span m_signal; arrow::MemoryPool * m_pool; }; class finish_column { public: finish_column(std::shared_ptr * dest) : m_dest(dest) {} Status operator()(UncompressedSignalBuilder & builder) const { return builder.signal_builder->Finish(m_dest); } Status operator()(VbzSignalBuilder & builder) const { auto offsets_copy = builder.offset_values; ARROW_RETURN_NOT_OK(builder.offset_values.clear()); auto const value_data = builder.data_values.get_buffer(); ARROW_RETURN_NOT_OK(builder.data_values.clear()); auto const length = offsets_copy.size(); // Write final offset (values length) ARROW_RETURN_NOT_OK(offsets_copy.append(value_data->size())); auto const offsets = offsets_copy.get_buffer(); std::shared_ptr null_bitmap; *m_dest = arrow::MakeArray( arrow::ArrayData::Make(vbz_signal(), length, {null_bitmap, offsets, value_data}, 0, 0)); return arrow::Status::OK(); } std::shared_ptr * m_dest; }; } // namespace visitors } // namespace pod5 ================================================ FILE: c++/pod5_format/signal_compression.cpp ================================================ #include "pod5_format/signal_compression.h" #include "pod5_format/svb16/decode.hpp" #include "pod5_format/svb16/encode.hpp" #include #include #include #include #include namespace pod5 { namespace { // SVB is designed around 32 bit sizes, so that's the maximum uncompressed samples allowed. constexpr std::size_t max_uncompressed_samples = std::numeric_limits::max(); class DecompressContext { struct DCtxDeleter { void operator()(ZSTD_DCtx * ctx) { ZSTD_freeDCtx(ctx); } }; std::unique_ptr m_context; public: DecompressContext() { m_context.reset(ZSTD_createDCtx()); } ZSTD_DCtx * get() { return m_context.get(); } explicit operator bool() const { return static_cast(m_context); } }; } // namespace arrow::Result compressed_signal_max_size(std::size_t sample_count) { if (sample_count > max_uncompressed_samples) { return arrow::Status::Invalid( sample_count, " samples exceeds max of ", max_uncompressed_samples); } auto const max_svb_size = svb16_max_encoded_length(sample_count); auto const zstd_compressed_max_size = ZSTD_compressBound(max_svb_size); if (ZSTD_isError(zstd_compressed_max_size)) { return pod5::Status::Invalid( sample_count, " samples exceeds zstd limit: (", zstd_compressed_max_size, " ", ZSTD_getErrorName(zstd_compressed_max_size), ")"); } return zstd_compressed_max_size; } arrow::Result compress_signal( gsl::span samples, arrow::MemoryPool * pool, gsl::span destination) { std::size_t const sample_count = samples.size(); if (sample_count > max_uncompressed_samples) { return arrow::Status::Invalid( sample_count, " samples exceeds max of ", max_uncompressed_samples); } // First compress the data using svb: auto const max_size = svb16_max_encoded_length(sample_count); ARROW_ASSIGN_OR_RAISE(auto intermediate, arrow::AllocateResizableBuffer(max_size, pool)); static constexpr bool UseDelta = true; static constexpr bool UseZigzag = true; auto const encoded_count = svb16::encode( samples.data(), intermediate->mutable_data(), sample_count); ARROW_RETURN_NOT_OK(intermediate->Resize(encoded_count)); // Now compress the svb data using zstd: size_t const zstd_compressed_max_size = ZSTD_compressBound(intermediate->size()); if (ZSTD_isError(zstd_compressed_max_size)) { return pod5::Status::Invalid( "Failed to find zstd max size for data: (", zstd_compressed_max_size, " ", ZSTD_getErrorName(zstd_compressed_max_size), ")"); } /* Compress. * If you are doing many compressions, you may want to reuse the context. * See the multiple_simple_compression.c example. */ size_t const compressed_size = ZSTD_compress( destination.data(), destination.size(), intermediate->data(), intermediate->size(), 1); if (ZSTD_isError(compressed_size)) { return pod5::Status::Invalid( "Failed to compress data: (", compressed_size, " ", ZSTD_getErrorName(compressed_size), ")"); } return compressed_size; } arrow::Result> compress_signal( gsl::span samples, arrow::MemoryPool * pool) { ARROW_ASSIGN_OR_RAISE( std::size_t const sample_count, compressed_signal_max_size(samples.size())); ARROW_ASSIGN_OR_RAISE( std::shared_ptr out, arrow::AllocateResizableBuffer(sample_count, pool)); ARROW_ASSIGN_OR_RAISE( auto final_size, compress_signal(samples, pool, gsl::make_span(out->mutable_data(), out->size()))); ARROW_RETURN_NOT_OK(out->Resize(final_size)); return out; } arrow::Status decompress_signal( gsl::span compressed_bytes, arrow::MemoryPool * pool, gsl::span destination) { // Check that we could have compressed this size. ARROW_ASSIGN_OR_RAISE( std::size_t const max_compressed_size, compressed_signal_max_size(destination.size())); if (compressed_bytes.size() > max_compressed_size) { return pod5::Status::Invalid( "Input data corrupt: compressed input size (", compressed_bytes.size(), ") exceeds max compressed output size (", max_compressed_size, ")"); } // Find out how big zstd thinks the data is. unsigned long long const decompressed_zstd_size = ZSTD_getFrameContentSize(compressed_bytes.data(), compressed_bytes.size()); if (ZSTD_isError(decompressed_zstd_size)) { return pod5::Status::Invalid( "Input data not compressed by zstd: (", decompressed_zstd_size, " ", ZSTD_getErrorName(decompressed_zstd_size), ")"); } // Documentation of |ZSTD_getFrameContentSize| explicitly states that we should bounds check this: // * note 5 : If source is untrusted, decompressed size could be wrong or intentionally modified. // * Always ensure return value fits within application's authorized limits. // * Each application can set its own limits. std::size_t const max_svb16_compressed_size = svb16_max_encoded_length(destination.size()); if (decompressed_zstd_size > max_svb16_compressed_size) { return arrow::Status::Invalid( "Input data corrupt: claimed size (", decompressed_zstd_size, ") exceeds max compressed output size (", max_svb16_compressed_size, ")"); } // Check that we have enough memory to decompress. // Note: this will return 0 on unsupported platforms, so we skip it there. std::int64_t const system_memory = arrow::internal::GetTotalMemoryBytes(); assert(system_memory > 0); if (system_memory > 0 && decompressed_zstd_size >= static_cast(system_memory)) { return arrow::Status::OutOfMemory( "Not enough system memory (", system_memory, ") to decompress file (", decompressed_zstd_size, ")"); } if (POD5_ENABLE_FUZZERS && decompressed_zstd_size > 1'000'000) { return arrow::Status::Invalid("Skipping huge sizes when fuzzing"); } thread_local DecompressContext decompress_context; if (!decompress_context) { return arrow::Status::OutOfMemory("Failed to create zstd decompress context"); } // Decompress the data using zstd. auto const allocation_padding = svb16::decode_input_buffer_padding_byte_count(); ARROW_ASSIGN_OR_RAISE( auto intermediate, arrow::AllocateResizableBuffer(decompressed_zstd_size + allocation_padding, pool)); size_t const decompress_res = ZSTD_decompressDCtx( decompress_context.get(), intermediate->mutable_data(), intermediate->size(), compressed_bytes.data(), compressed_bytes.size()); if (ZSTD_isError(decompress_res)) { return pod5::Status::Invalid( "Input data failed to decompress using zstd: (", decompress_res, " ", ZSTD_getErrorName(decompress_res), ")"); } auto const svb16_compressed_data_with_padding = gsl::make_span(intermediate->data(), intermediate->size()); auto const svb16_compressed_data_no_padding = svb16_compressed_data_with_padding.subspan(0, decompressed_zstd_size); // Validate the data. if (!svb16::validate(svb16_compressed_data_no_padding, destination.size())) { return pod5::Status::Invalid("Compressed signal data is corrupt"); } // Now decompress the data using svb: static constexpr bool UseDelta = true; static constexpr bool UseZigzag = true; auto consumed_count = svb16::decode( destination, svb16_compressed_data_with_padding); if (consumed_count != decompressed_zstd_size) { return pod5::Status::Invalid("Remaining data at end of signal buffer"); } return pod5::Status::OK(); } arrow::Result> decompress_signal( gsl::span compressed_bytes, std::uint32_t samples_count, arrow::MemoryPool * pool) { ARROW_ASSIGN_OR_RAISE( std::shared_ptr out, arrow::AllocateResizableBuffer(samples_count * sizeof(SampleType), pool)); auto signal_span = gsl::make_span(out->mutable_data(), out->size()).as_span(); ARROW_RETURN_NOT_OK(decompress_signal(compressed_bytes, pool, signal_span)); return out; } } // namespace pod5 ================================================ FILE: c++/pod5_format/signal_compression.h ================================================ #pragma once #include "pod5_format/pod5_format_export.h" #include "pod5_format/result.h" #include namespace arrow { class MemoryPool; class Buffer; } // namespace arrow namespace pod5 { using SampleType = std::int16_t; POD5_FORMAT_EXPORT arrow::Result compressed_signal_max_size(std::size_t sample_count); POD5_FORMAT_EXPORT arrow::Result compress_signal( gsl::span samples, arrow::MemoryPool * pool, gsl::span destination); POD5_FORMAT_EXPORT arrow::Result> compress_signal( gsl::span samples, arrow::MemoryPool * pool); POD5_FORMAT_EXPORT arrow::Result> decompress_signal( gsl::span compressed_bytes, std::uint32_t samples_count, arrow::MemoryPool * pool); POD5_FORMAT_EXPORT arrow::Status decompress_signal( gsl::span compressed_bytes, arrow::MemoryPool * pool, gsl::span destination); } // namespace pod5 ================================================ FILE: c++/pod5_format/signal_table_reader.cpp ================================================ #include "pod5_format/signal_table_reader.h" #include "pod5_format/schema_metadata.h" #include "pod5_format/signal_compression.h" #include "pod5_format/table_reader.h" #include #include #include #include namespace pod5 { struct SignalTableReaderCacheCleaner { static void make_space_in_table_batches( std::unordered_map & cached_batches) { std::vector> access_ordered_data; access_ordered_data.reserve(cached_batches.size()); for (auto item : cached_batches) { access_ordered_data.emplace_back( std::make_pair(item.first, item.second.last_access_index)); } std::sort( access_ordered_data.begin(), access_ordered_data.end(), [](auto const & a, auto const & b) { return a.second < b.second; }); // Clear about 20% of the cache to make space for further growth: auto const to_clear = std::max(1, cached_batches.size() * 0.2f); for (std::size_t i = 0; i < to_clear; ++i) { auto const index_to_remove = access_ordered_data[i].first; cached_batches.erase(index_to_remove); } } }; SignalTableRecordBatch::SignalTableRecordBatch( std::shared_ptr const & batch, SignalTableSchemaDescription field_locations, arrow::MemoryPool * pool) : TableRecordBatch(batch) , m_field_locations(field_locations) , m_pool(pool) { } std::shared_ptr SignalTableRecordBatch::read_id_column() const { return std::static_pointer_cast(batch()->column(m_field_locations.read_id)); } std::shared_ptr SignalTableRecordBatch::uncompressed_signal_column() const { return std::static_pointer_cast( batch()->column(m_field_locations.signal)); } std::shared_ptr SignalTableRecordBatch::vbz_signal_column() const { return std::static_pointer_cast(batch()->column(m_field_locations.signal)); } std::shared_ptr SignalTableRecordBatch::samples_column() const { return std::static_pointer_cast(batch()->column(m_field_locations.samples)); } Result SignalTableRecordBatch::samples_byte_count(std::size_t row_index) const { switch (m_field_locations.signal_type) { case SignalType::UncompressedSignal: { auto signal_column = uncompressed_signal_column(); auto signal = signal_column->value_slice(row_index); return signal->length() * sizeof(std::int16_t); } case SignalType::VbzSignal: { auto signal_column = vbz_signal_column(); auto signal_compressed = signal_column->Value(row_index); return signal_compressed.size(); } } return pod5::Status::Invalid("Unknown signal type"); } Status SignalTableRecordBatch::extract_signal_row( std::size_t row_index, gsl::span samples) const { if (row_index >= num_rows()) { return pod5::Status::Invalid( "Queried signal row ", row_index, " is outside the available rows (", num_rows(), " in batch)"); } auto sample_count = samples_column(); auto samples_in_row = sample_count->Value(row_index); if (samples_in_row != samples.size()) { return pod5::Status::Invalid( "Unexpected size for sample array ", samples.size(), " expected ", samples_in_row); } switch (m_field_locations.signal_type) { case SignalType::UncompressedSignal: { auto signal_column = uncompressed_signal_column(); auto signal = std::static_pointer_cast(signal_column->value_slice(row_index)); std::copy(signal->raw_values(), signal->raw_values() + signal->length(), samples.begin()); return Status::OK(); } case SignalType::VbzSignal: { auto signal_column = vbz_signal_column(); auto signal_compressed = signal_column->Value(row_index); return pod5::decompress_signal(signal_compressed, m_pool, samples); } } return pod5::Status::Invalid("Unknown signal type"); } Result> SignalTableRecordBatch::extract_signal_row_inplace( std::size_t row_index) const { if (row_index >= num_rows()) { return pod5::Status::Invalid( "Queried signal row ", row_index, " is outside the available rows (", num_rows(), " in batch)"); } switch (m_field_locations.signal_type) { case SignalType::UncompressedSignal: { auto signal_column = uncompressed_signal_column(); auto const value_slice = std::static_pointer_cast(signal_column->value_slice(row_index)); auto const element_size = sizeof(std::remove_reference::type::TypeClass); auto const values = value_slice->values(); auto offset = signal_column->value_offset(row_index); auto length = signal_column->value_length(row_index); return arrow::SliceBuffer(values, offset * element_size, length * element_size); } case SignalType::VbzSignal: { auto signal_column = vbz_signal_column(); return signal_column->ValueAsBuffer(row_index); } } return pod5::Status::Invalid("Unknown signal type"); } //--------------------------------------------------------------------------------------------------------------------- SignalTableReader::SignalTableReader( std::shared_ptr && input_source, std::shared_ptr && reader, SignalTableSchemaDescription field_locations, SchemaMetadataDescription && schema_metadata, std::size_t num_record_batches, std::size_t batch_size, std::size_t max_cached_table_batches, arrow::MemoryPool * pool) : TableReader(std::move(input_source), std::move(reader), std::move(schema_metadata), pool) , m_field_locations(field_locations) , m_pool(pool) , m_max_cached_table_batches(max_cached_table_batches) , m_table_batches(num_record_batches) , m_batch_size(batch_size) { } SignalTableReader::SignalTableReader(SignalTableReader && other) : TableReader(std::move(other)) , m_field_locations(std::move(other.m_field_locations)) , m_pool(other.m_pool) , m_max_cached_table_batches(other.m_max_cached_table_batches) , m_table_batches(std::move(other.m_table_batches)) , m_batch_size(other.m_batch_size) { } SignalTableReader & SignalTableReader::operator=(SignalTableReader && other) { m_field_locations = std::move(other.m_field_locations); m_pool = other.m_pool; m_max_cached_table_batches = other.m_max_cached_table_batches; m_batch_size = other.m_batch_size; m_table_batches = std::move(other.m_table_batches); static_cast(*this) = std::move(static_cast(other)); return *this; } Result SignalTableReader::read_record_batch(std::size_t i) const { std::lock_guard l(m_batch_get_mutex); if (m_last_read_record_batch_index == i) { return pod5::SignalTableRecordBatch{m_last_read_record_batch, m_field_locations, m_pool}; } auto it = m_table_batches.find(i); if (it != m_table_batches.end()) { it->second.last_access_index = m_last_access_index++; return it->second.item; } // If limited in cached batches, then ensure we apply limit: if (m_max_cached_table_batches != 0 && m_table_batches.size() >= m_max_cached_table_batches) { SignalTableReaderCacheCleaner::make_space_in_table_batches(m_table_batches); assert(m_table_batches.size() < m_max_cached_table_batches); } ARROW_ASSIGN_OR_RAISE(m_last_read_record_batch, TableReader::ReadRecordBatch(i)); m_last_read_record_batch_index = i; auto inserted = m_table_batches.emplace( i, CachedItem{ pod5::SignalTableRecordBatch{m_last_read_record_batch, m_field_locations, m_pool}, m_last_access_index++}); return inserted.first->second.item; } Result SignalTableReader::signal_batch_for_row_id( std::uint64_t row, std::size_t * batch_row) const { if (m_batch_size == 0) { return Status::Invalid("Invalid row '", row, "' for file with zero signal rows."); } auto batch = row / m_batch_size; if (batch_row) { *batch_row = row - (batch * m_batch_size); } if (batch >= num_record_batches()) { return Status::Invalid("Row outside batch bounds"); } return batch; } Result SignalTableReader::extract_sample_count( gsl::span const & row_indices) const { std::size_t sample_count = 0; for (auto const & signal_row : row_indices) { std::size_t batch_row = 0; ARROW_ASSIGN_OR_RAISE( auto const signal_batch_index, signal_batch_for_row_id(signal_row, &batch_row)); ARROW_ASSIGN_OR_RAISE(auto const & signal_batch, read_record_batch(signal_batch_index)); auto const & samples_column = signal_batch.samples_column(); sample_count += samples_column->Value(batch_row); } return sample_count; } Status SignalTableReader::extract_samples( gsl::span const & row_indices, gsl::span const & output_samples) const { std::size_t sample_count = 0; for (auto const & signal_row : row_indices) { std::size_t batch_row = 0; ARROW_ASSIGN_OR_RAISE( auto const signal_batch_index, signal_batch_for_row_id(signal_row, &batch_row)); ARROW_ASSIGN_OR_RAISE(auto const & signal_batch, read_record_batch(signal_batch_index)); auto const & samples_column = signal_batch.samples_column(); auto const row_samples_count = samples_column->Value(batch_row); std::size_t const sample_start = sample_count; sample_count += row_samples_count; if (sample_count > output_samples.size()) { return Status::Invalid("Too few samples in input samples array"); } ARROW_RETURN_NOT_OK(signal_batch.extract_signal_row( batch_row, output_samples.subspan(sample_start, row_samples_count))); } return Status::OK(); } Result>> SignalTableReader::extract_samples_inplace( gsl::span const & row_indices, std::vector & sample_count) const { std::vector> sample_buffers; for (auto const & signal_row : row_indices) { std::size_t batch_row = 0; ARROW_ASSIGN_OR_RAISE( auto const signal_batch_index, signal_batch_for_row_id(signal_row, &batch_row)); ARROW_ASSIGN_OR_RAISE(auto const & signal_batch, read_record_batch(signal_batch_index)); ARROW_ASSIGN_OR_RAISE(auto signal_data, signal_batch.extract_signal_row_inplace(batch_row)); sample_buffers.emplace_back(std::move(signal_data)); auto const & samples_column = signal_batch.samples_column(); sample_count.push_back(samples_column->Value(batch_row)); } return sample_buffers; } SignalType SignalTableReader::signal_type() const { return m_field_locations.signal_type; } //--------------------------------------------------------------------------------------------------------------------- Result make_signal_table_reader( std::shared_ptr const & input, std::size_t max_cached_table_batches, arrow::MemoryPool * pool) { arrow::ipc::IpcReadOptions options; options.memory_pool = pool; ARROW_ASSIGN_OR_RAISE(auto reader, arrow::ipc::RecordBatchFileReader::Open(input, options)); auto read_metadata_key_values = reader->schema()->metadata(); if (!read_metadata_key_values) { return Status::IOError("Missing metadata on signal table schema"); } ARROW_ASSIGN_OR_RAISE( auto read_metadata, read_schema_key_value_metadata(read_metadata_key_values)); ARROW_ASSIGN_OR_RAISE(auto field_locations, read_signal_table_schema(reader->schema())); std::size_t const num_record_batches = reader->num_record_batches(); std::size_t batch_size = 0; if (num_record_batches > 0) { ARROW_ASSIGN_OR_RAISE(auto const batch_zero, ReadRecordBatchAndValidate(*reader, 0)); batch_size = batch_zero->num_rows(); } return SignalTableReader( {input}, std::move(reader), field_locations, std::move(read_metadata), num_record_batches, batch_size, max_cached_table_batches, pool); } } // namespace pod5 ================================================ FILE: c++/pod5_format/signal_table_reader.h ================================================ #pragma once #include "pod5_format/pod5_format_export.h" #include "pod5_format/result.h" #include "pod5_format/schema_metadata.h" #include "pod5_format/signal_table_schema.h" #include "pod5_format/table_reader.h" #include "pod5_format/types.h" #include #include #include #include #include namespace arrow { class Schema; namespace io { class RandomAccessFile; } namespace ipc { class RecordBatchFileReader; } } // namespace arrow namespace pod5 { struct SignalTableReaderCacheCleaner; class POD5_FORMAT_EXPORT SignalTableRecordBatch : public TableRecordBatch { public: SignalTableRecordBatch( std::shared_ptr const & batch, SignalTableSchemaDescription field_locations, arrow::MemoryPool * pool); std::shared_ptr read_id_column() const; std::shared_ptr uncompressed_signal_column() const; std::shared_ptr vbz_signal_column() const; std::shared_ptr samples_column() const; Result samples_byte_count(std::size_t row_index) const; /// \brief Extract a row of sample data into [samples], decompressing if required. Status extract_signal_row(std::size_t row_index, gsl::span samples) const; Result> extract_signal_row_inplace(std::size_t row_index) const; private: SignalTableSchemaDescription m_field_locations; arrow::MemoryPool * m_pool; }; class POD5_FORMAT_EXPORT SignalTableReader : public TableReader { public: SignalTableReader( std::shared_ptr && input_source, std::shared_ptr && reader, SignalTableSchemaDescription field_locations, SchemaMetadataDescription && schema_metadata, std::size_t num_record_batches, std::size_t batch_size, std::size_t max_cached_table_batches, arrow::MemoryPool * pool); SignalTableReader(SignalTableReader &&); SignalTableReader & operator=(SignalTableReader &&); Result read_record_batch(std::size_t i) const; Result signal_batch_for_row_id(std::uint64_t row, std::size_t * batch_row) const; /// \brief Find the number of samples in a given list of rows. /// \param row_indices The rows to query for sample ount. /// \returns The sum of all sample counts on input rows. Result extract_sample_count( gsl::span const & row_indices) const; /// \brief Extract the samples for a list of rows. /// \param row_indices The rows to query for samples. /// \param output_samples The output samples from the rows. Data in the vector is cleared before appending. Status extract_samples( gsl::span const & row_indices, gsl::span const & output_samples) const; /// \brief Extract the samples as written in the arrow table for a list of rows. /// \param row_indices The rows to query for samples. Result>> extract_samples_inplace( gsl::span const & row_indices, std::vector & sample_count) const; /// \brief Find the signal type of this writer SignalType signal_type() const; private: SignalTableSchemaDescription m_field_locations; arrow::MemoryPool * m_pool; std::size_t m_max_cached_table_batches; mutable std::size_t m_last_read_record_batch_index = -1; mutable std::shared_ptr m_last_read_record_batch; mutable std::mutex m_batch_get_mutex; using AccessIndex = std::uint64_t; struct CachedItem { pod5::SignalTableRecordBatch item; AccessIndex last_access_index; }; mutable std::unordered_map m_table_batches; mutable AccessIndex m_last_access_index = 0; std::size_t m_batch_size; friend struct SignalTableReaderCacheCleaner; }; POD5_FORMAT_EXPORT Result make_signal_table_reader( std::shared_ptr const & sink, std::size_t max_cached_table_batches, arrow::MemoryPool * pool); } // namespace pod5 ================================================ FILE: c++/pod5_format/signal_table_schema.cpp ================================================ #include "pod5_format/signal_table_schema.h" #include "pod5_format/schema_utils.h" #include "pod5_format/types.h" #include namespace pod5 { std::shared_ptr make_signal_table_schema( SignalType signal_type, std::shared_ptr const & metadata, SignalTableSchemaDescription * field_locations) { auto const uuid_type = uuid(); if (field_locations) { *field_locations = {}; field_locations->signal_type = signal_type; } std::shared_ptr signal_schema_type; switch (signal_type) { case SignalType::UncompressedSignal: signal_schema_type = arrow::large_list(arrow::int16()); break; case SignalType::VbzSignal: signal_schema_type = vbz_signal(); break; } return arrow::schema( { arrow::field("read_id", uuid_type), arrow::field("signal", signal_schema_type), arrow::field("samples", arrow::uint32()), }, metadata); } Result read_signal_table_schema( std::shared_ptr const & schema) { ARROW_ASSIGN_OR_RAISE(auto read_id_field_idx, find_field(schema, "read_id", uuid())); ARROW_ASSIGN_OR_RAISE(auto samples_field_idx, find_field(schema, "samples", arrow::uint32())); ARROW_ASSIGN_OR_RAISE(auto signal_field_idx, find_field_untyped(schema, "signal")); SignalType signal_type = SignalType::UncompressedSignal; { auto const signal_field = schema->field(signal_field_idx); auto const signal_arrow_type = signal_field->type(); if (signal_arrow_type->id() == arrow::Type::LARGE_LIST) { auto const & signal_list_field = static_cast(*signal_arrow_type); if (signal_list_field.value_type()->id() != arrow::Type::INT16) { return Status::TypeError("Schema field 'signal' list value type is incorrect type"); } } else if (signal_arrow_type->Equals(vbz_signal())) { signal_type = SignalType::VbzSignal; } else { return Status::TypeError( "Schema field 'signal' is incorrect type: '", signal_arrow_type->name(), "'"); } } return SignalTableSchemaDescription{ signal_type, read_id_field_idx, signal_field_idx, samples_field_idx}; } } // namespace pod5 ================================================ FILE: c++/pod5_format/signal_table_schema.h ================================================ #pragma once #include "pod5_format/pod5_format_export.h" #include "pod5_format/result.h" #include "pod5_format/signal_table_utils.h" #include namespace arrow { class KeyValueMetadata; class Schema; } // namespace arrow namespace pod5 { struct SignalTableSchemaDescription { SignalType signal_type; int read_id = 0; int signal = 1; int samples = 2; }; /// \brief Make a new schema for a signal table. /// \param signal_type The type of signal to use. /// \param metadata Metadata to be applied to the schema. /// \param field_locations [optional] The signal table field locations, for use when writing to the table. /// \returns The schema for a signal table. POD5_FORMAT_EXPORT std::shared_ptr make_signal_table_schema( SignalType signal_type, std::shared_ptr const & metadata, SignalTableSchemaDescription * field_locations); POD5_FORMAT_EXPORT Result read_signal_table_schema( std::shared_ptr const &); } // namespace pod5 ================================================ FILE: c++/pod5_format/signal_table_utils.h ================================================ #pragma once namespace pod5 { using SignalTableRowIndex = std::uint64_t; enum class SignalType { UncompressedSignal, VbzSignal, }; } // namespace pod5 ================================================ FILE: c++/pod5_format/signal_table_writer.cpp ================================================ #include "pod5_format/signal_table_writer.h" #include "pod5_format/file_output_stream.h" #include "pod5_format/internal/tracing/tracing.h" #include "pod5_format/types.h" #include #include #include #include #include #include #include #include namespace pod5 { SignalTableWriter::SignalTableWriter( std::shared_ptr && writer, std::shared_ptr && schema, SignalBuilderVariant && signal_builder, SignalTableSchemaDescription const & field_locations, std::shared_ptr const & output_stream, std::size_t table_batch_size, arrow::MemoryPool * pool) : m_pool(pool) , m_schema(schema) , m_field_locations(field_locations) , m_output_stream{output_stream} , m_table_batch_size(table_batch_size) , m_writer(std::move(writer)) , m_signal_builder(std::move(signal_builder)) { m_read_id_builder = make_read_id_builder(m_pool); m_samples_builder = std::make_unique(m_pool); } SignalTableWriter::SignalTableWriter(SignalTableWriter && other) = default; SignalTableWriter & SignalTableWriter::operator=(SignalTableWriter &&) = default; SignalTableWriter::~SignalTableWriter() { if (m_writer) { (void)close(); } } Result SignalTableWriter::add_signal( Uuid const & read_id, gsl::span const & signal) { POD5_TRACE_FUNCTION(); if (!m_writer) { return Status::IOError("Writer terminated"); } ARROW_RETURN_NOT_OK(reserve_rows()); auto row_id = m_written_batched_row_count + m_current_batch_row_count; ARROW_RETURN_NOT_OK(m_read_id_builder->Append(read_id.data())); ARROW_RETURN_NOT_OK(std::visit(visitors::append_signal{signal, m_pool}, m_signal_builder)); ARROW_RETURN_NOT_OK(m_samples_builder->Append(signal.size())); ++m_current_batch_row_count; if (m_current_batch_row_count >= m_table_batch_size) { ARROW_RETURN_NOT_OK(write_batch()); } return row_id; } Result SignalTableWriter::add_pre_compressed_signal( Uuid const & read_id, gsl::span const & signal, std::uint32_t sample_count) { POD5_TRACE_FUNCTION(); if (!m_writer) { return Status::IOError("Writer terminated"); } ARROW_RETURN_NOT_OK(reserve_rows()); auto row_id = m_written_batched_row_count + m_current_batch_row_count; ARROW_RETURN_NOT_OK(m_read_id_builder->Append(read_id.data())); ARROW_RETURN_NOT_OK( std::visit(visitors::append_pre_compressed_signal{signal}, m_signal_builder)); ARROW_RETURN_NOT_OK(m_samples_builder->Append(sample_count)); ++m_current_batch_row_count; if (m_current_batch_row_count >= m_table_batch_size) { ARROW_RETURN_NOT_OK(write_batch()); } return row_id; } pod5::Result> SignalTableWriter::add_signal_batch( std::size_t row_count, std::vector> && columns, bool final_batch) { POD5_TRACE_FUNCTION(); if (!m_writer) { return Status::Invalid("Unable to write batches, writer is closed."); } if (m_current_batch_row_count != 0) { return Status::Invalid("Unable to write batches directly and using per read methods"); } if (!final_batch && row_count != m_table_batch_size) { return Status::Invalid("Unable to write invalid sized signal batch to signal table"); } auto const record_batch = arrow::RecordBatch::Make(m_schema, row_count, std::move(columns)); ARROW_RETURN_NOT_OK(m_writer->WriteRecordBatch(*record_batch)); if (final_batch) { ARROW_RETURN_NOT_OK(close()); } auto first_row_id = m_written_batched_row_count; m_written_batched_row_count += row_count; return std::make_pair(first_row_id, m_written_batched_row_count); } Status SignalTableWriter::close() { // Check for already closed if (!m_writer) { return Status::OK(); } ARROW_RETURN_NOT_OK(write_batch()); ARROW_RETURN_NOT_OK(m_writer->Close()); m_writer = nullptr; return Status::OK(); } SignalType SignalTableWriter::signal_type() const { return m_field_locations.signal_type; } Status SignalTableWriter::write_batch(arrow::RecordBatch const & record_batch) { ARROW_RETURN_NOT_OK(m_writer->WriteRecordBatch(record_batch)); return m_output_stream->batch_complete(); } Status SignalTableWriter::write_batch() { POD5_TRACE_FUNCTION(); if (m_current_batch_row_count == 0) { return Status::OK(); } if (!m_writer) { return Status::IOError("Writer terminated"); } std::vector> columns{nullptr, nullptr, nullptr}; ARROW_RETURN_NOT_OK(m_read_id_builder->Finish(&columns[m_field_locations.read_id])); ARROW_RETURN_NOT_OK( std::visit(visitors::finish_column{&columns[m_field_locations.signal]}, m_signal_builder)); ARROW_RETURN_NOT_OK(m_samples_builder->Finish(&columns[m_field_locations.samples])); auto const record_batch = arrow::RecordBatch::Make(m_schema, m_current_batch_row_count, std::move(columns)); m_written_batched_row_count += m_current_batch_row_count; m_current_batch_row_count = 0; ARROW_RETURN_NOT_OK(m_writer->WriteRecordBatch(*record_batch)); return m_output_stream->batch_complete(); } Status SignalTableWriter::reserve_rows() { // Only reserve if we have not already reserved (at the start of a batch) if (m_current_batch_row_count > 0) { return arrow::Status::OK(); } ARROW_RETURN_NOT_OK(m_read_id_builder->Reserve(m_table_batch_size)); ARROW_RETURN_NOT_OK(m_samples_builder->Reserve(m_table_batch_size)); static constexpr std::uint32_t APPROX_READ_SIZE = 102'400; return std::visit( visitors::reserve_rows{m_table_batch_size, APPROX_READ_SIZE}, m_signal_builder); } Result make_signal_table_writer( std::shared_ptr const & sink, std::shared_ptr const & metadata, std::size_t table_batch_size, SignalType compression_type, arrow::MemoryPool * pool) { SignalTableSchemaDescription field_locations; auto schema = make_signal_table_schema(compression_type, metadata, &field_locations); arrow::ipc::IpcWriteOptions options; options.memory_pool = pool; ARROW_ASSIGN_OR_RAISE(auto writer, arrow::ipc::MakeFileWriter(sink, schema, options, metadata)); ARROW_ASSIGN_OR_RAISE(auto signal_builder, make_signal_builder(compression_type, pool)); auto signal_table_writer = SignalTableWriter( std::move(writer), std::move(schema), std::move(signal_builder), field_locations, sink, table_batch_size, pool); return signal_table_writer; } } // namespace pod5 ================================================ FILE: c++/pod5_format/signal_table_writer.h ================================================ #pragma once #include "pod5_format/pod5_format_export.h" #include "pod5_format/result.h" #include "pod5_format/signal_builder.h" #include "pod5_format/signal_table_schema.h" #include "pod5_format/uuid.h" #include #include namespace arrow { class Schema; namespace ipc { class RecordBatchWriter; } } // namespace arrow namespace pod5 { class FileOutputStream; class POD5_FORMAT_EXPORT SignalTableWriter { public: SignalTableWriter( std::shared_ptr && writer, std::shared_ptr && schema, SignalBuilderVariant && signal_builder, SignalTableSchemaDescription const & field_locations, std::shared_ptr const & output_stream, std::size_t table_batch_size, arrow::MemoryPool * pool); SignalTableWriter(SignalTableWriter &&); SignalTableWriter & operator=(SignalTableWriter &&); SignalTableWriter(SignalTableWriter const &) = delete; SignalTableWriter & operator=(SignalTableWriter const &) = delete; ~SignalTableWriter(); /// \brief Find the size of table batches for the signal table writer. std::size_t table_batch_size() const { return m_table_batch_size; } /// \brief Add a read to the signal table, adding to the current batch. /// \param read_id The read id for the read entry /// \param signal The signal for the read entry /// \returns The row index of the inserted signal, or a status on failure. Result add_signal( Uuid const & read_id, gsl::span const & signal); /// \brief Add a pre-compressed read to the signal table, adding to the current batch. /// The batch is not flushed to disk until #flush is called. /// /// The user should call #compress_signal on *this* writer to compress the signal prior /// to calling this method, to ensure the signal is compressed correctly for the table. /// /// \param read_id The read id for the read entry /// \param signal The signal for the read entry /// \returns The row index of the inserted signal, or a status on failure. Result add_pre_compressed_signal( Uuid const & read_id, gsl::span const & signal, std::uint32_t sample_count); pod5::Result> add_signal_batch( std::size_t row_count, std::vector> && columns, bool final_batch); /// \brief Close this writer, signaling no further data will be written to the writer. Status close(); /// \brief Find the signal type of this writer SignalType signal_type() const; /// \brief Reserve space for future row writes, called automatically when a flush occurs. Status reserve_rows(); /// \brief Find the schema for the signal table std::shared_ptr const & schema() const { return m_schema; } /// \brief Flush passed data into the writer as a record batch. Status write_batch(arrow::RecordBatch const &); private: /// \brief Flush buffered data into the writer as a record batch. Status write_batch(); arrow::MemoryPool * m_pool = nullptr; std::shared_ptr m_schema; SignalTableSchemaDescription m_field_locations; std::shared_ptr m_output_stream; std::size_t m_table_batch_size; std::shared_ptr m_writer; std::unique_ptr m_read_id_builder; SignalBuilderVariant m_signal_builder; std::unique_ptr m_samples_builder; std::size_t m_written_batched_row_count = 0; std::size_t m_current_batch_row_count = 0; }; /// \brief Make a new writer for a signal table. /// \param sink Sink to be used for output of the table. /// \param metadata Metadata to be applied to the table schema. /// \param table_batch_size The size of each batch written for the table. /// \param pool Pool to be used for building table in memory. /// \returns The writer for the new table. POD5_FORMAT_EXPORT Result make_signal_table_writer( std::shared_ptr const & sink, std::shared_ptr const & metadata, std::size_t table_batch_size, SignalType compression_type, arrow::MemoryPool * pool); } // namespace pod5 ================================================ FILE: c++/pod5_format/svb16/common.hpp ================================================ #pragma once #if __cplusplus >= 201703L #define SVB16_IF_CONSTEXPR if constexpr #else #define SVB16_IF_CONSTEXPR if #endif #ifdef _MSC_VER #define SVB_RESTRICT __restrict #else #define SVB_RESTRICT __restrict__ #endif #if defined(__x86_64__) || defined(_M_AMD64) // x64 #define SVB16_X64 #elif defined(__arm__) || defined(__aarch64__) #define SVB16_ARM #endif #ifndef __has_builtin #define __has_builtin(x) 0 #endif #if __has_builtin(__builtin_popcount) // likely to be a single instruction (POPCNT) on x86_64 #define svb16_popcount __builtin_popcount #else // optimising compilers can often convert this pattern to POPCNT on x86_64 inline int svb16_popcount(unsigned int i) { i = i - ((i >> 1) & 0x55555555); // add pairs of bits i = (i & 0x33333333) + ((i >> 2) & 0x33333333); // quads i = (i + (i >> 4)) & 0x0F0F0F0F; // groups of 8 return (i * 0x01010101) >> 24; // horizontal sum of bytes } #endif ================================================ FILE: c++/pod5_format/svb16/decode.hpp ================================================ #pragma once #include "common.hpp" #include "decode_scalar.hpp" #include "svb16.h" // svb16_key_length #include #ifdef SVB16_X64 #include "decode_x64.hpp" #include "simd_detect_x64.hpp" #endif namespace svb16 { // Required extra space after readable buffers passed in. // // Require 1 128 bit buffer beyond the end of all input readable buffers. inline std::size_t decode_input_buffer_padding_byte_count() { #ifdef SVB16_X64 return sizeof(__m128i); #else return 0; #endif } template size_t decode(gsl::span out, gsl::span in, Int16T prev = 0) { auto keys_length = ::svb16_key_length(out.size()); auto const keys = in.subspan(0, keys_length); auto const data = in.subspan(keys_length); #ifdef SVB16_X64 if (has_sse4_1()) { return decode_sse(out, keys, data, prev) - in.begin(); } #endif return decode_scalar(out, keys, data, prev) - in.begin(); } inline bool validate(gsl::span compressed_input, std::size_t out_size) { auto const keys_length = ::svb16_key_length(out_size); if (keys_length > compressed_input.size()) { return false; } // Pull out the parts of the input data. auto const keys_span = compressed_input.subspan(0, keys_length); auto const data_span = compressed_input.subspan(keys_length); auto keys_ptr = keys_span.begin(); // Accumulate the key sizes in a wider type to avoid overflow. using Accumulator = std:: conditional_t= sizeof(std::uint64_t), std::size_t, std::uint64_t>; Accumulator encoded_size = 0; // Give the compiler a hint that it can avoid branches in the inner loop. for (std::size_t c = 0; c < out_size / 8; c++) { uint8_t const key_byte = *keys_ptr++; for (uint8_t shift = 0; shift < 8; shift++) { uint8_t const code = (key_byte >> shift) & 0x01; encoded_size += code + 1; } } out_size &= 7; // Process the remainder one at a time. uint8_t shift = 0; uint8_t key_byte = *keys_ptr++; for (std::size_t c = 0; c < out_size; c++) { if (shift == 8) { shift = 0; key_byte = *keys_ptr++; } uint8_t const code = (key_byte >> shift) & 0x01; encoded_size += code + 1; shift++; } return encoded_size == data_span.size(); } } // namespace svb16 ================================================ FILE: c++/pod5_format/svb16/decode_scalar.hpp ================================================ #pragma once #include "common.hpp" #include #include #include #include #include namespace svb16 { namespace detail { inline uint16_t zigzag_decode(uint16_t val) { return (val >> 1) ^ static_cast(0 - (val & 1)); } inline uint16_t decode_data(gsl::span::iterator & dataPtr, uint8_t code) { uint16_t val; if (code == 0) { // 1 byte val = (uint16_t)*dataPtr; dataPtr += 1; } else { // 2 bytes val = 0; memcpy(&val, dataPtr, 2); // assumes little endian dataPtr += 2; } return val; } } // namespace detail template uint8_t const * decode_scalar( gsl::span out_span, gsl::span keys_span, gsl::span data_span, Int16T prev = 0) { auto const count = out_span.size(); if (count == 0) { return data_span.begin(); } auto out = out_span.begin(); auto keys = keys_span.begin(); auto data = data_span.begin(); uint8_t shift = 0; // cycles 0 through 7 then resets uint8_t key_byte = *keys++; // need to do the arithmetic in unsigned space so it wraps auto u_prev = static_cast(prev); for (uint32_t c = 0; c < count; c++, shift++) { if (shift == 8) { shift = 0; key_byte = *keys++; } uint16_t value = detail::decode_data(data, (key_byte >> shift) & 0x01); SVB16_IF_CONSTEXPR(UseZigzag) { value = detail::zigzag_decode(value); } SVB16_IF_CONSTEXPR(UseDelta) { value += u_prev; u_prev = value; } *out++ = static_cast(value); } assert(out == out_span.end()); assert(keys == keys_span.end()); assert(data <= data_span.end()); return data; } } // namespace svb16 ================================================ FILE: c++/pod5_format/svb16/decode_x64.hpp ================================================ #pragma once #include "common.hpp" #include "decode_scalar.hpp" #include "intrinsics.hpp" #include "shuffle_tables.hpp" #include "svb16.h" // svb16_key_length #include #include #include #ifdef SVB16_X64 namespace svb16 { namespace detail { [[gnu::target("ssse3")]] inline __m128i zigzag_decode(__m128i val) { return _mm_xor_si128( // N >> 1 _mm_srli_epi16(val, 1), // 0xFFFF if N & 1 else 0x0000 _mm_srai_epi16(_mm_slli_epi16(val, 15), 15) // alternative: _mm_sign_epi16(ones, _mm_slli_epi16(buf, 15)) ); } [[gnu::target("ssse3")]] inline __m128i unpack(uint32_t key, uint8_t const * SVB_RESTRICT * data) { auto const len = static_cast(8 + svb16_popcount(key)); __m128i data_reg = _mm_loadu_si128(reinterpret_cast<__m128i const *>(*data)); __m128i const shuffle = *reinterpret_cast<__m128i const *>(&g_decode_shuffle_table[key]); data_reg = _mm_shuffle_epi8(data_reg, shuffle); *data += len; return data_reg; } template [[gnu::target("ssse3")]] inline void store_8(Int16T * to, __m128i value, __m128i * prev) { SVB16_IF_CONSTEXPR(UseZigzag) { value = zigzag_decode(value); } SVB16_IF_CONSTEXPR(UseDelta) { auto const broadcast_last_16 = m128i_from_bytes(14, 15, 14, 15, 14, 15, 14, 15, 14, 15, 14, 15, 14, 15, 14, 15); // value == [A B C D E F G H] (16 bit values) __m128i add = _mm_slli_si128(value, 2); // add == [- A B C D E F G] *prev = _mm_shuffle_epi8(*prev, broadcast_last_16); // *prev == [P P P P P P P P] value = _mm_add_epi16(value, add); // value == [A AB BC CD DE FG GH] add = _mm_slli_si128(value, 4); // add == [- - A AB BC CD DE EF] value = _mm_add_epi16(value, add); // value == [A AB ABC ABCD BCDE CDEF DEFG EFGH] add = _mm_slli_si128(value, 8); // add == [- - - - A AB ABC ABCD] value = _mm_add_epi16(value, add); // value == [A AB ABC ABCD ABCDE ABCDEF ABCDEFG ABCDEFGH] value = _mm_add_epi16(value, *prev); // value == [PA PAB PABC PABCD PABCDE PABCDEF PABCDEFG PABCDEFGH] *prev = value; } _mm_storeu_si128(reinterpret_cast<__m128i *>(to), value); } } // namespace detail template [[gnu::target("sse4.1")]] uint8_t const * decode_sse( gsl::span out_span, gsl::span keys_span, gsl::span data_span, Int16T prev = 0) { auto store_8 = [](Int16T * to, __m128i value, __m128i * prev) { detail::store_8(to, value, prev); }; // this code treats all input as uint16_t (except the zigzag code, which treats it as int16_t) // this isn't a problem, as the scalar code does the same auto out = out_span.begin(); auto const count = out_span.size(); auto keys_it = keys_span.begin(); auto data = data_span.begin(); // handle blocks of 32 values if (count >= 64) { size_t const key_bytes = count / 8; __m128i prev_reg; SVB16_IF_CONSTEXPR(UseDelta) { prev_reg = _mm_set1_epi16(prev); } int64_t offset = -static_cast(key_bytes) / 8 + 1; // 8 -> 4? uint64_t const * keyPtr64 = reinterpret_cast(keys_it) - offset; uint64_t nextkeys; memcpy(&nextkeys, keyPtr64 + offset, sizeof(nextkeys)); __m128i data_reg; for (; offset != 0; ++offset) { uint64_t keys = nextkeys; memcpy(&nextkeys, keyPtr64 + offset + 1, sizeof(nextkeys)); // faster 16-bit delta since we only have 8-bit values if (!keys) { // 64 1-byte ints in a row // _mm_cvtepu8_epi16: SSE4.1 data_reg = _mm_cvtepu8_epi16(_mm_lddqu_si128(reinterpret_cast<__m128i const *>(data))); store_8(out, data_reg, &prev_reg); data_reg = _mm_cvtepu8_epi16(_mm_lddqu_si128(reinterpret_cast<__m128i const *>(data + 8))); store_8(out + 8, data_reg, &prev_reg); data_reg = _mm_cvtepu8_epi16( _mm_lddqu_si128(reinterpret_cast<__m128i const *>(data + 16))); store_8(out + 16, data_reg, &prev_reg); data_reg = _mm_cvtepu8_epi16( _mm_lddqu_si128(reinterpret_cast<__m128i const *>(data + 24))); store_8(out + 24, data_reg, &prev_reg); data_reg = _mm_cvtepu8_epi16( _mm_lddqu_si128(reinterpret_cast<__m128i const *>(data + 32))); store_8(out + 32, data_reg, &prev_reg); data_reg = _mm_cvtepu8_epi16( _mm_lddqu_si128(reinterpret_cast<__m128i const *>(data + +40))); store_8(out + 40, data_reg, &prev_reg); data_reg = _mm_cvtepu8_epi16( _mm_lddqu_si128(reinterpret_cast<__m128i const *>(data + 48))); store_8(out + 48, data_reg, &prev_reg); data_reg = _mm_cvtepu8_epi16( _mm_lddqu_si128(reinterpret_cast<__m128i const *>(data + 56))); store_8(out + 56, data_reg, &prev_reg); out += 64; data += 64; continue; } data_reg = detail::unpack(keys & 0x00FF, &data); store_8(out, data_reg, &prev_reg); data_reg = detail::unpack((keys & 0xFF00) >> 8, &data); store_8(out + 8, data_reg, &prev_reg); keys >>= 16; data_reg = detail::unpack((keys & 0x00FF), &data); store_8(out + 16, data_reg, &prev_reg); data_reg = detail::unpack((keys & 0xFF00) >> 8, &data); store_8(out + 24, data_reg, &prev_reg); keys >>= 16; data_reg = detail::unpack((keys & 0x00FF), &data); store_8(out + 32, data_reg, &prev_reg); data_reg = detail::unpack((keys & 0xFF00) >> 8, &data); store_8(out + 40, data_reg, &prev_reg); keys >>= 16; data_reg = detail::unpack((keys & 0x00FF), &data); store_8(out + 48, data_reg, &prev_reg); // Note we load at least sizeof(__m128i) bytes from the end of data // here, need to ensure that is available to read. // // But we might not use it all depending on the unpacking. // // This is ok due to `decode_input_buffer_padding_byte_count` enuring // extra space on the input buffer. data_reg = detail::unpack((keys & 0xFF00) >> 8, &data); store_8(out + 56, data_reg, &prev_reg); out += 64; } { uint64_t keys = nextkeys; // faster 16-bit delta since we only have 8-bit values if (!keys) { // 64 1-byte ints in a row data_reg = _mm_cvtepu8_epi16(_mm_lddqu_si128(reinterpret_cast<__m128i const *>(data))); store_8(out, data_reg, &prev_reg); data_reg = _mm_cvtepu8_epi16(_mm_lddqu_si128(reinterpret_cast<__m128i const *>(data + 8))); store_8(out + 8, data_reg, &prev_reg); data_reg = _mm_cvtepu8_epi16( _mm_lddqu_si128(reinterpret_cast<__m128i const *>(data + 16))); store_8(out + 16, data_reg, &prev_reg); data_reg = _mm_cvtepu8_epi16( _mm_lddqu_si128(reinterpret_cast<__m128i const *>(data + 24))); store_8(out + 24, data_reg, &prev_reg); data_reg = _mm_cvtepu8_epi16( _mm_lddqu_si128(reinterpret_cast<__m128i const *>(data + 32))); store_8(out + 32, data_reg, &prev_reg); data_reg = _mm_cvtepu8_epi16( _mm_lddqu_si128(reinterpret_cast<__m128i const *>(data + +40))); store_8(out + 40, data_reg, &prev_reg); data_reg = _mm_cvtepu8_epi16( _mm_lddqu_si128(reinterpret_cast<__m128i const *>(data + 48))); store_8(out + 48, data_reg, &prev_reg); // Only load the first 8 bytes here, otherwise we may run off the end of the buffer data_reg = _mm_cvtepu8_epi16( _mm_loadl_epi64(reinterpret_cast<__m128i const *>(data + 56))); store_8(out + 56, data_reg, &prev_reg); out += 64; data += 64; } else { data_reg = detail::unpack(keys & 0x00FF, &data); store_8(out, data_reg, &prev_reg); data_reg = detail::unpack((keys & 0xFF00) >> 8, &data); store_8(out + 8, data_reg, &prev_reg); keys >>= 16; data_reg = detail::unpack((keys & 0x00FF), &data); store_8(out + 16, data_reg, &prev_reg); data_reg = detail::unpack((keys & 0xFF00) >> 8, &data); store_8(out + 24, data_reg, &prev_reg); keys >>= 16; data_reg = detail::unpack((keys & 0x00FF), &data); store_8(out + 32, data_reg, &prev_reg); data_reg = detail::unpack((keys & 0xFF00) >> 8, &data); store_8(out + 40, data_reg, &prev_reg); keys >>= 16; data_reg = detail::unpack((keys & 0x00FF), &data); store_8(out + 48, data_reg, &prev_reg); data_reg = detail::unpack((keys & 0xFF00) >> 8, &data); store_8(out + 56, data_reg, &prev_reg); out += 64; } } prev = out[-1]; keys_it += key_bytes - (key_bytes & 7); } assert(out <= out_span.end()); assert(keys_it <= keys_span.end()); assert(data <= data_span.end()); auto out_scalar_span = gsl::make_span(out, out_span.end()); assert(out_scalar_span.size() == (count & 63)); auto keys_scalar_span = gsl::make_span(keys_it, keys_span.end()); auto data_scalar_span = gsl::make_span(data, data_span.end()); return decode_scalar( out_scalar_span, keys_scalar_span, data_scalar_span, prev); } #endif // SVB16_X64 } // namespace svb16 ================================================ FILE: c++/pod5_format/svb16/encode.hpp ================================================ #pragma once #include "common.hpp" #include "encode_scalar.hpp" #include "svb16.h" // svb16_key_length #ifdef SVB16_X64 #include "encode_x64.hpp" #include "simd_detect_x64.hpp" #endif namespace svb16 { template size_t encode(Int16T const * in, uint8_t * SVB_RESTRICT out, uint32_t count, Int16T prev = 0) { auto const keys = out; auto const data = keys + ::svb16_key_length(count); #ifdef SVB16_X64 if (has_ssse3()) { return encode_sse(in, keys, data, count, prev) - out; } #endif return encode_scalar(in, keys, data, count, prev) - out; } } // namespace svb16 ================================================ FILE: c++/pod5_format/svb16/encode_scalar.hpp ================================================ #pragma once #include "common.hpp" #include #include #include namespace svb16 { namespace detail { inline uint16_t zigzag_encode(uint16_t val) { return (val + val) ^ static_cast(static_cast(val) >> 15); } } // namespace detail template uint8_t * encode_scalar( Int16T const * in, uint8_t * SVB_RESTRICT keys, uint8_t * SVB_RESTRICT data, uint32_t count, Int16T prev = 0) { if (count == 0) { return data; } uint8_t shift = 0; // cycles 0 through 7 then resets uint8_t key_byte = 0; for (uint32_t c = 0; c < count; c++) { if (shift == 8) { shift = 0; *keys++ = key_byte; key_byte = 0; } uint16_t value; SVB16_IF_CONSTEXPR(UseDelta) { // need to do the arithmetic in unsigned space so it wraps value = static_cast(in[c]) - static_cast(prev); SVB16_IF_CONSTEXPR(UseZigzag) { value = detail::zigzag_encode(value); } prev = in[c]; } else SVB16_IF_CONSTEXPR(UseZigzag) { value = detail::zigzag_encode(static_cast(in[c])); } else { value = static_cast(in[c]); } if (value < (1 << 8)) { // 1 byte *data = static_cast(value); ++data; } else { // 2 bytes std::memcpy(data, &value, 2); // assumes little endian data += 2; key_byte |= 1 << shift; } shift += 1; } *keys = key_byte; // write last key (no increment needed) return data; } } // namespace svb16 ================================================ FILE: c++/pod5_format/svb16/encode_x64.hpp ================================================ #pragma once #include "common.hpp" #include "encode_scalar.hpp" #include "intrinsics.hpp" #include "shuffle_tables.hpp" #include "svb16.h" // svb16_key_length #include #include #ifdef SVB16_X64 namespace svb16 { namespace detail { [[gnu::target("ssse3")]] inline __m128i delta(__m128i curr, __m128i prev) { return _mm_sub_epi16(curr, _mm_alignr_epi8(curr, prev, 14)); } [[gnu::target("ssse3")]] inline __m128i zigzag_encode(__m128i val) { return _mm_xor_si128(_mm_add_epi16(val, val), _mm_srai_epi16(val, 16)); } template [[gnu::target("ssse3")]] inline __m128i load_8(Int16T const * from, __m128i * prev) { auto const loaded = _mm_loadu_si128(reinterpret_cast<__m128i const *>(from)); SVB16_IF_CONSTEXPR(UseDelta && UseZigzag) { auto const result = delta(loaded, *prev); *prev = loaded; return zigzag_encode(result); } else SVB16_IF_CONSTEXPR(UseDelta) { auto const result = delta(loaded, *prev); *prev = loaded; return result; } else SVB16_IF_CONSTEXPR(UseZigzag) { return zigzag_encode(loaded); } else { return loaded; } } } // namespace detail template [[gnu::target("ssse3")]] uint8_t * encode_sse( Int16T const * in, uint8_t * SVB_RESTRICT keys_dest, uint8_t * SVB_RESTRICT data_dest, uint32_t count, Int16T prev = 0) { // this code treats all input as uint16_t (except the zigzag code, which treats it as int16_t) // this isn't a problem, as the scalar code does the same __m128i prev_reg; SVB16_IF_CONSTEXPR(UseDelta) { prev_reg = _mm_set1_epi16(prev); } //auto const key_len = svb16_key_length(count); auto const mask_01 = detail::m128i_from_bytes( 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01); for (Int16T const * end = &in [(count & ~15)]; in != end; in += 16) { // load up 16 values into r0 and r1 auto r0 = detail::load_8(in, &prev_reg); auto r1 = detail::load_8(in + 8, &prev_reg); // 1 byte per input byte: 1 if the byte is set, 0 if not auto r2 = _mm_min_epu8(mask_01, r0); auto r3 = _mm_min_epu8(mask_01, r1); // 1 byte per input Int16T: FF if the MSB is set, 00 or 01 if not // (us = unsigned saturation) r2 = _mm_packus_epi16(r2, r3); // 1 bit per input Int16T: 1 if the MSB is set, 0 if not // only the low 16 bits are set auto const keys = static_cast(_mm_movemask_epi8(r2)); // use the shuffle table to discard the MSB if the corresponidng key bit is not set r2 = _mm_loadu_si128((__m128i *)&g_encode_shuffle_table[(keys << 4) & 0x07F0]); r3 = _mm_loadu_si128((__m128i *)&g_encode_shuffle_table[(keys >> 4) & 0x07F0]); r0 = _mm_shuffle_epi8(r0, r2); r1 = _mm_shuffle_epi8(r1, r3); // store the data to data_dest (note that we often end up with overlapping writes) _mm_storeu_si128(reinterpret_cast<__m128i *>(data_dest), r0); data_dest += 8 + svb16_popcount(keys & 0xFF); _mm_storeu_si128(reinterpret_cast<__m128i *>(data_dest), r1); data_dest += 8 + svb16_popcount(keys >> 8); *reinterpret_cast(keys_dest) = keys; keys_dest += 2; } SVB16_IF_CONSTEXPR(UseDelta) { prev = _mm_extract_epi16(prev_reg, 7); } // max two control bytes (16 values) left, use the scalar function count &= 15; return encode_scalar(in, keys_dest, data_dest, count, prev); } #endif // SVB16_X64 } // namespace svb16 ================================================ FILE: c++/pod5_format/svb16/generate_shuffle_tables.py ================================================ def encode_table_row(control): table = [] for i in range(7): offset = i * 2 # first byte table.append(offset) if (control >> i) & 1: table.append(offset + 1) final_offset = 14 for j in range(2): table.append(final_offset + j) for i in range(16 - len(table)): table.append(0xFF) return table def decode_table_row(control): table = [] offset = 0 for i in range(8): table.append(offset) offset += 1 if (control >> i) & 1: table.append(offset) offset += 1 else: table.append(0xFF) return table def print_x64_encode_table(): print("static constexpr uint8_t g_encode_shuffle_table[128*16] = {") for i in range(128): table = encode_table_row(i) print("\t", ", ".join(f"0x{v:02X}" for v in table), ",", sep="") print("};\n\n") def print_x64_decode_table(): print("static const uint8_t g_decode_shuffle_table[256][16] = {") for i in range(256): table = decode_table_row(i) print("\t{ ", ", ".join(f"0x{v:02X}" for v in table), "},", sep="") print("};\n\n") if __name__ == "__main__": print("#pragma once") print('#include "common.hpp" // arch macros') print("#include ") print() print("#ifdef SVB16_X64") print_x64_encode_table() print_x64_decode_table() print("#endif") ================================================ FILE: c++/pod5_format/svb16/intrinsics.hpp ================================================ #pragma once #include "common.hpp" // architecture macros #if defined(_MSC_VER) #include #elif defined(__GNUC__) && defined(SVB16_X64) #include #elif defined(__GNUC__) && defined(__ARM_NEON__) #include #endif #include namespace svb16 { namespace detail { [[gnu::target("sse2")]] inline constexpr __m128i m128i_from_bytes( uint8_t a, uint8_t b, uint8_t c, uint8_t d, uint8_t e, uint8_t f, uint8_t g, uint8_t h, uint8_t i, uint8_t j, uint8_t k, uint8_t l, uint8_t m, uint8_t n, uint8_t o, uint8_t p) { #ifdef _MSC_VER return __m128i{ (char)a, (char)b, (char)c, (char)d, (char)e, (char)f, (char)g, (char)h, (char)i, (char)j, (char)k, (char)l, (char)m, (char)n, (char)o, (char)p}; #else return __m128i{ static_cast(static_cast(h) << 56) + (static_cast(g) << 48) + (static_cast(f) << 40) + (static_cast(e) << 32) + (static_cast(d) << 24) + (static_cast(c) << 16) + (static_cast(b) << 8) + static_cast(a), static_cast(static_cast(h) << 56) + (static_cast(g) << 48) + (static_cast(f) << 40) + (static_cast(e) << 32) + (static_cast(d) << 24) + (static_cast(c) << 16) + (static_cast(b) << 8) + static_cast(a)}; #endif } }} // namespace svb16::detail ================================================ FILE: c++/pod5_format/svb16/shuffle_tables.hpp ================================================ #pragma once #include "common.hpp" // arch macros #include #ifdef SVB16_X64 static constexpr uint8_t g_encode_shuffle_table[128 * 16] = { 0x00, 0x02, 0x04, 0x06, 0x08, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x06, 0x08, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x06, 0x08, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x06, 0x08, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x05, 0x06, 0x08, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x05, 0x06, 0x08, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x05, 0x06, 0x08, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x08, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x06, 0x07, 0x08, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x06, 0x07, 0x08, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x06, 0x07, 0x08, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x06, 0x07, 0x08, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x05, 0x06, 0x07, 0x08, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x05, 0x06, 0x07, 0x08, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x06, 0x08, 0x09, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x06, 0x08, 0x09, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x06, 0x08, 0x09, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x06, 0x08, 0x09, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x05, 0x06, 0x08, 0x09, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x05, 0x06, 0x08, 0x09, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x05, 0x06, 0x08, 0x09, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x08, 0x09, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x06, 0x08, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x06, 0x08, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x06, 0x08, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x06, 0x08, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x05, 0x06, 0x08, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x05, 0x06, 0x08, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x05, 0x06, 0x08, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x08, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x06, 0x07, 0x08, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x06, 0x07, 0x08, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x06, 0x07, 0x08, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x06, 0x07, 0x08, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x05, 0x06, 0x07, 0x08, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x05, 0x06, 0x07, 0x08, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x06, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x06, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x06, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x06, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x05, 0x06, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x05, 0x06, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x05, 0x06, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0xFF, 0x00, 0x02, 0x04, 0x06, 0x08, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x06, 0x08, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x06, 0x08, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x06, 0x08, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x05, 0x06, 0x08, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x05, 0x06, 0x08, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x05, 0x06, 0x08, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x08, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x06, 0x07, 0x08, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x06, 0x07, 0x08, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x06, 0x07, 0x08, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x06, 0x07, 0x08, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x05, 0x06, 0x07, 0x08, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x05, 0x06, 0x07, 0x08, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x06, 0x08, 0x09, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x06, 0x08, 0x09, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x06, 0x08, 0x09, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x06, 0x08, 0x09, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x05, 0x06, 0x08, 0x09, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x05, 0x06, 0x08, 0x09, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x05, 0x06, 0x08, 0x09, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x08, 0x09, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0x00, 0x02, 0x04, 0x06, 0x08, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x06, 0x08, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x06, 0x08, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x06, 0x08, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x05, 0x06, 0x08, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x05, 0x06, 0x08, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x05, 0x06, 0x08, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x08, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x06, 0x07, 0x08, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x06, 0x07, 0x08, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x06, 0x07, 0x08, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x06, 0x07, 0x08, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x05, 0x06, 0x07, 0x08, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x05, 0x06, 0x07, 0x08, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0x00, 0x02, 0x04, 0x06, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x06, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x06, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x06, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0x00, 0x02, 0x04, 0x05, 0x06, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x05, 0x06, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x05, 0x06, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0x00, 0x02, 0x04, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0x00, 0x02, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0xFF, 0x00, 0x01, 0x02, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0x00, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0xFF, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, }; static uint8_t const g_decode_shuffle_table[256][16] = { {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0x05, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0xFF, 0x0B, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0x06, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0x06, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0x07, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0x07, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0xFF, 0x06, 0x07, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0xFF, 0x0B, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0x07, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0x05, 0x06, 0x07, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0xFF, 0x0B, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0xFF, 0x0B, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0xFF, 0x0B, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0xFF, 0x0C, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0x06, 0x07, 0xFF, 0x08, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0x07, 0x08, 0xFF, 0x09, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0x07, 0x08, 0xFF, 0x09, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0x07, 0x08, 0xFF, 0x09, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0x07, 0x08, 0xFF, 0x09, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0xFF, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0x05, 0x06, 0xFF, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0xFF, 0x0C, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0x07, 0x08, 0xFF, 0x09, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0x06, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0x06, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0x07, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0x07, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0xFF, 0x06, 0x07, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0xFF, 0x0C, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0xFF, 0x0C, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0xFF, 0x0C, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0xFF, 0x0C, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0xFF, 0x0D, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0x07, 0x08, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0x08, 0x09, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0x08, 0x09, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0x08, 0x09, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0x08, 0x09, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0x05, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0x0B, 0x0C, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0xFF, 0x07, 0x08, 0x09, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0x06, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0x06, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0x07, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0x07, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0xFF, 0x06, 0x07, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0x0B, 0x0C, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0x07, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0x05, 0x06, 0x07, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0x0B, 0x0C, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0x0B, 0x0C, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0x0B, 0x0C, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0x0C, 0x0D, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0x06, 0x07, 0x08, 0x09, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0x07, 0x08, 0x09, 0x0A, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0x07, 0x08, 0x09, 0x0A, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0x07, 0x08, 0x09, 0x0A, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0x07, 0x08, 0x09, 0x0A, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0xFF, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0x05, 0x06, 0xFF, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0xFF, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0xFF}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0xFF}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0xFF}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0xFF}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0x08}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0x09}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0x09}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0x0A}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0x09}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0x0A}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0x0A}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0x0B}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0x09}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0x0A}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0x05, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0x0A}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0x0B}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0x0A}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0x0B}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0x0B}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0xFF, 0x0B, 0x0C}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0x09}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0x06, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0x0A}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0x06, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0x0A}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0x07, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0x0B}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0x0A}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0x07, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0x0B}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0xFF, 0x06, 0x07, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0x0B}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0xFF, 0x0B, 0x0C}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0x0A}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0x07, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0x0B}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0x05, 0x06, 0x07, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0x0B}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0xFF, 0x0B, 0x0C}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0x0B}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0xFF, 0x0B, 0x0C}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0xFF, 0x0B, 0x0C}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0xFF, 0x0C, 0x0D}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0x06, 0x07, 0xFF, 0x08, 0x09}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0x07, 0x08, 0xFF, 0x09, 0x0A}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0x07, 0x08, 0xFF, 0x09, 0x0A}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0x0B}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0x07, 0x08, 0xFF, 0x09, 0x0A}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0x0B}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0x0B}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0x0C}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0x07, 0x08, 0xFF, 0x09, 0x0A}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0xFF, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0x0B}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0x05, 0x06, 0xFF, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0x0B}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0x0C}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0x0B}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0x0C}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0x0C}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0xFF, 0x0C, 0x0D}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0x07, 0x08, 0xFF, 0x09, 0x0A}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0x06, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0x0B}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0x06, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0x0B}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0x07, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0x0C}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0x0B}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0x07, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0x0C}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0xFF, 0x06, 0x07, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0x0C}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0xFF, 0x0C, 0x0D}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0x0B}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0x0C}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0x0C}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0xFF, 0x0C, 0x0D}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0x0C}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0xFF, 0x0C, 0x0D}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0xFF, 0x0C, 0x0D}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0xFF, 0x0D, 0x0E}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0x07, 0x08, 0x09}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0x08, 0x09, 0x0A}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0x08, 0x09, 0x0A}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0x0B}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0x08, 0x09, 0x0A}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0x0B}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0x0B}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0x0C}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0x08, 0x09, 0x0A}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0x0B}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0x05, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0x0B}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0x0C}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0x0B}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0x0C}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0xFF, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0x0C}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0xFF, 0x09, 0xFF, 0x0A, 0x0B, 0x0C, 0x0D}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0xFF, 0x07, 0x08, 0x09, 0x0A}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0x06, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0x0B}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0x06, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0x0B}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0x07, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0x0C}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0x0B}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0x07, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0x0C}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0xFF, 0x06, 0x07, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0x0C}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0x0B, 0x0C, 0x0D}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0x0B}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0x07, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0x0C}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0x05, 0x06, 0x07, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0x0C}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0x0B, 0x0C, 0x0D}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0x0C}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0x0B, 0x0C, 0x0D}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0xFF, 0x0A, 0x0B, 0x0C, 0x0D}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0xFF, 0x0B, 0x0C, 0x0D, 0x0E}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0xFF, 0x06, 0xFF, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0xFF, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0x05, 0x06, 0xFF, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0xFF, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0xFF, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0xFF, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0xFF, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0xFF, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0xFF, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0xFF, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0xFF, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0xFF, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D}, {0x00, 0x01, 0x02, 0x03, 0x04, 0xFF, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E}, {0x00, 0xFF, 0x01, 0xFF, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D}, {0x00, 0x01, 0x02, 0xFF, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E}, {0x00, 0xFF, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E}, {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F}, }; #endif ================================================ FILE: c++/pod5_format/svb16/simd_detect_x64.hpp ================================================ #pragma once #include "common.hpp" // architecture macros #if defined(SVB16_X64) #ifdef _MSC_VER #include #endif // __AVX__ is documented for MSVC, but __SSE4_1__ isn't #if defined(__AVX__) || defined(__SSE4_1__) inline constexpr bool has_ssse3() { return true; } inline constexpr bool has_sse4_1() { return true; } #else struct CpuidResult { unsigned int eax; unsigned int ebx; unsigned int ecx; unsigned int edx; }; inline CpuidResult cpuid(unsigned int leaf, unsigned int subleaf) { #ifdef _MSC_VER int info[4]; __cpuidex(info, static_cast(leaf), static_cast(subleaf)); return CpuidResult{ static_cast(info[0]), static_cast(info[1]), static_cast(info[2]), static_cast(info[3]), }; #else CpuidResult info; asm("cpuid\n\t" : "=a"(info.eax), "=b"(info.ebx), "=c"(info.ecx), "=d"(info.edx) : "0"(leaf), "2"(subleaf)); return info; #endif } inline unsigned int cpuid_leaf1_ecx() { // using C++11 atomic static variables static unsigned int const ecx = cpuid(1, 0).ecx; return ecx; } #if defined(__SSSE3__) inline constexpr bool has_ssse3() { return true; } #else inline bool has_ssse3() { return (cpuid_leaf1_ecx() & (1 << 9)) != 0; } #endif inline bool has_sse4_1() { return (cpuid_leaf1_ecx() & (1 << 19)) != 0; } #endif // defined(__SSE4_1__) #endif // defined(SVB16_X64) ================================================ FILE: c++/pod5_format/svb16/streamvbytedelta_decode_16.c ================================================ #include "streamvbyte_isadetection.h" #include "streamvbytedelta.h" #include // for memcpy static inline uint16_t zigzag_decode_16(uint16_t val) { return (val >> 1) ^ (uint16_t)(0 - (val & 1)); } static inline uint16_t _decode_data(uint8_t const ** dataPtrPtr, uint8_t code) { uint8_t const * dataPtr = *dataPtrPtr; uint16_t val; if (code == 0) { // 1 byte val = (uint16_t)*dataPtr; dataPtr += 1; } else { // 2 bytes val = 0; memcpy(&val, dataPtr, 2); // assumes little endian dataPtr += 2; } *dataPtrPtr = dataPtr; return val; } static uint8_t const * svb_decode_scalar_d1_init( uint16_t * outPtr, uint8_t const * keyPtr, uint8_t const * dataPtr, uint32_t count, uint16_t prev) { if (count == 0) { return dataPtr; // no reads or writes if no data } uint8_t shift = 0; uint16_t key = *keyPtr++; for (uint32_t c = 0; c < count; c++) { if (shift == 8) { shift = 0; key = *keyPtr++; } uint16_t val = zigzag_decode_16(_decode_data(&dataPtr, (key >> shift) & 0x1)); //uint16_t val = _decode_data(&dataPtr, (key >> shift) & 0x1); val += prev; *outPtr++ = val; prev = val; shift += 1; } return dataPtr; // pointer to first unused byte after end } #ifdef STREAMVBYTE_X64 #include "streamvbytedelta_x64_decode_16.c" #endif size_t streamvbyte_zigzag_delta_decode_16( uint8_t const * in, uint16_t * out, uint32_t count, uint16_t prev) { // keyLen = ceil(count / 8), without overflowing (1 bit per input value): uint32_t keyLen = (count >> 3) + (((count & 7) + 7) >> 3); uint8_t const * keyPtr = in; uint8_t const * dataPtr = keyPtr + keyLen; // data starts at end of keys #ifdef STREAMVBYTE_X64 if (streamvbyte_ssse3()) { return svb_decode_avx_d1_init(out, keyPtr, dataPtr, count, prev) - in; } #endif return svb_decode_scalar_d1_init(out, keyPtr, dataPtr, count, prev) - in; } ================================================ FILE: c++/pod5_format/svb16/streamvbytedelta_encode_16.c ================================================ #include "streamvbyte_isadetection.h" #include "streamvbytedelta.h" #include #include // for memcpy #ifdef STREAMVBYTE_X64 #include "streamvbytedelta_x64_encode_16.c" #endif static inline uint16_t _zigzag_encode_16(uint16_t val) { return (val + val) ^ ((int16_t)val >> 15); } static uint8_t _encode_data(uint16_t val, uint8_t * __restrict__ * dataPtrPtr) { uint8_t * dataPtr = *dataPtrPtr; uint8_t code; if (val < (1 << 8)) { // 1 byte *dataPtr = (uint8_t)(val); *dataPtrPtr += 1; code = 0; } else { // 2 bytes memcpy(dataPtr, &val, 2); // assumes little endian *dataPtrPtr += 2; code = 1; } return code; } static uint8_t * svb_encode_scalar_d1_init( uint16_t const * in, uint8_t * __restrict__ keyPtr, uint8_t * __restrict__ dataPtr, uint32_t count, uint16_t prev) { if (count == 0) { return dataPtr; // exit immediately if no data } uint8_t shift = 0; // cycles 0 through 7 then resets uint8_t key = 0; for (uint32_t c = 0; c < count; c++) { if (shift == 8) { shift = 0; *keyPtr++ = key; key = 0; } uint16_t val = _zigzag_encode_16((uint16_t)(in[c] - prev)); //uint16_t val = in[c] - prev; prev = in[c]; uint8_t code = _encode_data(val, &dataPtr); key |= code << shift; shift += 1; } *keyPtr = key; // write last key (no increment needed) return dataPtr; // pointer to first unused data byte } size_t streamvbyte_zigzag_delta_encode_16( uint16_t const * in, uint32_t count, uint8_t * out, uint16_t prev) { #ifdef STREAMVBYTE_X64 if (streamvbyte_ssse3()) { return streamvbyte_zigzag_delta_encode_SSSE3_d1_init(in, count, out, prev); } #endif uint8_t * keyPtr = out; // keys come at start // keyLen = ceil(count / 8), without overflowing (1 bit per input value): uint32_t keyLen = (count >> 3) + (((count & 7) + 7) >> 3); uint8_t * dataPtr = keyPtr + keyLen; // variable byte data after all keys return svb_encode_scalar_d1_init(in, keyPtr, dataPtr, count, prev) - out; } ================================================ FILE: c++/pod5_format/svb16/streamvbytedelta_x64_decode_16.c ================================================ #include "streamvbyte_isadetection.h" #include "streamvbyte_shuffle_tables_decode_16.h" #include // for memcpy #ifdef STREAMVBYTE_X64 STREAMVBYTE_TARGET_SSSE3 static __m128i undo_zigzag_16(__m128i buf) { return _mm_xor_si128( // N >> 1 _mm_srli_epi16(buf, 1), // 0xFFFF if N & 1 else 0x0000 _mm_srai_epi16(_mm_slli_epi16(buf, 15), 15) // alternative: _mm_sign_epi16(ones, _mm_slli_epi16(buf, 15)) ); } STREAMVBYTE_UNTARGET_REGION STREAMVBYTE_TARGET_SSSE3 static inline __m128i _decode_avx(uint32_t key, uint8_t const * __restrict__ * dataPtrPtr) { uint8_t len = 8 + popcount(key); __m128i Data = _mm_loadu_si128((__m128i *)*dataPtrPtr); __m128i Shuf = *(__m128i *)&shuffleTable[key]; Data = _mm_shuffle_epi8(Data, Shuf); *dataPtrPtr += len; return Data; } STREAMVBYTE_UNTARGET_REGION STREAMVBYTE_TARGET_SSSE3 static inline void _write_avx(uint16_t * out, __m128i Vec) { _mm_storeu_si128((__m128i *)out, Vec); } STREAMVBYTE_UNTARGET_REGION STREAMVBYTE_TARGET_SSSE3 static inline __m128i _write_16bit_avx_d1(uint16_t * out, __m128i Vec, __m128i Prev) { #ifndef _MSC_VER __m128i BroadcastLast16 = {0x0F0E0F0E0F0E0F0E, 0x0F0E0F0E0F0E0F0E}; #else __m128i BroadcastLast16 = {14, 15, 14, 15, 14, 15, 14, 15, 14, 15, 14, 15, 14, 15, 14, 15}; #endif Vec = undo_zigzag_16(Vec); // vec == [A B C D E F G H] (16 bit values) __m128i Add = _mm_slli_si128(Vec, 2); // [- A B C D E F G] Prev = _mm_shuffle_epi8(Prev, BroadcastLast16); // [P P P P P P P P] Vec = _mm_add_epi16(Vec, Add); // [A AB BC CD DE FG GH] Add = _mm_slli_si128(Vec, 4); // [- - A AB BC CD DE EF] Vec = _mm_add_epi16(Vec, Add); // [A AB ABC ABCD BCDE CDEF DEFG EFGH] Add = _mm_slli_si128(Vec, 8); // [- - - - A AB ABC ABCD] Vec = _mm_add_epi16(Vec, Add); // [A AB ABC ABCD ABCDE ABCDEF ABCDEFG ABCDEFGH] Vec = _mm_add_epi16(Vec, Prev); // [PA PAB PABC PABCD PABCDE PABCDEF PABCDEFG PABCDEFGH] _write_avx(out, Vec); return Vec; } STREAMVBYTE_UNTARGET_REGION STREAMVBYTE_TARGET_SSSE3 static uint8_t const * svb_decode_avx_d1_init( uint16_t * out, uint8_t const * __restrict__ keyPtr, uint8_t const * __restrict__ dataPtr, uint64_t count, uint16_t prev) { uint64_t keybytes = count / 4; // number of key bytes if (keybytes >= 8) { __m128i Prev = _mm_set1_epi16(prev); __m128i Data; int64_t Offset = -(int64_t)keybytes / 8 + 1; uint64_t const * keyPtr64 = (uint64_t const *)keyPtr - Offset; uint64_t nextkeys; memcpy(&nextkeys, keyPtr64 + Offset, sizeof(nextkeys)); for (; Offset != 0; ++Offset) { uint64_t keys = nextkeys; memcpy(&nextkeys, keyPtr64 + Offset + 1, sizeof(nextkeys)); // faster 16-bit delta since we only have 8-bit values if (!keys) { // 32 1-byte ints in a row // _mm_cvtepu8_epi16: SSE4.1 Data = _mm_cvtepu8_epi16(_mm_lddqu_si128((__m128i *)(dataPtr))); Prev = _write_16bit_avx_d1(out, Data, Prev); Data = _mm_cvtepu8_epi16(_mm_lddqu_si128((__m128i *)(dataPtr + 8))); Prev = _write_16bit_avx_d1(out + 8, Data, Prev); Data = _mm_cvtepu8_epi16(_mm_lddqu_si128((__m128i *)(dataPtr + 16))); Prev = _write_16bit_avx_d1(out + 16, Data, Prev); Data = _mm_cvtepu8_epi16(_mm_lddqu_si128((__m128i *)(dataPtr + 24))); Prev = _write_16bit_avx_d1(out + 24, Data, Prev); out += 32; dataPtr += 32; continue; } Data = _decode_avx(keys & 0x00FF, &dataPtr); Prev = _write_16bit_avx_d1(out, Data, Prev); Data = _decode_avx((keys & 0xFF00) >> 8, &dataPtr); Prev = _write_16bit_avx_d1(out + 4, Data, Prev); keys >>= 16; Data = _decode_avx((keys & 0x00FF), &dataPtr); Prev = _write_16bit_avx_d1(out + 8, Data, Prev); Data = _decode_avx((keys & 0xFF00) >> 8, &dataPtr); Prev = _write_16bit_avx_d1(out + 12, Data, Prev); keys >>= 16; Data = _decode_avx((keys & 0x00FF), &dataPtr); Prev = _write_16bit_avx_d1(out + 16, Data, Prev); Data = _decode_avx((keys & 0xFF00) >> 8, &dataPtr); Prev = _write_16bit_avx_d1(out + 20, Data, Prev); keys >>= 16; Data = _decode_avx((keys & 0x00FF), &dataPtr); Prev = _write_16bit_avx_d1(out + 24, Data, Prev); Data = _decode_avx((keys & 0xFF00) >> 8, &dataPtr); Prev = _write_16bit_avx_d1(out + 28, Data, Prev); out += 32; } { uint64_t keys = nextkeys; // faster 16-bit delta since we only have 8-bit values if (!keys) { // 32 1-byte ints in a row Data = _mm_cvtepu8_epi16(_mm_lddqu_si128((__m128i *)(dataPtr))); Prev = _write_16bit_avx_d1(out, Data, Prev); Data = _mm_cvtepu8_epi16(_mm_lddqu_si128((__m128i *)(dataPtr + 8))); Prev = _write_16bit_avx_d1(out + 8, Data, Prev); Data = _mm_cvtepu8_epi16(_mm_lddqu_si128((__m128i *)(dataPtr + 16))); Prev = _write_16bit_avx_d1(out + 16, Data, Prev); Data = _mm_cvtepu8_epi16(_mm_loadl_epi64((__m128i *)(dataPtr + 24))); Prev = _write_16bit_avx_d1(out + 24, Data, Prev); out += 32; dataPtr += 32; } else { Data = _decode_avx(keys & 0x00FF, &dataPtr); Prev = _write_16bit_avx_d1(out, Data, Prev); Data = _decode_avx((keys & 0xFF00) >> 8, &dataPtr); Prev = _write_16bit_avx_d1(out + 4, Data, Prev); keys >>= 16; Data = _decode_avx((keys & 0x00FF), &dataPtr); Prev = _write_16bit_avx_d1(out + 8, Data, Prev); Data = _decode_avx((keys & 0xFF00) >> 8, &dataPtr); Prev = _write_16bit_avx_d1(out + 12, Data, Prev); keys >>= 16; Data = _decode_avx((keys & 0x00FF), &dataPtr); Prev = _write_16bit_avx_d1(out + 16, Data, Prev); Data = _decode_avx((keys & 0xFF00) >> 8, &dataPtr); Prev = _write_16bit_avx_d1(out + 20, Data, Prev); keys >>= 16; Data = _decode_avx((keys & 0x00FF), &dataPtr); Prev = _write_16bit_avx_d1(out + 24, Data, Prev); Data = _decode_avx((keys & 0xFF00) >> 8, &dataPtr); Prev = _write_16bit_avx_d1(out + 28, Data, Prev); out += 32; } } prev = out[-1]; } uint64_t consumedkeys = keybytes - (keybytes & 7); return svb_decode_scalar_d1_init(out, keyPtr + consumedkeys, dataPtr, count & 31, prev); } STREAMVBYTE_UNTARGET_REGION #endif ================================================ FILE: c++/pod5_format/svb16/streamvbytedelta_x64_encode_16.c ================================================ #include "streamvbyte_isadetection.h" #include "streamvbyte_shuffle_tables_encode_16.h" #include #include #ifdef STREAMVBYTE_X64 STREAMVBYTE_TARGET_SSSE3 static __m128i Delta(__m128i curr, __m128i prev) { // _mm_alignr_epi8: SSSE3 return _mm_sub_epi16(curr, _mm_alignr_epi8(curr, prev, 14)); } STREAMVBYTE_UNTARGET_REGION STREAMVBYTE_TARGET_SSSE3 static __m128i zigzag_16(__m128i buf) { return _mm_xor_si128(_mm_add_epi16(buf, buf), _mm_srai_epi16(buf, 16)); } STREAMVBYTE_UNTARGET_REGION // based on code by aqrit (streamvbyte_encode_SSSE3) STREAMVBYTE_TARGET_SSSE3 size_t streamvbyte_zigzag_delta_encode_SSSE3_d1_init( uint16_t const * in, uint32_t count, uint8_t * out, uint16_t prev) { __m128i Prev = _mm_set1_epi16(prev); uint32_t keyLen = (count >> 3) + (((count & 7) + 7) >> 3); // 1-bit rounded to full byte uint8_t * restrict keyPtr = &out[0]; uint8_t * restrict dataPtr = &out[keyLen]; // variable length data after keys __m128i const mask_01 = _mm_set1_epi8(0x01); for (uint16_t const * end = &in [(count & ~15)]; in != end; in += 16) { __m128i rawr0, r0, rawr1, r1, r2, r3; size_t keys; rawr0 = _mm_loadu_si128((__m128i *)&in[0]); r0 = zigzag_16(Delta(rawr0, Prev)); Prev = rawr0; rawr1 = _mm_loadu_si128((__m128i *)&in[8]); r1 = zigzag_16(Delta(rawr1, Prev)); Prev = rawr1; // 1 if the byte is set, 0 if not r2 = _mm_min_epu8(mask_01, r0); r3 = _mm_min_epu8(mask_01, r1); // for each (u)int16, FF if the MSB is set, 00 or 01 if not (us = unsigned saturation) r2 = _mm_packus_epi16(r2, r3); // for each byte, store a bit: 1 if FF, 0 if 00 or 01 (so 1 if MSB is set, 0 if not) keys = (size_t)_mm_movemask_epi8(r2); r2 = _mm_loadu_si128((__m128i *)&shuf_lut[(keys << 4) & 0x07F0]); r3 = _mm_loadu_si128((__m128i *)&shuf_lut[(keys >> 4) & 0x07F0]); // _mm_shuffle_epi8: SSSE3 r0 = _mm_shuffle_epi8(r0, r2); r1 = _mm_shuffle_epi8(r1, r3); _mm_storeu_si128((__m128i *)dataPtr, r0); dataPtr += 8 + popcount(keys & 0xFF); _mm_storeu_si128((__m128i *)dataPtr, r1); dataPtr += 8 + popcount(keys >> 8); *((uint16_t *)keyPtr) = (uint16_t)keys; keyPtr += 2; } prev = _mm_extract_epi16(Prev, 7); // do remaining - max two control bytes left uint16_t key = 0; for (size_t i = 0; i < (count & 15); i++) { // TODO: can we factor this out to reuse the non-intrinsic code? uint16_t dw = in[i] - prev; prev = in[i]; uint16_t zz = (dw + dw) ^ ((int16_t)dw >> 15); uint16_t symbol = (zz > 0x00FF); key |= symbol << (i + i); *((uint16_t *)dataPtr) = zz; dataPtr += 1 + symbol; } memcpy(keyPtr, &key, ((count & 15) + 5) >> 3); return dataPtr - out; } STREAMVBYTE_UNTARGET_REGION #endif ================================================ FILE: c++/pod5_format/svb16/svb16.c ================================================ ================================================ FILE: c++/pod5_format/svb16/svb16.h ================================================ #ifndef SVB16_H #define SVB16_H #include #if defined(__cplusplus) extern "C" { #endif /// Get the number of key bytes required to encode a given number of 16-bit integers. inline uint32_t svb16_key_length(uint32_t count) { // ceil(count / 8.0), without overflowing or using fp arithmetic return (count >> 3) + (((count & 7) + 7) >> 3); } /// Get the maximum number of bytes required to encode a given number of 16-bit integers. inline uint32_t svb16_max_encoded_length(uint32_t count) { return svb16_key_length(count) + (2 * count); } #if defined(__cplusplus) }; #endif #endif // SVB16_H ================================================ FILE: c++/pod5_format/table_reader.cpp ================================================ #include "pod5_format/table_reader.h" #include #include #include namespace pod5 { TableRecordBatch::TableRecordBatch(std::shared_ptr const & batch) : m_batch(batch) { } TableRecordBatch::TableRecordBatch(std::shared_ptr && batch) : m_batch(std::move(batch)) { } TableRecordBatch::TableRecordBatch(TableRecordBatch const &) = default; TableRecordBatch & TableRecordBatch::operator=(TableRecordBatch const &) = default; TableRecordBatch::TableRecordBatch(TableRecordBatch &&) = default; TableRecordBatch & TableRecordBatch::operator=(TableRecordBatch &&) = default; TableRecordBatch::~TableRecordBatch() = default; std::size_t TableRecordBatch::num_rows() const { return m_batch->num_rows(); } //--------------------------------------------------------------------------------------------------------------------- TableReader::TableReader( std::shared_ptr && input_source, std::shared_ptr && reader, SchemaMetadataDescription && schema_metadata, arrow::MemoryPool * pool) : m_input_source(std::move(input_source)) , m_reader(std::move(reader)) , m_schema_metadata(std::move(schema_metadata)) { } TableReader::TableReader(TableReader &&) = default; TableReader & TableReader::operator=(TableReader &&) = default; TableReader::~TableReader() = default; std::size_t TableReader::num_record_batches() const { return m_reader->num_record_batches(); } Result TableReader::CountRows() const { return m_reader->CountRows(); } Result> TableReader::ReadRecordBatch(int i) const { return ReadRecordBatchAndValidate(*m_reader, i); } Result> ReadRecordBatchAndValidate( arrow::ipc::RecordBatchFileReader & reader, int i) { ARROW_ASSIGN_OR_RAISE(auto batch, reader.ReadRecordBatch(i)); ARROW_RETURN_NOT_OK(batch->ValidateFull()); // Check that the data buffers are aligned. std::vector unaligned_columns; unaligned_columns.reserve(batch->num_columns()); if (!arrow::util::CheckAlignment(*batch, arrow::util::kValueAlignment, &unaligned_columns)) { return Status::Invalid("Column data alignment check failed"); } return batch; } } // namespace pod5 ================================================ FILE: c++/pod5_format/table_reader.h ================================================ #pragma once #include "pod5_format/pod5_format_export.h" #include "pod5_format/schema_metadata.h" #include namespace arrow { class MemoryPool; class RecordBatch; namespace ipc { class RecordBatchFileReader; } } // namespace arrow namespace pod5 { class POD5_FORMAT_EXPORT TableRecordBatch { public: TableRecordBatch(std::shared_ptr const & batch); TableRecordBatch(std::shared_ptr && batch); TableRecordBatch(TableRecordBatch &&); TableRecordBatch & operator=(TableRecordBatch &&); TableRecordBatch(TableRecordBatch const &); TableRecordBatch & operator=(TableRecordBatch const &); ~TableRecordBatch(); std::size_t num_rows() const; std::shared_ptr const & batch() const { return m_batch; } private: std::shared_ptr m_batch; }; class POD5_FORMAT_EXPORT TableReader { public: TableReader( std::shared_ptr && input_source, std::shared_ptr && reader, SchemaMetadataDescription && schema_metadata, arrow::MemoryPool * pool); TableReader(TableReader &&); TableReader & operator=(TableReader &&); TableReader(TableReader const &) = delete; TableReader & operator=(TableReader const &) = delete; ~TableReader(); SchemaMetadataDescription const & schema_metadata() const { return m_schema_metadata; } std::size_t num_record_batches() const; Result CountRows() const; Result> ReadRecordBatch(int i) const; private: std::shared_ptr m_input_source; std::shared_ptr m_reader; SchemaMetadataDescription m_schema_metadata; }; // Same as RecordBatchFileReader::ReadRecordBatch() but validates the contents. Result> ReadRecordBatchAndValidate( arrow::ipc::RecordBatchFileReader & reader, int i); } // namespace pod5 ================================================ FILE: c++/pod5_format/thread_pool.cpp ================================================ #include "pod5_format/thread_pool.h" #include #include #include #include #include #include #include #include #include namespace pod5 { namespace { class ThreadPoolImpl : public ThreadPool, public std::enable_shared_from_this { public: ThreadPoolImpl(std::size_t worker_count) { assert(worker_count > 0); for (std::size_t i = 0; i < std::max(1, worker_count); ++i) { m_threads.emplace_back([&] { run_thread(); }); } } ~ThreadPoolImpl() { stop_and_drain(); } void run_thread() { bool keep_alive = true; std::optional work; while (keep_alive) { { std::unique_lock lock{m_work_mutex}; if (work) { if (work->strand_id != NO_STRAND) { m_busy_strands[work->strand_id] = false; } work = std::nullopt; } // find the first piece of work whose strand isn't already busy for (auto it = m_work.begin(); it != m_work.end(); ++it) { if (it->strand_id == NO_STRAND || !m_busy_strands.at(it->strand_id)) { if (it->strand_id != NO_STRAND) { m_busy_strands[it->strand_id] = true; } work = std::move(*it); m_work.erase(it); break; } } if (!work) { if (m_keep_alive) { m_work_ready.wait(lock); keep_alive = m_keep_alive || !m_work.empty(); } else { // If there wasn't any work for us to pick up, any remaining work must be // for strands with running tasks (in which case the workers handling those // tasks will pick them up. This will work because once a task finishes, the // worker will check for work *before* checking m_keep_alive. Thus it's safe // for *this* worker to exit. keep_alive = false; } continue; } } assert(work); if (work->callback) { work->callback(); } } } void post(std::function callback) override { { std::lock_guard l{m_work_mutex}; if (!m_keep_alive) { throw std::logic_error{"ThreadPool: post() called after stop_and_drain()"}; } m_work.emplace_back(WorkItem{std::move(callback), NO_STRAND}); } m_work_ready.notify_one(); } void post(std::function callback, uint64_t const strand_id) { assert(strand_id != NO_STRAND); std::lock_guard l{m_work_mutex}; if (!m_keep_alive) { throw std::logic_error{"ThreadPool: post() called after stop_and_drain()"}; } if (m_busy_strands.size() <= strand_id) { m_busy_strands.resize(strand_id + 1); } m_work.emplace_back(WorkItem{std::move(callback), strand_id}); // only send a wakeup if the strand isn't already busy - it's generally more efficient to do // this outside the lock, but the conditional makes that hard to reason about if (!m_busy_strands.at(strand_id)) { m_work_ready.notify_one(); } } void stop_and_drain() override { { std::lock_guard lock{m_work_mutex}; m_keep_alive = false; } m_work_ready.notify_all(); for (auto & thread : m_threads) { if (thread.joinable()) { thread.join(); } } assert(m_work.empty()); } void wait_for_drain() override { auto const drained = [&]() { std::lock_guard lock{m_work_mutex}; return m_work.empty(); }; while (!drained()) { m_work_ready.notify_all(); std::this_thread::sleep_for(std::chrono::milliseconds{10}); } } std::shared_ptr create_strand() override; private: struct WorkItem { std::function callback; uint64_t strand_id; explicit operator bool() const { return !!callback; } }; static constexpr uint64_t NO_STRAND = UINT64_MAX; std::mutex m_work_mutex; bool m_keep_alive{true}; std::condition_variable m_work_ready; std::deque m_work; std::vector m_busy_strands; std::atomic m_next_strand_id{0}; std::vector m_threads; }; class StrandImpl : public ThreadPoolStrand { public: StrandImpl(std::shared_ptr pool, uint64_t const strand_id) : m_pool(std::move(pool)) , m_strand_id(strand_id) { } void post(std::function callback) override { m_pool->post(std::move(callback), m_strand_id); } std::shared_ptr m_pool; uint64_t m_strand_id; }; std::shared_ptr ThreadPoolImpl::create_strand() { uint64_t strand_id; { std::lock_guard l{m_work_mutex}; if (!m_keep_alive) { throw std::logic_error{"ThreadPool: create_strand() called after stop_and_drain()"}; } strand_id = m_next_strand_id++; } return std::make_shared(shared_from_this(), strand_id); } } // namespace std::shared_ptr make_thread_pool(std::size_t worker_threads) { return std::make_shared(worker_threads); } } // namespace pod5 ================================================ FILE: c++/pod5_format/thread_pool.h ================================================ #pragma once #include "pod5_format/pod5_format_export.h" #include #include namespace pod5 { class POD5_FORMAT_EXPORT ThreadPoolStrand { public: virtual ~ThreadPoolStrand() = default; virtual void post(std::function callback) = 0; }; class POD5_FORMAT_EXPORT ThreadPool { public: virtual ~ThreadPool() = default; virtual std::shared_ptr create_strand() = 0; virtual void post(std::function callback) = 0; /// Stops the thread pool and drains all active work. /// /// Further calls to create_strand() or post() (including on an existing strand created from /// this pool) will throw. virtual void stop_and_drain() = 0; /// Waits for the threads to process all posted work. virtual void wait_for_drain() = 0; }; POD5_FORMAT_EXPORT std::shared_ptr make_thread_pool(std::size_t worker_threads); } // namespace pod5 ================================================ FILE: c++/pod5_format/tuple_utils.h ================================================ #pragma once #include #include namespace pod5 { namespace detail { template void for_each(T && t, F f, std::integer_sequence) { auto l = {(f(std::get(t), Is), 0)...}; (void)l; } template void for_each_in_tuple(std::tuple & t, F f) { detail::for_each(t, f, std::make_integer_sequence()); } template void for_each_zipped(T1 && t1, T2 && t2, F f, std::integer_sequence) { auto l = {(f(std::get(t1), std::get(t2), Is), 0)...}; (void)l; } template void for_each_in_tuple_zipped(T1 & t1, T2 & t2, F f) { static_assert( std::tuple_size::value == std::tuple_size::value, "Tuples must be same size"); detail::for_each_zipped( t1, t2, f, std::make_integer_sequence::value>()); } }} // namespace pod5::detail ================================================ FILE: c++/pod5_format/types.cpp ================================================ #include "pod5_format/types.h" #include #include #include #include namespace pod5 { Uuid const * UuidArray::raw_values() const { auto const & array = static_cast(*storage_); return reinterpret_cast(array.GetValue(0)); } Uuid UuidArray::Value(int64_t i) const { auto const & array = static_cast(*storage_); return *reinterpret_cast(array.GetValue(i)); } bool UuidType::ExtensionEquals(ExtensionType const & other) const { // no parameters to consider return other.extension_name() == extension_name(); } std::shared_ptr UuidType::MakeArray(std::shared_ptr data) const { DCHECK_EQ(data->type->id(), arrow::Type::EXTENSION); DCHECK_EQ( static_cast(*data->type).extension_name(), extension_name()); return std::make_shared(data); } std::string UuidType::Serialize() const { return ""; } arrow::Result> UuidType::Deserialize( std::shared_ptr storage_type, std::string const & serialized_data) const { if (serialized_data != "") { return arrow::Status::Invalid("Unexpected type metadata: '", serialized_data, "'"); } if (!storage_type->Equals(*arrow::fixed_size_binary(16))) { return arrow::Status::Invalid( "Incorrect storage for UuidType: '", storage_type->ToString(), "'"); } return std::make_shared(); } gsl::span VbzSignalArray::Value(int64_t i) const { auto const & array = static_cast(*storage_); arrow::LargeBinaryArray::offset_type value_length = 0; auto value_ptr = array.GetValue(i, &value_length); return gsl::make_span(value_ptr, value_length); } std::shared_ptr VbzSignalArray::ValueAsBuffer(int64_t i) const { auto const & array = static_cast(*storage_); auto offset = array.value_offset(i); auto length = array.value_length(i); auto const value_data = array.value_data(); return arrow::SliceBuffer(value_data, offset, length); } bool VbzSignalType::ExtensionEquals(ExtensionType const & other) const { // no parameters to consider return other.extension_name() == extension_name(); } std::shared_ptr VbzSignalType::MakeArray(std::shared_ptr data) const { DCHECK_EQ(data->type->id(), arrow::Type::EXTENSION); DCHECK_EQ( static_cast(*data->type).extension_name(), extension_name()); return std::make_shared(data); } std::string VbzSignalType::Serialize() const { return ""; } arrow::Result> VbzSignalType::Deserialize( std::shared_ptr storage_type, std::string const & serialized_data) const { if (serialized_data != "") { return arrow::Status::Invalid("Unexpected type metadata: '", serialized_data, "'"); } if (!storage_type->Equals(*arrow::large_binary())) { return arrow::Status::Invalid( "Incorrect storage for VbzSignalType: '", storage_type->ToString(), "'"); } return std::make_shared(); } std::unique_ptr make_read_id_builder(arrow::MemoryPool * pool) { auto const & uuid_type = uuid(); assert(uuid_type->id() == arrow::Type::EXTENSION); auto result = std::make_unique(uuid_type->storage_type(), pool); assert(result->byte_width() == 16); return result; } std::shared_ptr const & vbz_signal() { static auto vbz_signal = std::make_shared(); return vbz_signal; } std::shared_ptr const & uuid() { static auto uuid = std::make_shared(); return uuid; } namespace { std::mutex & get_pod5_register_mutex() { // Heap allocated so that it's safe for user code to call during static init // and destruction, not that they should. static std::mutex & m = *new std::mutex{}; return m; } std::size_t s_pod5_register_count(0); } // namespace pod5::Status register_extension_types() { std::lock_guard lock(get_pod5_register_mutex()); if (++s_pod5_register_count == 1) { ARROW_RETURN_NOT_OK(arrow::RegisterExtensionType(uuid())); ARROW_RETURN_NOT_OK(arrow::RegisterExtensionType(vbz_signal())); } return pod5::Status::OK(); } pod5::Status unregister_extension_types() { std::lock_guard lock(get_pod5_register_mutex()); auto register_count = --s_pod5_register_count; if (register_count == 0) { if (arrow::GetExtensionType("minknow.uuid")) { ARROW_RETURN_NOT_OK(arrow::UnregisterExtensionType("minknow.uuid")); } if (arrow::GetExtensionType("minknow.vbz")) { ARROW_RETURN_NOT_OK(arrow::UnregisterExtensionType("minknow.vbz")); } } return pod5::Status::OK(); } bool check_extension_types_registered() { std::lock_guard lock(get_pod5_register_mutex()); return s_pod5_register_count > 0; } } // namespace pod5 ================================================ FILE: c++/pod5_format/types.h ================================================ #pragma once #include "pod5_format/pod5_format_export.h" #include "pod5_format/result.h" #include "pod5_format/uuid.h" #include #include #include namespace pod5 { class POD5_FORMAT_EXPORT UuidArray : public arrow::ExtensionArray { public: using IteratorType = arrow::stl::ArrayIterator; using ExtensionArray::ExtensionArray; Uuid const * raw_values() const; Uuid Value(int64_t i) const; // this isn't actually a view - it copies the data - but // (a) it's only 16 bytes, which is what a view (pointer + size) would require anyway // (b) arrow::std::ArrayIterator hard-codes the name of this method (even though it is supposed // to be configurable via the ValueAccessor template parameter) Uuid GetView(int64_t i) const { return Value(i); } std::optional operator[](int64_t i) const { return *IteratorType(*this, i); } IteratorType begin() const { return IteratorType(*this); } IteratorType end() const { return IteratorType(*this, length()); } }; class POD5_FORMAT_EXPORT UuidType : public arrow::ExtensionType { public: UuidType() : ExtensionType(arrow::fixed_size_binary(16)) {} std::string extension_name() const override { return "minknow.uuid"; } bool ExtensionEquals(ExtensionType const & other) const override; std::shared_ptr MakeArray(std::shared_ptr data) const override; std::string Serialize() const override; arrow::Result> Deserialize( std::shared_ptr storage_type, std::string const & serialized_data) const override; }; class POD5_FORMAT_EXPORT VbzSignalArray : public arrow::ExtensionArray { public: using IteratorType = arrow::stl::ArrayIterator; gsl::span Value(int64_t i) const; std::shared_ptr ValueAsBuffer(int64_t i) const; using ExtensionArray::ExtensionArray; }; class POD5_FORMAT_EXPORT VbzSignalType : public arrow::ExtensionType { public: VbzSignalType() : ExtensionType(arrow::large_binary()) {} std::string extension_name() const override { return "minknow.vbz"; } bool ExtensionEquals(ExtensionType const & other) const override; std::shared_ptr MakeArray(std::shared_ptr data) const override; std::string Serialize() const override; arrow::Result> Deserialize( std::shared_ptr storage_type, std::string const & serialized_data) const override; }; std::unique_ptr make_read_id_builder(arrow::MemoryPool * pool); std::shared_ptr const & vbz_signal(); std::shared_ptr const & uuid(); /// \brief Register all required extension types. POD5_FORMAT_EXPORT pod5::Status register_extension_types(); /// \brief Unregister all required extension types. POD5_FORMAT_EXPORT pod5::Status unregister_extension_types(); /// \brief Returns true iff the required extension types are registered. /// \details The caller can expect the extension types to be registered if the number of calls to /// `register_extension_types` exceeds the number of calls to `unregister_extension_types`. bool check_extension_types_registered(); } // namespace pod5 ================================================ FILE: c++/pod5_format/uuid.h ================================================ #pragma once // This file contains code from https://github.com/mariusbancila/stduuid/ which has the following // license: // // MIT License // // Copyright (c) 2017 // // Permission is hereby granted, free of charge, to any person obtaining a copy // of this software and associated documentation files (the "Software"), to deal // in the Software without restriction, including without limitation the rights // to use, copy, modify, merge, publish, distribute, sublicense, and/or sell // copies of the Software, and to permit persons to whom the Software is // furnished to do so, subject to the following conditions: // // The above copyright notice and this permission notice shall be included in all // copies or substantial portions of the Software. // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, // FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE // AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER // LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, // OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE // SOFTWARE. #include #include #include #include #include #include #include namespace pod5 { namespace uuid_detail { template [[nodiscard]] constexpr inline unsigned char hex2char(TChar const ch) noexcept { if (ch >= static_cast('0') && ch <= static_cast('9')) { return static_cast(ch - static_cast('0')); } if (ch >= static_cast('a') && ch <= static_cast('f')) { return static_cast(10 + ch - static_cast('a')); } if (ch >= static_cast('A') && ch <= static_cast('F')) { return static_cast(10 + ch - static_cast('A')); } return 0; } template [[nodiscard]] constexpr inline bool is_hex(TChar const ch) noexcept { return (ch >= static_cast('0') && ch <= static_cast('9')) || (ch >= static_cast('a') && ch <= static_cast('f')) || (ch >= static_cast('A') && ch <= static_cast('F')); } template [[nodiscard]] constexpr std::basic_string_view to_string_view(TChar const * str) noexcept { if (str) { return str; } return {}; } template [[nodiscard]] constexpr std:: basic_string_view to_string_view(StringType const & str) noexcept { return str; } template inline constexpr CharT empty_guid[37] = "00000000-0000-0000-0000-000000000000"; template <> inline constexpr wchar_t empty_guid[37] = L"00000000-0000-0000-0000-000000000000"; template inline constexpr CharT guid_encoder[17] = "0123456789abcdef"; template <> inline constexpr wchar_t guid_encoder[17] = L"0123456789abcdef"; } // namespace uuid_detail // Forward declare uuid & to_string so that we can declare to_string as a friend later. class Uuid; template < class CharT = char, class Traits = std::char_traits, class Allocator = std::allocator> std::basic_string to_string(Uuid const & id); /// A representation of a Universally Unique IDentifier. /// /// This code implements part of RFC 4122. It does not aim to be a complete implementation of UUIDs, /// but provides enough for the uses in the POD5 library. class Uuid { public: using value_type = uint8_t; constexpr Uuid() noexcept = default; Uuid(value_type const (&arr)[16]) noexcept { std::copy(std::cbegin(arr), std::cend(arr), std::begin(m_data)); } constexpr Uuid(std::array const & arr) noexcept : m_data{arr} {} template explicit Uuid(ForwardIterator first, ForwardIterator last) { if (std::distance(first, last) == 16) { std::copy(first, last, std::begin(m_data)); } } [[nodiscard]] constexpr bool is_nil() const noexcept { for (size_t i = 0; i < m_data.size(); ++i) { if (m_data[i] != 0) { return false; } } return true; } void swap(Uuid & other) noexcept { m_data.swap(other.m_data); } uint8_t const * data() const noexcept { return m_data.data(); } size_t size() const noexcept { return m_data.size(); } void to_c_array(value_type (&arr)[16]) const noexcept { std::copy(std::cbegin(m_data), std::cend(m_data), std::begin(arr)); } // Note: uustr must be at least 36 characters template void write_to(CharT * uustr) const noexcept { for (size_t i = 0, index = 0; i < 36; ++i) { if (i == 8 || i == 13 || i == 18 || i == 23) { uustr[i] = uuid_detail::empty_guid[i]; continue; } uustr[i] = uuid_detail::guid_encoder[m_data[index] >> 4 & 0x0f]; uustr[++i] = uuid_detail::guid_encoder[m_data[index] & 0x0f]; index++; } } template [[nodiscard]] constexpr static std::optional from_string( StringType const & in_str) noexcept { auto str = uuid_detail::to_string_view(in_str); bool firstDigit = true; size_t hasBraces = 0; size_t index = 0; std::array data{{0}}; if (str.empty()) { return {}; } if (str.front() == '{') { hasBraces = 1; } if (hasBraces && str.back() != '}') { return {}; } for (size_t i = hasBraces; i < str.size() - hasBraces; ++i) { if (str[i] == '-') { continue; } if (index >= 16 || !uuid_detail::is_hex(str[i])) { return {}; } if (firstDigit) { data[index] = static_cast(uuid_detail::hex2char(str[i]) << 4); firstDigit = false; } else { data[index] = static_cast(data[index] | uuid_detail::hex2char(str[i])); index++; firstDigit = true; } } if (index < 16) { return {}; } return Uuid{data}; } private: std::array m_data{{0}}; friend bool operator==(Uuid const & lhs, Uuid const & rhs) noexcept; friend bool operator<(Uuid const & lhs, Uuid const & rhs) noexcept; template friend std::basic_ostream & operator<<( std::basic_ostream & s, Uuid const & id); template friend std::basic_string to_string(Uuid const & id); friend std::hash; }; [[nodiscard]] inline bool operator==(Uuid const & lhs, Uuid const & rhs) noexcept { return lhs.m_data == rhs.m_data; } [[nodiscard]] inline bool operator!=(Uuid const & lhs, Uuid const & rhs) noexcept { return !(lhs == rhs); } [[nodiscard]] inline bool operator<(Uuid const & lhs, Uuid const & rhs) noexcept { return lhs.m_data < rhs.m_data; } template [[nodiscard]] inline std::basic_string to_string(Uuid const & id) { std::basic_string uustr{uuid_detail::empty_guid}; id.write_to(uustr.data()); return uustr; } template std::basic_ostream & operator<<(std::basic_ostream & s, Uuid const & id) { s << to_string(id); return s; } inline void swap(Uuid & lhs, Uuid & rhs) noexcept { lhs.swap(rhs); } template class BasicUuidRandomGenerator { public: using engine_type = UniformRandomNumberGenerator; explicit BasicUuidRandomGenerator(engine_type & gen) : generator(&gen) {} explicit BasicUuidRandomGenerator(engine_type * gen) : generator(gen) {} [[nodiscard]] Uuid operator()() { alignas(uint32_t) uint8_t bytes[16]; for (int i = 0; i < 16; i += 4) { *reinterpret_cast(bytes + i) = distribution(*generator); } // variant must be 10xxxxxx bytes[8] &= 0xBF; bytes[8] |= 0x80; // version must be 0100xxxx bytes[6] &= 0x4F; bytes[6] |= 0x40; return Uuid{std::begin(bytes), std::end(bytes)}; } private: std::uniform_int_distribution distribution; UniformRandomNumberGenerator * generator; }; using UuidRandomGenerator = BasicUuidRandomGenerator; } // namespace pod5 namespace std { template <> struct hash { using argument_type = pod5::Uuid; using result_type = std::size_t; [[nodiscard]] result_type operator()(argument_type const & uuid) const { uint64_t l = static_cast(uuid.m_data[0]) << 56 | static_cast(uuid.m_data[1]) << 48 | static_cast(uuid.m_data[2]) << 40 | static_cast(uuid.m_data[3]) << 32 | static_cast(uuid.m_data[4]) << 24 | static_cast(uuid.m_data[5]) << 16 | static_cast(uuid.m_data[6]) << 8 | static_cast(uuid.m_data[7]); uint64_t h = static_cast(uuid.m_data[8]) << 56 | static_cast(uuid.m_data[9]) << 48 | static_cast(uuid.m_data[10]) << 40 | static_cast(uuid.m_data[11]) << 32 | static_cast(uuid.m_data[12]) << 24 | static_cast(uuid.m_data[13]) << 16 | static_cast(uuid.m_data[14]) << 8 | static_cast(uuid.m_data[15]); if constexpr (sizeof(result_type) > 4) { return result_type(l ^ h); } else { uint64_t hash64 = l ^ h; return result_type(uint32_t(hash64 >> 32) ^ uint32_t(hash64)); } } }; } // namespace std ================================================ FILE: c++/pod5_format/version.h.in ================================================ #pragma once #include namespace pod5 { std::uint16_t const Pod5MajorVersion = @POD5_VERSION_MAJOR@; std::uint16_t const Pod5MinorVersion = @POD5_VERSION_MINOR@; std::uint16_t const Pod5RevVersion = @POD5_VERSION_REV@; std::string const Pod5Version = "@POD5_NUMERIC_VERSION@"; } ================================================ FILE: c++/pod5_format_pybind/CMakeLists.txt ================================================ pybind11_add_module(pod5_format_pybind api.h bindings.cpp utils.h subset.cpp subset.h repack/repack_functions.h repack/repack_states.h repack/repack_utils.h repack/repack_output.cpp repack/repack_output.h repack/repacker.cpp repack/repacker.h ) target_link_libraries(pod5_format_pybind PRIVATE pod5_format ) if (NOT MSVC) target_compile_options(pod5_format_pybind PRIVATE ${pod5_warning_options}) endif() set_target_properties(pod5_format_pybind PROPERTIES POSITION_INDEPENDENT_CODE 1 CXX_STANDARD 20 ) # Non-conan license files to copy. set(pod5_cxx_licenses_src "${CMAKE_SOURCE_DIR}/LICENSE.md" "${CMAKE_SOURCE_DIR}/third_party/licenses/gsl-lite.txt" "${CMAKE_SOURCE_DIR}/third_party/pybind11/LICENSE" ) # Destination name for the above files. set(pod5_cxx_licenses_dst "LICENSE.md" "gsl-lite.txt" "pybind11.txt" ) set(python_project_root "${CMAKE_SOURCE_DIR}/python/lib_pod5/") configure_file( ${CMAKE_CURRENT_SOURCE_DIR}/_version.py.in ${python_project_root}/src/lib_pod5/_version.py ) set(wheel_output_stub "${CMAKE_CURRENT_BINARY_DIR}/wheel.touch") set(wheel_output_dir "${CMAKE_CURRENT_BINARY_DIR}/wheel_${POD5_FULL_VERSION}") file(MAKE_DIRECTORY ${wheel_output_dir}) add_custom_command( OUTPUT "${wheel_output_stub}" COMMAND ${CMAKE_COMMAND} ARGS -D "PYTHON_EXECUTABLE=${Python_EXECUTABLE}" -D "PYTHON_PROJECT_DIR=${python_project_root}" -D "PYBIND_INPUT_LIB=$" -D "WHEEL_OUTPUT_DIR=${wheel_output_dir}" -D "POD5_CONAN_LICENSES=${CMAKE_BINARY_DIR}/pod5_conan_licenses" -D "POD5_CXX_LICENSES_SRC=${pod5_cxx_licenses_src}" -D "POD5_CXX_LICENSES_DST=${pod5_cxx_licenses_dst}" -P "${CMAKE_CURRENT_SOURCE_DIR}/build_wheel.cmake" DEPENDS pod5_format_pybind VERBATIM ) add_custom_target(lib_pod5_python_wheel ALL SOURCES build_wheel.cmake DEPENDS "${wheel_output_stub}" ) install( DIRECTORY "${wheel_output_dir}/" DESTINATION "." COMPONENT wheel FILES_MATCHING PATTERN "lib_pod5*.whl" ) ================================================ FILE: c++/pod5_format_pybind/_version.py.in ================================================ # This file is auto generated by cmake during compilation __version__ = version = "@POD5_FULL_VERSION@" __version_tuple__ = version_tuple = (@POD5_VERSION_MAJOR@, @POD5_VERSION_MINOR@, @POD5_VERSION_REV@) ================================================ FILE: c++/pod5_format_pybind/api.h ================================================ #pragma once #include "pod5_format/async_signal_loader.h" #include "pod5_format/c_api.h" #include "pod5_format/file_reader.h" #include "pod5_format/file_updater.h" #include "pod5_format/file_writer.h" #include "pod5_format/read_table_reader.h" #include "pod5_format/signal_compression.h" #include "pod5_format/signal_table_reader.h" #include "pod5_format/thread_pool.h" #include "pod5_format/uuid.h" #include "utils.h" #include #include #include #include namespace py = pybind11; inline std::shared_ptr create_file( char const * path, std::string const & writer_name, pod5::FileWriterOptions const * options) { pod5::FileWriterOptions dummy; POD5_PYTHON_ASSIGN_OR_RAISE( auto writer, pod5::create_file_writer( path, writer_name, options ? *options : pod5::FileWriterOptions{})); return writer; } inline pod5::RecoveryDetails recover_file( char const * src_filename, char const * dest_filename, pod5::RecoverFileOptions const * const options) { POD5_PYTHON_ASSIGN_OR_RAISE( pod5::RecoveryDetails details, pod5::recover_file( src_filename, dest_filename, options ? *options : pod5::RecoverFileOptions{})); return details; } class Pod5SignalCacheBatch { public: Pod5SignalCacheBatch( pod5::AsyncSignalLoader::SamplesMode samples_mode, pod5::CachedBatchSignalData && cached_data) : m_samples_mode(samples_mode) , m_cached_data(std::move(cached_data)) { } py::array_t sample_count() const { return py::array_t( m_cached_data.sample_count().size(), m_cached_data.sample_count().data()); } py::list samples() const { py::list py_samples; if (m_samples_mode != pod5::AsyncSignalLoader::SamplesMode::Samples) { return py_samples; } for (auto const & row_samples : m_cached_data.samples()) { py_samples.append( py::array_t( {row_samples.size()}, {sizeof(std::int16_t)}, row_samples.data())); } return py_samples; } std::uint32_t batch_index() const { return m_cached_data.batch_index(); } private: pod5::AsyncSignalLoader::SamplesMode m_samples_mode; pod5::CachedBatchSignalData m_cached_data; }; class Pod5AsyncSignalLoader { public: // Make an async loader for all reads in the file Pod5AsyncSignalLoader( std::shared_ptr const & reader, pod5::AsyncSignalLoader::SamplesMode samples_mode, std::size_t worker_count = std::thread::hardware_concurrency(), std::size_t max_pending_batches = 10) : m_samples_mode(samples_mode) , m_batch_counts_ref({}) , m_batch_rows_ref({}) , m_async_loader(reader, samples_mode, {}, {}, worker_count, max_pending_batches) { } // Make an async loader for specific batches Pod5AsyncSignalLoader( std::shared_ptr const & reader, pod5::AsyncSignalLoader::SamplesMode samples_mode, py::array_t && batches, std::size_t worker_count = std::thread::hardware_concurrency(), std::size_t max_pending_batches = 10) : m_samples_mode(samples_mode) , m_batch_sizes(make_batch_counts(reader, batches)) , m_async_loader( reader, samples_mode, gsl::make_span(m_batch_sizes), {}, worker_count, max_pending_batches) { } // Make an async loader for specific reads in specific batches Pod5AsyncSignalLoader( std::shared_ptr const & reader, pod5::AsyncSignalLoader::SamplesMode samples_mode, py::array_t && batch_counts, py::array_t && batch_rows, std::size_t worker_count = std::thread::hardware_concurrency(), std::size_t max_pending_batches = 10) : m_samples_mode(samples_mode) , m_batch_counts_ref(std::move(batch_counts)) , m_batch_rows_ref(std::move(batch_rows)) , m_async_loader( reader, samples_mode, gsl::make_span(m_batch_counts_ref.data(), m_batch_counts_ref.size()), gsl::make_span(m_batch_rows_ref.data(), m_batch_rows_ref.size()), worker_count, max_pending_batches) { } std::shared_ptr release_next_batch() { auto batch = m_async_loader.release_next_batch(); if (!batch.ok()) { throw std::runtime_error(batch.status().ToString()); } if (!*batch) { assert(m_async_loader.is_finished()); throw pybind11::stop_iteration(); } return std::make_shared(m_samples_mode, std::move(**batch)); } std::vector make_batch_counts( std::shared_ptr const & reader, py::array_t const & batches) { std::vector batch_counts(reader->num_read_record_batches(), 0); for (auto const & batch_idx : gsl::make_span(batches.data(), batches.shape(0))) { auto read_batch = reader->read_read_record_batch(batch_idx); if (!read_batch.ok()) { throw std::runtime_error( "Failed to query read batch count: " + read_batch.status().ToString()); } batch_counts[batch_idx] = read_batch->num_rows(); } return batch_counts; } pod5::AsyncSignalLoader::SamplesMode m_samples_mode; std::vector m_batch_sizes; py::array_t m_batch_counts_ref; py::array_t m_batch_rows_ref; pod5::AsyncSignalLoader m_async_loader; }; struct Pod5FileReaderPtr { std::shared_ptr reader = nullptr; Pod5FileReaderPtr(std::shared_ptr && reader_) : reader(std::move(reader_)) {} pod5::FileLocation get_file_run_info_table_location() const { return reader->run_info_table_location(); } pod5::FileLocation get_file_read_table_location() const { return reader->read_table_location(); } pod5::FileLocation get_file_signal_table_location() const { return reader->signal_table_location(); } std::string get_file_version_pre_migration() const { return reader->file_version_pre_migration().to_string(); } void close() { reader = nullptr; } std::size_t plan_traversal( py::array_t const & read_id_data, py::array_t & batch_counts, py::array_t & batch_rows) { auto const read_id_count = read_id_data.shape(0); auto search_input = pod5::ReadIdSearchInput( gsl::make_span( reinterpret_cast(read_id_data.data()), read_id_count)); POD5_PYTHON_ASSIGN_OR_RAISE( auto find_success_count, reader->search_for_read_ids( search_input, gsl::make_span(batch_counts.mutable_data(), reader->num_read_record_batches()), gsl::make_span(batch_rows.mutable_data(), read_id_count))); return find_success_count; } std::shared_ptr batch_get_signal(bool get_samples, bool get_sample_count) { return std::make_shared( reader, get_samples ? pod5::AsyncSignalLoader::SamplesMode::Samples : pod5::AsyncSignalLoader::SamplesMode::NoSamples); } std::shared_ptr batch_get_signal_batches( bool get_samples, bool get_sample_count, py::array_t && batches) { return std::make_shared( reader, get_samples ? pod5::AsyncSignalLoader::SamplesMode::Samples : pod5::AsyncSignalLoader::SamplesMode::NoSamples, std::move(batches)); } std::shared_ptr batch_get_signal_selection( bool get_samples, bool get_sample_count, py::array_t && batch_counts, py::array_t && batch_rows) { return std::make_shared( reader, get_samples ? pod5::AsyncSignalLoader::SamplesMode::Samples : pod5::AsyncSignalLoader::SamplesMode::NoSamples, std::move(batch_counts), std::move(batch_rows)); } }; inline Pod5FileReaderPtr open_file(char const * filename) { POD5_PYTHON_ASSIGN_OR_RAISE(auto reader, pod5::open_file_reader(filename, {})); return Pod5FileReaderPtr(std::move(reader)); } inline void write_updated_file_to_dest(Pod5FileReaderPtr source, char const * dest_filename) { POD5_PYTHON_RETURN_NOT_OK( pod5::update_file(arrow::default_memory_pool(), source.reader, dest_filename)); } inline pod5::RunInfoDictionaryIndex FileWriter_add_run_info( pod5::FileWriter & w, std::string & acquisition_id, std::int64_t acquisition_start_time, std::int16_t adc_max, std::int16_t adc_min, std::vector> && context_tags, std::string & experiment_name, std::string & flow_cell_id, std::string & flow_cell_product_code, std::string & protocol_name, std::string & protocol_run_id, std::int64_t protocol_start_time, std::string & sample_id, std::uint16_t sample_rate, std::string & sequencing_kit, std::string & sequencer_position, std::string & sequencer_position_type, std::string & software, std::string & system_name, std::string & system_type, std::vector> && tracking_id) { return throw_on_error(w.add_run_info( {std::move(acquisition_id), acquisition_start_time, adc_max, adc_min, std::move(context_tags), std::move(experiment_name), std::move(flow_cell_id), std::move(flow_cell_product_code), std::move(protocol_name), std::move(protocol_run_id), std::move(protocol_start_time), std::move(sample_id), sample_rate, std::move(sequencing_kit), std::move(sequencer_position), std::move(sequencer_position_type), std::move(software), std::move(system_name), std::move(system_type), std::move(tracking_id)})); } inline pod5::ReadData make_read_data( std::size_t row_id, py::array_t const & read_id_data, py::array_t const & read_numbers, py::array_t const & start_samples, py::array_t const & channels, py::array_t const & wells, py::array_t const & pore_types, py::array_t const & calibration_offsets, py::array_t const & calibration_scales, py::array_t const & median_befores, py::array_t const & end_reasons, py::array_t const & end_reason_forceds, py::array_t const & run_infos, py::array_t const & num_minknow_events, py::array_t const & tracked_scaling_scale, py::array_t const & tracked_scaling_shift, py::array_t const & predicted_scaling_scale, py::array_t const & predicted_scaling_shift, py::array_t const & num_reads_since_mux_change, py::array_t const & time_since_mux_change, py::array_t const & open_pore_level) { auto read_ids = reinterpret_cast(read_id_data.data(0)); return pod5::ReadData{ read_ids[row_id], *read_numbers.data(row_id), *start_samples.data(row_id), *channels.data(row_id), *wells.data(row_id), *pore_types.data(row_id), *calibration_offsets.data(row_id), *calibration_scales.data(row_id), *median_befores.data(row_id), *end_reasons.data(row_id), *end_reason_forceds.data(row_id), *run_infos.data(row_id), *num_minknow_events.data(row_id), *tracked_scaling_scale.data(row_id), *tracked_scaling_shift.data(row_id), *predicted_scaling_scale.data(row_id), *predicted_scaling_shift.data(row_id), *num_reads_since_mux_change.data(row_id), *time_since_mux_change.data(row_id), *open_pore_level.data(row_id)}; } inline void FileWriter_add_reads( pod5::FileWriter & w, std::size_t count, py::array_t const & read_id_data, py::array_t const & read_numbers, py::array_t const & start_samples, py::array_t const & channels, py::array_t const & wells, py::array_t const & pore_types, py::array_t const & calibration_offsets, py::array_t const & calibration_scales, py::array_t const & median_befores, py::array_t const & end_reasons, py::array_t const & end_reason_forceds, py::array_t const & run_infos, py::array_t const & num_minknow_events, py::array_t const & tracked_scaling_scales, py::array_t const & tracked_scaling_shifts, py::array_t const & predicted_scaling_scales, py::array_t const & predicted_scaling_shifts, py::array_t const & num_reads_since_mux_changes, py::array_t const & time_since_mux_changes, py::array_t const & open_pore_levels, py::list signal_ptrs) { if (read_id_data.shape(1) != 16) { throw std::runtime_error("Read id array is of unexpected size"); } auto signal_it = signal_ptrs.begin(); for (std::size_t i = 0; i < count; ++i, ++signal_it) { if (signal_it == signal_ptrs.end()) { throw std::runtime_error("Missing signal data"); } auto signal = signal_it->cast>(); auto signal_span = gsl::make_span(signal.data(), signal.size()); auto read_data = make_read_data( i, read_id_data, read_numbers, start_samples, channels, wells, pore_types, calibration_offsets, calibration_scales, median_befores, end_reasons, end_reason_forceds, run_infos, num_minknow_events, tracked_scaling_scales, tracked_scaling_shifts, predicted_scaling_scales, predicted_scaling_shifts, num_reads_since_mux_changes, time_since_mux_changes, open_pore_levels); throw_on_error(w.add_complete_read(read_data, signal_span)); } } inline void FileWriter_add_reads_pre_compressed( pod5::FileWriter & w, std::size_t count, py::array_t const & read_id_data, py::array_t const & read_numbers, py::array_t const & start_samples, py::array_t const & channels, py::array_t const & wells, py::array_t const & pore_types, py::array_t const & calibration_offsets, py::array_t const & calibration_scales, py::array_t const & median_befores, py::array_t const & end_reasons, py::array_t const & end_reason_forceds, py::array_t const & run_infos, py::array_t const & num_minknow_events, py::array_t const & tracked_scaling_scales, py::array_t const & tracked_scaling_shifts, py::array_t const & predicted_scaling_scales, py::array_t const & predicted_scaling_shifts, py::array_t const & num_reads_since_mux_changes, py::array_t const & time_since_mux_changes, py::array_t const & open_pore_levels, py::list compressed_signal_ptrs, py::array_t const & sample_counts, py::array_t const & signal_chunk_counts) { if (read_id_data.shape(1) != 16) { throw std::runtime_error("Read id array is of unexpected size"); } auto read_ids = reinterpret_cast(read_id_data.data(0)); auto compressed_signal_it = compressed_signal_ptrs.begin(); auto sample_counts_it = sample_counts.data(); for (std::size_t i = 0; i < count; ++i) { auto const read_id = read_ids[i]; auto const signal_chunk_count = *signal_chunk_counts.data(i); std::uint64_t signal_duration_count = 0; std::vector signal_rows(signal_chunk_count); for (std::size_t signal_chunk_idx = 0; signal_chunk_idx < signal_chunk_count; ++signal_chunk_idx) { if (compressed_signal_it == compressed_signal_ptrs.end()) { throw std::runtime_error("Missing signal data"); } auto compressed_signal = compressed_signal_it ->cast>(); auto compressed_signal_span = gsl::make_span(compressed_signal.data(), compressed_signal.size()); auto signal_row = throw_on_error( w.add_pre_compressed_signal(read_id, compressed_signal_span, *sample_counts_it)); signal_rows[signal_chunk_idx] = signal_row; signal_duration_count += *sample_counts_it; ++compressed_signal_it; ++sample_counts_it; } auto read_data = make_read_data( i, read_id_data, read_numbers, start_samples, channels, wells, pore_types, calibration_offsets, calibration_scales, median_befores, end_reasons, end_reason_forceds, run_infos, num_minknow_events, tracked_scaling_scales, tracked_scaling_shifts, predicted_scaling_scales, predicted_scaling_shifts, num_reads_since_mux_changes, time_since_mux_changes, open_pore_levels); throw_on_error(w.add_complete_read(read_data, signal_rows, signal_duration_count)); } } inline void decompress_signal_wrapper( py::array_t const & compressed_signal, py::array_t & signal_out) { throw_on_error( pod5::decompress_signal( gsl::make_span(compressed_signal.data(0), compressed_signal.shape(0)), arrow::system_memory_pool(), gsl::make_span(signal_out.mutable_data(0), signal_out.shape(0)))); } inline std::size_t compress_signal_wrapper( py::array_t const & signal, py::array_t & compressed_signal_out) { auto size = throw_on_error( pod5::compress_signal( gsl::make_span(signal.data(), signal.shape(0)), arrow::system_memory_pool(), gsl::make_span(compressed_signal_out.mutable_data(), compressed_signal_out.shape(0)))); return size; } inline std::size_t vbz_compressed_signal_max_size(std::size_t sample_count) { POD5_PYTHON_ASSIGN_OR_RAISE( std::size_t const max_size, pod5::compressed_signal_max_size(sample_count)); return max_size; } inline std::size_t load_read_id_iterable( py::iterable const & read_ids_str, py::array_t & read_id_data_out) { std::size_t out_idx = 0; auto read_ids = reinterpret_cast(read_id_data_out.mutable_data()); auto read_ids_out_len = read_id_data_out.shape(0); std::string temp_uuid; for (auto & read_id : read_ids_str) { if (out_idx >= (std::size_t)read_ids_out_len) { throw std::runtime_error("Too many input uuids for output container"); } temp_uuid = read_id.cast(); if (auto const found_uuid = pod5::Uuid::from_string(temp_uuid)) { read_ids[out_idx++] = *found_uuid; } // if it's invalid, ignore it - we will return one fewer read ids than expected and the caller can deal with it. } return out_idx; } inline py::list format_read_id_to_str( py::array_t & read_id_data_out) { if (read_id_data_out.size() % 16 != 0) { throw std::runtime_error( "Unexpected amount of data for read id - expected data to align to 16 bytes."); } py::list result; std::array str_data; std::size_t const count = read_id_data_out.size() / 16; for (std::size_t i = 0; i < count; ++i) { auto read_id_data = read_id_data_out.data() + (i * 16); pod5_format_read_id(read_id_data, str_data.data()); result.append(py::str(str_data.data(), str_data.size() - 1)); } return result; } ================================================ FILE: c++/pod5_format_pybind/bindings.cpp ================================================ #include "api.h" #include "pod5_format/c_api.h" #include "repack/repack_output.h" #include "repack/repacker.h" #include "subset.h" PYBIND11_MODULE(pod5_format_pybind, m) { using namespace pod5; pod5_init(); m.doc() = "POD5 Format Raw Bindings"; py::enum_(m, "SignalType", py::arithmetic(), "SignalType enum") .value("UncompressedSignal", SignalType::UncompressedSignal, "Signal is not compressed") .value("VbzSignal", SignalType::VbzSignal, "Signal is compressed using vbz") .export_values(); py::class_(m, "FileWriterOptions") .def(py::init<>()) .def_property( "max_signal_chunk_size", &FileWriterOptions::max_signal_chunk_size, &FileWriterOptions::set_max_signal_chunk_size) .def_property( "signal_table_batch_size", &FileWriterOptions::signal_table_batch_size, &FileWriterOptions::set_signal_table_batch_size) .def_property( "read_table_batch_size", &FileWriterOptions::read_table_batch_size, &FileWriterOptions::set_read_table_batch_size) .def_property( "signal_compression_type", &FileWriterOptions::signal_type, &FileWriterOptions::set_signal_type); py::class_>(m, "FileWriter") .def("close", [](pod5::FileWriter & w) { throw_on_error(w.close()); }) .def( "add_pore", [](pod5::FileWriter & w, std::string pore_type) { return throw_on_error(w.add_pore_type(std::move(pore_type))); }) .def( "add_end_reason", [](pod5::FileWriter & w, int name) { return throw_on_error(w.lookup_end_reason((pod5::ReadEndReason)name)); }) .def("add_run_info", FileWriter_add_run_info) .def("add_reads", FileWriter_add_reads) .def("add_reads_pre_compressed", FileWriter_add_reads_pre_compressed); py::class_(m, "EmbeddedFileData") .def_readonly("file_path", &pod5::FileLocation::file_path) .def_readonly("offset", &pod5::FileLocation::offset) .def_readonly("length", &pod5::FileLocation::size); py::class_>( m, "Pod5AsyncSignalLoader") .def("release_next_batch", &Pod5AsyncSignalLoader::release_next_batch); py::class_>( m, "Pod5SignalCacheBatch") .def_property_readonly("batch_index", &Pod5SignalCacheBatch::batch_index) .def_property_readonly("sample_count", &Pod5SignalCacheBatch::sample_count) .def_property_readonly("samples", &Pod5SignalCacheBatch::samples); py::class_(m, "Pod5FileReader") .def( "get_file_run_info_table_location", &Pod5FileReaderPtr::get_file_run_info_table_location) .def("get_file_read_table_location", &Pod5FileReaderPtr::get_file_read_table_location) .def("get_file_signal_table_location", &Pod5FileReaderPtr::get_file_signal_table_location) .def("get_file_version_pre_migration", &Pod5FileReaderPtr::get_file_version_pre_migration) .def("plan_traversal", &Pod5FileReaderPtr::plan_traversal) .def("batch_get_signal", &Pod5FileReaderPtr::batch_get_signal) .def("batch_get_signal_selection", &Pod5FileReaderPtr::batch_get_signal_selection) .def("batch_get_signal_batches", &Pod5FileReaderPtr::batch_get_signal_batches) .def("close", &Pod5FileReaderPtr::close); // Errors API m.def("get_error_string", &pod5_get_error_string, "Get the most recent error as a string"); // Creating files m.def( "create_file", &create_file, "Create a POD5 file for writing", py::arg("filename"), py::arg("writer_name"), py::arg("options") = nullptr); // Opening files m.def("open_file", &open_file, "Open a POD5 file for reading"); // Recovering files py::class_(m, "RecoverFileOptions") .def(py::init<>()) .def_readwrite("file_writer_options", &RecoverFileOptions::file_writer_options) .def_readwrite("cleanup", &RecoverFileOptions::cleanup); py::class_(m, "RecoveredRowCounts") .def(py::init<>()) .def_readwrite("signal", &RecoveredRowCounts::signal) .def_readwrite("run_info", &RecoveredRowCounts::run_info) .def_readwrite("reads", &RecoveredRowCounts::reads); py::class_(m, "CleanupError") .def(py::init<>()) .def_readwrite("file_path", &CleanupError::file_path) .def_readwrite("description", &CleanupError::description); py::class_(m, "RecoveryDetails") .def(py::init<>()) .def_readwrite("row_counts", &RecoveryDetails::row_counts) .def_readwrite("cleanup_errors", &RecoveryDetails::cleanup_errors); m.def( "recover_file", &::recover_file, "Recover a POD5 file which was not closed correctly", py::arg("src_filename"), py::arg("dest_filename"), py::arg("options") = nullptr); m.def( "update_file", &write_updated_file_to_dest, "Update a POD5 file to the latest writer format"); // Signal API m.def("decompress_signal", &decompress_signal_wrapper, "Decompress a numpy array of signal"); m.def("compress_signal", &compress_signal_wrapper, "Compress a numpy array of signal"); m.def("vbz_compressed_signal_max_size", &vbz_compressed_signal_max_size); // Repacker API py::class_>( m, "Pod5RepackerOutput"); py::class_>(m, "Repacker") .def(py::init<>()) .def("add_output", &repack::Pod5Repacker::add_output) .def("set_output_finished", &repack::Pod5Repacker::set_output_finished) .def("add_all_reads_to_output", &repack::Pod5Repacker::add_all_reads_to_output) .def("add_selected_reads_to_output", &repack::Pod5Repacker::py_add_selected_reads_to_output) .def("finish", &repack::Pod5Repacker::finish) .def_property_readonly("is_complete", &repack::Pod5Repacker::is_complete) .def_property_readonly( "currently_open_file_reader_count", &repack::Pod5Repacker::currently_open_file_reader_count) .def_property_readonly("reads_completed", &repack::Pod5Repacker::reads_completed); // Util API m.def( "load_read_id_iterable", &load_read_id_iterable, "Load an iterable of read ids into a numpy array of data"); m.def("format_read_id_to_str", &format_read_id_to_str, "Format an array of read ids to string"); m.def( "subset_pod5s_with_mapping", &subset_pod5s_with_mapping, "Subset pod5 files given a mapping"); } ================================================ FILE: c++/pod5_format_pybind/build_wheel.cmake ================================================ message("Building python lib-pod5 wheel using ${PYTHON_EXECUTABLE}") message(" project dir ${PYTHON_PROJECT_DIR}") message(" with lib ${PYBIND_INPUT_LIB}") message(" with conan licenses ${POD5_CONAN_LICENSES}") message(" with c++ licenses ${POD5_CXX_LICENSES_SRC}") message(" into ${WHEEL_OUTPUT_DIR}") message(" using: ${PYTHON_EXECUTABLE} -m pip wheel . --wheel-dir ${WHEEL_OUTPUT_DIR}") # Copy the prebuilt lib into the wheel src. file(COPY "${PYBIND_INPUT_LIB}" DESTINATION "${PYTHON_PROJECT_DIR}/src/lib_pod5") # Copy the licenses into the wheel src. # Note: the trailing / on src is important since it tells cmake to copy only the contents. if(EXISTS "${POD5_CONAN_LICENSES}") file(INSTALL "${POD5_CONAN_LICENSES}/" DESTINATION "${PYTHON_PROJECT_DIR}/licenses") endif() foreach(license_src license_dst IN ZIP_LISTS POD5_CXX_LICENSES_SRC POD5_CXX_LICENSES_DST) file(COPY_FILE "${license_src}" "${PYTHON_PROJECT_DIR}/licenses/${license_dst}") endforeach() execute_process( COMMAND ${PYTHON_EXECUTABLE} -m pip wheel . --wheel-dir ${WHEEL_OUTPUT_DIR} WORKING_DIRECTORY "${PYTHON_PROJECT_DIR}/" RESULT_VARIABLE exit_code OUTPUT_VARIABLE output ERROR_VARIABLE output ) if (NOT exit_code EQUAL 0) message(FATAL_ERROR "Could not generate wheel: ${output}") endif() file(GLOB pod5_wheel_names "${WHEEL_OUTPUT_DIR}/*.whl") foreach(wheel ${pod5_wheel_names}) message("Built wheel ${wheel}") endforeach() ================================================ FILE: c++/pod5_format_pybind/repack/repack_functions.h ================================================ #pragma once #include "pod5_format/internal/tracing/tracing.h" #include "pod5_format/read_table_reader.h" #include "pod5_format/signal_builder.h" #include "pod5_format/signal_table_schema.h" #include "pod5_format/uuid.h" #include "repack_utils.h" #include #include #include #include #include namespace repack { struct ReadReadData { std::shared_ptr input; std::vector reads; std::vector signal_durations; std::vector signal_row_sizes; std::vector signal_rows_read_ids; std::vector signal_rows; }; arrow::Result read_read_data( ReadsTableDictionaryThreadCache & reads_table_cache, states::unread_read_table_rows && in_batch) { POD5_TRACE_FUNCTION(); auto const & source_file = in_batch.input; ARROW_ASSIGN_OR_RAISE( auto source_read_table_batch, source_file->read_read_record_batch(in_batch.batch_index)); ARROW_ASSIGN_OR_RAISE(auto columns, source_read_table_batch.columns()); auto const & pore_type_columns = columns.pore_type->indices(); auto const & source_reads_pore_type_column = static_cast(*pore_type_columns); auto const & end_reason_columns = columns.end_reason->indices(); auto const & source_reads_end_reason_column = static_cast(*end_reason_columns); auto const & run_info_columns = columns.run_info->indices(); auto const & source_reads_run_info_column = static_cast(*run_info_columns); auto source_reads_signal_column = source_read_table_batch.signal_column(); auto batch_rows = std::move(in_batch.batch_rows); if (batch_rows.empty()) { auto const source_batch_row_count = source_read_table_batch.num_rows(); batch_rows.resize(source_batch_row_count); std::iota(batch_rows.begin(), batch_rows.end(), 0); } ReadReadData result; result.input = source_file; result.reads.reserve(batch_rows.size()); result.signal_rows.reserve(batch_rows.size()); result.signal_row_sizes.reserve(batch_rows.size()); for (std::size_t batch_row_index = 0; batch_row_index < batch_rows.size(); ++batch_row_index) { auto batch_row = batch_rows[batch_row_index]; // Find the read params auto const & read_id = columns.read_id->Value(batch_row); auto const & read_number = columns.read_number->Value(batch_row); auto const & start_sample = columns.start_sample->Value(batch_row); auto const & channel = columns.channel->Value(batch_row); auto const & well = columns.well->Value(batch_row); auto const & calibration_offset = columns.calibration_offset->Value(batch_row); auto const & calibration_scale = columns.calibration_scale->Value(batch_row); auto const & median_before = columns.median_before->Value(batch_row); auto const & end_reason_forced = columns.end_reason_forced->Value(batch_row); auto const & num_minknow_events = columns.num_minknow_events->Value(batch_row); auto const & tracked_scaling_scale = columns.tracked_scaling_scale->Value(batch_row); auto const & tracked_scaling_shift = columns.tracked_scaling_shift->Value(batch_row); auto const & predicted_scaling_scale = columns.predicted_scaling_scale->Value(batch_row); auto const & predicted_scaling_shift = columns.predicted_scaling_shift->Value(batch_row); auto const & num_reads_since_mux_change = columns.num_reads_since_mux_change->Value(batch_row); auto const & time_since_mux_change = columns.time_since_mux_change->Value(batch_row); auto const & open_pore_level = columns.open_pore_level->Value(batch_row); auto const & num_samples = columns.num_samples->Value(batch_row); auto const & pore_type_index = source_reads_pore_type_column.Value(batch_row); auto const & end_reason_index = source_reads_end_reason_column.Value(batch_row); auto const & run_info_index = source_reads_run_info_column.Value(batch_row); ARROW_ASSIGN_OR_RAISE( auto dest_pore_index, reads_table_cache.find_pore_index( source_file, source_read_table_batch, pore_type_index)); ARROW_ASSIGN_OR_RAISE( auto dest_run_info_index, reads_table_cache.find_run_info_index( source_file, source_read_table_batch, run_info_index)); result.reads.emplace_back( read_id, read_number, start_sample, channel, well, dest_pore_index, calibration_offset, calibration_scale, median_before, end_reason_index, end_reason_forced, dest_run_info_index, num_minknow_events, tracked_scaling_scale, tracked_scaling_shift, predicted_scaling_scale, predicted_scaling_shift, num_reads_since_mux_change, time_since_mux_change, open_pore_level); result.signal_durations.emplace_back(num_samples); auto const signal_rows = std::static_pointer_cast( source_reads_signal_column->value_slice(batch_row)); auto const signal_rows_span = gsl::make_span(signal_rows->raw_values(), signal_rows->length()); result.signal_rows.insert( result.signal_rows.end(), signal_rows_span.begin(), signal_rows_span.end()); for (std::size_t i = 0; i < signal_rows_span.size(); ++i) { result.signal_rows_read_ids.emplace_back(read_id); } result.signal_row_sizes.emplace_back(signal_rows_span.size()); } return result; } arrow::Status read_signal( std::shared_ptr const & source_file, pod5::SignalType input_compression_type, std::uint64_t abs_signal_row, pod5::Uuid read_id, pod5::SignalType output_compression_type, arrow::FixedSizeBinaryBuilder & read_id_builder, pod5::SignalBuilderVariant & signal_builder, arrow::UInt32Builder & samples_builder, arrow::MemoryPool * pool) { auto signal_rows_span = gsl::make_span(&abs_signal_row, 1); // If were using the same compression type in both files, just copy compressed: if (input_compression_type == output_compression_type && output_compression_type == pod5::SignalType::VbzSignal) { std::vector sample_counts; ARROW_ASSIGN_OR_RAISE( auto extracted_signal, source_file->extract_samples_inplace(signal_rows_span, sample_counts)); assert(1 == extracted_signal.size()); assert(sample_counts.size() == extracted_signal.size()); auto signal_span = gsl::make_span(extracted_signal.front()->data(), extracted_signal.front()->size()); ARROW_RETURN_NOT_OK(read_id_builder.Append(read_id.data())); ARROW_RETURN_NOT_OK( std::visit(pod5::visitors::append_pre_compressed_signal{signal_span}, signal_builder)); ARROW_RETURN_NOT_OK(samples_builder.Append(sample_counts.front())); } else { // Find the sample count of the complete read: ARROW_ASSIGN_OR_RAISE( auto sample_count, source_file->extract_sample_count(signal_rows_span)); std::vector signal(sample_count); auto signal_buffer_span = gsl::make_span(signal); ARROW_RETURN_NOT_OK(source_file->extract_samples(signal_rows_span, signal_buffer_span)); ARROW_RETURN_NOT_OK(read_id_builder.Append(read_id.data())); ARROW_RETURN_NOT_OK( std::visit(pod5::visitors::append_signal{signal_buffer_span, pool}, signal_builder)); ARROW_RETURN_NOT_OK(samples_builder.Append(sample_count)); } return arrow::Status::OK(); } struct RequestedSignalReads { std::vector complete_requests; std::shared_ptr partial_request; }; arrow::Result request_signal_reads( std::shared_ptr const & source_file, pod5::SignalType output_compression_type, std::size_t signal_batch_size, std::vector read_ids, std::vector signal_rows, std::shared_ptr const & partial_request, std::shared_ptr const & dest_read_table_rows, arrow::MemoryPool * pool) { POD5_TRACE_FUNCTION(); auto const input_signal_type = source_file->signal_type(); assert(read_ids.size() == signal_rows.size()); RequestedSignalReads result; auto next_request = partial_request; assert(signal_rows.size() == dest_read_table_rows->signal_row_indices.size()); std::size_t signal_rows_position = 0; while (signal_rows_position < signal_rows.size()) { if (!next_request) { ARROW_ASSIGN_OR_RAISE( auto signal_builder, pod5::make_signal_builder(output_compression_type, pool)); next_request = std::make_shared( std::move(signal_builder), pool); next_request->patch_rows.reserve(signal_batch_size); } auto to_write = std::min( signal_rows.size() - signal_rows_position, signal_batch_size - next_request->patch_rows.size()); for (std::size_t i = 0; i < to_write; ++i) { auto const dest_batch_row_index = signal_rows_position + i; assert(dest_batch_row_index < signal_rows.size()); assert(dest_batch_row_index < dest_read_table_rows->signal_row_indices.size()); ARROW_RETURN_NOT_OK(read_signal( source_file, input_signal_type, signal_rows[signal_rows_position + i], read_ids[signal_rows_position + i], output_compression_type, *next_request->read_id_builder, next_request->signal_builder, next_request->samples_builder, pool)); next_request->patch_rows.emplace_back(dest_read_table_rows, dest_batch_row_index); } signal_rows_position += to_write; assert(next_request->row_count() <= signal_batch_size); assert(next_request->row_count() <= signal_batch_size); if (next_request->row_count() >= signal_batch_size) { result.complete_requests.emplace_back(std::move(next_request)); next_request.reset(); } } result.partial_request = next_request; return result; } struct ReadSignal { std::size_t row_count; bool final_batch; std::vector> columns; }; arrow::Result read_signal_data(states::read_split_signal_table_batch_rows & signal_rows) { POD5_TRACE_FUNCTION(); ReadSignal result; pod5::SignalTableSchemaDescription field_locations; result.final_batch = signal_rows.final_batch; result.row_count = signal_rows.row_count(); result.columns = {nullptr, nullptr, nullptr}; ARROW_RETURN_NOT_OK( signal_rows.read_id_builder->Finish(&result.columns[field_locations.read_id])); ARROW_RETURN_NOT_OK( std::visit( pod5::visitors::finish_column{&result.columns[field_locations.signal]}, signal_rows.signal_builder)); ARROW_RETURN_NOT_OK( signal_rows.samples_builder.Finish(&result.columns[field_locations.samples])); return result; } arrow::Status write_reads( std::shared_ptr const & output, std::vector const & reads, std::vector const & signal_durations, std::vector const & signal_row_sizes, std::vector const & signal_row_indices) { POD5_TRACE_FUNCTION(); std::size_t signal_position = 0; auto signal_indices_span = gsl::make_span(signal_row_indices); for (std::size_t i = 0; i < reads.size(); ++i) { auto signal_rows = signal_indices_span.subspan(signal_position, signal_row_sizes[i]); signal_position += signal_row_sizes[i]; ARROW_RETURN_NOT_OK(output->add_complete_read(reads[i], signal_rows, signal_durations[i])); } return arrow::Status::OK(); } arrow::Status check_duplicate_read_ids( std::unordered_set & output_read_ids, std::vector const & new_reads) { for (auto const & read : new_reads) { auto result = output_read_ids.insert(read.read_id); if (!result.second) { return arrow::Status::Invalid( "Duplicate read id ", to_string(read.read_id), " found in file"); } } return arrow::Status::OK(); } } // namespace repack ================================================ FILE: c++/pod5_format_pybind/repack/repack_output.cpp ================================================ #include "repack_output.h" #include "pod5_format/internal/tracing/tracing.h" #include "repack_functions.h" #include #include #include namespace repack { namespace { struct is_not_nullptr { template bool operator()(T const & t) const { return t != nullptr; } }; #if 0 struct get_name{ template std::string operator()(T const& t) const { return typeid(typename T::element_type).name(); } }; template void dump_queued_items(T const& queued) { std::map items; for (auto const& item : queued) { items[std::visit(get_name{}, item)] += 1; } std::cout << "Queued items:\n"; for (auto pr : items) { std::cout << " " << pr.first << ": " << pr.second << "\n"; } } #endif } // namespace struct Pod5RepackerOutputThreadState { Pod5RepackerOutputThreadState(std::shared_ptr const & dict_manager) : dict_cache(dict_manager) { } ReadsTableDictionaryThreadCache dict_cache; }; struct Pod5RepackerOutputState { Pod5RepackerOutputState( std::shared_ptr const & _output_file, bool _check_duplicate_read_ids, arrow::MemoryPool * _memory_pool) : output_file(_output_file) , check_duplicate_read_ids(_check_duplicate_read_ids) , memory_pool(_memory_pool) , dict_manager( std::make_shared(_output_file, read_table_writer_mutex)) { } Pod5RepackerOutputThreadState * get_thread_state() { std::lock_guard l{thread_states_mutex}; auto it = thread_states.find(std::this_thread::get_id()); if (it == thread_states.end()) { it = thread_states.emplace(std::this_thread::get_id(), dict_manager).first; } return &it->second; } std::shared_ptr output_file; bool check_duplicate_read_ids; arrow::MemoryPool * memory_pool; std::mutex read_table_writer_mutex; std::mutex signal_table_writer_mutex; std::shared_ptr dict_manager; std::mutex partial_signal_batch_mutex; std::shared_ptr partial_signal_batch; std::atomic reads_completed{0}; std::mutex thread_states_mutex; std::unordered_map thread_states; std::mutex output_read_ids_mutex; std::unordered_set output_read_ids; }; namespace { struct StateProgressResult { StateProgressResult() = default; StateProgressResult(std::vector && _new_states) : new_states(_new_states) { } std::vector new_states; }; struct StateOperator { StateOperator(Pod5RepackerOutputState * _progress_state) : progress_state(_progress_state) {} arrow::Result operator()( std::shared_ptr & batch) const { POD5_TRACE_FUNCTION(); // Read out the read table data from the source file ARROW_ASSIGN_OR_RAISE( auto read_result, read_read_data(progress_state->get_thread_state()->dict_cache, std::move(*batch))); batch.reset(); auto read_table_rows = std::make_shared(); read_table_rows->reads = std::move(read_result.reads); read_table_rows->signal_durations = std::move(read_result.signal_durations); read_table_rows->signal_row_sizes = std::move(read_result.signal_row_sizes); read_table_rows->signal_row_indices.resize(read_result.signal_rows.size()); if (progress_state->check_duplicate_read_ids) { std::lock_guard l{progress_state->output_read_ids_mutex}; ARROW_RETURN_NOT_OK( check_duplicate_read_ids(progress_state->output_read_ids, read_table_rows->reads)); } // Split the read table rows into new signal table batches: { std::lock_guard l{progress_state->partial_signal_batch_mutex}; ARROW_ASSIGN_OR_RAISE( auto signal_request_result, request_signal_reads( read_result.input, progress_state->output_file->signal_type(), progress_state->output_file->signal_table_batch_size(), read_result.signal_rows_read_ids, read_result.signal_rows, progress_state->partial_signal_batch, read_table_rows, progress_state->memory_pool)); progress_state->partial_signal_batch = signal_request_result.partial_request; return StateProgressResult{std::move(signal_request_result.complete_requests)}; } } arrow::Result operator()( std::shared_ptr & batch) const { POD5_TRACE_FUNCTION(); ARROW_ASSIGN_OR_RAISE(auto read_signal_result, read_signal_data(*batch)); std::pair inserted_signal_rows; { std::lock_guard l(progress_state->signal_table_writer_mutex); ARROW_ASSIGN_OR_RAISE( inserted_signal_rows, progress_state->output_file->add_signal_batch( read_signal_result.row_count, std::move(read_signal_result.columns), read_signal_result.final_batch)); } std::vector result_new_states; for (std::size_t i = 0; i < batch->patch_rows.size(); ++i) { auto const & row = batch->patch_rows[i]; auto const & dest_read_table = row.dest_read_table; assert(dest_read_table); assert(row.dest_batch_row_index < dest_read_table->signal_row_indices.size()); dest_read_table->signal_row_indices[row.dest_batch_row_index] = inserted_signal_rows.first + i; dest_read_table->written_row_indices += 1; // Check if this read table is completed! if (dest_read_table->written_row_indices > dest_read_table->signal_row_indices.size()) { assert(false); } if (dest_read_table->written_row_indices == dest_read_table->signal_row_indices.size()) { result_new_states.push_back(dest_read_table); } } return StateProgressResult{std::move(result_new_states)}; } arrow::Result operator()( std::shared_ptr & batch) const { POD5_TRACE_FUNCTION(); assert(batch->written_row_indices == batch->signal_row_indices.size()); std::lock_guard l(progress_state->read_table_writer_mutex); ARROW_RETURN_NOT_OK(write_reads( progress_state->output_file, batch->reads, batch->signal_durations, batch->signal_row_sizes, batch->signal_row_indices)); progress_state->reads_completed += batch->reads.size(); return StateProgressResult{{}}; } arrow::Result operator()(std::shared_ptr & batch) const { POD5_TRACE_FUNCTION(); std::vector final_states; // No further reads expected, flush all partial state: std::lock_guard l{progress_state->partial_signal_batch_mutex}; if (progress_state->partial_signal_batch) { progress_state->partial_signal_batch->final_batch = true; final_states.emplace_back(std::move(progress_state->partial_signal_batch)); progress_state->partial_signal_batch.reset(); } return StateProgressResult{std::move(final_states)}; } Pod5RepackerOutputState * progress_state; }; } // namespace Pod5RepackerOutput::Pod5RepackerOutput( std::shared_ptr const & repacker, std::shared_ptr thread_pool, std::shared_ptr const & output, bool check_duplicate_read_ids) : m_repacker(repacker) , m_thread_pool(thread_pool) , m_output(output) , m_progress_state( std::make_unique( output, check_duplicate_read_ids, arrow::default_memory_pool())) { } Pod5RepackerOutput::~Pod5RepackerOutput() {} bool Pod5RepackerOutput::has_tasks() const { if (m_in_flight > 0) { return true; } std::lock_guard l{m_active_read_table_states_mutex}; return m_active_read_table_states.size() > 0; } void Pod5RepackerOutput::set_finished() { if (!m_finished) { // Wait for all other tasks to flush through the output. while (!m_has_error) { if (!has_tasks()) { break; } std::this_thread::sleep_for(std::chrono::milliseconds(1)); } { std::lock_guard l{m_active_read_table_states_mutex}; m_active_read_table_states.emplace_front(std::make_shared()); } post_try_work(); m_finished = true; } } bool Pod5RepackerOutput::is_complete() const { if (!m_finished) { return false; } std::lock_guard l{m_active_read_table_states_mutex}; return m_active_read_table_states.empty(); } std::size_t Pod5RepackerOutput::reads_completed() const { return m_progress_state->reads_completed; } void Pod5RepackerOutput::register_new_reads( std::shared_ptr const & input, std::size_t batch_index, std::vector && batch_rows) { if (m_finished) { throw std::runtime_error("Failed to add reads to finished output"); } { std::lock_guard l{m_active_read_table_states_mutex}; m_active_read_table_states.emplace_front( std::make_shared( input, batch_index, std::move(batch_rows))); } post_try_work(); } void Pod5RepackerOutput::post_try_work() { m_thread_pool->post([&]() { POD5_TRACE_FUNCTION(); auto get_next_work = [](auto & locked_states) -> states::shared_variant { if (locked_states.empty()) { return {}; } auto work = locked_states.back(); locked_states.pop_back(); return work; }; StateOperator state_operator{m_progress_state.get()}; states::shared_variant next_work; while (!m_has_error) { m_in_flight += 1; // Its important we don't release this until any new states // are in `m_active_read_table_states` auto remove_in_flight = gsl::finally([&] { m_in_flight -= 1; }); if (!std::visit(is_not_nullptr{}, next_work)) { std::lock_guard l{m_active_read_table_states_mutex}; next_work = get_next_work(m_active_read_table_states); if (!std::visit(is_not_nullptr{}, next_work)) { return; } } auto result = std::visit(state_operator, next_work); if (!result.ok()) { set_error(result.status()); return; } next_work = {}; { std::lock_guard l{m_active_read_table_states_mutex}; auto && states = m_active_read_table_states; states.insert(states.end(), result->new_states.begin(), result->new_states.end()); next_work = get_next_work(states); } } }); } } // namespace repack ================================================ FILE: c++/pod5_format_pybind/repack/repack_output.h ================================================ #pragma once #include "pod5_format/file_writer.h" #include "pod5_format/thread_pool.h" #include "repack_states.h" #include #include #include #include #include namespace repack { class Pod5Repacker; struct Pod5RepackerOutputState; class Pod5RepackerOutput { public: Pod5RepackerOutput( std::shared_ptr const & repacker, std::shared_ptr thread_pool, std::shared_ptr const & output, bool check_duplicate_read_ids); ~Pod5RepackerOutput(); std::string path() const { return m_output->path(); } std::shared_ptr const & repacker() const { return m_repacker; } bool has_tasks() const; arrow::Status error() { std::lock_guard l{m_error_mutex}; return m_error; } bool has_error() const { return m_has_error.load(); } // Inform the output no further reads will be added void set_finished(); // Check if the output has completed all writes bool is_complete() const; // Number of reads completed std::size_t reads_completed() const; // Register new writes to the output, should not be called after #set_reads_finished void register_new_reads( std::shared_ptr const & input, std::size_t batch_index, std::vector && batch_rows = {} // All rows by default ); private: void post_try_work(); void set_error(arrow::Status error) { assert(!error.ok()); { std::lock_guard l{m_error_mutex}; m_error = std::move(error); } m_has_error = true; } std::shared_ptr m_repacker; std::shared_ptr m_thread_pool; std::shared_ptr m_output; std::atomic m_finished{false}; std::atomic m_has_error{false}; std::mutex m_error_mutex; arrow::Status m_error; std::atomic m_in_flight{0}; mutable std::mutex m_active_read_table_states_mutex; std::deque m_active_read_table_states; std::unique_ptr m_progress_state; }; } // namespace repack ================================================ FILE: c++/pod5_format_pybind/repack/repack_states.h ================================================ #pragma once #include "pod5_format/file_reader.h" #include "pod5_format/signal_builder.h" #include #include #include #include #include #include namespace repack { namespace states { class unread_read_table_rows { public: unread_read_table_rows( std::shared_ptr const & _input, std::size_t _batch_index, std::vector && _batch_rows) : input(_input) , batch_index(_batch_index) , batch_rows(std::move(_batch_rows)) { } std::shared_ptr input; std::size_t batch_index; std::vector batch_rows; }; class read_read_table_rows_no_signal { public: std::vector reads; std::vector signal_durations; std::vector signal_row_sizes; std::atomic written_row_indices{0}; std::vector signal_row_indices; }; class read_split_signal_table_batch_rows { public: struct PatchRecord { PatchRecord( std::shared_ptr dest_read_table, std::uint64_t dest_batch_row_index) : dest_read_table(dest_read_table) , dest_batch_row_index(dest_batch_row_index) { } std::shared_ptr dest_read_table; std::uint64_t dest_batch_row_index; }; read_split_signal_table_batch_rows( pod5::SignalBuilderVariant && signal_builder, arrow::MemoryPool * pool) : read_id_builder(pod5::make_read_id_builder(pool)) , signal_builder(std::move(signal_builder)) , samples_builder(pool) { } std::unique_ptr read_id_builder; pod5::SignalBuilderVariant signal_builder; arrow::UInt32Builder samples_builder; std::vector patch_rows; bool final_batch = false; std::size_t row_count() const { return patch_rows.size(); } }; struct finished {}; using shared_variant = std::variant< std::shared_ptr, std::shared_ptr, std::shared_ptr, std::shared_ptr>; }} // namespace repack::states ================================================ FILE: c++/pod5_format_pybind/repack/repack_utils.h ================================================ #pragma once #include "pod5_format/read_table_reader.h" #include #include #include namespace repack { struct pair_hasher { template std::size_t operator()(std::pair const & pair) const { return std::hash{}(pair.first) ^ std::hash{}(pair.second); } }; struct run_info_hasher { std::size_t operator()(pod5::RunInfoData const & run_info) const { return std::hash{}(run_info.acquisition_id); } }; class ReadsTableDictionaryManager { public: ReadsTableDictionaryManager( std::shared_ptr const & output_file, std::mutex & writer_mutex) : m_output_file(output_file) , m_writer_mutex(writer_mutex) { } // Find or create a pore index in the output file - expects to run on strand. arrow::Result find_pore_index( std::shared_ptr const & source_file, pod5::ReadTableRecordBatch const & source_batch, pod5::PoreDictionaryIndex source_index) { std::lock_guard l(m_writer_mutex); ARROW_ASSIGN_OR_RAISE(auto source_data, source_batch.get_pore_type(source_index)); pod5::PoreDictionaryIndex dest_index = 0; // See if we have the same run info by value stored in the file: auto data_lookup_it = m_pore_data_indexes.find(source_data); if (data_lookup_it != m_pore_data_indexes.end()) { dest_index = data_lookup_it->second; } else { ARROW_ASSIGN_OR_RAISE(dest_index, m_output_file->add_pore_type(source_data)); } m_pore_data_indexes[source_data] = dest_index; return dest_index; } // Find or create a run_info index in the output file - expects to run on strand. arrow::Result find_run_info_index( std::shared_ptr const & source_file, pod5::ReadTableRecordBatch const & source_batch, pod5::RunInfoDictionaryIndex source_index) { std::lock_guard l(m_writer_mutex); ARROW_ASSIGN_OR_RAISE(auto source_data, source_batch.get_run_info(source_index)); ARROW_ASSIGN_OR_RAISE(auto const run_info, source_file->find_run_info(source_data)); pod5::RunInfoDictionaryIndex dest_index = 0; // See if we have the same run info by value stored in the file: auto data_lookup_it = m_run_info_data_indexes.find(*run_info); if (data_lookup_it != m_run_info_data_indexes.end()) { dest_index = data_lookup_it->second; } else { ARROW_ASSIGN_OR_RAISE(dest_index, m_output_file->add_run_info(*run_info)); } m_run_info_data_indexes[*run_info] = dest_index; return dest_index; } private: std::shared_ptr m_output_file; std::mutex & m_writer_mutex; std::unordered_map m_pore_data_indexes; std::unordered_map m_run_info_data_indexes; }; class ReadsTableDictionaryThreadCache { public: ReadsTableDictionaryThreadCache(std::shared_ptr const & main_cache) : m_main_cache(main_cache) { } // Find or create a pore index in the output file - expects to run on strand. arrow::Result find_pore_index( std::shared_ptr const & source_file, pod5::ReadTableRecordBatch const & source_batch, pod5::PoreDictionaryIndex source_index) { auto const key = std::make_pair(make_file_key(source_file), source_index); auto const it = m_pore_indexes.find(key); if (it != m_pore_indexes.end()) { return it->second; } ARROW_ASSIGN_OR_RAISE( auto dest_index, m_main_cache->find_pore_index(source_file, source_batch, source_index)); m_pore_indexes[key] = dest_index; return dest_index; } // Find or create a run_info index in the output file - expects to run on strand. arrow::Result find_run_info_index( std::shared_ptr const & source_file, pod5::ReadTableRecordBatch const & source_batch, pod5::RunInfoDictionaryIndex source_index) { auto const key = std::make_pair(make_file_key(source_file), source_index); auto const it = m_run_info_indexes.find(key); if (it != m_run_info_indexes.end()) { return it->second; } ARROW_ASSIGN_OR_RAISE( auto dest_index, m_main_cache->find_run_info_index(source_file, source_batch, source_index)); m_run_info_indexes[key] = dest_index; return dest_index; } private: using FileKey = std::uint64_t; FileKey make_file_key(std::shared_ptr const & file) { return reinterpret_cast(file.get()); } template using DictionaryLookup = std::unordered_map, IndexType, pair_hasher>; std::shared_ptr m_main_cache; DictionaryLookup m_pore_indexes; DictionaryLookup m_run_info_indexes; }; } // namespace repack ================================================ FILE: c++/pod5_format_pybind/repack/repacker.cpp ================================================ #include "repacker.h" #include "pod5_format/internal/tracing/tracing.h" #include "repack_output.h" #include "repack_states.h" namespace repack { namespace { void repacker_add_reads_preconditions( std::shared_ptr const & repacker, std::shared_ptr const & output, Pod5FileReaderPtr const & input) { if (output->repacker() != repacker) { throw std::runtime_error("Invalid repacker output passed, created by another repacker"); } if (!input.reader) { throw std::runtime_error("Invalid input passed to repacker, no reader"); } } } // namespace Pod5Repacker::Pod5Repacker() : m_thread_pool{pod5::make_thread_pool(10)} {} Pod5Repacker::~Pod5Repacker() { finish(); } void Pod5Repacker::finish() { POD5_TRACE_FUNCTION(); for (auto & output : m_outputs) { output->set_finished(); } check_for_error(); m_thread_pool->stop_and_drain(); for (auto & output : m_outputs) { m_reads_complete_deleted_outputs += output->reads_completed(); } m_outputs.clear(); } std::shared_ptr Pod5Repacker::add_output( std::shared_ptr const & output, bool check_duplicate_read_ids) { POD5_TRACE_FUNCTION(); auto repacker_output = std::make_shared( shared_from_this(), m_thread_pool, output, check_duplicate_read_ids); m_outputs.push_back(repacker_output); return repacker_output; } void Pod5Repacker::set_output_finished(std::shared_ptr const & output) { if (output->repacker() != shared_from_this()) { throw std::runtime_error("Invalid repacker output passed, created by another repacker"); } output->set_finished(); } void Pod5Repacker::add_all_reads_to_output( std::shared_ptr const & output, Pod5FileReaderPtr const & input) { POD5_TRACE_FUNCTION(); repacker_add_reads_preconditions(shared_from_this(), output, input); for (std::size_t i = 0; i < input.reader->num_read_record_batches(); ++i) { output->register_new_reads(input.reader, i); } register_submitted_reader(input.reader); } void Pod5Repacker::py_add_selected_reads_to_output( std::shared_ptr const & output, Pod5FileReaderPtr const & input, py::array_t && batch_counts, py::array_t && all_batch_rows) { repacker_add_reads_preconditions(shared_from_this(), output, input); auto batch_counts_span = gsl::make_span(batch_counts.data(), batch_counts.size()); auto all_batch_rows_span = gsl::make_span(all_batch_rows.data(), all_batch_rows.size()); add_selected_reads_to_output(output, input.reader, batch_counts_span, all_batch_rows_span); } void Pod5Repacker::add_selected_reads_to_output( std::shared_ptr const & output, std::shared_ptr const & input, gsl::span batch_counts_span, gsl::span all_batch_rows_span) { POD5_TRACE_FUNCTION(); std::size_t current_start_point = 0; for (std::size_t i = 0; i < input->num_read_record_batches(); ++i) { std::vector batch_rows; auto const batch_rows_span = all_batch_rows_span.subspan(current_start_point, batch_counts_span[i]); // If this batch has no selected if (batch_rows_span.empty()) { continue; } batch_rows.insert(batch_rows.end(), batch_rows_span.begin(), batch_rows_span.end()); current_start_point += batch_counts_span[i]; output->register_new_reads(input, i, std::move(batch_rows)); } register_submitted_reader(input); } void Pod5Repacker::check_for_error() const { for (auto const & output : m_outputs) { if (output->has_error()) { throw std::runtime_error(output->error().ToString()); } } } bool Pod5Repacker::is_complete() const { POD5_TRACE_FUNCTION(); check_for_error(); for (auto const & output : m_outputs) { if (!output->is_complete()) { return false; } } return true; } std::size_t Pod5Repacker::reads_completed() const { POD5_TRACE_FUNCTION(); check_for_error(); std::size_t reads_complete = 0; for (auto const & output : m_outputs) { reads_complete += output->reads_completed(); } return reads_complete + m_reads_complete_deleted_outputs; } } // namespace repack ================================================ FILE: c++/pod5_format_pybind/repack/repacker.h ================================================ #pragma once #include "pod5_format_pybind/api.h" #include #include #include #include namespace repack { class Pod5RepackerOutput; class Pod5Repacker : public std::enable_shared_from_this { public: Pod5Repacker(); ~Pod5Repacker(); void finish(); std::shared_ptr add_output( std::shared_ptr const & output, bool check_duplicate_read_ids); void set_output_finished(std::shared_ptr const & output); void add_all_reads_to_output( std::shared_ptr const & output, Pod5FileReaderPtr const & input); void py_add_selected_reads_to_output( std::shared_ptr const & output, Pod5FileReaderPtr const & input, py::array_t && batch_counts, py::array_t && all_batch_rows); void add_selected_reads_to_output( std::shared_ptr const & output, std::shared_ptr const & input, gsl::span batch_counts, gsl::span all_batch_rows); bool is_complete() const; std::size_t reads_completed() const; std::size_t currently_open_file_reader_count() { check_for_error(); cleanup_submitted_readers(); return m_file_readers.size(); } private: void check_for_error() const; void cleanup_submitted_readers() { std::erase_if(m_file_readers, [](auto const & ptr) { return ptr.expired(); }); } void register_submitted_reader(std::shared_ptr const & input) { cleanup_submitted_readers(); m_file_readers.insert(input); } std::shared_ptr m_thread_pool; std::set, std::owner_less<>> m_file_readers; std::vector> m_outputs; std::size_t m_reads_complete_deleted_outputs{0}; }; } // namespace repack ================================================ FILE: c++/pod5_format_pybind/subset.cpp ================================================ #include "subset.h" #include #include #include #include #include #include #include #include #ifndef _WIN32 #include #include #else #include #include // _getmaxstdio #endif #include "pod5_format/file_reader.h" #include "pod5_format/file_writer.h" #include "pod5_format/schema_metadata.h" #include "repack/repacker.h" #include namespace io_limits { // Balance the number of open inputs by the output-side handle usage. // Prefer outputs over inputs to reduce the number of output // batches which iterate over all inputs. constexpr std::float_t kOutputsBias = 0.7f; constexpr std::size_t kMinHandles = 1; constexpr std::size_t kBaseReserve = 16; std::size_t clamp_open_inputs(std::size_t soft_limit, std::size_t output_files) { constexpr std::size_t kMaxInHandles = 256; std::size_t const reserve = kBaseReserve + output_files; if (soft_limit <= reserve + kMinHandles) { return kMinHandles; } return std::clamp(soft_limit - reserve, kMinHandles, kMaxInHandles); } std::size_t clamp_open_outputs(std::size_t soft_limit) { constexpr std::size_t kMaxOutHandles = 4096; std::size_t const reserve = kBaseReserve + kMinHandles; if (soft_limit <= reserve + kMinHandles) { return kMinHandles; } std::size_t soft_upper = (std::size_t)(soft_limit * kOutputsBias); if (soft_upper > 32) { soft_upper = (soft_upper / 16) * 16; } return std::clamp(std::min(soft_limit - reserve, soft_upper), kMinHandles, kMaxOutHandles); } std::size_t detect_soft_limit() { // constexpr std::size_t kSoftLimitFallback = 1024; #ifndef _WIN32 // Attempt to get the resource limits (if any) struct rlimit rl{}; if (getrlimit(RLIMIT_NOFILE, &rl) == 0 && rl.rlim_cur != RLIM_INFINITY) { return static_cast(rl.rlim_cur); } long sc = sysconf(_SC_OPEN_MAX); return sc > 0 ? static_cast(sc) : kSoftLimitFallback; #else // Only stdio stream limit, not a true OS handle limit. int n = _getmaxstdio(); return n > 0 ? static_cast(n) : kSoftLimitFallback; #endif } } // namespace io_limits /// \brief Simple progress bar for console applications class ProgressBar { public: static constexpr int PB_WIDTH = 60; ProgressBar() {} ~ProgressBar() { std::fputs("\n", stdout); } void set_task(std::string const & task_name) { m_task = task_name; print_progress(); } void update_max_steps(std::size_t max_steps) { this->m_max_steps = max_steps; } void update(std::size_t current_step) { if (current_step == m_current_step) { return; } m_current_step = current_step; print_progress(); } void print_progress() { float complete_ratio = static_cast(m_current_step) / static_cast(m_max_steps); int complete_length = static_cast(complete_ratio * PB_WIDTH); std::string complete_string{"\r["}; for (int i = 0; i < PB_WIDTH; ++i) { if (i < complete_length) { complete_string += "="; } else { complete_string += " "; } } complete_string += "] (" + std::to_string(m_current_step) + "/" + std::to_string(m_max_steps) + ") " + m_task; m_max_printed_width = std::max(m_max_printed_width, complete_string.size()); // Pad to max width to overwrite previous longer lines complete_string.resize(m_max_printed_width, ' '); std::cout << complete_string.c_str() << std::flush; } private: std::string m_task; std::size_t m_max_steps{0}; std::size_t m_current_step{0}; std::size_t m_max_printed_width{0}; }; void subset_pod5s_with_mapping( std::vector inputs, std::filesystem::path output, std::map> read_id_to_dest, bool missing_ok, bool duplicate_ok, bool force_overwrite) { auto next_interrupt_check = std::chrono::steady_clock::now(); auto poll_python_interrupt = [&]() { auto const now = std::chrono::steady_clock::now(); if (now < next_interrupt_check) { return; } next_interrupt_check = now + std::chrono::milliseconds(500); pybind11::gil_scoped_acquire gil; if (PyErr_CheckSignals() != 0) { throw pybind11::error_already_set(); } }; struct OutputInfo { OutputInfo(std::shared_ptr && repacker_output_) : repacker_output(std::move(repacker_output_)) { } std::shared_ptr repacker_output; void clear_per_input_working_data() { batch_counts.clear(); all_batch_rows.clear(); batch_counts.reserve(32); all_batch_rows.reserve(128); } void add_row(std::uint32_t row_index) { all_batch_rows.push_back(row_index); next_batch_size += 1; } void finish_batch() { batch_counts.push_back(next_batch_size); next_batch_size = 0; } // Per file working vectors: std::uint32_t next_batch_size = 0; std::vector batch_counts; std::vector all_batch_rows; }; pod5::FileWriterOptions output_options{}; output_options.set_keep_signal_file_open(false); output_options.set_keep_read_table_file_open(false); output_options.set_keep_run_info_file_open(false); pod5::FileReaderOptions input_options{}; input_options.set_force_disable_file_mapping(true); // Process inputs in deterministic lexical path order. std::sort(inputs.begin(), inputs.end()); std::vector created_output_files; auto cleanup = gsl::finally([&]() { for (auto const & path : created_output_files) { std::error_code ec; std::filesystem::remove(path, ec); } }); bool issued_migration_warning = false; std::size_t const io_soft_limit = io_limits::detect_soft_limit(); std::size_t const max_out_size = io_limits::clamp_open_outputs(io_soft_limit); // Create indexable view of the map iterators so we can conveniently index in batches. std::vector>::const_iterator> read_id_dest_iters; read_id_dest_iters.reserve(read_id_to_dest.size()); std::size_t total_requested_read_ids = 0; for (auto it = read_id_to_dest.begin(); it != read_id_to_dest.end(); ++it) { poll_python_interrupt(); // Check we're not unintentionally overwriting files auto const output_path = output / it->first; if (std::filesystem::exists(output_path)) { if (!force_overwrite) { throw std::runtime_error( "Output files already exists and --force-overwrite not set. "); } else { std::filesystem::remove(output_path); } } // Index the map iterator and tally total reads read_id_dest_iters.push_back(it); total_requested_read_ids += it->second.size(); } std::size_t found_read_count = 0; std::size_t total_reads_completed = 0; std::size_t const total_output_batches = (read_id_dest_iters.size() + max_out_size - 1) / max_out_size; if (total_output_batches > 1) { std::cerr << "Subsetting inputs into " << std::to_string(read_id_dest_iters.size()) << " files in " << std::to_string(total_output_batches) << " batches of at most " << max_out_size << " outputs. IO limit: " << std::to_string(io_soft_limit) << std::endl; } ProgressBar progress_bar; progress_bar.set_task("Starting..."); progress_bar.update_max_steps(total_requested_read_ids); // Iterate over outputs in batches for (std::size_t out_st = 0; out_st < read_id_dest_iters.size(); out_st += max_out_size) { poll_python_interrupt(); std::size_t const output_batch_index = (out_st / max_out_size) + 1; std::size_t const out_end = std::min(out_st + max_out_size, read_id_dest_iters.size()); std::string const batch_prefix = "Batch [" + std::to_string(output_batch_index) + "/" + std::to_string(total_output_batches) + "]: "; auto repacker = std::make_shared(); std::unordered_multimap read_id_lookup; std::vector dest_to_output; dest_to_output.reserve(out_end - out_st); // For each output in this batch for (std::size_t out_idx = out_st; out_idx < out_end; ++out_idx) { poll_python_interrupt(); auto const & read_id_dest = *read_id_dest_iters[out_idx]; auto const output_path = output / read_id_dest.first; // Create the output file auto writer = pod5::create_file_writer(output_path.string(), "pod5_subset", output_options); if (!writer.ok()) { std::cerr << "Failed to create output file: " << output_path << std::endl; throw std::runtime_error("Failed to create output POD5 file"); } // Add the output file writer to the repacker created_output_files.push_back(output_path); auto repacker_output_file = repacker->add_output(std::move(*writer), !duplicate_ok); std::size_t const repacker_output_idx = dest_to_output.size(); dest_to_output.emplace_back(std::move(repacker_output_file)); // Associate the requested read_ids to this output for (auto const & read_id : read_id_dest.second) { auto read_id_uuid = pod5::Uuid::from_string(read_id); if (!read_id_uuid) { std::cerr << "Invalid read id uuid: " << read_id << std::endl; throw std::runtime_error("Invalid read id uuid in mapping"); } read_id_lookup.insert(std::make_pair(*read_id_uuid, repacker_output_idx)); } } // Scale the max open input files by current output handle usage and system limits. std::size_t const max_open_input_files = io_limits::clamp_open_inputs(io_soft_limit, dest_to_output.size()); std::size_t const max_in_size = std::max(1, max_open_input_files); // Wait for the number of open readers in the repacker to go below `limit` auto wait_for_open_readers_below = [&](std::size_t limit) { auto last_update = std::chrono::steady_clock::now(); while (repacker->currently_open_file_reader_count() >= limit) { std::this_thread::sleep_for(std::chrono::milliseconds(100)); poll_python_interrupt(); auto const now = std::chrono::steady_clock::now(); if (now - last_update >= std::chrono::milliseconds(2000)) { progress_bar.update(total_reads_completed + repacker->reads_completed()); progress_bar.set_task( batch_prefix + "Waiting for queued writes to complete from " + std::to_string(repacker->currently_open_file_reader_count()) + "files..."); last_update = now; } } }; // Wait for the repacker to finish with it's currently open readers auto wait_for_open_readers_zero = [&]() { auto last_update = std::chrono::steady_clock::now(); while (repacker->currently_open_file_reader_count() > 0) { std::this_thread::sleep_for(std::chrono::milliseconds(100)); poll_python_interrupt(); auto const now = std::chrono::steady_clock::now(); if (now - last_update >= std::chrono::milliseconds(2000)) { progress_bar.update(total_reads_completed + repacker->reads_completed()); progress_bar.set_task(batch_prefix + "Waiting for batch IO to complete..."); last_update = now; } } }; // Walk each input file in chunks for this output batch. for (std::size_t in_st = 0; in_st < inputs.size(); in_st += max_in_size) { poll_python_interrupt(); std::size_t const in_end = std::min(in_st + max_in_size, inputs.size()); // Add an input in this chunk for (std::size_t in_idx = in_st; in_idx < in_end; ++in_idx) { poll_python_interrupt(); auto const & input_path = inputs[in_idx]; // Keep in-flight readers below chunk limit. wait_for_open_readers_below(max_in_size); // Clear previous row selections from a previous input file. for (auto & output_file : dest_to_output) { output_file.clear_per_input_working_data(); } // "Batch [i/N]: Subsetting {input}" progress_bar.set_task( batch_prefix + "Subsetting " + input_path.filename().string()); // Open the input file auto input_reader_opt = pod5::open_file_reader(input_path.string(), input_options); if (!input_reader_opt.ok()) { std::cerr << "Failed to open input file: " << input_path << std::endl; throw std::runtime_error("Failed to open input POD5 file"); } auto const & input_reader = *input_reader_opt; if (!issued_migration_warning && out_st == 0) { auto const pre_migration_version = input_reader->file_version_pre_migration(); auto const post_migration_version = input_reader->schema_metadata().writing_pod5_version; if (pre_migration_version != post_migration_version) { std::cerr << "Warning: Migrated an input from POD5 version " << pre_migration_version.to_string() << " to " << post_migration_version.to_string() << " while subsetting. This can affect performance " "significantly. Consider updating input files." << std::endl; } issued_migration_warning = true; } // Walk the input file batches: for (std::size_t i = 0; i < input_reader->num_read_record_batches(); ++i) { poll_python_interrupt(); auto batch = input_reader->read_read_record_batch(i); if (!batch.ok()) { std::cerr << "Failed to read batch " << i << " from input file: " << input_path << std::endl; throw std::runtime_error("Failed to read read record batch from POD5 file"); } // Test each read id in the batch to see if we want it: auto const & read_id_column = batch->read_id_column(); for (std::int64_t row = 0; row < read_id_column->length(); ++row) { if ((row & 0x3FF) == 0) { poll_python_interrupt(); } auto const found = read_id_lookup.equal_range(read_id_column->Value(row)); for (auto it = found.first; it != found.second; ++it) { dest_to_output[it->second].add_row(row); found_read_count += 1; } } // Store how many rows in this batch were selected: for (auto & output_file : dest_to_output) { output_file.finish_batch(); } progress_bar.update(total_reads_completed + repacker->reads_completed()); } // Submit selected reads to each output: for (auto & output_file : dest_to_output) { repacker->add_selected_reads_to_output( output_file.repacker_output, input_reader, gsl::make_span(output_file.batch_counts), gsl::make_span(output_file.all_batch_rows)); } } // Batch drain barrier for inputs in this output batch. wait_for_open_readers_zero(); } // Set this output batch to finished: std::thread finisher([&] { for (auto & output_file : dest_to_output) { repacker->set_output_finished(output_file.repacker_output); } }); auto join_finisher = gsl::finally([&] { if (finisher.joinable()) { finisher.join(); } }); // Wait for this batch to complete: progress_bar.set_task(batch_prefix + "Waiting for batch IO to complete..."); try { while (!repacker->is_complete()) { std::this_thread::sleep_for(std::chrono::milliseconds(100)); poll_python_interrupt(); progress_bar.update(total_reads_completed + repacker->reads_completed()); } } catch (pybind11::error_already_set const &) { throw; } catch (std::exception const & e) { std::cout << "\nError during repacking: " << e.what() << std::endl; } if (finisher.joinable()) { finisher.join(); } repacker->finish(); total_reads_completed += repacker->reads_completed(); } progress_bar.set_task("Finished"); if (found_read_count < total_requested_read_ids && !missing_ok) { throw std::runtime_error("Missing read_ids from inputs but --missing-ok not set"); } // Clear created output files from cleanup since we succeeded: created_output_files.clear(); } ================================================ FILE: c++/pod5_format_pybind/subset.h ================================================ #include #include #include #include #include void subset_pod5s_with_mapping( std::vector inputs, std::filesystem::path output, std::map> read_id_to_dest, bool missing_ok, bool duplicate_ok, bool force_overwrite); ================================================ FILE: c++/pod5_format_pybind/utils.h ================================================ #pragma once #include "pod5_format/result.h" inline void raise_error(arrow::Status const & status) { throw std::runtime_error(status.ToString()); } template inline void raise_error(arrow::Result const & result) { throw std::runtime_error(result.status().ToString()); } #define POD5_PYTHON_RETURN_NOT_OK(statement) \ do { \ auto const _res = (statement); \ if (!_res.ok()) { \ raise_error(_res); \ } \ } while (false) #define POD5_PYTHON_ASSIGN_OR_RAISE_IMPL(result_name, lhs, rexpr) \ auto && result_name = (rexpr); \ if (!(result_name).ok()) { \ raise_error(result_name); \ } \ lhs = std::move(result_name).ValueUnsafe(); #define POD5_PYTHON_ASSIGN_OR_RAISE(lhs, rexpr) \ POD5_PYTHON_ASSIGN_OR_RAISE_IMPL( \ ARROW_ASSIGN_OR_RAISE_NAME(_error_or_value, __COUNTER__), lhs, rexpr); inline void throw_on_error(pod5::Status const & s) { if (!s.ok()) { throw std::runtime_error(s.ToString()); } } template inline T throw_on_error(pod5::Result const & s) { if (!s.ok()) { throw std::runtime_error(s.status().ToString()); } return *s; } ================================================ FILE: c++/test/CMakeLists.txt ================================================ add_executable(pod5_unit_tests main.cpp c_api_null_input.cpp c_api_test_utils.h c_api_tests.cpp c_api_build_test.c file_reader_writer_tests.cpp output_stream_tests.cpp read_table_writer_utils_tests.cpp read_table_tests.cpp run_info_table_tests.cpp schema_tests.cpp signal_compression_tests.cpp signal_table_tests.cpp svb16_scalar_tests.cpp svb16_x64_tests.cpp test_utils.h thread_pool_tests.cpp utils.h uuid_tests.cpp ) if (${CMAKE_CXX_COMPILER_ID} MATCHES "Clang") set_source_files_properties(c_api_build_test.c PROPERTIES COMPILE_OPTIONS "-Wdocumentation") endif() target_link_libraries(pod5_unit_tests PUBLIC pod5_format ${maybe_public_libs} ) set_property(TARGET pod5_unit_tests PROPERTY CXX_STANDARD 20) if (NOT MSVC) target_compile_options(pod5_unit_tests PRIVATE ${pod5_warning_options}) endif() add_test( NAME pod5_unit_tests COMMAND pod5_unit_tests ) ================================================ FILE: c++/test/TemporaryDirectory.h ================================================ #pragma once #include "pod5_format/uuid.h" #include namespace ont { namespace testutils { static std::string make_unique_name() { std::random_device gen; auto uuid_gen = pod5::BasicUuidRandomGenerator{gen}; return to_string(uuid_gen()); } /// A scoped directory with a fixed name. class TemporaryDirectory { public: /// Where to create the directory. enum class Location { CurrentDir, TempDir }; enum class DeleteBehaviour { AfterOnly, BeforeAndAfter }; /// Creates a random temporary directory TemporaryDirectory() : TemporaryDirectory(make_unique_name(), Location::TempDir, DeleteBehaviour::AfterOnly) { } /// Create a directory. explicit TemporaryDirectory(std::filesystem::path path, DeleteBehaviour delete_behaviour) : TemporaryDirectory(std::move(path), Location::CurrentDir, delete_behaviour) { } /// Create a directory. explicit TemporaryDirectory( std::filesystem::path path, Location location = Location::CurrentDir, DeleteBehaviour delete_behaviour = DeleteBehaviour::AfterOnly) { if (!path.is_absolute()) { if (location == Location::CurrentDir) { path = std::filesystem::absolute(path); } else { path = std::filesystem::temp_directory_path() / path; } } if (delete_behaviour == DeleteBehaviour::BeforeAndAfter) { std::filesystem::remove_all(m_path); } std::filesystem::create_directories(path); m_path = path; } TemporaryDirectory(TemporaryDirectory const &) = delete; TemporaryDirectory & operator=(TemporaryDirectory const &) = delete; TemporaryDirectory(TemporaryDirectory &&) = default; TemporaryDirectory & operator=(TemporaryDirectory &&) = default; /// Remove the referenced directory. /// /// Does nothing if this is not a valid object. ~TemporaryDirectory() { if (!m_path.empty()) { std::error_code error; std::filesystem::remove_all(m_path, error); } } /// Path to the directory. std::filesystem::path const & path() const { return m_path; } explicit operator bool() const { return !m_path.empty(); } private: std::filesystem::path m_path; }; template std::basic_ostream & operator<<( std::basic_ostream & os, TemporaryDirectory const & td) { return os << "TemporaryDirectory{ " << td.path() << " }"; } }} // namespace ont::testutils ================================================ FILE: c++/test/c_api_build_test.c ================================================ #include "pod5_format/c_api.h" // Build check to verify a c file can include the c_api ================================================ FILE: c++/test/c_api_null_input.cpp ================================================ #include "c_api_test_utils.h" #include "pod5_format/c_api.h" #include "utils.h" #include #include #include namespace { void pod5_reset_error() { pod5_vbz_compressed_signal_max_size(1); REQUIRE_POD5_OK(pod5_get_error_no()); REQUIRE(pod5_get_error_string() == std::string_view{}); } namespace detail { template constexpr std::size_t ptr_idx_to_arg_idx() { // Count how many pointers we've seen at each arg. std::size_t ptr_count[]{static_cast(std::is_pointer_v)...}; std::partial_sum(std::begin(ptr_count), std::end(ptr_count), std::begin(ptr_count)); // Find which arg matches our index. for (std::size_t arg_i = 0; arg_i < std::size(ptr_count); arg_i++) { if (ptr_count[arg_i] == PtrIdx + 1) { return arg_i; } } throw "Cannot find arg for ptr"; } template void make_ptr_null_impl(std::tuple & args, std::uint64_t valid_ptr_bitset) { // Grab the arg that we'll be modifying. constexpr std::size_t ArgIdx = ptr_idx_to_arg_idx(); auto & arg = std::get(args); using ArgT = std::remove_reference_t; static_assert(std::is_pointer_v); // If the arg isn't a valid one then replace it with a nullptr. auto const valid = (valid_ptr_bitset >> PtrIdx) & 1; if (!valid) { arg = nullptr; } } template void make_ptrs_null( std::tuple & args, std::uint64_t valid_ptr_bitset, std::index_sequence) { (make_ptr_null_impl(args, valid_ptr_bitset), ...); } template auto unpack_and_call( pod5_error_t (*func)(Args...), std::tuple args, std::index_sequence) { return func(std::get(args)...); } } // namespace detail template void call_with_nulls(pod5_error_t (*func)(Args...), Args... args) { auto const valid_inputs = std::make_tuple(args...); constexpr std::size_t num_args = sizeof...(Args); static_assert(num_args <= 64, "uint64_t isn't big enough for a bitmask"); constexpr std::size_t num_pointers = (std::is_pointer_v + ...); constexpr auto ArgIdxs = std::make_index_sequence(); constexpr auto PtrIdxs = std::make_index_sequence(); // Try every combination of NULL for the pointers. for (std::uint64_t valid_ptr_bitset = 0; std::popcount(valid_ptr_bitset) != num_pointers; valid_ptr_bitset++) { CAPTURE(valid_ptr_bitset); // Replace some args with nulls. auto inputs = valid_inputs; detail::make_ptrs_null(inputs, valid_ptr_bitset, PtrIdxs); // Make the call. pod5_reset_error(); pod5_error_t const result = detail::unpack_and_call(func, inputs, ArgIdxs); // Check that it was an error. // TODO: We could improve this to check that the first invalid arg matches the error that's // reported (ie null string, null file, etc...), but this is already overengineered enough. //int const first_ptr = std::countr_zero(~valid_ptr_bitset); // codespell:ignore CHECK_POD5_NOT_OK(result); CHECK_THAT(pod5_get_error_string(), Catch::Matchers::Contains("null")); } } TEST_CASE("NULL input doesn't crash") { using Catch::Matchers::Contains; pod5_init(); auto cleanup = gsl::finally([] { pod5_terminate(); }); // Make a temporary file for the read API to use. static constexpr char const temporary_filename[] = "./foo_c_api.pod5"; { REQUIRE(remove_file_if_exists(temporary_filename).ok()); Pod5FileWriter_t * writer = pod5_create_file(temporary_filename, "c_software", nullptr); REQUIRE_POD5_OK(pod5_get_error_no()); REQUIRE(writer); std::int16_t pore_type_id{}; REQUIRE_POD5_OK(pod5_add_pore(&pore_type_id, writer, "pore_type")); std::int16_t run_info_id{}; size_t const num_kv_pairs = 1; char const * keys[]{"key"}; char const * values[]{"value"}; REQUIRE_POD5_OK(pod5_add_run_info( &run_info_id, writer, "acquisition_id", 1, 1, -1, num_kv_pairs, keys, values, "experiment_name", "flow_cell_id", "flow_cell_product_code", "protocol_name", "protocol_run_id", 1, "sample_id", 1, "sequencing_kit", "sequencer_position", "sequencer_position_type", "software", "system_name", "system_type", num_kv_pairs, keys, values)); read_id_t const read_id{}; uint32_t const read_number{}; uint64_t const start_sample{}; float const median_before{}; uint16_t const channel{}; uint8_t const well{}; float const calibration_offset{}; float const calibration_scale{}; pod5_end_reason_t const end_reason{}; uint8_t const end_reason_forced{}; uint64_t const num_minknow_events{}; float const tracked_scaling_scale{}; float const tracked_scaling_shift{}; float const predicted_scaling_scale{}; float const predicted_scaling_shift{}; uint32_t const num_reads_since_mux_change{}; float const time_since_mux_change{}; float const open_pore_level{}; ReadBatchRowInfoArrayV3 const row_data_v3{ &read_id, &read_number, &start_sample, &median_before, &channel, &well, &pore_type_id, &calibration_offset, &calibration_scale, &end_reason, &end_reason_forced, &run_info_id, &num_minknow_events, &tracked_scaling_scale, &tracked_scaling_shift, &predicted_scaling_scale, &predicted_scaling_shift, &num_reads_since_mux_change, &time_since_mux_change}; int16_t const signal_data[]{1, 2, 3, 4, 5}; uint32_t const signal_size = std::size(signal_data); auto * signal_data_ptr = signal_data; REQUIRE_POD5_OK(pod5_add_reads_data( writer, 1, READ_BATCH_ROW_INFO_VERSION_3, &row_data_v3, &signal_data_ptr, &signal_size)); ReadBatchRowInfoArrayV4 const row_data_v4{ &read_id, &read_number, &start_sample, &median_before, &channel, &well, &pore_type_id, &calibration_offset, &calibration_scale, &end_reason, &end_reason_forced, &run_info_id, &num_minknow_events, &tracked_scaling_scale, &tracked_scaling_shift, &predicted_scaling_scale, &predicted_scaling_shift, &num_reads_since_mux_change, &time_since_mux_change, &open_pore_level}; REQUIRE_POD5_OK(pod5_add_reads_data( writer, 1, READ_BATCH_ROW_INFO_VERSION_4, &row_data_v4, &signal_data_ptr, &signal_size)); REQUIRE_POD5_OK(pod5_close_and_free_writer(writer)); } SECTION("Reader API") { { INFO("pod5_open_file") pod5_reset_error(); CHECK(pod5_open_file(nullptr) == nullptr); CHECK_THAT(pod5_get_error_string(), Contains("null string passed")); } { INFO("pod5_open_file_options") Pod5ReaderOptions_t options{}; pod5_reset_error(); CHECK(pod5_open_file_options(nullptr, nullptr) == nullptr); CHECK_THAT(pod5_get_error_string(), Contains("null string passed")); pod5_reset_error(); CHECK(pod5_open_file_options(temporary_filename, nullptr) == nullptr); CHECK_THAT(pod5_get_error_string(), Contains("null passed")); pod5_reset_error(); CHECK(pod5_open_file_options(nullptr, &options) == nullptr); CHECK_THAT(pod5_get_error_string(), Contains("null string passed")); } { INFO("pod5_close_and_free_reader") pod5_reset_error(); CHECK_POD5_OK(pod5_close_and_free_reader(nullptr)); } // The rest of these functions require a reader. Pod5FileReader_t * mutable_reader = pod5_open_file(temporary_filename); REQUIRE(mutable_reader); auto close_reader = gsl::finally([&mutable_reader] { pod5_close_and_free_reader(mutable_reader); }); Pod5FileReader_t const * reader = mutable_reader; { INFO("pod5_get_file_info") FileInfo file_info{}; call_with_nulls(pod5_get_file_info, reader, &file_info); } { INFO("pod5_get_file_read_table_location") EmbeddedFileData_t file_data{}; call_with_nulls(pod5_get_file_read_table_location, reader, &file_data); } { INFO("pod5_get_file_signal_table_location") EmbeddedFileData_t file_data{}; call_with_nulls(pod5_get_file_signal_table_location, reader, &file_data); } { INFO("pod5_get_file_run_info_table_location") EmbeddedFileData_t file_data{}; call_with_nulls(pod5_get_file_run_info_table_location, reader, &file_data); } { INFO("pod5_get_read_count") size_t count{}; call_with_nulls(pod5_get_read_count, reader, &count); } { INFO("pod5_get_read_ids") std::array read_ids{}; call_with_nulls(pod5_get_read_ids, reader, read_ids.size(), read_ids.data()); } { INFO("pod5_plan_traversal") constexpr std::size_t read_id_count = 1; uint8_t const read_id_array[read_id_count * 16]{}; uint32_t batch_counts{}; uint32_t batch_rows{}; size_t find_success_count_out{}; call_with_nulls( pod5_plan_traversal, reader, read_id_array, read_id_count, &batch_counts, &batch_rows, &find_success_count_out); } { INFO("pod5_get_read_batch_count") size_t count{}; call_with_nulls(pod5_get_read_batch_count, &count, reader); } { INFO("pod5_get_read_batch") Pod5ReadRecordBatch_t * batch = nullptr; size_t index{}; call_with_nulls(pod5_get_read_batch, &batch, reader, index); } { INFO("pod5_free_read_batch") pod5_reset_error(); CHECK_POD5_OK(pod5_free_read_batch(nullptr)); } // The rest of these functions require a batch. Pod5ReadRecordBatch_t * mutable_batch = nullptr; CHECK_POD5_OK(pod5_get_read_batch(&mutable_batch, reader, 0)); REQUIRE(mutable_batch); auto free_batch = gsl::finally([&mutable_batch] { pod5_free_read_batch(mutable_batch); }); Pod5ReadRecordBatch_t const * batch = mutable_batch; { INFO("pod5_get_read_batch_row_count") size_t count{}; call_with_nulls(pod5_get_read_batch_row_count, &count, batch); } { INFO("pod5_get_read_batch_row_info_data") ReadBatchRowInfoV4 row_info{}; size_t row = 0; uint16_t struct_version = READ_BATCH_ROW_INFO_VERSION; uint16_t read_table_version{}; call_with_nulls( pod5_get_read_batch_row_info_data, batch, row, struct_version, static_cast(&row_info), &read_table_version); } { INFO("pod5_get_signal_row_indices") size_t row = 0; uint64_t indices[1]; call_with_nulls( pod5_get_signal_row_indices, batch, row, static_cast(std::size(indices)), indices); } { INFO("pod5_get_calibration_extra_info") size_t row = 0; CalibrationExtraData_t data{}; call_with_nulls(pod5_get_calibration_extra_info, batch, row, &data); } { INFO("pod5_get_run_info") int16_t index = 0; RunInfoDictData_t * data = nullptr; call_with_nulls(pod5_get_run_info, batch, index, &data); } { INFO("pod5_get_file_run_info") run_info_index_t run_info_index = 0; RunInfoDictData_t * run_info_data = nullptr; call_with_nulls(pod5_get_file_run_info, reader, run_info_index, &run_info_data); } { INFO("pod5_free_run_info") pod5_reset_error(); CHECK_POD5_OK(pod5_free_run_info(nullptr)); } { INFO("pod5_release_run_info") pod5_reset_error(); #ifndef _WIN32 #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wdeprecated-declarations" #endif CHECK_POD5_OK(pod5_release_run_info(nullptr)); #ifndef _WIN32 #pragma GCC diagnostic pop #endif } { INFO("pod5_get_file_run_info_count") run_info_index_t count{}; call_with_nulls(pod5_get_file_run_info_count, reader, &count); } { INFO("pod5_get_end_reason") int16_t index = 0; pod5_end_reason end_reason{}; std::array string{}; size_t string_len = string.size(); call_with_nulls( pod5_get_end_reason, batch, index, &end_reason, string.data(), &string_len); } { INFO("pod5_get_pore_type") int16_t index = 0; std::array string{}; size_t string_len = string.size(); call_with_nulls(pod5_get_pore_type, batch, index, string.data(), &string_len); } { INFO("pod5_get_signal_row_info") std::array const signal_rows{}; SignalRowInfo * signal_row_info = nullptr; call_with_nulls( pod5_get_signal_row_info, reader, signal_rows.size(), signal_rows.data(), &signal_row_info); } { INFO("pod5_free_signal_row_info") pod5_reset_error(); CHECK_POD5_OK(pod5_free_signal_row_info(0, nullptr)); CHECK_POD5_NOT_OK(pod5_free_signal_row_info(1, nullptr)); } { INFO("pod5_get_signal") // We need a signal row info. uint64_t const signal_row_index = 0; SignalRowInfo_t * signal_row_info = nullptr; CHECK_POD5_OK(pod5_get_signal_row_info(reader, 1, &signal_row_index, &signal_row_info)); REQUIRE(signal_row_info); auto free_signal_row_info = gsl::finally( [&signal_row_info] { pod5_free_signal_row_info(1, &signal_row_info); }); std::array samples{}; call_with_nulls( pod5_get_signal, reader, static_cast(signal_row_info), samples.size(), samples.data()); } { INFO("pod5_get_read_complete_sample_count") size_t row = 0; size_t count{}; call_with_nulls(pod5_get_read_complete_sample_count, reader, batch, row, &count); } { INFO("pod5_get_read_complete_signal") size_t row = 1; std::array samples{}; call_with_nulls( pod5_get_read_complete_signal, reader, batch, row, samples.size(), samples.data()); } } SECTION("Writer API") { { INFO("pod5_create_file") pod5_reset_error(); CHECK(pod5_create_file(nullptr, nullptr, nullptr) == nullptr); CHECK_THAT(pod5_get_error_string(), Contains("null string passed")); pod5_reset_error(); CHECK(pod5_create_file(temporary_filename, nullptr, nullptr) == nullptr); CHECK_THAT(pod5_get_error_string(), Contains("null string passed")); pod5_reset_error(); CHECK(pod5_create_file(nullptr, temporary_filename, nullptr) == nullptr); CHECK_THAT(pod5_get_error_string(), Contains("null string passed")); } { INFO("pod5_close_and_free_writer") pod5_reset_error(); CHECK_POD5_OK(pod5_close_and_free_writer(nullptr)); } // The rest of these functions require a writer. REQUIRE(remove_file_if_exists(temporary_filename).ok()); Pod5FileWriter_t * writer = pod5_create_file(temporary_filename, "c_software", nullptr); REQUIRE(writer); auto close_writer = gsl::finally([&writer] { pod5_close_and_free_writer(writer); }); { INFO("pod5_add_pore") int16_t pore_index{}; char const pore_type[] = "test"; call_with_nulls(pod5_add_pore, &pore_index, writer, pore_type); } { INFO("pod5_add_run_info") int16_t run_info_index{}; char const dummy_string[] = "test"; char const * acquisition_id = dummy_string; int64_t acquisition_start_time_ms = 1; int16_t adc_max = 1; int16_t adc_min = -1; char const * experiment_name = dummy_string; char const * flow_cell_id = dummy_string; char const * flow_cell_product_code = dummy_string; char const * protocol_name = dummy_string; char const * protocol_run_id = dummy_string; int64_t protocol_start_time_ms = 1; char const * sample_id = dummy_string; uint16_t sample_rate = 1; char const * sequencing_kit = dummy_string; char const * sequencer_position = dummy_string; char const * sequencer_position_type = dummy_string; char const * software = dummy_string; char const * system_name = dummy_string; char const * system_type = dummy_string; size_t context_tags_count = 1; char const * context_tags_keys[]{dummy_string}; char const * context_tags_values[]{dummy_string}; size_t tracking_id_count = 1; char const * tracking_id_keys[]{dummy_string}; char const * tracking_id_values[]{dummy_string}; call_with_nulls( pod5_add_run_info, &run_info_index, writer, acquisition_id, acquisition_start_time_ms, adc_max, adc_min, context_tags_count, context_tags_keys, context_tags_values, experiment_name, flow_cell_id, flow_cell_product_code, protocol_name, protocol_run_id, protocol_start_time_ms, sample_id, sample_rate, sequencing_kit, sequencer_position, sequencer_position_type, software, system_name, system_type, tracking_id_count, tracking_id_keys, tracking_id_values); } { INFO("pod5_add_reads_data") uint32_t count = 1; uint16_t version = READ_BATCH_ROW_INFO_VERSION; ReadBatchRowInfoArray_t row_info{}; int16_t const signal[]{1, 2, 3, 4, 5}; int16_t const * signals[]{signal}; uint32_t const signal_size = std::size(signal); call_with_nulls( pod5_add_reads_data, writer, count, version, static_cast(&row_info), signals, &signal_size); } { INFO("pod5_add_reads_data_pre_compressed") uint32_t count = 1; uint16_t version = READ_BATCH_ROW_INFO_VERSION; ReadBatchRowInfoArray_t row_info{}; char const read0_compressed_signal_chunk0[]{1, 2, 3, 4, 5}; char const * read0_compressed_signal[]{read0_compressed_signal_chunk0}; size_t const read0_compressed_signal_sizes[]{std::size(read0_compressed_signal_chunk0)}; uint32_t const read0_sample_counts[]{3}; size_t const read0_signal_chunk_count = std::size(read0_compressed_signal); char const ** compressed_signals[]{read0_compressed_signal}; size_t const * compressed_signal_sizes[]{read0_compressed_signal_sizes}; uint32_t const * sample_counts[]{read0_sample_counts}; size_t const signal_chunk_counts[]{read0_signal_chunk_count}; call_with_nulls( pod5_add_reads_data_pre_compressed, writer, count, version, static_cast(&row_info), compressed_signals, compressed_signal_sizes, sample_counts, signal_chunk_counts); } } SECTION("VBZ API") { { INFO("pod5_vbz_compress_signal") std::array const signal{}; std::array compressed{}; size_t compressed_size = compressed.size(); call_with_nulls( pod5_vbz_compress_signal, signal.data(), signal.size(), compressed.data(), &compressed_size); } { INFO("pod5_vbz_decompress_signal") std::array const compressed{}; std::array signal{}; call_with_nulls( pod5_vbz_decompress_signal, compressed.data(), compressed.size(), signal.size(), signal.data()); } } SECTION("Misc API") { { INFO("pod5_format_read_id") read_id_t const read_id{}; char * string = nullptr; call_with_nulls(pod5_format_read_id, read_id, string); } } } } // namespace ================================================ FILE: c++/test/c_api_test_utils.h ================================================ #pragma once #include "pod5_format/c_api.h" #include #include #define CHECK_POD5_OK(statement) \ do { \ auto const & _res = (statement); \ CHECK_THAT(testutils::Pod5C_Result::capture(_res), testutils::IsPod5COk()); \ } while (false) #define REQUIRE_POD5_OK(statement) \ do { \ auto const & _res = (statement); \ REQUIRE_THAT(testutils::Pod5C_Result::capture(_res), testutils::IsPod5COk()); \ } while (false) #define CHECK_POD5_NOT_OK(statement) \ do { \ auto const & _res = (statement); \ CHECK_THAT(testutils::Pod5C_Result::capture(_res), !testutils::IsPod5COk()); \ } while (false) #define REQUIRE_POD5_NOT_OK(statement) \ do { \ auto const & _res = (statement); \ REQUIRE_THAT(testutils::Pod5C_Result::capture(_res), !testutils::IsPod5COk()); \ } while (false) namespace testutils { struct Pod5C_Result { static Pod5C_Result capture(pod5_error_t err_num) { return Pod5C_Result{err_num, pod5_get_error_string()}; } pod5_error_t error_code; std::string error_string; }; class IsPod5COk : public Catch::MatcherBase { public: IsPod5COk() = default; bool match(Pod5C_Result const & result) const override { return result.error_code == POD5_OK; } virtual std::string describe() const override { return "== POD5_OK"; } }; } // namespace testutils template <> struct Catch::StringMaker { static std::string convert(testutils::Pod5C_Result const & value) { return "{ code: " + std::to_string(value.error_code) + "| " + value.error_string + " }"; } }; ================================================ FILE: c++/test/c_api_tests.cpp ================================================ #include "pod5_format/c_api.h" #include "c_api_test_utils.h" #include "pod5_format/file_reader.h" #include "pod5_format/schema_metadata.h" #include "pod5_format/uuid.h" #include "pod5_format/version.h" #include "utils.h" #include #include #include #include #include struct Pod5ReadId { Pod5ReadId() = default; Pod5ReadId(pod5::Uuid const & uid) { uid.to_c_array(read_id); } pod5::Uuid as_uuid() const { return pod5::Uuid{read_id}; } bool operator==(Pod5ReadId const & other) const { return as_uuid() == other.as_uuid(); } read_id_t read_id; }; std::ostream & operator<<(std::ostream & str, Pod5ReadId rid) { return str << rid.as_uuid(); } SCENARIO("C API Reads") { static constexpr char const * filename = "./foo_c_api.pod5"; pod5_init(); auto fin = gsl::finally([] { pod5_terminate(); }); std::mt19937 gen{Catch::rngSeed()}; auto uuid_gen = pod5::UuidRandomGenerator{gen}; auto input_read_id = uuid_gen(); auto input_read_id_2 = uuid_gen(); std::vector signal_1(10); std::iota(signal_1.begin(), signal_1.end(), -20000); std::vector signal_2(20); std::iota(signal_2.begin(), signal_2.end(), 0); std::int16_t adc_min = -4096; std::int16_t adc_max = 4095; float calibration_offset = 54.0f; float calibration_scale = 100.0f; float predicted_scale = 2.3f; float predicted_shift = 10.0f; float tracked_scale = 4.3f; float tracked_shift = 15.0f; std::uint32_t num_reads_since_mux_change = 1234; float time_since_mux_change = 2.4f; float open_pore_level = 123.0f; std::uint64_t num_minknow_events = 104; // Write the file: { CHECK_POD5_OK(pod5_get_error_no()); CHECK_FALSE(pod5_create_file(NULL, "c_software", NULL)); CHECK(pod5_get_error_no() == POD5_ERROR_INVALID); CHECK_FALSE(pod5_create_file("", "c_software", NULL)); CHECK(pod5_get_error_no() == POD5_ERROR_INVALID); CHECK_FALSE(pod5_create_file("", NULL, NULL)); CHECK(pod5_get_error_no() == POD5_ERROR_INVALID); REQUIRE(remove_file_if_exists(filename).ok()); auto file = pod5_create_file(filename, "c_software", NULL); REQUIRE(file); CHECK_POD5_OK(pod5_get_error_no()); std::int16_t pore_type_id = -1; CHECK_POD5_OK(pod5_add_pore(&pore_type_id, file, "pore_type")); CHECK(pore_type_id == 0); std::vector context_tags_keys{"thing", "foo"}; std::vector context_tags_values{"thing_val", "foo_val"}; std::vector tracking_id_keys{"baz", "other"}; std::vector tracking_id_values{"baz_val", "other_val"}; std::uint32_t read_number = 12; std::uint64_t start_sample = 10245; float median_before = 200.0f; std::uint16_t channel = 43; std::uint8_t well = 4; pod5_end_reason_t end_reason = POD5_END_REASON_MUX_CHANGE; uint8_t end_reason_forced = false; auto read_id_array = (read_id_t const *)input_read_id.data(); std::int16_t run_info_id = 0; ReadBatchRowInfoArrayV4 row_data{ read_id_array, &read_number, &start_sample, &median_before, &channel, &well, &pore_type_id, &calibration_offset, &calibration_scale, &end_reason, &end_reason_forced, &run_info_id, &num_minknow_events, &tracked_scale, &tracked_shift, &predicted_scale, &predicted_shift, &num_reads_since_mux_change, &time_since_mux_change, &open_pore_level}; std::int16_t const * signal_arr[] = {signal_1.data()}; std::uint32_t signal_size[] = {(std::uint32_t)signal_1.size()}; // Referencing a non-existent run id should fail: CHECK( pod5_add_reads_data( file, 1, READ_BATCH_ROW_INFO_VERSION_4, &row_data, signal_arr, signal_size) == POD5_ERROR_INVALID); // Now actually add the run info: CHECK_POD5_OK(pod5_add_run_info( &run_info_id, file, "acquisition_id", 15400, adc_max, adc_min, context_tags_keys.size(), context_tags_keys.data(), context_tags_values.data(), "experiment_name", "flow_cell_id", "flow_cell_product_code", "protocol_name", "protocol_run_id", 200000, "sample_id", 4000, "sequencing_kit", "sequencer_position", "sequencer_position_type", "software", "system_name", "system_type", tracking_id_keys.size(), tracking_id_keys.data(), tracking_id_values.data())); CHECK(run_info_id == 0); CHECK_POD5_OK(pod5_add_reads_data( file, 1, READ_BATCH_ROW_INFO_VERSION_4, &row_data, signal_arr, signal_size)); { auto compressed_read_max_size = pod5_vbz_compressed_signal_max_size(signal_2.size()); std::vector compressed_signal(compressed_read_max_size); char const * compressed_data[] = {compressed_signal.data()}; char const ** compressed_data_ptr = compressed_data; std::size_t compressed_size[] = {compressed_signal.size()}; std::size_t const * compressed_size_ptr = compressed_size; std::uint32_t signal_size[] = {(std::uint32_t)signal_2.size()}; std::uint32_t const * signal_size_ptr = signal_size; pod5_vbz_compress_signal( signal_2.data(), signal_2.size(), compressed_signal.data(), compressed_size); std::size_t signal_counts = 1; auto read_id_array = (read_id_t const *)input_read_id_2.data(); ReadBatchRowInfoArrayV3 row_data_v3{ read_id_array, &read_number, &start_sample, &median_before, &channel, &well, &pore_type_id, &calibration_offset, &calibration_scale, &end_reason, &end_reason_forced, &run_info_id, &num_minknow_events, &tracked_scale, &tracked_shift, &predicted_scale, &predicted_shift, &num_reads_since_mux_change, &time_since_mux_change}; CHECK_POD5_OK(pod5_add_reads_data_pre_compressed( file, 1, READ_BATCH_ROW_INFO_VERSION_3, &row_data_v3, &compressed_data_ptr, &compressed_size_ptr, &signal_size_ptr, &signal_counts)); } CHECK_POD5_OK(pod5_close_and_free_writer(file)); CHECK_POD5_OK(pod5_get_error_no()); } // Read the file back: { CHECK_POD5_OK(pod5_get_error_no()); CHECK_FALSE(pod5_open_file(NULL)); auto file = pod5_open_file(filename); CHECK_POD5_OK(pod5_get_error_no()); CHECK(file); FileInfo_t file_info; CHECK_POD5_OK(pod5_get_file_info(file, &file_info)); CHECK(file_info.version.major == pod5::Pod5MajorVersion); CHECK(file_info.version.minor == pod5::Pod5MinorVersion); CHECK(file_info.version.revision == pod5::Pod5RevVersion); { auto reader = pod5::open_file_reader(filename); pod5::Uuid file_identifier{file_info.file_identifier}; CHECK(file_identifier == (*reader)->schema_metadata().file_identifier); } std::size_t read_count = 0; CHECK_POD5_OK(pod5_get_read_count(file, &read_count)); REQUIRE(read_count == 2); std::vector read_ids(2); CHECK(pod5_get_read_ids(file, 1, (read_id_t *)read_ids.data()) != POD5_OK); CHECK_POD5_OK(pod5_get_read_ids(file, read_ids.size(), (read_id_t *)read_ids.data())); std::vector expected_read_ids{input_read_id, input_read_id_2}; CHECK(read_ids == expected_read_ids); std::size_t batch_count = 0; CHECK_POD5_OK(pod5_get_read_batch_count(&batch_count, file)); REQUIRE(batch_count == 1); Pod5ReadRecordBatch * batch_0 = nullptr; CHECK_POD5_OK(pod5_get_read_batch(&batch_0, file, 0)); REQUIRE(batch_0); std::size_t row_count = 0; CHECK_POD5_OK(pod5_get_read_batch_row_count(&row_count, batch_0)); REQUIRE(row_count == 2); // Check out of bounds accesses get errors { ReadBatchRowInfoV4 v3_struct; uint16_t input_version = 0; CHECK( pod5_get_read_batch_row_info_data( batch_0, row_count, READ_BATCH_ROW_INFO_VERSION, &v3_struct, &input_version) == POD5_ERROR_INDEXERROR); std::vector signal_row_indices{1}; CHECK( pod5_get_signal_row_indices( batch_0, row_count, signal_row_indices.size(), signal_row_indices.data()) == POD5_ERROR_INDEXERROR); CalibrationExtraData calibration_extra_data{}; CHECK( pod5_get_calibration_extra_info(batch_0, row_count, &calibration_extra_data) == POD5_ERROR_INDEXERROR); } for (std::size_t row = 0; row < row_count; ++row) { CAPTURE(row); auto signal = signal_1; if (row == 1) { signal = signal_2; } static_assert( std::is_same::value, "Update this if new structs added"); ReadBatchRowInfoV3 v3_struct; ReadBatchRowInfoV4 v4_struct; uint16_t input_version = 0; CHECK_POD5_OK(pod5_get_read_batch_row_info_data( batch_0, row, READ_BATCH_ROW_INFO_VERSION_3, &v3_struct, &input_version)); CHECK( input_version == 4); // We're reading from a v4 file, even if the input struct is v3. CHECK_POD5_OK(pod5_get_read_batch_row_info_data( batch_0, row, READ_BATCH_ROW_INFO_VERSION_4, &v4_struct, &input_version)); CHECK(input_version == 4); auto check_v3_or_v4 = [&](auto name, auto const & input_struct) { CAPTURE(name); std::string formatted_uuid(36, '\0'); CHECK_POD5_OK(pod5_format_read_id(input_struct.read_id, &formatted_uuid[0])); CHECK( formatted_uuid == to_string(*reinterpret_cast(input_struct.read_id))); CHECK(input_struct.read_number == 12); CHECK(input_struct.start_sample == 10245); CHECK(input_struct.median_before == 200.0f); CHECK(input_struct.channel == 43); CHECK(input_struct.well == 4); CHECK(input_struct.pore_type == 0); CHECK(input_struct.calibration_offset == calibration_offset); CHECK(input_struct.calibration_scale == calibration_scale); CHECK(input_struct.end_reason == 1); CHECK(input_struct.end_reason_forced == uint8_t{false}); CHECK(input_struct.run_info == 0); CHECK(input_struct.num_minknow_events == num_minknow_events); CHECK(input_struct.tracked_scaling_scale == tracked_scale); CHECK(input_struct.tracked_scaling_shift == tracked_shift); CHECK(input_struct.predicted_scaling_scale == predicted_scale); CHECK(input_struct.predicted_scaling_shift == predicted_shift); CHECK(input_struct.num_reads_since_mux_change == num_reads_since_mux_change); CHECK(input_struct.time_since_mux_change == time_since_mux_change); CHECK(input_struct.signal_row_count == 1); CHECK(input_struct.num_samples == signal.size()); }; check_v3_or_v4("v3", v3_struct); check_v3_or_v4("v4", v4_struct); if (row == 0) { CHECK(v4_struct.open_pore_level == open_pore_level); } else { CHECK(std::isnan(v4_struct.open_pore_level)); } std::vector signal_row_indices(v3_struct.signal_row_count); CHECK_POD5_OK(pod5_get_signal_row_indices( batch_0, row, signal_row_indices.size(), signal_row_indices.data())); std::vector signal_row_info(v3_struct.signal_row_count); CHECK_POD5_OK(pod5_get_signal_row_info( file, signal_row_indices.size(), signal_row_indices.data(), signal_row_info.data())); std::vector read_signal(signal_row_info.front()->stored_sample_count); REQUIRE(signal_row_info.front()->stored_sample_count == signal.size()); CHECK_POD5_OK(pod5_get_signal( file, signal_row_info.front(), signal_row_info.front()->stored_sample_count, read_signal.data())); CHECK(read_signal == signal); std::size_t sample_count = 0; CHECK_POD5_OK(pod5_get_read_complete_sample_count(file, batch_0, row, &sample_count)); CHECK(sample_count == signal_row_info.front()->stored_sample_count); CHECK_POD5_OK(pod5_get_read_complete_signal( file, batch_0, row, sample_count, read_signal.data())); CHECK(read_signal == signal); CHECK_POD5_OK( pod5_free_signal_row_info(signal_row_indices.size(), signal_row_info.data())); std::string expected_pore_type{"pore_type"}; std::array char_buffer{}; std::size_t returned_size = 2; // deliberately too short! { CHECK( pod5_get_pore_type( batch_0, v3_struct.pore_type, char_buffer.data(), &returned_size) == POD5_ERROR_STRING_NOT_LONG_ENOUGH); CHECK(returned_size == expected_pore_type.size() + 1); } { returned_size = char_buffer.size(); CHECK_POD5_OK(pod5_get_pore_type( batch_0, v3_struct.pore_type, char_buffer.data(), &returned_size)); CHECK(returned_size == expected_pore_type.size() + 1); CHECK(std::string{char_buffer.data()} == expected_pore_type); } { returned_size = char_buffer.size(); CHECK( pod5_get_pore_type(batch_0, -1, char_buffer.data(), &returned_size) == POD5_ERROR_INDEXERROR); CHECK(returned_size == char_buffer.size()); } std::string expected_end_reason{"mux_change"}; { returned_size = 2; // deliberately too short! pod5_end_reason end_reason = POD5_END_REASON_UNKNOWN; CHECK( pod5_get_end_reason( batch_0, v3_struct.end_reason, &end_reason, char_buffer.data(), &returned_size) == POD5_ERROR_STRING_NOT_LONG_ENOUGH); CHECK(returned_size == expected_end_reason.size() + 1); } { returned_size = char_buffer.size(); pod5_end_reason end_reason = POD5_END_REASON_UNKNOWN; CHECK_POD5_OK(pod5_get_end_reason( batch_0, v3_struct.end_reason, &end_reason, char_buffer.data(), &returned_size)); CHECK(returned_size == expected_end_reason.size() + 1); CHECK(end_reason == POD5_END_REASON_MUX_CHANGE); CHECK(std::string{char_buffer.data()} == expected_end_reason); } // Check getting with an invalid input end reason index: { returned_size = char_buffer.size(); pod5_end_reason end_reason = POD5_END_REASON_UNKNOWN; CHECK( pod5_get_end_reason( batch_0, v3_struct.end_reason + 100, &end_reason, char_buffer.data(), &returned_size) == POD5_ERROR_INDEXERROR); CHECK(returned_size == char_buffer.size()); CHECK(end_reason == POD5_END_REASON_UNKNOWN); } CalibrationExtraData calibration_extra_data{}; CHECK_POD5_OK(pod5_get_calibration_extra_info(batch_0, row, &calibration_extra_data)); CHECK(calibration_extra_data.digitisation == adc_max - adc_min + 1); CHECK(calibration_extra_data.range == 8192 * calibration_scale); } SECTION("Embedded files") { for (auto [get_file_location, name] : { std::make_tuple( pod5_get_file_read_table_location, "pod5_get_file_read_table_location"), std::make_tuple( pod5_get_file_signal_table_location, "pod5_get_file_signal_table_location"), std::make_tuple( pod5_get_file_run_info_table_location, "pod5_get_file_run_info_table_location"), }) { CAPTURE(name); EmbeddedFileData_t embedded_file_data{}; CHECK_POD5_OK(get_file_location(file, &embedded_file_data)); REQUIRE(embedded_file_data.file_name != nullptr); CHECK(embedded_file_data.file_name == std::string_view{filename}); CHECK(embedded_file_data.offset > 0); CHECK(embedded_file_data.length > 0); } } run_info_index_t run_info_count = 0; CHECK_POD5_OK(pod5_get_file_run_info_count(file, &run_info_count)); REQUIRE(run_info_count == 1); // Check getting invalid run info indexes fails correctly. RunInfoDictData * run_info_error = nullptr; CHECK(pod5_get_run_info(batch_0, -1, &run_info_error) == POD5_ERROR_INDEXERROR); CHECK_FALSE(run_info_error); CHECK(pod5_get_run_info(batch_0, run_info_count, &run_info_error) == POD5_ERROR_INDEXERROR); CHECK_FALSE(run_info_error); CHECK(pod5_get_file_run_info(file, -1, &run_info_error) == POD5_ERROR_INDEXERROR); CHECK_FALSE(run_info_error); CHECK( pod5_get_file_run_info(file, run_info_count, &run_info_error) == POD5_ERROR_INDEXERROR); CHECK_FALSE(run_info_error); auto check_run_info = [](RunInfoDictData * run_info) { REQUIRE(run_info); CHECK(run_info->tracking_id.size == 2); CHECK(run_info->tracking_id.keys[0] == std::string("baz")); CHECK(run_info->tracking_id.keys[1] == std::string("other")); CHECK(run_info->tracking_id.values[0] == std::string("baz_val")); CHECK(run_info->tracking_id.values[1] == std::string("other_val")); CHECK(run_info->context_tags.size == 2); CHECK(run_info->context_tags.keys[0] == std::string("thing")); CHECK(run_info->context_tags.keys[1] == std::string("foo")); CHECK(run_info->context_tags.values[0] == std::string("thing_val")); CHECK(run_info->context_tags.values[1] == std::string("foo_val")); }; RunInfoDictData * run_info_data_out_1 = nullptr; CHECK_POD5_OK(pod5_get_file_run_info(file, 0, &run_info_data_out_1)); check_run_info(run_info_data_out_1); pod5_free_run_info(run_info_data_out_1); RunInfoDictData * run_info_data_out_2 = nullptr; CHECK_POD5_OK(pod5_get_run_info(batch_0, 0, &run_info_data_out_2)); check_run_info(run_info_data_out_2); pod5_free_run_info(run_info_data_out_2); pod5_free_read_batch(batch_0); pod5_close_and_free_reader(file); CHECK_POD5_OK(pod5_get_error_no()); } } SCENARIO("C API Many Reads") { static constexpr char const * filename = "./foo_c_api.pod5"; pod5_init(); auto fin = gsl::finally([] { pod5_terminate(); }); std::mt19937 gen{Catch::rngSeed()}; auto uuid_gen = pod5::UuidRandomGenerator{gen}; std::vector signal_1(10); std::iota(signal_1.begin(), signal_1.end(), -20000); std::vector signal_2(20); std::iota(signal_2.begin(), signal_2.end(), 0); std::size_t const read_count = 10037; std::int16_t const adc_min = -4096; std::int16_t const adc_max = 4095; std::vector read_id_array(read_count); std::generate(read_id_array.begin(), read_id_array.end(), uuid_gen); // Write the file: { CHECK_POD5_OK(pod5_get_error_no()); CHECK_FALSE(pod5_create_file(NULL, "c_software", NULL)); CHECK(pod5_get_error_no() == POD5_ERROR_INVALID); CHECK_FALSE(pod5_create_file("", "c_software", NULL)); CHECK(pod5_get_error_no() == POD5_ERROR_INVALID); CHECK_FALSE(pod5_create_file("", NULL, NULL)); CHECK(pod5_get_error_no() == POD5_ERROR_INVALID); REQUIRE(remove_file_if_exists(filename).ok()); auto file = pod5_create_file(filename, "c_software", NULL); REQUIRE(file); CHECK_POD5_OK(pod5_get_error_no()); std::int16_t pore_type_id = -1; CHECK_POD5_OK(pod5_add_pore(&pore_type_id, file, "pore_type")); CHECK(pore_type_id == 0); std::vector context_tags_keys{"thing", "foo"}; std::vector context_tags_values{"thing_val", "foo_val"}; std::vector tracking_id_keys{"baz", "other"}; std::vector tracking_id_values{"baz_val", "other_val"}; std::int16_t run_info_id = -1; CHECK_POD5_OK(pod5_add_run_info( &run_info_id, file, "acquisition_id", 15400, adc_max, adc_min, context_tags_keys.size(), context_tags_keys.data(), context_tags_values.data(), "experiment_name", "flow_cell_id", "flow_cell_product_code", "protocol_name", "protocol_run_id", 200000, "sample_id", 4000, "sequencing_kit", "sequencer_position", "sequencer_position_type", "software", "system_name", "system_type", tracking_id_keys.size(), tracking_id_keys.data(), tracking_id_values.data())); CHECK(run_info_id == 0); std::vector read_number(read_count, 12); std::vector start_sample(read_count, 10245); std::vector median_before(read_count, 200.0f); std::vector channel(read_count, 43); std::vector well(read_count, 4); std::vector end_reason(read_count, POD5_END_REASON_MUX_CHANGE); std::vector end_reason_forced(read_count, false); std::vector calibration_offset(read_count, 54.0f); std::vector calibration_scale(read_count, 100.0f); std::vector predicted_scale(read_count, 2.3f); std::vector predicted_shift(read_count, 10.0f); std::vector tracked_scale(read_count, 4.3f); std::vector tracked_shift(read_count, 15.0f); std::vector num_reads_since_mux_change(read_count, 1234); std::vector time_since_mux_change(read_count, 2.4f); std::vector open_pore_level(read_count, 123.0f); std::vector num_minknow_events(read_count, 104); std::vector pore_type_ids(read_count, pore_type_id); std::vector run_info_ids(read_count, run_info_id); std::vector signal_arr; std::vector signal_size; ReadBatchRowInfoArrayV4 row_data{ (read_id_t *)read_id_array.data(), read_number.data(), start_sample.data(), median_before.data(), channel.data(), well.data(), pore_type_ids.data(), calibration_offset.data(), calibration_scale.data(), end_reason.data(), end_reason_forced.data(), run_info_ids.data(), num_minknow_events.data(), tracked_scale.data(), tracked_shift.data(), predicted_scale.data(), predicted_shift.data(), num_reads_since_mux_change.data(), time_since_mux_change.data(), open_pore_level.data()}; for (std::size_t i = 0; i < read_count; ++i) { signal_arr.push_back(signal_1.data()); signal_size.push_back((std::uint32_t)signal_1.size()); } CHECK_POD5_OK(pod5_add_reads_data( file, read_count, READ_BATCH_ROW_INFO_VERSION_3, &row_data, signal_arr.data(), signal_size.data())); CHECK_POD5_OK(pod5_close_and_free_writer(file)); CHECK_POD5_OK(pod5_get_error_no()); } // Read the file back: { Pod5ReaderOptions_t options{}; options.force_disable_file_mapping = true; CHECK_POD5_OK(pod5_get_error_no()); CHECK_FALSE(pod5_open_file_options(NULL, &options)); CHECK_FALSE(pod5_open_file_options(filename, NULL)); auto file = pod5_open_file_options(filename, &options); CHECK_POD5_OK(pod5_get_error_no()); CHECK(file); FileInfo_t file_info; CHECK_POD5_OK(pod5_get_file_info(file, &file_info)); CHECK(file_info.version.major == pod5::Pod5MajorVersion); CHECK(file_info.version.minor == pod5::Pod5MinorVersion); CHECK(file_info.version.revision == pod5::Pod5RevVersion); { auto reader = pod5::open_file_reader(filename); pod5::Uuid file_identifier{file_info.file_identifier}; CHECK(file_identifier == (*reader)->schema_metadata().file_identifier); } std::size_t read_count_returned = 0; CHECK_POD5_OK(pod5_get_read_count(file, &read_count_returned)); REQUIRE(read_count_returned == read_count); // Randomise the order of the read IDs and then try and plan a path through them. std::shuffle(read_id_array.begin(), read_id_array.end(), gen); std::vector batch_counts(read_count); std::vector batch_rows(read_count); std::size_t find_success_count = 0; CHECK_POD5_OK(pod5_plan_traversal( file, reinterpret_cast(read_id_array.data()), read_count, batch_counts.data(), batch_rows.data(), &find_success_count)); REQUIRE(find_success_count == read_count); CHECK_POD5_OK(pod5_close_and_free_reader(file)); } } SCENARIO("C API Run Info") { static constexpr char const * filename = "./foo_c_api.pod5"; pod5_init(); auto fin = gsl::finally([] { pod5_terminate(); }); std::int16_t adc_min = -4096; std::int16_t adc_max = 4095; auto expected_acq_id = [](std::size_t index) { std::string acquisition_id{"acquisition_id_"}; acquisition_id += std::to_string(index); return acquisition_id; }; // Write the file: { REQUIRE(remove_file_if_exists(filename).ok()); auto file = pod5_create_file(filename, "c_software", NULL); REQUIRE(file); CHECK_POD5_OK(pod5_get_error_no()); std::vector context_tags_keys{"thing", "foo"}; std::vector context_tags_values{"thing_val", "foo_val"}; std::vector tracking_id_keys{"baz", "other"}; std::vector tracking_id_values{"baz_val", "other_val"}; for (std::size_t i = 0; i < 10; ++i) { std::int16_t run_info_id = -1; CHECK_POD5_OK(pod5_add_run_info( &run_info_id, file, expected_acq_id(i).c_str(), 15400, adc_max, adc_min, context_tags_keys.size(), context_tags_keys.data(), context_tags_values.data(), "experiment_name", "flow_cell_id", "flow_cell_product_code", "protocol_name", "protocol_run_id", 200000, "sample_id", 4000, "sequencing_kit", "sequencer_position", "sequencer_position_type", "software", "system_name", "system_type", tracking_id_keys.size(), tracking_id_keys.data(), tracking_id_values.data())); CHECK(run_info_id == static_cast(i)); } CHECK_POD5_OK(pod5_close_and_free_writer(file)); } // Read the file back: { CHECK_POD5_OK(pod5_get_error_no()); CHECK_FALSE(pod5_open_file(NULL)); auto file = pod5_open_file(filename); CHECK_POD5_OK(pod5_get_error_no()); CHECK(pod5_get_error_string() == std::string{""}); REQUIRE(file); run_info_index_t run_info_count = 0; CHECK_POD5_OK(pod5_get_file_run_info_count(file, &run_info_count)); REQUIRE(run_info_count == 10); for (run_info_index_t i = 0; i < 10; ++i) { RunInfoDictData * run_info_data_out = nullptr; CHECK_POD5_OK(pod5_get_file_run_info(file, i, &run_info_data_out)); CHECK(run_info_data_out->acquisition_id == expected_acq_id(i)); pod5_free_run_info(run_info_data_out); } CHECK_POD5_OK(pod5_close_and_free_reader(file)); } } TEST_CASE("Missing file passed to pod5_open_file") { pod5_init(); auto cleanup = gsl::finally([] { pod5_terminate(); }); static constexpr char const temporary_filename[] = "./foo_c_api.pod5"; REQUIRE(remove_file_if_exists(temporary_filename).ok()); CHECK(pod5_open_file(temporary_filename) == nullptr); } TEST_CASE("Existing file passed to pod5_create_file") { pod5_init(); auto cleanup = gsl::finally([] { pod5_terminate(); }); static constexpr char const temporary_filename[] = "./foo_c_api.pod5"; REQUIRE(remove_file_if_exists(temporary_filename).ok()); // Create it once. Pod5FileWriter_t * writer = pod5_create_file(temporary_filename, "c_software", nullptr); REQUIRE_POD5_OK(pod5_get_error_no()); REQUIRE(writer != nullptr); REQUIRE_POD5_OK(pod5_close_and_free_writer(writer)); // File already exists so this should fail. CHECK(pod5_create_file(temporary_filename, "c_software", nullptr) == nullptr); } TEST_CASE("pod5_create_file with options") { pod5_init(); auto cleanup = gsl::finally([] { pod5_terminate(); }); static constexpr char const temporary_filename[] = "./foo_c_api.pod5"; REQUIRE(remove_file_if_exists(temporary_filename).ok()); Pod5WriterOptions_t test_options{}; Pod5WriterOptions_t const * options = nullptr; bool const with_options = GENERATE(false, true); if (with_options) { options = &test_options; } else { test_options.max_signal_chunk_size = GENERATE(0, 1, 2); test_options.signal_compression_type = GENERATE( CompressionOption::DEFAULT_SIGNAL_COMPRESSION, CompressionOption::VBZ_SIGNAL_COMPRESSION, CompressionOption::UNCOMPRESSED_SIGNAL); test_options.signal_table_batch_size = GENERATE(0, 1, 2); test_options.read_table_batch_size = GENERATE(0, 1, 2); } CAPTURE( with_options, test_options.max_signal_chunk_size, test_options.signal_compression_type, test_options.signal_table_batch_size, test_options.read_table_batch_size); Pod5FileWriter_t * writer = pod5_create_file(temporary_filename, "c_software", options); REQUIRE_POD5_OK(pod5_get_error_no()); REQUIRE(writer != nullptr); REQUIRE_POD5_OK(pod5_close_and_free_writer(writer)); } TEST_CASE("VBZ compression") { pod5_init(); auto cleanup = gsl::finally([] { pod5_terminate(); }); std::size_t const sample_count = 20; std::vector input_signal(sample_count); std::iota(input_signal.begin(), input_signal.end(), -sample_count / 2); // Determine max size. std::size_t const compressed_read_max_size = pod5_vbz_compressed_signal_max_size(input_signal.size()); REQUIRE(compressed_read_max_size > 0); // Compress it. std::vector compressed_signal(compressed_read_max_size); std::size_t compressed_size = compressed_read_max_size; REQUIRE_POD5_OK(pod5_vbz_compress_signal( input_signal.data(), input_signal.size(), compressed_signal.data(), &compressed_size)); REQUIRE(compressed_size <= compressed_read_max_size); compressed_signal.resize(compressed_size); // Decompress it. std::vector output_signal(sample_count); REQUIRE_POD5_OK(pod5_vbz_decompress_signal( compressed_signal.data(), compressed_signal.size(), sample_count, output_signal.data())); REQUIRE(input_signal == output_signal); // Providing incorrect buffer sizes should fail rather than crash. CHECK_POD5_NOT_OK(pod5_vbz_decompress_signal( compressed_signal.data(), compressed_signal.size(), sample_count * 2, output_signal.data())); std::size_t bad_compressed_size = compressed_size / 2; CHECK_POD5_NOT_OK(pod5_vbz_compress_signal( input_signal.data(), input_signal.size(), compressed_signal.data(), &bad_compressed_size)); // Going over the maximum size should produce an error. size_t const max_size_error = pod5_vbz_compressed_signal_max_size(std::uint64_t{1} << 48); CHECK(max_size_error == 0); CHECK_POD5_NOT_OK(pod5_get_error_no()); } ================================================ FILE: c++/test/file_reader_writer_tests.cpp ================================================ #include "pod5_format/async_signal_loader.h" #include "pod5_format/file_reader.h" #include "pod5_format/file_writer.h" #include "pod5_format/internal/combined_file_utils.h" #include "pod5_format/read_table_reader.h" #include "pod5_format/signal_table_reader.h" #include "pod5_format/thread_pool.h" #include "pod5_format/uuid.h" #include "TemporaryDirectory.h" #include "test_utils.h" #include "utils.h" #include #include #include #include #include #include #include #include #include #include #include void run_file_reader_writer_tests( char const * file, pod5::FileWriterOptions const & extra_options = {}) { REQUIRE_ARROW_STATUS_OK(remove_file_if_exists(file)); (void)pod5::register_extension_types(); auto fin = gsl::finally([] { (void)pod5::unregister_extension_types(); }); auto const run_info_data = get_test_run_info_data("_run_info"); std::mt19937 gen{Catch::rngSeed()}; auto uuid_gen = pod5::UuidRandomGenerator{gen}; auto read_id_1 = uuid_gen(); std::uint16_t channel = 25; std::uint8_t well = 3; std::uint32_t read_number = 1234; std::uint64_t start_sample = 12340; std::uint64_t num_minknow_events = 27; float median_before = 224.0f; float calib_offset = 22.5f; float calib_scale = 1.2f; float tracked_scaling_scale = 2.3f; float tracked_scaling_shift = 100.0f; float predicted_scaling_scale = 1.5f; float predicted_scaling_shift = 50.0f; std::uint32_t num_reads_since_mux_change = 3; float time_since_mux_change = 200.0f; float open_pore_level = 150.0f; std::vector signal_1(100'000); std::iota(signal_1.begin(), signal_1.end(), 0); // Write a file: { pod5::FileWriterOptions options = extra_options; options.set_max_signal_chunk_size(20'480); options.set_read_table_batch_size(1); options.set_signal_table_batch_size(5); auto writer = pod5::create_file_writer(file, "test_software", options); REQUIRE_ARROW_STATUS_OK(writer); auto run_info = (*writer)->add_run_info(run_info_data); auto end_reason = (*writer)->lookup_end_reason(pod5::ReadEndReason::signal_negative); bool end_reason_forced = true; auto pore_type = (*writer)->add_pore_type("Pore_type"); for (std::size_t i = 0; i < 10; ++i) { CHECK_ARROW_STATUS_OK((*writer)->add_complete_read( {read_id_1, read_number, start_sample, channel, well, *pore_type, calib_offset, calib_scale, median_before, *end_reason, end_reason_forced, *run_info, num_minknow_events, tracked_scaling_scale, tracked_scaling_shift, predicted_scaling_scale, predicted_scaling_shift, num_reads_since_mux_change, time_since_mux_change, open_pore_level}, gsl::make_span(signal_1))); } } // Open the file for reading: // Write a file: { auto reader = pod5::open_file_reader(file, {}); REQUIRE_ARROW_STATUS_OK(reader); REQUIRE((*reader)->num_read_record_batches() == 10); for (std::size_t i = 0; i < 10; ++i) { auto read_batch = (*reader)->read_read_record_batch(i); REQUIRE_ARROW_STATUS_OK(read_batch); auto read_id_array = read_batch->read_id_column(); CHECK(read_id_array->length() == 1); CHECK(read_id_array->Value(0) == read_id_1); auto columns = *read_batch->columns(); auto const run_info_dict_index = std::dynamic_pointer_cast(columns.run_info->indices())->Value(0); CHECK(run_info_dict_index == 0); auto const run_info_id = read_batch->get_run_info(run_info_dict_index); CHECK(*run_info_id == run_info_data.acquisition_id); auto const run_info = (*reader)->find_run_info(*run_info_id); CHECK(**run_info == run_info_data); REQUIRE((*reader)->num_signal_record_batches() == 10); auto signal_batch = (*reader)->read_signal_record_batch(i); REQUIRE_ARROW_STATUS_OK(signal_batch); auto signal_read_id_array = signal_batch->read_id_column(); CHECK(signal_read_id_array->length() == 5); CHECK(signal_read_id_array->Value(0) == read_id_1); CHECK(signal_read_id_array->Value(1) == read_id_1); CHECK(signal_read_id_array->Value(2) == read_id_1); CHECK(signal_read_id_array->Value(3) == read_id_1); CHECK(signal_read_id_array->Value(4) == read_id_1); auto vbz_signal_array = signal_batch->vbz_signal_column(); CHECK(vbz_signal_array->length() == 5); auto samples_array = signal_batch->samples_column(); CHECK(samples_array->Value(0) == 20'480); CHECK(samples_array->Value(1) == 20'480); CHECK(samples_array->Value(2) == 20'480); CHECK(samples_array->Value(3) == 20'480); CHECK(samples_array->Value(4) == 18'080); } auto const samples_mode = GENERATE( pod5::AsyncSignalLoader::SamplesMode::NoSamples, pod5::AsyncSignalLoader::SamplesMode::Samples); pod5::AsyncSignalLoader async_no_samples_loader( *reader, samples_mode, {}, // Read all the batches {} // No specific rows within batches) ); for (std::size_t i = 0; i < 10; ++i) { CAPTURE(i); auto first_batch_res = async_no_samples_loader.release_next_batch(); REQUIRE_ARROW_STATUS_OK(first_batch_res); auto first_batch = std::move(*first_batch_res); CHECK(first_batch->batch_index() == i); CHECK(first_batch->sample_count().size() == 1); CHECK(first_batch->sample_count()[0] == signal_1.size()); CHECK(first_batch->samples().size() == 1); if (samples_mode == pod5::AsyncSignalLoader::SamplesMode::Samples) { CHECK(first_batch->samples()[0] == signal_1); } else { CHECK(first_batch->samples()[0].size() == 0); } } } } SCENARIO("File Reader Writer Tests") { run_file_reader_writer_tests("./foo.pod5"); } #ifdef __linux__ TEST_CASE("Additional make_file_stream() tests") { // When the user filesystem doesn't support direct-io, but it is requested then // make_file_stream() should fallback to a "regular" FileOutputStream // This because of the disk mounting, this test can only be run by someone or something that // is effectively a root user. if (::geteuid() != 0) { WARN("SKIPPING TEST: Need root privileges to mount a test drive."); return; } std::filesystem::path const dir_path = "./ramdisk_" + std::to_string(std::time(nullptr)); // Create and mount tmpfs drive. try { std::filesystem::create_directory(dir_path); auto const mount_cmd = std::string{"mount -o size=500M -t tmpfs none "} + dir_path.string(); auto const mount_return = std::system(mount_cmd.c_str()); if (mount_return != 0) { // we have seen this fail in CI where the test thinks it can mount // but CI fails to mount, just skip the test WARN("SKIPPING TEST: Need root privileges to mount a test drive."); return; } } catch (std::exception const & e) { FAIL("Failed to create and mount a tmpfs drive: " << e.what()); } auto const umount_cmd = std::string{"umount "} + dir_path.string(); auto remove_directory = gsl::finally([&] { std::filesystem::remove(dir_path); }); auto remove_mount = gsl::finally([&] { std::ignore = std::system(umount_cmd.c_str()); }); pod5::FileWriterOptions options_for_direct_io; options_for_direct_io.set_use_directio(true); options_for_direct_io.set_use_sync_io(true); options_for_direct_io.set_write_chunk_size(524288); try { auto const test_file_path = dir_path / "bar.pod5"; run_file_reader_writer_tests(test_file_path.c_str(), options_for_direct_io); } catch (std::exception const & e) { FAIL("Failed to run file reader/writer tests: " << e.what()); } } #endif SCENARIO("Opening older files") { (void)pod5::register_extension_types(); auto fin = gsl::finally([] { (void)pod5::unregister_extension_types(); }); auto uuid_from_string = [](char const * val) -> pod5::Uuid { auto result = pod5::Uuid::from_string(val); REQUIRE(result); return *result; }; struct ReadData { pod5::Uuid read_id; std::uint32_t read_number; float calibration_offset; float calibration_scale; std::string end_reason; std::string pore_type; std::string run_info_id; }; std::vector test_read_data{ {{uuid_from_string("0000173c-bf67-44e7-9a9c-1ad0bc728e74")}, 1093, 21.0f, 0.1755f, "unknown", "not_set", "a08e850aaa44c8b56765eee10b386fc3e516a62b"}, {{uuid_from_string("002fde30-9e23-4125-9eae-d112c18a81a7")}, 75, 4.0f, 0.1755f, "unknown", "not_set", "a08e850aaa44c8b56765eee10b386fc3e516a62b"}, {{uuid_from_string("006d1319-2877-4b34-85df-34de7250a47b")}, 1053, 6.0f, 0.1755f, "unknown", "not_set", "a08e850aaa44c8b56765eee10b386fc3e516a62b"}, {{uuid_from_string("00728efb-2120-4224-87d8-580fbb0bd4b2")}, 657, 2.0f, 0.1755f, "unknown", "not_set", "a08e850aaa44c8b56765eee10b386fc3e516a62b"}, {{uuid_from_string("007cc97e-6de2-4ff6-a0fd-1c1eca816425")}, 1625, 23.0f, 0.1755f, "unknown", "not_set", "a08e850aaa44c8b56765eee10b386fc3e516a62b"}, {{uuid_from_string("008468c3-e477-46c4-a6e2-7d021a4ebf0b")}, 411, 4.0f, 0.1755f, "unknown", "not_set", "a08e850aaa44c8b56765eee10b386fc3e516a62b"}, {{uuid_from_string("008ed3dc-86c2-452f-b107-6877a473d177")}, 513, 5.0f, 0.1755f, "unknown", "not_set", "a08e850aaa44c8b56765eee10b386fc3e516a62b"}, {{uuid_from_string("00919556-e519-4960-8aa5-c2dfa020980c")}, 56, 2.0f, 0.1755f, "unknown", "not_set", "a08e850aaa44c8b56765eee10b386fc3e516a62b"}, {{uuid_from_string("00925f34-6baf-47fc-b40c-22591e27fb5c")}, 930, 37.0f, 0.1755f, "unknown", "not_set", "a08e850aaa44c8b56765eee10b386fc3e516a62b"}, {{uuid_from_string("009dc9bd-c5f4-487b-ba4c-b9ce7e3a711e")}, 195, 14.0f, 0.1755f, "unknown", "not_set", "a08e850aaa44c8b56765eee10b386fc3e516a62b"}, }; auto repo_root = ::arrow::internal::PlatformFilename::FromString(__FILE__)->Parent().Parent().Parent(); auto path = GENERATE_COPY( *repo_root.Join("test_data/multi_fast5_zip_v0.pod5"), *repo_root.Join("test_data/multi_fast5_zip_v1.pod5"), *repo_root.Join("test_data/multi_fast5_zip_v2.pod5"), *repo_root.Join("test_data/multi_fast5_zip_v3.pod5"), *repo_root.Join("test_data/multi_fast5_zip_v4.pod5")); auto reader = pod5::open_file_reader(path.ToString(), {}); CHECK_ARROW_STATUS_OK(reader); auto metadata = (*reader)->schema_metadata(); CHECK(metadata.writing_software == "Python API"); std::size_t abs_row = 0; for (std::size_t i = 0; i < (*reader)->num_read_record_batches(); ++i) { auto batch = (*reader)->read_read_record_batch(i); auto columns = batch->columns(); REQUIRE_ARROW_STATUS_OK(columns); for (std::size_t row = 0; row < batch->num_rows(); ++row) { CAPTURE(abs_row); auto read_data = test_read_data[row]; CHECK(columns->read_id->Value(row) == read_data.read_id); CHECK(columns->read_number->Value(row) == read_data.read_number); CHECK(columns->calibration_offset->Value(row) == read_data.calibration_offset); CHECK(columns->calibration_scale->Value(row) == Approx(read_data.calibration_scale)); auto end_reason = *batch->get_end_reason(columns->end_reason->GetValueIndex(row)); CHECK(end_reason.first == pod5::end_reason_from_string(read_data.end_reason)); CHECK(end_reason.second == read_data.end_reason); auto pore_type = batch->get_pore_type(columns->pore_type->GetValueIndex(row)); CHECK(*pore_type == read_data.pore_type); auto run_info_id = batch->get_run_info(columns->run_info->GetValueIndex(row)); CHECK(*run_info_id == read_data.run_info_id); ++abs_row; } } CHECK(abs_row == test_read_data.size()); auto run_info = (*reader)->find_run_info(test_read_data[0].run_info_id); REQUIRE_ARROW_STATUS_OK(run_info); CHECK((*run_info)->acquisition_id == test_read_data[0].run_info_id); CHECK((*run_info)->adc_min == -4096); CHECK((*run_info)->adc_max == 4095); CHECK((*run_info)->protocol_run_id == "df049455-3552-438c-8176-d4a5b1dd9fc5"); CHECK((*run_info)->software == "python-pod5-converter"); CHECK( (*run_info)->tracking_id == pod5::RunInfoData::MapType{ {"asic_id", "131070"}, {"asic_id_eeprom", "0"}, {"asic_temp", "35.043102"}, {"asic_version", "IA02C"}, {"auto_update", "0"}, {"auto_update_source", "https://mirror.oxfordnanoportal.com/software/MinKNOW/"}, {"bream_is_standard", "0"}, {"device_id", "MS00000"}, {"device_type", "minion"}, {"distribution_status", "modified"}, {"distribution_version", "unknown"}, {"exp_script_name", "c449127e3461a521e0865fe6a88716f6f6b0b30c"}, {"exp_script_purpose", "sequencing_run"}, {"exp_start_time", "2019-05-13T11:11:43Z"}, {"flow_cell_id", ""}, {"guppy_version", "3.0.3+7e7b7d0"}, {"heatsink_temp", "35.000000"}, {"hostname", "happy_fish"}, {"installation_type", "prod"}, {"local_firmware_file", "1"}, {"operating_system", "ubuntu 16.04"}, {"protocol_group_id", "TEST_EXPERIMENT"}, {"protocol_run_id", "df049455-3552-438c-8176-d4a5b1dd9fc5"}, {"protocols_version", "4.0.6"}, {"run_id", "a08e850aaa44c8b56765eee10b386fc3e516a62b"}, {"sample_id", "TEST_SAMPLE"}, {"usb_config", "MinION_fx3_1.1.1_ONT#MinION_fpga_1.1.0#ctrl#Auto"}, {"version", "3.4.0-rc3"}, }); } /// Create empty file at \p path. static void touch(std::filesystem::path const & path) { std::ofstream const ofs(path); } /// Create file containing bytes of value zero at \p path. static void write_zeros(std::filesystem::path const & path) { std::ofstream file_stream(path, std::ios::binary); for (int i = 0; i < 1000000; ++i) { file_stream.put('\0'); } } /// Returns true iff the file exists and contains non-null data. static bool file_writing_started(std::filesystem::path const & file_path) { if (!exists(file_path)) { return false; } if (!is_regular_file(file_path)) { return false; } // This should be enough for the check as unwritten files are usually // empty or populated with nulls if writing has not been done. auto const MINIMUM_BYTES_WRITTEN = 3; if (file_size(file_path) < 3) { return false; } std::ifstream file{file_path, std::ios::in | std::ios::binary}; for (auto byte_index = 0; byte_index < MINIMUM_BYTES_WRITTEN; ++byte_index) { std::uint8_t byte; file >> byte; if (byte == 0) { return false; } } return true; } static bool files_ready_to_recover(std::filesystem::path const & directory_path) { using directory_iterator = std::filesystem::directory_iterator; // The directory should contain 3 files for recovery. A `pod5.tmp` with the signal data // a `.tmp-reads` for the reads and a `.tmp-run-info` for the run information. return std::count_if( directory_iterator(directory_path), directory_iterator{}, file_writing_started) >= 3; } static void wait_for_files_to_recover(std::filesystem::path const & directory_path) { using clock = std::chrono::steady_clock; auto const begin_waiting = clock::now(); auto const time_waited = [&]() { return std::chrono::duration_cast(clock::now() - begin_waiting); }; while (!files_ready_to_recover(directory_path)) { REQUIRE(time_waited() < std::chrono::milliseconds{100000}); // Give any asynchronous file writing threads a chance to write to disk, before we continue. std::this_thread::sleep_for(std::chrono::milliseconds{100}); } } static std::filesystem::path create_files_for_recovery( std::string const & file_name, pod5::Uuid read_id_1, ont::testutils::TemporaryDirectory & recovery_directory) { auto const run_info_data = get_test_run_info_data("_run_info"); std::uint16_t channel = 25; std::uint8_t well = 3; std::uint32_t read_number = 1234; std::uint64_t start_sample = 12340; std::uint64_t num_minknow_events = 27; float median_before = 224.0f; float calib_offset = 22.5f; float calib_scale = 1.2f; float tracked_scaling_scale = 2.3f; float tracked_scaling_shift = 100.0f; float predicted_scaling_scale = 1.5f; float predicted_scaling_shift = 50.0f; std::uint32_t num_reads_since_mux_change = 3; float time_since_mux_change = 200.0f; float open_pore_level = 150.0f; std::vector signal_1(100'000); std::iota(signal_1.begin(), signal_1.end(), 0); ont::testutils::TemporaryDirectory data_writing_directory; auto file = data_writing_directory.path() / file_name; pod5::FileWriterOptions options; options.set_max_signal_chunk_size(20'480); options.set_read_table_batch_size(1); options.set_signal_table_batch_size(5); options.set_use_sync_io(true); auto thread_pool = pod5::make_thread_pool(4); options.set_thread_pool(thread_pool); auto writer_result = pod5::create_file_writer(file.string(), "test_software", options); REQUIRE_ARROW_STATUS_OK(writer_result); std::unique_ptr writer = std::move(*writer_result); auto run_info = writer->add_run_info(run_info_data); auto end_reason = writer->lookup_end_reason(pod5::ReadEndReason::signal_negative); bool end_reason_forced = true; auto pore_type = writer->add_pore_type("Pore_type"); for (std::size_t i = 0; i < 10; ++i) { CHECK_ARROW_STATUS_OK(writer->add_complete_read( {read_id_1, read_number, start_sample, channel, well, *pore_type, calib_offset, calib_scale, median_before, *end_reason, end_reason_forced, *run_info, num_minknow_events, tracked_scaling_scale, tracked_scaling_shift, predicted_scaling_scale, predicted_scaling_shift, num_reads_since_mux_change, time_since_mux_change, open_pore_level}, gsl::make_span(signal_1))); } wait_for_files_to_recover(data_writing_directory.path()); // Intermittent failures were seen on Windows, where the file was in the middle of being // written when we copied it. This ensures that the file writing threads are done before // we take the files. thread_pool->wait_for_drain(); // The files are deliberately copied here before they can be properly finalised // by the destructor of the FileWriter. std::filesystem::copy(data_writing_directory.path(), recovery_directory.path()); REQUIRE(files_ready_to_recover(recovery_directory.path())); return recovery_directory.path() / file_name; } /// This is equivalent to the C++20 `std::string::ends_with` function. It should be replaced with /// the standard library function once we move to the C++20 standard and drop support for building /// with GCC 8. static bool ends_with(std::string const & search_in, std::string const & suffix) { if (suffix.size() > search_in.size()) { return false; } return search_in.compare(search_in.size() - suffix.size(), std::string::npos, suffix) == 0; } TEST_CASE("Check custom rolled ends_with works", "[string_utilities]") { CHECK(ends_with("abc", "abc")); CHECK(ends_with("abcdef", "def")); CHECK_FALSE(ends_with("abcdef", "abc")); CHECK_FALSE(ends_with("def", "abcdef")); CHECK_FALSE(ends_with("abc", "def")); } static std::string escape_for_regex(std::string const & input) { std::string output; for (auto const & character : input) { switch (character) { case '\\': case '/': case '.': case '[': case ']': case '(': case ')': output += std::string("\\"); default:; } output += character; } return output; } TEST_CASE("Recovering .pod5.tmp files", "[recovery]") { std::string const file_name = "foo.pod5.tmp"; ont::testutils::TemporaryDirectory recovery_directory; auto const registration_status = pod5::register_extension_types(); REQUIRE(registration_status.ok()); auto const unregister = [] { (void)pod5::unregister_extension_types(); }; auto fin = std::make_unique>(unregister); std::mt19937 gen{Catch::rngSeed()}; auto uuid_gen = pod5::UuidRandomGenerator{gen}; std::filesystem::path const path_to_recover = create_files_for_recovery(file_name, uuid_gen(), recovery_directory); REQUIRE(exists(path_to_recover)); std::filesystem::path reads_path, run_path; for (auto const & directory_entry : std::filesystem::directory_iterator{recovery_directory.path()}) { if (!directory_entry.is_regular_file()) { continue; } if (ends_with(directory_entry.path().filename().string(), (".tmp-reads"))) { reads_path = directory_entry.path(); } if (ends_with(directory_entry.path().filename().string(), (".tmp-run-info"))) { run_path = directory_entry.path(); } } REQUIRE(exists(reads_path)); REQUIRE(exists(run_path)); auto const recovered_file_path = recovery_directory.path() / (file_name + "-recovered.pod5"); // Confirm that no recovered file is left over from previous test runs. REQUIRE_FALSE(exists(recovered_file_path)); // Paths are implicitly convertible to the kind of strings used for paths // on the current platform. On Windows this is an `std::wstring`, but the // recover_file_writer takes a `std::string`, so we need the explicit // conversion to make the build work on that platform. // `generic_string()` is used rather than `native()` because Arrow paths // always use `/` as a separator, even on Windows. std::string const to_recover = path_to_recover.generic_string(); std::string const recovered = recovered_file_path.generic_string(); bool const cleanup = GENERATE(true, false); pod5::RecoverFileOptions const options{.cleanup = cleanup}; CAPTURE(to_recover, recovered, cleanup); SECTION("Recovering basic set of .tmp files.") { auto const recovery_details = pod5::recover_file(to_recover, recovered, options); REQUIRE_ARROW_STATUS_OK(recovery_details); CHECK(exists(recovered_file_path)); CHECK(recovery_details->row_counts.run_info == 1); CHECK(recovery_details->row_counts.signal == 50); CHECK(recovery_details->row_counts.reads == 10); CHECK(recovery_details->cleanup_errors.empty()); if (cleanup) { CHECK_FALSE(exists(path_to_recover)); CHECK_FALSE(exists(reads_path)); CHECK_FALSE(exists(run_path)); } else { CHECK(exists(path_to_recover)); CHECK(exists(reads_path)); CHECK(exists(run_path)); } } SECTION("Recovering whilst extensions are not registered.") { fin = {}; auto recover_result2 = pod5::recover_file(to_recover, recovered, options); REQUIRE_FALSE(recover_result2.ok()); REQUIRE( recover_result2.status().ToString() == "Invalid: POD5 library is not correctly initialised."); CHECK(exists(path_to_recover)); CHECK(exists(reads_path)); CHECK(exists(run_path)); } SECTION("Recovering without run information.") { remove(run_path); std::string const run_info_string = run_path.generic_string(); CAPTURE(run_info_string); SECTION("Recovering set of .tmp files with run info file missing.") { auto recover_result3 = pod5::recover_file(to_recover, recovered, options); REQUIRE_FALSE(recover_result3.ok()); auto const result_message3 = recover_result3.status().ToString(); auto const expected_regex3 = "IOError: Failed whilst attempting to recover run information from file - " + escape_for_regex(run_info_string) + R"(\. Detail: \[(errno|Windows error) 2\] )" + R"((No such file or directory|The system cannot find the file specified)[.\n\r]*)"; REQUIRE_THAT(result_message3, Catch::Matchers::Matches(expected_regex3)); if (cleanup) { CHECK(exists(path_to_recover)); CHECK(exists(reads_path)); CHECK_FALSE(exists(run_path)); CHECK_FALSE(exists(recovered_file_path)); } else { CHECK(exists(path_to_recover)); CHECK(exists(reads_path)); } } SECTION("Recovering set of .tmp files with run info file empty.") { touch(run_path); auto recover_result4 = pod5::recover_file(to_recover, recovered, options); REQUIRE_FALSE(recover_result4.ok()); REQUIRE( recover_result4.status().ToString() == "Invalid: Failed whilst attempting to recover run information from file - " + run_info_string + ". Detail: File is empty/zero bytes long."); if (cleanup) { CHECK(exists(path_to_recover)); CHECK(exists(reads_path)); CHECK_FALSE(exists(run_path)); CHECK_FALSE(exists(recovered_file_path)); } else { CHECK(exists(path_to_recover)); CHECK(exists(reads_path)); CHECK(exists(run_path)); } } SECTION("Recovering set of .tmp files with run info file zeroed.") { write_zeros(run_path); auto recover_result5 = pod5::recover_file(to_recover, recovered, options); REQUIRE_FALSE(recover_result5.ok()); REQUIRE( recover_result5.status().ToString() == "Invalid: Failed whilst attempting to recover run information from file - " + run_info_string + ". Detail: Not an Arrow file"); if (cleanup) { CHECK(exists(path_to_recover)); CHECK(exists(reads_path)); CHECK_FALSE(exists(run_path)); CHECK_FALSE(exists(recovered_file_path)); } else { CHECK(exists(path_to_recover)); CHECK(exists(reads_path)); CHECK(exists(run_path)); } } } SECTION("Recovering without read information.") { remove(reads_path); std::string const reads_string = reads_path.generic_string(); CAPTURE(reads_string); SECTION("Recovering set of .tmp files with reads file missing.") { auto recover_result6 = pod5::recover_file(to_recover, recovered, options); REQUIRE_FALSE(recover_result6.ok()); auto const result_message6 = recover_result6.status().ToString(); auto const expected_regex6 = "IOError: Failed whilst attempting to recover reads from file - " + escape_for_regex(reads_string) + R"(\. Detail: \[(errno|Windows error) 2\] )" + R"((No such file or directory|The system cannot find the file specified)[.\n\r]*)"; REQUIRE_THAT(result_message6, Catch::Matchers::Matches(expected_regex6)); if (cleanup) { CHECK(exists(path_to_recover)); CHECK_FALSE(exists(reads_path)); CHECK(exists(run_path)); CHECK_FALSE(exists(recovered_file_path)); } else { CHECK(exists(path_to_recover)); CHECK(exists(run_path)); } } SECTION("Recovering set of .tmp files with reads file empty.") { touch(reads_path); auto recover_result7 = pod5::recover_file(to_recover, recovered, options); REQUIRE_FALSE(recover_result7.ok()); REQUIRE( recover_result7.status().ToString() == "Invalid: Failed whilst attempting to recover reads from file - " + reads_string + ". Detail: File is empty/zero bytes long."); if (cleanup) { CHECK(exists(path_to_recover)); CHECK_FALSE(exists(reads_path)); CHECK(exists(run_path)); CHECK_FALSE(exists(recovered_file_path)); } else { CHECK(exists(path_to_recover)); CHECK(exists(reads_path)); CHECK(exists(run_path)); } } SECTION("Recovering set of .tmp files with reads file zeroed.") { write_zeros(reads_path); auto recover_result7 = pod5::recover_file(to_recover, recovered, options); REQUIRE_FALSE(recover_result7.ok()); REQUIRE( recover_result7.status().ToString() == "Invalid: Failed whilst attempting to recover reads from file - " + reads_string + ". Detail: Not an Arrow file"); if (cleanup) { CHECK(exists(path_to_recover)); CHECK_FALSE(exists(reads_path)); CHECK(exists(run_path)); CHECK_FALSE(exists(recovered_file_path)); } else { CHECK(exists(path_to_recover)); CHECK(exists(reads_path)); CHECK(exists(run_path)); } } } SECTION("Error messages for problems with combined .pod5.tmp file.") { remove(path_to_recover); SECTION("Recovering set of .tmp files with .pod5.tmp file missing.") { auto recover_result8 = pod5::recover_file(to_recover, recovered, options); REQUIRE_FALSE(recover_result8.ok()); auto const result_message = recover_result8.status().ToString(); auto const expected_regex = "IOError: Failed to open local file '" + escape_for_regex(to_recover) + R"('\. Detail: \[(errno|Windows error) 2\] )" + R"((No such file or directory|The system cannot find the file specified)[.\n\r]*)"; CAPTURE(result_message, expected_regex); REQUIRE_THAT(result_message, Catch::Matchers::Matches(expected_regex)); if (cleanup) { CHECK_FALSE(exists(recovered_file_path)); } CHECK_FALSE(exists(path_to_recover)); CHECK(exists(reads_path)); CHECK(exists(run_path)); } SECTION("Recovering set of .tmp files with .pod5.tmp file empty.") { touch(path_to_recover); auto recover_result9 = pod5::recover_file(to_recover, recovered, options); REQUIRE_FALSE(recover_result9.ok()); REQUIRE(recover_result9.status().ToString() == "IOError: Invalid signature in file"); if (cleanup) { CHECK_FALSE(exists(recovered_file_path)); CHECK_FALSE(exists(path_to_recover)); } else { CHECK(exists(path_to_recover)); } CHECK(exists(reads_path)); CHECK(exists(run_path)); } SECTION("Recovering set of .tmp files with .pod5.tmp file zeroed.") { write_zeros(path_to_recover); auto recover_result10 = pod5::recover_file(to_recover, recovered, options); REQUIRE_FALSE(recover_result10.ok()); REQUIRE(recover_result10.status().ToString() == "IOError: Invalid signature in file"); if (cleanup) { CHECK_FALSE(exists(recovered_file_path)); CHECK_FALSE(exists(path_to_recover)); } else { CHECK(exists(path_to_recover)); } CHECK(exists(reads_path)); CHECK(exists(run_path)); } arrow::Result> result_tmp_file = arrow::io::FileOutputStream::Open(to_recover, false); REQUIRE_ARROW_STATUS_OK(result_tmp_file); std::shared_ptr tmp_file = std::move(*result_tmp_file); REQUIRE_ARROW_STATUS_OK(pod5::combined_file_utils::write_file_signature(tmp_file)); SECTION("Recover .pod5.tmp missing section marker after signature.") { tmp_file = {}; auto recover_result11 = pod5::recover_file(to_recover, recovered, options); REQUIRE_FALSE(recover_result11.ok()); REQUIRE(recover_result11.status().ToString() == "IOError: Invalid offset into SubFile"); if (cleanup) { CHECK_FALSE(exists(recovered_file_path)); } CHECK(exists(path_to_recover)); CHECK(exists(reads_path)); CHECK(exists(run_path)); } SECTION("Recover .pod5.tmp missing signal sub file.") { pod5::Uuid section_id = uuid_gen(); REQUIRE_ARROW_STATUS_OK(tmp_file->Write(section_id.data(), section_id.size())); tmp_file = {}; auto recover_result12 = pod5::recover_file(to_recover, recovered, options); REQUIRE_FALSE(recover_result12.ok()); REQUIRE( recover_result12.status().ToString() == "Invalid: Failed whilst attempting to recover signal data sub file from file - " + to_recover + ". Detail: Not an Arrow file"); if (cleanup) { CHECK_FALSE(exists(recovered_file_path)); } CHECK(exists(path_to_recover)); CHECK(exists(reads_path)); CHECK(exists(run_path)); } } } ================================================ FILE: c++/test/main.cpp ================================================ #define CATCH_CONFIG_MAIN #include ================================================ FILE: c++/test/output_stream_tests.cpp ================================================ #include "pod5_format/internal/async_output_stream.h" #include "pod5_format/internal/linux_output_stream.h" #include "test_utils.h" #include #include #include #include namespace { static constexpr std::size_t TestDataSize = 1024 * 1024 * 100; std::shared_ptr get_test_data() { static std::shared_ptr const data = [] { auto result = *arrow::AllocateResizableBuffer(TestDataSize); for (std::size_t i = 0; i < TestDataSize; ++i) { result->mutable_data()[i] = i % 256; } return result; }(); return data; } std::vector read_file(char const * filename) { std::ifstream fin(filename, std::ios::binary); return std::vector(std::istreambuf_iterator(fin), std::istreambuf_iterator()); } void check_file_contents(char const * filename) { auto contents = read_file(filename); auto expected_contents = get_test_data(); auto expected_contents_span = expected_contents->span_as(); REQUIRE(contents.size() == expected_contents_span.size()); for (std::size_t i = 0; i < expected_contents_span.size(); i += 1) { CHECK(contents[i] == expected_contents_span[i]); } } } // namespace void run_output_stream_test(std::shared_ptr output_stream) { auto const data = get_test_data(); std::size_t small_writes_bytes_consumed = 0; { CHECK_ARROW_STATUS_OK(output_stream->Write(data->data() + small_writes_bytes_consumed, 1)); small_writes_bytes_consumed += 1; CHECK_ARROW_STATUS_OK(output_stream->Write(data->data() + small_writes_bytes_consumed, 2)); small_writes_bytes_consumed += 2; CHECK_ARROW_STATUS_OK(output_stream->Write(data->data() + small_writes_bytes_consumed, 4)); small_writes_bytes_consumed += 4; CHECK_ARROW_STATUS_OK(output_stream->Write(data->data() + small_writes_bytes_consumed, 8)); small_writes_bytes_consumed += 8; CHECK_ARROW_STATUS_OK(output_stream->Flush()); } auto remaining_data_buffer = arrow::SliceBuffer(data, small_writes_bytes_consumed); { auto chunk_1 = arrow::SliceBuffer(remaining_data_buffer, 0, 1024); auto chunk_2 = arrow::SliceBuffer(remaining_data_buffer, 1024, 63); remaining_data_buffer = arrow::SliceBuffer(remaining_data_buffer, 1024 + 63); CHECK_ARROW_STATUS_OK(output_stream->Write(chunk_1)); CHECK_ARROW_STATUS_OK(output_stream->Write(chunk_2)); CHECK_ARROW_STATUS_OK(output_stream->Flush()); } { auto chunk_1 = arrow::SliceBuffer(remaining_data_buffer, 0, 1024 * 1024); auto chunk_2 = arrow::SliceBuffer(remaining_data_buffer, 1024 * 1024, 1023); remaining_data_buffer = arrow::SliceBuffer(remaining_data_buffer, 1024 * 1024 + 1023); CHECK_ARROW_STATUS_OK(output_stream->Write(chunk_1)); CHECK_ARROW_STATUS_OK(output_stream->Write(chunk_2)); CHECK_ARROW_STATUS_OK(output_stream->Flush()); } CHECK_ARROW_STATUS_OK(output_stream->Write(remaining_data_buffer)); CHECK_ARROW_STATUS_OK(output_stream->Flush()); } TEST_CASE("AsyncOutputStream", "[OutputStream]") { using namespace pod5; auto const filename = "./test_file.bin"; { std::ofstream f(filename, std::ios_base::trunc); } { bool keep_file_open = GENERATE(true, false); auto thread_pool = make_thread_pool(1); auto stream = *AsyncOutputStream::make( filename, thread_pool, true, arrow::default_memory_pool(), keep_file_open); run_output_stream_test(stream); } check_file_contents(filename); } #ifdef __linux__ TEST_CASE("LinuxOutputStream IOManagerSyncImpl", "[OutputStream]") { using namespace pod5; bool keep_file_open = GENERATE(true, false); CAPTURE(keep_file_open); auto filename = "./test_file.bin"; { std::ofstream f(filename, std::ios_base::trunc); } { auto io_manager = pod5::make_sync_io_manager(); REQUIRE_ARROW_STATUS_OK(io_manager); auto stream = *LinuxOutputStream::make( filename, *io_manager, 10 * 1024 * 1024, true, false, true, keep_file_open); run_output_stream_test(stream); } check_file_contents(filename); } #endif ================================================ FILE: c++/test/read_table_tests.cpp ================================================ #include "pod5_format/internal/async_output_stream.h" #include "pod5_format/read_table_reader.h" #include "pod5_format/read_table_writer.h" #include "pod5_format/schema_metadata.h" #include "pod5_format/types.h" #include "pod5_format/uuid.h" #include "pod5_format/version.h" #include "test_utils.h" #include "utils.h" #include #include #include #include #include #include #include bool operator==( std::shared_ptr const & array, std::vector const & vec) { auto const length = static_cast(array->length()); if (length != vec.size()) { return false; } for (std::size_t i = 0; i < length; ++i) { if ((*array)[i] != vec[i]) { return false; } } return true; } SCENARIO("Read table Tests") { using namespace pod5; (void)pod5::register_extension_types(); auto fin = gsl::finally([] { (void)pod5::unregister_extension_types(); }); std::mt19937 gen{Catch::rngSeed()}; auto uuid_gen = pod5::UuidRandomGenerator{gen}; auto file_identifier = uuid_gen(); auto data_for_index = [&](std::size_t index) { std::array uuid_source{}; Uuid read_id{uuid_source}; return std::make_tuple( pod5::ReadData{ read_id, std::uint32_t(index * 2), std::uint64_t(index * 10), std::uint16_t(index + 1), std::uint8_t(index + 2), 0, index * 0.1f, index * 0.2f, index * 100.0f, 0, true, 0, std::uint64_t(index * 150), index * 0.4f, index * 0.3f, index * 0.6f, index * 0.5f, std::uint32_t(index + 10), index * 50.0f, index * 0.7f}, std::vector{index + 2, index + 3}); }; GIVEN("A read table writer") { auto filename = "./foo.pod5"; auto pool = arrow::system_memory_pool(); auto const record_batch_count = GENERATE(as{}, 1, 2, 5, 10); auto const read_count = GENERATE(1, 2); { auto file_out = *pod5::AsyncOutputStream::make(filename, pod5::make_thread_pool(1), true); auto schema_metadata = make_schema_key_value_metadata( {file_identifier, "test_software", *parse_version_number(Pod5Version)}); REQUIRE_ARROW_STATUS_OK(schema_metadata); auto pore_writer = pod5::make_pore_writer(pool); REQUIRE_ARROW_STATUS_OK(pore_writer); auto end_reason_writer = pod5::make_end_reason_writer(pool); REQUIRE_ARROW_STATUS_OK(end_reason_writer); auto run_info_writer = pod5::make_run_info_writer(pool); REQUIRE_ARROW_STATUS_OK(run_info_writer); auto writer = pod5::make_read_table_writer( file_out, *schema_metadata, read_count, *pore_writer, *end_reason_writer, *run_info_writer, pool); REQUIRE_ARROW_STATUS_OK(writer); auto const pore_1 = (*pore_writer)->add("Well Type"); REQUIRE_ARROW_STATUS_OK(pore_1); auto const end_reason_1 = (*end_reason_writer)->lookup(pod5::ReadEndReason::mux_change); REQUIRE_ARROW_STATUS_OK(end_reason_1); auto const run_info_1 = (*run_info_writer)->add("acq_id_1"); REQUIRE_ARROW_STATUS_OK(run_info_1); auto const run_info_2 = (*run_info_writer)->add("acq_id_2"); REQUIRE_ARROW_STATUS_OK(run_info_2); for (std::size_t i = 0; i < record_batch_count; ++i) { for (std::size_t j = 0; j < static_cast(read_count); ++j) { auto const idx = j + i * read_count; pod5::ReadData read_data; std::vector signal; std::tie(read_data, signal) = data_for_index(idx); auto row = writer->add_read(read_data, signal, signal.size()); REQUIRE_ARROW_STATUS_OK(row); CHECK(*row == idx); } } REQUIRE_ARROW_STATUS_OK(writer->close()); } auto file_in = arrow::io::ReadableFile::Open(filename, pool); { REQUIRE_ARROW_STATUS_OK(file_in); auto reader = pod5::make_read_table_reader(*file_in, pool); REQUIRE_ARROW_STATUS_OK(reader); auto metadata = reader->schema_metadata(); CHECK(metadata.file_identifier == file_identifier); CHECK(metadata.writing_software == "test_software"); CHECK(metadata.writing_pod5_version == *parse_version_number(Pod5Version)); REQUIRE(reader->num_record_batches() == record_batch_count); for (std::size_t i = 0; i < record_batch_count; ++i) { auto const record_batch = reader->read_record_batch(i); REQUIRE_ARROW_STATUS_OK(record_batch); REQUIRE(record_batch->num_rows() == static_cast(read_count)); auto columns = record_batch->columns(); CHECK(columns->read_id->length() == read_count); CHECK(columns->signal->length() == read_count); CHECK(columns->channel->length() == read_count); CHECK(columns->well->length() == read_count); CHECK(columns->pore_type->length() == read_count); CHECK(columns->calibration_offset->length() == read_count); CHECK(columns->calibration_scale->length() == read_count); CHECK(columns->read_number->length() == read_count); CHECK(columns->start_sample->length() == read_count); CHECK(columns->median_before->length() == read_count); CHECK(columns->num_samples->length() == read_count); CHECK(columns->end_reason->length() == read_count); CHECK(columns->end_reason_forced->length() == read_count); CHECK(columns->run_info->length() == read_count); auto pore_indices = std::static_pointer_cast(columns->pore_type->indices()); auto end_reason_indices = std::static_pointer_cast(columns->end_reason->indices()); auto run_info_indices = std::static_pointer_cast(columns->run_info->indices()); for (auto j = 0; j < read_count; ++j) { auto idx = j + i * read_count; pod5::ReadData read_data; std::vector expected_signal; std::tie(read_data, expected_signal) = data_for_index(idx); CHECK(columns->read_id->Value(j) == read_data.read_id); auto signal_data = std::static_pointer_cast( columns->signal->value_slice(j)); CHECK( gsl::make_span(signal_data->raw_values(), signal_data->length()) == gsl::make_span(expected_signal)); CHECK(columns->read_number->Value(j) == read_data.read_number); CHECK(columns->start_sample->Value(j) == read_data.start_sample); CHECK(columns->median_before->Value(j) == read_data.median_before); CHECK(columns->num_samples->Value(j) == expected_signal.size()); CHECK(columns->calibration_offset->Value(j) == read_data.calibration_offset); CHECK(columns->calibration_scale->Value(j) == read_data.calibration_scale); CHECK(columns->channel->Value(j) == read_data.channel); CHECK(columns->well->Value(j) == read_data.well); CHECK(end_reason_indices->Value(j) == read_data.end_reason); CHECK(pore_indices->Value(j) == read_data.pore_type); CHECK(run_info_indices->Value(j) == read_data.run_info); } auto pore_data = record_batch->get_pore_type(0); REQUIRE_ARROW_STATUS_OK(pore_data); CHECK(*pore_data == "Well Type"); auto end_reason_data = record_batch->get_end_reason(1); REQUIRE_ARROW_STATUS_OK(end_reason_data); CHECK(end_reason_data->first == pod5::ReadEndReason::mux_change); CHECK(end_reason_data->second == "mux_change"); auto run_info_data = record_batch->get_run_info(0); REQUIRE_ARROW_STATUS_OK(run_info_data); CHECK(*run_info_data == "acq_id_1"); run_info_data = record_batch->get_run_info(1); REQUIRE_ARROW_STATUS_OK(run_info_data); CHECK(*run_info_data == "acq_id_2"); } } } } ================================================ FILE: c++/test/read_table_writer_utils_tests.cpp ================================================ #include "pod5_format/read_table_writer_utils.h" #include "test_utils.h" #include "utils.h" #include #include #include #include #include TEST_CASE("Run Info Writer Tests") { auto pool = arrow::system_memory_pool(); auto run_info_writer = pod5::make_run_info_writer(pool); REQUIRE_ARROW_STATUS_OK(run_info_writer); auto index = (*run_info_writer)->add("acq_id_1"); CHECK(*index == 0); CHECK((*run_info_writer)->item_count() == 1); // Important to always call this so we test calling it twice auto const value_array = (*run_info_writer)->get_value_array(); WHEN("Checking the first row") { REQUIRE_ARROW_STATUS_OK(value_array); auto string_value_array = std::dynamic_pointer_cast(*value_array); REQUIRE(string_value_array); CHECK(string_value_array->length() == 1); CHECK(string_value_array->Value(0) == "acq_id_1"); } index = (*run_info_writer)->add("acq_id_2"); CHECK(*index == 1); CHECK((*run_info_writer)->item_count() == 2); WHEN("Checking the rows after a second append") { auto value_array = (*run_info_writer)->get_value_array(); REQUIRE_ARROW_STATUS_OK(value_array); auto string_value_array = std::dynamic_pointer_cast(*value_array); REQUIRE(string_value_array); CHECK(string_value_array->length() == 2); CHECK(string_value_array->Value(0) == "acq_id_1"); CHECK(string_value_array->Value(1) == "acq_id_2"); } } ================================================ FILE: c++/test/run_info_table_tests.cpp ================================================ #include "pod5_format/internal/async_output_stream.h" #include "pod5_format/run_info_table_reader.h" #include "pod5_format/run_info_table_writer.h" #include "pod5_format/schema_metadata.h" #include "pod5_format/types.h" #include "pod5_format/uuid.h" #include "pod5_format/version.h" #include "test_utils.h" #include "utils.h" #include #include #include #include #include #include #include SCENARIO("Run Info table Tests") { using namespace pod5; (void)pod5::register_extension_types(); auto fin = gsl::finally([] { (void)pod5::unregister_extension_types(); }); std::mt19937 gen{Catch::rngSeed()}; auto uuid_gen = pod5::UuidRandomGenerator{gen}; auto file_identifier = uuid_gen(); GIVEN("A read table writer") { auto filename = "./foo.pod5"; auto pool = arrow::system_memory_pool(); auto run_info_data_0 = get_test_run_info_data(); auto run_info_data_1 = get_test_run_info_data("_2"); { auto file_out = *pod5::AsyncOutputStream::make(filename, pod5::make_thread_pool(1), true); auto schema_metadata = make_schema_key_value_metadata( {file_identifier, "test_software", *parse_version_number(Pod5Version)}); REQUIRE_ARROW_STATUS_OK(schema_metadata); std::size_t run_info_per_batch = 2; auto writer = pod5::make_run_info_table_writer( file_out, *schema_metadata, run_info_per_batch, pool); REQUIRE_ARROW_STATUS_OK(writer); REQUIRE_ARROW_STATUS_OK(writer->add_run_info(run_info_data_0)); REQUIRE_ARROW_STATUS_OK(writer->add_run_info(run_info_data_1)); } auto file_in = arrow::io::ReadableFile::Open(filename, pool); { REQUIRE_ARROW_STATUS_OK(file_in); auto reader = pod5::make_run_info_table_reader(*file_in, pool); REQUIRE_ARROW_STATUS_OK(reader); auto metadata = reader->schema_metadata(); CHECK(metadata.file_identifier == file_identifier); CHECK(metadata.writing_software == "test_software"); CHECK(metadata.writing_pod5_version == *parse_version_number(Pod5Version)); REQUIRE(reader->num_record_batches() == 1); auto const record_batch = reader->read_record_batch(0); REQUIRE_ARROW_STATUS_OK(record_batch); REQUIRE(record_batch->num_rows() == 2); auto columns = record_batch->columns(); REQUIRE_ARROW_STATUS_OK(columns); auto check_run_info = [](auto & columns, std::size_t index, pod5::RunInfoData const & run_info_data) { CHECK(columns.acquisition_id->Value(index) == run_info_data.acquisition_id); CHECK( columns.acquisition_start_time->Value(index) == run_info_data.acquisition_start_time); CHECK(columns.adc_max->Value(index) == run_info_data.adc_max); CHECK(columns.adc_min->Value(index) == run_info_data.adc_min); CHECK(columns.experiment_name->Value(index) == run_info_data.experiment_name); CHECK(columns.flow_cell_id->Value(index) == run_info_data.flow_cell_id); CHECK( columns.flow_cell_product_code->Value(index) == run_info_data.flow_cell_product_code); CHECK(columns.protocol_name->Value(index) == run_info_data.protocol_name); CHECK(columns.protocol_run_id->Value(index) == run_info_data.protocol_run_id); CHECK( columns.protocol_start_time->Value(index) == run_info_data.protocol_start_time); CHECK(columns.sample_id->Value(index) == run_info_data.sample_id); CHECK(columns.sample_rate->Value(index) == run_info_data.sample_rate); CHECK(columns.sequencing_kit->Value(index) == run_info_data.sequencing_kit); CHECK(columns.sequencer_position->Value(index) == run_info_data.sequencer_position); CHECK( columns.sequencer_position_type->Value(index) == run_info_data.sequencer_position_type); CHECK(columns.software->Value(index) == run_info_data.software); CHECK(columns.system_name->Value(index) == run_info_data.system_name); CHECK(columns.system_type->Value(index) == run_info_data.system_type); }; check_run_info(*columns, 0, run_info_data_0); check_run_info(*columns, 1, run_info_data_1); auto found_run_info_0 = reader->find_run_info(run_info_data_0.acquisition_id); CHECK_ARROW_STATUS_OK(found_run_info_0); CHECK(**found_run_info_0 == run_info_data_0); auto found_run_info_1 = reader->find_run_info(run_info_data_1.acquisition_id); CHECK_ARROW_STATUS_OK(found_run_info_1); CHECK(**found_run_info_1 == run_info_data_1); } } } ================================================ FILE: c++/test/schema_tests.cpp ================================================ #include "pod5_format/schema_metadata.h" #include "test_utils.h" #include SCENARIO("Version Tests") { using namespace pod5; CHECK(Version(1, 2, 3) < Version(3, 2, 1)); CHECK(Version(1, 2, 3) < Version(1, 3, 3)); CHECK(Version(1, 2, 3) < Version(1, 2, 4)); CHECK(Version(3, 2, 1) > Version(1, 2, 3)); CHECK(Version(1, 3, 3) > Version(1, 2, 3)); CHECK(Version(1, 2, 4) > Version(1, 2, 3)); CHECK(Version(1, 2, 3) == Version(1, 2, 3)); CHECK(Version(1, 2, 3) != Version(2, 2, 3)); CHECK(Version(1, 2, 3) != Version(1, 3, 3)); CHECK(Version(1, 2, 3) != Version(1, 2, 4)); CHECK_ARROW_STATUS_NOT_OK(parse_version_number("1.2.3.4")); CHECK_ARROW_STATUS_NOT_OK(parse_version_number("1.2.3-pre")); auto const parsed_version = parse_version_number("10.200.3"); REQUIRE_ARROW_STATUS_OK(parsed_version); CHECK(Version(10, 200, 3) == *parsed_version); CHECK(parsed_version->major_version() == 10); CHECK(parsed_version->minor_version() == 200); CHECK(parsed_version->revision_version() == 3); CHECK(Version(1, 200, 30).to_string() == "1.200.30"); } ================================================ FILE: c++/test/signal_compression_tests.cpp ================================================ #include "pod5_format/signal_compression.h" #include "test_utils.h" #include "utils.h" #include #include #include #include #include SCENARIO("Signal compression Tests") { auto pool = arrow::system_memory_pool(); std::vector signal(100'00); std::iota(signal.begin(), signal.end(), 0); auto compressed = pod5::compress_signal(gsl::make_span(signal), pool); REQUIRE_ARROW_STATUS_OK(compressed); auto compressed_span = gsl::make_span((*compressed)->data(), (*compressed)->size()); auto decompressed = pod5::decompress_signal(compressed_span, signal.size(), pool); REQUIRE_ARROW_STATUS_OK(decompressed); auto decompressed_span = gsl::make_span((*decompressed)->data(), (*decompressed)->size()) .as_span(); CHECK(gsl::make_span(signal) == decompressed_span); } ================================================ FILE: c++/test/signal_table_tests.cpp ================================================ #include "pod5_format/internal/async_output_stream.h" #include "pod5_format/schema_metadata.h" #include "pod5_format/signal_compression.h" #include "pod5_format/signal_table_reader.h" #include "pod5_format/signal_table_writer.h" #include "pod5_format/types.h" #include "pod5_format/uuid.h" #include "pod5_format/version.h" #include "test_utils.h" #include "utils.h" #include #include #include #include #include #include #include SCENARIO("Signal table Tests") { using namespace pod5; (void)pod5::register_extension_types(); auto fin = gsl::finally([] { (void)pod5::unregister_extension_types(); }); std::mt19937 gen{Catch::rngSeed()}; auto uuid_gen = pod5::UuidRandomGenerator{gen}; auto file_identifier = uuid_gen(); auto read_id_1 = uuid_gen(); auto read_id_2 = uuid_gen(); std::vector signal_1(100'000); std::iota(signal_1.begin(), signal_1.end(), 0); std::vector signal_2(10'000, 1); GIVEN("A signal table writer") { auto filename = "./foo.pod5"; auto pool = arrow::system_memory_pool(); auto signal_type = GENERATE(SignalType::UncompressedSignal, SignalType::VbzSignal); { auto file_out = *pod5::AsyncOutputStream::make(filename, pod5::make_thread_pool(1), true); auto schema_metadata = make_schema_key_value_metadata( {file_identifier, "test_software", *parse_version_number(Pod5Version)}); REQUIRE_ARROW_STATUS_OK(schema_metadata); auto writer = pod5::make_signal_table_writer(file_out, *schema_metadata, 100, signal_type, pool); REQUIRE_ARROW_STATUS_OK(writer); WHEN("Writing a read") { auto row_1 = writer->add_signal(read_id_1, gsl::make_span(signal_1)); auto row_2 = writer->add_signal(read_id_2, gsl::make_span(signal_2)); REQUIRE_ARROW_STATUS_OK(writer->close()); THEN("Read row ids are correct") { REQUIRE_ARROW_STATUS_OK(row_1); REQUIRE_ARROW_STATUS_OK(row_2); CHECK(*row_1 == 0); CHECK(*row_2 == 1); } } } auto file_in = arrow::io::ReadableFile::Open(filename, pool); { REQUIRE_ARROW_STATUS_OK(file_in); auto reader = pod5::make_signal_table_reader(*file_in, 20, pool); CAPTURE(reader); REQUIRE_ARROW_STATUS_OK(reader); auto metadata = reader->schema_metadata(); CHECK(metadata.file_identifier == file_identifier); CHECK(metadata.writing_software == "test_software"); CHECK(metadata.writing_pod5_version == *parse_version_number(Pod5Version)); REQUIRE(reader->num_record_batches() == 1); auto const record_batch_0 = reader->read_record_batch(0); REQUIRE_ARROW_STATUS_OK(record_batch_0); REQUIRE(record_batch_0->num_rows() == 2); auto read_id = record_batch_0->read_id_column(); CHECK(read_id->length() == 2); CHECK(read_id->Value(0) == read_id_1); CHECK(read_id->Value(1) == read_id_2); if (signal_type == SignalType::VbzSignal) { auto signal = record_batch_0->vbz_signal_column(); CHECK(signal->length() == 2); auto compare_compressed_signal = [&](gsl::span compressed_actual, std::vector const & expected) { auto decompressed = pod5::decompress_signal(compressed_actual, expected.size(), pool); REQUIRE_ARROW_STATUS_OK(decompressed); auto actual = gsl::make_span((*decompressed)->data(), (*decompressed)->size()) .as_span(); CHECK(actual == gsl::make_span(expected)); }; auto signal_typed = std::static_pointer_cast(signal); compare_compressed_signal(signal_typed->Value(0), signal_1); compare_compressed_signal(signal_typed->Value(1), signal_2); } else if (signal_type == SignalType::UncompressedSignal) { auto signal = record_batch_0->uncompressed_signal_column(); CHECK(signal->length() == 2); auto signal_1_read = std::static_pointer_cast(signal->value_slice(0)); std::vector stored_values_1( signal_1_read->raw_values(), signal_1_read->raw_values() + signal_1_read->length()); CHECK(stored_values_1 == signal_1); auto signal_2_read = std::static_pointer_cast(signal->value_slice(1)); std::vector stored_values_2( signal_2_read->raw_values(), signal_2_read->raw_values() + signal_2_read->length()); CHECK(stored_values_2 == signal_2); } else { FAIL("Unknown signal type"); } auto samples = record_batch_0->samples_column(); CHECK(samples->length() == 2); CHECK(samples->Value(0) == signal_1.size()); CHECK(samples->Value(1) == signal_2.size()); } } } ================================================ FILE: c++/test/svb16_scalar_tests.cpp ================================================ #include "pod5_format/svb16/decode.hpp" #include "pod5_format/svb16/encode.hpp" #include #include #include #include #include #include using Catch::Matchers::Equals; template void test_scalar_encode_scalar_decode() { static constexpr uint32_t DATA_COUNT = 1024; std::minstd_rand rng; std::vector data(DATA_COUNT); std::uniform_int_distribution dist{ std::numeric_limits::min(), std::numeric_limits::max()}; std::generate(data.begin(), data.end(), [&] { return dist(rng); }); std::vector encoded(svb16_max_encoded_length(data.size())); auto const encoded_count = svb16::encode_scalar( data.data(), encoded.data(), encoded.data() + svb16_key_length(data.size()), DATA_COUNT) - encoded.data(); CHECK(encoded_count <= svb16_max_encoded_length(data.size())); std::vector decoded(DATA_COUNT); auto const encoded_span = gsl::make_span(encoded); auto const key_length = svb16_key_length(data.size()); auto const consumed = svb16::decode_scalar( gsl::make_span(decoded), encoded_span.subspan(0, key_length), encoded_span.subspan(key_length)) - encoded.data(); CHECK(consumed == encoded_count); CHECK_THAT(decoded, Equals(data)); } TEST_CASE("Scalar decode is inverse of scalar encode", "[scalar]") { SECTION("Unsigned, no delta, no zig-zag") { test_scalar_encode_scalar_decode(); } SECTION("Signed, no delta, no zig-zag") { test_scalar_encode_scalar_decode(); } SECTION("Unsigned, delta, no zig-zag") { test_scalar_encode_scalar_decode(); } SECTION("Signed, delta, no zig-zag") { test_scalar_encode_scalar_decode(); } SECTION("Unsigned, delta, zig-zag") { test_scalar_encode_scalar_decode(); } SECTION("Signed, delta, zig-zag") { test_scalar_encode_scalar_decode(); } SECTION("Unsigned, no delta, zig-zag") { // this scenario doesn't really make sense, but it's possible, so let's test it test_scalar_encode_scalar_decode(); } SECTION("Signed, no delta, zig-zag") { test_scalar_encode_scalar_decode(); } } ================================================ FILE: c++/test/svb16_x64_tests.cpp ================================================ #include "pod5_format/svb16/decode.hpp" #include "pod5_format/svb16/encode.hpp" #include #include #include #include #include #include #include #ifdef SVB16_X64 using Catch::Matchers::Equals; template void test_sse_encode_scalar_decode() { uint32_t const DATA_COUNT = GENERATE( 1000, 20000); // Deliberately not aligned to 64 so we test the scalar tidy up code at the end. std::minstd_rand rng; std::vector data(DATA_COUNT); std::uniform_int_distribution dist{ std::numeric_limits::min(), std::numeric_limits::max()}; std::generate(data.begin(), data.end(), [&] { return dist(rng); }); std::vector encoded(svb16_max_encoded_length(data.size())); auto const encoded_count = svb16::encode_sse( data.data(), encoded.data(), encoded.data() + svb16_key_length(data.size()), DATA_COUNT) - encoded.data(); CHECK(encoded_count <= svb16_max_encoded_length(data.size())); std::vector encoded_scalar(svb16_max_encoded_length(data.size())); auto const scalar_encoded_count = svb16::encode_scalar( data.data(), encoded_scalar.data(), encoded_scalar.data() + svb16_key_length(data.size()), DATA_COUNT) - encoded_scalar.data(); CHECK(scalar_encoded_count == encoded_count); CHECK(encoded == encoded_scalar); std::vector decoded(DATA_COUNT); auto const encoded_span = gsl::make_span(encoded); auto const key_length = svb16_key_length(data.size()); auto const consumed = svb16::decode_sse( gsl::make_span(decoded), encoded_span.subspan(0, key_length), encoded_span.subspan(key_length)) - encoded.data(); CHECK(consumed == encoded_count); CHECK_THAT(decoded, Equals(data)); } TEST_CASE("SSE decode is inverse of scalar encode", "[scalar]") { SECTION("Unsigned, no delta, no zig-zag") { test_sse_encode_scalar_decode(); } SECTION("Signed, no delta, no zig-zag") { test_sse_encode_scalar_decode(); } SECTION("Unsigned, delta, no zig-zag") { test_sse_encode_scalar_decode(); } SECTION("Signed, delta, no zig-zag") { test_sse_encode_scalar_decode(); } SECTION("Unsigned, delta, zig-zag") { test_sse_encode_scalar_decode(); } SECTION("Signed, delta, zig-zag") { test_sse_encode_scalar_decode(); } SECTION("Unsigned, no delta, zig-zag") { // this scenario doesn't really make sense, but it's possible, so let's test it test_sse_encode_scalar_decode(); } SECTION("Signed, no delta, zig-zag") { test_sse_encode_scalar_decode(); } } #endif ================================================ FILE: c++/test/test_utils.h ================================================ #pragma once #include #include #include template struct Catch::StringMaker> { static std::string convert(arrow::Result const & value) { return value.status().ToString(); } }; template class IsStatusOk : public Catch::MatcherBase { public: IsStatusOk() = default; bool match(arrow::Status const & status) const override { return status.ok() == CheckOk; } virtual std::string describe() const override { return "== arrow::Status::OK()"; } }; template class IsResultOk : public Catch::MatcherBase> { public: IsResultOk() = default; bool match(arrow::Result const & status) const override { return status.ok() == CheckOk; } virtual std::string describe() const override { return "== arrow::Status::OK()"; } }; template inline IsResultOk _is_arrow_ok(arrow::Result const &) { return IsResultOk(); } inline IsStatusOk _is_arrow_ok(arrow::Status const &) { return IsStatusOk(); } template inline IsResultOk _is_arrow_not_ok(arrow::Result const &) { return IsResultOk(); } inline IsStatusOk _is_arrow_not_ok(arrow::Status const &) { return IsStatusOk(); } #define CHECK_ARROW_STATUS_OK(statement) \ do { \ auto const & _res = (statement); \ CHECK_THAT(_res, _is_arrow_ok(_res)); \ } while (false) #define REQUIRE_ARROW_STATUS_OK(statement) \ do { \ auto const & _res = (statement); \ REQUIRE_THAT(_res, _is_arrow_ok(_res)); \ } while (false) #define CHECK_ARROW_STATUS_NOT_OK(statement) \ do { \ auto const & _res = (statement); \ CHECK_THAT(_res, _is_arrow_not_ok(_res)); \ } while (false) #define REQUIRE_ARROW_STATUS_NOT_OK(statement) \ do { \ auto const & _res = (statement); \ REQUIRE_THAT(_res, _is_arrow_not_ok(_res)); \ } while (false) ================================================ FILE: c++/test/thread_pool_tests.cpp ================================================ #include "pod5_format/thread_pool.h" #include #include #include #include #include TEST_CASE("Thread pool runs tasks concurrently", "[thread_pool]") { using namespace std::chrono_literals; auto const explicit_stop = GENERATE(true, false); CAPTURE(explicit_stop); auto const use_strands = GENERATE(true, false); CAPTURE(use_strands); // semaphores only in std lib in c++20, so fake them std::mutex sem_mutex; int sem1 = 2; std::condition_variable cv1; int sem2 = 2; std::condition_variable cv2; auto const create_task = [&]() -> std::function { return [&] { std::unique_lock l{sem_mutex}; sem1--; if (sem1 > 0) { cv1.wait(l, [&] { return sem1 == 0; }); } else { l.unlock(); cv1.notify_all(); std::this_thread::sleep_for(1ms); l.lock(); } sem2--; if (sem2 > 0) { cv2.wait(l, [&] { return sem2 == 0; }); } else { l.unlock(); cv2.notify_all(); } }; }; auto thread_pool = pod5::make_thread_pool(2); std::shared_ptr strands[2]; if (use_strands) { for (unsigned i = 0; i < 2; ++i) { strands[i] = thread_pool->create_strand(); strands[i]->post(create_task()); } } else { thread_pool->post(create_task()); thread_pool->post(create_task()); } if (explicit_stop) { thread_pool->stop_and_drain(); } else { thread_pool.reset(); for (unsigned i = 0; i < 2; ++i) { strands[i].reset(); } } REQUIRE(sem1 == 0); REQUIRE(sem2 == 0); } TEST_CASE("Tasks on the same strand are serialised", "[thread_pool]") { using namespace std::chrono_literals; auto const explicit_stop = GENERATE(true, false); CAPTURE(explicit_stop); std::mutex seq_mutex; std::vector seq; seq.reserve(4); auto const create_task = [&](int const num) -> std::function { return [&, num] { { std::lock_guard l{seq_mutex}; seq.push_back(num); } std::this_thread::sleep_for(50ms); { std::lock_guard l{seq_mutex}; seq.push_back(num); } }; }; auto thread_pool = pod5::make_thread_pool(2); auto strand = thread_pool->create_strand(); strand->post(create_task(0)); strand->post(create_task(1)); if (explicit_stop) { thread_pool->stop_and_drain(); } else { thread_pool.reset(); strand.reset(); } REQUIRE(seq.size() == 4); if (seq[0] == 0) { REQUIRE(seq == (std::vector{0, 0, 1, 1})); } else { REQUIRE(seq == (std::vector{1, 1, 0, 0})); } } ================================================ FILE: c++/test/utils.h ================================================ #pragma once #include "pod5_format/read_table_utils.h" #include "test_utils.h" #include #include inline pod5::RunInfoData get_test_run_info_data( std::string suffix = "", std::int16_t adc_center_offset = 0, std::int16_t sample_rate = 4000) { return pod5::RunInfoData( "acquisition_id" + suffix, 1005, 4095 + adc_center_offset, -4096 + adc_center_offset, {{"context" + suffix, "tags" + suffix}, {"other" + suffix, "tagz" + suffix}, {"third" + suffix, "thing" + suffix}}, "experiment_name" + suffix, "flow_cell_id" + suffix, "flow_cell_product_code" + suffix, "protocol_name" + suffix, "protocol_run_id" + suffix, 200005, "sample_id" + suffix, sample_rate, "sequencing_kit" + suffix, "sequencer_position" + suffix, "sequencer_position_type" + suffix, "software" + suffix, "system_name" + suffix, "system_type" + suffix, {{"tracking" + suffix, "id" + suffix}}); } inline arrow::Status remove_file_if_exists(std::string const & file) { ARROW_ASSIGN_OR_RAISE( auto arrow_reads_path, ::arrow::internal::PlatformFilename::FromString(file)); ARROW_ASSIGN_OR_RAISE(bool file_exists, arrow::internal::FileExists(arrow_reads_path)); if (file_exists) { ARROW_RETURN_NOT_OK(arrow::internal::DeleteFile(arrow_reads_path)); } return arrow::Status::OK(); } ================================================ FILE: c++/test/uuid_tests.cpp ================================================ // This file contains code from https://github.com/mariusbancila/stduuid/ which has the following // license: // // MIT License // // Copyright (c) 2017 // // Permission is hereby granted, free of charge, to any person obtaining a copy // of this software and associated documentation files (the "Software"), to deal // in the Software without restriction, including without limitation the rights // to use, copy, modify, merge, publish, distribute, sublicense, and/or sell // copies of the Software, and to permit persons to whom the Software is // furnished to do so, subject to the following conditions: // // The above copyright notice and this permission notice shall be included in all // copies or substantial portions of the Software. // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, // FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE // AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER // LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, // OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE // SOFTWARE. #include "pod5_format/uuid.h" #include #include #include TEST_CASE("Default constructor returns nil UUID", "[pod5::Uuid]") { pod5::Uuid nil; REQUIRE(nil.is_nil()); } TEST_CASE("Default constructor produces all-zero string", "[pod5::Uuid]") { pod5::Uuid nil; REQUIRE(to_string(nil) == "00000000-0000-0000-0000-000000000000"); REQUIRE(pod5::to_string(nil) == L"00000000-0000-0000-0000-000000000000"); } TEST_CASE("Parsing the nil UUID is nil", "[pod5::Uuid]") { auto const no_braces = pod5::Uuid::from_string("00000000-0000-0000-0000-000000000000"); auto const braces = pod5::Uuid::from_string("{00000000-0000-0000-0000-000000000000}"); auto const no_braces_w = pod5::Uuid::from_string(L"00000000-0000-0000-0000-000000000000"); auto const braces_w = pod5::Uuid::from_string(L"{00000000-0000-0000-0000-000000000000}"); REQUIRE(no_braces); REQUIRE(no_braces->is_nil()); REQUIRE(braces); REQUIRE(braces->is_nil()); REQUIRE(no_braces_w); REQUIRE(no_braces_w->is_nil()); REQUIRE(braces_w); REQUIRE(braces_w->is_nil()); } TEST_CASE("Parsing produces the same value with or without braces", "[pod5::Uuid]") { auto const no_braces = pod5::Uuid::from_string("1d5a3dd9-2d50-4f2b-a0fb-a3a749eb96c7"); auto const braces = pod5::Uuid::from_string("{1d5a3dd9-2d50-4f2b-a0fb-a3a749eb96c7}"); auto const no_braces_w = pod5::Uuid::from_string(L"1d5a3dd9-2d50-4f2b-a0fb-a3a749eb96c7"); auto const braces_w = pod5::Uuid::from_string(L"{1d5a3dd9-2d50-4f2b-a0fb-a3a749eb96c7}"); REQUIRE(no_braces == braces); REQUIRE(no_braces_w == braces_w); } TEST_CASE("Parsing produces the same value from char or wchar_t", "[pod5::Uuid]") { auto const no_braces = pod5::Uuid::from_string("1d5a3dd9-2d50-4f2b-a0fb-a3a749eb96c7"); auto const braces = pod5::Uuid::from_string("{1d5a3dd9-2d50-4f2b-a0fb-a3a749eb96c7}"); auto const no_braces_w = pod5::Uuid::from_string(L"1d5a3dd9-2d50-4f2b-a0fb-a3a749eb96c7"); auto const braces_w = pod5::Uuid::from_string(L"{1d5a3dd9-2d50-4f2b-a0fb-a3a749eb96c7}"); REQUIRE(no_braces == no_braces_w); REQUIRE(braces == braces_w); } TEST_CASE("A parsed UUID prints the same value", "[pod5::Uuid]") { auto const guid = pod5::Uuid::from_string("1d5a3dd9-2d50-4f2b-a0fb-a3a749eb96c7"); REQUIRE(guid); REQUIRE(to_string(*guid) == "1d5a3dd9-2d50-4f2b-a0fb-a3a749eb96c7"); REQUIRE(pod5::to_string(*guid) == L"1d5a3dd9-2d50-4f2b-a0fb-a3a749eb96c7"); } TEST_CASE("Invalid UUIDs cannot be parsed", "[pod5::Uuid]") { REQUIRE_FALSE(pod5::Uuid::from_string("")); REQUIRE_FALSE(pod5::Uuid::from_string("{}")); // mismatched braces REQUIRE_FALSE(pod5::Uuid::from_string("{1d5a3dd9-2d50-4f2b-a0fb-a3a749eb96c7")); REQUIRE_FALSE(pod5::Uuid::from_string("1d5a3dd9-2d50-4f2b-a0fb-a3a749eb96c7}")); // missing a char REQUIRE_FALSE(pod5::Uuid::from_string("1d5a3dd9-2d50-4f2b-a0fb-a3a749eb96c")); // too many chars REQUIRE_FALSE(pod5::Uuid::from_string("1d5a3dd9-2d50-4f2b-a0fb-a3a749eb96c77")); // invalid characters REQUIRE_FALSE(pod5::Uuid::from_string("1d5a3dd9-2d50-4f2b-a0fb-a3a749eb96cg")); } TEST_CASE("Construction from iterators", "[pod5::Uuid]") { using namespace std::string_literals; { std::array const arr{ {0x47, 0x18, 0x38, 0x23, 0x25, 0x74, 0x4b, 0xfd, 0xb4, 0x11, 0x99, 0xed, 0x17, 0x7d, 0x3e, 0x43}}; pod5::Uuid guid(arr.begin(), arr.end()); REQUIRE(to_string(guid) == "47183823-2574-4bfd-b411-99ed177d3e43"s); } { uint8_t const arr[16] = { 0x47, 0x18, 0x38, 0x23, 0x25, 0x74, 0x4b, 0xfd, 0xb4, 0x11, 0x99, 0xed, 0x17, 0x7d, 0x3e, 0x43}; pod5::Uuid guid(std::begin(arr), std::end(arr)); REQUIRE(to_string(guid) == "47183823-2574-4bfd-b411-99ed177d3e43"s); } } TEST_CASE("Construction from arrays", "[pod5::Uuid]") { using namespace std::string_literals; { pod5::Uuid guid{ {0x47, 0x18, 0x38, 0x23, 0x25, 0x74, 0x4b, 0xfd, 0xb4, 0x11, 0x99, 0xed, 0x17, 0x7d, 0x3e, 0x43}}; REQUIRE(to_string(guid) == "47183823-2574-4bfd-b411-99ed177d3e43"s); } { std::array const arr{ {0x47, 0x18, 0x38, 0x23, 0x25, 0x74, 0x4b, 0xfd, 0xb4, 0x11, 0x99, 0xed, 0x17, 0x7d, 0x3e, 0x43}}; pod5::Uuid guid(arr); REQUIRE(to_string(guid) == "47183823-2574-4bfd-b411-99ed177d3e43"s); } { uint8_t const arr[16] = { 0x47, 0x18, 0x38, 0x23, 0x25, 0x74, 0x4b, 0xfd, 0xb4, 0x11, 0x99, 0xed, 0x17, 0x7d, 0x3e, 0x43}; pod5::Uuid guid(arr); REQUIRE(to_string(guid) == "47183823-2574-4bfd-b411-99ed177d3e43"s); } } TEST_CASE("Test equality", "[operators]") { pod5::Uuid empty; auto engine = pod5::UuidRandomGenerator::engine_type{Catch::rngSeed()}; pod5::Uuid guid = pod5::UuidRandomGenerator{engine}(); REQUIRE(empty == empty); REQUIRE(guid == guid); REQUIRE(empty != guid); } TEST_CASE("Test comparison", "[operators]") { auto empty = pod5::Uuid{}; auto engine = pod5::UuidRandomGenerator::engine_type{Catch::rngSeed()}; pod5::UuidRandomGenerator gen{engine}; auto id = gen(); REQUIRE(empty < id); std::set ids{pod5::Uuid{}, gen(), gen(), gen(), gen()}; REQUIRE(ids.size() == 5); REQUIRE(ids.find(pod5::Uuid{}) != ids.end()); } TEST_CASE("Test hashing", "[ops]") { using namespace std::string_literals; auto str = "47183823-2574-4bfd-b411-99ed177d3e43"s; auto guid = pod5::Uuid::from_string(str).value(); auto h1 = std::hash{}; auto h2 = std::hash{}; #ifdef UUID_HASH_STRING_BASED REQUIRE(h1(str) == h2(guid)); #else REQUIRE(h1(str) != h2(guid)); #endif auto engine = pod5::UuidRandomGenerator::engine_type{Catch::rngSeed()}; pod5::UuidRandomGenerator gen{engine}; std::unordered_set ids{pod5::Uuid{}, gen(), gen(), gen(), gen()}; REQUIRE(ids.size() == 5); REQUIRE(ids.find(pod5::Uuid{}) != ids.end()); } TEST_CASE("Test swap", "[ops]") { pod5::Uuid empty; auto engine = pod5::UuidRandomGenerator::engine_type{Catch::rngSeed()}; pod5::Uuid guid = pod5::UuidRandomGenerator{engine}(); REQUIRE(empty.is_nil()); REQUIRE_FALSE(guid.is_nil()); std::swap(empty, guid); REQUIRE_FALSE(empty.is_nil()); REQUIRE(guid.is_nil()); empty.swap(guid); REQUIRE(empty.is_nil()); REQUIRE_FALSE(guid.is_nil()); } TEST_CASE("Test constexpr", "[const]") { constexpr pod5::Uuid empty; static_assert(empty.is_nil()); } TEST_CASE("Test size", "[operators]") { REQUIRE(sizeof(pod5::Uuid) == 16); } ================================================ FILE: ci/docker/Dockerfile.conda ================================================ FROM condaforge/mambaforge:latest WORKDIR / ================================================ FILE: ci/docker/Dockerfile.py39.arm64 ================================================ from git.oxfordnanolabs.local:4567/minknow/images/build-aarch64-gcc9 RUN yum groupinstall "Development Tools" -y RUN yum install wget openssl-devel libffi-devel bzip2-devel -y RUN wget https://www.python.org/ftp/python/3.9.10/Python-3.9.10.tgz RUN tar xvf Python-* WORKDIR Python-3.9.10/ RUN ./configure --enable-optimizations RUN make altinstall RUN rm /usr/bin/python3 && ln -s /usr/local/bin/python3.9 /usr/bin/python3 WORKDIR / ================================================ FILE: ci/docker/Dockerfile.py39.x64 ================================================ from git.oxfordnanolabs.local:4567/minknow/images/build-x86_64-gcc9 RUN yum groupinstall "Development Tools" -y RUN yum install wget openssl-devel libffi-devel bzip2-devel -y RUN wget https://www.python.org/ftp/python/3.9.10/Python-3.9.10.tgz RUN tar xvf Python-* WORKDIR Python-3.9.10/ RUN ./configure --enable-optimizations RUN make altinstall RUN rm /usr/bin/python3 && ln -s /usr/local/bin/python3.9 /usr/bin/python3 WORKDIR / ================================================ FILE: ci/generate_coverage_report.sh ================================================ #!/bin/bash -e # Parse args. if [ $# -ne 1 ]; then echo "Usage: $0 build_dir" exit 1 fi build_dir=$(realpath "$1") # Set up the venv. echo "Setting up venv" if [ ! -e .coverage_venv ]; then python3 -m venv .coverage_venv fi # shellcheck disable=SC1091 # "Not following: .coverage_venv/bin/activate was not specified as input" source .coverage_venv/bin/activate # --cobertura support added in 5.1. pip install -U 'gcovr>=5.1' # Determine the root of the project. # Note: shellcheck wants these split up into separate lines. project_root=$(realpath "$0") project_root=$(dirname "${project_root}") project_root=$(dirname "${project_root}") cd "${project_root}" gcovr_args=( # work around https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68080 --gcov-ignore-parse-errors=negative_hits.warn --filter "${project_root}/c\+\+" ) function generate_coverage { test_name=$1 regex=$2 echo "Generating coverage report for ${test_name}" # Clear out old coverage info. find "${project_root}" -name "*.gcda" -delete # Run the test. # shellcheck disable=SC2086 # the regex is intentionally split ctest --test-dir "${build_dir}" ${regex} # Generate the coverage report for this test. gcovr "${gcovr_args[@]}" --cobertura "${project_root}/coverage-report-${test_name}.xml" gcovr "${gcovr_args[@]}" --html-single-page --html-details "${project_root}/coverage-report-${test_name}.html" } # Generate a report for each test. for test_name in $(ctest --test-dir "${build_dir}" -N | sed -rn 's/^ +Test +#[0-9]+: +(.*)$/\1/p'); do generate_coverage "${test_name}" "-R ^${test_name}\$" done # Generate a full coverage report too. generate_coverage "all" "" # CI wants to see a TOTAL line in order to report coverage, so give it the one from all tests. # gcovr only has a resolution of 1%, so do the calculation ourselves. gcovr "${gcovr_args[@]}" | grep TOTAL | awk '{print $1, $2, $3, 100 * $3 / $2 "%"}' ================================================ FILE: ci/get_tag_version.cmake ================================================ set(CANONICAL_TAG_BUILD TRUE) include("${CMAKE_CURRENT_LIST_DIR}/../cmake/POD5Version.cmake") message("${POD5_FULL_VERSION}") ================================================ FILE: ci/gitlab-ci-common.yml ================================================ variables: CONAN_USER: nanopore CONAN_CHANNEL: stable CONAN_REFERENCE: '.' # Location of the .conan dir: having it in $CI_PROJECT_DIR makes it easy to grab the packages as # artifacts, and putting it in a job-specific subdir allows multiple packages to be unpacked # into a single upload job (otherwise the metadata.json files would overwrite each other) CONAN_USER_HOME: '${CI_PROJECT_DIR}/${CI_JOB_ID}' PACKAGES_PER_VERSION: 2 # can set this instead for the total number: #EXPECTED_PACKAGE_COUNT: 2 stages: - build - upload before_script: - conan config install --verify-ssl=no "${CONAN_CONFIG_URL}" # # use the extends keyword to inherit the job templates defined below # .parallel-cppstd: # A matrix definition to allow conan builds with different cppstd parallel: matrix: - CONAN_PROFILE_CPPSTD: [17, 20] .tarball-package: &tarball-package # gitlab-runner on Windows silently fails to archive files whose full path is longer than 260 # characters; the MSYS `tar` command is not subject to this limitation (providing Windows has # been configured to allow long paths), so we tar up packages in the build job and untar them in # the upload job. # # This also allows us to only archive the package we just built, and not any of its dependencies # (because we can use `conan inspect` to find the name of the right packages). - PACKAGE_DIR="${CONAN_USER_HOME#${PWD}/}/.conan/data/$(conan inspect --raw name ${CONAN_REFERENCE})" - echo "Packing from $PACKAGE_DIR" - tar -czvf "conan-${CI_JOB_ID}.tar.gz" "$PACKAGE_DIR"/*/${CONAN_USER}/${CONAN_CHANNEL}/{package,metadata.json} - rm -rf "${CONAN_USER_HOME}/.conan" .profile-variables: &profile-variables # The caller (an individual package) should have set up either PROFILE_BASE or PROFILE_BASE_HOST # and PROFILE_BASE_BUILD. We set variables so that both PROFILE_BASE_HOST and PROFILE_BASE_BUILD # are defined correctly after this call, or exit. - if [[ -n ${PROFILE_BASE} && ( -n ${PROFILE_BASE_HOST} || -n ${PROFILE_BASE_BUILD} ) ]]; then - echo "Only one of PROFILE_BASE or (PROFILE_BASE_HOST and PROFILE_BASE_BUILD) should be defined" - exit 1 - fi - if [[ -n ${PROFILE_BASE} ]]; then - PROFILE_BASE_HOST=${PROFILE_BASE} - PROFILE_BASE_BUILD=${PROFILE_BASE} - fi - if [[ -z ${PROFILE_BASE_HOST} || -z ${PROFILE_BASE_BUILD} ]]; then - echo "Both PROFILE_BASE_HOST and PROFILE_BASE_BUILD variables need to be defined" - exit 1 - fi .build-package: # The script builds all required conan packages. The caller needs to set up: # Either PROFILE_BASE or both PROFILE_BASE_HOST and PROFILE_BASE_BUILD # VERSIONS as an array if one or more version numbers are wanted. # EXTRA_CREATE_ARGS is passed to conan unchanged, if present. # # EXTRA_CREATE_ARGS is only used by libcurl, which builds the libcurl in parallel with c_ares # set to True and to False. # # # The after_script removes unneeded builds and sources and packages everything into a tarball, # artifacts defines the name and path for build artifacts. stage: build variables: # For Linux we need to tell arrow to not use boost. EXTRA_CREATE_ARGS: "-o arrow:with_boost=False -o arrow:with_thrift=False -o arrow:parquet=False -o arrow:with_zstd=True" script: - *profile-variables - | if [[ -n ${VERSIONS} ]]; then for version in ${VERSIONS}; do export CONAN_PROFILE_BUILD_TYPE=Debug conan create --profile:build ${PROFILE_BASE_BUILD} --profile:host ${PROFILE_BASE_HOST} ${CONAN_REFERENCE} ${version}@${CONAN_USER}/${CONAN_CHANNEL} ${EXTRA_CREATE_ARGS} export CONAN_PROFILE_BUILD_TYPE=Release conan create --profile:build ${PROFILE_BASE_BUILD} --profile:host ${PROFILE_BASE_HOST} ${CONAN_REFERENCE} ${version}@${CONAN_USER}/${CONAN_CHANNEL} ${EXTRA_CREATE_ARGS} done else export CONAN_PROFILE_BUILD_TYPE=Debug conan create --profile:build ${PROFILE_BASE_BUILD} --profile:host ${PROFILE_BASE_HOST} ${CONAN_REFERENCE} ${CONAN_USER}/${CONAN_CHANNEL} ${EXTRA_CREATE_ARGS} export CONAN_PROFILE_BUILD_TYPE=Release conan create --profile:build ${PROFILE_BASE_BUILD} --profile:host ${PROFILE_BASE_HOST} ${CONAN_REFERENCE} ${CONAN_USER}/${CONAN_CHANNEL} ${EXTRA_CREATE_ARGS} fi after_script: # Re-load the venv if it exists - if ls .venv/*/activate >/dev/null 2>&1; then source .venv/*/activate; fi - conan --version # Avoid storing things on the CI node unnecessarily - conan remove "*" --builds --src --force - *tarball-package artifacts: name: "${CI_PROJECT_NAME}-${CI_JOB_ID}" paths: - 'conan-*.tar.gz' .build-package-win: # Almost the same as build-package. Sets two additional variables CONAN_USER_HOME_SHORT and # CONAN_USE_ALWAYS_SHORT_PATHS. "script" is exactly the same as for build-package. "after_script" # does some additional processing needed for Windows between removing conan builds and sources, # and creating tarballs. extends: .build-package variables: # avoid interfering with the standard conan short-path directory CONAN_USER_HOME_SHORT: 'c:\.conan-tmp' # we're nesting conan's data dir pretty deep, so build systems that would normally be ok can # fail if we don't use short paths CONAN_USE_ALWAYS_SHORT_PATHS: '1' # We need to override arrow's boost 1.85.0 requirement to match the version we use internally. EXTRA_CREATE_ARGS: "-o arrow:with_thrift=False -o arrow:parquet=False --require=boost/1.86.0@ -o boost:without_locale=True" after_script: # Avoid storing things on the CI node unnecessarily - conan remove "*" --builds --src --force # CONAN_USE_ALWAYS_SHORT_PATHS links paths deep in the data dir to dirs in c:\.conan # Resolve package links (so they can be gathered into artifacts): - shopt -s nullglob # allow there to be nothing, eg: if CONAN_USE_ALWAYS_SHORT_PATHS is off # MOVE_COMMAND can be set to, say, "cp -r" if necessary. Moving has been seen to fail for # packages with executables (especially if those executables are run as part of the test # package), such as protobuf. - for link in ${CONAN_USER_HOME}/.conan/data/*/*/$CONAN_USER/$CONAN_CHANNEL/package/*/.conan_link; do source=$(cat $link) && ${MOVE_COMMAND:-mv} $(cygpath "$source")/* $(dirname $link) && rm $link; done # Clean up the short_paths folder (even on failure): - rm -rf "/c/.conan-tmp" - *tarball-package # This can be used to override the script stage to build both static and shared versions of a # library. The "conan create" commands are duplicates with either -o ${PACKAGE}:shared=False or # -o ${PACKAGE}:shared=True added. Since this doesn't use "extends" the caller has to extend # either build-package or build-package-win as well. .build-shared-and-static: script: - *profile-variables - PACKAGE="$(conan inspect --raw name .)" - if [[ -n ${VERSIONS} ]]; then - for version in ${VERSIONS}; do - export CONAN_PROFILE_BUILD_TYPE=Debug - conan create --profile:build ${PROFILE_BASE_BUILD} --profile:host ${PROFILE_BASE_HOST} ${CONAN_REFERENCE} ${version}@${CONAN_USER}/${CONAN_CHANNEL} -o ${PACKAGE}:shared=False ${EXTRA_CREATE_ARGS} - conan create --profile:build ${PROFILE_BASE_BUILD} --profile:host ${PROFILE_BASE_HOST} ${CONAN_REFERENCE} ${version}@${CONAN_USER}/${CONAN_CHANNEL} -o ${PACKAGE}:shared=True ${EXTRA_CREATE_ARGS} - export CONAN_PROFILE_BUILD_TYPE=Release - conan create --profile:build ${PROFILE_BASE_BUILD} --profile:host ${PROFILE_BASE_HOST} ${CONAN_REFERENCE} ${version}@${CONAN_USER}/${CONAN_CHANNEL} -o ${PACKAGE}:shared=False ${EXTRA_CREATE_ARGS} - conan create --profile:build ${PROFILE_BASE_BUILD} --profile:host ${PROFILE_BASE_HOST} ${CONAN_REFERENCE} ${version}@${CONAN_USER}/${CONAN_CHANNEL} -o ${PACKAGE}:shared=True ${EXTRA_CREATE_ARGS} - done - else - export CONAN_PROFILE_BUILD_TYPE=Debug - conan create --profile:build ${PROFILE_BASE_BUILD} --profile:host ${PROFILE_BASE_HOST} ${CONAN_REFERENCE} ${CONAN_USER}/${CONAN_CHANNEL} -o ${PACKAGE}:shared=False ${EXTRA_CREATE_ARGS} - conan create --profile:build ${PROFILE_BASE_BUILD} --profile:host ${PROFILE_BASE_HOST} ${CONAN_REFERENCE} ${CONAN_USER}/${CONAN_CHANNEL} -o ${PACKAGE}:shared=True ${EXTRA_CREATE_ARGS} - export CONAN_PROFILE_BUILD_TYPE=Release - conan create --profile:build ${PROFILE_BASE_BUILD} --profile:host ${PROFILE_BASE_HOST} ${CONAN_REFERENCE} ${CONAN_USER}/${CONAN_CHANNEL} -o ${PACKAGE}:shared=False ${EXTRA_CREATE_ARGS} - conan create --profile:build ${PROFILE_BASE_BUILD} --profile:host ${PROFILE_BASE_HOST} ${CONAN_REFERENCE} ${CONAN_USER}/${CONAN_CHANNEL} -o ${PACKAGE}:shared=True ${EXTRA_CREATE_ARGS} - fi .upload-package: stage: upload image: git.oxfordnanolabs.local:4567/traque/ont-docker-base/ont-base-python:3.8 tags: - linux - docker before_script: - echo -e "\e[0Ksection_start:`date +%s`:install_conan[collapsed=true]\r\e[0KInstalling conan" - pip install 'conan<2' - echo -e "\e[0Ksection_end:`date +%s`:install_conan\r\e[0K" script: # BSD tar (on macOS) puts some extra optional information into the tarballs that GNU tar # complains about. --warning=no-unknown-keyword suppresses this. - for tarball in conan-*.tar.gz; do tar --warning=no-unknown-keyword -xf "$tarball"; done - for conan_dir in ./*/.conan; do - job_dir="$(dirname "$conan_dir")" - echo -e "\e[0Ksection_start:`date +%s`:upload_package\r\e[0KUploading from $job_dir" - export CONAN_USER_HOME="$PWD/$job_dir" - conan config install --verify-ssl=no "${CONAN_CONFIG_URL}" - if [[ -n ${VERSIONS} ]]; then - expected_recipe_count=$(echo ${VERSIONS} | wc -w) - for version in ${VERSIONS}; do - conan export ${CONAN_REFERENCE} ${version}@${CONAN_USER}/${CONAN_CHANNEL} - done - else - expected_recipe_count=1 - conan export ${CONAN_REFERENCE} ${CONAN_USER}/${CONAN_CHANNEL} - fi - PACKAGE="$(conan inspect --raw name ${CONAN_REFERENCE})" - recipes="$(conan search --raw "${PACKAGE}/*@${CONAN_USER}/${CONAN_CHANNEL}")" - recipe_count="$(echo $recipes | wc -w)" - package_count=0 - for recipe in $recipes; do - echo "${recipe}:" - conan search "$recipe" - package_count=$(($package_count + $(conan search "$recipe" | grep "Package_ID:" | wc -l))) - done - if [[ -z $EXPECTED_PACKAGE_COUNT ]]; then - EXPECTED_PACKAGE_COUNT=$((PACKAGES_PER_VERSION * expected_recipe_count)) - fi - if [[ $recipe_count -ne $expected_recipe_count ]] || [[ $package_count -ne $EXPECTED_PACKAGE_COUNT ]]; then - echo "Expected $expected_recipe_count recipe(s) with $EXPECTED_PACKAGE_COUNT package(s), got $recipe_count recipe(s) with $package_count package(s)" - exit 1 - fi # conan claims it should pick this information up automatically, given the variable names, # but it doesn't seem to work if you don't do this: - conan user -r ont-artifactory -p "${CONAN_PASSWORD}" "${CONAN_LOGIN_USERNAME}" - EXTRA_ARGS= - if [[ -z $DO_UPLOAD ]]; then - DO_UPLOAD=no - if [[ $CI_COMMIT_REF_NAME == stable/* ]] || [[ $CI_COMMIT_REF_NAME == release/* ]] || [[ $CI_COMMIT_REF_NAME == $STABLE_BRANCH_NAME ]]; then - DO_UPLOAD=yes - fi - fi - if [[ $DO_UPLOAD == "yes" ]]; then - EXTRA_ARGS=--force - else - 'echo "WARNING: NOT uploading to artifactory for this branch"' - EXTRA_ARGS=--skip-upload - fi - for recipe in $recipes; do - conan upload -r ont-artifactory --all --check --confirm ${EXTRA_ARGS} "$recipe" - done - echo -e "\e[0Ksection_end:`date +%s`:upload_package\r\e[0K" - done # for conan_dir # # Various setup methods. Each sets a number of relevant tags, and one or two variables: For # non-cross compiling one variable PROFILE_BASE is set with the name of a profile which will be # adapted by adding "" or "". For cross compiling two variables PROFILE_BASE_BUILD # for the profile of the build machine and PROFILE_BASE_HOST for the host machine are set. # # # Set up for Windows versions # .profile-windows-x86_64-vs2019: # Set up for Windows x86 using VS 2019, using conan and the profile windows-x86_64-vs2019, # adapted for debug and release. To be called from individual packages by using "extends". tags: - windows - cmake - VS2019 - conan variables: PROFILE_BASE: windows-x86_64-vs2019.jinja parallel: !reference [.parallel-cppstd,parallel] .profile-windows-x86_64-vs2019-conan2: # Set up for Windows x86 using VS 2019, using conan and the profile windows-x86_64-vs2019, # adapted for debug and release. To be called from individual packages by using "extends". tags: - windows - cmake - VS2019 - conan variables: CMAKE_GENERATOR: "Visual Studio 16 2019" PROFILE_BASE: windows-x86_64-vs2019.jinja CMAKE_PRESET: "conan2-windows-x86_64-vs2019-cppstd${CONAN_PROFILE_CPPSTD}-release" # # Set up for MacOS versions # .profile-macos-aarch64-appleclang-15.0: # Set up for MacOS arm 64 using clang 15.0, using conan and the profile # macos-aarch64-appleclang-15.0, adapted for debug and release. To be called from individual # packages by using "extends". tags: - osx_arm64 - xcode-15.3 - conan variables: PROFILE_BASE: macos-aarch64-appleclang-15.0.jinja parallel: !reference [.parallel-cppstd,parallel] .profile-macos-aarch64-appleclang-16.0: # Set up for MacOS arm 64 using clang 16.0, using conan and the profile # macos-aarch64-appleclang-16.0, adapted for debug and release. To be called from individual # packages by using "extends". tags: - osx_arm64 - xcode-16.1 - conan variables: PROFILE_BASE: macos-aarch64-appleclang-16.0.jinja parallel: !reference [.parallel-cppstd,parallel] .profile-macos-aarch64-appleclang-15.0-conan2: # Set up for MacOS arm 64 using clang 15.0, using conan and the profile # macos-aarch64-appleclang-15.0, adapted for debug and release. To be called from individual # packages by using "extends". tags: - osx_arm64 - xcode-15.3 - conan variables: PROFILE_BASE: macos-aarch64-appleclang-15.0.jinja CMAKE_PRESET: "conan2-macos-appleclang-15.0-aarch64-cppstd${CONAN_PROFILE_CPPSTD}-release" .profile-macos-aarch64-appleclang-16.0-conan2: # Set up for MacOS arm 64 using clang 16.0, using conan and the profile # macos-aarch64-appleclang-16.0, adapted for debug and release. To be called from individual # packages by using "extends". tags: - osx_arm64 - xcode-16.1 - conan variables: PROFILE_BASE: macos-aarch64-appleclang-16.0.jinja CMAKE_PRESET: "conan2-macos-appleclang-16.0-aarch64-cppstd${CONAN_PROFILE_CPPSTD}-release" # # Set up for linux versions # .profile-linux-x86_64-gcc9: # Set up for linux x86 using gcc9, using docker and the profile linux-x86_64-gcc9, adapted # for debug and release. To be called from individual packages by using "extends". # # The docker image builds on CentOS 7 using devtoolset-9, for maximum compatibility. This means # the compiled code will work on any Ubuntu distro from Xenial onwards (and most other # still-supported Linux distros). Differences between GCC 9's libstdc++ and GCC 4.8's libstdc++ # are handled by a static library, so no special handling of libstdc++ is required. image: git.oxfordnanolabs.local:4567/informatics/conan-config/linux-x86_64-gcc9:latest tags: - linux_x86 - docker variables: PROFILE_BASE: linux-x86_64-gcc9.jinja parallel: !reference [.parallel-cppstd,parallel] .profile-linux-x86_64-gcc9-conan2: image: git.oxfordnanolabs.local:4567/informatics/conan-config/linux-x86_64-gcc9:latest tags: - linux_x86 - docker variables: PROFILE_BASE: linux-x86_64-gcc9.jinja CMAKE_PRESET: "conan2-linux-gcc9-x86_64-cppstd${CONAN_PROFILE_CPPSTD}-release" .profile-linux-x86_64-gcc11: # Set up for linux x86 using gcc11, using docker and the profile linux-aarch64-gcc11, adapted # for debug and release. To be called from individual packages by using "extends". # # Note that the docker image uses a GCC 11 backport to Ubuntu Bionic. Compiled artifacts will # be mostly compatible with Ubuntu Bionic and later, except that they will need the correct # libstdc++ to be available. This can be achieved by installing libstdc++6 from the GCC 11 # backport (available in the ~ubuntu-toolchain-r/test PPA), or by otherwise shipping that # version of libstdc++6 in a way that the software can find it. image: git.oxfordnanolabs.local:4567/informatics/conan-config/linux-x86_64-gcc11:latest tags: - linux_x86 - docker variables: PROFILE_BASE: linux-x86_64-gcc11.jinja parallel: !reference [.parallel-cppstd,parallel] .profile-linux-x86_64-gcc11-conan2: image: git.oxfordnanolabs.local:4567/informatics/conan-config/linux-x86_64-gcc11:latest tags: - linux_x86 - docker variables: PROFILE_BASE: linux-x86_64-gcc11.jinja CMAKE_PRESET: "conan2-linux-gcc11-x86_64-cppstd${CONAN_PROFILE_CPPSTD}-release" .profile-linux-x86_64-gcc11-asan-static-conan2: image: git.oxfordnanolabs.local:4567/informatics/conan-config/linux-x86_64-gcc11:latest tags: - linux_x86 - docker variables: PROFILE_BASE: linux-x86_64-gcc11-asan-static.jinja CMAKE_PRESET: "conan2-linux-gcc11-asan-static-x86_64-cppstd${CONAN_PROFILE_CPPSTD}-release" .profile-linux-x86_64-gcc11-usan-static-conan2: image: git.oxfordnanolabs.local:4567/informatics/conan-config/linux-x86_64-gcc11:latest tags: - linux_x86 - docker variables: PROFILE_BASE: linux-x86_64-gcc11-usan-static.jinja CMAKE_PRESET: "conan2-linux-gcc11-usan-static-x86_64-cppstd${CONAN_PROFILE_CPPSTD}-release" .profile-linux-x86_64-gcc11-tsan-static-conan2: image: git.oxfordnanolabs.local:4567/informatics/conan-config/linux-x86_64-gcc11:latest tags: - linux_x86 - docker variables: PROFILE_BASE: linux-x86_64-gcc11-tsan-static.jinja CMAKE_PRESET: "conan2-linux-gcc11-tsan-static-x86_64-cppstd${CONAN_PROFILE_CPPSTD}-release" .profile-linux-x86_64-gcc11-asan-static: image: git.oxfordnanolabs.local:4567/informatics/conan-config/linux-x86_64-gcc11:latest tags: - linux_x86 - docker variables: PROFILE_BASE: linux-x86_64-gcc11-asan-static.jinja parallel: !reference [.parallel-cppstd,parallel] .profile-linux-x86_64-gcc11-ausan-static: image: git.oxfordnanolabs.local:4567/informatics/conan-config/linux-x86_64-gcc11:latest tags: - linux_x86 - docker variables: PROFILE_BASE: linux-x86_64-gcc11-ausan-static.jinja parallel: !reference [.parallel-cppstd,parallel] .profile-linux-x86_64-gcc11-tsan-static: image: git.oxfordnanolabs.local:4567/informatics/conan-config/linux-x86_64-gcc11:latest tags: - linux_x86 - docker variables: PROFILE_BASE: linux-x86_64-gcc11-tsan-static.jinja parallel: !reference [.parallel-cppstd,parallel] .profile-linux-x86_64-gcc11-usan-static: image: git.oxfordnanolabs.local:4567/informatics/conan-config/linux-x86_64-gcc11:latest tags: - linux_x86 - docker variables: PROFILE_BASE: linux-x86_64-gcc11-usan-static.jinja parallel: !reference [.parallel-cppstd,parallel] .profile-linux-x86_64-gcc13: image: git.oxfordnanolabs.local:4567/informatics/conan-config/linux-x86_64-gcc13:latest tags: - linux_x86 - docker variables: PROFILE_BASE: linux-x86_64-gcc13.jinja parallel: !reference [.parallel-cppstd,parallel] .profile-linux-x86_64-gcc13-conan2: image: git.oxfordnanolabs.local:4567/informatics/conan-config/linux-x86_64-gcc13:latest tags: - linux_x86 - docker variables: PROFILE_BASE: linux-x86_64-gcc13.jinja CMAKE_PRESET: "conan2-linux-gcc13-x86_64-cppstd${CONAN_PROFILE_CPPSTD}-release" .profile-linux-x86_64-gcc13-asan-static: extends: .profile-linux-x86_64-gcc13 variables: PROFILE_BASE: linux-x86_64-gcc13-asan-static.jinja .profile-linux-x86_64-gcc13-tsan-static: extends: .profile-linux-x86_64-gcc13 variables: PROFILE_BASE: linux-x86_64-gcc13-tsan-static.jinja .profile-linux-x86_64-gcc13-usan-static: extends: .profile-linux-x86_64-gcc13 variables: PROFILE_BASE: linux-x86_64-gcc13-usan-static.jinja .profile-linux-aarch64-gcc9: # Set up for linux arm64 using gcc9, using docker and the profile linux-aarch64-gcc9, adapted # for debug and release. To be called from individual packages by using "extends". # # The docker image builds on CentOS 7 using devtoolset-9, for maximum compatibility. This means # the compiled code will work on any Ubuntu distro from Xenial onwards (and most other # still-supported Linux distros). Differences between GCC 9's libstdc++ and GCC 4.8's libstdc++ # are handled by a static library, so no special handling of libstdc++ is required. image: git.oxfordnanolabs.local:4567/informatics/conan-config/linux-aarch64-gcc9:latest tags: - linux_aarch64 - docker variables: PROFILE_BASE: linux-aarch64-gcc9.jinja parallel: !reference [.parallel-cppstd,parallel] .profile-linux-aarch64-gcc9-conan2: image: git.oxfordnanolabs.local:4567/informatics/conan-config/linux-aarch64-gcc9:latest tags: - linux_aarch64 - docker variables: PROFILE_BASE: linux-aarch64-gcc9.jinja CMAKE_PRESET: "conan2-linux-gcc9-aarch64-cppstd${CONAN_PROFILE_CPPSTD}-release" .profile-linux-aarch64-gcc11: # Set up for linux arm64 using gcc11, using docker and the profile linux-aarch64-gcc11, adapted # for debug and release. To be called from individual packages by using "extends". # # Note that the docker image uses a GCC 11 backport to Ubuntu Bionic. Compiled artifacts will # be mostly compatible with Ubuntu Bionic and later, except that they will need the correct # libstdc++ to be available. This can be achieved by installing libstdc++6 from the GCC 11 # backport (available in the ~ubuntu-toolchain-r/test PPA), or by otherwise shipping that # version of libstdc++6 in a way that the software can find it. image: git.oxfordnanolabs.local:4567/informatics/conan-config/linux-aarch64-gcc11:latest tags: - linux_aarch64 - docker variables: PROFILE_BASE: linux-aarch64-gcc11.jinja parallel: !reference [.parallel-cppstd,parallel] .profile-linux-aarch64-gcc11-conan2: image: git.oxfordnanolabs.local:4567/informatics/conan-config/linux-aarch64-gcc11:latest tags: - linux_aarch64 - docker variables: PROFILE_BASE: linux-aarch64-gcc11.jinja CMAKE_PRESET: "conan2-linux-gcc11-aarch64-cppstd${CONAN_PROFILE_CPPSTD}-release" .profile-linux-aarch64-gcc13: image: git.oxfordnanolabs.local:4567/informatics/conan-config/linux-aarch64-gcc13:latest tags: - linux_aarch64 - docker variables: PROFILE_BASE: linux-aarch64-gcc13.jinja parallel: !reference [.parallel-cppstd,parallel] .profile-linux-aarch64-gcc13-conan2: image: git.oxfordnanolabs.local:4567/informatics/conan-config/linux-aarch64-gcc13:latest tags: - linux_aarch64 - docker variables: PROFILE_BASE: linux-aarch64-gcc13.jinja CMAKE_PRESET: "conan2-linux-gcc13-aarch64-cppstd${CONAN_PROFILE_CPPSTD}-release" ================================================ FILE: ci/install.sh ================================================ #!/bin/bash set -o errexit set -o pipefail set -o nounset # set -o xtrace # Tar up the archive build: ( cmake -DCMAKE_INSTALL_PREFIX="archive" -DBUILD_TYPE="Release" -DCOMPONENT="archive" -P "cmake_install.cmake" if [ "$#" -ge 1 ] && [ "$1" == "STATIC_BUILD" ]; then if [[ "$OSTYPE" == "linux-gnu"* ]] && [[ -e "archive/lib64" ]]; then cp "../build/third_party/libs"/* "archive/lib64" else cp "../build/third_party/libs"/* "archive/lib" fi fi ) # Find the wheel: ( cmake -DCMAKE_INSTALL_PREFIX="wheel" -DBUILD_TYPE="Release" -DCOMPONENT="wheel" -P "cmake_install.cmake" ) ================================================ FILE: ci/package.sh ================================================ #!/bin/bash set -o errexit set -o pipefail set -o nounset # set -o xtrace output_sku=$1 auditwheel_platform= if [ $# -gt 1 ]; then auditwheel_platform="${2}" fi CURRENT_DIR=$(pwd) SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd ) REPO_ROOT="${SCRIPT_DIR}/../" cd "${REPO_ROOT}" pod5_version="$(cmake -P ci/get_tag_version.cmake 2>&1)" cd "${CURRENT_DIR}" # Tar up the archive build: ( cd ./archive tar -cvzf "${REPO_ROOT}/lib_pod5-${pod5_version}-${output_sku}.tar.gz" . ) # Find the wheel: ( # Wheels are optional: if [ -d "wheel/" ] ; then cd wheel/ if [ -z "${auditwheel_platform}" ]; then mv ./*.whl "${REPO_ROOT}/" else echo "Running audit wheel" pwd ls auditwheel repair ./*.whl --plat "${auditwheel_platform}" -w "${REPO_ROOT}/" fi fi ) ================================================ FILE: ci/unpack_libs_for_python.sh ================================================ #!/bin/bash input_dir=$1 output_dir=$2 echo "Unpacking builds from $input_dir to $output_dir" file_regex=".*/lib_pod5-[0-9\.]*-(.*).tar.gz" for i in "${input_dir}"/lib_pod5*.tar.gz; do if [[ $i =~ $file_regex ]] then sku="${BASH_REMATCH[1]}" echo "Extracting for SKU: $sku" else echo "$i doesn't match expected file pattern" >&2 exit 1 fi sku_out_dir="$output_dir/$sku/" mkdir -p "${sku_out_dir}" tmp_dir="$output_dir/tmp" mkdir -p "$tmp_dir" tar -xzf "$i" --directory "$output_dir/tmp" mv "${tmp_dir}"/lib/* "${sku_out_dir}" rm -r "$tmp_dir" done echo "unpacked skus:" ls "${output_dir}/" echo "contents:" ls "${output_dir}"/* ================================================ FILE: cmake/BuildFlatBuffers.cmake ================================================ # Copyright 2015 Google Inc. All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # General function to create FlatBuffer build rules for the given list of # schemas. # # flatbuffers_schemas: A list of flatbuffer schema files to process. # # schema_include_dirs: A list of schema file include directories, which will be # passed to flatc via the -I parameter. # # custom_target_name: The generated files will be added as dependencies for a # new custom target with this name. You should add that target as a dependency # for your main target to ensure these files are built. You can also retrieve # various properties from this target, such as GENERATED_INCLUDES_DIR, # BINARY_SCHEMAS_DIR, and COPY_TEXT_SCHEMAS_DIR. # # additional_dependencies: A list of additional dependencies that you'd like # all generated files to depend on. Pass in a blank string if you have none. # # generated_includes_dir: Where to generate the C++ header files for these # schemas. The generated includes directory will automatically be added to # CMake's include directories, and will be where generated header files are # placed. This parameter is optional; pass in empty string if you don't want to # generate include files for these schemas. # # binary_schemas_dir: If you specify an optional binary schema directory, binary # schemas will be generated for these schemas as well, and placed into the given # directory. # # copy_text_schemas_dir: If you want all text schemas (including schemas from # all schema include directories) copied into a directory (for example, if you # need them within your project to build JSON files), you can specify that # folder here. All text schemas will be copied to that folder. # # IMPORTANT: Make sure you quote all list arguments you pass to this function! # Otherwise CMake will only pass in the first element. # Example: build_flatbuffers("${fb_files}" "${include_dirs}" target_name ...) function(build_flatbuffers flatbuffers_schemas schema_include_dirs custom_target_name additional_dependencies generated_includes_dir binary_schemas_dir copy_text_schemas_dir) # Test if including from FindFlatBuffers if(FLATBUFFERS_FLATC_EXECUTABLE) set(FLATC_TARGET "") set(FLATC ${FLATBUFFERS_FLATC_EXECUTABLE}) else() set(FLATC_TARGET flatc) set(FLATC flatc) endif() set(FLATC_SCHEMA_ARGS --gen-mutable) if(FLATBUFFERS_FLATC_SCHEMA_EXTRA_ARGS) set(FLATC_SCHEMA_ARGS ${FLATBUFFERS_FLATC_SCHEMA_EXTRA_ARGS} ${FLATC_SCHEMA_ARGS} ) endif() set(working_dir "${CMAKE_CURRENT_SOURCE_DIR}") set(schema_glob "*.fbs") # Generate the include files parameters. set(include_params "") set(all_generated_files "") foreach (include_dir ${schema_include_dirs}) set(include_params -I ${include_dir} ${include_params}) if (NOT ${copy_text_schemas_dir} STREQUAL "") # Copy text schemas from dependent folders. file(GLOB_RECURSE dependent_schemas ${include_dir}/${schema_glob}) foreach (dependent_schema ${dependent_schemas}) file(COPY ${dependent_schema} DESTINATION ${copy_text_schemas_dir}) endforeach() endif() endforeach() foreach(schema ${flatbuffers_schemas}) get_filename_component(filename ${schema} NAME_WE) # For each schema, do the things we requested. if (NOT ${generated_includes_dir} STREQUAL "") set(generated_include ${generated_includes_dir}/${filename}_generated.h) add_custom_command( OUTPUT ${generated_include} COMMAND ${FLATC} ${FLATC_SCHEMA_ARGS} -o ${generated_includes_dir} ${include_params} -c ${schema} DEPENDS ${FLATC_TARGET} ${schema} ${additional_dependencies} WORKING_DIRECTORY "${working_dir}") list(APPEND all_generated_files ${generated_include}) endif() if (NOT ${binary_schemas_dir} STREQUAL "") set(binary_schema ${binary_schemas_dir}/${filename}.bfbs) add_custom_command( OUTPUT ${binary_schema} COMMAND ${FLATC} -b --schema -o ${binary_schemas_dir} ${include_params} ${schema} DEPENDS ${FLATC_TARGET} ${schema} ${additional_dependencies} WORKING_DIRECTORY "${working_dir}") list(APPEND all_generated_files ${binary_schema}) endif() if (NOT ${copy_text_schemas_dir} STREQUAL "") file(COPY ${schema} DESTINATION ${copy_text_schemas_dir}) endif() endforeach() # Create a custom target that depends on all the generated files. # This is the target that you can depend on to trigger all these # to be built. add_custom_target(${custom_target_name} DEPENDS ${all_generated_files} ${additional_dependencies}) # Register the include directory we are using. if (NOT ${generated_includes_dir} STREQUAL "") include_directories(${generated_includes_dir}) set_property(TARGET ${custom_target_name} PROPERTY GENERATED_INCLUDES_DIR ${generated_includes_dir}) endif() # Register the binary schemas dir we are using. if (NOT ${binary_schemas_dir} STREQUAL "") set_property(TARGET ${custom_target_name} PROPERTY BINARY_SCHEMAS_DIR ${binary_schemas_dir}) endif() # Register the text schema copy dir we are using. if (NOT ${copy_text_schemas_dir} STREQUAL "") set_property(TARGET ${custom_target_name} PROPERTY COPY_TEXT_SCHEMAS_DIR ${copy_text_schemas_dir}) endif() endfunction() # Creates a target that can be linked against that generates flatbuffer headers. # # This function takes a target name and a list of schemas. You can also specify # other flagc flags using the FLAGS option to change the behavior of the flatc # tool. # # Arguments: # TARGET: The name of the target to generate. # SCHEMAS: The list of schema files to generate code for. # BINARY_SCHEMAS_DIR: Optional. The directory in which to generate binary # schemas. Binary schemas will only be generated if a path is provided. # INCLUDE: Optional. Search for includes in the specified paths. (Use this # instead of "-I " and the FLAGS option so that CMake is aware of # the directories that need to be searched). # INCLUDE_PREFIX: Optional. The directory in which to place the generated # files. Use this instead of the --include-prefix option. # FLAGS: Optional. A list of any additional flags that you would like to pass # to flatc. # # Example: # # flatbuffers_generate_headers( # TARGET my_generated_headers_target # INCLUDE_PREFIX ${MY_INCLUDE_PREFIX}" # SCHEMAS ${MY_SCHEMA_FILES} # BINARY_SCHEMAS_DIR "${MY_BINARY_SCHEMA_DIRECTORY}" # FLAGS --gen-object-api) # # target_link_libraries(MyExecutableTarget # PRIVATE my_generated_headers_target # ) function(flatbuffers_generate_headers) # Parse function arguments. set(options) set(one_value_args "TARGET" "INCLUDE_PREFIX" "BINARY_SCHEMAS_DIR") set(multi_value_args "SCHEMAS" "INCLUDE" "FLAGS") cmake_parse_arguments( PARSE_ARGV 0 FLATBUFFERS_GENERATE_HEADERS "${options}" "${one_value_args}" "${multi_value_args}") # Test if including from FindFlatBuffers if(FLATBUFFERS_FLATC_EXECUTABLE) set(FLATC_TARGET "") set(FLATC ${FLATBUFFERS_FLATC_EXECUTABLE}) else() set(FLATC_TARGET flatc) set(FLATC flatc) endif() set(working_dir "${CMAKE_CURRENT_SOURCE_DIR}") # Generate the include files parameters. set(include_params "") foreach (include_dir ${FLATBUFFERS_GENERATE_HEADERS_INCLUDE}) set(include_params -I ${include_dir} ${include_params}) endforeach() # Create a directory to place the generated code. set(generated_target_dir "${CMAKE_CURRENT_BINARY_DIR}/${FLATBUFFERS_GENERATE_HEADERS_TARGET}") set(generated_include_dir "${generated_target_dir}") if (NOT ${FLATBUFFERS_GENERATE_HEADERS_INCLUDE_PREFIX} STREQUAL "") set(generated_include_dir "${generated_include_dir}/${FLATBUFFERS_GENERATE_HEADERS_INCLUDE_PREFIX}") list(APPEND FLATBUFFERS_GENERATE_HEADERS_FLAGS "--include-prefix" ${FLATBUFFERS_GENERATE_HEADERS_INCLUDE_PREFIX}) endif() # Create rules to generate the code for each schema. foreach(schema ${FLATBUFFERS_GENERATE_HEADERS_SCHEMAS}) get_filename_component(filename ${schema} NAME_WE) set(generated_include "${generated_include_dir}/${filename}_generated.h") # Generate files for grpc if needed set(generated_source_file) if("${FLATBUFFERS_GENERATE_HEADERS_FLAGS}" MATCHES "--grpc") # Check if schema file contain a rpc_service definition file(STRINGS ${schema} has_grpc REGEX "rpc_service") if(has_grpc) list(APPEND generated_include "${generated_include_dir}/${filename}.grpc.fb.h") set(generated_source_file "${generated_include_dir}/${filename}.grpc.fb.cc") endif() endif() add_custom_command( OUTPUT ${generated_include} ${generated_source_file} COMMAND ${FLATC} ${FLATC_ARGS} -o ${generated_include_dir} ${include_params} -c ${schema} ${FLATBUFFERS_GENERATE_HEADERS_FLAGS} DEPENDS ${FLATC_TARGET} ${schema} WORKING_DIRECTORY "${working_dir}" COMMENT "Building ${schema} flatbuffers...") list(APPEND all_generated_header_files ${generated_include}) list(APPEND all_generated_source_files ${generated_source_file}) # Generate the binary flatbuffers schemas if instructed to. if (NOT ${FLATBUFFERS_GENERATE_HEADERS_BINARY_SCHEMAS_DIR} STREQUAL "") set(binary_schema "${FLATBUFFERS_GENERATE_HEADERS_BINARY_SCHEMAS_DIR}/${filename}.bfbs") add_custom_command( OUTPUT ${binary_schema} COMMAND ${FLATC} -b --schema -o ${FLATBUFFERS_GENERATE_HEADERS_BINARY_SCHEMAS_DIR} ${include_params} ${schema} DEPENDS ${FLATC_TARGET} ${schema} WORKING_DIRECTORY "${working_dir}") list(APPEND all_generated_binary_files ${binary_schema}) endif() endforeach() # Set up interface library add_library(${FLATBUFFERS_GENERATE_HEADERS_TARGET} INTERFACE) target_sources( ${FLATBUFFERS_GENERATE_HEADERS_TARGET} INTERFACE ${all_generated_header_files} ${all_generated_binary_files} ${all_generated_source_files} ${FLATBUFFERS_GENERATE_HEADERS_SCHEMAS}) add_dependencies( ${FLATBUFFERS_GENERATE_HEADERS_TARGET} ${FLATC} ${FLATBUFFERS_GENERATE_HEADERS_SCHEMAS}) target_include_directories( ${FLATBUFFERS_GENERATE_HEADERS_TARGET} INTERFACE ${generated_target_dir}) # Organize file layout for IDEs. source_group( TREE "${generated_target_dir}" PREFIX "Flatbuffers/Generated/Headers Files" FILES ${all_generated_header_files}) source_group( TREE "${generated_target_dir}" PREFIX "Flatbuffers/Generated/Source Files" FILES ${all_generated_source_files}) source_group( TREE ${working_dir} PREFIX "Flatbuffers/Schemas" FILES ${FLATBUFFERS_GENERATE_HEADERS_SCHEMAS}) if (NOT ${FLATBUFFERS_GENERATE_HEADERS_BINARY_SCHEMAS_DIR} STREQUAL "") source_group( TREE "${FLATBUFFERS_GENERATE_HEADERS_BINARY_SCHEMAS_DIR}" PREFIX "Flatbuffers/Generated/Binary Schemas" FILES ${all_generated_binary_files}) endif() endfunction() # Creates a target that can be linked against that generates flatbuffer binaries # from json files. # # This function takes a target name and a list of schemas and Json files. You # can also specify other flagc flags and options to change the behavior of the # flatc compiler. # # Adding this target to your executable ensurses that the flatbuffer binaries # are compiled before your executable is run. # # Arguments: # TARGET: The name of the target to generate. # JSON_FILES: The list of json files to compile to flatbuffers binaries. # SCHEMA: The flatbuffers schema of the Json files to be compiled. # INCLUDE: Optional. Search for includes in the specified paths. (Use this # instead of "-I " and the FLAGS option so that CMake is aware of # the directories that need to be searched). # OUTPUT_DIR: The directly where the generated flatbuffers binaries should be # placed. # FLAGS: Optional. A list of any additional flags that you would like to pass # to flatc. # # Example: # # flatbuffers_generate_binary_files( # TARGET my_binary_data # SCHEMA "${MY_SCHEMA_DIR}/my_example_schema.fbs" # JSON_FILES ${MY_JSON_FILES} # OUTPUT_DIR "${MY_BINARY_DATA_DIRECTORY}" # FLAGS --strict-json) # # target_link_libraries(MyExecutableTarget # PRIVATE my_binary_data # ) function(flatbuffers_generate_binary_files) # Parse function arguments. set(options) set(one_value_args "TARGET" "SCHEMA" "OUTPUT_DIR") set(multi_value_args "JSON_FILES" "INCLUDE" "FLAGS") cmake_parse_arguments( PARSE_ARGV 0 FLATBUFFERS_GENERATE_BINARY_FILES "${options}" "${one_value_args}" "${multi_value_args}") # Test if including from FindFlatBuffers if(FLATBUFFERS_FLATC_EXECUTABLE) set(FLATC_TARGET "") set(FLATC ${FLATBUFFERS_FLATC_EXECUTABLE}) else() set(FLATC_TARGET flatc) set(FLATC flatc) endif() set(working_dir "${CMAKE_CURRENT_SOURCE_DIR}") # Generate the include files parameters. set(include_params "") foreach (include_dir ${FLATBUFFERS_GENERATE_BINARY_FILES_INCLUDE}) set(include_params -I ${include_dir} ${include_params}) endforeach() # Create rules to generate the flatbuffers binary for each json file. foreach(json_file ${FLATBUFFERS_GENERATE_BINARY_FILES_JSON_FILES}) get_filename_component(filename ${json_file} NAME_WE) set(generated_binary_file "${FLATBUFFERS_GENERATE_BINARY_FILES_OUTPUT_DIR}/${filename}.bin") add_custom_command( OUTPUT ${generated_binary_file} COMMAND ${FLATC} ${FLATC_ARGS} -o ${FLATBUFFERS_GENERATE_BINARY_FILES_OUTPUT_DIR} ${include_params} -b ${FLATBUFFERS_GENERATE_BINARY_FILES_SCHEMA} ${json_file} ${FLATBUFFERS_GENERATE_BINARY_FILES_FLAGS} DEPENDS ${FLATC_TARGET} ${json_file} WORKING_DIRECTORY "${working_dir}" COMMENT "Building ${json_file} binary flatbuffers...") list(APPEND all_generated_binary_files ${generated_binary_file}) endforeach() # Set up interface library add_library(${FLATBUFFERS_GENERATE_BINARY_FILES_TARGET} INTERFACE) target_sources( ${FLATBUFFERS_GENERATE_BINARY_FILES_TARGET} INTERFACE ${all_generated_binary_files} ${FLATBUFFERS_GENERATE_BINARY_FILES_JSON_FILES} ${FLATBUFFERS_GENERATE_BINARY_FILES_SCHEMA}) add_dependencies( ${FLATBUFFERS_GENERATE_BINARY_FILES_TARGET} ${FLATC}) # Organize file layout for IDEs. source_group( TREE ${working_dir} PREFIX "Flatbuffers/JSON Files" FILES ${FLATBUFFERS_GENERATE_BINARY_FILES_JSON_FILES}) source_group( TREE ${working_dir} PREFIX "Flatbuffers/Schemas" FILES ${FLATBUFFERS_GENERATE_BINARY_FILES_SCHEMA}) source_group( TREE ${FLATBUFFERS_GENERATE_BINARY_FILES_OUTPUT_DIR} PREFIX "Flatbuffers/Generated/Binary Files" FILES ${all_generated_binary_files}) endfunction() ================================================ FILE: cmake/Findzstd.cmake ================================================ find_path(ZSTD_INCLUDE_DIR NAMES zstd.h PATHS ${CONAN_INCLUDE_DIRS_RELEASE} ${CONAN_INCLUDE_DIRS_DEBUG} ) set(ZSTD_NAMES zstd zstd_static) set(ZSTD_NAMES_DEBUG zstdd zstd_staticd) find_library(ZSTD_LIBRARY_RELEASE NAMES ${ZSTD_NAMES} PATHS ${CONAN_LIB_DIRS_RELEASE} ) find_library(ZSTD_LIBRARY_DEBUG NAMES ${ZSTD_NAMES_DEBUG} ${ZSTD_NAMES} PATHS ${CONAN_LIB_DIRS_DEBUG} ) include(SelectLibraryConfigurations) select_library_configurations(ZSTD) if(ZSTD_INCLUDE_DIR AND EXISTS "${ZSTD_INCLUDE_DIR}/zstd.h") file(STRINGS "${ZSTD_INCLUDE_DIR}/zstd.h" ZSTD_VERSION_MAJOR_LINE REGEX "^#define ZSTD_VERSION_MAJOR.*$") file(STRINGS "${ZSTD_INCLUDE_DIR}/zstd.h" ZSTD_VERSION_MINOR_LINE REGEX "^#define ZSTD_VERSION_MINOR.*$") file(STRINGS "${ZSTD_INCLUDE_DIR}/zstd.h" ZSTD_VERSION_RELEASE_LINE REGEX "^#define ZSTD_VERSION_RELEASE.*$") string(REGEX REPLACE "^.*ZSTD_VERSION_MAJOR *([0-9]+)$" "\\1" ZSTD_VERSION_MAJOR "${ZSTD_VERSION_MAJOR_LINE}") string(REGEX REPLACE "^.*ZSTD_VERSION_MINOR *([0-9]+)$" "\\1" ZSTD_VERSION_MINOR "${ZSTD_VERSION_MINOR_LINE}") string(REGEX REPLACE "^.*ZSTD_VERSION_RELEASE *([0-9]+)$" "\\1" ZSTD_VERSION_RELEASE "${ZSTD_VERSION_RELEASE_LINE}") set(ZSTD_VERSION_STRING "${ZSTD_VERSION_MAJOR}.${ZSTD_VERSION_MINOR}.${ZSTD_VERSION_RELEASE}") endif() # handle the QUIETLY and REQUIRED arguments and set ZLIB_FOUND to TRUE if # all listed variables are TRUE include(FindPackageHandleStandardArgs) find_package_handle_standard_args(zstd REQUIRED_VARS ZSTD_LIBRARY ZSTD_INCLUDE_DIR VERSION_VAR ZSTD_VERSION_STRING) if (ZSTD_FOUND) set(ZSTD_INCLUDE_DIRS ${ZSTD_INCLUDE_DIR}) if (NOT ZSTD_LIBRARIES) set(ZSTD_LIBRARIES ${ZSTD_LIBRARY}) endif() if (NOT TARGET zstd::zstd) add_library(zstd::zstd UNKNOWN IMPORTED) set_target_properties(zstd::zstd PROPERTIES INTERFACE_INCLUDE_DIRECTORIES "${ZSTD_INCLUDE_DIRS}") if(ZSTD_LIBRARY_RELEASE) set_property(TARGET zstd::zstd APPEND PROPERTY IMPORTED_CONFIGURATIONS RELEASE) set_target_properties(zstd::zstd PROPERTIES IMPORTED_LOCATION_RELEASE "${ZSTD_LIBRARY_RELEASE}") endif() if(ZSTD_LIBRARY_DEBUG) set_property(TARGET zstd::zstd APPEND PROPERTY IMPORTED_CONFIGURATIONS DEBUG) set_target_properties(zstd::zstd PROPERTIES IMPORTED_LOCATION_DEBUG "${ZSTD_LIBRARY_DEBUG}") endif() if(NOT ZSTD_LIBRARY_RELEASE AND NOT ZSTD_LIBRARY_DEBUG) set_property(TARGET zstd::zstd APPEND PROPERTY IMPORTED_LOCATION "${ZSTD_LIBRARY}") endif() endif() endif() ================================================ FILE: cmake/conan_provider.cmake ================================================ # The MIT License (MIT) # # Copyright (c) 2024 JFrog # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in all # copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. set(CONAN_MINIMUM_VERSION 2.0.5) # Create a new policy scope and set the minimum required cmake version so the # features behind a policy setting like if(... IN_LIST ...) behaves as expected # even if the parent project does not specify a minimum cmake version or a minimum # version less than this module requires (e.g. 3.0) before the first project() call. # (see: https://cmake.org/cmake/help/latest/variable/CMAKE_PROJECT_TOP_LEVEL_INCLUDES.html) # # The policy-affecting calls like cmake_policy(SET...) or `cmake_minimum_required` only # affects the current policy scope, i.e. between the PUSH and POP in this case. # # https://cmake.org/cmake/help/book/mastering-cmake/chapter/Policies.html#the-policy-stack cmake_policy(PUSH) cmake_minimum_required(VERSION 3.24) function(detect_os os os_api_level os_sdk os_subsystem os_version) # it could be cross compilation message(STATUS "CMake-Conan: cmake_system_name=${CMAKE_SYSTEM_NAME}") if(CMAKE_SYSTEM_NAME AND NOT CMAKE_SYSTEM_NAME STREQUAL "Generic") if(CMAKE_SYSTEM_NAME STREQUAL "Darwin") set(${os} Macos PARENT_SCOPE) elseif(CMAKE_SYSTEM_NAME STREQUAL "QNX") set(${os} Neutrino PARENT_SCOPE) elseif(CMAKE_SYSTEM_NAME STREQUAL "CYGWIN") set(${os} Windows PARENT_SCOPE) set(${os_subsystem} cygwin PARENT_SCOPE) elseif(CMAKE_SYSTEM_NAME MATCHES "^MSYS") set(${os} Windows PARENT_SCOPE) set(${os_subsystem} msys2 PARENT_SCOPE) elseif(CMAKE_SYSTEM_NAME STREQUAL "Emscripten") # https://github.com/emscripten-core/emscripten/blob/4.0.6/cmake/Modules/Platform/Emscripten.cmake#L17C1-L17C34 set(${os} Emscripten PARENT_SCOPE) else() set(${os} ${CMAKE_SYSTEM_NAME} PARENT_SCOPE) endif() if(CMAKE_SYSTEM_NAME STREQUAL "Android") if(DEFINED ANDROID_PLATFORM) string(REGEX MATCH "[0-9]+" _os_api_level ${ANDROID_PLATFORM}) elseif(DEFINED CMAKE_SYSTEM_VERSION) set(_os_api_level ${CMAKE_SYSTEM_VERSION}) endif() message(STATUS "CMake-Conan: android api level=${_os_api_level}") set(${os_api_level} ${_os_api_level} PARENT_SCOPE) endif() if(CMAKE_SYSTEM_NAME MATCHES "Darwin|iOS|tvOS|watchOS") # CMAKE_OSX_SYSROOT contains the full path to the SDK for MakeFile/Ninja # generators, but just has the original input string for Xcode. if(NOT IS_DIRECTORY ${CMAKE_OSX_SYSROOT}) set(_os_sdk ${CMAKE_OSX_SYSROOT}) else() if(CMAKE_OSX_SYSROOT MATCHES Simulator) set(apple_platform_suffix simulator) else() set(apple_platform_suffix os) endif() if(CMAKE_OSX_SYSROOT MATCHES AppleTV) set(_os_sdk "appletv${apple_platform_suffix}") elseif(CMAKE_OSX_SYSROOT MATCHES iPhone) set(_os_sdk "iphone${apple_platform_suffix}") elseif(CMAKE_OSX_SYSROOT MATCHES Watch) set(_os_sdk "watch${apple_platform_suffix}") endif() endif() if(DEFINED os_sdk) message(STATUS "CMake-Conan: cmake_osx_sysroot=${CMAKE_OSX_SYSROOT}") set(${os_sdk} ${_os_sdk} PARENT_SCOPE) endif() if(DEFINED CMAKE_OSX_DEPLOYMENT_TARGET) message(STATUS "CMake-Conan: cmake_osx_deployment_target=${CMAKE_OSX_DEPLOYMENT_TARGET}") set(${os_version} ${CMAKE_OSX_DEPLOYMENT_TARGET} PARENT_SCOPE) endif() endif() endif() endfunction() function(detect_arch arch) # CMAKE_OSX_ARCHITECTURES can contain multiple architectures, but Conan only supports one. # Therefore this code only finds one. If the recipes support multiple architectures, the # build will work. Otherwise, there will be a linker error for the missing architecture(s). if(DEFINED CMAKE_OSX_ARCHITECTURES) string(REPLACE " " ";" apple_arch_list "${CMAKE_OSX_ARCHITECTURES}") list(LENGTH apple_arch_list apple_arch_count) if(apple_arch_count GREATER 1) message(WARNING "CMake-Conan: Multiple architectures detected, this will only work if Conan recipe(s) produce fat binaries.") endif() endif() if(CMAKE_SYSTEM_NAME MATCHES "Darwin|iOS|tvOS|watchOS" AND NOT CMAKE_OSX_ARCHITECTURES STREQUAL "") set(host_arch ${CMAKE_OSX_ARCHITECTURES}) elseif(MSVC) set(host_arch ${CMAKE_CXX_COMPILER_ARCHITECTURE_ID}) else() set(host_arch ${CMAKE_SYSTEM_PROCESSOR}) endif() if(host_arch MATCHES "aarch64|arm64|ARM64") set(_arch armv8) elseif(host_arch MATCHES "armv7|armv7-a|armv7l|ARMV7") set(_arch armv7) elseif(host_arch MATCHES armv7s) set(_arch armv7s) elseif(host_arch MATCHES "i686|i386|X86") set(_arch x86) elseif(host_arch MATCHES "AMD64|amd64|x86_64|x64") set(_arch x86_64) endif() if(EMSCRIPTEN) # https://github.com/emscripten-core/emscripten/blob/4.0.6/cmake/Modules/Platform/Emscripten.cmake#L294C1-L294C80 set(_arch wasm) endif() message(STATUS "CMake-Conan: cmake_system_processor=${_arch}") set(${arch} ${_arch} PARENT_SCOPE) endfunction() function(detect_cxx_standard cxx_standard) set(${cxx_standard} ${CMAKE_CXX_STANDARD} PARENT_SCOPE) if(CMAKE_CXX_EXTENSIONS) set(${cxx_standard} "gnu${CMAKE_CXX_STANDARD}" PARENT_SCOPE) endif() endfunction() macro(detect_gnu_libstdcxx) # _conan_is_gnu_libstdcxx true if GNU libstdc++ check_cxx_source_compiles(" #include #if !defined(__GLIBCXX__) && !defined(__GLIBCPP__) static_assert(false); #endif int main(){}" _conan_is_gnu_libstdcxx) # _conan_gnu_libstdcxx_is_cxx11_abi true if C++11 ABI check_cxx_source_compiles(" #include static_assert(sizeof(std::string) != sizeof(void*), \"using libstdc++\"); int main () {}" _conan_gnu_libstdcxx_is_cxx11_abi) set(_conan_gnu_libstdcxx_suffix "") if(_conan_gnu_libstdcxx_is_cxx11_abi) set(_conan_gnu_libstdcxx_suffix "11") endif() unset (_conan_gnu_libstdcxx_is_cxx11_abi) endmacro() macro(detect_libcxx) # _conan_is_libcxx true if LLVM libc++ check_cxx_source_compiles(" #include #if !defined(_LIBCPP_VERSION) static_assert(false); #endif int main(){}" _conan_is_libcxx) endmacro() function(detect_lib_cxx lib_cxx) if(CMAKE_SYSTEM_NAME STREQUAL "Android") message(STATUS "CMake-Conan: android_stl=${CMAKE_ANDROID_STL_TYPE}") set(${lib_cxx} ${CMAKE_ANDROID_STL_TYPE} PARENT_SCOPE) return() endif() include(CheckCXXSourceCompiles) if(CMAKE_CXX_COMPILER_ID MATCHES "GNU") detect_gnu_libstdcxx() set(${lib_cxx} "libstdc++${_conan_gnu_libstdcxx_suffix}" PARENT_SCOPE) elseif(CMAKE_CXX_COMPILER_ID MATCHES "AppleClang") set(${lib_cxx} "libc++" PARENT_SCOPE) elseif(CMAKE_CXX_COMPILER_ID MATCHES "Clang" AND NOT CMAKE_SYSTEM_NAME MATCHES "Windows") # Check for libc++ detect_libcxx() if(_conan_is_libcxx) set(${lib_cxx} "libc++" PARENT_SCOPE) return() endif() # Check for libstdc++ detect_gnu_libstdcxx() if(_conan_is_gnu_libstdcxx) set(${lib_cxx} "libstdc++${_conan_gnu_libstdcxx_suffix}" PARENT_SCOPE) return() endif() # TODO: it would be an error if we reach this point elseif(CMAKE_CXX_COMPILER_ID MATCHES "MSVC") # Do nothing - compiler.runtime and compiler.runtime_type # should be handled separately: https://github.com/conan-io/cmake-conan/pull/516 return() else() # TODO: unable to determine, ask user to provide a full profile file instead endif() endfunction() function(detect_compiler compiler compiler_version compiler_runtime compiler_runtime_type) if(DEFINED CMAKE_CXX_COMPILER_ID) set(_compiler ${CMAKE_CXX_COMPILER_ID}) set(_compiler_version ${CMAKE_CXX_COMPILER_VERSION}) else() if(NOT DEFINED CMAKE_C_COMPILER_ID) message(FATAL_ERROR "C or C++ compiler not defined") endif() set(_compiler ${CMAKE_C_COMPILER_ID}) set(_compiler_version ${CMAKE_C_COMPILER_VERSION}) endif() message(STATUS "CMake-Conan: CMake compiler=${_compiler}") message(STATUS "CMake-Conan: CMake compiler version=${_compiler_version}") if(_compiler MATCHES MSVC) set(_compiler "msvc") string(SUBSTRING ${MSVC_VERSION} 0 3 _compiler_version) # Configure compiler.runtime and compiler.runtime_type settings for MSVC if(CMAKE_MSVC_RUNTIME_LIBRARY) set(_msvc_runtime_library ${CMAKE_MSVC_RUNTIME_LIBRARY}) else() set(_msvc_runtime_library MultiThreaded$<$:Debug>DLL) # default value documented by CMake endif() set(_KNOWN_MSVC_RUNTIME_VALUES "") list(APPEND _KNOWN_MSVC_RUNTIME_VALUES MultiThreaded MultiThreadedDLL) list(APPEND _KNOWN_MSVC_RUNTIME_VALUES MultiThreadedDebug MultiThreadedDebugDLL) list(APPEND _KNOWN_MSVC_RUNTIME_VALUES MultiThreaded$<$:Debug> MultiThreaded$<$:Debug>DLL) # only accept the 6 possible values, otherwise we don't don't know to map this if(NOT _msvc_runtime_library IN_LIST _KNOWN_MSVC_RUNTIME_VALUES) message(FATAL_ERROR "CMake-Conan: unable to map MSVC runtime: ${_msvc_runtime_library} to Conan settings") endif() # Runtime is "dynamic" in all cases if it ends in DLL if(_msvc_runtime_library MATCHES ".*DLL$") set(_compiler_runtime "dynamic") else() set(_compiler_runtime "static") endif() message(STATUS "CMake-Conan: CMake compiler.runtime=${_compiler_runtime}") # Only define compiler.runtime_type when explicitly requested # If a generator expression is used, let Conan handle it conditional on build_type if(NOT _msvc_runtime_library MATCHES ":Debug>") if(_msvc_runtime_library MATCHES "Debug") set(_compiler_runtime_type "Debug") else() set(_compiler_runtime_type "Release") endif() message(STATUS "CMake-Conan: CMake compiler.runtime_type=${_compiler_runtime_type}") endif() unset(_KNOWN_MSVC_RUNTIME_VALUES) elseif(_compiler MATCHES AppleClang) set(_compiler "apple-clang") string(REPLACE "." ";" VERSION_LIST ${_compiler_version}) list(GET VERSION_LIST 0 _compiler_version) elseif(_compiler MATCHES Clang) set(_compiler "clang") string(REPLACE "." ";" VERSION_LIST ${_compiler_version}) list(GET VERSION_LIST 0 _compiler_version) elseif(_compiler MATCHES GNU) set(_compiler "gcc") string(REPLACE "." ";" VERSION_LIST ${_compiler_version}) list(GET VERSION_LIST 0 _compiler_version) endif() message(STATUS "CMake-Conan: [settings] compiler=${_compiler}") message(STATUS "CMake-Conan: [settings] compiler.version=${_compiler_version}") if (_compiler_runtime) message(STATUS "CMake-Conan: [settings] compiler.runtime=${_compiler_runtime}") endif() if (_compiler_runtime_type) message(STATUS "CMake-Conan: [settings] compiler.runtime_type=${_compiler_runtime_type}") endif() set(${compiler} ${_compiler} PARENT_SCOPE) set(${compiler_version} ${_compiler_version} PARENT_SCOPE) set(${compiler_runtime} ${_compiler_runtime} PARENT_SCOPE) set(${compiler_runtime_type} ${_compiler_runtime_type} PARENT_SCOPE) endfunction() function(detect_build_type build_type) get_property(multiconfig_generator GLOBAL PROPERTY GENERATOR_IS_MULTI_CONFIG) if(NOT multiconfig_generator) # Only set when we know we are in a single-configuration generator # Note: we may want to fail early if `CMAKE_BUILD_TYPE` is not defined set(${build_type} ${CMAKE_BUILD_TYPE} PARENT_SCOPE) endif() endfunction() macro(set_conan_compiler_if_appleclang lang command output_variable) if(CMAKE_${lang}_COMPILER_ID STREQUAL "AppleClang") execute_process(COMMAND xcrun --find ${command} OUTPUT_VARIABLE _xcrun_out OUTPUT_STRIP_TRAILING_WHITESPACE) cmake_path(GET _xcrun_out PARENT_PATH _xcrun_toolchain_path) cmake_path(GET CMAKE_${lang}_COMPILER PARENT_PATH _compiler_parent_path) if ("${_xcrun_toolchain_path}" STREQUAL "${_compiler_parent_path}") set(${output_variable} "") endif() unset(_xcrun_out) unset(_xcrun_toolchain_path) unset(_compiler_parent_path) endif() endmacro() macro(append_compiler_executables_configuration) set(_conan_c_compiler "") set(_conan_cpp_compiler "") set(_conan_rc_compiler "") set(_conan_compilers_list "") if(CMAKE_C_COMPILER) set(_conan_c_compiler "\"c\":\"${CMAKE_C_COMPILER}\"") set_conan_compiler_if_appleclang(C cc _conan_c_compiler) list(APPEND _conan_compilers_list ${_conan_c_compiler}) else() message(WARNING "CMake-Conan: The C compiler is not defined. " "Please define CMAKE_C_COMPILER or enable the C language.") endif() if(CMAKE_CXX_COMPILER) set(_conan_cpp_compiler "\"cpp\":\"${CMAKE_CXX_COMPILER}\"") set_conan_compiler_if_appleclang(CXX c++ _conan_cpp_compiler) list(APPEND _conan_compilers_list ${_conan_cpp_compiler}) else() message(WARNING "CMake-Conan: The C++ compiler is not defined. " "Please define CMAKE_CXX_COMPILER or enable the C++ language.") endif() if(CMAKE_RC_COMPILER) set(_conan_rc_compiler "\"rc\":\"${CMAKE_RC_COMPILER}\"") list(APPEND _conan_compilers_list ${_conan_rc_compiler}) # Not necessary to warn if RC not defined endif() if(NOT "x${_conan_compilers_list}" STREQUAL "x") string(REPLACE ";" "," _conan_compilers_list "${_conan_compilers_list}") string(APPEND profile "tools.build:compiler_executables={${_conan_compilers_list}}\n") endif() unset(_conan_c_compiler) unset(_conan_cpp_compiler) unset(_conan_rc_compiler) unset(_conan_compilers_list) endmacro() function(detect_host_profile output_file) detect_os(os os_api_level os_sdk os_subsystem os_version) detect_arch(arch) detect_compiler(compiler compiler_version compiler_runtime compiler_runtime_type) detect_cxx_standard(compiler_cppstd) detect_lib_cxx(compiler_libcxx) detect_build_type(build_type) set(profile "") string(APPEND profile "[settings]\n") if(arch) string(APPEND profile arch=${arch} "\n") endif() if(os) string(APPEND profile os=${os} "\n") endif() if(os_api_level) string(APPEND profile os.api_level=${os_api_level} "\n") endif() if(os_version) string(APPEND profile os.version=${os_version} "\n") endif() if(os_sdk) string(APPEND profile os.sdk=${os_sdk} "\n") endif() if(os_subsystem) string(APPEND profile os.subsystem=${os_subsystem} "\n") endif() if(compiler) string(APPEND profile compiler=${compiler} "\n") endif() if(compiler_version) string(APPEND profile compiler.version=${compiler_version} "\n") endif() if(compiler_runtime) string(APPEND profile compiler.runtime=${compiler_runtime} "\n") endif() if(compiler_runtime_type) string(APPEND profile compiler.runtime_type=${compiler_runtime_type} "\n") endif() if(compiler_cppstd) string(APPEND profile compiler.cppstd=${compiler_cppstd} "\n") endif() if(compiler_libcxx) string(APPEND profile compiler.libcxx=${compiler_libcxx} "\n") endif() if(build_type) string(APPEND profile "build_type=${build_type}\n") endif() if(NOT DEFINED output_file) set(file_name "${CMAKE_BINARY_DIR}/profile") else() set(file_name ${output_file}) endif() string(APPEND profile "[conf]\n") string(APPEND profile "tools.cmake.cmaketoolchain:generator=${CMAKE_GENERATOR}\n") # propagate compilers via profile append_compiler_executables_configuration() if(os STREQUAL "Android") string(APPEND profile "tools.android:ndk_path=${CMAKE_ANDROID_NDK}\n") endif() message(STATUS "CMake-Conan: Creating profile ${file_name}") file(WRITE ${file_name} ${profile}) message(STATUS "CMake-Conan: Profile: \n${profile}") endfunction() function(conan_profile_detect_default) message(STATUS "CMake-Conan: Checking if a default profile exists") execute_process(COMMAND ${CONAN_COMMAND} profile path default RESULT_VARIABLE return_code OUTPUT_VARIABLE conan_stdout ERROR_VARIABLE conan_stderr ECHO_ERROR_VARIABLE # show the text output regardless ECHO_OUTPUT_VARIABLE WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}) if(NOT ${return_code} EQUAL "0") message(STATUS "CMake-Conan: The default profile doesn't exist, detecting it.") execute_process(COMMAND ${CONAN_COMMAND} profile detect RESULT_VARIABLE return_code OUTPUT_VARIABLE conan_stdout ERROR_VARIABLE conan_stderr ECHO_ERROR_VARIABLE # show the text output regardless ECHO_OUTPUT_VARIABLE WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}) endif() endfunction() function(conan_install) cmake_parse_arguments(ARGS conan_args ${ARGN}) set(conan_output_folder ${CMAKE_BINARY_DIR}/conan) # Invoke "conan install" with the provided arguments set(conan_args ${conan_args} -of=${conan_output_folder}) message(STATUS "CMake-Conan: conan install ${CMAKE_SOURCE_DIR} ${conan_args} ${ARGN}") # In case there was not a valid cmake executable in the PATH, we inject the # same we used to invoke the provider to the PATH if(DEFINED PATH_TO_CMAKE_BIN) set(old_path $ENV{PATH}) set(ENV{PATH} "$ENV{PATH}:${PATH_TO_CMAKE_BIN}") endif() execute_process(COMMAND ${CONAN_COMMAND} install ${CMAKE_SOURCE_DIR} ${conan_args} ${ARGN} --format=json RESULT_VARIABLE return_code OUTPUT_VARIABLE conan_stdout ERROR_VARIABLE conan_stderr ECHO_ERROR_VARIABLE # show the text output regardless WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}) if(DEFINED PATH_TO_CMAKE_BIN) set(ENV{PATH} "${old_path}") endif() if(NOT "${return_code}" STREQUAL "0") message(FATAL_ERROR "Conan install failed='${return_code}'") endif() # the files are generated in a folder that depends on the layout used, if # one is specified, but we don't know a priori where this is. # TODO: this can be made more robust if Conan can provide this in the json output string(JSON conan_generators_folder GET "${conan_stdout}" graph nodes 0 generators_folder) cmake_path(CONVERT ${conan_generators_folder} TO_CMAKE_PATH_LIST conan_generators_folder) message(STATUS "CMake-Conan: CONAN_GENERATORS_FOLDER=${conan_generators_folder}") set_property(GLOBAL PROPERTY CONAN_GENERATORS_FOLDER "${conan_generators_folder}") # reconfigure on conanfile changes string(JSON conanfile GET "${conan_stdout}" graph nodes 0 label) message(STATUS "CMake-Conan: CONANFILE=${CMAKE_SOURCE_DIR}/${conanfile}") set_property(DIRECTORY ${CMAKE_SOURCE_DIR} APPEND PROPERTY CMAKE_CONFIGURE_DEPENDS "${CMAKE_SOURCE_DIR}/${conanfile}") # success set_property(GLOBAL PROPERTY CONAN_INSTALL_SUCCESS TRUE) endfunction() function(conan_get_version conan_command conan_current_version) execute_process( COMMAND ${conan_command} --version OUTPUT_VARIABLE conan_output RESULT_VARIABLE conan_result OUTPUT_STRIP_TRAILING_WHITESPACE ) if(conan_result) message(FATAL_ERROR "CMake-Conan: Error when trying to run Conan") endif() string(REGEX MATCH "[0-9]+\\.[0-9]+\\.[0-9]+" conan_version ${conan_output}) set(${conan_current_version} ${conan_version} PARENT_SCOPE) endfunction() function(conan_version_check) set(options ) set(one_value_args MINIMUM CURRENT) set(multi_value_args ) cmake_parse_arguments(conan_version_check "${options}" "${one_value_args}" "${multi_value_args}" ${ARGN}) if(NOT conan_version_check_MINIMUM) message(FATAL_ERROR "CMake-Conan: Required parameter MINIMUM not set!") endif() if(NOT conan_version_check_CURRENT) message(FATAL_ERROR "CMake-Conan: Required parameter CURRENT not set!") endif() if(conan_version_check_CURRENT VERSION_LESS conan_version_check_MINIMUM) message(FATAL_ERROR "CMake-Conan: Conan version must be ${conan_version_check_MINIMUM} or later") endif() endfunction() macro(construct_profile_argument argument_variable profile_list) set(${argument_variable} "") if("${profile_list}" STREQUAL "CONAN_HOST_PROFILE") set(_arg_flag "--profile:host=") elseif("${profile_list}" STREQUAL "CONAN_BUILD_PROFILE") set(_arg_flag "--profile:build=") endif() set(_profile_list "${${profile_list}}") list(TRANSFORM _profile_list REPLACE "auto-cmake" "${CMAKE_BINARY_DIR}/conan_host_profile") list(TRANSFORM _profile_list PREPEND ${_arg_flag}) set(${argument_variable} ${_profile_list}) unset(_arg_flag) unset(_profile_list) endmacro() macro(conan_provide_dependency method package_name) set_property(GLOBAL PROPERTY CONAN_PROVIDE_DEPENDENCY_INVOKED TRUE) get_property(_conan_install_success GLOBAL PROPERTY CONAN_INSTALL_SUCCESS) if(NOT _conan_install_success) find_program(CONAN_COMMAND "conan" REQUIRED) conan_get_version(${CONAN_COMMAND} CONAN_CURRENT_VERSION) conan_version_check(MINIMUM ${CONAN_MINIMUM_VERSION} CURRENT ${CONAN_CURRENT_VERSION}) message(STATUS "CMake-Conan: first find_package() found. Installing dependencies with Conan") if("default" IN_LIST CONAN_HOST_PROFILE OR "default" IN_LIST CONAN_BUILD_PROFILE) conan_profile_detect_default() endif() if("auto-cmake" IN_LIST CONAN_HOST_PROFILE) detect_host_profile(${CMAKE_BINARY_DIR}/conan_host_profile) endif() construct_profile_argument(_host_profile_flags CONAN_HOST_PROFILE) construct_profile_argument(_build_profile_flags CONAN_BUILD_PROFILE) if(EXISTS "${CMAKE_SOURCE_DIR}/conanfile.py") file(READ "${CMAKE_SOURCE_DIR}/conanfile.py" outfile) if(NOT "${outfile}" MATCHES ".*CMakeDeps.*") message(WARNING "Cmake-conan: CMakeDeps generator was not defined in the conanfile") endif() set(generator "") elseif (EXISTS "${CMAKE_SOURCE_DIR}/conanfile.txt") file(READ "${CMAKE_SOURCE_DIR}/conanfile.txt" outfile) if(NOT "${outfile}" MATCHES ".*CMakeDeps.*") message(WARNING "Cmake-conan: CMakeDeps generator was not defined in the conanfile. " "Please define the generator as it will be mandatory in the future") endif() set(generator "-g;CMakeDeps") endif() get_property(_multiconfig_generator GLOBAL PROPERTY GENERATOR_IS_MULTI_CONFIG) if(NOT _multiconfig_generator) message(STATUS "CMake-Conan: Installing single configuration ${CMAKE_BUILD_TYPE}") conan_install(${_host_profile_flags} ${_build_profile_flags} ${CONAN_INSTALL_ARGS} ${generator}) else() message(STATUS "CMake-Conan: Installing both Debug and Release") conan_install(${_host_profile_flags} ${_build_profile_flags} -s build_type=Release ${CONAN_INSTALL_ARGS} ${generator}) conan_install(${_host_profile_flags} ${_build_profile_flags} -s build_type=Debug ${CONAN_INSTALL_ARGS} ${generator}) endif() unset(_host_profile_flags) unset(_build_profile_flags) unset(_multiconfig_generator) unset(_conan_install_success) else() message(STATUS "CMake-Conan: find_package(${ARGV1}) found, 'conan install' already ran") unset(_conan_install_success) endif() get_property(_conan_generators_folder GLOBAL PROPERTY CONAN_GENERATORS_FOLDER) # Ensure that we consider Conan-provided packages ahead of any other, # irrespective of other settings that modify the search order or search paths # This follows the guidelines from the find_package documentation # (https://cmake.org/cmake/help/latest/command/find_package.html): # find_package ( PATHS paths... NO_DEFAULT_PATH) # find_package () # Filter out `REQUIRED` from the argument list, as the first call may fail set(_find_args_${package_name} "${ARGN}") list(REMOVE_ITEM _find_args_${package_name} "REQUIRED") if(NOT "MODULE" IN_LIST _find_args_${package_name}) find_package(${package_name} ${_find_args_${package_name}} BYPASS_PROVIDER PATHS "${_conan_generators_folder}" NO_DEFAULT_PATH NO_CMAKE_FIND_ROOT_PATH) unset(_find_args_${package_name}) endif() # Invoke find_package a second time - if the first call succeeded, # this will simply reuse the result. If not, fall back to CMake default search # behaviour, also allowing modules to be searched. if(NOT ${package_name}_FOUND) list(FIND CMAKE_MODULE_PATH "${_conan_generators_folder}" _index) if(_index EQUAL -1) list(PREPEND CMAKE_MODULE_PATH "${_conan_generators_folder}") endif() unset(_index) find_package(${package_name} ${ARGN} BYPASS_PROVIDER) list(REMOVE_ITEM CMAKE_MODULE_PATH "${_conan_generators_folder}") endif() endmacro() cmake_language( SET_DEPENDENCY_PROVIDER conan_provide_dependency SUPPORTED_METHODS FIND_PACKAGE ) macro(conan_provide_dependency_check) set(_conan_provide_dependency_invoked FALSE) get_property(_conan_provide_dependency_invoked GLOBAL PROPERTY CONAN_PROVIDE_DEPENDENCY_INVOKED) if(NOT _conan_provide_dependency_invoked) message(WARNING "Conan is correctly configured as dependency provider, " "but Conan has not been invoked. Please add at least one " "call to `find_package()`.") if(DEFINED CONAN_COMMAND) # suppress warning in case `CONAN_COMMAND` was specified but unused. set(_conan_command ${CONAN_COMMAND}) unset(_conan_command) endif() endif() unset(_conan_provide_dependency_invoked) endmacro() # Add a deferred call at the end of processing the top-level directory # to check if the dependency provider was invoked at all. cmake_language(DEFER DIRECTORY "${CMAKE_SOURCE_DIR}" CALL conan_provide_dependency_check) # Configurable variables for Conan profiles set(CONAN_HOST_PROFILE "default;auto-cmake" CACHE STRING "Conan host profile") set(CONAN_BUILD_PROFILE "default" CACHE STRING "Conan build profile") set(CONAN_INSTALL_ARGS "--build=missing" CACHE STRING "Command line arguments for conan install") find_program(_cmake_program NAMES cmake NO_PACKAGE_ROOT_PATH NO_CMAKE_PATH NO_CMAKE_ENVIRONMENT_PATH NO_CMAKE_SYSTEM_PATH NO_CMAKE_FIND_ROOT_PATH) if(NOT _cmake_program) get_filename_component(PATH_TO_CMAKE_BIN "${CMAKE_COMMAND}" DIRECTORY) set(PATH_TO_CMAKE_BIN "${PATH_TO_CMAKE_BIN}" CACHE INTERNAL "Path where the CMake executable is") endif() cmake_policy(POP) ================================================ FILE: cmake/pod5_fuzz.cmake ================================================ if (NOT CMAKE_CXX_COMPILER_ID MATCHES "Clang") message(FATAL_ERROR "Only LLVM based compilers are supported for fuzzing. Assuming that " "'clang' is install, it can be picked by setting the environment " "variables 'CC=clang' and 'CXX=clang++' before invoking cmake." ) endif() # Build everything with fuzzing instrumentation and sanitizers set(POD5_SANITIZER_FLAGS -fsanitize=address,undefined,fuzzer-no-link) add_compile_options(-g ${POD5_SANITIZER_FLAGS} -UNDEBUG -O1) add_link_options(${POD5_SANITIZER_FLAGS}) ================================================ FILE: cmake/pod5_packaging.cmake ================================================ set(CPACK_PACKAGE_NAME "lib_pod5") set(CPACK_PACKAGE_VENDOR "Oxford Nanopore") set(CPACK_VERBATIM_VARIABLES true) set(CPACK_PACKAGE_VERSION_MAJOR ${POD5_VERSION_MAJOR}) set(CPACK_PACKAGE_VERSION_MINOR ${POD5_VERSION_MINOR}) set(CPACK_PACKAGE_VERSION_PATCH ${POD5_VERSION_REV}) set(CPACK_RESOURCE_FILE_LICENSE "${CMAKE_SOURCE_DIR}/LICENSE.md") include(CPack) ================================================ FILE: cmake/presets/conan-build-options.json ================================================ { "version": 4, "configurePresets": [ { "name": "conan2-debug", "binaryDir": "${sourceDir}/build", "hidden": true, "cacheVariables": { "CMAKE_BUILD_TYPE": "Debug" }, "environment": { "CONAN_PROFILE_BUILD_TYPE": "Debug" } }, { "name": "conan2-release", "binaryDir": "${sourceDir}/build", "hidden": true, "cacheVariables": { "CMAKE_BUILD_TYPE": "Release" }, "environment": { "CONAN_PROFILE_BUILD_TYPE": "Release" } }, { "name": "conan2-cppstd20", "hidden": true, "cacheVariables": { "CMAKE_CXX_STANDARD": "20" }, "environment": { "CONAN_PROFILE_CPPSTD": "20" } }, { "name": "conan2-cppstd17", "hidden": true, "cacheVariables": { "CMAKE_CXX_STANDARD": "17" }, "environment": { "CONAN_PROFILE_CPPSTD": "17" } } ] } ================================================ FILE: cmake/presets/conan-profiles.json ================================================ { "version": 4, "configurePresets": [ { "name": "conan2-gcc9-x86_64-profile", "hidden": true, "cacheVariables": { "CONAN_HOST_PROFILE": "linux-x86_64-gcc9.jinja", "CONAN_BUILD_PROFILE": "linux-x86_64-gcc9.jinja" } }, { "name": "conan2-gcc11-x86_64-profile", "hidden": true, "cacheVariables": { "CONAN_HOST_PROFILE": "linux-x86_64-gcc11.jinja", "CONAN_BUILD_PROFILE": "linux-x86_64-gcc11.jinja" } }, { "name": "conan2-gcc11-asan-static-x86_64-profile", "hidden": true, "cacheVariables": { "CONAN_HOST_PROFILE": "linux-x86_64-gcc11-asan-static.jinja", "CONAN_BUILD_PROFILE": "linux-x86_64-gcc11-asan-static.jinja" } }, { "name": "conan2-gcc11-usan-static-x86_64-profile", "hidden": true, "cacheVariables": { "CONAN_HOST_PROFILE": "linux-x86_64-gcc11-usan-static.jinja", "CONAN_BUILD_PROFILE": "linux-x86_64-gcc11-usan-static.jinja" } }, { "name": "conan2-gcc11-tsan-static-x86_64-profile", "hidden": true, "cacheVariables": { "CONAN_HOST_PROFILE": "linux-x86_64-gcc11-tsan-static.jinja", "CONAN_BUILD_PROFILE": "linux-x86_64-gcc11-tsan-static.jinja" } }, { "name": "conan2-gcc13-x86_64-profile", "hidden": true, "cacheVariables": { "CONAN_HOST_PROFILE": "linux-x86_64-gcc13.jinja", "CONAN_BUILD_PROFILE": "linux-x86_64-gcc13.jinja" } }, { "name": "conan2-gcc13-aarch64-profile", "hidden": true, "cacheVariables": { "CONAN_HOST_PROFILE": "linux-aarch64-gcc13.jinja", "CONAN_BUILD_PROFILE": "linux-aarch64-gcc13.jinja" } }, { "name": "conan2-gcc11-aarch64-profile", "hidden": true, "cacheVariables": { "CONAN_HOST_PROFILE": "linux-aarch64-gcc11.jinja", "CONAN_BUILD_PROFILE": "linux-aarch64-gcc11.jinja" } }, { "name": "conan2-gcc9-aarch64-profile", "hidden": true, "cacheVariables": { "CONAN_HOST_PROFILE": "linux-aarch64-gcc9.jinja", "CONAN_BUILD_PROFILE": "linux-aarch64-gcc9.jinja" } }, { "name": "conan2-appleclang-15.0-aarch64-profile", "hidden": true, "cacheVariables": { "CONAN_HOST_PROFILE": "macos-aarch64-appleclang-15.0.jinja", "CONAN_BUILD_PROFILE": "macos-aarch64-appleclang-15.0.jinja" } }, { "name": "conan2-appleclang-16.0-aarch64-profile", "hidden": true, "cacheVariables": { "CONAN_HOST_PROFILE": "macos-aarch64-appleclang-16.0.jinja", "CONAN_BUILD_PROFILE": "macos-aarch64-appleclang-16.0.jinja" } }, { "name": "conan2-windows-x86_64-vs2019-profile", "hidden": true, "cacheVariables": { "CONAN_HOST_PROFILE": "windows-x86_64-vs2019.jinja", "CONAN_BUILD_PROFILE": "windows-x86_64-vs2019.jinja" }, "environment": { "CMAKE_GENERATOR": "Visual Studio 16 2019" } } ] } ================================================ FILE: cmake/presets/conan-provider.json ================================================ { "version": 4, "configurePresets": [ { "name": "conan2-provider", "hidden": true, "cacheVariables": { "CMAKE_PROJECT_TOP_LEVEL_INCLUDES": "${sourceDir}/cmake/conan_provider.cmake", "CONAN2": "ON", "CONAN_INSTALL_ARGS": "--build=never;-o:a=arrow/*:with_boost=False;-o:a=arrow/*:with_thrift=False;-o:a=arrow/*:parquet=False;-o:a=arrow/*:with_zstd=True;" } } ] } ================================================ FILE: conanfile.py ================================================ from conan import ConanFile from conan.tools.cmake import CMakeToolchain, CMake, CMakeDeps from conan.tools.files import collect_libs, copy from conan.tools.build import cross_building from conan.tools.cmake import cmake_layout import os class Pod5Conan(ConanFile): name = "pod5_file_format" license = "MPL 2.0" url = "https://github.com/nanoporetech/pod5-file-format" description = "POD5 File format" topics = "nanopore", "sequencing", "genomic", "dna", "arrow" settings = "os", "compiler", "build_type", "arch" options = {"shared": [True, False]} default_options = { "shared": False, } exports_sources = [ "c++/*", "cmake/*", "python/*", "third_party/*", "CMakeLists.txt", "LICENSE.md", ] """ When building a static library, we need to pack arrow, zstd and if on linux jemalloc, alongside pod5 static lib to avoid linking errors. This function copies those libs to a folder called third_party in the build directory. The ci/install.sh ensures they end up in the correct location to be deployed, if install is done via cmake. """ def _setup_third_party_deps_packaging(self): deps_to_pack = ( ["arrow", "zstd", "jemalloc"] if self.settings.os == "Linux" else ["arrow", "zstd"] ) static_lib_ext_wildcard = "*.a" if self.settings.os != "Windows" else "*.lib" for dep in deps_to_pack: if dep == "jemalloc": static_lib_ext_wildcard = ( "*_pic.a" if self.settings.os != "Windows" else "*_pic.lib" ) dep_object = self.dependencies[dep] copy( self, static_lib_ext_wildcard, dep_object.cpp_info.libdir, f"{self.build_folder}/third_party/libs", ) def _licenses_path(self): # This needs to match the install step inside CMake. return os.path.join(self.build_folder, "pod5_conan_licenses") def _copy_licenses(self): # Copy each dependency's licenses. for require, dependency in self.dependencies.items(): # package_folder will be None if this dependency isn't used. if dependency.package_folder is not None: copy( self, "license*", dependency.package_folder, os.path.join(self._licenses_path(), dependency.ref.name), ignore_case=True, ) def layout(self): cmake_layout(self, "Ninja Multi-Config") def requirements(self): self.requires("arrow/18.0.0") self.requires("flatbuffers/2.0.0") self.requires("zstd/[>=1.4.8 <=2.0.0]") self.requires("zlib/[>=1.2.11 <=2.0.0]") if not ( self.settings.os == "Windows" or self.settings.os == "Macos" or self.settings.os == "iOS" ): self.requires("jemalloc/5.2.1") """ When cross compiling we need pre compiled flatbuffers for flatc to run on the build machine which is not the target. The flatbuffers version is most likely available already; it is on the master branch and quite likely already built on the development branch. However, it seems that conan doesn't realise this since it is the same package that it tries to build, even though it is a different revision, flatbuffers on the other hand is downloaded. """ def build_requirements(self): if hasattr(self, "settings_build") and cross_building(self): # We are using an older version of flatbuffers not available on CCI. # @TODO: Update to a version that exists in CCI # When this line changes a corresponding change in .gitlab-ci.yml is required where this # package is uninstalled. self.tool_requires("flatbuffers/2.0.0") def generate(self): if not self.options.shared: self._setup_third_party_deps_packaging() self._copy_licenses() tc = CMakeToolchain(self) tc.variables["ENABLE_CONAN"] = "ON" tc.variables["BUILD_PYTHON_WHEEL"] = "OFF" tc.variables["POD5_DISABLE_TESTS"] = "ON" tc.variables["POD5_BUILD_EXAMPLES"] = "OFF" tc.variables["BUILD_SHARED_LIB"] = "ON" if self.options.shared else "OFF" tc.generate() deps = CMakeDeps(self) deps.check_components_exist = True # This ensures that target names in cmake would be in the form of libname::libname deps.set_property("zstd", "cmake_target_name", None) deps.generate() def build(self): cmake = CMake(self) cmake.configure() cmake.build() def package(self): cmake = CMake(self) cmake.install() # Copy the license files copy( self, "*", self._licenses_path(), os.path.join(self.package_folder, "licenses"), ) # Package the required third party libs after installing pod5 static if not self.options.shared: src = f"{self.build_folder}/third_party/libs/" dst = f"{self.build_folder}/lib/" copy(self, "*", src, dst) def package_info(self): # Note: package_info collects information in self.cpp_info. It is called from the Conan # application. # # This call is made immediately after the pre_package_info hook and before the # post_package_info hook. To get more information, we can "import traceback" and "import inspect", # then call traceback.print_stack() to print the complete call stack, or examine # inspect.stack(). # # The caller has created self.cpp_info with the name set to the name of self, with a rootpath, # version and description from self, env_info and user_info set with default values, # public_deps set to an array with the names of public requirements in conanfile.requires.items. # Additions for this package. Note that everything in requirements needs to be mentioned # here. Except for Windows and Macos, jemalloc is also needed. self.cpp_info.libs = collect_libs(self) self.cpp_info.requires = [ "arrow::arrow", "flatbuffers::flatbuffers", "zstd::zstd", "zlib::zlib", ] # Workaround for broken Arrow package - ensure transitive includes are available # Since our headers include Arrow headers, we need Arrow's includes to be transitively available try: arrow_dep = self.dependencies["arrow"] arrow_include_path = os.path.join(arrow_dep.package_folder, "include") if os.path.exists(arrow_include_path): self.cpp_info.includedirs.append(arrow_include_path) except Exception: # Arrow dependency not found or other issue - let it fail naturally pass # self.cpp if self.settings.os == "Linux": self.cpp_info.requires.append("jemalloc::jemalloc") if self.settings.os in ["iOS", "Macos"]: self.cpp_info.frameworks = ["CoreFoundation"] ================================================ FILE: docs/DESIGN.md ================================================ POD5 File Format Design Details ============================== ## Summary This file format has the following design goals (roughly in priority order): - Good write performance for MinKNOW - Recoverable if the writing process crashes - Good read performance for downstream tools, including basecall model generation - Efficient use of space - Straightforward to implement and maintain - Extensibility Note that trade-offs have been made between these goals, but we have mostly aimed to make those run-time decisions. We have also chosen not to optimise for editing existing files. ### Write performance The aspects of this format that are designed to maximise write performance are: - Data can be written sequentially - The sequential access pattern makes it easy to use efficient operating system APIs (such as io_uring on Linux) - The sequential access pattern helps the operating system's I/O scheduler maximise throughput - Signal data from different reads can be interleaved, and data streams can be safely abandoned (at the cost of using more space than necessary) - This allows MinKNOW to write out data as it arrives, potentially avoiding the need have an intermediate caching format (this file format can be used for the cache and the final output) - Support for space- and CPU-efficient compression routines (VBZ) - This reduces the amount of data that needs to be written, which reduces I/O load ### Recovery The aspects of this format that are designed to allow for recovery if the writing process crashes are: - A way to indicate that a file is actually complete as intended (complete files end with a recognisable footer) - The Apache Feather format can be assembled by reading it sequentially, without using the footer - The data file format is append-only, which means that once data is recorded it cannot be corrupted by later updates ### Read performance The aspects of this format that are designed to maximise read performance are: - The Apache Feather format can be memory mapped and used directly - Apache Arrow has significant existing engineering work geared around efficient access to data, from the layout of the data itself to the library tooling - Storing direct information about signal data locations with the row table - This allows quick access to a read's data without scanning the data file - It is possible to only decode part of a long read, due to read data being stored in chunks - This is useful for model training - Read access does not require locking or otherwise modifying the file - This allows multi-threaded and multi-process access to a file for reading ### Efficient use of space The aspects of this format that are designed to maximise use of space are: - Support for efficient compression routines (VBZ) - Apache Arrow's support for dictionary encoding - Apache Arrow's support for compressing buffers with standard compression routines ### Ease of implementation The aspects of this format that are designed to make the format easy to implement are: - Relying on an existing, widely-used format (Apache Arrow) ### Extensibility The aspects of this format that are designed to make the format extensible are: - Apache Arrow uses a self-describing schema with named columns, so it is straightforward to write code that is resilient in the face of things like additional columns being added. ================================================ FILE: docs/README.md ================================================ Design documentation for POD5 ============================ The POD5 file format has been specifically designed to be suitable for Nanopore read data, we had some specific design goals: Design Goals ------------ The primary purpose of this file format is store reads produced by Oxford Nanopore sequencing, and in particular the signal data from those reads (which can then be basecalled or processed in other ways). This file format has the following design goals: - Good write performance for MinKNOW - Recoverable if the writing process crashes - Good read performance for downstream tools, including basecall model generation - Efficient use of space - Straightforward to implement and maintain - Extensibility Note that trade-offs have been made between these goals, but we have mostly aimed to make those run-time decisions. We have also chosen not to optimise for editing existing files. More detailed information around general format goals can be found in [DESIGN](./DESIGN.md), more detailed format specification is available in [SPECIFICATION](./SPECIFICATION.md). ================================================ FILE: docs/SPECIFICATION.md ================================================ POD5 Format Specification ========================= ## Overview The file format is, at its core, a collection of Apache Arrow tables, stored in the Apache Feather 2 (also know as Apache Arrow IPC File) format, and bundled into a container format. The container file has the extension `.pod5`. ### Table Schemas POD5 files are a custom wrapper format around arrow that contain several [arrow tables](https://arrow.apache.org/docs/python/data.html#tables). All the tables should have the following `custom_metadata` fields set on them: | Name | Example Value | Notes | |-------------------------|--------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------| | MINKNOW:pod5_version | 1.0.0 | The version of this specification that the schema was based on. | | MINKNOW:software | MinNOW Core 5.2.3 | A free-form description of the software that wrote the file, intended to help pin down the source of files that violate the specification. | | MINKNOW:file_identifier | cbf91180-0684-4a39-bf56-41eaf437de9e | Must be identical across all tables. Allows checking that the files correspond to each other. | ### Extension Types Several fields in the table schemas use [custom arrow types](https://arrow.apache.org/docs/python/data.html#custom-schema-and-field-metadata). #### minknow.uuid The schemas make extensive use of UUIDs to identify reads. This is stored using an extension type, with the following properties: Name: "minknow.uuid" Physical storage: FixedBinary(16) #### minknow.vbz Storage for VBZ-encoded data: Name: "minknow.vbz" Physical storage: LargeBinary ### Tables The Reads, Signal and Run Info tables must all be present in a POD5 file. Note that some very early POD5 files produced by pre-0.1 versions of the pod5 library did not include a Run Info table, instead including that information in the Reads table. #### Reads Table The Reads table contains a single row per read, and describes the metadata for each read. The `signal` column of the read links to the Signal table, and allows a reads signal to be retrieved. The `run_info` column links to the the Run Info table, providing more context for the read and avoiding duplicating data that is common to many or all reads in the file. Some fields of the Reads table are [dictionaries](https://arrow.apache.org/docs/python/data.html#dictionary-arrays): the contents of the table are stored in a lookup written prior to each batch of read rows and the read row itself then contains an integer index. This allows space savings on fields that would otherwise be repeated. Only simple types are stored in dictionaries as third party tools have limited support for dictionaries of structs. [tables/reads.toml] contains specific information about fields in the reads table. #### Signal Table The signal table contains the (optionally compressed) signal data where one row contains sequence of sample data, and some information about the sample data origin. [tables/signal.toml] contains specific information about fields in the signal table. #### Run Info Table The run info table contains a single row per MinKNOW run that any read in the file came from. Several fields of the Reads table are [dictionaries](https://arrow.apache.org/docs/python/data.html#dictionary-arrays), the contents of the table are stored in a lookup written prior to each batch of read rows, the read row itself then contains an integer index. This allows space savings on fields that would otherwise be repeated. [tables/run_info.toml] contains specific information about fields in the reads table. ### Combined file Layout #### Layout ```
...