Showing preview only (420K chars total). Download the full file or copy to clipboard to get everything.
Repository: coderamp-labs/gitingest
Branch: main
Commit: 4e259a02fe72
Files: 110
Total size: 391.9 KB
Directory structure:
gitextract_380b_654/
├── .docker/
│ └── minio/
│ └── setup.sh
├── .dockerignore
├── .github/
│ ├── ISSUE_TEMPLATE/
│ │ ├── bug_report.yml
│ │ └── feature_request.yml
│ └── workflows/
│ ├── ci.yml
│ ├── codeql.yml
│ ├── dependency-review.yml
│ ├── deploy-pr.yml
│ ├── docker-build.ecr.yml
│ ├── docker-build.ghcr.yml
│ ├── pr-title-check.yml
│ ├── publish_to_pypi.yml
│ ├── rebase-needed.yml
│ ├── release-please.yml
│ ├── scorecard.yml
│ └── stale.yml
├── .gitignore
├── .pre-commit-config.yaml
├── .release-please-manifest.json
├── .vscode/
│ └── launch.json
├── CHANGELOG.md
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── Dockerfile
├── LICENSE
├── README.md
├── SECURITY.md
├── compose.yml
├── eslint.config.cjs
├── pyproject.toml
├── release-please-config.json
├── renovate.json
├── requirements-dev.txt
├── requirements.txt
├── src/
│ ├── gitingest/
│ │ ├── __init__.py
│ │ ├── __main__.py
│ │ ├── clone.py
│ │ ├── config.py
│ │ ├── entrypoint.py
│ │ ├── ingestion.py
│ │ ├── output_formatter.py
│ │ ├── query_parser.py
│ │ ├── schemas/
│ │ │ ├── __init__.py
│ │ │ ├── cloning.py
│ │ │ ├── filesystem.py
│ │ │ └── ingestion.py
│ │ └── utils/
│ │ ├── __init__.py
│ │ ├── auth.py
│ │ ├── compat_func.py
│ │ ├── compat_typing.py
│ │ ├── exceptions.py
│ │ ├── file_utils.py
│ │ ├── git_utils.py
│ │ ├── ignore_patterns.py
│ │ ├── ingestion_utils.py
│ │ ├── logging_config.py
│ │ ├── notebook.py
│ │ ├── os_utils.py
│ │ ├── pattern_utils.py
│ │ ├── query_parser_utils.py
│ │ └── timeout_wrapper.py
│ ├── server/
│ │ ├── __init__.py
│ │ ├── __main__.py
│ │ ├── form_types.py
│ │ ├── main.py
│ │ ├── metrics_server.py
│ │ ├── models.py
│ │ ├── query_processor.py
│ │ ├── routers/
│ │ │ ├── __init__.py
│ │ │ ├── dynamic.py
│ │ │ ├── index.py
│ │ │ └── ingest.py
│ │ ├── routers_utils.py
│ │ ├── s3_utils.py
│ │ ├── server_config.py
│ │ ├── server_utils.py
│ │ └── templates/
│ │ ├── base.jinja
│ │ ├── components/
│ │ │ ├── _macros.jinja
│ │ │ ├── footer.jinja
│ │ │ ├── git_form.jinja
│ │ │ ├── navbar.jinja
│ │ │ ├── result.jinja
│ │ │ └── tailwind_components.html
│ │ ├── git.jinja
│ │ ├── index.jinja
│ │ └── swagger_ui.jinja
│ └── static/
│ ├── js/
│ │ ├── git.js
│ │ ├── git_form.js
│ │ ├── index.js
│ │ ├── navbar.js
│ │ ├── posthog.js
│ │ └── utils.js
│ ├── llms.txt
│ └── robots.txt
└── tests/
├── .pylintrc
├── __init__.py
├── conftest.py
├── query_parser/
│ ├── __init__.py
│ ├── test_git_host_agnostic.py
│ └── test_query_parser.py
├── server/
│ ├── __init__.py
│ └── test_flow_integration.py
├── test_cli.py
├── test_clone.py
├── test_git_utils.py
├── test_gitignore_feature.py
├── test_ingestion.py
├── test_notebook_utils.py
├── test_pattern_utils.py
└── test_summary.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .docker/minio/setup.sh
================================================
#!/bin/sh
# Simple script to set up MinIO bucket and user
# Based on example from MinIO issues
# Format bucket name to ensure compatibility
BUCKET_NAME=$(echo "${S3_BUCKET_NAME}" | tr '[:upper:]' '[:lower:]' | tr '_' '-')
# Configure MinIO client
mc alias set myminio http://minio:9000 ${MINIO_ROOT_USER} ${MINIO_ROOT_PASSWORD}
# Remove bucket if it exists (for clean setup)
mc rm -r --force myminio/${BUCKET_NAME} || true
# Create bucket
mc mb myminio/${BUCKET_NAME}
# Set bucket policy to allow downloads
mc anonymous set download myminio/${BUCKET_NAME}
# Create user with access and secret keys
mc admin user add myminio ${S3_ACCESS_KEY} ${S3_SECRET_KEY} || echo "User already exists"
# Create policy for the bucket
echo '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Action":["s3:*"],"Resource":["arn:aws:s3:::'${BUCKET_NAME}'/*","arn:aws:s3:::'${BUCKET_NAME}'"]}]}' > /tmp/policy.json
# Apply policy
mc admin policy create myminio gitingest-policy /tmp/policy.json || echo "Policy already exists"
mc admin policy attach myminio gitingest-policy --user ${S3_ACCESS_KEY}
echo "MinIO setup completed successfully"
echo "Bucket: ${BUCKET_NAME}"
echo "Access via console: http://localhost:9001"
================================================
FILE: .dockerignore
================================================
# -------------------------------------------------
# Base: reuse patterns from .gitignore
# -------------------------------------------------
# Operating-system
.DS_Store
Thumbs.db
# Editor / IDE settings
.vscode/
!.vscode/launch.json
.idea/
*.swp
# Python virtual-envs & tooling
.venv*/
.python-version
__pycache__/
*.egg-info/
*.egg
.ruff_cache/
# Test artifacts & coverage
.pytest_cache/
.coverage
coverage.xml
htmlcov/
# Build, distribution & docs
build/
dist/
*.wheel
# Logs & runtime output
*.log
logs/
*.tmp
tmp/
# Project-specific files
history.txt
digest.txt
# -------------------------------------------------
# Extra for Docker
# -------------------------------------------------
# Git history
.git/
.gitignore
# Tests
tests/
# Docs
docs/
*.md
LICENSE
# Local overrides & secrets
.env
# Docker files
.dockerignore
Dockerfile*
# -------------------------------------------------
# Files required during build
# -------------------------------------------------
!pyproject.toml
!src/
================================================
FILE: .github/ISSUE_TEMPLATE/bug_report.yml
================================================
name: Bug report 🐞
description: Report a bug or internal server error when using Gitingest
title: "(bug): "
labels: ["bug"]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to report a bug! :lady_beetle:
Please fill out the following details to help us reproduce and fix the issue. :point_down:
- type: dropdown
id: interface
attributes:
label: Which interface did you use?
default: 0
options:
- "Select one..."
- Web UI
- CLI
- PyPI package
validations:
required: true
- type: input
id: repo_url
attributes:
label: Repository URL (if public)
placeholder: e.g., https://github.com/<username>/<repo>/commit_branch_or_tag/blob_or_tree/subdir
- type: dropdown
id: git_host
attributes:
label: Git host
description: The Git host of the repository.
default: 0
options:
- "Select one..."
- GitHub (github.com)
- GitLab (gitlab.com)
- Bitbucket (bitbucket.org)
- Gitea (gitea.com)
- Codeberg (codeberg.org)
- Gist (gist.github.com)
- Kaggle (kaggle.com)
- GitHub Enterprise (github.company.com)
- Other (specify below)
validations:
required: true
- type: input
id: git_host_other
attributes:
label: Other Git host
placeholder: If you selected "Other", please specify the Git host here.
- type: dropdown
id: repo_visibility
attributes:
label: Repository visibility
default: 0
options:
- "Select one..."
- public
- private
validations:
required: true
- type: dropdown
id: revision
attributes:
label: Commit, branch, or tag
default: 0
options:
- "Select one..."
- default branch
- commit
- branch
- tag
validations:
required: true
- type: dropdown
id: ingest_scope
attributes:
label: Did you ingest the full repository or a subdirectory?
default: 0
options:
- "Select one..."
- full repository
- subdirectory
validations:
required: true
- type: dropdown
id: os
attributes:
label: Operating system
default: 0
options:
- "Select one..."
- Not relevant (Web UI)
- macOS
- Windows
- Linux
validations:
required: true
- type: dropdown
id: browser
attributes:
label: Browser (Web UI only)
default: 0
options:
- "Select one..."
- Not relevant (CLI / PyPI)
- Chrome
- Firefox
- Safari
- Edge
- Other (specify below)
validations:
required: true
- type: input
id: browser_other
attributes:
label: Other browser
placeholder: If you selected "Other", please specify the browser here.
- type: input
id: gitingest_version
attributes:
label: Gitingest version
placeholder: e.g., v0.1.5
description: Not required if you used the Web UI.
- type: input
id: python_version
attributes:
label: Python version
placeholder: e.g., 3.11.5
description: Not required if you used the Web UI.
- type: textarea
id: bug_description
attributes:
label: Bug description
placeholder: Describe the bug here.
description: A detailed but concise description of the bug.
validations:
required: true
- type: textarea
id: steps_to_reproduce
attributes:
label: Steps to reproduce
placeholder: Include the exact commands or actions that led to the error.
description: Include the exact commands or actions that led to the error *(if relevant)*.
render: shell
- type: textarea
id: expected_behavior
attributes:
label: Expected behavior
placeholder: Describe what you expected to happen.
description: Describe what you expected to happen *(if relevant)*.
- type: textarea
id: actual_behavior
attributes:
label: Actual behavior
description: Paste the full error message or stack trace here.
- type: textarea
id: additional_context
attributes:
label: Additional context, logs, or screenshots
placeholder: Add any other context, links, or screenshots about the issue here.
================================================
FILE: .github/ISSUE_TEMPLATE/feature_request.yml
================================================
name: Feature request 💡
description: Suggest a new feature or improvement for Gitingest
title: "(feat): "
labels: ["enhancement"]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to help us improve **Gitingest**! :sparkles:
Please fill in the sections below to describe your idea. The more detail you provide, the easier it is for us to evaluate and plan the work. :point_down:
- type: input
id: summary
attributes:
label: Feature summary
placeholder: One-sentence description of the feature.
validations:
required: true
- type: textarea
id: problem
attributes:
label: Problem / motivation
description: What problem does this feature solve? How does it affect your workflow?
placeholder: Why is this feature important? Describe the pain point or limitation you're facing.
validations:
required: true
- type: textarea
id: proposal
attributes:
label: Proposed solution
placeholder: Describe what you would like to see happen.
description: Outline the feature as you imagine it. *(optional)*
- type: textarea
id: alternatives
attributes:
label: Alternatives considered
placeholder: List other approaches you've considered or work-arounds you use today.
description: Feel free to mention why those alternatives don't fully solve the problem.
- type: dropdown
id: interface
attributes:
label: Which interface would this affect?
default: 0
options:
- "Select one..."
- Web UI
- CLI
- PyPI package
- CLI + PyPI package
- All
validations:
required: true
- type: dropdown
id: priority
attributes:
label: How important is this to you?
default: 0
options:
- "Select one..."
- Nice to have
- Important
- Critical
validations:
required: true
- type: dropdown
id: willingness
attributes:
label: Would you like to work on this feature yourself?
default: 0
options:
- "Select one..."
- Yes, I'd like to implement it
- Maybe, if I get some guidance
- No, just requesting (absolutely fine!)
validations:
required: true
- type: dropdown
id: support_needed
attributes:
label: Would you need support from the maintainers (if you're implementing it yourself)?
default: 0
options:
- "Select one..."
- No, I can handle it solo
- Yes, I'd need some guidance
- Not sure yet
- This is just a suggestion, I'm not planning to implement it myself (absolutely fine!)
- type: textarea
id: additional_context
attributes:
label: Additional context, screenshots, or examples
placeholder: Add links, sketches, or any other context that would help us understand and implement the feature.
================================================
FILE: .github/workflows/ci.yml
================================================
name: CI
on:
push:
branches: [main]
pull_request:
branches: [main]
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
permissions:
contents: read
jobs:
test:
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
python-version: ["3.8", "3.13"]
include:
- os: ubuntu-latest
python-version: "3.13"
coverage: true
steps:
- name: Harden the runner (Audit all outbound calls)
uses: step-security/harden-runner@ec9f2d5744a09debf3a187a3f4f675c53b671911 # v2.13.0
with:
egress-policy: audit
- uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
- name: Set up Python
uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install ".[dev,server]"
- name: Cache pytest results
uses: actions/cache@v4
with:
path: .pytest_cache
key: ${{ runner.os }}-pytest-${{ matrix.python-version }}-${{ hashFiles('**/pytest.ini') }}
restore-keys: |
${{ runner.os }}-pytest-${{ matrix.python-version }}-
- name: Run tests
if: ${{ matrix.coverage != true }}
run: pytest
- name: Run tests
if: ${{ matrix.coverage == true }}
run: pytest
- name: Run pre-commit hooks
uses: pre-commit/action@2c7b3805fd2a0fd8c1884dcaebf91fc102a13ecd # v3.0.1
if: ${{ matrix.python-version == '3.13' && matrix.os == 'ubuntu-latest' }}
================================================
FILE: .github/workflows/codeql.yml
================================================
# For most projects, this workflow file will not need changing; you simply need
# to commit it to your repository.
#
# You may wish to alter this file to override the set of languages analyzed,
# or to provide custom queries or build logic.
#
# ******** NOTE ********
# We have attempted to detect the languages in your repository. Please check
# the `language` matrix defined below to confirm you have the correct set of
# supported CodeQL languages.
#
name: "CodeQL"
on:
push:
branches: ["main"]
pull_request:
# The branches below must be a subset of the branches above
branches: ["main"]
schedule:
- cron: "0 0 * * 1"
permissions:
contents: read
jobs:
analyze:
name: Analyze
runs-on: ubuntu-latest
permissions:
actions: read
contents: read
security-events: write
strategy:
fail-fast: false
matrix:
language: ["javascript", "python"]
# CodeQL supports [ $supported-codeql-languages ]
# Learn more about CodeQL language support at https://aka.ms/codeql-docs/language-support
steps:
- name: Harden the runner (Audit all outbound calls)
uses: step-security/harden-runner@ec9f2d5744a09debf3a187a3f4f675c53b671911 # v2.13.0
with:
egress-policy: audit
- name: Checkout repository
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
# Initializes the CodeQL tools for scanning.
- name: Initialize CodeQL
uses: github/codeql-action/init@df559355d593797519d70b90fc8edd5db049e7a2 # v3.29.9
with:
languages: ${{ matrix.language }}
# If you wish to specify custom queries, you can do so here or in a config file.
# By default, queries listed here will override any specified in a config file.
# Prefix the list here with "+" to use these queries and those in the config file.
# Autobuild attempts to build any compiled languages (C/C++, C#, or Java).
# If this step fails, then you should remove it and run the build manually (see below)
- name: Autobuild
uses: github/codeql-action/autobuild@df559355d593797519d70b90fc8edd5db049e7a2 # v3.29.9
# ℹ️ Command-line programs to run using the OS shell.
# 📚 See https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsrun
# If the Autobuild fails above, remove it and uncomment the following three lines.
# modify them (or add more) to build your code if your project, please refer to the EXAMPLE below for guidance.
# - run: |
# echo "Run, Build Application using script"
# ./location_of_script_within_repo/buildscript.sh
- name: Perform CodeQL Analysis
uses: github/codeql-action/analyze@df559355d593797519d70b90fc8edd5db049e7a2 # v3.29.9
with:
category: "/language:${{matrix.language}}"
================================================
FILE: .github/workflows/dependency-review.yml
================================================
# Dependency Review Action
#
# This Action will scan dependency manifest files that change as part of a Pull Request,
# surfacing known-vulnerable versions of the packages declared or updated in the PR.
# Once installed, if the workflow run is marked as required,
# PRs introducing known-vulnerable packages will be blocked from merging.
#
# Source repository: https://github.com/actions/dependency-review-action
name: 'Dependency Review'
on: [pull_request]
permissions:
contents: read
jobs:
dependency-review:
runs-on: ubuntu-latest
steps:
- name: Harden the runner (Audit all outbound calls)
uses: step-security/harden-runner@ec9f2d5744a09debf3a187a3f4f675c53b671911 # v2.13.0
with:
egress-policy: audit
- name: 'Checkout Repository'
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
- name: 'Dependency Review'
uses: actions/dependency-review-action@da24556b548a50705dd671f47852072ea4c105d9 # v4.7.1
================================================
FILE: .github/workflows/deploy-pr.yml
================================================
name: Manage PR Temp Envs
'on':
pull_request:
types:
- labeled
- unlabeled
- closed
permissions:
contents: read
pull-requests: write
env:
APP_NAME: gitingest
FLUX_OWNER: '${{ github.repository_owner }}'
FLUX_REPO: '${{ secrets.CR_FLUX_REPO }}'
jobs:
deploy-pr-env:
if: >-
${{ github.event.action == 'labeled' && github.event.label.name ==
'deploy-pr-temp-env' }}
runs-on: ubuntu-latest
steps:
- name: Create GitHub App token
uses: actions/create-github-app-token@v2
id: app-token
with:
app-id: '${{ secrets.CR_APP_CI_APP_ID }}'
private-key: '${{ secrets.CR_APP_CI_PRIVATE_KEY }}'
owner: '${{ env.FLUX_OWNER }}'
repositories: '${{ env.FLUX_REPO }}'
- name: Checkout Flux repo
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
repository: '${{ env.FLUX_OWNER }}/${{ env.FLUX_REPO }}'
token: '${{ steps.app-token.outputs.token }}'
path: flux-repo
persist-credentials: false
- name: Export PR ID
shell: bash
run: 'echo "PR_ID=${{ github.event.pull_request.number }}" >> $GITHUB_ENV'
- name: Ensure template exists
shell: bash
run: >
T="flux-repo/pr-template/${APP_NAME}"
[[ -d "$T" ]] || { echo "Missing $T"; exit 1; }
[[ $(find "$T" -type f | wc -l) -gt 0 ]] || { echo "No files in $T";
exit 1; }
- name: Render & copy template
shell: bash
run: |
SRC="flux-repo/pr-template/${APP_NAME}"
DST="flux-repo/deployments/prs-${APP_NAME}/${PR_ID}"
mkdir -p "$DST"
cp -r "$SRC/." "$DST/"
find "$DST" -type f -print0 \
| xargs -0 -n1 sed -i "s|@PR-ID@|${PR_ID}|g"
- name: Sanity‑check rendered output
shell: bash
run: >
E=$(find "flux-repo/pr-template/${APP_NAME}" -type f | wc -l)
G=$(find "flux-repo/deployments/prs-${APP_NAME}/${PR_ID}" -type f | wc
-l)
(( G == E )) || { echo "Expected $E files, got $G"; exit 1; }
- name: Commit & push creation
shell: bash
run: >
cd flux-repo
git config user.name "${{ steps.app-token.outputs.app-slug }}[bot]"
git config user.email "${{ steps.app-token.outputs.app-slug
}}[bot]@users.noreply.github.com"
git add .
git commit -m "chore(prs-${APP_NAME}): create temp env for PR #${{
env.PR_ID }} [skip ci]" || echo "Nothing to commit"
git remote set-url origin \
https://x-access-token:${{ steps.app-token.outputs.token }}@github.com/${{ env.FLUX_OWNER }}/${{ env.FLUX_REPO }}.git
git push origin HEAD:main
- name: Comment preview URL on PR
uses: thollander/actions-comment-pull-request@v3
with:
github-token: '${{ secrets.GITHUB_TOKEN }}'
pr-number: '${{ github.event.pull_request.number }}'
comment-tag: 'pr-preview'
create-if-not-exists: 'true'
message: |
🌐 [Preview environment](https://pr-${{ env.PR_ID }}.${{ env.APP_NAME }}.coderamp.dev/) for PR #${{ env.PR_ID }}
📊 [Log viewer](https://app.datadoghq.eu/logs?query=kube_namespace%3Aprs-gitingest%20version%3Apr-${{ env.PR_ID }})
remove-pr-env:
if: >-
(github.event.action == 'unlabeled' && github.event.label.name ==
'deploy-pr-temp-env') || (github.event.action == 'closed')
runs-on: ubuntu-latest
steps:
- name: Create GitHub App token
uses: actions/create-github-app-token@v2
id: app-token
with:
app-id: '${{ secrets.CR_APP_CI_APP_ID }}'
private-key: '${{ secrets.CR_APP_CI_PRIVATE_KEY }}'
owner: '${{ env.FLUX_OWNER }}'
repositories: '${{ env.FLUX_REPO }}'
- name: Checkout Flux repo
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
repository: '${{ env.FLUX_OWNER }}/${{ env.FLUX_REPO }}'
token: '${{ steps.app-token.outputs.token }}'
path: flux-repo
persist-credentials: false
- name: Export PR ID
shell: bash
run: 'echo "PR_ID=${{ github.event.pull_request.number }}" >> $GITHUB_ENV'
- name: Remove deployed directory
shell: bash
run: |
DST="flux-repo/deployments/prs-${APP_NAME}/${PR_ID}"
if [[ -d "$DST" ]]; then
rm -rf "$DST"
echo "✅ Deleted $DST"
else
echo "⏭️ Nothing to delete at $DST"
fi
- name: Commit & push deletion
shell: bash
run: >
cd flux-repo
git config user.name "${{ steps.app-token.outputs.app-slug }}[bot]"
git config user.email "${{ steps.app-token.outputs.app-slug
}}[bot]@users.noreply.github.com"
git add -A
git commit -m "chore(prs-${APP_NAME}): remove temp env for PR #${{
env.PR_ID }} [skip ci]" || echo "Nothing to commit"
git remote set-url origin \
https://x-access-token:${{ steps.app-token.outputs.token }}@github.com/${{ env.FLUX_OWNER }}/${{ env.FLUX_REPO }}.git
git push origin HEAD:main
- name: Comment preview URL on PR
uses: thollander/actions-comment-pull-request@v3
with:
github-token: '${{ secrets.GITHUB_TOKEN }}'
pr-number: '${{ github.event.pull_request.number }}'
comment-tag: 'pr-preview'
create-if-not-exists: 'true'
message: |
⚙️ Preview environment was undeployed.
================================================
FILE: .github/workflows/docker-build.ecr.yml
================================================
name: Build & Push Container
on:
push:
branches:
- 'main'
tags:
- '*'
merge_group:
pull_request:
types: [labeled, synchronize, reopened, ready_for_review, opened]
env:
PUSH_FROM_PR: >-
${{ github.event_name == 'pull_request' &&
(
contains(github.event.pull_request.labels.*.name, 'push-container') ||
contains(github.event.pull_request.labels.*.name, 'deploy-pr-temp-env')
)
}}
jobs:
terraform:
name: "ECR"
runs-on: ubuntu-latest
if: github.repository == 'coderamp-labs/gitingest'
permissions:
id-token: write
contents: read
pull-requests: write
steps:
- name: Checkout
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
- name: configure aws credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.CODERAMP_AWS_ECR_REGISTRY_PUSH_ROLE_ARN }}
role-session-name: GitHub_to_AWS_via_FederatedOIDC
aws-region: eu-west-1
- name: Set current timestamp
id: vars
run: |
echo "timestamp=$(date +%s)" >> $GITHUB_OUTPUT
echo "sha_short=$(git rev-parse --short HEAD)" >> $GITHUB_OUTPUT
echo "sha_full=$(git rev-parse HEAD)" >> $GITHUB_OUTPUT
- name: Determine version and deployment context
id: version
run: |
REPO_URL="https://github.com/${{ github.repository }}"
if [[ "${{ github.ref_type }}" == "tag" ]]; then
# Tag deployment - display version, link to release
echo "version=${{ github.ref_name }}" >> $GITHUB_OUTPUT
echo "app_version=${{ github.ref_name }}" >> $GITHUB_OUTPUT
echo "app_version_url=${REPO_URL}/releases/tag/${{ github.ref_name }}" >> $GITHUB_OUTPUT
elif [[ "${{ github.event_name }}" == "pull_request" ]]; then
# PR deployment - display pr-XXX, link to PR commit
PR_NUMBER="${{ github.event.pull_request.number }}"
COMMIT_HASH="${{ steps.vars.outputs.sha_full }}"
echo "version=${PR_NUMBER}/merge-${COMMIT_HASH}" >> $GITHUB_OUTPUT
echo "app_version=pr-${PR_NUMBER}" >> $GITHUB_OUTPUT
echo "app_version_url=${REPO_URL}/pull/${PR_NUMBER}/commits/${COMMIT_HASH}" >> $GITHUB_OUTPUT
else
# Branch deployment - display branch name, link to commit
BRANCH_NAME="${{ github.ref_name }}"
COMMIT_HASH="${{ steps.vars.outputs.sha_full }}"
echo "app_version=${BRANCH_NAME}" >> $GITHUB_OUTPUT
echo "app_version_url=${REPO_URL}/commit/${COMMIT_HASH}" >> $GITHUB_OUTPUT
fi
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v2
- name: Docker Meta
id: meta
uses: docker/metadata-action@v5
with:
images: |
${{ secrets.ECR_REGISTRY_URL }}
flavor: |
latest=false
tags: |
type=ref,event=branch,branch=main,suffix=-${{ steps.vars.outputs.sha_short }}-${{ steps.vars.outputs.timestamp }}
type=ref,event=pr,suffix=-${{ steps.vars.outputs.sha_short }}-${{ steps.vars.outputs.timestamp }}
type=pep440,pattern={{raw}}
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Build and push
uses: docker/build-push-action@v6
with:
context: .
platforms: linux/amd64, linux/arm64
push: ${{ github.event_name != 'pull_request' || env.PUSH_FROM_PR == 'true' }}
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
build-args: |
APP_REPOSITORY=https://github.com/${{ github.repository }}
APP_VERSION=${{ steps.version.outputs.app_version }}
APP_VERSION_URL=${{ steps.version.outputs.app_version_url }}
cache-from: type=gha
cache-to: type=gha,mode=max
================================================
FILE: .github/workflows/docker-build.ghcr.yml
================================================
name: Build & Push Container
on:
push:
branches:
- 'main'
tags:
- '*'
merge_group:
pull_request:
types: [labeled, synchronize, reopened, ready_for_review, opened]
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.ref }}
cancel-in-progress: true
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
PUSH_FROM_PR: >-
${{ github.event_name == 'pull_request' &&
(
contains(github.event.pull_request.labels.*.name, 'push-container') ||
contains(github.event.pull_request.labels.*.name, 'deploy-pr-temp-env')
)
}}
permissions:
contents: read
jobs:
docker-build:
name: "GHCR"
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
attestations: write
id-token: write
steps:
- name: Harden the runner (Audit all outbound calls)
uses: step-security/harden-runner@ec9f2d5744a09debf3a187a3f4f675c53b671911 # v2.13.0
with:
egress-policy: audit
- uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
- name: Set current timestamp
id: vars
run: |
echo "timestamp=$(date +%s)" >> $GITHUB_OUTPUT
echo "sha_short=$(git rev-parse --short HEAD)" >> $GITHUB_OUTPUT
echo "sha_full=$(git rev-parse HEAD)" >> $GITHUB_OUTPUT
- name: Determine version and deployment context
id: version
run: |
REPO_URL="https://github.com/${{ github.repository }}"
if [[ "${{ github.ref_type }}" == "tag" ]]; then
# Tag deployment - display version, link to release
echo "version=${{ github.ref_name }}" >> $GITHUB_OUTPUT
echo "app_version=${{ github.ref_name }}" >> $GITHUB_OUTPUT
echo "app_version_url=${REPO_URL}/releases/tag/${{ github.ref_name }}" >> $GITHUB_OUTPUT
elif [[ "${{ github.event_name }}" == "pull_request" ]]; then
# PR deployment - display pr-XXX, link to PR commit
PR_NUMBER="${{ github.event.pull_request.number }}"
COMMIT_HASH="${{ steps.vars.outputs.sha_full }}"
echo "version=${PR_NUMBER}/merge-${COMMIT_HASH}" >> $GITHUB_OUTPUT
echo "app_version=pr-${PR_NUMBER}" >> $GITHUB_OUTPUT
echo "app_version_url=${REPO_URL}/pull/${PR_NUMBER}/commits/${COMMIT_HASH}" >> $GITHUB_OUTPUT
else
# Branch deployment - display branch name, link to commit
BRANCH_NAME="${{ github.ref_name }}"
COMMIT_HASH="${{ steps.vars.outputs.sha_full }}"
echo "app_version=${BRANCH_NAME}" >> $GITHUB_OUTPUT
echo "app_version_url=${REPO_URL}/commit/${COMMIT_HASH}" >> $GITHUB_OUTPUT
fi
- name: Log in to the Container registry
uses: docker/login-action@184bdaa0721073962dff0199f1fb9940f07167d1 # v3.5.0
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Docker Meta
id: meta
uses: docker/metadata-action@c1e51972afc2121e065aed6d45c65596fe445f3f # v5.8.0
with:
images: |
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
flavor: |
latest=false
tags: |
type=ref,event=branch,branch=main
type=ref,event=branch,branch=main,suffix=-${{ steps.vars.outputs.sha_short }}-${{ steps.vars.outputs.timestamp }}
type=pep440,pattern={{raw}}
type=ref,event=pr,suffix=-${{ steps.vars.outputs.sha_short }}-${{ steps.vars.outputs.timestamp }}
- name: Set up QEMU
uses: docker/setup-qemu-action@29109295f81e9208d7d86ff1c6c12d2833863392 # v3.6.0
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@e468171a9de216ec08956ac3ada2f0791b6bd435 # v3.11.1
- name: Build and push
uses: docker/build-push-action@263435318d21b8e681c14492fe198d362a7d2c83 # v6.18.0
id: push
with:
context: .
platforms: linux/amd64, linux/arm64
push: ${{ github.event_name != 'pull_request' || env.PUSH_FROM_PR == 'true' }}
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
build-args: |
APP_REPOSITORY=https://github.com/${{ github.repository }}
APP_VERSION=${{ steps.version.outputs.app_version }}
APP_VERSION_URL=${{ steps.version.outputs.app_version_url }}
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Generate artifact attestation
if: github.event_name != 'pull_request' || env.PUSH_FROM_PR == 'true'
uses: actions/attest-build-provenance@e8998f949152b193b063cb0ec769d69d929409be # v2.4.0
with:
subject-name: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME}}
subject-digest: ${{ steps.push.outputs.digest }}
push-to-registry: true
================================================
FILE: .github/workflows/pr-title-check.yml
================================================
name: PR Conventional Commit Validation
on:
pull_request:
types: [opened, synchronize, reopened, edited]
jobs:
validate-pr-title:
runs-on: ubuntu-latest
steps:
- name: Harden the runner (Audit all outbound calls)
uses: step-security/harden-runner@ec9f2d5744a09debf3a187a3f4f675c53b671911 # v2.13.0
with:
egress-policy: audit
- name: PR Conventional Commit Validation
uses: ytanikin/pr-conventional-commits@b72758283dcbee706975950e96bc4bf323a8d8c0 # 1.4.2
with:
task_types: '["feat","fix","docs","test","ci","refactor","perf","chore","revert"]'
add_label: 'false'
================================================
FILE: .github/workflows/publish_to_pypi.yml
================================================
name: Publish to PyPI
on:
release:
types: [created] # Run when you click "Publish release"
workflow_dispatch: # ... or run it manually from the Actions tab
permissions:
contents: read
jobs:
release-build:
runs-on: ubuntu-latest
steps:
- name: Harden the runner (Audit all outbound calls)
uses: step-security/harden-runner@ec9f2d5744a09debf3a187a3f4f675c53b671911 # v2.13.0
with:
egress-policy: audit
- uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
- name: Set up Python 3.13
uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: "3.13"
cache: pip
cache-dependency-path: pyproject.toml
- name: Build package
run: |
python -m pip install --upgrade pip
python -m pip install build twine
python -m build
twine check dist/*
- name: Upload dist artefact
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4.6.2
with:
name: dist
path: dist/
# Publish to PyPI (only if "dist/" succeeded)
pypi-publish:
needs: release-build
runs-on: ubuntu-latest
environment: pypi
permissions:
id-token: write # OIDC token for trusted publishing
steps:
- name: Harden the runner (Audit all outbound calls)
uses: step-security/harden-runner@ec9f2d5744a09debf3a187a3f4f675c53b671911 # v2.13.0
with:
egress-policy: audit
- uses: actions/download-artifact@634f93cb2916e3fdff6788551b99b062d0335ce0 # v5.0.0
with:
name: dist
path: dist/
- uses: pypa/gh-action-pypi-publish@76f52bc884231f62b9a034ebfe128415bbaabdfc # release/v1
with:
verbose: true
================================================
FILE: .github/workflows/rebase-needed.yml
================================================
name: PR Needs Rebase
on:
workflow_dispatch: {}
schedule:
- cron: '0 * * * *'
permissions:
pull-requests: write
jobs:
label-rebase-needed:
runs-on: ubuntu-latest
if: github.repository == 'coderamp-labs/gitingest'
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
steps:
- name: Check for merge conflicts
uses: eps1lon/actions-label-merge-conflict@v3
with:
dirtyLabel: 'rebase needed :construction:'
repoToken: '${{ secrets.GITHUB_TOKEN }}'
commentOnClean: This pull request has resolved merge conflicts and is ready for review.
commentOnDirty: This pull request has merge conflicts that must be resolved before it can be merged.
retryMax: 30
continueOnMissingPermissions: false
================================================
FILE: .github/workflows/release-please.yml
================================================
name: release-please
on:
push:
branches:
- main
permissions:
contents: write
pull-requests: write
jobs:
release:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
- name: Create GitHub App token
uses: actions/create-github-app-token@v2
id: app-token
with:
app-id: '${{ secrets.CR_APP_CI_APP_ID }}'
private-key: '${{ secrets.CR_APP_CI_PRIVATE_KEY }}'
owner: '${{ env.FLUX_OWNER }}'
repositories: '${{ env.FLUX_REPO }}'
- name: Release Please
uses: googleapis/release-please-action@v4
with:
token: '${{ steps.app-token.outputs.token }}'
================================================
FILE: .github/workflows/scorecard.yml
================================================
name: OSSF Scorecard
on:
branch_protection_rule:
schedule:
- cron: '33 11 * * 2' # Every Tuesday at 11:33 AM UTC
push:
branches: [ main ]
permissions: read-all
concurrency: # avoid overlapping runs
group: scorecard-${{ github.ref }}
cancel-in-progress: true
jobs:
analysis:
name: Scorecard analysis
runs-on: ubuntu-latest
permissions:
security-events: write # upload SARIF to code-scanning
id-token: write # publish results for the badge
steps:
- name: Harden the runner (Audit all outbound calls)
uses: step-security/harden-runner@ec9f2d5744a09debf3a187a3f4f675c53b671911 # v2.13.0
with:
egress-policy: audit
- name: Checkout
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
with:
persist-credentials: false
- name: Run Scorecard
uses: ossf/scorecard-action@f35c64557cf912815708bb1126d9948f3e459487
with:
results_file: results.sarif
results_format: sarif
publish_results: true # enables the public badge
- name: Upload to code-scanning
uses: github/codeql-action/upload-sarif@df559355d593797519d70b90fc8edd5db049e7a2 # v3.29.9
with:
sarif_file: results.sarif
================================================
FILE: .github/workflows/stale.yml
================================================
name: "Close stale issues and PRs"
on:
schedule:
- cron: "0 6 * * *"
workflow_dispatch: {}
permissions:
issues: write
pull-requests: write
jobs:
stale:
runs-on: ubuntu-latest
steps:
- uses: actions/stale@v9
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
days-before-stale: 45
days-before-close: 10
stale-issue-label: stale
stale-pr-label: stale
stale-issue-message: |
Hi there! We haven’t seen activity here for 45 days, so I’m marking this issue as stale.
If you’d like to keep it open, please leave a comment within 10 days. Thanks!
stale-pr-message: |
Hi there! We haven’t seen activity on this pull request for 45 days, so I’m marking it as stale.
If you’d like to keep it open, please leave a comment within 10 days. Thanks!
close-issue-message: |
Hi there! We haven’t heard anything for 10 days, so I’m closing this issue. Feel free to reopen if you’d like to continue the discussion. Thanks!
close-pr-message: |
Hi there! We haven’t heard anything for 10 days, so I’m closing this pull request. Feel free to reopen if you’d like to continue working on it. Thanks!
================================================
FILE: .gitignore
================================================
# Operating-system
.DS_Store
Thumbs.db
# Editor / IDE settings
.vscode/
!.vscode/launch.json
.idea/
*.swp
# Python virtual-envs & tooling
.venv*/
venv/
.python-version
__pycache__/
*.egg-info/
*.egg
.ruff_cache/
# Test artifacts & coverage
.pytest_cache/
.coverage
coverage.xml
htmlcov/
# Build, distribution & docs
build/
dist/
*.wheel
# Logs & runtime output
*.log
logs/
*.tmp
tmp/
# Project-specific files
history.txt
digest.txt
# Environment variables
.env
================================================
FILE: .pre-commit-config.yaml
================================================
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v5.0.0
hooks:
- id: check-added-large-files
description: 'Prevent large files from being committed.'
args: ['--maxkb=10000']
- id: check-case-conflict
description: 'Check for files that would conflict in case-insensitive filesystems.'
- id: fix-byte-order-marker
description: 'Remove utf-8 byte order marker.'
- id: mixed-line-ending
description: 'Replace mixed line ending.'
- id: destroyed-symlinks
description: 'Detect symlinks which are changed to regular files with a content of a path which that symlink was pointing to.'
- id: check-ast
description: 'Check for parseable syntax.'
- id: end-of-file-fixer
description: 'Ensure that a file is either empty, or ends with one newline.'
- id: trailing-whitespace
description: 'Trim trailing whitespace.'
exclude: CHANGELOG.md
- id: check-docstring-first
description: 'Check a common error of defining a docstring after code.'
- id: requirements-txt-fixer
description: 'Sort entries in requirements.txt.'
- repo: https://github.com/MarcoGorelli/absolufy-imports
rev: v0.3.1
hooks:
- id: absolufy-imports
description: 'Automatically convert relative imports to absolute. (Use `args: [--never]` to revert.)'
- repo: https://github.com/asottile/pyupgrade
rev: v3.20.0
hooks:
- id: pyupgrade
description: 'Automatically upgrade syntax for newer versions.'
args: [--py3-plus, --py36-plus]
- repo: https://github.com/pre-commit/pygrep-hooks
rev: v1.10.0
hooks:
- id: python-check-blanket-noqa
description: 'Enforce that `# noqa` annotations always occur with specific codes.'
- id: python-check-blanket-type-ignore
description: 'Enforce that `# type: ignore` annotations always occur with specific codes.'
- id: python-use-type-annotations
description: 'Enforce that python3.6+ type annotations are used instead of type comments.'
- repo: https://github.com/PyCQA/isort
rev: 6.0.1
hooks:
- id: isort
description: 'Sort imports alphabetically, and automatically separated into sections and by type.'
- repo: https://github.com/pre-commit/mirrors-eslint
rev: v9.30.1
hooks:
- id: eslint
description: 'Lint javascript files.'
files: \.js$
args: [--max-warnings=0, --fix]
additional_dependencies:
[
'eslint@9.30.1',
'@eslint/js@9.30.1',
'eslint-plugin-import@2.32.0',
'globals@16.3.0',
]
- repo: https://github.com/djlint/djLint
rev: v1.36.4
hooks:
- id: djlint-reformat-jinja
- repo: https://github.com/igorshubovych/markdownlint-cli
rev: v0.45.0
hooks:
- id: markdownlint
description: 'Lint markdown files.'
args: ['--disable=line-length', '--ignore=CHANGELOG.md']
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.12.2
hooks:
- id: ruff-check
- id: ruff-format
- repo: https://github.com/jsh9/pydoclint
rev: 0.6.7
hooks:
- id: pydoclint
name: pydoclint for source
args: [--style=numpy]
files: ^src/
- repo: https://github.com/pycqa/pylint
rev: v3.3.7
hooks:
- id: pylint
name: pylint for source
files: ^src/
additional_dependencies:
[
boto3>=1.28.0,
click>=8.0.0,
'fastapi[standard]>=0.109.1',
gitpython>=3.1.0,
httpx,
loguru>=0.7.0,
pathspec>=0.12.1,
prometheus-client,
pydantic,
pytest-asyncio,
pytest-mock,
python-dotenv,
'sentry-sdk[fastapi]',
slowapi,
starlette>=0.40.0,
strenum; python_version < '3.11',
tiktoken>=0.7.0,
typing_extensions>= 4.0.0; python_version < '3.10',
uvicorn>=0.11.7,
]
- id: pylint
name: pylint for tests
files: ^tests/
args:
- --rcfile=tests/.pylintrc
additional_dependencies:
[
boto3>=1.28.0,
click>=8.0.0,
'fastapi[standard]>=0.109.1',
gitpython>=3.1.0,
httpx,
loguru>=0.7.0,
pathspec>=0.12.1,
prometheus-client,
pydantic,
pytest-asyncio,
pytest-mock,
python-dotenv,
'sentry-sdk[fastapi]',
slowapi,
starlette>=0.40.0,
strenum; python_version < '3.11',
tiktoken>=0.7.0,
typing_extensions>= 4.0.0; python_version < '3.10',
uvicorn>=0.11.7,
]
- repo: meta
hooks:
- id: check-hooks-apply
- id: check-useless-excludes
- repo: https://github.com/gitleaks/gitleaks
rev: v8.16.3
hooks:
- id: gitleaks
================================================
FILE: .release-please-manifest.json
================================================
{".":"0.3.1"}
================================================
FILE: .vscode/launch.json
================================================
{
"configurations": [
{
"name": "Python Debugger: Module",
"type": "debugpy",
"request": "launch",
"module": "server",
"args": [],
"cwd": "${workspaceFolder}/src"
}
]
}
================================================
FILE: CHANGELOG.md
================================================
# Changelog
## [0.3.1](https://github.com/coderamp-labs/gitingest/compare/v0.3.0...v0.3.1) (2025-07-31)
### Bug Fixes
* make cache aware of subpaths ([#481](https://github.com/coderamp-labs/gitingest/issues/481)) ([8b59bef](https://github.com/coderamp-labs/gitingest/commit/8b59bef541f858ef44eba8fce6ace77df9dea01c))
## [0.3.0](https://github.com/coderamp-labs/gitingest/compare/v0.2.1...v0.3.0) (2025-07-30)
### Features
* **logging:** implement loguru ([#473](https://github.com/coderamp-labs/gitingest/issues/473)) ([d061b48](https://github.com/coderamp-labs/gitingest/commit/d061b4877a253ba3f0480d329f025427c7f70177))
* serve cached digest if available ([#462](https://github.com/coderamp-labs/gitingest/issues/462)) ([efe5a26](https://github.com/coderamp-labs/gitingest/commit/efe5a2686142b5ee4984061ebcec23c3bf3495d5))
### Bug Fixes
* handle network errors gracefully in token count estimation ([#437](https://github.com/coderamp-labs/gitingest/issues/437)) ([5fbb445](https://github.com/coderamp-labs/gitingest/commit/5fbb445cd8725e56972f43ec8b5e12cb299e9e83))
* improved server side cleanup after ingest ([#477](https://github.com/coderamp-labs/gitingest/issues/477)) ([2df0eb4](https://github.com/coderamp-labs/gitingest/commit/2df0eb43989731ae40a9dd82d310ff76a794a46d))
### Documentation
* **contributing:** update PR title guidelines to enforce convention ([#476](https://github.com/coderamp-labs/gitingest/issues/476)) ([d1f8a80](https://github.com/coderamp-labs/gitingest/commit/d1f8a80826ca38ec105a1878742fe351d4939d6e))
## [0.2.1](https://github.com/coderamp-labs/gitingest/compare/v0.2.0...v0.2.1) (2025-07-27)
### Bug Fixes
* remove logarithm conversion from the backend and correctly process max file size in kb ([#464](https://github.com/coderamp-labs/gitingest/issues/464)) ([932bfef](https://github.com/coderamp-labs/gitingest/commit/932bfef85db66704985c83f3f7c427756bd14023))
## [0.2.0](https://github.com/coderamp-labs/gitingest/compare/v0.1.5...v0.2.0) (2025-07-26)
### Features
* `include_submodules` option ([#313](https://github.com/coderamp-labs/gitingest/issues/313)) ([38c2317](https://github.com/coderamp-labs/gitingest/commit/38c23171a14556a2cdd05c0af8219f4dc789defd))
* add Tailwind CSS pipeline, tag-aware cloning & overhaul CI/CD ([#352](https://github.com/coderamp-labs/gitingest/issues/352)) ([b683e59](https://github.com/coderamp-labs/gitingest/commit/b683e59b5b1a31d27cc5c6ce8fb62da9b660613b))
* add Tailwind CSS pipeline, tag-aware cloning & overhaul CI/CD ([#352](https://github.com/coderamp-labs/gitingest/issues/352)) ([016817d](https://github.com/coderamp-labs/gitingest/commit/016817d5590c1412498b7532f6e854d20239c6be))
* **ci:** build Docker Image on PRs ([#382](https://github.com/coderamp-labs/gitingest/issues/382)) ([bc8cdb4](https://github.com/coderamp-labs/gitingest/commit/bc8cdb459482948c27e780b733ac7216d822529a))
* implement prometheus exporter ([#406](https://github.com/coderamp-labs/gitingest/issues/406)) ([1016f6e](https://github.com/coderamp-labs/gitingest/commit/1016f6ecb3b1b066d541d1eba1ddffec49b15f16))
* implement S3 integration for storing and retrieving digest files ([#427](https://github.com/coderamp-labs/gitingest/issues/427)) ([414e851](https://github.com/coderamp-labs/gitingest/commit/414e85189fb9055491530ba8c0665c798474451e))
* integrate Sentry for error tracking and performance monitoring ([#408](https://github.com/coderamp-labs/gitingest/issues/408)) ([590e55a](https://github.com/coderamp-labs/gitingest/commit/590e55a4d28a4f5c0beafbd12c525828fa79e221))
* Refactor backend to a rest api ([#346](https://github.com/coderamp-labs/gitingest/issues/346)) ([2b1f228](https://github.com/coderamp-labs/gitingest/commit/2b1f228ae1f6d1f7ee471794d258b13fcac25a96))
* **ui:** add inline PAT info tooltip inside token field ([#348](https://github.com/coderamp-labs/gitingest/issues/348)) ([2592303](https://github.com/coderamp-labs/gitingest/commit/25923037ea6cd2f8ef33a6cf1f0406c2b4f0c9b6))
### Bug Fixes
* enable metrics if env var is defined instead of being "True" ([#407](https://github.com/coderamp-labs/gitingest/issues/407)) ([fa2e192](https://github.com/coderamp-labs/gitingest/commit/fa2e192c05864c8db90bda877e9efb9b03caf098))
* fix docker container not launching ([#449](https://github.com/coderamp-labs/gitingest/issues/449)) ([998cea1](https://github.com/coderamp-labs/gitingest/commit/998cea15b4f79c5d6f840b5d3d916f83c8be3a07))
* frontend directory tree ([#363](https://github.com/coderamp-labs/gitingest/issues/363)) ([0fcf8a9](https://github.com/coderamp-labs/gitingest/commit/0fcf8a956f7ec8403a025177f998f92ddee96de0))
* gitignore and gitingestignore files are now correctly processed … ([#416](https://github.com/coderamp-labs/gitingest/issues/416)) ([74e503f](https://github.com/coderamp-labs/gitingest/commit/74e503fa1140feb74aa5350a32f0025c43097da1))
* Potential fix for code scanning alert no. 75: Uncontrolled data used in path expression ([#421](https://github.com/coderamp-labs/gitingest/issues/421)) ([9ceaf6c](https://github.com/coderamp-labs/gitingest/commit/9ceaf6cbbb0cdefbc79f78c5285406b9188b2d3d))
* reset pattern form when switching between include/exclude patterns ([#417](https://github.com/coderamp-labs/gitingest/issues/417)) ([7085e13](https://github.com/coderamp-labs/gitingest/commit/7085e138a74099b1df189b3bf9b8a333c8769380))
* temp files cleanup after ingest([#309](https://github.com/coderamp-labs/gitingest/issues/309)) ([e669e44](https://github.com/coderamp-labs/gitingest/commit/e669e444fa1e6130f3f22952dd81f0ca3fe08fa5))
* **ui:** update layout in PAT section to avoid overlaps & overflows ([#331](https://github.com/coderamp-labs/gitingest/issues/331)) ([b39ef54](https://github.com/coderamp-labs/gitingest/commit/b39ef5416c1f8a7993a8249161d2a898b7387595))
* **windows:** warn if Git long path support is disabled, do not fail ([b8e375f](https://github.com/coderamp-labs/gitingest/commit/b8e375f71cae7d980cf431396c4414a6dbd0588c))
### Documentation
* add GitHub Issue Form for bug reports ([#403](https://github.com/coderamp-labs/gitingest/issues/403)) ([4546449](https://github.com/coderamp-labs/gitingest/commit/4546449bbc1e4a7ad0950c4b831b8855a98628fd))
* add GitHub Issue Form for feature requests ([#404](https://github.com/coderamp-labs/gitingest/issues/404)) ([9b1fc58](https://github.com/coderamp-labs/gitingest/commit/9b1fc58900ae18a3416fe3cf9b5e301a65a8e9fd))
* Fix CLI help text accuracy ([#332](https://github.com/coderamp-labs/gitingest/issues/332)) ([fdcbc53](https://github.com/coderamp-labs/gitingest/commit/fdcbc53cadde6a5dc3c3626120df1935b63693b2))
### Code Refactoring
* centralize PAT validation, streamline repo checks & misc cleanup ([#349](https://github.com/coderamp-labs/gitingest/issues/349)) ([cea0edd](https://github.com/coderamp-labs/gitingest/commit/cea0eddce8c6846bc6271cb3a8d15320e103214c))
* centralize PAT validation, streamline repo checks & misc cleanup ([#349](https://github.com/coderamp-labs/gitingest/issues/349)) ([f8d397e](https://github.com/coderamp-labs/gitingest/commit/f8d397e66e3382d12f8a0ed05d291a39db830bda))
================================================
FILE: CODE_OF_CONDUCT.md
================================================
# Contributor Covenant Code of Conduct
## Our Pledge
We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, religion, or sexual identity
and orientation.
We pledge to act and interact in ways that contribute to an open, welcoming,
diverse, inclusive, and healthy community.
## Our Standards
Examples of behavior that contributes to a positive environment for our
community include:
* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes,
and learning from the experience
* Focusing on what is best not just for us as individuals, but for the
overall community
Examples of unacceptable behavior include:
* The use of sexualized language or imagery, and sexual attention or
advances of any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email
address, without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Enforcement Responsibilities
Community leaders are responsible for clarifying and enforcing our standards of
acceptable behavior and will take appropriate and fair corrective action in
response to any behavior that they deem inappropriate, threatening, offensive,
or harmful.
Community leaders have the right and responsibility to remove, edit, or reject
comments, commits, code, wiki edits, issues, and other contributions that are
not aligned to this Code of Conduct, and will communicate reasons for moderation
decisions when appropriate.
## Scope
This Code of Conduct applies within all community spaces, and also applies when
an individual is officially representing the community in public spaces.
Examples of representing our community include using an official e-mail address,
posting via an official social media account, or acting as an appointed
representative at an online or offline event.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement at
<romain@coderamp.io>.
All complaints will be reviewed and investigated promptly and fairly.
All community leaders are obligated to respect the privacy and security of the
reporter of any incident.
## Enforcement Guidelines
Community leaders will follow these Community Impact Guidelines in determining
the consequences for any action they deem in violation of this Code of Conduct:
### 1. Correction
**Community Impact**: Use of inappropriate language or other behavior deemed
unprofessional or unwelcome in the community.
**Consequence**: A private, written warning from community leaders, providing
clarity around the nature of the violation and an explanation of why the
behavior was inappropriate. A public apology may be requested.
### 2. Warning
**Community Impact**: A violation through a single incident or series
of actions.
**Consequence**: A warning with consequences for continued behavior. No
interaction with the people involved, including unsolicited interaction with
those enforcing the Code of Conduct, for a specified period of time. This
includes avoiding interactions in community spaces as well as external channels
like social media. Violating these terms may lead to a temporary or
permanent ban.
### 3. Temporary Ban
**Community Impact**: A serious violation of community standards, including
sustained inappropriate behavior.
**Consequence**: A temporary ban from any sort of interaction or public
communication with the community for a specified period of time. No public or
private interaction with the people involved, including unsolicited interaction
with those enforcing the Code of Conduct, is allowed during this period.
Violating these terms may lead to a permanent ban.
### 4. Permanent Ban
**Community Impact**: Demonstrating a pattern of violation of community
standards, including sustained inappropriate behavior, harassment of an
individual, or aggression toward or disparagement of classes of individuals.
**Consequence**: A permanent ban from any sort of public interaction within
the community.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org),
version 2.0, available at
<https://www.contributor-covenant.org/version/2/0/code_of_conduct.html>.
Community Impact Guidelines were inspired by [Mozilla's code of conduct
enforcement ladder](https://github.com/mozilla/diversity).
For answers to common questions about this code of conduct, see the FAQ at
<https://www.contributor-covenant.org/faq>. Translations are available at
<https://www.contributor-covenant.org/translations>.
================================================
FILE: CONTRIBUTING.md
================================================
# Contributing to Gitingest
Thanks for your interest in contributing to **Gitingest** 🚀 Our goal is to keep the codebase friendly to first-time
contributors.
If you ever get stuck, reach out on [Discord](https://discord.com/invite/zerRaGK9EC).
---
## How to Contribute (non-technical)
- **Create an Issue** – found a bug or have a feature idea?
[Open an issue](https://github.com/coderamp-labs/gitingest/issues/new).
- **Spread the Word** – tweet, blog, or tell a friend.
- **Use Gitingest** – real-world usage gives the best feedback. File issues or ping us
on [Discord](https://discord.com/invite/zerRaGK9EC) with anything you notice.
---
## How to submit a Pull Request
> **Prerequisites**: The project uses **Python 3.9+** and `pre-commit` for development.
1. **Fork** the repository.
2. **Clone** your fork:
```bash
git clone https://github.com/coderamp-labs/gitingest.git
cd gitingest
```
3. **Set up the dev environment**:
```bash
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev,server]"
pre-commit install
```
4. **Create a branch** for your changes:
```bash
git checkout -b your-branch
```
5. **Make your changes** (and add tests when relevant).
6. **Stage** the changes:
```bash
git add .
```
7. **Run the backend test suite**:
```bash
pytest
```
8. *(Optional)* **Run `pre-commit` on all files** to check hooks without committing:
```bash
pre-commit run --all-files
```
9. **Run the local server** to sanity-check:
```bash
python -m server
```
Open [http://localhost:8000](http://localhost:8000) to confirm everything works.
10. **Commit** (signed):
```bash
git commit -S -m "Your commit message"
```
If *pre-commit* complains, fix the problems and repeat **5 – 9**.
11. **Push** your branch:
```bash
git push origin your-branch
```
12. **Open a pull request** on GitHub with a clear description.
> **Important:** Pull request titles **must follow
the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) specification**. This helps with
changelogs and automated releases.
13. **Iterate** on any review feedback—update your branch and repeat **6 – 11** as needed.
*(Optional) Invite a maintainer to your branch for easier collaboration.*
================================================
FILE: Dockerfile
================================================
# Stage 1: Install Python dependencies
FROM python:3.13.5-slim@sha256:4c2cf9917bd1cbacc5e9b07320025bdb7cdf2df7b0ceaccb55e9dd7e30987419 AS python-builder
WORKDIR /build
RUN set -eux; \
apt-get update; \
apt-get install -y --no-install-recommends gcc python3-dev; \
rm -rf /var/lib/apt/lists/*
COPY pyproject.toml .
COPY src/ ./src/
RUN set -eux; \
pip install --no-cache-dir --upgrade pip; \
pip install --no-cache-dir --timeout 1000 .[server,mcp]
# Stage 2: Runtime image
FROM python:3.13.5-slim@sha256:4c2cf9917bd1cbacc5e9b07320025bdb7cdf2df7b0ceaccb55e9dd7e30987419
ARG UID=1000
ARG GID=1000
ARG APP_REPOSITORY=https://github.com/coderamp-labs/gitingest
ARG APP_VERSION=unknown
ARG APP_VERSION_URL=https://github.com/coderamp-labs/gitingest
ENV PYTHONUNBUFFERED=1 \
PYTHONDONTWRITEBYTECODE=1 \
APP_REPOSITORY=${APP_REPOSITORY} \
APP_VERSION=${APP_VERSION} \
APP_VERSION_URL=${APP_VERSION_URL}
RUN set -eux; \
apt-get update; \
apt-get install -y --no-install-recommends git curl; \
apt-get clean; \
rm -rf /var/lib/apt/lists/*
WORKDIR /app
RUN set -eux; \
groupadd -g "$GID" appuser; \
useradd -m -u "$UID" -g "$GID" appuser
COPY --from=python-builder --chown=$UID:$GID /usr/local/lib/python3.13/site-packages/ /usr/local/lib/python3.13/site-packages/
COPY --chown=$UID:$GID src/ ./
RUN set -eux; \
chown -R appuser:appuser /app
USER appuser
EXPOSE 8000
EXPOSE 9090
CMD ["python", "-m", "server"]
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2024 Romain Courtois
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.md
================================================
# Gitingest
[](https://gitingest.com)
<!-- Badges -->
<!-- markdownlint-disable MD033 -->
<p align="center">
<!-- row 1 — install & compat -->
<a href="https://pypi.org/project/gitingest"><img src="https://img.shields.io/pypi/v/gitingest.svg" alt="PyPI"></a>
<a href="https://pypi.org/project/gitingest"><img src="https://img.shields.io/pypi/pyversions/gitingest.svg" alt="Python Versions"></a>
<br>
<!-- row 2 — quality & community -->
<a href="https://github.com/coderamp-labs/gitingest/actions/workflows/ci.yml?query=branch%3Amain"><img src="https://github.com/coderamp-labs/gitingest/actions/workflows/ci.yml/badge.svg?branch=main" alt="CI"></a>
<a href="https://github.com/astral-sh/ruff"><img src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json" alt="Ruff"></a>
<a href="https://scorecard.dev/viewer/?uri=github.com/coderamp-labs/gitingest"><img src="https://api.scorecard.dev/projects/github.com/coderamp-labs/gitingest/badge" alt="OpenSSF Scorecard"></a>
<br>
<a href="https://github.com/coderamp-labs/gitingest/blob/main/LICENSE"><img src="https://img.shields.io/github/license/coderamp-labs/gitingest.svg" alt="License"></a>
<a href="https://pepy.tech/project/gitingest"><img src="https://pepy.tech/badge/gitingest" alt="Downloads"></a>
<a href="https://github.com/coderamp-labs/gitingest"><img src="https://img.shields.io/github/stars/coderamp-labs/gitingest" alt="GitHub Stars"></a>
<a href="https://discord.com/invite/zerRaGK9EC"><img src="https://img.shields.io/badge/Discord-Join_chat-5865F2?logo=discord&logoColor=white" alt="Discord"></a>
<br>
<a href="https://trendshift.io/repositories/13519"><img src="https://trendshift.io/api/badge/repositories/13519" alt="Trendshift" height="50"></a>
</p>
<!-- markdownlint-enable MD033 -->
Turn any Git repository into a prompt-friendly text ingest for LLMs.
You can also replace `hub` with `ingest` in any GitHub URL to access the corresponding digest.
<!-- Extensions -->
[gitingest.com](https://gitingest.com) · [Chrome Extension](https://chromewebstore.google.com/detail/adfjahbijlkjfoicpjkhjicpjpjfaood) · [Firefox Add-on](https://addons.mozilla.org/firefox/addon/gitingest)
<!-- Languages -->
[Deutsch](https://www.readme-i18n.com/coderamp-labs/gitingest?lang=de) |
[Español](https://www.readme-i18n.com/coderamp-labs/gitingest?lang=es) |
[Français](https://www.readme-i18n.com/coderamp-labs/gitingest?lang=fr) |
[日本語](https://www.readme-i18n.com/coderamp-labs/gitingest?lang=ja) |
[한국어](https://www.readme-i18n.com/coderamp-labs/gitingest?lang=ko) |
[Português](https://www.readme-i18n.com/coderamp-labs/gitingest?lang=pt) |
[Русский](https://www.readme-i18n.com/coderamp-labs/gitingest?lang=ru) |
[中文](https://www.readme-i18n.com/coderamp-labs/gitingest?lang=zh)
## 🚀 Features
- **Easy code context**: Get a text digest from a Git repository URL or a directory
- **Smart Formatting**: Optimized output format for LLM prompts
- **Statistics about**:
- File and directory structure
- Size of the extract
- Token count
- **CLI tool**: Run it as a shell command
- **Python package**: Import it in your code
## 📚 Requirements
- Python 3.8+
- For private repositories: A GitHub Personal Access Token (PAT). [Generate your token **here**!](https://github.com/settings/tokens/new?description=gitingest&scopes=repo)
### 📦 Installation
Gitingest is available on [PyPI](https://pypi.org/project/gitingest/).
You can install it using `pip`:
```bash
pip install gitingest
```
or
```bash
pip install gitingest[server]
```
to include server dependencies for self-hosting.
However, it might be a good idea to use `pipx` to install it.
You can install `pipx` using your preferred package manager.
```bash
brew install pipx
apt install pipx
scoop install pipx
...
```
If you are using pipx for the first time, run:
```bash
pipx ensurepath
```
```bash
# install gitingest
pipx install gitingest
```
## 🧩 Browser Extension Usage
<!-- markdownlint-disable MD033 -->
<a href="https://chromewebstore.google.com/detail/adfjahbijlkjfoicpjkhjicpjpjfaood" target="_blank" title="Get Gitingest Extension from Chrome Web Store"><img height="48" src="https://github.com/user-attachments/assets/20a6e44b-fd46-4e6c-8ea6-aad436035753" alt="Available in the Chrome Web Store" /></a>
<a href="https://addons.mozilla.org/firefox/addon/gitingest" target="_blank" title="Get Gitingest Extension from Firefox Add-ons"><img height="48" src="https://github.com/user-attachments/assets/c0e99e6b-97cf-4af2-9737-099db7d3538b" alt="Get The Add-on for Firefox" /></a>
<a href="https://microsoftedge.microsoft.com/addons/detail/nfobhllgcekbmpifkjlopfdfdmljmipf" target="_blank" title="Get Gitingest Extension from Microsoft Edge Add-ons"><img height="48" src="https://github.com/user-attachments/assets/204157eb-4cae-4c0e-b2cb-db514419fd9e" alt="Get from the Edge Add-ons" /></a>
<!-- markdownlint-enable MD033 -->
The extension is open source at [lcandy2/gitingest-extension](https://github.com/lcandy2/gitingest-extension).
Issues and feature requests are welcome to the repo.
## 💡 Command line usage
The `gitingest` command line tool allows you to analyze codebases and create a text dump of their contents.
```bash
# Basic usage (writes to digest.txt by default)
gitingest /path/to/directory
# From URL
gitingest https://github.com/coderamp-labs/gitingest
# or from specific subdirectory
gitingest https://github.com/coderamp-labs/gitingest/tree/main/src/gitingest/utils
```
For private repositories, use the `--token/-t` option.
```bash
# Get your token from https://github.com/settings/personal-access-tokens
gitingest https://github.com/username/private-repo --token github_pat_...
# Or set it as an environment variable
export GITHUB_TOKEN=github_pat_...
gitingest https://github.com/username/private-repo
# Include repository submodules
gitingest https://github.com/username/repo-with-submodules --include-submodules
```
By default, files listed in `.gitignore` are skipped. Use `--include-gitignored` if you
need those files in the digest.
By default, the digest is written to a text file (`digest.txt`) in your current working directory. You can customize the output in two ways:
- Use `--output/-o <filename>` to write to a specific file.
- Use `--output/-o -` to output directly to `STDOUT` (useful for piping to other tools).
See more options and usage details with:
```bash
gitingest --help
```
## 🐍 Python package usage
```python
# Synchronous usage
from gitingest import ingest
summary, tree, content = ingest("path/to/directory")
# or from URL
summary, tree, content = ingest("https://github.com/coderamp-labs/gitingest")
# or from a specific subdirectory
summary, tree, content = ingest("https://github.com/coderamp-labs/gitingest/tree/main/src/gitingest/utils")
```
For private repositories, you can pass a token:
```python
# Using token parameter
summary, tree, content = ingest("https://github.com/username/private-repo", token="github_pat_...")
# Or set it as an environment variable
import os
os.environ["GITHUB_TOKEN"] = "github_pat_..."
summary, tree, content = ingest("https://github.com/username/private-repo")
# Include repository submodules
summary, tree, content = ingest("https://github.com/username/repo-with-submodules", include_submodules=True)
```
By default, this won't write a file but can be enabled with the `output` argument.
```python
# Asynchronous usage
from gitingest import ingest_async
import asyncio
result = asyncio.run(ingest_async("path/to/directory"))
```
### Jupyter notebook usage
```python
from gitingest import ingest_async
# Use await directly in Jupyter
summary, tree, content = await ingest_async("path/to/directory")
```
This is because Jupyter notebooks are asynchronous by default.
## 🐳 Self-host
### Using Docker
1. Build the image:
``` bash
docker build -t gitingest .
```
2. Run the container:
``` bash
docker run -d --name gitingest -p 8000:8000 gitingest
```
The application will be available at `http://localhost:8000`.
If you are hosting it on a domain, you can specify the allowed hostnames via env variable `ALLOWED_HOSTS`.
```bash
# Default: "gitingest.com, *.gitingest.com, localhost, 127.0.0.1".
ALLOWED_HOSTS="example.com, localhost, 127.0.0.1"
```
### Environment Variables
The application can be configured using the following environment variables:
- **ALLOWED_HOSTS**: Comma-separated list of allowed hostnames (default: "gitingest.com, *.gitingest.com, localhost, 127.0.0.1")
- **GITINGEST_METRICS_ENABLED**: Enable Prometheus metrics server (set to any value to enable)
- **GITINGEST_METRICS_HOST**: Host for the metrics server (default: "127.0.0.1")
- **GITINGEST_METRICS_PORT**: Port for the metrics server (default: "9090")
- **GITINGEST_SENTRY_ENABLED**: Enable Sentry error tracking (set to any value to enable)
- **GITINGEST_SENTRY_DSN**: Sentry DSN (required if Sentry is enabled)
- **GITINGEST_SENTRY_TRACES_SAMPLE_RATE**: Sampling rate for performance data (default: "1.0", range: 0.0-1.0)
- **GITINGEST_SENTRY_PROFILE_SESSION_SAMPLE_RATE**: Sampling rate for profile sessions (default: "1.0", range: 0.0-1.0)
- **GITINGEST_SENTRY_PROFILE_LIFECYCLE**: Profile lifecycle mode (default: "trace")
- **GITINGEST_SENTRY_SEND_DEFAULT_PII**: Send default personally identifiable information (default: "true")
- **S3_ALIAS_HOST**: Public URL/CDN for accessing S3 resources (default: "127.0.0.1:9000/gitingest-bucket")
- **S3_DIRECTORY_PREFIX**: Optional prefix for S3 file paths (if set, prefixes all S3 paths with this value)
### Using Docker Compose
The project includes a `compose.yml` file that allows you to easily run the application in both development and production environments.
#### Compose File Structure
The `compose.yml` file uses YAML anchoring with `&app-base` and `<<: *app-base` to define common configuration that is shared between services:
```yaml
# Common base configuration for all services
x-app-base: &app-base
build:
context: .
dockerfile: Dockerfile
ports:
- "${APP_WEB_BIND:-8000}:8000" # Main application port
- "${GITINGEST_METRICS_HOST:-127.0.0.1}:${GITINGEST_METRICS_PORT:-9090}:9090" # Metrics port
# ... other common configurations
```
#### Services
The file defines three services:
1. **app**: Production service configuration
- Uses the `prod` profile
- Sets the Sentry environment to "production"
- Configured for stable operation with `restart: unless-stopped`
2. **app-dev**: Development service configuration
- Uses the `dev` profile
- Enables debug mode
- Mounts the source code for live development
- Uses hot reloading for faster development
3. **minio**: S3-compatible object storage for development
- Uses the `dev` profile (only available in development mode)
- Provides S3-compatible storage for local development
- Accessible via:
- API: Port 9000 ([localhost:9000](http://localhost:9000))
- Web Console: Port 9001 ([localhost:9001](http://localhost:9001))
- Default admin credentials:
- Username: `minioadmin`
- Password: `minioadmin`
- Configurable via environment variables:
- `MINIO_ROOT_USER`: Custom admin username (default: minioadmin)
- `MINIO_ROOT_PASSWORD`: Custom admin password (default: minioadmin)
- Includes persistent storage via Docker volume
- Auto-creates a bucket and application-specific credentials:
- Bucket name: `gitingest-bucket` (configurable via `S3_BUCKET_NAME`)
- Access key: `gitingest` (configurable via `S3_ACCESS_KEY`)
- Secret key: `gitingest123` (configurable via `S3_SECRET_KEY`)
- These credentials are automatically passed to the app-dev service via environment variables:
- `S3_ENDPOINT`: URL of the MinIO server
- `S3_ACCESS_KEY`: Access key for the S3 bucket
- `S3_SECRET_KEY`: Secret key for the S3 bucket
- `S3_BUCKET_NAME`: Name of the S3 bucket
- `S3_REGION`: Region for the S3 bucket (default: us-east-1)
- `S3_ALIAS_HOST`: Public URL/CDN for accessing S3 resources (default: "127.0.0.1:9000/gitingest-bucket")
#### Usage Examples
To run the application in development mode:
```bash
docker compose --profile dev up
```
To run the application in production mode:
```bash
docker compose --profile prod up -d
```
To build and run the application:
```bash
docker compose --profile prod build
docker compose --profile prod up -d
```
## 🤝 Contributing
### Non-technical ways to contribute
- **Create an Issue**: If you find a bug or have an idea for a new feature, please [create an issue](https://github.com/coderamp-labs/gitingest/issues/new) on GitHub. This will help us track and prioritize your request.
- **Spread the Word**: If you like Gitingest, please share it with your friends, colleagues, and on social media. This will help us grow the community and make Gitingest even better.
- **Use Gitingest**: The best feedback comes from real-world usage! If you encounter any issues or have ideas for improvement, please let us know by [creating an issue](https://github.com/coderamp-labs/gitingest/issues/new) on GitHub or by reaching out to us on [Discord](https://discord.com/invite/zerRaGK9EC).
### Technical ways to contribute
Gitingest aims to be friendly for first time contributors, with a simple Python and HTML codebase. If you need any help while working with the code, reach out to us on [Discord](https://discord.com/invite/zerRaGK9EC). For detailed instructions on how to make a pull request, see [CONTRIBUTING.md](./CONTRIBUTING.md).
## 🛠️ Stack
- [Tailwind CSS](https://tailwindcss.com) - Frontend
- [FastAPI](https://github.com/fastapi/fastapi) - Backend framework
- [Jinja2](https://jinja.palletsprojects.com) - HTML templating
- [tiktoken](https://github.com/openai/tiktoken) - Token estimation
- [posthog](https://github.com/PostHog/posthog) - Amazing analytics
- [Sentry](https://sentry.io) - Error tracking and performance monitoring
### Looking for a JavaScript/FileSystemNode package?
Check out the NPM alternative 📦 Repomix: <https://github.com/yamadashy/repomix>
## 🚀 Project Growth
[](https://star-history.com/#coderamp-labs/gitingest&Date)
================================================
FILE: SECURITY.md
================================================
# Security Policy
## Reporting a Vulnerability
If you have discovered a vulnerability inside the project, report it privately at <romain@coderamp.io>. This way the maintainer can work on a proper fix without disclosing the problem to the public before it has been solved.
================================================
FILE: compose.yml
================================================
x-base-environment: &base-environment
# Python Configuration
PYTHONUNBUFFERED: "1"
PYTHONDONTWRITEBYTECODE: "1"
# Host Configuration
ALLOWED_HOSTS: ${ALLOWED_HOSTS:-gitingest.com,*.gitingest.com,localhost,127.0.0.1}
# Metrics Configuration
GITINGEST_METRICS_ENABLED: ${GITINGEST_METRICS_ENABLED:-true}
GITINGEST_METRICS_HOST: ${GITINGEST_METRICS_HOST:-0.0.0.0}
GITINGEST_METRICS_PORT: ${GITINGEST_METRICS_PORT:-9090}
# Sentry Configuration
GITINGEST_SENTRY_ENABLED: ${GITINGEST_SENTRY_ENABLED:-false}
GITINGEST_SENTRY_DSN: ${GITINGEST_SENTRY_DSN:-}
GITINGEST_SENTRY_TRACES_SAMPLE_RATE: ${GITINGEST_SENTRY_TRACES_SAMPLE_RATE:-1.0}
GITINGEST_SENTRY_PROFILE_SESSION_SAMPLE_RATE: ${GITINGEST_SENTRY_PROFILE_SESSION_SAMPLE_RATE:-1.0}
GITINGEST_SENTRY_PROFILE_LIFECYCLE: ${GITINGEST_SENTRY_PROFILE_LIFECYCLE:-trace}
GITINGEST_SENTRY_SEND_DEFAULT_PII: ${GITINGEST_SENTRY_SEND_DEFAULT_PII:-true}
x-prod-environment: &prod-environment
GITINGEST_SENTRY_ENVIRONMENT: ${GITINGEST_SENTRY_ENVIRONMENT:-production}
x-dev-environment: &dev-environment
DEBUG: "true"
LOG_LEVEL: "DEBUG"
RELOAD: "true"
GITINGEST_SENTRY_ENVIRONMENT: ${GITINGEST_SENTRY_ENVIRONMENT:-development}
# S3 Configuration for development
S3_ENABLED: "true"
S3_ENDPOINT: http://minio:9000
S3_ACCESS_KEY: ${S3_ACCESS_KEY:-gitingest}
S3_SECRET_KEY: ${S3_SECRET_KEY:-gitingest123}
S3_BUCKET_NAME: ${S3_BUCKET_NAME:-gitingest-bucket}
S3_REGION: ${S3_REGION:-us-east-1}
S3_DIRECTORY_PREFIX: ${S3_DIRECTORY_PREFIX:-dev}
S3_ALIAS_HOST: ${S3_ALIAS_HOST:-http://127.0.0.1:9000/${S3_BUCKET_NAME:-gitingest-bucket}}
x-app-base: &app-base
ports:
- "${APP_WEB_BIND:-8000}:8000" # Main application port
- "${GITINGEST_METRICS_HOST:-127.0.0.1}:${GITINGEST_METRICS_PORT:-9090}:9090" # Metrics port
user: "1000:1000"
command: ["python", "-m", "server"]
services:
# Production service configuration
app:
<<: *app-base
image: ghcr.io/coderamp-labs/gitingest:latest
profiles:
- prod
environment:
<<: [*base-environment, *prod-environment]
restart: unless-stopped
# Development service configuration
app-dev:
<<: *app-base
build:
context: .
dockerfile: Dockerfile
profiles:
- dev
environment:
<<: [*base-environment, *dev-environment]
volumes:
# Mount source code for live development
- ./src:/app:ro
# Use --reload flag for hot reloading during development
command: ["python", "-m", "server"]
depends_on:
minio-setup:
condition: service_completed_successfully
# MinIO S3-compatible object storage for development
minio:
image: minio/minio:latest
profiles:
- dev
ports:
- "9000:9000" # API port
- "9001:9001" # Console port
environment: &minio-environment
MINIO_ROOT_USER: ${MINIO_ROOT_USER:-minioadmin}
MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD:-minioadmin}
volumes:
- minio-data:/data
command: server /data --console-address ":9001"
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 30s
timeout: 30s
start_period: 30s
start_interval: 1s
# MinIO setup service to create bucket and user
minio-setup:
image: minio/mc
profiles:
- dev
depends_on:
minio:
condition: service_healthy
environment:
<<: *minio-environment
S3_ACCESS_KEY: ${S3_ACCESS_KEY:-gitingest}
S3_SECRET_KEY: ${S3_SECRET_KEY:-gitingest123}
S3_BUCKET_NAME: ${S3_BUCKET_NAME:-gitingest-bucket}
volumes:
- ./.docker/minio/setup.sh:/setup.sh:ro
entrypoint: sh
command: -c /setup.sh
volumes:
minio-data:
driver: local
================================================
FILE: eslint.config.cjs
================================================
const js = require('@eslint/js');
const globals = require('globals');
const importPlugin = require('eslint-plugin-import');
module.exports = [
js.configs.recommended,
{
files: ['src/static/js/**/*.js'],
languageOptions: {
parserOptions: { ecmaVersion: 2021, sourceType: 'module' },
globals: {
...globals.browser,
changePattern: 'readonly',
copyFullDigest: 'readonly',
copyText: 'readonly',
downloadFullDigest: 'readonly',
handleSubmit: 'readonly',
posthog: 'readonly',
submitExample: 'readonly',
toggleAccessSettings: 'readonly',
toggleFile: 'readonly',
},
},
plugins: { import: importPlugin },
rules: {
// Import hygiene (eslint-plugin-import)
'import/no-extraneous-dependencies': 'error',
'import/no-unresolved': 'error',
'import/order': ['warn', { alphabetize: { order: 'asc' } }],
// Safety & bug-catchers
'consistent-return': 'error',
'default-case': 'error',
'no-implicit-globals': 'error',
'no-shadow': 'error',
// Maintainability / complexity
complexity: ['warn', 10],
'max-depth': ['warn', 4],
'max-lines': ['warn', 500],
'max-params': ['warn', 5],
// Stylistic consistency (auto-fixable)
'arrow-parens': ['error', 'always'],
curly: ['error', 'all'],
indent: ['error', 4, { SwitchCase: 2 }],
'newline-per-chained-call': ['warn', { ignoreChainWithDepth: 2 }],
'no-multi-spaces': 'error',
'object-shorthand': ['error', 'always'],
'padding-line-between-statements': [
'warn',
{ blankLine: 'always', prev: '*', next: 'return' },
{ blankLine: 'always', prev: ['const', 'let', 'var'], next: '*' },
{ blankLine: 'any', prev: ['const', 'let', 'var'], next: ['const', 'let', 'var'] },
],
'quote-props': ['error', 'consistent-as-needed'],
quotes: ['error', 'single', { avoidEscape: true }],
semi: 'error',
// Modern / performance tips
'arrow-body-style': ['warn', 'as-needed'],
'prefer-arrow-callback': 'error',
'prefer-exponentiation-operator': 'error',
'prefer-numeric-literals': 'error',
'prefer-object-has-own': 'warn',
'prefer-object-spread': 'error',
'prefer-template': 'error',
},
},
];
================================================
FILE: pyproject.toml
================================================
[project]
name = "gitingest"
version = "0.3.1"
description="CLI tool to analyze and create text dumps of codebases for LLMs"
readme = {file = "README.md", content-type = "text/markdown" }
requires-python = ">= 3.8"
dependencies = [
"click>=8.0.0",
"gitpython>=3.1.0",
"httpx",
"loguru>=0.7.0",
"pathspec>=0.12.1",
"pydantic",
"python-dotenv",
"starlette>=0.40.0", # Minimum safe release (https://osv.dev/vulnerability/GHSA-f96h-pmfr-66vw)
"strenum; python_version < '3.11'",
"tiktoken>=0.7.0", # Support for o200k_base encoding
"typing_extensions>= 4.0.0; python_version < '3.10'",
]
license = {file = "LICENSE"}
authors = [
{ name = "Romain Courtois", email = "romain@coderamp.io" },
{ name = "Filip Christiansen"},
]
classifiers=[
"Development Status :: 3 - Alpha",
"Intended Audience :: Developers",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
]
[project.optional-dependencies]
dev = [
"eval-type-backport",
"pre-commit",
"pytest",
"pytest-asyncio",
"pytest-mock",
]
server = [
"boto3>=1.28.0", # AWS SDK for S3 support
"fastapi[standard]>=0.109.1", # Minimum safe release (https://osv.dev/vulnerability/PYSEC-2024-38)
"prometheus-client",
"sentry-sdk[fastapi]",
"slowapi",
"uvicorn>=0.11.7", # Minimum safe release (https://osv.dev/vulnerability/PYSEC-2020-150)
]
[project.scripts]
gitingest = "gitingest.__main__:main"
[project.urls]
homepage = "https://gitingest.com"
github = "https://github.com/coderamp-labs/gitingest"
[build-system]
requires = ["setuptools>=61.0", "wheel"]
build-backend = "setuptools.build_meta"
[tool.setuptools]
packages = {find = {where = ["src"]}}
include-package-data = true
# Linting configuration
[tool.pylint.format]
max-line-length = 119
[tool.pylint.'MESSAGES CONTROL']
disable = [
"too-many-arguments",
"too-many-positional-arguments",
"too-many-locals",
"too-few-public-methods",
"broad-exception-caught",
"duplicate-code",
"fixme",
]
[tool.ruff]
line-length = 119
fix = true
[tool.ruff.lint]
select = ["ALL"]
ignore = [ # https://docs.astral.sh/ruff/rules/...
"D107", # undocumented-public-init
"FIX002", # line-contains-todo
"TD002", # missing-todo-author
"PLR0913", # too-many-arguments,
# TODO: fix the following issues:
"TD003", # missing-todo-link, TODO: add issue links
"S108", # hardcoded-temp-file, TODO: replace with tempfile
"BLE001", # blind-except, TODO: replace with specific exceptions
"FAST003", # fast-api-unused-path-parameter, TODO: fix
]
per-file-ignores = { "tests/**/*.py" = ["S101"] } # Skip the "assert used" warning
[tool.ruff.lint.pylint]
max-returns = 10
[tool.ruff.lint.isort]
order-by-type = true
case-sensitive = true
[tool.pycln]
all = true
# TODO: Remove this once we figure out how to use ruff-isort
[tool.isort]
profile = "black"
line_length = 119
remove_redundant_aliases = true
float_to_top = true # https://github.com/astral-sh/ruff/issues/6514
order_by_type = true
filter_files = true
# Test configuration
[tool.pytest.ini_options]
pythonpath = ["src"]
testpaths = ["tests/"]
python_files = "test_*.py"
asyncio_mode = "auto"
asyncio_default_fixture_loop_scope = "function"
python_classes = "Test*"
python_functions = "test_*"
================================================
FILE: release-please-config.json
================================================
{
"$schema": "https://raw.githubusercontent.com/googleapis/release-please/main/schemas/config.json",
"packages": {
".": {
"release-type": "python",
"bump-minor-pre-major": true
}
}
}
================================================
FILE: renovate.json
================================================
{
"$schema": "https://docs.renovatebot.com/renovate-schema.json",
"extends": [
"config:recommended"
]
}
================================================
FILE: requirements-dev.txt
================================================
-r requirements.txt
eval-type-backport
pre-commit
pytest
pytest-asyncio
pytest-cov
pytest-mock
================================================
FILE: requirements.txt
================================================
boto3>=1.28.0 # AWS SDK for S3 support
click>=8.0.0
fastapi[standard]>=0.109.1 # Vulnerable to https://osv.dev/vulnerability/PYSEC-2024-38
httpx
loguru>=0.7.0
pathspec>=0.12.1
prometheus-client
pydantic
python-dotenv
sentry-sdk[fastapi]
slowapi
starlette>=0.40.0 # Vulnerable to https://osv.dev/vulnerability/GHSA-f96h-pmfr-66vw
tiktoken>=0.7.0 # Support for o200k_base encoding
uvicorn>=0.11.7 # Vulnerable to https://osv.dev/vulnerability/PYSEC-2020-150
================================================
FILE: src/gitingest/__init__.py
================================================
"""Gitingest: A package for ingesting data from Git repositories."""
from gitingest.entrypoint import ingest, ingest_async
__all__ = ["ingest", "ingest_async"]
================================================
FILE: src/gitingest/__main__.py
================================================
"""Command-line interface (CLI) for Gitingest."""
# pylint: disable=no-value-for-parameter
from __future__ import annotations
import asyncio
from typing import TypedDict
import click
from typing_extensions import Unpack
from gitingest.config import MAX_FILE_SIZE, OUTPUT_FILE_NAME
from gitingest.entrypoint import ingest_async
# Import logging configuration first to intercept all logging
from gitingest.utils.logging_config import get_logger
# Initialize logger for this module
logger = get_logger(__name__)
class _CLIArgs(TypedDict):
source: str
max_size: int
exclude_pattern: tuple[str, ...]
include_pattern: tuple[str, ...]
branch: str | None
include_gitignored: bool
include_submodules: bool
token: str | None
output: str | None
@click.command()
@click.argument("source", type=str, default=".")
@click.option(
"--max-size",
"-s",
default=MAX_FILE_SIZE,
show_default=True,
help="Maximum file size to process in bytes",
)
@click.option("--exclude-pattern", "-e", multiple=True, help="Shell-style patterns to exclude.")
@click.option(
"--include-pattern",
"-i",
multiple=True,
help="Shell-style patterns to include.",
)
@click.option("--branch", "-b", default=None, help="Branch to clone and ingest")
@click.option(
"--include-gitignored",
is_flag=True,
default=False,
help="Include files matched by .gitignore and .gitingestignore",
)
@click.option(
"--include-submodules",
is_flag=True,
help="Include repository's submodules in the analysis",
default=False,
)
@click.option(
"--token",
"-t",
envvar="GITHUB_TOKEN",
default=None,
help=(
"GitHub personal access token (PAT) for accessing private repositories. "
"If omitted, the CLI will look for the GITHUB_TOKEN environment variable."
),
)
@click.option(
"--output",
"-o",
default=None,
help="Output file path (default: digest.txt in current directory). Use '-' for stdout.",
)
def main(**cli_kwargs: Unpack[_CLIArgs]) -> None:
"""Run the CLI entry point to analyze a repo / directory and dump its contents.
Parameters
----------
**cli_kwargs : Unpack[_CLIArgs]
A dictionary of keyword arguments forwarded to ``ingest_async``.
Notes
-----
See ``ingest_async`` for a detailed description of each argument.
Examples
--------
Basic usage:
$ gitingest
$ gitingest /path/to/repo
$ gitingest https://github.com/user/repo
Output to stdout:
$ gitingest -o -
$ gitingest https://github.com/user/repo --output -
With filtering:
$ gitingest -i "*.py" -e "*.log"
$ gitingest --include-pattern "*.js" --exclude-pattern "node_modules/*"
Private repositories:
$ gitingest https://github.com/user/private-repo -t ghp_token
$ GITHUB_TOKEN=ghp_token gitingest https://github.com/user/private-repo
Include submodules:
$ gitingest https://github.com/user/repo --include-submodules
"""
asyncio.run(_async_main(**cli_kwargs))
async def _async_main(
source: str,
*,
max_size: int = MAX_FILE_SIZE,
exclude_pattern: tuple[str, ...] | None = None,
include_pattern: tuple[str, ...] | None = None,
branch: str | None = None,
include_gitignored: bool = False,
include_submodules: bool = False,
token: str | None = None,
output: str | None = None,
) -> None:
"""Analyze a directory or repository and create a text dump of its contents.
This command scans the specified ``source`` (a local directory or Git repo),
applies custom include and exclude patterns, and generates a text summary of
the analysis. The summary is written to an output file or printed to ``stdout``.
Parameters
----------
source : str
A directory path or a Git repository URL.
max_size : int
Maximum file size in bytes to ingest (default: 10 MB).
exclude_pattern : tuple[str, ...] | None
Glob patterns for pruning the file set.
include_pattern : tuple[str, ...] | None
Glob patterns for including files in the output.
branch : str | None
Git branch to ingest. If ``None``, the repository's default branch is used.
include_gitignored : bool
If ``True``, also ingest files matched by ``.gitignore`` or ``.gitingestignore`` (default: ``False``).
include_submodules : bool
If ``True``, recursively include all Git submodules within the repository (default: ``False``).
token : str | None
GitHub personal access token (PAT) for accessing private repositories.
Can also be set via the ``GITHUB_TOKEN`` environment variable.
output : str | None
The path where the output file will be written (default: ``digest.txt`` in current directory).
Use ``"-"`` to write to ``stdout``.
Raises
------
click.Abort
Raised if an error occurs during execution and the command must be aborted.
"""
try:
# Normalise pattern containers (the ingest layer expects sets)
exclude_patterns = set(exclude_pattern) if exclude_pattern else set()
include_patterns = set(include_pattern) if include_pattern else set()
output_target = output if output is not None else OUTPUT_FILE_NAME
if output_target == "-":
click.echo("Analyzing source, preparing output for stdout...", err=True)
else:
click.echo(f"Analyzing source, output will be written to '{output_target}'...", err=True)
summary, _, _ = await ingest_async(
source,
max_file_size=max_size,
include_patterns=include_patterns,
exclude_patterns=exclude_patterns,
branch=branch,
include_gitignored=include_gitignored,
include_submodules=include_submodules,
token=token,
output=output_target,
)
except Exception as exc:
# Convert any exception into Click.Abort so that exit status is non-zero
click.echo(f"Error: {exc}", err=True)
raise click.Abort from exc
if output_target == "-": # stdout
click.echo("\n--- Summary ---", err=True)
click.echo(summary, err=True)
click.echo("--- End Summary ---", err=True)
click.echo("Analysis complete! Output sent to stdout.", err=True)
else: # file
click.echo(f"Analysis complete! Output written to: {output_target}")
click.echo("\nSummary:")
click.echo(summary)
if __name__ == "__main__":
main()
================================================
FILE: src/gitingest/clone.py
================================================
"""Module containing functions for cloning a Git repository to a local path."""
from __future__ import annotations
from pathlib import Path
from typing import TYPE_CHECKING
import git
from gitingest.config import DEFAULT_TIMEOUT
from gitingest.utils.git_utils import (
check_repo_exists,
checkout_partial_clone,
create_git_repo,
ensure_git_installed,
git_auth_context,
is_github_host,
resolve_commit,
)
from gitingest.utils.logging_config import get_logger
from gitingest.utils.os_utils import ensure_directory_exists_or_create
from gitingest.utils.timeout_wrapper import async_timeout
if TYPE_CHECKING:
from gitingest.schemas import CloneConfig
# Initialize logger for this module
logger = get_logger(__name__)
@async_timeout(DEFAULT_TIMEOUT)
async def clone_repo(config: CloneConfig, *, token: str | None = None) -> None:
"""Clone a repository to a local path based on the provided configuration.
This function handles the process of cloning a Git repository to the local file system.
It can clone a specific branch, tag, or commit if provided, and it raises exceptions if
any errors occur during the cloning process.
Parameters
----------
config : CloneConfig
The configuration for cloning the repository.
token : str | None
GitHub personal access token (PAT) for accessing private repositories.
Raises
------
ValueError
If the repository is not found, if the provided URL is invalid, or if the token format is invalid.
RuntimeError
If Git operations fail during the cloning process.
"""
# Extract and validate query parameters
url: str = config.url
local_path: str = config.local_path
partial_clone: bool = config.subpath != "/"
logger.info(
"Starting git clone operation",
extra={
"url": url,
"local_path": local_path,
"partial_clone": partial_clone,
"subpath": config.subpath,
"branch": config.branch,
"tag": config.tag,
"commit": config.commit,
"include_submodules": config.include_submodules,
},
)
logger.debug("Ensuring git is installed")
await ensure_git_installed()
logger.debug("Creating local directory", extra={"parent_path": str(Path(local_path).parent)})
await ensure_directory_exists_or_create(Path(local_path).parent)
logger.debug("Checking if repository exists", extra={"url": url})
if not await check_repo_exists(url, token=token):
logger.error("Repository not found", extra={"url": url})
msg = "Repository not found. Make sure it is public or that you have provided a valid token."
raise ValueError(msg)
logger.debug("Resolving commit reference")
commit = await resolve_commit(config, token=token)
logger.debug("Resolved commit", extra={"commit": commit})
# Clone the repository using GitPython with proper authentication
logger.info("Executing git clone operation", extra={"url": "<redacted>", "local_path": local_path})
try:
clone_kwargs = {
"single_branch": True,
"no_checkout": True,
"depth": 1,
}
with git_auth_context(url, token) as (git_cmd, auth_url):
if partial_clone:
# For partial clones, use git.Git() with filter and sparse options
cmd_args = ["--single-branch", "--no-checkout", "--depth=1"]
cmd_args.extend(["--filter=blob:none", "--sparse"])
cmd_args.extend([auth_url, local_path])
git_cmd.clone(*cmd_args)
elif token and is_github_host(url):
# For authenticated GitHub repos, use git_cmd with auth URL
cmd_args = ["--single-branch", "--no-checkout", "--depth=1", auth_url, local_path]
git_cmd.clone(*cmd_args)
else:
# For non-authenticated repos, use the standard GitPython method
git.Repo.clone_from(url, local_path, **clone_kwargs)
logger.info("Git clone completed successfully")
except git.GitCommandError as exc:
msg = f"Git clone failed: {exc}"
raise RuntimeError(msg) from exc
# Checkout the subpath if it is a partial clone
if partial_clone:
logger.info("Setting up partial clone for subpath", extra={"subpath": config.subpath})
await checkout_partial_clone(config, token=token)
logger.debug("Partial clone setup completed")
# Perform post-clone operations
await _perform_post_clone_operations(config, local_path, url, token, commit)
logger.info("Git clone operation completed successfully", extra={"local_path": local_path})
async def _perform_post_clone_operations(
config: CloneConfig,
local_path: str,
url: str,
token: str | None,
commit: str,
) -> None:
"""Perform post-clone operations like fetching, checkout, and submodule updates.
Parameters
----------
config : CloneConfig
The configuration for cloning the repository.
local_path : str
The local path where the repository was cloned.
url : str
The repository URL.
token : str | None
GitHub personal access token (PAT) for accessing private repositories.
commit : str
The commit SHA to checkout.
Raises
------
RuntimeError
If any Git operation fails.
"""
try:
repo = create_git_repo(local_path, url, token)
# Ensure the commit is locally available
logger.debug("Fetching specific commit", extra={"commit": commit})
repo.git.fetch("--depth=1", "origin", commit)
# Write the work-tree at that commit
logger.info("Checking out commit", extra={"commit": commit})
repo.git.checkout(commit)
# Update submodules
if config.include_submodules:
logger.info("Updating submodules")
repo.git.submodule("update", "--init", "--recursive", "--depth=1")
logger.debug("Submodules updated successfully")
except git.GitCommandError as exc:
msg = f"Git operation failed: {exc}"
raise RuntimeError(msg) from exc
================================================
FILE: src/gitingest/config.py
================================================
"""Configuration file for the project."""
import tempfile
from pathlib import Path
MAX_FILE_SIZE = 10 * 1024 * 1024 # Maximum size of a single file to process (10 MB)
MAX_DIRECTORY_DEPTH = 20 # Maximum depth of directory traversal
MAX_FILES = 10_000 # Maximum number of files to process
MAX_TOTAL_SIZE_BYTES = 500 * 1024 * 1024 # Maximum size of output file (500 MB)
DEFAULT_TIMEOUT = 60 # seconds
OUTPUT_FILE_NAME = "digest.txt"
TMP_BASE_PATH = Path(tempfile.gettempdir()) / "gitingest"
================================================
FILE: src/gitingest/entrypoint.py
================================================
"""Main entry point for ingesting a source and processing its contents."""
from __future__ import annotations
import asyncio
import errno
import shutil
import stat
import sys
from contextlib import asynccontextmanager
from pathlib import Path
from typing import TYPE_CHECKING, AsyncGenerator, Callable
from urllib.parse import urlparse
from gitingest.clone import clone_repo
from gitingest.config import MAX_FILE_SIZE
from gitingest.ingestion import ingest_query
from gitingest.query_parser import parse_local_dir_path, parse_remote_repo
from gitingest.utils.auth import resolve_token
from gitingest.utils.compat_func import removesuffix
from gitingest.utils.ignore_patterns import load_ignore_patterns
from gitingest.utils.logging_config import get_logger
from gitingest.utils.pattern_utils import process_patterns
from gitingest.utils.query_parser_utils import KNOWN_GIT_HOSTS
if TYPE_CHECKING:
from types import TracebackType
from gitingest.schemas import IngestionQuery
# Initialize logger for this module
logger = get_logger(__name__)
async def ingest_async(
source: str,
*,
max_file_size: int = MAX_FILE_SIZE,
include_patterns: str | set[str] | None = None,
exclude_patterns: str | set[str] | None = None,
branch: str | None = None,
tag: str | None = None,
include_gitignored: bool = False,
include_submodules: bool = False,
token: str | None = None,
output: str | None = None,
) -> tuple[str, str, str]:
"""Ingest a source and process its contents.
This function analyzes a source (URL or local path), clones the corresponding repository (if applicable),
and processes its files according to the specified query parameters. It returns a summary, a tree-like
structure of the files, and the content of the files. The results can optionally be written to an output file.
Parameters
----------
source : str
The source to analyze, which can be a URL (for a Git repository) or a local directory path.
max_file_size : int
Maximum allowed file size for file ingestion. Files larger than this size are ignored (default: 10 MB).
include_patterns : str | set[str] | None
Pattern or set of patterns specifying which files to include. If ``None``, all files are included.
exclude_patterns : str | set[str] | None
Pattern or set of patterns specifying which files to exclude. If ``None``, no files are excluded.
branch : str | None
The branch to clone and ingest (default: the default branch).
tag : str | None
The tag to clone and ingest. If ``None``, no tag is used.
include_gitignored : bool
If ``True``, include files ignored by ``.gitignore`` and ``.gitingestignore`` (default: ``False``).
include_submodules : bool
If ``True``, recursively include all Git submodules within the repository (default: ``False``).
token : str | None
GitHub personal access token (PAT) for accessing private repositories.
Can also be set via the ``GITHUB_TOKEN`` environment variable.
output : str | None
File path where the summary and content should be written.
If ``"-"`` (dash), the results are written to ``stdout``.
If ``None``, the results are not written to a file.
Returns
-------
tuple[str, str, str]
A tuple containing:
- A summary string of the analyzed repository or directory.
- A tree-like string representation of the file structure.
- The content of the files in the repository or directory.
"""
logger.info("Starting ingestion process", extra={"source": source})
token = resolve_token(token)
source = removesuffix(source.strip(), ".git")
# Determine the parsing method based on the source type
if urlparse(source).scheme in ("https", "http") or any(h in source for h in KNOWN_GIT_HOSTS):
# We either have a full URL or a domain-less slug
logger.info("Parsing remote repository", extra={"source": source})
query = await parse_remote_repo(source, token=token)
query.include_submodules = include_submodules
_override_branch_and_tag(query, branch=branch, tag=tag)
else:
# Local path scenario
logger.info("Processing local directory", extra={"source": source})
query = parse_local_dir_path(source)
query.max_file_size = max_file_size
query.ignore_patterns, query.include_patterns = process_patterns(
exclude_patterns=exclude_patterns,
include_patterns=include_patterns,
)
if query.url:
_override_branch_and_tag(query, branch=branch, tag=tag)
query.include_submodules = include_submodules
logger.debug(
"Configuration completed",
extra={
"max_file_size": query.max_file_size,
"include_submodules": query.include_submodules,
"include_gitignored": include_gitignored,
"has_include_patterns": bool(query.include_patterns),
"has_exclude_patterns": bool(query.ignore_patterns),
},
)
async with _clone_repo_if_remote(query, token=token):
if query.url:
logger.info("Repository cloned, starting file processing")
else:
logger.info("Starting local directory processing")
if not include_gitignored:
logger.debug("Applying gitignore patterns")
_apply_gitignores(query)
logger.info("Processing files and generating output")
summary, tree, content = ingest_query(query)
if output:
logger.debug("Writing output to file", extra={"output_path": output})
await _write_output(tree, content=content, target=output)
logger.info("Ingestion completed successfully")
return summary, tree, content
def ingest(
source: str,
*,
max_file_size: int = MAX_FILE_SIZE,
include_patterns: str | set[str] | None = None,
exclude_patterns: str | set[str] | None = None,
branch: str | None = None,
tag: str | None = None,
include_gitignored: bool = False,
include_submodules: bool = False,
token: str | None = None,
output: str | None = None,
) -> tuple[str, str, str]:
"""Provide a synchronous wrapper around ``ingest_async``.
This function analyzes a source (URL or local path), clones the corresponding repository (if applicable),
and processes its files according to the specified query parameters. It returns a summary, a tree-like
structure of the files, and the content of the files. The results can optionally be written to an output file.
Parameters
----------
source : str
The source to analyze, which can be a URL (for a Git repository) or a local directory path.
max_file_size : int
Maximum allowed file size for file ingestion. Files larger than this size are ignored (default: 10 MB).
include_patterns : str | set[str] | None
Pattern or set of patterns specifying which files to include. If ``None``, all files are included.
exclude_patterns : str | set[str] | None
Pattern or set of patterns specifying which files to exclude. If ``None``, no files are excluded.
branch : str | None
The branch to clone and ingest (default: the default branch).
tag : str | None
The tag to clone and ingest. If ``None``, no tag is used.
include_gitignored : bool
If ``True``, include files ignored by ``.gitignore`` and ``.gitingestignore`` (default: ``False``).
include_submodules : bool
If ``True``, recursively include all Git submodules within the repository (default: ``False``).
token : str | None
GitHub personal access token (PAT) for accessing private repositories.
Can also be set via the ``GITHUB_TOKEN`` environment variable.
output : str | None
File path where the summary and content should be written.
If ``"-"`` (dash), the results are written to ``stdout``.
If ``None``, the results are not written to a file.
Returns
-------
tuple[str, str, str]
A tuple containing:
- A summary string of the analyzed repository or directory.
- A tree-like string representation of the file structure.
- The content of the files in the repository or directory.
See Also
--------
``ingest_async`` : The asynchronous version of this function.
"""
return asyncio.run(
ingest_async(
source=source,
max_file_size=max_file_size,
include_patterns=include_patterns,
exclude_patterns=exclude_patterns,
branch=branch,
tag=tag,
include_gitignored=include_gitignored,
include_submodules=include_submodules,
token=token,
output=output,
),
)
def _override_branch_and_tag(query: IngestionQuery, branch: str | None, tag: str | None) -> None:
"""Compare the caller-supplied ``branch`` and ``tag`` with the ones already in ``query``.
If they differ, update ``query`` to the chosen values and issue a warning.
If both are specified, the tag wins over the branch.
Parameters
----------
query : IngestionQuery
The query to update.
branch : str | None
The branch to use.
tag : str | None
The tag to use.
"""
if tag and query.tag and tag != query.tag:
msg = f"Warning: The specified tag '{tag}' overrides the tag found in the URL '{query.tag}'."
logger.warning(msg)
query.tag = tag or query.tag
if branch and query.branch and branch != query.branch:
msg = f"Warning: The specified branch '{branch}' overrides the branch found in the URL '{query.branch}'."
logger.warning(msg)
query.branch = branch or query.branch
if tag and branch:
msg = "Warning: Both tag and branch are specified. The tag will be used."
logger.warning(msg)
# Tag wins over branch if both supplied
if query.tag:
query.branch = None
def _apply_gitignores(query: IngestionQuery) -> None:
"""Update ``query.ignore_patterns`` in-place.
Parameters
----------
query : IngestionQuery
The query to update.
"""
for fname in (".gitignore", ".gitingestignore"):
query.ignore_patterns.update(load_ignore_patterns(query.local_path, filename=fname))
@asynccontextmanager
async def _clone_repo_if_remote(query: IngestionQuery, *, token: str | None) -> AsyncGenerator[None]:
"""Async context-manager that clones ``query.url`` if present.
If ``query.url`` is set, the repo is cloned, control is yielded, and the temp directory is removed on exit.
If no URL is given, the function simply yields immediately.
Parameters
----------
query : IngestionQuery
Parsed query describing the source to ingest.
token : str | None
GitHub personal access token (PAT) for accessing private repositories.
"""
kwargs = {}
if sys.version_info >= (3, 12):
kwargs["onexc"] = _handle_remove_readonly
else:
kwargs["onerror"] = _handle_remove_readonly
if query.url:
clone_config = query.extract_clone_config()
await clone_repo(clone_config, token=token)
try:
yield
finally:
shutil.rmtree(query.local_path.parent, **kwargs)
else:
yield
def _handle_remove_readonly(
func: Callable,
path: str,
exc_info: BaseException | tuple[type[BaseException], BaseException, TracebackType],
) -> None:
"""Handle permission errors raised by ``shutil.rmtree()``.
* Makes the target writable (removes the read-only attribute).
* Retries the original operation (``func``) once.
"""
# 'onerror' passes a (type, value, tb) tuple; 'onexc' passes the exception
if isinstance(exc_info, tuple): # 'onerror' (Python <3.12)
exc: BaseException = exc_info[1]
else: # 'onexc' (Python 3.12+)
exc = exc_info
# Handle only'Permission denied' and 'Operation not permitted'
if not isinstance(exc, OSError) or exc.errno not in {errno.EACCES, errno.EPERM}:
raise exc
# Make the target writable
Path(path).chmod(stat.S_IWRITE)
func(path)
async def _write_output(tree: str, content: str, target: str | None) -> None:
"""Write combined output to ``target`` (``"-"`` ⇒ stdout).
Parameters
----------
tree : str
The tree-like string representation of the file structure.
content : str
The content of the files in the repository or directory.
target : str | None
The path to the output file. If ``None``, the results are not written to a file.
"""
data = f"{tree}\n{content}"
loop = asyncio.get_running_loop()
if target == "-":
await loop.run_in_executor(None, sys.stdout.write, data)
await loop.run_in_executor(None, sys.stdout.flush)
elif target is not None:
await loop.run_in_executor(None, Path(target).write_text, data, "utf-8")
================================================
FILE: src/gitingest/ingestion.py
================================================
"""Functions to ingest and analyze a codebase directory or single file."""
from __future__ import annotations
from pathlib import Path
from typing import TYPE_CHECKING
from gitingest.config import MAX_DIRECTORY_DEPTH, MAX_FILES, MAX_TOTAL_SIZE_BYTES
from gitingest.output_formatter import format_node
from gitingest.schemas import FileSystemNode, FileSystemNodeType, FileSystemStats
from gitingest.utils.ingestion_utils import _should_exclude, _should_include
from gitingest.utils.logging_config import get_logger
if TYPE_CHECKING:
from gitingest.schemas import IngestionQuery
# Initialize logger for this module
logger = get_logger(__name__)
def ingest_query(query: IngestionQuery) -> tuple[str, str, str]:
"""Run the ingestion process for a parsed query.
This is the main entry point for analyzing a codebase directory or single file. It processes the query
parameters, reads the file or directory content, and generates a summary, directory structure, and file content,
along with token estimations.
Parameters
----------
query : IngestionQuery
The parsed query object containing information about the repository and query parameters.
Returns
-------
tuple[str, str, str]
A tuple containing the summary, directory structure, and file contents.
Raises
------
ValueError
If the path cannot be found, is not a file, or the file has no content.
"""
logger.info(
"Starting file ingestion",
extra={
"slug": query.slug,
"subpath": query.subpath,
"local_path": str(query.local_path),
"max_file_size": query.max_file_size,
},
)
subpath = Path(query.subpath.strip("/")).as_posix()
path = query.local_path / subpath
if not path.exists():
logger.error("Path not found", extra={"path": str(path), "slug": query.slug})
msg = f"{query.slug} cannot be found"
raise ValueError(msg)
if (query.type and query.type == "blob") or query.local_path.is_file():
# TODO: We do this wrong! We should still check the branch and commit!
logger.info("Processing single file", extra={"file_path": str(path)})
if not path.is_file():
logger.error("Expected file but found non-file", extra={"path": str(path)})
msg = f"Path {path} is not a file"
raise ValueError(msg)
relative_path = path.relative_to(query.local_path)
file_node = FileSystemNode(
name=path.name,
type=FileSystemNodeType.FILE,
size=path.stat().st_size,
file_count=1,
path_str=str(relative_path),
path=path,
)
if not file_node.content:
logger.error("File has no content", extra={"file_name": file_node.name})
msg = f"File {file_node.name} has no content"
raise ValueError(msg)
logger.info(
"Single file processing completed",
extra={
"file_name": file_node.name,
"file_size": file_node.size,
},
)
return format_node(file_node, query=query)
logger.info("Processing directory", extra={"directory_path": str(path)})
root_node = FileSystemNode(
name=path.name,
type=FileSystemNodeType.DIRECTORY,
path_str=str(path.relative_to(query.local_path)),
path=path,
)
stats = FileSystemStats()
_process_node(node=root_node, query=query, stats=stats)
logger.info(
"Directory processing completed",
extra={
"total_files": root_node.file_count,
"total_directories": root_node.dir_count,
"total_size_bytes": root_node.size,
"stats_total_files": stats.total_files,
"stats_total_size": stats.total_size,
},
)
return format_node(root_node, query=query)
def _process_node(node: FileSystemNode, query: IngestionQuery, stats: FileSystemStats) -> None:
"""Process a file or directory item within a directory.
This function handles each file or directory item, checking if it should be included or excluded based on the
provided patterns. It handles symlinks, directories, and files accordingly.
Parameters
----------
node : FileSystemNode
The current directory or file node being processed.
query : IngestionQuery
The parsed query object containing information about the repository and query parameters.
stats : FileSystemStats
Statistics tracking object for the total file count and size.
"""
if limit_exceeded(stats, depth=node.depth):
return
for sub_path in node.path.iterdir():
if query.ignore_patterns and _should_exclude(sub_path, query.local_path, query.ignore_patterns):
continue
if query.include_patterns and not _should_include(sub_path, query.local_path, query.include_patterns):
continue
if sub_path.is_symlink():
_process_symlink(path=sub_path, parent_node=node, stats=stats, local_path=query.local_path)
elif sub_path.is_file():
if sub_path.stat().st_size > query.max_file_size:
logger.debug(
"Skipping file: would exceed max file size limit",
extra={
"file_path": str(sub_path),
"file_size": sub_path.stat().st_size,
"max_file_size": query.max_file_size,
},
)
continue
_process_file(path=sub_path, parent_node=node, stats=stats, local_path=query.local_path)
elif sub_path.is_dir():
child_directory_node = FileSystemNode(
name=sub_path.name,
type=FileSystemNodeType.DIRECTORY,
path_str=str(sub_path.relative_to(query.local_path)),
path=sub_path,
depth=node.depth + 1,
)
_process_node(node=child_directory_node, query=query, stats=stats)
if not child_directory_node.children:
continue
node.children.append(child_directory_node)
node.size += child_directory_node.size
node.file_count += child_directory_node.file_count
node.dir_count += 1 + child_directory_node.dir_count
else:
logger.warning("Unknown file type, skipping", extra={"file_path": str(sub_path)})
node.sort_children()
def _process_symlink(path: Path, parent_node: FileSystemNode, stats: FileSystemStats, local_path: Path) -> None:
"""Process a symlink in the file system.
This function checks the symlink's target.
Parameters
----------
path : Path
The full path of the symlink.
parent_node : FileSystemNode
The parent directory node.
stats : FileSystemStats
Statistics tracking object for the total file count and size.
local_path : Path
The base path of the repository or directory being processed.
"""
child = FileSystemNode(
name=path.name,
type=FileSystemNodeType.SYMLINK,
path_str=str(path.relative_to(local_path)),
path=path,
depth=parent_node.depth + 1,
)
stats.total_files += 1
parent_node.children.append(child)
parent_node.file_count += 1
def _process_file(path: Path, parent_node: FileSystemNode, stats: FileSystemStats, local_path: Path) -> None:
"""Process a file in the file system.
This function checks the file's size, increments the statistics, and reads its content.
If the file size exceeds the maximum allowed, it raises an error.
Parameters
----------
path : Path
The full path of the file.
parent_node : FileSystemNode
The dictionary to accumulate the results.
stats : FileSystemStats
Statistics tracking object for the total file count and size.
local_path : Path
The base path of the repository or directory being processed.
"""
if stats.total_files + 1 > MAX_FILES:
logger.warning(
"Maximum file limit reached",
extra={
"current_files": stats.total_files,
"max_files": MAX_FILES,
"file_path": str(path),
},
)
return
file_size = path.stat().st_size
if stats.total_size + file_size > MAX_TOTAL_SIZE_BYTES:
logger.warning(
"Skipping file: would exceed total size limit",
extra={
"file_path": str(path),
"file_size": file_size,
"current_total_size": stats.total_size,
"max_total_size": MAX_TOTAL_SIZE_BYTES,
},
)
return
stats.total_files += 1
stats.total_size += file_size
child = FileSystemNode(
name=path.name,
type=FileSystemNodeType.FILE,
size=file_size,
file_count=1,
path_str=str(path.relative_to(local_path)),
path=path,
depth=parent_node.depth + 1,
)
parent_node.children.append(child)
parent_node.size += file_size
parent_node.file_count += 1
def limit_exceeded(stats: FileSystemStats, depth: int) -> bool:
"""Check if any of the traversal limits have been exceeded.
This function checks if the current traversal has exceeded any of the configured limits:
maximum directory depth, maximum number of files, or maximum total size in bytes.
Parameters
----------
stats : FileSystemStats
Statistics tracking object for the total file count and size.
depth : int
The current depth of directory traversal.
Returns
-------
bool
``True`` if any limit has been exceeded, ``False`` otherwise.
"""
if depth > MAX_DIRECTORY_DEPTH:
logger.warning(
"Maximum directory depth limit reached",
extra={
"current_depth": depth,
"max_depth": MAX_DIRECTORY_DEPTH,
},
)
return True
if stats.total_files >= MAX_FILES:
logger.warning(
"Maximum file limit reached",
extra={
"current_files": stats.total_files,
"max_files": MAX_FILES,
},
)
return True # TODO: end recursion
if stats.total_size >= MAX_TOTAL_SIZE_BYTES:
logger.warning(
"Maximum total size limit reached",
extra={
"current_size_mb": stats.total_size / 1024 / 1024,
"max_size_mb": MAX_TOTAL_SIZE_BYTES / 1024 / 1024,
},
)
return True # TODO: end recursion
return False
================================================
FILE: src/gitingest/output_formatter.py
================================================
"""Functions to ingest and analyze a codebase directory or single file."""
from __future__ import annotations
import ssl
from typing import TYPE_CHECKING
import requests.exceptions
import tiktoken
from gitingest.schemas import FileSystemNode, FileSystemNodeType
from gitingest.utils.compat_func import readlink
from gitingest.utils.logging_config import get_logger
if TYPE_CHECKING:
from gitingest.schemas import IngestionQuery
# Initialize logger for this module
logger = get_logger(__name__)
_TOKEN_THRESHOLDS: list[tuple[int, str]] = [
(1_000_000, "M"),
(1_000, "k"),
]
def format_node(node: FileSystemNode, query: IngestionQuery) -> tuple[str, str, str]:
"""Generate a summary, directory structure, and file contents for a given file system node.
If the node represents a directory, the function will recursively process its contents.
Parameters
----------
node : FileSystemNode
The file system node to be summarized.
query : IngestionQuery
The parsed query object containing information about the repository and query parameters.
Returns
-------
tuple[str, str, str]
A tuple containing the summary, directory structure, and file contents.
"""
is_single_file = node.type == FileSystemNodeType.FILE
summary = _create_summary_prefix(query, single_file=is_single_file)
if node.type == FileSystemNodeType.DIRECTORY:
summary += f"Files analyzed: {node.file_count}\n"
elif node.type == FileSystemNodeType.FILE:
summary += f"File: {node.name}\n"
summary += f"Lines: {len(node.content.splitlines()):,}\n"
tree = "Directory structure:\n" + _create_tree_structure(query, node=node)
content = _gather_file_contents(node)
token_estimate = _format_token_count(tree + content)
if token_estimate:
summary += f"\nEstimated tokens: {token_estimate}"
return summary, tree, content
def _create_summary_prefix(query: IngestionQuery, *, single_file: bool = False) -> str:
"""Create a prefix string for summarizing a repository or local directory.
Includes repository name (if provided), commit/branch details, and subpath if relevant.
Parameters
----------
query : IngestionQuery
The parsed query object containing information about the repository and query parameters.
single_file : bool
A flag indicating whether the summary is for a single file (default: ``False``).
Returns
-------
str
A summary prefix string containing repository, commit, branch, and subpath details.
"""
parts = []
if query.user_name:
parts.append(f"Repository: {query.user_name}/{query.repo_name}")
else:
# Local scenario
parts.append(f"Directory: {query.slug}")
if query.tag:
parts.append(f"Tag: {query.tag}")
elif query.branch and query.branch not in ("main", "master"):
parts.append(f"Branch: {query.branch}")
if query.commit:
parts.append(f"Commit: {query.commit}")
if query.subpath != "/" and not single_file:
parts.append(f"Subpath: {query.subpath}")
return "\n".join(parts) + "\n"
def _gather_file_contents(node: FileSystemNode) -> str:
"""Recursively gather contents of all files under the given node.
This function recursively processes a directory node and gathers the contents of all files
under that node. It returns the concatenated content of all files as a single string.
Parameters
----------
node : FileSystemNode
The current directory or file node being processed.
Returns
-------
str
The concatenated content of all files under the given node.
"""
if node.type != FileSystemNodeType.DIRECTORY:
return node.content_string
# Recursively gather contents of all files under the current directory
return "\n".join(_gather_file_contents(child) for child in node.children)
def _create_tree_structure(
query: IngestionQuery,
*,
node: FileSystemNode,
prefix: str = "",
is_last: bool = True,
) -> str:
"""Generate a tree-like string representation of the file structure.
This function generates a string representation of the directory structure, formatted
as a tree with appropriate indentation for nested directories and files.
Parameters
----------
query : IngestionQuery
The parsed query object containing information about the repository and query parameters.
node : FileSystemNode
The current directory or file node being processed.
prefix : str
A string used for indentation and formatting of the tree structure (default: ``""``).
is_last : bool
A flag indicating whether the current node is the last in its directory (default: ``True``).
Returns
-------
str
A string representing the directory structure formatted as a tree.
"""
if not node.name:
# If no name is present, use the slug as the top-level directory name
node.name = query.slug
tree_str = ""
current_prefix = "└── " if is_last else "├── "
# Indicate directories with a trailing slash
display_name = node.name
if node.type == FileSystemNodeType.DIRECTORY:
display_name += "/"
elif node.type == FileSystemNodeType.SYMLINK:
display_name += " -> " + readlink(node.path).name
tree_str += f"{prefix}{current_prefix}{display_name}\n"
if node.type == FileSystemNodeType.DIRECTORY and node.children:
prefix += " " if is_last else "│ "
for i, child in enumerate(node.children):
tree_str += _create_tree_structure(query, node=child, prefix=prefix, is_last=i == len(node.children) - 1)
return tree_str
def _format_token_count(text: str) -> str | None:
"""Return a human-readable token-count string (e.g. 1.2k, 1.2 M).
Parameters
----------
text : str
The text string for which the token count is to be estimated.
Returns
-------
str | None
The formatted number of tokens as a string (e.g., ``"1.2k"``, ``"1.2M"``), or ``None`` if an error occurs.
"""
try:
encoding = tiktoken.get_encoding("o200k_base") # gpt-4o, gpt-4o-mini
total_tokens = len(encoding.encode(text, disallowed_special=()))
except (ValueError, UnicodeEncodeError) as exc:
logger.warning("Failed to estimate token size", extra={"error": str(exc)})
return None
except (requests.exceptions.RequestException, ssl.SSLError) as exc:
# If network errors, skip token count estimation instead of erroring out
logger.warning("Failed to download tiktoken model", extra={"error": str(exc)})
return None
for threshold, suffix in _TOKEN_THRESHOLDS:
if total_tokens >= threshold:
return f"{total_tokens / threshold:.1f}{suffix}"
return str(total_tokens)
================================================
FILE: src/gitingest/query_parser.py
================================================
"""Module containing functions to parse and validate input sources and patterns."""
from __future__ import annotations
import uuid
from pathlib import Path
from typing import Literal
from gitingest.config import TMP_BASE_PATH
from gitingest.schemas import IngestionQuery
from gitingest.utils.git_utils import fetch_remote_branches_or_tags, resolve_commit
from gitingest.utils.logging_config import get_logger
from gitingest.utils.query_parser_utils import (
PathKind,
_fallback_to_root,
_get_user_and_repo_from_path,
_is_valid_git_commit_hash,
_normalise_source,
)
# Initialize logger for this module
logger = get_logger(__name__)
async def parse_remote_repo(source: str, token: str | None = None) -> IngestionQuery:
"""Parse a repository URL and return an ``IngestionQuery`` object.
If source is:
- A fully qualified URL ('https://gitlab.com/...'), parse & verify that domain
- A URL missing 'https://' ('gitlab.com/...'), add 'https://' and parse
- A *slug* ('pandas-dev/pandas'), attempt known domains until we find one that exists.
Parameters
----------
source : str
The URL or domain-less slug to parse.
token : str | None
GitHub personal access token (PAT) for accessing private repositories.
Returns
-------
IngestionQuery
A dictionary containing the parsed details of the repository.
"""
parsed_url = await _normalise_source(source, token=token)
host = parsed_url.netloc
user, repo = _get_user_and_repo_from_path(parsed_url.path)
_id = uuid.uuid4()
slug = f"{user}-{repo}"
local_path = TMP_BASE_PATH / str(_id) / slug
url = f"https://{host}/{user}/{repo}"
query = IngestionQuery(
host=host,
user_name=user,
repo_name=repo,
url=url,
local_path=local_path,
slug=slug,
id=_id,
)
path_parts = parsed_url.path.strip("/").split("/")[2:]
# main branch
if not path_parts:
return await _fallback_to_root(query, token=token)
kind = PathKind(path_parts.pop(0)) # may raise ValueError
query.type = kind
# TODO: Handle issues and pull requests
if query.type in {PathKind.ISSUES, PathKind.PULL}:
msg = f"Warning: Issues and pull requests are not yet supported: {url}. Returning repository root."
return await _fallback_to_root(query, token=token, warn_msg=msg)
# If no extra path parts, just return
if not path_parts:
msg = f"Warning: No extra path parts: {url}. Returning repository root."
return await _fallback_to_root(query, token=token, warn_msg=msg)
if query.type not in {PathKind.TREE, PathKind.BLOB}:
# TODO: Handle other types
msg = f"Warning: Type '{query.type}' is not yet supported: {url}. Returning repository root."
return await _fallback_to_root(query, token=token, warn_msg=msg)
# Commit, branch, or tag
ref = path_parts[0]
if _is_valid_git_commit_hash(ref): # Commit
query.commit = ref
path_parts.pop(0) # Consume the commit hash
else: # Branch or tag
# Try to resolve a tag
query.tag = await _configure_branch_or_tag(
path_parts,
url=url,
ref_type="tags",
token=token,
)
# If no tag found, try to resolve a branch
if not query.tag:
query.branch = await _configure_branch_or_tag(
path_parts,
url=url,
ref_type="branches",
token=token,
)
# Only configure subpath if we have identified a commit, branch, or tag.
if path_parts and (query.commit or query.branch or query.tag):
query.subpath += "/".join(path_parts)
query.commit = await resolve_commit(query.extract_clone_config(), token=token)
return query
def parse_local_dir_path(path_str: str) -> IngestionQuery:
"""Parse the given file path into a structured query dictionary.
Parameters
----------
path_str : str
The file path to parse.
Returns
-------
IngestionQuery
A dictionary containing the parsed details of the file path.
"""
path_obj = Path(path_str).resolve()
slug = path_obj.name if path_str == "." else path_str.strip("/")
return IngestionQuery(local_path=path_obj, slug=slug, id=uuid.uuid4())
async def _configure_branch_or_tag(
path_parts: list[str],
*,
url: str,
ref_type: Literal["branches", "tags"],
token: str | None = None,
) -> str | None:
"""Configure the branch or tag based on the remaining parts of the URL.
Parameters
----------
path_parts : list[str]
The path parts of the URL.
url : str
The URL of the repository.
ref_type : Literal["branches", "tags"]
The type of reference to configure. Can be "branches" or "tags".
token : str | None
GitHub personal access token (PAT) for accessing private repositories.
Returns
-------
str | None
The branch or tag name if found, otherwise ``None``.
"""
_ref_type = "tags" if ref_type == "tags" else "branches"
try:
# Fetch the list of branches or tags from the remote repository
branches_or_tags: list[str] = await fetch_remote_branches_or_tags(url, ref_type=_ref_type, token=token)
except RuntimeError as exc:
# If remote discovery fails, we optimistically treat the first path segment as the branch/tag.
msg = f"Warning: Failed to fetch {_ref_type}: {exc}"
logger.warning(msg)
return path_parts.pop(0) if path_parts else None
# Iterate over the path components and try to find a matching branch/tag
candidate_parts: list[str] = []
for part in path_parts:
candidate_parts.append(part)
candidate_name = "/".join(candidate_parts)
if candidate_name in branches_or_tags:
# We found a match — now consume exactly the parts that form the branch/tag
del path_parts[: len(candidate_parts)]
return candidate_name
# No match found; leave path_parts intact
return None
================================================
FILE: src/gitingest/schemas/__init__.py
================================================
"""Module containing the schemas for the Gitingest package."""
from gitingest.schemas.cloning import CloneConfig
from gitingest.schemas.filesystem import FileSystemNode, FileSystemNodeType, FileSystemStats
from gitingest.schemas.ingestion import IngestionQuery
__all__ = ["CloneConfig", "FileSystemNode", "FileSystemNodeType", "FileSystemStats", "IngestionQuery"]
================================================
FILE: src/gitingest/schemas/cloning.py
================================================
"""Schema for the cloning process."""
from __future__ import annotations
from pydantic import BaseModel, Field
class CloneConfig(BaseModel): # pylint: disable=too-many-instance-attributes
"""Configuration for cloning a Git repository.
This model holds the necessary parameters for cloning a repository to a local path, including
the repository's URL, the target local path, and optional parameters for a specific commit, branch, or tag.
Attributes
----------
url : str
The URL of the Git repository to clone.
local_path : str
The local directory where the repository will be cloned.
commit : str | None
The specific commit hash to check out after cloning.
branch : str | None
The branch to clone.
tag : str | None
The tag to clone.
subpath : str
The subpath to clone from the repository (default: ``"/"``).
blob : bool
Whether the repository is a blob (default: ``False``).
include_submodules : bool
Whether to clone submodules (default: ``False``).
"""
url: str
local_path: str
commit: str | None = None
branch: str | None = None
tag: str | None = None
subpath: str = Field(default="/")
blob: bool = Field(default=False)
include_submodules: bool = Field(default=False)
================================================
FILE: src/gitingest/schemas/filesystem.py
================================================
"""Schema for the filesystem representation."""
from __future__ import annotations
import os
from dataclasses import dataclass, field
from enum import Enum, auto
from typing import TYPE_CHECKING
from gitingest.utils.compat_func import readlink
from gitingest.utils.file_utils import _decodes, _get_preferred_encodings, _read_chunk
from gitingest.utils.notebook import process_notebook
if TYPE_CHECKING:
from pathlib import Path
SEPARATOR = "=" * 48 # Tiktoken, the tokenizer openai uses, counts 2 tokens if we have more than 48
class FileSystemNodeType(Enum):
"""Enum representing the type of a file system node (directory or file)."""
DIRECTORY = auto()
FILE = auto()
SYMLINK = auto()
@dataclass
class FileSystemStats:
"""Class for tracking statistics during file system traversal."""
total_files: int = 0
total_size: int = 0
@dataclass
class FileSystemNode: # pylint: disable=too-many-instance-attributes
"""Class representing a node in the file system (either a file or directory).
Tracks properties of files/directories for comprehensive analysis.
"""
name: str
type: FileSystemNodeType
path_str: str
path: Path
size: int = 0
file_count: int = 0
dir_count: int = 0
depth: int = 0
children: list[FileSystemNode] = field(default_factory=list)
def sort_children(self) -> None:
"""Sort the children nodes of a directory according to a specific order.
Order of sorting:
2. Regular files (not starting with dot)
3. Hidden files (starting with dot)
4. Regular directories (not starting with dot)
5. Hidden directories (starting with dot)
All groups are sorted alphanumerically within themselves.
Raises
------
ValueError
If the node is not a directory.
"""
if self.type != FileSystemNodeType.DIRECTORY:
msg = "Cannot sort children of a non-directory node"
raise ValueError(msg)
def _sort_key(child: FileSystemNode) -> tuple[int, str]:
# returns the priority order for the sort function, 0 is first
# Groups: 0=README, 1=regular file, 2=hidden file, 3=regular dir, 4=hidden dir
name = child.name.lower()
if child.type == FileSystemNodeType.FILE:
if name == "readme" or name.startswith("readme."):
return (0, name)
return (1 if not name.startswith(".") else 2, name)
return (3 if not name.startswith(".") else 4, name)
self.children.sort(key=_sort_key)
@property
def content_string(self) -> str:
"""Return the content of the node as a string, including path and content.
Returns
-------
str
A string representation of the node's content.
"""
parts = [
SEPARATOR,
f"{self.type.name}: {str(self.path_str).replace(os.sep, '/')}"
+ (f" -> {readlink(self.path).name}" if self.type == FileSystemNodeType.SYMLINK else ""),
SEPARATOR,
f"{self.content}",
]
return "\n".join(parts) + "\n\n"
@property
def content(self) -> str: # pylint: disable=too-many-return-statements
"""Return file content (if text / notebook) or an explanatory placeholder.
Heuristically decides whether the file is text or binary by decoding a small chunk of the file
with multiple encodings and checking for common binary markers.
Returns
-------
str
The content of the file, or an error message if the file could not be read.
Raises
------
ValueError
If the node is a directory.
"""
if self.type == FileSystemNodeType.DIRECTORY:
msg = "Cannot read content of a directory node"
raise ValueError(msg)
if self.type == FileSystemNodeType.SYMLINK:
return "" # TODO: are we including the empty content of symlinks?
if self.path.suffix == ".ipynb": # Notebook
try:
return process_notebook(self.path)
except Exception as exc:
return f"Error processing notebook: {exc}"
chunk = _read_chunk(self.path)
if chunk is None:
return "Error reading file"
if chunk == b"":
return "[Empty file]"
if not _decodes(chunk, "utf-8"):
return "[Binary file]"
# Find the first encoding that decodes the sample
good_enc: str | None = next(
(enc for enc in _get_preferred_encodings() if _decodes(chunk, encoding=enc)),
None,
)
if good_enc is None:
return "Error: Unable to decode file with available encodings"
try:
with self.path.open(encoding=good_enc) as fp:
return fp.read()
except (OSError, UnicodeDecodeError) as exc:
return f"Error reading file with {good_enc!r}: {exc}"
================================================
FILE: src/gitingest/schemas/ingestion.py
================================================
"""Module containing the dataclasses for the ingestion process."""
from __future__ import annotations
from pathlib import Path # noqa: TC003 (typing-only-standard-library-import) needed for type checking (pydantic)
from uuid import UUID # noqa: TC003 (typing-only-standard-library-import) needed for type checking (pydantic)
from pydantic import BaseModel, Field
from gitingest.config import MAX_FILE_SIZE
from gitingest.schemas.cloning import CloneConfig
class IngestionQuery(BaseModel): # pylint: disable=too-many-instance-attributes
"""Pydantic model to store the parsed details of the repository or file path.
Attributes
----------
host : str | None
The host of the repository.
user_name : str | None
The username or owner of the repository.
repo_name : str | None
The name of the repository.
local_path : Path
The local path to the repository or file.
url : str | None
The URL of the repository.
slug : str
The slug of the repository.
id : UUID
The ID of the repository.
subpath : str
The subpath to the repository or file (default: ``"/"``).
type : str | None
The type of the repository or file.
branch : str | None
The branch of the repository.
commit : str | None
The commit of the repository.
tag : str | None
The tag of the repository.
max_file_size : int
The maximum file size to ingest in bytes (default: 10 MB).
ignore_patterns : set[str]
The patterns to ignore (default: ``set()``).
include_patterns : set[str] | None
The patterns to include.
include_submodules : bool
Whether to include all Git submodules within the repository. (default: ``False``)
s3_url : str | None
The S3 URL where the digest is stored if S3 is enabled.
"""
host: str | None = None
user_name: str | None = None
repo_name: str | None = None
local_path: Path
url: str | None = None
slug: str
id: UUID
subpath: str = Field(default="/")
type: str | None = None
branch: str | None = None
commit: str | None = None
tag: str | None = None
max_file_size: int = Field(default=MAX_FILE_SIZE)
ignore_patterns: set[str] = Field(default_factory=set) # TODO: ssame type for ignore_* and include_* patterns
include_patterns: set[str] | None = None
include_submodules: bool = Field(default=False)
s3_url: str | None = None
def extract_clone_config(self) -> CloneConfig:
"""Extract the relevant fields for the CloneConfig object.
Returns
-------
CloneConfig
A CloneConfig object containing the relevant fields.
Raises
------
ValueError
If the ``url`` parameter is not provided.
"""
if not self.url:
msg = "The 'url' parameter is required."
raise ValueError(msg)
return CloneConfig(
url=self.url,
local_path=str(self.local_path),
commit=self.commit,
branch=self.branch,
tag=self.tag,
subpath=self.subpath,
blob=self.type == "blob",
include_submodules=self.include_submodules,
)
================================================
FILE: src/gitingest/utils/__init__.py
================================================
"""Utility functions for the gitingest package."""
================================================
FILE: src/gitingest/utils/auth.py
================================================
"""Utilities for handling authentication."""
from __future__ import annotations
import os
from gitingest.utils.git_utils import validate_github_token
def resolve_token(token: str | None) -> str | None:
"""Resolve the token to use for the query.
Parameters
----------
token : str | None
GitHub personal access token (PAT) for accessing private repositories.
Returns
-------
str | None
The resolved token.
"""
token = token or os.getenv("GITHUB_TOKEN")
if token:
validate_github_token(token)
return token
================================================
FILE: src/gitingest/utils/compat_func.py
================================================
"""Compatibility functions for Python 3.8."""
import os
from pathlib import Path
def readlink(path: Path) -> Path:
"""Read the target of a symlink.
Compatible with Python 3.8.
Parameters
----------
path : Path
Path to the symlink.
Returns
-------
Path
The target of the symlink.
"""
return Path(os.readlink(path))
def removesuffix(s: str, suffix: str) -> str:
"""Remove a suffix from a string.
Compatible with Python 3.8.
Parameters
----------
s : str
String to remove suffix from.
suffix : str
Suffix to remove.
Returns
-------
str
String with suffix removed.
"""
return s[: -len(suffix)] if s.endswith(suffix) else s
================================================
FILE: src/gitingest/utils/compat_typing.py
================================================
"""Compatibility layer for typing."""
try:
from enum import StrEnum # type: ignore[attr-defined] # Py ≥ 3.11
except ImportError:
from strenum import StrEnum # type: ignore[import-untyped] # Py ≤ 3.10
try:
from typing import ParamSpec, TypeAlias # type: ignore[attr-defined] # Py ≥ 3.10
except ImportError:
from typing_extensions import ParamSpec, TypeAlias # type: ignore[attr-defined] # Py ≤ 3.9
try:
from typing import Annotated # type: ignore[attr-defined] # Py ≥ 3.9
except ImportError:
from typing_extensions import Annotated # type: ignore[attr-defined] # Py ≤ 3.8
__all__ = ["Annotated", "ParamSpec", "StrEnum", "TypeAlias"]
================================================
FILE: src/gitingest/utils/exceptions.py
================================================
"""Custom exceptions for the Gitingest package."""
class AsyncTimeoutError(Exception):
"""Exception raised when an async operation exceeds its timeout limit.
This exception is used by the ``async_timeout`` decorator to signal that the wrapped
asynchronous function has exceeded the specified time limit for execution.
"""
class InvalidNotebookError(Exception):
"""Exception raised when a Jupyter notebook is invalid or cannot be processed."""
def __init__(self, message: str) -> None:
super().__init__(message)
class InvalidGitHubTokenError(ValueError):
"""Exception raised when a GitHub Personal Access Token is malformed."""
def __init__(self) -> None:
msg = (
"Invalid GitHub token format. To generate a token, go to "
"https://github.com/settings/tokens/new?description=gitingest&scopes=repo."
)
super().__init__(msg)
================================================
FILE: src/gitingest/utils/file_utils.py
================================================
"""Utility functions for working with files and directories."""
from __future__ import annotations
import locale
import platform
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from pathlib import Path
try:
locale.setlocale(locale.LC_ALL, "")
except locale.Error:
locale.setlocale(locale.LC_ALL, "C")
_CHUNK_SIZE = 1024 # bytes
def _get_preferred_encodings() -> list[str]:
"""Get list of encodings to try, prioritized for the current platform.
Returns
-------
list[str]
List of encoding names to try in priority order, starting with the
platform's default encoding followed by common fallback encodings.
"""
encodings = [locale.getpreferredencoding(), "utf-8", "utf-16", "utf-16le", "utf-8-sig", "latin"]
if platform.system() == "Windows":
encodings += ["cp1252", "iso-8859-1"]
return list(dict.fromkeys(encodings))
def _read_chunk(path: Path) -> bytes | None:
"""Attempt to read the first *size* bytes of *path* in binary mode.
Parameters
----------
path : Path
The path to the file to read.
Returns
-------
bytes | None
The first ``_CHUNK_SIZE`` bytes of ``path``, or ``None`` on any ``OSError``.
"""
try:
with path.open("rb") as fp:
return fp.read(_CHUNK_SIZE)
except OSError:
return None
def _decodes(chunk: bytes, encoding: str) -> bool:
"""Return ``True`` if ``chunk`` decodes cleanly with ``encoding``.
Parameters
----------
chunk : bytes
The chunk of bytes to decode.
encoding : str
The encoding to use to decode the chunk.
Returns
-------
bool
``True`` if the chunk decodes cleanly with the encoding, ``False`` otherwise.
"""
try:
chunk.decode(encoding)
except UnicodeDecodeError:
return False
return True
================================================
FILE: src/gitingest/utils/git_utils.py
================================================
"""Utility functions for interacting with Git repositories."""
from __future__ import annotations
import asyncio
import base64
import re
import sys
from contextlib import contextmanager
from pathlib import Path
from typing import TYPE_CHECKING, Final, Generator, Iterable
from urllib.parse import urlparse, urlunparse
import git
from gitingest.utils.compat_func import removesuffix
from gitingest.utils.exceptions import InvalidGitHubTokenError
from gitingest.utils.logging_config import get_logger
if TYPE_CHECKING:
from gitingest.schemas import CloneConfig
# Initialize logger for this module
logger = get_logger(__name__)
# GitHub Personal-Access tokens (classic + fine-grained).
# - ghp_ / gho_ / ghu_ / ghs_ / ghr_ → 36 alphanumerics
# - github_pat_ → 22 alphanumerics + "_" + 59 alphanumerics
_GITHUB_PAT_PATTERN: Final[str] = r"^(?:gh[pousr]_[A-Za-z0-9]{36}|github_pat_[A-Za-z0-9]{22}_[A-Za-z0-9]{59})$"
def is_github_host(url: str) -> bool:
"""Check if a URL is from a GitHub host (github.com or GitHub Enterprise).
Parameters
----------
url : str
The URL to check
Returns
-------
bool
True if the URL is from a GitHub host, False otherwise
"""
hostname = urlparse(url).hostname or ""
return hostname.startswith("github.")
async def run_command(*args: str) -> tuple[bytes, bytes]:
"""Execute a shell command asynchronously and return (stdout, stderr) bytes.
This function is kept for backward compatibility with non-git commands.
Git operations should use GitPython directly.
Parameters
----------
*args : str
The command and its arguments to execute.
Returns
-------
tuple[bytes, bytes]
A tuple containing the stdout and stderr of the command.
Raises
------
RuntimeError
If command exits with a non-zero status.
"""
# Execute the requested command
proc = await asyncio.create_subprocess_exec(
*args,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
stdout, stderr = await proc.communicate()
if proc.returncode != 0:
msg = f"Command failed: {' '.join(args)}\nError: {stderr.decode().strip()}"
raise RuntimeError(msg)
return stdout, stderr
async def ensure_git_installed() -> None:
"""Ensure Git is installed and accessible on the system.
On Windows, this also checks whether Git is configured to support long file paths.
Raises
------
RuntimeError
If Git is not installed or not accessible.
"""
try:
# Use GitPython to check git availability
git_cmd = git.Git()
git_cmd.version()
except git.GitCommandError as exc:
msg = "Git is not installed or not accessible. Please install Git first."
raise RuntimeError(msg) from exc
except Exception as exc:
msg = "Git is not installed or not accessible. Please install Git first."
raise RuntimeError(msg) from exc
if sys.platform == "win32":
try:
longpaths_value = git_cmd.config("core.longpaths")
if longpaths_value.lower() != "true":
logger.warning(
"Git clone may fail on Windows due to long file paths. "
"Consider enabling long path support with: 'git config --global core.longpaths true'. "
"Note: This command may require administrator privileges.",
extra={"platform": "windows", "longpaths_enabled": False},
)
except git.GitCommandError:
# Ignore if checking 'core.longpaths' fails.
pass
async def check_repo_exists(url: str, token: str | None = None) -> bool:
"""Check whether a remote Git repository is reachable.
Parameters
----------
url : str
URL of the Git repository to check.
token : str | None
GitHub personal access token (PAT) for accessing private repositories.
Returns
-------
bool
``True`` if the repository exists, ``False`` otherwise.
"""
try:
# Try to resolve HEAD - if repo exists, this will work
await _resolve_ref_to_sha(url, "HEAD", token=token)
except (ValueError, Exception):
# Repository doesn't exist, is private without proper auth, or other error
return False
return True
def _parse_github_url(url: str) -> tuple[str, str, str]:
"""Parse a GitHub URL and return (hostname, owner, repo).
Parameters
----------
url : str
The URL of the GitHub repository to parse.
Returns
-------
tuple[str, str, str]
A tuple containing the hostname, owner, and repository name.
Raises
------
ValueError
If the URL is not a valid GitHub repository URL.
"""
parsed = urlparse(url)
if parsed.scheme not in {"http", "https"}:
msg = f"URL must start with http:// or https://: {url!r}"
raise ValueError(msg)
if not parsed.hostname or not parsed.hostname.startswith("github."):
msg = f"Un-recognised GitHub hostname: {parsed.hostname!r}"
raise ValueError(msg)
parts = removesuffix(parsed.path, ".git").strip("/").split("/")
expected_path_length = 2
if len(parts) != expected_path_length:
msg = f"Path must look like /<owner>/<repo>: {parsed.path!r}"
raise ValueError(msg)
owner, repo = parts
return parsed.hostname, owner, repo
async def fetch_remote_branches_or_tags(url: str, *, ref_type: str, token: str | None = None) -> list[str]:
"""Fetch the list of branches or tags from a remote Git repository.
Parameters
----------
url : str
The URL of the Git repository to fetch branches or tags from.
ref_type: str
The type of reference to fetch. Can be "branches" or "tags".
token : str | None
GitHub personal access token (PAT) for accessing private repositories.
Returns
-------
list[str]
A list of branch names available in the remote repository.
Raises
------
ValueError
If the ``ref_type`` parameter is not "branches" or "tags".
RuntimeError
If fetching branches or tags from the remote repository fails.
"""
if ref_type not in ("branches", "tags"):
msg = f"Invalid fetch type: {ref_type}"
raise ValueError(msg)
await ensure_git_installed()
# Use GitPython to get remote references
try:
fetch_tags = ref_type == "tags"
to_fetch = "tags" if fetch_tags else "heads"
# Build ls-remote command
cmd_args = [f"--{to_fetch}"]
if fetch_tags:
cmd_args.append("--refs") # Filter out peeled tag objects
cmd_args.append(url)
# Run the command with proper authentication
with git_auth_context(url, token) as (git_cmd, auth_url):
# Replace the URL in cmd_args with the authenticated URL
cmd_args[-1] = auth_url # URL is the last argument
output = git_cmd.ls_remote(*cmd_args)
# Parse output
return [
line.split(f"refs/{to_fetch}/", 1)[1]
for line in output.splitlines()
if line.strip() and f"refs/{to_fetch}/" in line
]
except git.GitCommandError as exc:
msg = f"Failed to fetch {ref_type} from {url}: {exc}"
raise RuntimeError(msg) from exc
def create_git_repo(local_path: str, url: str, token: str | None = None) -> git.Repo:
"""Create a GitPython Repo object with authentication if needed.
Parameters
----------
local_path : str
The local path where the git repository is located.
url : str
The repository URL to check if it's a GitHub repository.
token : str | None
GitHub personal access token (PAT) for accessing private repositories.
Returns
-------
git.Repo
A GitPython Repo object configured with authentication.
Raises
------
ValueError
If the local path is not a valid git repository.
"""
try:
repo = git.Repo(local_path)
# Configure authentication if needed
if token and is_github_host(url):
auth_header = create_git_auth_header(token, url=url)
# Set the auth header in git config for this repo
key, value = auth_header.split("=", 1)
repo.git.config(key, value)
except git.InvalidGitRepositoryError as exc:
msg = f"Invalid git repository at {local_path}"
raise ValueError(msg) from exc
return repo
def create_git_auth_header(token: str, url: str = "https://github.com") -> str:
"""Create a Basic authentication header for GitHub git operations.
Parameters
----------
token : str
GitHub personal access token (PAT) for accessing private repositories.
url : str
The GitHub URL to create the authentication header for.
Defaults to "https://github.com" if not provided.
Returns
-------
str
The git config command for setting the authentication header.
Raises
------
ValueError
If the URL is not a valid GitHub repository URL.
"""
hostname = urlparse(url).hostname
if not hostname:
msg = f"Invalid GitHub URL: {url!r}"
raise ValueError(msg)
basic = base64.b64encode(f"x-oauth-basic:{token}".encode()).decode()
return f"http.https://{hostname}/.extraheader=Authorization: Basic {basic}"
def create_authenticated_url(url: str, token: str | None = None) -> str:
"""Create an authenticated URL for Git operations.
This is the safest approach for multi-user environments - no global state.
Parameters
----------
url : str
The repository URL.
token : str | None
GitHub personal access token (PAT) for accessing private repositories.
Returns
-------
str
The URL with authentication embedded (for GitHub) or original URL.
"""
if not (token and is_github_host(url)):
return url
parsed = urlparse(url)
# Add token as username in URL (GitHub supports this)
netloc = f"x-oauth-basic:{token}@{parsed.hostname}"
if parsed.port:
netloc += f":{parsed.port}"
return urlunparse(
(
parsed.scheme,
netloc,
parsed.path,
parsed.params,
parsed.query,
parsed.fragment,
),
)
@contextmanager
def git_auth_context(url: str, token: str | None = None) -> Generator[tuple[git.Git, str]]:
"""Context manager that provides Git command and authenticated URL.
Returns both a Git command object and the authenticated URL to use.
This avoids any global state contamination between users.
Parameters
----------
url : str
The repository URL to check if authentication is needed.
token : str | None
GitHub personal access token (PAT) for accessing private repositories.
Yields
------
Generator[tuple[git.Git, str]]
Tuple of (Git command object, authenticated URL to use).
"""
git_cmd = git.Git()
auth_url = create_authenticated_url(url, token)
yield git_cmd, auth_url
def validate_github_token(token: str) -> None:
"""Validate the format of a GitHub Personal Access Token.
Parameters
----------
token : str
GitHub personal access token (PAT) for accessing private repositories.
Raises
------
InvalidGitHubTokenError
If the token format is invalid.
"""
if not re.fullmatch(_GITHUB_PAT_PATTERN, token):
raise InvalidGitHubTokenError
async def checkout_partial_clone(config: CloneConfig, token: str | None) -> None:
"""Configure sparse-checkout for a partially cloned repository.
Parameters
----------
config : CloneConfig
The configuration for cloning the repository, including subpath and blob flag.
token : str | None
GitHub personal access token (PAT) for accessing private repositories.
Raises
------
RuntimeError
If the sparse-checkout configuration fails.
"""
subpath = config.subpath.lstrip("/")
if config.blob:
# Remove the file name from the subpath when ingesting from a file url (e.g. blob/branch/path/file.txt)
subpath = str(Path(subpath).parent.as_posix())
try:
repo = create_git_repo(config.local_path, config.url, token)
repo.git.sparse_checkout("set", subpath)
except git.GitCommandError as exc:
msg = f"Failed to configure sparse-checkout: {exc}"
raise RuntimeError(msg) from exc
async def resolve_commit(config: CloneConfig, token: str | None) -> str:
"""Resolve the commit to use for the clone.
Parameters
----------
config : CloneConfig
The configuration for cloning the repository.
token : str | None
GitHub personal access token (PAT) for accessing private repositories.
Returns
-------
str
The commit SHA.
"""
if config.commit:
commit = config.commit
elif config.tag:
commit = await _resolve_ref_to_sha(config.url, pattern=f"refs/tags/{config.tag}*", token=token)
elif config.branch:
commit = await _resolve_ref_to_sha(config.url, pattern=f"refs/heads/{config.branch}", token=token)
else:
commit = await _resolve_ref_to_sha(config.url, pattern="HEAD", token=token)
return commit
async def _resolve_ref_to_sha(url: str, pattern: str, token: str | None = None) -> str:
"""Return the commit SHA that <kind>/<ref> points to in <url>.
* Branch → first line from ``git ls-remote``.
* Tag → if annotated, prefer the peeled ``^{}`` line (commit).
Parameters
----------
url : str
The URL of the remote repository.
pattern : str
The pattern to use to resolve the commit SHA.
token : str | None
GitHub personal access token (PAT) for accessing private repositories.
Returns
-------
str
The commit SHA.
Raises
------
ValueError
If the ref does not exist in the remote repository.
"""
try:
# Execute ls-remote command with proper authentication
with git_auth_context(url, token) as (git_cmd, auth_url):
output = git_cmd.ls_remote(auth_url, pattern)
lines = output.splitlines()
sha = _pick_commit_sha(lines)
if not sha:
msg = f"{pattern!r} not found in {url}"
raise ValueError(msg)
except git.GitCommandError as exc:
msg = f"Failed to resolve {pattern} in {url}:\n{exc}"
raise ValueError(msg) from exc
return sha
def _pick_commit_sha(lines: Iterable[str]) -> str | None:
"""Return a commit SHA from ``git ls-remote`` output.
• Annotated tag → prefer the peeled line (<sha> refs/tags/x^{})
• Branch / lightweight tag → first non-peeled line
Parameters
----------
lines : Iterable[str]
The lines of a ``git ls-remote`` output.
Returns
-------
str | None
The commit SHA, or ``None`` if no commit SHA is found.
"""
first_non_peeled: str | None = None
for ln in lines:
if not ln.strip():
continue
sha, ref = ln.split(maxsplit=1)
if ref.endswith("^{}"): # peeled commit of annotated tag
return sha # ← best match, done
if first_non_peeled is None: # remember the first ordinary line
first_non_peeled = sha
return first_non_peeled # branch or lightweight tag (or None)
================================================
FILE: src/gitingest/utils/ignore_patterns.py
================================================
"""Default ignore patterns for Gitingest."""
from __future__ import annotations
from pathlib import Path
DEFAULT_IGNORE_PATTERNS: set[str] = {
# Python
"*.pyc",
"*.pyo",
"*.pyd",
"__pycache__",
".pytest_cache",
".coverage",
".tox",
".nox",
".mypy_cache",
".ruff_cache",
".hypothesis",
"poetry.lock",
"Pipfile.lock",
# JavaScript/FileSystemNode
"node_modules",
"bower_components",
"package-lock.json",
"yarn.lock",
".npm",
".yarn",
".pnpm-store",
"bun.lock",
"bun.lockb",
# Java
"*.class",
"*.jar",
"*.war",
"*.ear",
"*.nar",
".gradle/",
"build/",
".settings/",
".classpath",
"gradle-app.setting",
"*.gradle",
# IDEs and editors / Java
".project",
# C/C++
"*.o",
"*.obj",
"*.dll",
"*.dylib",
"*.exe",
"*.lib",
"*.out",
"*.a",
"*.pdb",
# Binary
"*.bin",
# Swift/Xcode
".build/",
"*.xcodeproj/",
"*.xcworkspace/",
"*.pbxuser",
"*.mode1v3",
"*.mode2v3",
"*.perspectivev3",
"*.xcuserstate",
"xcuserdata/",
".swiftpm/",
# Ruby
"*.gem",
".bundle/",
"vendor/bundle",
"Gemfile.lock",
".ruby-version",
".ruby-gemset",
".rvmrc",
# Rust
"Cargo.lock",
"**/*.rs.bk",
# Java / Rust
"target/",
# Go
"pkg/",
# .NET/C#
"obj/",
"*.suo",
"*.user",
"*.userosscache",
"*.sln.docstates",
"*.nupkg",
# Go / .NET / C#
"bin/",
# Version control
".git",
".svn",
".hg",
".gitignore",
".gitattributes",
".gitmodules",
# Images and media
"*.svg",
"*.png",
"*.jpg",
"*.jpeg",
"*.gif",
"*.ico",
"*.pdf",
"*.mov",
"*.mp4",
"*.mp3",
"*.wav",
# Virtual environments
"venv",
".venv",
"env",
".env",
"virtualenv",
# IDEs and editors
".idea",
".vscode",
".vs",
"*.swo",
"*.swn",
".settings",
"*.sublime-*",
# Temporary and cache files
"*.log",
"*.bak",
"*.swp",
"*.tmp",
"*.temp",
".cache",
".sass-cache",
".eslintcache",
".DS_Store",
"Thumbs.db",
"desktop.ini",
# Build directories and artifacts
"build",
"dist",
"target",
"out",
"*.egg-info",
"*.egg",
"*.whl",
"*.so",
# Documentation
"site-packages",
".docusaurus",
".next",
".nuxt",
# Database
"*.db",
"*.sqlite",
"*.sqlite3",
# Other common patterns
## Minified files
"*.min.js",
"*.min.css",
## Source maps
"*.map",
## Terraform
"*.tfstate*",
## Dependencies in various languages
"vendor/",
# Gitingest
"digest.txt",
}
def load_ignore_patterns(root: Path, filename: str) -> set[str]:
"""Load ignore patterns from ``filename`` found under ``root``.
The loader walks the directory tree, looks for the supplied ``filename``,
and returns a unified set of patterns. It implements the same parsing rules
we use for ``.gitignore`` and ``.gitingestignore`` (git-wildmatch syntax with
support for negation and root-relative paths).
Parameters
----------
root : Path
Directory to walk.
filename : str
The filename to look for in each directory.
Returns
-------
set[str]
A set of ignore patterns extracted from the ``filename`` file found under the ``root`` directory.
"""
patterns: set[str] = set()
for ignore_file in root.rglob(filename):
if ignore_file.is_file():
patterns.update(_parse_ignore_file(ignore_file, root))
return patterns
def _parse_ignore_file(ignore_file: Path, root: Path) -> set[str]:
"""Parse an ignore file and return a set of ignore patterns.
Parameters
----------
ignore_file : Path
The path to the ignore file.
root : Path
The root directory of the repository.
Returns
-------
set[str]
A set of ignore patterns.
"""
patterns: set[str] = set()
# Path of the ignore file relative to the repository root
rel_dir = ignore_file.parent.relative_to(root)
base_dir = Path() if rel_dir == Path() else rel_dir
with ignore_file.open(encoding="utf-8") as fh:
for raw in fh:
line = raw.strip()
if not line or line.startswith("#"): # comments / blank lines
continue
# Handle negation ("!foobar")
negated = line.startswith("!")
if negated:
line = line[1:]
# Handle leading slash ("/foobar")
if line.startswith("/"):
line = line.lstrip("/")
pattern_body = (base_dir / line).as_posix()
patterns.add(f"!{pattern_body}" if negated else pattern_body)
return patterns
================================================
FILE: src/gitingest/utils/ingestion_utils.py
================================================
"""Utility functions for the ingestion process."""
from __future__ import annotations
from typing import TYPE_CHECKING
from pathspec import PathSpec
if TYPE_CHECKING:
from pathlib import Path
def _should_include(path: Path, base_path: Path, include_patterns: set[str]) -> bool:
"""Return ``True`` if ``path`` matches any of ``include_patterns``.
Parameters
----------
path : Path
The absolute path of the file or directory to check.
base_path : Path
The base directory from which the relative path is calculated.
include_patterns : set[str]
A set of patterns to check against the relative path.
Returns
-------
bool
``True`` if the path matches any of the include patterns, ``False`` otherwise.
"""
rel_path = _relative_or_none(path, base_path)
if rel_path is None: # outside repo → do *not* include
return False
if path.is_dir(): # keep directories so children are visited
return True
spec = PathSpec.from_lines("gitwildmatch", include_patterns)
return spec.match_file(str(rel_path))
def _should_exclude(path: Path, base_path: Path, ignore_patterns: set[str]) -> bool:
"""Return ``True`` if ``path`` matches any of ``ignore_patterns``.
Parameters
----------
path : Path
The absolute path of the file or directory to check.
base_path : Path
The base directory from which the relative path is calculated.
ignore_patterns : set[str]
A set of patterns to check against the relative path.
Returns
-------
bool
``True`` if the path matches any of the ignore patterns, ``False`` otherwise.
"""
rel_path = _relative_or_none(path, base_path)
if rel_path is None: # outside repo → already "excluded"
return True
spec = PathSpec.from_lines("gitwildmatch", ignore_patterns)
return spec.match_file(str(rel_path))
def _relative_or_none(path: Path, base: Path) -> Path | None:
"""Return *path* relative to *base* or ``None`` if *path* is outside *base*.
Parameters
----------
path : Path
The absolute path of the file or directory to check.
base : Path
The base directory from which the relative path is calculated.
Returns
-------
Path | None
The relative path of ``path`` to ``base``, or ``None`` if ``path`` is outside ``base``.
"""
try:
return path.relative_to(base)
except ValueError: # path is not a sub-path of base
return None
================================================
FILE: src/gitingest/utils/logging_config.py
================================================
"""Logging configuration for gitingest using loguru.
This module provides structured JSON logging suitable for Kubernetes deployments
while also supporting human-readable logging for development.
"""
from __future__ import annotations
import json
import logging
import os
import sys
from typing import Any
from loguru import logger
def json_sink(message: Any) -> None: # noqa: ANN401
"""Create JSON formatted log output.
Parameters
----------
message : Any
The loguru message record
"""
record = message.record
log_entry = {
"timestamp": record["time"].isoformat(),
"level": record["level"].name.upper(),
"logger": record["name"],
"module": record["module"],
"function": record["function"],
"line": record["line"],
"message": record["message"],
}
# Add exception info if present
if record["exception"]:
log_entry["exception"] = {
"type": record["exception"].type.__name__,
"value": str(record["exception"].value),
"traceback": record["exception"].traceback,
}
# Add extra fields if present
if record["extra"]:
log_entry.update(record["extra"])
sys.stdout.write(json.dumps(log_entry, ensure_ascii=False, separators=(",", ":")) + "\n")
def format_extra_fields(record: dict) -> str:
"""Format extra fields as JSON string.
Parameters
----------
record : dict
The loguru record dictionary
Returns
-------
str
JSON formatted extra fields or empty string
"""
if not record.get("extra"):
return ""
# Filter out loguru's internal extra fields
filtered_extra = {k: v for k, v in record["extra"].items() if not k.startswith("_") and k not in ["name"]}
# Handle nested extra structure - if there's an 'extra' key, use its contents
if "extra" in filtered_extra and isinstance(filtered_extra["extra"], dict):
filtered_extra = filtered_extra["extra"]
if filtered_extra:
extra_json = json.dumps(filtered_extra, ensure_ascii=False, separators=(",", ":"))
return f" | {extra_json}"
return ""
def extra_filter(record: dict) -> dict:
"""Filter function to add extra fields to the message.
Parameters
----------
record : dict
The loguru record dictionary
Returns
-------
dict
Modified record with extra fields appended to message
"""
extra_str = format_extra_fields(record)
if extra_str:
record["message"] = record["message"] + extra_str
return record
class InterceptHandler(logging.Handler):
"""Intercept standard library logging and redirect to loguru."""
def emit(self, record: logging.LogRecord) -> None:
"""Emit a record to loguru."""
# Get corresponding loguru level
try:
level = logger.level(record.levelname).name
except ValueError:
level = record.levelno
# Find caller from where originated the logged message
frame, depth = logging.currentframe(), 2
while frame.f_code.co_filename == logging.__file__:
frame = frame.f_back
depth += 1
logger.opt(depth=depth, exception=record.exc_info).log(
level,
record.getMessage(),
)
def configure_logging() -> None:
"""Configure loguru for the application.
Sets up JSON logging for production/Kubernetes environments
or human-readable logging for development.
Intercepts all standard library logging including uvicorn.
"""
# Remove default handler
logger.remove()
# Check if we're in Kubernetes or production environment
is_k8s = os.getenv("KUBERNETES_SERVICE_HOST") is not None
log_format = os.getenv("LOG_FORMAT", "json" if is_k8s else "human")
log_level = os.getenv("LOG_LEVEL", "INFO")
if log_format.lower() == "json":
# JSON format for structured logging (Kubernetes/production)
logger.add(
json_sink,
level=log_level,
enqueue=True, # Async logging for better performance
diagnose=False, # Don't include variable values in exceptions (security)
backtrace=True, # Include full traceback
serialize=True, # Ensure proper serialization
)
else:
# Human-readable format for development
logger_format = (
"<green>{time:YYYY-MM-DD HH:mm:ss.SSS}</green> | "
"<level>{level: <8}</level> | "
"<cyan>{name}</cyan>:<cyan>{function}</cyan>:<cyan>{line}</cyan> | "
"{message}"
)
logger.add(
sys.stderr,
format=logger_format,
filter=extra_filter,
level=log_level,
enqueue=True,
diagnose=True, # Include variable values in development
backtrace=True,
)
# Intercept all standard library logging
logging.basicConfig(handlers=[InterceptHandler()], level=0, force=True)
# Intercept specific loggers that might bypass basicConfig
for name in logging.root.manager.loggerDict: # pylint: disable=no-member
logging.getLogger(name).handlers = []
logging.getLogger(name).propagate = True
def get_logger(name: str | None = None) -> logger.__class__:
"""Get a configured logger instance.
Parameters
----------
name : str | None, optional
Logger name, defaults to the calling module name
Returns
-------
logger.__class__
Configured logger instance
"""
if name:
return logger.bind(name=name)
return logger
# Initialize logging when module is imported
configure_logging()
================================================
FILE: src/gitingest/utils/notebook.py
================================================
"""Utilities for processing Jupyter notebooks."""
from __future__ import annotations
import json
from itertools import chain
from typing import TYPE_CHECKING, Any
from gitingest.utils.exceptions import InvalidNotebookError
from gitingest.utils.logging_config import get_logger
if TYPE_CHECKING:
from pathlib import Path
# Initialize logger for this module
logger = get_logger(__name__)
def process_notebook(file: Path, *, include_output: bool = True) -> str:
"""Process a Jupyter notebook file and return an executable Python script as a string.
Parameters
----------
file : Path
The path to the Jupyter notebook file.
include_output : bool
Whether to include cell outputs in the generated script (default: ``True``).
Returns
-------
str
The executable Python script as a string.
Raises
------
InvalidNotebookError
If the notebook file is invalid or cannot be processed.
"""
try:
with file.open(encoding="utf-8") as f:
notebook: dict[str, Any] = json.load(f)
except json.JSONDecodeError as exc:
msg = f"Invalid JSON in notebook: {file}"
raise InvalidNotebookError(msg) from exc
# Check if the notebook contains worksheets
worksheets = notebook.get("worksheets")
if worksheets:
logger.warning(
"Worksheets are deprecated as of IPEP-17. Consider updating the notebook. "
"(See: https://github.com/jupyter/nbformat and "
"https://github.com/ipython/ipython/wiki/IPEP-17:-Notebook-Format-4#remove-multiple-worksheets "
"for more information.)",
)
if len(worksheets) > 1:
logger.warning(
"Multiple worksheets detected. Combining all worksheets into a single script.",
)
cells = list(chain.from_iterable(ws["cells"] for ws in worksheets))
else:
cells = notebook["cells"]
result = ["# Jupyter notebook converted to Python script."]
for cell in cells:
cell_str = _process_cell(cell, include_output=include_output)
if cell_str:
result.append(cell_str)
return "\n\n".join(result) + "\n"
def _process_cell(cell: dict[str, Any], *, include_output: bool) -> str | None:
"""Process a Jupyter notebook cell and return the cell content as a string.
Parameters
----------
cell : dict[str, Any]
The cell dictionary from a Jupyter notebook.
include_output : bool
Whether to include cell outputs in the generated script.
Returns
-------
str | None
The cell content as a string, or ``None`` if the cell is empty.
Raises
------
ValueError
If an unexpected cell type is encountered.
"""
cell_type = cell["cell_type"]
# Validate cell type and handle unexpected types
if cell_type not in ("markdown", "code", "raw"):
msg = f"Unknown cell type: {cell_type}"
raise ValueError(msg)
cell_str = "".join(cell["source"])
# Skip empty cells
if not cell_str:
return None
# Convert Markdown and raw cells to multi-line comments
if cell_type in ("markdown", "raw"):
return f'"""\n{cell_str}\n"""'
# Add cell output as comments
outputs = cell.get("outputs")
if include_output and outputs:
# Include cell outputs as comments
raw_lines: list[str] = []
for output in outputs:
raw_lines += _extract_output(output)
cell_str += "\n# Output:\n# " + "\n# ".join(raw_lines)
return cell_str
def _extract_output(output: dict[str, Any]) -> list[str]:
"""Extract the output from a Jupyter notebook cell.
Parameters
----------
output : dict[str, Any]
The output dictionary from a Jupyter notebook cell.
Returns
-------
list[str]
The output as a list of strings.
Raises
------
ValueError
If an unknown output type is encountered.
"""
output_type = output["output_type"]
if output_type == "stream":
return output["text"]
if output_type in ("execute_result", "display_data"):
return output["data"]["text/plain"]
if output_type == "error":
return [f"Error: {output['ename']}: {output['evalue']}"]
msg = f"Unknown output type: {output_type}"
raise ValueError(msg)
================================================
FILE: src/gitingest/utils/os_utils.py
================================================
"""Utility functions for working with the operating system."""
from pathlib import Path
async def ensure_directory_exists_or_create(path: Path) -> None:
"""Ensure the directory exists, creating it if necessary.
Parameters
----------
path : Path
The path to ensure exists.
Raises
------
OSError
If the directory cannot be created.
"""
try:
path.mkdir(parents=True, exist_ok=True)
except OSError as exc:
msg = f"Failed to create directory {path}: {exc}"
raise OSError(msg) from exc
================================================
FILE: src/gitingest/utils/pattern_utils.py
================================================
"""Pattern utilities for the Gitingest package."""
from __future__ import annotations
import re
from typing import Iterable
from gitingest.utils.ignore_patterns import DEFAULT_IGNORE_PATTERNS
_PATTERN_SPLIT_RE = re.compile(r"[,\s]+")
def process_patterns(
exclude_patterns: str | set[str] | None = None,
include_patterns: str | set[str] | None = None,
) -> tuple[set[str], set[str] | None]:
"""Process include and exclude patterns.
Parameters
----------
exclude_patterns : str | set[str] | None
Exclude patterns to process.
include_patterns : str | set[str] | None
Include patterns to process.
Returns
-------
tuple[set[str], set[str] | None]
A tuple containing the processed ignore patterns and include patterns.
"""
# Combine default ignore patterns + custom patterns
ignore_patterns_set = DEFAULT_IGNORE_PATTERNS.copy()
if exclude_patterns:
ignore_patterns_set.update(_parse_patterns(exclude_patterns))
# Process include patterns and override ignore patterns accordingly
if include_patterns:
parsed_include = _parse_patterns(include_patterns)
# Override ignore patterns with include patterns
ignore_patterns_set = set(ignore_patterns_set) - set(parsed_include)
else:
parsed_include = None
return ignore_patterns_set, parsed_include
def _parse_patterns(patterns: str | Iterable[str]) -> set[str]:
"""Normalize a collection of file or directory patterns.
Parameters
----------
patterns : str | Iterable[str]
One pattern string or an iterable of pattern strings. Each pattern may contain multiple comma- or
whitespace-separated sub-patterns, e.g. "src/*, tests *.md".
Returns
-------
set[str]
Normalized patterns with Windows back-slashes converted to forward-slashes and duplicates removed.
"""
# Treat a lone string as the iterable [string]
if isinstance(patterns, str):
patterns = [patterns]
# Flatten, split on commas/whitespace, strip empties, normalise slashes
return {
part.replace("\\", "/")
for pat in patterns
for part in _PATTERN_SPLIT_RE.split(pat.strip())
if part # discard empty tokens
}
================================================
FILE: src/gitingest/utils/query_parser_utils.py
================================================
"""Utility functions for parsing and validating query parameters."""
from __future__ import annotations
import string
from typing import TYPE_CHECKING, cast
from urllib.parse import ParseResult, unquote, urlparse
from gitingest.utils.compat_typing import StrEnum
from gitingest.utils.git_utils import _resolve_ref_to_sha, check_repo_exists
from gitingest.utils.logging_config import get_logger
if TYPE_CHECKING:
from gitingest.schemas import IngestionQuery
# Initialize logger for this module
logger = get_logger(__name__)
HEX_DIGITS: set[str] = set(string.hexdigits)
KNOWN_GIT_HOSTS: list[str] = [
"github.com",
"gitlab.com",
"bitbucket.org",
"gitea.com",
"codeberg.org",
"gist.github.com",
]
class PathKind(StrEnum):
"""Path kind enum."""
TREE = "tree"
BLOB = "blob"
ISSUES = "issues"
PULL = "pull"
async def _fallback_to_root(query: IngestionQuery, token: str | None, warn_msg: str | None = None) -> IngestionQuery:
"""Fallback to the root of the repository if no extra path parts are provided.
Parameters
----------
query : IngestionQuery
The query to fallback to the root of the repository.
token : str | None
The token to use to access the repository.
warn_msg : str | None
The message to warn.
Returns
-------
IngestionQuery
The query with the fallback to the root of the repository.
"""
url = cast("str", query.url)
query.commit = await _resolve_ref_to_sha(url, pattern="HEAD", token=token)
if warn_msg:
logger.warning(warn_msg)
return query
async def _normalise_source(raw: str, token: str | None) -> ParseResult:
"""Return a fully-qualified ParseResult or raise.
Parameters
----------
raw : str
The raw URL to parse.
token : str | None
The token to use to access the repository.
Returns
-------
ParseResult
The parsed URL.
"""
raw = unquote(raw)
parsed = urlparse(raw)
if parsed.scheme:
_validate_url_scheme(parsed.scheme)
_validate_host(parsed.netloc)
return parsed
# no scheme ('host/user/repo' or 'user/repo')
host = raw.split("/", 1)[0].lower()
if "." in host:
_validate_host(host)
return urlparse(f"https://{raw}")
# "user/repo" slug
host = await _try_domains_for_user_and_repo(*_get_user_and_repo_from_path(raw), token=token)
return urlparse(f"https://{host}/{raw}")
async def _try_domains_for_user_and_repo(user_name: str, repo_name: str, token: str | None = None) -> str:
"""Attempt to find a valid repository host for the given ``user_name`` and ``repo_name``.
Parameters
----------
user_name : str
The username or owner of the repository.
repo_name : str
The name of the repository.
token : str | None
GitHub personal access token (PAT) for accessing private repositories.
Returns
-------
str
The domain of the valid repository host.
Raises
------
ValueError
If no valid repository host is found for the given ``user_name`` and ``repo_name``.
"""
for domain in KNOWN_GIT_HOSTS:
candidate = f"https://{domain}/{user_name}/{repo_name}"
if await check_repo_exists(candidate, token=token if domain.startswith("github.") else None):
return domain
msg = f"Could not find a valid repository host for '{user_name}/{repo_name}'."
raise ValueError(msg)
def _is_valid_git_commit_hash(commit: str) -> bool:
"""Validate if the provided string is a valid Git commit hash.
This function checks if the commit hash is a 40-character string consisting only
of hexadecimal digits, which is the standard format for Git commit hashes.
Parameters
----------
commit : str
The string to validate as a Git commit hash.
Returns
-------
bool
``True`` if the string is a valid 40-character Git commit hash, otherwise ``False``.
"""
sha_hex_length = 40
return len(commit) == sha_hex_length and all(c in HEX_DIGITS for c in commit)
def _validate_host(host: str) -> None:
"""Validate a hostname.
The host is accepted if it is either present in the hard-coded ``KNOWN_GIT_HOSTS`` list or if it satisfies the
simple heuristics in ``_looks_like_git_host``, which try to recognise common self-hosted Git services (e.g. GitLab
instances on sub-domains such as 'gitlab.example.com' or 'git.example.com').
Parameters
----------
host : str
Hostname (case-insensitive).
Raises
------
ValueError
If the host cannot be recognised as a probable Git hosting domain.
"""
host = host.lower()
if host not in KNOWN_GIT_HOSTS and not _looks_like_git_host(host):
msg = f"Unknown domain '{host}' in URL"
raise ValueError(msg)
def _looks_like_git_host(host: str) -> bool:
"""Check if the given host looks like a Git host.
The current heuristic returns ``True`` when the host starts with ``git.`` (e.g. 'git.example.com'), starts with
gitextract_380b_654/
├── .docker/
│ └── minio/
│ └── setup.sh
├── .dockerignore
├── .github/
│ ├── ISSUE_TEMPLATE/
│ │ ├── bug_report.yml
│ │ └── feature_request.yml
│ └── workflows/
│ ├── ci.yml
│ ├── codeql.yml
│ ├── dependency-review.yml
│ ├── deploy-pr.yml
│ ├── docker-build.ecr.yml
│ ├── docker-build.ghcr.yml
│ ├── pr-title-check.yml
│ ├── publish_to_pypi.yml
│ ├── rebase-needed.yml
│ ├── release-please.yml
│ ├── scorecard.yml
│ └── stale.yml
├── .gitignore
├── .pre-commit-config.yaml
├── .release-please-manifest.json
├── .vscode/
│ └── launch.json
├── CHANGELOG.md
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── Dockerfile
├── LICENSE
├── README.md
├── SECURITY.md
├── compose.yml
├── eslint.config.cjs
├── pyproject.toml
├── release-please-config.json
├── renovate.json
├── requirements-dev.txt
├── requirements.txt
├── src/
│ ├── gitingest/
│ │ ├── __init__.py
│ │ ├── __main__.py
│ │ ├── clone.py
│ │ ├── config.py
│ │ ├── entrypoint.py
│ │ ├── ingestion.py
│ │ ├── output_formatter.py
│ │ ├── query_parser.py
│ │ ├── schemas/
│ │ │ ├── __init__.py
│ │ │ ├── cloning.py
│ │ │ ├── filesystem.py
│ │ │ └── ingestion.py
│ │ └── utils/
│ │ ├── __init__.py
│ │ ├── auth.py
│ │ ├── compat_func.py
│ │ ├── compat_typing.py
│ │ ├── exceptions.py
│ │ ├── file_utils.py
│ │ ├── git_utils.py
│ │ ├── ignore_patterns.py
│ │ ├── ingestion_utils.py
│ │ ├── logging_config.py
│ │ ├── notebook.py
│ │ ├── os_utils.py
│ │ ├── pattern_utils.py
│ │ ├── query_parser_utils.py
│ │ └── timeout_wrapper.py
│ ├── server/
│ │ ├── __init__.py
│ │ ├── __main__.py
│ │ ├── form_types.py
│ │ ├── main.py
│ │ ├── metrics_server.py
│ │ ├── models.py
│ │ ├── query_processor.py
│ │ ├── routers/
│ │ │ ├── __init__.py
│ │ │ ├── dynamic.py
│ │ │ ├── index.py
│ │ │ └── ingest.py
│ │ ├── routers_utils.py
│ │ ├── s3_utils.py
│ │ ├── server_config.py
│ │ ├── server_utils.py
│ │ └── templates/
│ │ ├── base.jinja
│ │ ├── components/
│ │ │ ├── _macros.jinja
│ │ │ ├── footer.jinja
│ │ │ ├── git_form.jinja
│ │ │ ├── navbar.jinja
│ │ │ ├── result.jinja
│ │ │ └── tailwind_components.html
│ │ ├── git.jinja
│ │ ├── index.jinja
│ │ └── swagger_ui.jinja
│ └── static/
│ ├── js/
│ │ ├── git.js
│ │ ├── git_form.js
│ │ ├── index.js
│ │ ├── navbar.js
│ │ ├── posthog.js
│ │ └── utils.js
│ ├── llms.txt
│ └── robots.txt
└── tests/
├── .pylintrc
├── __init__.py
├── conftest.py
├── query_parser/
│ ├── __init__.py
│ ├── test_git_host_agnostic.py
│ └── test_query_parser.py
├── server/
│ ├── __init__.py
│ └── test_flow_integration.py
├── test_cli.py
├── test_clone.py
├── test_git_utils.py
├── test_gitignore_feature.py
├── test_ingestion.py
├── test_notebook_utils.py
├── test_pattern_utils.py
└── test_summary.py
SYMBOL INDEX (238 symbols across 51 files)
FILE: src/gitingest/__main__.py
class _CLIArgs (line 22) | class _CLIArgs(TypedDict):
function main (line 79) | def main(**cli_kwargs: Unpack[_CLIArgs]) -> None:
function _async_main (line 117) | async def _async_main(
FILE: src/gitingest/clone.py
function clone_repo (line 32) | async def clone_repo(config: CloneConfig, *, token: str | None = None) -...
function _perform_post_clone_operations (line 130) | async def _perform_post_clone_operations(
FILE: src/gitingest/entrypoint.py
function ingest_async (line 35) | async def ingest_async(
function ingest (line 151) | def ingest(
function _override_branch_and_tag (line 225) | def _override_branch_and_tag(query: IngestionQuery, branch: str | None, ...
function _apply_gitignores (line 262) | def _apply_gitignores(query: IngestionQuery) -> None:
function _clone_repo_if_remote (line 276) | async def _clone_repo_if_remote(query: IngestionQuery, *, token: str | N...
function _handle_remove_readonly (line 307) | def _handle_remove_readonly(
function _write_output (line 333) | async def _write_output(tree: str, content: str, target: str | None) -> ...
FILE: src/gitingest/ingestion.py
function ingest_query (line 21) | def ingest_query(query: IngestionQuery) -> tuple[str, str, str]:
function _process_node (line 123) | def _process_node(node: FileSystemNode, query: IngestionQuery, stats: Fi...
function _process_symlink (line 187) | def _process_symlink(path: Path, parent_node: FileSystemNode, stats: Fil...
function _process_file (line 216) | def _process_file(path: Path, parent_node: FileSystemNode, stats: FileSy...
function limit_exceeded (line 276) | def limit_exceeded(stats: FileSystemStats, depth: int) -> bool:
FILE: src/gitingest/output_formatter.py
function format_node (line 27) | def format_node(node: FileSystemNode, query: IngestionQuery) -> tuple[st...
function _create_summary_prefix (line 65) | def _create_summary_prefix(query: IngestionQuery, *, single_file: bool =...
function _gather_file_contents (line 105) | def _gather_file_contents(node: FileSystemNode) -> str:
function _create_tree_structure (line 129) | def _create_tree_structure(
function _format_token_count (line 181) | def _format_token_count(text: str) -> str | None:
FILE: src/gitingest/query_parser.py
function parse_remote_repo (line 25) | async def parse_remote_repo(source: str, token: str | None = None) -> In...
function parse_local_dir_path (line 122) | def parse_local_dir_path(path_str: str) -> IngestionQuery:
function _configure_branch_or_tag (line 141) | async def _configure_branch_or_tag(
FILE: src/gitingest/schemas/cloning.py
class CloneConfig (line 8) | class CloneConfig(BaseModel): # pylint: disable=too-many-instance-attri...
FILE: src/gitingest/schemas/filesystem.py
class FileSystemNodeType (line 20) | class FileSystemNodeType(Enum):
class FileSystemStats (line 29) | class FileSystemStats:
class FileSystemNode (line 37) | class FileSystemNode: # pylint: disable=too-many-instance-attributes
method sort_children (line 53) | def sort_children(self) -> None:
method content_string (line 87) | def content_string(self) -> str:
method content (line 107) | def content(self) -> str: # pylint: disable=too-many-return-statements
FILE: src/gitingest/schemas/ingestion.py
class IngestionQuery (line 14) | class IngestionQuery(BaseModel): # pylint: disable=too-many-instance-at...
method extract_clone_config (line 74) | def extract_clone_config(self) -> CloneConfig:
FILE: src/gitingest/utils/auth.py
function resolve_token (line 10) | def resolve_token(token: str | None) -> str | None:
FILE: src/gitingest/utils/compat_func.py
function readlink (line 7) | def readlink(path: Path) -> Path:
function removesuffix (line 26) | def removesuffix(s: str, suffix: str) -> str:
FILE: src/gitingest/utils/exceptions.py
class AsyncTimeoutError (line 4) | class AsyncTimeoutError(Exception):
class InvalidNotebookError (line 12) | class InvalidNotebookError(Exception):
method __init__ (line 15) | def __init__(self, message: str) -> None:
class InvalidGitHubTokenError (line 19) | class InvalidGitHubTokenError(ValueError):
method __init__ (line 22) | def __init__(self) -> None:
FILE: src/gitingest/utils/file_utils.py
function _get_preferred_encodings (line 20) | def _get_preferred_encodings() -> list[str]:
function _read_chunk (line 36) | def _read_chunk(path: Path) -> bytes | None:
function _decodes (line 57) | def _decodes(chunk: bytes, encoding: str) -> bool:
FILE: src/gitingest/utils/git_utils.py
function is_github_host (line 32) | def is_github_host(url: str) -> bool:
function run_command (line 50) | async def run_command(*args: str) -> tuple[bytes, bytes]:
function ensure_git_installed (line 86) | async def ensure_git_installed() -> None:
function check_repo_exists (line 123) | async def check_repo_exists(url: str, token: str | None = None) -> bool:
function _parse_github_url (line 149) | def _parse_github_url(url: str) -> tuple[str, str, str]:
function fetch_remote_branches_or_tags (line 187) | async def fetch_remote_branches_or_tags(url: str, *, ref_type: str, toke...
function create_git_repo (line 246) | def create_git_repo(local_path: str, url: str, token: str | None = None)...
function create_git_auth_header (line 286) | def create_git_auth_header(token: str, url: str = "https://github.com") ...
function create_authenticated_url (line 317) | def create_authenticated_url(url: str, token: str | None = None) -> str:
function git_auth_context (line 357) | def git_auth_context(url: str, token: str | None = None) -> Generator[tu...
function validate_github_token (line 381) | def validate_github_token(token: str) -> None:
function checkout_partial_clone (line 399) | async def checkout_partial_clone(config: CloneConfig, token: str | None)...
function resolve_commit (line 428) | async def resolve_commit(config: CloneConfig, token: str | None) -> str:
function _resolve_ref_to_sha (line 455) | async def _resolve_ref_to_sha(url: str, pattern: str, token: str | None ...
function _pick_commit_sha (line 499) | def _pick_commit_sha(lines: Iterable[str]) -> str | None:
FILE: src/gitingest/utils/ignore_patterns.py
function load_ignore_patterns (line 171) | def load_ignore_patterns(root: Path, filename: str) -> set[str]:
function _parse_ignore_file (line 200) | def _parse_ignore_file(ignore_file: Path, root: Path) -> set[str]:
FILE: src/gitingest/utils/ingestion_utils.py
function _should_include (line 13) | def _should_include(path: Path, base_path: Path, include_patterns: set[s...
function _should_exclude (line 43) | def _should_exclude(path: Path, base_path: Path, ignore_patterns: set[st...
function _relative_or_none (line 69) | def _relative_or_none(path: Path, base: Path) -> Path | None:
FILE: src/gitingest/utils/logging_config.py
function json_sink (line 18) | def json_sink(message: Any) -> None: # noqa: ANN401
function format_extra_fields (line 54) | def format_extra_fields(record: dict) -> str:
function extra_filter (line 85) | def extra_filter(record: dict) -> dict:
class InterceptHandler (line 105) | class InterceptHandler(logging.Handler):
method emit (line 108) | def emit(self, record: logging.LogRecord) -> None:
function configure_logging (line 128) | def configure_logging() -> None:
function get_logger (line 180) | def get_logger(name: str | None = None) -> logger.__class__:
FILE: src/gitingest/utils/notebook.py
function process_notebook (line 19) | def process_notebook(file: Path, *, include_output: bool = True) -> str:
function _process_cell (line 77) | def _process_cell(cell: dict[str, Any], *, include_output: bool) -> str ...
function _extract_output (line 128) | def _extract_output(output: dict[str, Any]) -> list[str]:
FILE: src/gitingest/utils/os_utils.py
function ensure_directory_exists_or_create (line 6) | async def ensure_directory_exists_or_create(path: Path) -> None:
FILE: src/gitingest/utils/pattern_utils.py
function process_patterns (line 13) | def process_patterns(
function _parse_patterns (line 48) | def _parse_patterns(patterns: str | Iterable[str]) -> set[str]:
FILE: src/gitingest/utils/query_parser_utils.py
class PathKind (line 31) | class PathKind(StrEnum):
function _fallback_to_root (line 40) | async def _fallback_to_root(query: IngestionQuery, token: str | None, wa...
function _normalise_source (line 65) | async def _normalise_source(raw: str, token: str | None) -> ParseResult:
function _try_domains_for_user_and_repo (line 101) | async def _try_domains_for_user_and_repo(user_name: str, repo_name: str,...
function _is_valid_git_commit_hash (line 133) | def _is_valid_git_commit_hash(commit: str) -> bool:
function _validate_host (line 154) | def _validate_host(host: str) -> None:
function _looks_like_git_host (line 178) | def _looks_like_git_host(host: str) -> bool:
function _validate_url_scheme (line 199) | def _validate_url_scheme(scheme: str) -> None:
function _get_user_and_repo_from_path (line 219) | def _get_user_and_repo_from_path(path: str) -> tuple[str, str]:
FILE: src/gitingest/utils/timeout_wrapper.py
function async_timeout (line 14) | def async_timeout(seconds: int) -> Callable[[Callable[P, Awaitable[T]]],...
FILE: src/server/main.py
function health_check (line 96) | async def health_check() -> dict[str, str]:
function head_root (line 108) | async def head_root() -> HTMLResponse:
function robots (line 124) | async def robots() -> FileResponse:
function llm_txt (line 140) | async def llm_txt() -> FileResponse:
function custom_swagger_ui (line 156) | async def custom_swagger_ui(request: Request) -> HTMLResponse:
function openapi_json_get (line 178) | def openapi_json_get() -> JSONResponse:
function openapi_json (line 195) | def openapi_json() -> JSONResponse:
FILE: src/server/metrics_server.py
function metrics (line 23) | async def metrics() -> HTMLResponse:
function start_metrics_server (line 41) | def start_metrics_server(host: str = "127.0.0.1", port: int = 9090) -> N...
FILE: src/server/models.py
class PatternType (line 18) | class PatternType(str, Enum):
class IngestRequest (line 25) | class IngestRequest(BaseModel):
method validate_input_text (line 51) | def validate_input_text(cls, v: str) -> str:
method validate_pattern (line 60) | def validate_pattern(cls, v: str) -> str:
class IngestSuccessResponse (line 65) | class IngestSuccessResponse(BaseModel):
class IngestErrorResponse (line 102) | class IngestErrorResponse(BaseModel):
class S3Metadata (line 119) | class S3Metadata(BaseModel):
class QueryForm (line 138) | class QueryForm(BaseModel):
method as_form (line 163) | def as_form(
FILE: src/server/query_processor.py
function _cleanup_repository (line 35) | def _cleanup_repository(clone_config: CloneConfig) -> None:
function _check_s3_cache (line 46) | async def _check_s3_cache(
function _store_digest_content (line 139) | def _store_digest_content(
function _generate_digest_url (line 200) | def _generate_digest_url(query: IngestionQuery) -> str:
function process_query (line 229) | async def process_query(
function _print_query (line 345) | def _print_query(url: str, max_file_size: int, pattern_type: str, patter...
function _print_error (line 373) | def _print_error(url: str, exc: Exception, max_file_size: int, pattern_t...
function _print_success (line 402) | def _print_success(url: str, max_file_size: int, pattern_type: str, patt...
FILE: src/server/routers/dynamic.py
function catch_all (line 12) | async def catch_all(request: Request, full_path: str) -> HTMLResponse:
FILE: src/server/routers/index.py
function home (line 12) | async def home(request: Request) -> HTMLResponse:
FILE: src/server/routers/ingest.py
function api_ingest (line 24) | async def api_ingest(
function api_ingest_get (line 57) | async def api_ingest_get(
function download_ingest (line 98) | async def download_ingest(
FILE: src/server/routers_utils.py
function _perform_ingestion (line 20) | async def _perform_ingestion(
FILE: src/server/s3_utils.py
class S3UploadError (line 30) | class S3UploadError(Exception):
function is_s3_enabled (line 34) | def is_s3_enabled() -> bool:
function get_s3_config (line 39) | def get_s3_config() -> dict[str, str | None]:
function get_s3_bucket_name (line 50) | def get_s3_bucket_name() -> str:
function get_s3_alias_host (line 55) | def get_s3_alias_host() -> str | None:
function generate_s3_file_path (line 60) | def generate_s3_file_path(
function create_s3_client (line 133) | def create_s3_client() -> BaseClient:
function upload_to_s3 (line 149) | def upload_to_s3(content: str, s3_file_path: str, ingest_id: UUID) -> str:
function upload_metadata_to_s3 (line 246) | def upload_metadata_to_s3(metadata: S3Metadata, s3_file_path: str, inges...
function get_metadata_from_s3 (line 345) | def get_metadata_from_s3(s3_file_path: str) -> S3Metadata | None:
function _build_s3_url (line 389) | def _build_s3_url(key: str) -> str:
function _check_object_tags (line 405) | def _check_object_tags(s3_client: BaseClient, bucket_name: str, key: str...
function check_s3_object_exists (line 415) | def check_s3_object_exists(s3_file_path: str) -> bool:
function get_s3_url_for_ingest_id (line 486) | def get_s3_url_for_ingest_id(ingest_id: UUID) -> str | None:
FILE: src/server/server_config.py
function get_version_info (line 31) | def get_version_info() -> dict[str, str]:
FILE: src/server/server_utils.py
function rate_limit_exception_handler (line 18) | async def rate_limit_exception_handler(request: Request, exc: Exception)...
class Colors (line 47) | class Colors:
FILE: src/static/js/git.js
function waitForStars (line 1) | function waitForStars() {
FILE: src/static/js/git_form.js
function changePattern (line 2) | function changePattern() {
function toggleAccessSettings (line 24) | function toggleAccessSettings() {
FILE: src/static/js/index.js
function submitExample (line 1) | function submitExample(repoName) {
FILE: src/static/js/navbar.js
function formatStarCount (line 2) | function formatStarCount(count) {
function fetchGitHubStars (line 8) | async function fetchGitHubStars() {
FILE: src/static/js/posthog.js
function g (line 9) | function g(t, e) {
FILE: src/static/js/utils.js
function getFileName (line 1) | function getFileName(element) {
function toggleFile (line 30) | function toggleFile(element) {
function copyText (line 58) | function copyText(className) {
function showLoading (line 105) | function showLoading() {
function showResults (line 110) | function showResults() {
function showError (line 115) | function showError(msg) {
function collectFormData (line 125) | function collectFormData(form) {
function setButtonLoadingState (line 143) | function setButtonLoadingState(submitButton, isLoading) {
function handleSuccessfulResponse (line 171) | function handleSuccessfulResponse(data) {
function handleSubmit (line 203) | function handleSubmit(event, showLoadingSpinner = false) {
function copyFullDigest (line 277) | function copyFullDigest() {
function downloadFullDigest (line 301) | function downloadFullDigest() {
function logSliderToSize (line 345) | function logSliderToSize(position) {
function initializeSlider (line 355) | function initializeSlider() {
function formatSize (line 378) | function formatSize(sizeInKB) {
function setupGlobalEnterHandler (line 387) | function setupGlobalEnterHandler() {
FILE: tests/conftest.py
function get_ensure_git_installed_call_count (line 30) | def get_ensure_git_installed_call_count() -> int:
function sample_query (line 50) | def sample_query() -> IngestionQuery:
function temp_directory (line 74) | def temp_directory(tmp_path: Path) -> Path:
function write_notebook (line 136) | def write_notebook(tmp_path: Path) -> WriteNotebookFunc:
function stub_resolve_sha (line 162) | def stub_resolve_sha(mocker: MockerFixture) -> dict[str, AsyncMock]:
function stub_branches (line 182) | def stub_branches(mocker: MockerFixture) -> Callable[[list[str]], None]:
function repo_exists_true (line 205) | def repo_exists_true(mocker: MockerFixture) -> AsyncMock:
function run_command_mock (line 211) | def run_command_mock(mocker: MockerFixture) -> AsyncMock:
function gitpython_mocks (line 227) | def gitpython_mocks(mocker: MockerFixture) -> dict[str, MagicMock]:
function _setup_gitpython_mocks (line 232) | def _setup_gitpython_mocks(mocker: MockerFixture) -> dict[str, MagicMock]:
function _fake_run_command (line 275) | async def _fake_run_command(*args: str) -> tuple[bytes, bytes]:
FILE: tests/query_parser/test_git_host_agnostic.py
function test_parse_query_without_host (line 31) | async def test_parse_query_without_host(
FILE: tests/query_parser/test_query_parser.py
function test_parse_url_valid_https (line 42) | async def test_parse_url_valid_https(url: str, stub_resolve_sha: dict[st...
function test_parse_url_valid_http (line 51) | async def test_parse_url_valid_http(url: str, stub_resolve_sha: dict[str...
function test_parse_url_invalid (line 57) | async def test_parse_url_invalid(stub_resolve_sha: dict[str, AsyncMock])...
function test_parse_query_basic (line 74) | async def test_parse_query_basic(url: str, stub_resolve_sha: dict[str, A...
function test_parse_query_mixed_case (line 90) | async def test_parse_query_mixed_case(stub_resolve_sha: dict[str, AsyncM...
function test_parse_url_with_subpaths (line 106) | async def test_parse_url_with_subpaths(
function test_parse_url_invalid_repo_structure (line 129) | async def test_parse_url_invalid_repo_structure(stub_resolve_sha: dict[s...
function test_parse_local_dir_path_local_path (line 144) | async def test_parse_local_dir_path_local_path() -> None:
function test_parse_local_dir_path_relative_path (line 160) | async def test_parse_local_dir_path_relative_path() -> None:
function test_parse_remote_repo_empty_source (line 176) | async def test_parse_remote_repo_empty_source(stub_resolve_sha: dict[str...
function test_parse_url_branch_and_commit_distinction (line 199) | async def test_parse_url_branch_and_commit_distinction(
function test_parse_local_dir_path_uuid_uniqueness (line 222) | async def test_parse_local_dir_path_uuid_uniqueness() -> None:
function test_parse_url_with_query_and_fragment (line 237) | async def test_parse_url_with_query_and_fragment(stub_resolve_sha: dict[...
function test_parse_url_unsupported_host (line 254) | async def test_parse_url_unsupported_host(stub_resolve_sha: dict[str, As...
function test_parse_query_with_branch (line 270) | async def test_parse_query_with_branch() -> None:
function test_parse_repo_source_with_various_url_patterns (line 304) | async def test_parse_repo_source_with_various_url_patterns(
function _assert_basic_repo_fields (line 330) | async def _assert_basic_repo_fields(url: str, sha_mock: AsyncMock) -> In...
FILE: tests/server/test_flow_integration.py
function test_client (line 21) | def test_client() -> Generator[TestClient, None, None]:
function mock_static_files (line 29) | def mock_static_files(mocker: MockerFixture) -> None:
function cleanup_tmp_dir (line 37) | def cleanup_tmp_dir() -> Generator[None, None, None]:
function test_remote_repository_analysis (line 49) | async def test_remote_repository_analysis(request: pytest.FixtureRequest...
function test_invalid_repository_url (line 74) | async def test_invalid_repository_url(request: pytest.FixtureRequest) ->...
function test_large_repository (line 95) | async def test_large_repository(request: pytest.FixtureRequest) -> None:
function test_concurrent_requests (line 119) | async def test_concurrent_requests(request: pytest.FixtureRequest) -> None:
function test_large_file_handling (line 148) | async def test_large_file_handling(request: pytest.FixtureRequest) -> None:
function test_repository_with_patterns (line 171) | async def test_repository_with_patterns(request: pytest.FixtureRequest) ...
FILE: tests/test_cli.py
function test_cli_writes_file (line 37) | def test_cli_writes_file(
function test_cli_with_stdout_output (line 62) | def test_cli_with_stdout_output() -> None:
function _invoke_isolated_cli_runner (line 93) | def _invoke_isolated_cli_runner(args: list[str]) -> Result:
FILE: tests/test_clone.py
function test_clone_with_commit (line 34) | async def test_clone_with_commit(repo_exists_true: AsyncMock, gitpython_...
function test_clone_nonexistent_repository (line 70) | async def test_clone_nonexistent_repository(repo_exists_true: AsyncMock)...
function test_check_repo_exists (line 100) | async def test_check_repo_exists(
function test_clone_without_commit (line 121) | async def test_clone_without_commit(repo_exists_true: AsyncMock, gitpyth...
function test_clone_creates_parent_directory (line 149) | async def test_clone_creates_parent_directory(tmp_path: Path, gitpython_...
function test_clone_with_specific_subpath (line 170) | async def test_clone_with_specific_subpath(gitpython_mocks: dict) -> None:
function test_clone_with_include_submodules (line 192) | async def test_clone_with_include_submodules(gitpython_mocks: dict) -> N...
function test_check_repo_exists_with_auth_token (line 209) | async def test_check_repo_exists_with_auth_token(mocker: MockerFixture) ...
FILE: tests/test_git_utils.py
function test_validate_github_token_valid (line 35) | def test_validate_github_token_valid(token: str) -> None:
function test_validate_github_token_invalid (line 52) | def test_validate_github_token_invalid(token: str) -> None:
function test_create_git_repo (line 81) | def test_create_git_repo(
function test_create_git_auth_header (line 113) | def test_create_git_auth_header(token: str) -> None:
function test_create_git_repo_helper_calls (line 129) | def test_create_git_repo_helper_calls(
function test_is_github_host (line 178) | def test_is_github_host(url: str, *, expected: bool) -> None:
function test_create_git_auth_header_with_ghe_url (line 195) | def test_create_git_auth_header_with_ghe_url(token: str, url: str, expec...
function test_create_git_repo_with_ghe_urls (line 234) | def test_create_git_repo_with_ghe_urls(
function test_create_git_repo_ignores_non_github_urls (line 264) | def test_create_git_repo_ignores_non_github_urls(
FILE: tests/test_gitignore_feature.py
function repo_fixture (line 12) | def repo_fixture(tmp_path: Path) -> Path:
function test_load_gitignore_patterns (line 35) | def test_load_gitignore_patterns(tmp_path: Path) -> None:
function test_ingest_with_gitignore (line 52) | async def test_ingest_with_gitignore(repo_path: Path) -> None:
FILE: tests/test_ingestion.py
function test_run_ingest_query (line 22) | def test_run_ingest_query(temp_directory: Path, sample_query: IngestionQ...
class PatternScenario (line 55) | class PatternScenario(TypedDict):
function test_include_ignore_patterns (line 201) | def test_include_ignore_patterns(
FILE: tests/test_notebook_utils.py
function test_process_notebook_all_cells (line 14) | def test_process_notebook_all_cells(write_notebook: WriteNotebookFunc) -...
function test_process_notebook_with_worksheets (line 48) | def test_process_notebook_with_worksheets(write_notebook: WriteNotebookF...
function test_process_notebook_multiple_worksheets (line 80) | def test_process_notebook_multiple_worksheets(write_notebook: WriteNoteb...
function test_process_notebook_code_only (line 118) | def test_process_notebook_code_only(write_notebook: WriteNotebookFunc) -...
function test_process_notebook_markdown_only (line 139) | def test_process_notebook_markdown_only(write_notebook: WriteNotebookFun...
function test_process_notebook_raw_only (line 161) | def test_process_notebook_raw_only(write_notebook: WriteNotebookFunc) ->...
function test_process_notebook_empty_cells (line 183) | def test_process_notebook_empty_cells(write_notebook: WriteNotebookFunc)...
function test_process_notebook_invalid_cell_type (line 206) | def test_process_notebook_invalid_cell_type(write_notebook: WriteNoteboo...
function test_process_notebook_with_output (line 225) | def test_process_notebook_with_output(write_notebook: WriteNotebookFunc)...
FILE: tests/test_pattern_utils.py
function test_process_patterns_empty_patterns (line 7) | def test_process_patterns_empty_patterns() -> None:
function test_parse_patterns_valid (line 20) | def test_parse_patterns_valid() -> None:
function test_process_patterns_include_and_ignore_overlap (line 33) | def test_process_patterns_include_and_ignore_overlap() -> None:
FILE: tests/test_summary.py
function test_ingest_summary (line 28) | def test_ingest_summary(path_type: str, path: str, ref_type: str, ref: s...
function _calculate_expected_lines (line 86) | def _calculate_expected_lines(ref_type: str, *, is_main_branch: bool) ->...
Condensed preview — 110 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (427K chars).
[
{
"path": ".docker/minio/setup.sh",
"chars": 1215,
"preview": "#!/bin/sh\n\n# Simple script to set up MinIO bucket and user\n# Based on example from MinIO issues\n\n# Format bucket name to"
},
{
"path": ".dockerignore",
"chars": 1009,
"preview": "# -------------------------------------------------\n# Base: reuse patterns from .gitignore\n# ---------------------------"
},
{
"path": ".github/ISSUE_TEMPLATE/bug_report.yml",
"chars": 4417,
"preview": "name: Bug report 🐞\ndescription: Report a bug or internal server error when using Gitingest\ntitle: \"(bug): \"\nlabels: [\"bu"
},
{
"path": ".github/ISSUE_TEMPLATE/feature_request.yml",
"chars": 2955,
"preview": "name: Feature request 💡\ndescription: Suggest a new feature or improvement for Gitingest\ntitle: \"(feat): \"\nlabels: [\"enha"
},
{
"path": ".github/workflows/ci.yml",
"chars": 1831,
"preview": "name: CI\n\non:\n push:\n branches: [main]\n pull_request:\n branches: [main]\n\nconcurrency:\n group: ${{ github.workfl"
},
{
"path": ".github/workflows/codeql.yml",
"chars": 2925,
"preview": "# For most projects, this workflow file will not need changing; you simply need\n# to commit it to your repository.\n#\n# Y"
},
{
"path": ".github/workflows/dependency-review.yml",
"chars": 1001,
"preview": "# Dependency Review Action\n#\n# This Action will scan dependency manifest files that change as part of a Pull Request,\n# "
},
{
"path": ".github/workflows/deploy-pr.yml",
"chars": 5722,
"preview": "name: Manage PR Temp Envs\n'on':\n pull_request:\n types:\n - labeled\n - unlabeled\n - closed\n\npermissions"
},
{
"path": ".github/workflows/docker-build.ecr.yml",
"chars": 4235,
"preview": "name: Build & Push Container\n\non:\n push:\n branches:\n - 'main'\n tags:\n - '*'\n merge_group:\n pull_reque"
},
{
"path": ".github/workflows/docker-build.ghcr.yml",
"chars": 5118,
"preview": "name: Build & Push Container\n\non:\n push:\n branches:\n - 'main'\n tags:\n - '*'\n merge_group:\n pull_reque"
},
{
"path": ".github/workflows/pr-title-check.yml",
"chars": 657,
"preview": "name: PR Conventional Commit Validation\n\non:\n pull_request:\n types: [opened, synchronize, reopened, edited]\n\njobs:\n "
},
{
"path": ".github/workflows/publish_to_pypi.yml",
"chars": 1847,
"preview": "name: Publish to PyPI\n\non:\n release:\n types: [created] # Run when you click \"Publish release\"\n workflow_dispatch: #"
},
{
"path": ".github/workflows/rebase-needed.yml",
"chars": 841,
"preview": "name: PR Needs Rebase\n\non:\n workflow_dispatch: {}\n schedule:\n - cron: '0 * * * *'\n\npermissions:\n pull-requests: wr"
},
{
"path": ".github/workflows/release-please.yml",
"chars": 728,
"preview": "name: release-please\non:\n push:\n branches:\n - main\n\npermissions:\n contents: write\n pull-requests: write\n\njobs"
},
{
"path": ".github/workflows/scorecard.yml",
"chars": 1289,
"preview": "name: OSSF Scorecard\non:\n branch_protection_rule:\n schedule:\n - cron: '33 11 * * 2' # Every Tuesday at 11:33 AM UT"
},
{
"path": ".github/workflows/stale.yml",
"chars": 1271,
"preview": "name: \"Close stale issues and PRs\"\n\non:\n schedule:\n - cron: \"0 6 * * *\"\n workflow_dispatch: {}\n\npermissions:\n issu"
},
{
"path": ".gitignore",
"chars": 470,
"preview": "# Operating-system\n.DS_Store\nThumbs.db\n\n# Editor / IDE settings\n.vscode/\n!.vscode/launch.json\n.idea/\n*.swp\n\n# Python vir"
},
{
"path": ".pre-commit-config.yaml",
"chars": 5111,
"preview": "repos:\n - repo: https://github.com/pre-commit/pre-commit-hooks\n rev: v5.0.0\n hooks:\n - id: check-added-large"
},
{
"path": ".release-please-manifest.json",
"chars": 14,
"preview": "{\".\":\"0.3.1\"}\n"
},
{
"path": ".vscode/launch.json",
"chars": 265,
"preview": "{\n \"configurations\": [\n {\n \"name\": \"Python Debugger: Module\",\n \"type\": \"debugpy\",\n "
},
{
"path": "CHANGELOG.md",
"chars": 7107,
"preview": "# Changelog\n\n## [0.3.1](https://github.com/coderamp-labs/gitingest/compare/v0.3.0...v0.3.1) (2025-07-31)\n\n\n### Bug Fixes"
},
{
"path": "CODE_OF_CONDUCT.md",
"chars": 5206,
"preview": "# Contributor Covenant Code of Conduct\n\n## Our Pledge\n\nWe as members, contributors, and leaders pledge to make participa"
},
{
"path": "CONTRIBUTING.md",
"chars": 2346,
"preview": "# Contributing to Gitingest\n\nThanks for your interest in contributing to **Gitingest** 🚀 Our goal is to keep the codebas"
},
{
"path": "Dockerfile",
"chars": 1475,
"preview": "# Stage 1: Install Python dependencies\nFROM python:3.13.5-slim@sha256:4c2cf9917bd1cbacc5e9b07320025bdb7cdf2df7b0ceaccb55"
},
{
"path": "LICENSE",
"chars": 1072,
"preview": "MIT License\n\nCopyright (c) 2024 Romain Courtois\n\nPermission is hereby granted, free of charge, to any person obtaining a"
},
{
"path": "README.md",
"chars": 14460,
"preview": "# Gitingest\n\n[;\nconst globals = require('globals');\nconst importPlugin = require('eslint-plugin-import"
},
{
"path": "pyproject.toml",
"chars": 3572,
"preview": "[project]\nname = \"gitingest\"\nversion = \"0.3.1\"\ndescription=\"CLI tool to analyze and create text dumps of codebases for L"
},
{
"path": "release-please-config.json",
"chars": 209,
"preview": "{\n \"$schema\": \"https://raw.githubusercontent.com/googleapis/release-please/main/schemas/config.json\",\n \"packages\": {\n "
},
{
"path": "renovate.json",
"chars": 114,
"preview": "{\n \"$schema\": \"https://docs.renovatebot.com/renovate-schema.json\",\n \"extends\": [\n \"config:recommended\"\n ]\n}\n"
},
{
"path": "requirements-dev.txt",
"chars": 95,
"preview": "-r requirements.txt\neval-type-backport\npre-commit\npytest\npytest-asyncio\npytest-cov\npytest-mock\n"
},
{
"path": "requirements.txt",
"chars": 461,
"preview": "boto3>=1.28.0 # AWS SDK for S3 support\nclick>=8.0.0\nfastapi[standard]>=0.109.1 # Vulnerable to https://osv.dev/vulnera"
},
{
"path": "src/gitingest/__init__.py",
"chars": 162,
"preview": "\"\"\"Gitingest: A package for ingesting data from Git repositories.\"\"\"\n\nfrom gitingest.entrypoint import ingest, ingest_as"
},
{
"path": "src/gitingest/__main__.py",
"chars": 6600,
"preview": "\"\"\"Command-line interface (CLI) for Gitingest.\"\"\"\n\n# pylint: disable=no-value-for-parameter\nfrom __future__ import annot"
},
{
"path": "src/gitingest/clone.py",
"chars": 6233,
"preview": "\"\"\"Module containing functions for cloning a Git repository to a local path.\"\"\"\n\nfrom __future__ import annotations\n\nfro"
},
{
"path": "src/gitingest/config.py",
"chars": 497,
"preview": "\"\"\"Configuration file for the project.\"\"\"\n\nimport tempfile\nfrom pathlib import Path\n\nMAX_FILE_SIZE = 10 * 1024 * 1024 #"
},
{
"path": "src/gitingest/entrypoint.py",
"chars": 13064,
"preview": "\"\"\"Main entry point for ingesting a source and processing its contents.\"\"\"\n\nfrom __future__ import annotations\n\nimport a"
},
{
"path": "src/gitingest/ingestion.py",
"chars": 10789,
"preview": "\"\"\"Functions to ingest and analyze a codebase directory or single file.\"\"\"\n\nfrom __future__ import annotations\n\nfrom pat"
},
{
"path": "src/gitingest/output_formatter.py",
"chars": 6917,
"preview": "\"\"\"Functions to ingest and analyze a codebase directory or single file.\"\"\"\n\nfrom __future__ import annotations\n\nimport s"
},
{
"path": "src/gitingest/query_parser.py",
"chars": 6162,
"preview": "\"\"\"Module containing functions to parse and validate input sources and patterns.\"\"\"\n\nfrom __future__ import annotations\n"
},
{
"path": "src/gitingest/schemas/__init__.py",
"chars": 366,
"preview": "\"\"\"Module containing the schemas for the Gitingest package.\"\"\"\n\nfrom gitingest.schemas.cloning import CloneConfig\nfrom g"
},
{
"path": "src/gitingest/schemas/cloning.py",
"chars": 1331,
"preview": "\"\"\"Schema for the cloning process.\"\"\"\n\nfrom __future__ import annotations\n\nfrom pydantic import BaseModel, Field\n\n\nclass"
},
{
"path": "src/gitingest/schemas/filesystem.py",
"chars": 5071,
"preview": "\"\"\"Schema for the filesystem representation.\"\"\"\n\nfrom __future__ import annotations\n\nimport os\nfrom dataclasses import d"
},
{
"path": "src/gitingest/schemas/ingestion.py",
"chars": 3289,
"preview": "\"\"\"Module containing the dataclasses for the ingestion process.\"\"\"\n\nfrom __future__ import annotations\n\nfrom pathlib imp"
},
{
"path": "src/gitingest/utils/__init__.py",
"chars": 51,
"preview": "\"\"\"Utility functions for the gitingest package.\"\"\"\n"
},
{
"path": "src/gitingest/utils/auth.py",
"chars": 579,
"preview": "\"\"\"Utilities for handling authentication.\"\"\"\n\nfrom __future__ import annotations\n\nimport os\n\nfrom gitingest.utils.git_ut"
},
{
"path": "src/gitingest/utils/compat_func.py",
"chars": 756,
"preview": "\"\"\"Compatibility functions for Python 3.8.\"\"\"\n\nimport os\nfrom pathlib import Path\n\n\ndef readlink(path: Path) -> Path:\n "
},
{
"path": "src/gitingest/utils/compat_typing.py",
"chars": 671,
"preview": "\"\"\"Compatibility layer for typing.\"\"\"\n\ntry:\n from enum import StrEnum # type: ignore[attr-defined] # Py ≥ 3.11\nexce"
},
{
"path": "src/gitingest/utils/exceptions.py",
"chars": 919,
"preview": "\"\"\"Custom exceptions for the Gitingest package.\"\"\"\n\n\nclass AsyncTimeoutError(Exception):\n \"\"\"Exception raised when an"
},
{
"path": "src/gitingest/utils/file_utils.py",
"chars": 1878,
"preview": "\"\"\"Utility functions for working with files and directories.\"\"\"\n\nfrom __future__ import annotations\n\nimport locale\nimpor"
},
{
"path": "src/gitingest/utils/git_utils.py",
"chars": 15654,
"preview": "\"\"\"Utility functions for interacting with Git repositories.\"\"\"\n\nfrom __future__ import annotations\n\nimport asyncio\nimpor"
},
{
"path": "src/gitingest/utils/ignore_patterns.py",
"chars": 4897,
"preview": "\"\"\"Default ignore patterns for Gitingest.\"\"\"\n\nfrom __future__ import annotations\n\nfrom pathlib import Path\n\nDEFAULT_IGNO"
},
{
"path": "src/gitingest/utils/ingestion_utils.py",
"chars": 2540,
"preview": "\"\"\"Utility functions for the ingestion process.\"\"\"\n\nfrom __future__ import annotations\n\nfrom typing import TYPE_CHECKING"
},
{
"path": "src/gitingest/utils/logging_config.py",
"chars": 5742,
"preview": "\"\"\"Logging configuration for gitingest using loguru.\n\nThis module provides structured JSON logging suitable for Kubernet"
},
{
"path": "src/gitingest/utils/notebook.py",
"chars": 4385,
"preview": "\"\"\"Utilities for processing Jupyter notebooks.\"\"\"\n\nfrom __future__ import annotations\n\nimport json\nfrom itertools import"
},
{
"path": "src/gitingest/utils/os_utils.py",
"chars": 566,
"preview": "\"\"\"Utility functions for working with the operating system.\"\"\"\n\nfrom pathlib import Path\n\n\nasync def ensure_directory_ex"
},
{
"path": "src/gitingest/utils/pattern_utils.py",
"chars": 2272,
"preview": "\"\"\"Pattern utilities for the Gitingest package.\"\"\"\n\nfrom __future__ import annotations\n\nimport re\nfrom typing import Ite"
},
{
"path": "src/gitingest/utils/query_parser_utils.py",
"chars": 6635,
"preview": "\"\"\"Utility functions for parsing and validating query parameters.\"\"\"\n\nfrom __future__ import annotations\n\nimport string\n"
},
{
"path": "src/gitingest/utils/timeout_wrapper.py",
"chars": 1582,
"preview": "\"\"\"Utility functions for the Gitingest package.\"\"\"\n\nimport asyncio\nimport functools\nfrom typing import Awaitable, Callab"
},
{
"path": "src/server/__init__.py",
"chars": 21,
"preview": "\"\"\"Server module.\"\"\"\n"
},
{
"path": "src/server/__main__.py",
"chars": 798,
"preview": "\"\"\"Server module entry point for running with python -m server.\"\"\"\n\nimport os\n\nimport uvicorn\n\n# Import logging configur"
},
{
"path": "src/server/form_types.py",
"chars": 448,
"preview": "\"\"\"Reusable form type aliases for FastAPI form parameters.\"\"\"\n\nfrom __future__ import annotations\n\nfrom typing import TY"
},
{
"path": "src/server/main.py",
"chars": 7413,
"preview": "\"\"\"Main module for the FastAPI application.\"\"\"\n\nfrom __future__ import annotations\n\nimport os\nimport threading\nfrom path"
},
{
"path": "src/server/metrics_server.py",
"chars": 1814,
"preview": "\"\"\"Prometheus metrics server running on a separate port.\"\"\"\n\nimport uvicorn\nfrom fastapi import FastAPI\nfrom fastapi.res"
},
{
"path": "src/server/models.py",
"chars": 6193,
"preview": "\"\"\"Pydantic models for the query form.\"\"\"\n\nfrom __future__ import annotations\n\nfrom enum import Enum\nfrom typing import "
},
{
"path": "src/server/query_processor.py",
"chars": 14392,
"preview": "\"\"\"Process a query by parsing input, cloning a repository, and generating a summary.\"\"\"\n\nfrom __future__ import annotati"
},
{
"path": "src/server/routers/__init__.py",
"chars": 261,
"preview": "\"\"\"Module containing the routers for the FastAPI application.\"\"\"\n\nfrom server.routers.dynamic import router as dynamic\nf"
},
{
"path": "src/server/routers/dynamic.py",
"chars": 1262,
"preview": "\"\"\"The dynamic router module defines handlers for dynamic path requests.\"\"\"\n\nfrom fastapi import APIRouter, Request\nfrom"
},
{
"path": "src/server/routers/index.py",
"chars": 1216,
"preview": "\"\"\"Module defining the FastAPI router for the home page of the application.\"\"\"\n\nfrom fastapi import APIRouter, Request\nf"
},
{
"path": "src/server/routers/ingest.py",
"chars": 6211,
"preview": "\"\"\"Ingest endpoint for the API.\"\"\"\n\nfrom typing import Union\nfrom uuid import UUID\n\nfrom fastapi import APIRouter, HTTPE"
},
{
"path": "src/server/routers_utils.py",
"chars": 2257,
"preview": "\"\"\"Utility functions for the ingest endpoints.\"\"\"\n\nfrom __future__ import annotations\n\nfrom typing import Any\n\nfrom fast"
},
{
"path": "src/server/s3_utils.py",
"chars": 17775,
"preview": "\"\"\"S3 utility functions for uploading and managing digest files.\"\"\"\n\nfrom __future__ import annotations\n\nimport hashlib\n"
},
{
"path": "src/server/server_config.py",
"chars": 1849,
"preview": "\"\"\"Configuration for the server.\"\"\"\n\nfrom __future__ import annotations\n\nimport os\nfrom pathlib import Path\n\nfrom fastap"
},
{
"path": "src/server/server_utils.py",
"chars": 1892,
"preview": "\"\"\"Utility functions for the server.\"\"\"\n\nfrom fastapi import Request\nfrom fastapi.responses import Response\nfrom slowapi"
},
{
"path": "src/server/templates/base.jinja",
"chars": 3391,
"preview": "<!DOCTYPE html>\n<html lang=\"en\">\n <head>\n <meta charset=\"UTF-8\">\n <meta name=\"viewport\" content=\"width="
},
{
"path": "src/server/templates/components/_macros.jinja",
"chars": 324,
"preview": "{# Icon link #}\n{% macro footer_icon_link(href, icon, label) -%}\n <a href=\"{{ href }}\"\n target=\"_blank\"\n "
},
{
"path": "src/server/templates/components/footer.jinja",
"chars": 1611,
"preview": "{% from 'components/_macros.jinja' import footer_icon_link %}\n<footer class=\"w-full border-t-[3px] border-gray-900 mt-au"
},
{
"path": "src/server/templates/components/git_form.jinja",
"chars": 12294,
"preview": "<div class=\"relative\">\n <div class=\"w-full h-full absolute inset-0 bg-gray-900 rounded-xl translate-y-2 translate-x-2"
},
{
"path": "src/server/templates/components/navbar.jinja",
"chars": 1879,
"preview": "<header class=\"sticky top-0 bg-[#FFFDF8] border-b-[3px] border-gray-900 z-50\">\n <div class=\"max-w-4xl mx-auto px-4\">\n"
},
{
"path": "src/server/templates/components/result.jinja",
"chars": 8134,
"preview": "<div class=\"mt-10\">\n <!-- Error Message (hidden by default) -->\n <div id=\"results-error\" style=\"display:none\"></di"
},
{
"path": "src/server/templates/components/tailwind_components.html",
"chars": 1327,
"preview": "<style type=\"text/tailwindcss\">\n @layer components {\n .badge-new {\n @apply inline-block -rotate-6 -translate-y-"
},
{
"path": "src/server/templates/git.jinja",
"chars": 452,
"preview": "{% extends \"base.jinja\" %}\n{% block content %}\n {% if error_message %}\n <div class=\"mb-6 p-4 bg-red-50 border "
},
{
"path": "src/server/templates/index.jinja",
"chars": 1279,
"preview": "{% extends \"base.jinja\" %}\n{% block content %}\n <div class=\"mb-8\">\n <div class=\"relative w-full flex sm:flex-r"
},
{
"path": "src/server/templates/swagger_ui.jinja",
"chars": 1422,
"preview": "{% extends \"base.jinja\" %}\n{% block title %}GitIngest API{% endblock %}\n{% block content %}\n <div class=\"mb-8\">\n "
},
{
"path": "src/static/js/git.js",
"chars": 967,
"preview": "function waitForStars() {\n return new Promise((resolve) => {\n const check = () => {\n const stars = "
},
{
"path": "src/static/js/git_form.js",
"chars": 1367,
"preview": "// Strike-through / un-strike file lines when the pattern-type menu flips.\nfunction changePattern() {\n const dirPre ="
},
{
"path": "src/static/js/index.js",
"chars": 258,
"preview": "function submitExample(repoName) {\n const input = document.getElementById('input_text');\n\n if (input) {\n in"
},
{
"path": "src/static/js/navbar.js",
"chars": 776,
"preview": "// Fetch GitHub stars\nfunction formatStarCount(count) {\n if (count >= 1000) {return `${ (count / 1000).toFixed(1) }k`"
},
{
"path": "src/static/js/posthog.js",
"chars": 2798,
"preview": "/* eslint-disable */\n!function (t, e) {\n let o, n, p, r;\n if (e.__SV) {return;} // already loaded\n"
},
{
"path": "src/static/js/utils.js",
"chars": 13898,
"preview": "function getFileName(element) {\n const indentSize = 4;\n let path = '';\n let prevIndentLevel = null;\n\n while "
},
{
"path": "src/static/llms.txt",
"chars": 12047,
"preview": "# GitIngest – **AI Agent Integration Guide**\n\nTurn any Git repository into a prompt-ready text digest. GitIngest fetches"
},
{
"path": "src/static/robots.txt",
"chars": 69,
"preview": "User-agent: *\nAllow: /\nAllow: /api/\nAllow: /coderamp-labs/gitingest/\n"
},
{
"path": "tests/.pylintrc",
"chars": 196,
"preview": "[MASTER]\ninit-hook=\n import sys\n sys.path.append('./src')\n\n[MESSAGES CONTROL]\ndisable=missing-class-docstring,miss"
},
{
"path": "tests/__init__.py",
"chars": 39,
"preview": "\"\"\"Tests for the gitingest package.\"\"\"\n"
},
{
"path": "tests/conftest.py",
"chars": 8720,
"preview": "\"\"\"Fixtures for tests.\n\nThis file provides shared fixtures for creating sample queries, a temporary directory structure,"
},
{
"path": "tests/query_parser/__init__.py",
"chars": 34,
"preview": "\"\"\"Tests for the query parser.\"\"\"\n"
},
{
"path": "tests/query_parser/test_git_host_agnostic.py",
"chars": 2734,
"preview": "\"\"\"Tests to verify that the query parser is Git host agnostic.\n\nThese tests confirm that ``parse_query`` correctly ident"
},
{
"path": "tests/query_parser/test_query_parser.py",
"chars": 11907,
"preview": "\"\"\"Tests for the ``query_parser`` module.\n\nThese tests cover URL parsing, pattern parsing, and handling of branches/subp"
},
{
"path": "tests/server/__init__.py",
"chars": 28,
"preview": "\"\"\"Tests for the server.\"\"\"\n"
},
{
"path": "tests/server/test_flow_integration.py",
"chars": 6683,
"preview": "\"\"\"Integration tests covering core functionalities, edge cases, and concurrency handling.\"\"\"\n\nimport shutil\nimport sys\nf"
},
{
"path": "tests/test_cli.py",
"chars": 3548,
"preview": "\"\"\"Tests for the Gitingest CLI.\"\"\"\n\nfrom __future__ import annotations\n\nfrom inspect import signature\nfrom pathlib impor"
},
{
"path": "tests/test_clone.py",
"chars": 7746,
"preview": "\"\"\"Tests for the ``clone`` module.\n\nThese tests cover various scenarios for cloning repositories, verifying that the app"
},
{
"path": "tests/test_git_utils.py",
"chars": 9459,
"preview": "\"\"\"Tests for the ``git_utils`` module.\n\nThese tests validate the ``validate_github_token`` function, which ensures that\n"
},
{
"path": "tests/test_gitignore_feature.py",
"chars": 2714,
"preview": "\"\"\"Tests for the gitignore functionality in Gitingest.\"\"\"\n\nfrom pathlib import Path\n\nimport pytest\n\nfrom gitingest.entry"
},
{
"path": "tests/test_ingestion.py",
"chars": 8861,
"preview": "\"\"\"Tests for the ``query_ingestion`` module.\n\nThese tests validate directory scanning, file content extraction, notebook"
},
{
"path": "tests/test_notebook_utils.py",
"chars": 10375,
"preview": "\"\"\"Tests for the ``notebook`` utils module.\n\nThese tests validate how notebooks are processed into Python-like output, e"
},
{
"path": "tests/test_pattern_utils.py",
"chars": 1642,
"preview": "\"\"\"Test pattern utilities.\"\"\"\n\nfrom gitingest.utils.ignore_patterns import DEFAULT_IGNORE_PATTERNS\nfrom gitingest.utils."
},
{
"path": "tests/test_summary.py",
"chars": 3487,
"preview": "\"\"\"Test that ``gitingest.ingest()`` emits a concise, 5-or-6-line summary.\"\"\"\n\nimport re\nfrom pathlib import Path\n\nimport"
}
]
About this extraction
This page contains the full source code of the coderamp-labs/gitingest GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 110 files (391.9 KB), approximately 98.9k tokens, and a symbol index with 238 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.