Full Code of kelvins/awesome-dataops for AI

main e3f89a46010d cached
7 files
26.2 KB
6.5k tokens
4 symbols
1 requests
Download .txt
Repository: kelvins/awesome-dataops
Branch: main
Commit: e3f89a46010d
Files: 7
Total size: 26.2 KB

Directory structure:
gitextract_rh9xc5ik/

├── .github/
│   ├── PULL_REQUEST_TEMPLATE.md
│   └── workflows/
│       └── validate.yml
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── README.md
├── check_order.py
└── mlc_config.json

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/PULL_REQUEST_TEMPLATE.md
================================================
## What is this tool for?

Describe features.

## What's the difference between this tool and similar ones?

Enumerate comparisons.

---

Anyone who agrees with this pull request could submit an *Approve* review to it.


================================================
FILE: .github/workflows/validate.yml
================================================
name: Validate README

on: [push, pull_request]

jobs:
  check-order:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - uses: actions/setup-python@v2
      with:
        python-version: '3.8'
    - run: python check_order.py
  check-links:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - uses: gaurav-nelson/github-action-markdown-link-check@v1


================================================
FILE: CODE_OF_CONDUCT.md
================================================
# Code of Conduct

## Our Pledge

In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to making participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, gender identity and expression, level of experience,
nationality, personal appearance, race, religion, or sexual identity and
orientation.

## Our Standards

Examples of behavior that contributes to creating a positive environment
include:

* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members

Examples of unacceptable behavior by participants include:

* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
  address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
  professional setting

## Our Responsibilities

Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.

## Scope

This Code of Conduct applies both within project spaces and in public spaces
when an individual is representing the project or its community. Examples of
representing a project or community include using an official project e-mail
address, posting via an official social media account, or acting as an appointed
representative at an online or offline event. Representation of a project may be
further defined and clarified by project maintainers.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team. All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at [http://contributor-covenant.org/version/1/4][version]

[homepage]: http://contributor-covenant.org
[version]: http://contributor-covenant.org/version/1/4/


================================================
FILE: CONTRIBUTING.md
================================================
# Contributing

Your contributions are always welcome!

## Guidelines

* Add one link per Pull Request.
    * Make sure the PR title is in the format of `Add project-name`.
* Add the link: `* [project-name](http://example.com/) - A short description ends with a period.`
    * Keep descriptions concise.
* Add a section if needed.
    * Add the section description.
    * Add the section title to Table of Contents.
* Search previous Pull Requests or Issues before making a new one, as yours may be a duplicate.
* Check your spelling and grammar.
* Remove any trailing whitespace.


================================================
FILE: README.md
================================================
# Awesome DataOps [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)

A curated list of awesome DataOps tools.

- [Awesome DataOps](#awesome-dataops)
    - [Data Catalog](#data-catalog)
    - [Data Exploration](#data-exploration)
    - [Data Ingestion](#data-ingestion)
    - [Data Processing](#data-processing)
    - [Data Quality](#data-quality)
    - [Data Serialization](#data-serialization)
        - [Data Compression](#data-compression)
        - [Data Table Format](#data-table-format)
    - [Data Visualization](#data-visualization)
    - [Data Warehouse](#data-warehouse)
    - [Data Workflow](#data-workflow)
    - [Database](#database)
        - [Columnar Database](#columnar-database)
        - [Document-Oriented Database](#document-oriented-database)
        - [Graph Database](#graph-database)
        - [Key-Value Database](#key-value-database)
        - [Relational Database](#relational-database)
        - [Time Series Database](#time-series-database)
        - [Vector Database](#vector-database)
    - [File System](#file-system)
    - [Logging and Monitoring](#logging-and-monitoring)
    - [Metadata Service](#metadata-service)
    - [SQL Playground](#sql-playground)
    - [SQL Query Engine](#sql-query-engine)
- [Resources](#resources)
    - [Books](#books)
    - [Other Lists](#other-lists)
    - [Slack](#slack)
- [Contributing](#contributing)

---

## Data Catalog

*Tools related to data cataloging.*

* [Amundsen](https://www.amundsen.io/) - Data discovery and metadata engine for improving the productivity when interacting with data.
* [Apache Atlas](https://atlas.apache.org) - Provides open metadata management and governance capabilities to build a data catalog.
* [CKAN](https://github.com/ckan/ckan) - Open-source DMS (data management system) for powering data hubs and data portals.
* [DataHub](https://github.com/linkedin/datahub) - LinkedIn's generalized metadata search & discovery tool.
* [Magda](https://github.com/magda-io/magda) - A federated, open-source data catalog for all your big data and small data.
* [Marquez](https://github.com/MarquezProject/marquez) - Service for the collection, aggregation, and visualization of a data ecosystem's metadata.
* [Metacat](https://github.com/Netflix/metacat) - Unified metadata exploration API service for Hive, RDS, Teradata, Redshift, S3 and Cassandra.
* [OpenLineage](https://github.com/OpenLineage/openlineage) - Open standard for metadata and lineage collection.
* [OpenMetadata](https://open-metadata.org/) - A Single place to discover, collaborate and get your data right.
* [Unity Catalog](https://www.unitycatalog.io/) - Industry’s only universal catalog for data and AI.

## Data Exploration

*Tools for performing data exploration.*

* [Apache Zeppelin](https://zeppelin.apache.org/) - Enables data-driven, interactive data analytics and collaborative documents.
* [Jupyter Notebook](https://jupyter.org/) - Web-based notebook environment for interactive computing.
* [JupyterLab](https://jupyterlab.readthedocs.io) - The next-generation user interface for Project Jupyter.
* [Jupytext](https://github.com/mwouts/jupytext) - Jupyter Notebooks as Markdown Documents, Julia, Python or R scripts.
* [Marimo](https://github.com/marimo-team/marimo) - A reactive Python notebook that's reproducible, git-friendly, and deployable as scripts or apps.
* [Polynote](https://polynote.org/) - The polyglot notebook with first-class Scala support.

## Data Ingestion

*Tools for performing data ingestion.*

* [Amazon Kinesis](https://aws.amazon.com/kinesis/) - Easily collect, process, and analyze video and data streams in real time.
* [Apache Gobblin](https://github.com/apache/gobblin) - A framework that simplifies common aspects of big data such as data ingestion.
* [Apache Kafka](https://github.com/apache/kafka) - Open-source distributed event streaming platform used by thousands of companies.
* [Apache Pulsar](https://github.com/apache/pulsar) - Distributed pub-sub messaging platform with a flexible messaging model and intuitive API.
* [Embulk](https://github.com/embulk/embulk) - A parallel bulk data loader that helps data transfer between various storages.
* [Fluentd](https://github.com/fluent/fluentd) - Collects events from various data sources and writes them to files.
* [Google PubSub](https://cloud.google.com/pubsub) - Ingest events for streaming into BigQuery, data lakes or operational databases.
* [Nakadi](https://github.com/zalando/nakadi) - A distributed event bus that implements a RESTful API abstraction on top of Kafka-like queues.
* [Pravega](https://github.com/pravega/pravega) - An open source distributed storage service implementing Streams.
* [RabbitMQ](https://www.rabbitmq.com/) - One of the most popular open source message brokers.

## Data Workflow

*Tools related to data workflow/pipeline.*

* [Apache Airflow](https://github.com/apache/airflow) - A platform to programmatically author, schedule, and monitor workflows.
* [Apache Oozie](https://github.com/apache/oozie) - An extensible, scalable and reliable system to manage complex Hadoop workloads.
* [Azkaban](https://github.com/azkaban/azkaban) - Batch workflow job scheduler created at LinkedIn to run Hadoop jobs.
* [Dagster](https://github.com/dagster-io/dagster) - An orchestration platform for the development, production, and observation of data assets.
* [Luigi](https://github.com/spotify/luigi) - Python module that helps you build complex pipelines of batch jobs.
* [Prefect](https://docs.prefect.io/) - A workflow management system, designed for modern infrastructure.

## Data Processing

*Tools related to data processing (batch and stream).*

* [Apache Beam](https://github.com/apache/beam) - A unified model for defining both batch and streaming data-parallel processing pipelines.
* [Apache Flink](https://github.com/apache/flink) - An open source stream processing framework with powerful capabilities.
* [Apache Hadoop MapReduce](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html) - A framework for writing applications which process vast amounts of data.
* [Apache Nifi](https://github.com/apache/nifi) - An easy to use, powerful, and reliable system to process and distribute data.
* [Apache Samza](https://github.com/apache/samza) - A distributed stream processing framework which uses Apache Kafka and Hadoop YARN.
* [Apache Spark](https://github.com/apache/spark) - A unified analytics engine for large-scale data processing.
* [Apache Storm](https://github.com/apache/storm) - An open source distributed realtime computation system.
* [Apache Tez](https://github.com/apache/tez) - A generic data-processing pipeline engine envisioned as a low-level engine.
* [Faust](https://github.com/robinhood/faust) - A stream processing library, porting the ideas from Kafka Streams to Python.
* [skrub](http://skrub-data.org) - Python library to ease preprocessing and feature engineering for tabular machine learning.

## Data Quality

*Tools for ensuring data quality.*

* [Cerberus](https://github.com/pyeve/cerberus) - Lightweight, extensible data validation library for Python.
* [Cleanlab](https://github.com/cleanlab/cleanlab) - Data-centric AI tool to detect (non-predefined) issues in ML data like label errors or outliers.
* [DataProfiler](https://github.com/capitalone/DataProfiler) - A Python library designed to make data analysis, monitoring, and sensitive data detection easy.
* [Deequ](https://github.com/awslabs/deequ) - A library built on top of Apache Spark for measuring data quality in large datasets.
* [Great Expectations](https://greatexpectations.io) - A Python data validation framework that allows to test your data against datasets.
* [JSON Schema](https://json-schema.org/) - A vocabulary that allows you to annotate and validate JSON documents.
* [SodaSQL](https://github.com/sodadata/soda-sql) - Data profiling, testing, and monitoring for SQL accessible data.

## Data Serialization

*Tools related to data serialization.*

* [Apache Avro](https://github.com/apache/avro) - A data serialization system which is compact, fast and provides rich data structures.
* [Apache ORC](https://github.com/apache/orc) - A self-describing type-aware columnar file format designed for Hadoop workloads.
* [Apache Parquet](https://github.com/apache/parquet-mr) - A columnar storage format which provides efficient storage and encoding of data.
* [Kryo](https://github.com/EsotericSoftware/kryo) - A fast and efficient binary object graph serialization framework for Java.
* [ProtoBuf](https://github.com/protocolbuffers/protobuf) - Language-neutral, platform-neutral, extensible mechanism for serializing structured data.

### Data Compression

* [Pigz](https://github.com/madler/pigz) - A parallel implementation of gzip for modern multi-processor, multi-core machines.
* [Snappy](https://github.com/google/snappy) - Open source compression library that is fast, stable and robuts.

### Data Table Format

* [Apache Hudi](https://github.com/apache/hudi) - Manages the storage of large analytical datasets on DFS.
* [Apache Iceberg](https://github.com/apache/iceberg) - Open table format for huge analytic datasets.
* [Delta Lake](https://github.com/delta-io/delta) - An open source project that enables building a Lakehouse architecture on top of data lakes.

## Data Visualization

*Tools for performing data visualization (DataViz).*

* [Apache Superset](https://github.com/apache/superset) - A modern data exploration and data visualization platform.
* [Count](https://count.co) - SQL/drag-and-drop querying and visualisation tool based on notebooks.
* [Dash](https://github.com/plotly/dash) - Analytical Web Apps for Python, R, Julia, and Jupyter.
* [Data Studio](https://datastudio.google.com) - Reporting solution for power users who want to go beyond the data and dashboards of GA.
* [HUE](https://github.com/cloudera/hue) - A mature SQL Assistant for querying Databases & Data Warehouses.
* [Lux](https://github.com/lux-org/lux) - Fast and easy data exploration by automating the visualization and data analysis process.
* [Metabase](https://www.metabase.com/) - The simplest, fastest way to get business intelligence and analytics to everyone.
* [Redash](https://redash.io/) - Connect to any data source, easily visualize, dashboard and share your data.
* [Tableau](https://www.tableau.com) - Powerful and fastest growing data visualization tool used in the business intelligence industry.

## Data Warehouse

*Tools related to storing data in data warehouses (DW).*

* [Amazon Redshift](https://aws.amazon.com/redshift/) - Accelerate your time to insights with fast, easy, and secure cloud data warehousing.
* [Apache Hive](https://github.com/apache/hive) - Facilitates reading, writing, and managing large datasets residing in distributed storage.
* [Apache Kylin](https://github.com/apache/kylin) - An open source, distributed analytical data warehouse for big data.
* [Google BigQuery](https://cloud.google.com/bigquery) - Serverless, highly scalable, and cost-effective multicloud data warehouse.

## Database

*Database tools for storing data.*

### Columnar Database

* [Apache Cassandra](https://github.com/apache/cassandra) - Open source column based DBMS designed to handle large amounts of data.
* [Apache Druid](https://github.com/apache/druid) - Designed to quickly ingest massive quantities of event data, and provide low-latency queries.
* [Apache HBase](https://github.com/apache/hbase) - An open-source, distributed, versioned, column-oriented store.
* [Scylla](https://github.com/scylladb/scylla) - Designed to be compatible with Cassandra while achieving higher throughputs and lower latencies.

### Document-Oriented Database

* [Apache CouchDB](https://github.com/apache/couchdb) - An open-source document-oriented NoSQL database, implemented in Erlang.
* [Elasticsearch](https://github.com/elastic/elasticsearch) - A distributed document oriented database with a RESTful search engine.
* [MongoDB](https://github.com/mongodb/mongo) - A cross-platform document database that uses JSON-like documents with optional schemas.
* [RethinkDB](https://github.com/rethinkdb/rethinkdb) - The first open-source scalable database built for realtime applications.

### Graph Database

* [Age](https://github.com/apache/age) - A multi-model database that supports both graph and relational data models.
* [ArangoDB](https://github.com/arangodb/arangodb) - A scalable open-source multi-model database natively supporting graph, document and search.
* [JanusGraph](https://github.com/JanusGraph/janusgraph) - Manage large graphs with billions of data distributed across a multi-machine cluster.
* [Memgraph](https://github.com/memgraph/memgraph) - An open source graph database, built for real-time streaming data, compatible with Neo4j.
* [Neo4j](https://github.com/neo4j/neo4j) - A high performance graph store with all the features expected of a mature and robust database.
* [Titan](https://github.com/thinkaurelius/titan) - A highly scalable graph database optimized for storing and querying large graphs.

### Key-Value Database

* [Apache Accumulo](https://github.com/apache/accumulo) - A sorted, distributed key-value store that provides robust and scalable data storage.
* [Dragonfly](https://github.com/dragonflydb/dragonfly) - A modern in-memory datastore, fully compatible with Redis and Memcached APIs.
* [DynamoDB](https://aws.amazon.com/dynamodb/) - Fast, flexible NoSQL database service for single-digit millisecond performance at any scale.
* [etcd](https://github.com/etcd-io/etcd) - Distributed reliable key-value store for the most critical data of a distributed system.
* [EVCache](https://github.com/Netflix/EVCache) - A distributed in-memory data store for the cloud.
* [Memcached](https://github.com/memcached/memcached) - A high performance multithreaded event-based key/value cache store.
* [Redis](https://github.com/redis/redis) - An in-memory key-value database that persists on disk.

### Relational Database

* [CockroachDB](https://github.com/cockroachdb/cockroach) - A distributed database designed to build, scale, and manage data-intensive apps.
* [Crate](https://github.com/crate/crate) - A distributed SQL database that makes it simple to store and analyze massive amounts of data.
* [MariaDB](https://github.com/MariaDB/server) - A replacement of MySQL with more features, new storage engines and better performance.
* [MySQL](https://github.com/mysql/mysql-server) - One of the most popular open source transactional databases.
* [PostgreSQL](https://github.com/postgres/postgres) - An advanced RDBMS that supports an extended subset of the SQL standard.
* [RQLite](https://github.com/rqlite/rqlite) - A lightweight, distributed relational database, which uses SQLite as its storage engine.
* [SQLite](https://github.com/sqlite/sqlite) - A popular choice as embedded database software for local/client storage.

### Time Series Database

* [Akumuli](https://github.com/akumuli/Akumuli) - Can be used to capture, store and process time-series data in real-time.
* [Atlas](https://github.com/Netflix/Atlas) - An in-memory dimensional time series database.
* [InfluxDB](https://github.com/influxdata/influxdb) - Scalable datastore for metrics, events, and real-time analytics.
* [QuestDB](https://github.com/questdb/questdb) - An open source SQL database designed to process time series data, faster.
* [TimescaleDB](https://github.com/timescale/timescaledb) - Open-source time-series SQL database optimized for fast ingest and complex queries.

### Vector Database

* [Milvus](https://github.com/milvus-io/milvus/) - An open source embedding vector similarity search engine powered by Faiss, NMSLIB and Annoy.
* [Pinecone](https://www.pinecone.io) - Managed and distributed vector similarity search used with a lightweight SDK.
* [Qdrant](https://github.com/qdrant/qdrant) - An open source vector similarity search engine with extended filtering support.

## File System

*Tools related to file system and data storage.*

* [Alluxio](https://github.com/Alluxio/alluxio) - A virtual distributed storage system.
* [Amazon Simple Storage Service (S3)](https://aws.amazon.com/s3/) - Object storage built to retrieve any amount of data from anywhere.
* [Apache Hadoop Distributed File System (HDFS)](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) - A distributed file system.
* [GlusterFS](https://github.com/gluster/glusterfs) - A software defined distributed storage that can scale to several petabytes.
* [Google Cloud Storage (GCS)](https://cloud.google.com/storage) - Object storage for companies of all sizes, to store any amount of data.
* [LakeFS](https://github.com/treeverse/lakeFS) - Open source tool that transforms your object storage into a Git-like repository.
* [LizardFS](https://github.com/lizardfs/lizardfs) - A highly reliable, scalable and efficient distributed file system.
* [MinIO](https://github.com/minio/minio) - High Performance, Kubernetes Native Object Storage compatible with Amazon S3 API.
* [SeaweedFS](https://github.com/chrislusf/seaweedfs) - A fast distributed storage system for blobs, objects, files, and data lake.
* [Swift](https://github.com/openstack/swift) - A distributed object storage system designed to scale from a single machine to thousands of servers.

## Logging and Monitoring

*Tools used for logging and monitoring data workflows.*

* [Grafana](https://github.com/grafana/grafana) - Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, InfluxDB and more.
* [Loki](https://github.com/grafana/loki) - A horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus.
* [Prometheus](https://github.com/prometheus/prometheus) - A monitoring system and time series database.
* [Whylogs](https://github.com/whylabs/whylogs) - A tool for creating data logs, enabling monitoring for data drift and data quality issues.

## Metadata Service

*Tools used for storing and serving metadata.*

* [Hive Metastore](https://cwiki.apache.org/confluence/display/hive/design#Design-Metastore) - Service that stores metadata related to Apache Hive and other services.
* [Metacat](https://github.com/Netflix/metacat) - Provides you information about what data you have, where it resides and how to process it.

## SQL Playground

*Tools for testing and sharing SQL snippets in mock databases.*

* [RunSQL](https://runsql.com/) - Free online SQL playground for MySQL, PostgreSQL, and SQL Server.
* [SQLFiddle](https://sqlfiddle.com/) - Online SQL compiler for learning and practicing SQL.

## SQL Query Engine

*Tools for parallel processing SQL statements.*

* [Apache Drill](https://github.com/apache/drill) - Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.
* [Apache Impala](https://github.com/apache/impala) - Lightning-fast, distributed SQL queries for petabytes of data.
* [Dremio](https://www.dremio.com/) - Power high-performing BI dashboards and interactive analytics directly on data lake.
* [Presto](https://github.com/prestodb/presto) - A distributed SQL query engine for big data.
* [Trino](https://github.com/trinodb/trino) - A fast distributed SQL query engine for big data analytics.

---

# Resources

Where to discover new tools and discuss about existing ones.

## Books

* [Data Mesh: Delivering Data-Driven Value at Scale](https://www.oreilly.com/library/view/data-mesh/9781492092384/) (O'Reilly)
* [Designing Data-Intensive Applications](https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/) (O'Reilly)
* [Fundamentals of Data Engineering](https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/) (O'Reilly)
* [Getting Started with Impala](https://www.oreilly.com/library/view/getting-started-with/9781491905760/) (O'Reilly)
* [Learning and Operating Presto](https://www.oreilly.com/library/view/learning-and-operating/9781098141844/) (O'Reilly)
* [Learning Spark: Lightning-Fast Data Analytics](https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/) (O'Reilly)
* [Spark in Action](https://www.oreilly.com/library/view/spark-in-action/9781617295522/) (O'Reilly)
* [Spark: The Definitive Guide](https://www.oreilly.com/library/view/spark-the-definitive/9781491912201/) (O'Reilly)

## Other Lists

* [Awesome Data Engineering](https://github.com/igorbarinov/awesome-data-engineering)
* [Awesome MLOps](https://github.com/kelvins/awesome-mlops)
* [DataOps Resource](https://github.com/chen1649chenli/dataOpsResource)

## Slack

* [Delta Lake Workspace](https://delta-users.slack.com/ssb/redirect)
* [Trino Workspace](https://trinodb.slack.com/ssb/redirect)

---

# Contributing

All contributions are welcome! Please take a look at the [contribution guidelines](https://github.com/kelvins/awesome-dataops/blob/main/CONTRIBUTING.md) first.


================================================
FILE: check_order.py
================================================

def check_order(lines, match_pattern, stop_pattern):
    """Check the order based on the patterns."""
    data = list()
    for line in lines:
        if line.startswith(match_pattern):
            data.append(line)
        elif line.startswith(stop_pattern):
            if data != sorted(data, key=str.casefold):
                raise Exception('The content is not alphabetically sorted!')
            data = list()


def check_menu(lines, level):
    """Check if menu is alphabetically sorted."""
    whitespaces = level * 4
    match_pattern = f'{" " * whitespaces}- ['
    stop_pattern = ('---', f'{" " * (whitespaces - 4)}- [')
    check_order(lines, match_pattern, stop_pattern)


def check_content(lines):
    """Check if content is alphabetically sorted."""
    match_pattern = '* ['
    stop_pattern = ('## ', '### ')
    check_order(lines, match_pattern, stop_pattern)


def main(path):
    """Check if menus are alphabetically sorted."""
    data = list()
    with open(path, 'r') as f:
        lines = f.readlines()
    check_menu(lines, level=1)
    check_menu(lines, level=2)
    check_content(lines)


if __name__ == '__main__':
    main('README.md')


================================================
FILE: mlc_config.json
================================================
{
  "ignorePatterns": [
    {
      "pattern": "https://www.tableau.com"
    }
  ],
  "timeout": "20s",
  "retryOn429": true,
  "retryCount": 5,
  "fallbackRetryDelay": "30s",
  "aliveStatusCodes": [0, 200, 403]
}
Download .txt
gitextract_rh9xc5ik/

├── .github/
│   ├── PULL_REQUEST_TEMPLATE.md
│   └── workflows/
│       └── validate.yml
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── README.md
├── check_order.py
└── mlc_config.json
Download .txt
SYMBOL INDEX (4 symbols across 1 files)

FILE: check_order.py
  function check_order (line 2) | def check_order(lines, match_pattern, stop_pattern):
  function check_menu (line 14) | def check_menu(lines, level):
  function check_content (line 22) | def check_content(lines):
  function main (line 29) | def main(path):
Condensed preview — 7 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (28K chars).
[
  {
    "path": ".github/PULL_REQUEST_TEMPLATE.md",
    "chars": 219,
    "preview": "## What is this tool for?\n\nDescribe features.\n\n## What's the difference between this tool and similar ones?\n\nEnumerate c"
  },
  {
    "path": ".github/workflows/validate.yml",
    "chars": 399,
    "preview": "name: Validate README\n\non: [push, pull_request]\n\njobs:\n  check-order:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: "
  },
  {
    "path": "CODE_OF_CONDUCT.md",
    "chars": 3183,
    "preview": "# Code of Conduct\n\n## Our Pledge\n\nIn the interest of fostering an open and welcoming environment, we as\ncontributors and"
  },
  {
    "path": "CONTRIBUTING.md",
    "chars": 581,
    "preview": "# Contributing\n\nYour contributions are always welcome!\n\n## Guidelines\n\n* Add one link per Pull Request.\n    * Make sure "
  },
  {
    "path": "README.md",
    "chars": 21024,
    "preview": "# Awesome DataOps [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media"
  },
  {
    "path": "check_order.py",
    "chars": 1168,
    "preview": "\ndef check_order(lines, match_pattern, stop_pattern):\n    \"\"\"Check the order based on the patterns.\"\"\"\n    data = list()"
  },
  {
    "path": "mlc_config.json",
    "chars": 214,
    "preview": "{\n  \"ignorePatterns\": [\n    {\n      \"pattern\": \"https://www.tableau.com\"\n    }\n  ],\n  \"timeout\": \"20s\",\n  \"retryOn429\": "
  }
]

About this extraction

This page contains the full source code of the kelvins/awesome-dataops GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 7 files (26.2 KB), approximately 6.5k tokens, and a symbol index with 4 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!