Repository: datacontract/datacontract-specification Branch: main Commit: 145852c67604 Files: 55 Total size: 1.7 MB Directory structure: gitextract_glrnu_dz/ ├── .github/ │ ├── validate-examples │ └── workflows/ │ └── ci.yaml ├── .gitignore ├── CHANGELOG.md ├── CNAME ├── LICENSE ├── README.md ├── _config.yml ├── _layouts/ │ └── default.html ├── datacontract.init.yaml ├── datacontract.schema.json ├── definition.schema.json ├── diagrams/ │ ├── automation.drawio │ ├── datacontract.drawio │ └── favicon.drawio ├── examples/ │ ├── covid-cases/ │ │ ├── datacontract.html │ │ └── datacontract.yaml │ ├── datacontract.html │ ├── generate-catalog │ ├── index.html │ ├── muellimperium/ │ │ ├── data.csv │ │ ├── datacontract.html │ │ └── datacontract.yaml │ ├── orders-latest/ │ │ ├── datacontract.html │ │ └── datacontract.yaml │ ├── orders-latest-nested/ │ │ ├── datacontract.html │ │ └── datacontract.yaml │ ├── time-example/ │ │ ├── datacontract.html │ │ └── datacontract.yaml │ └── variant-json-example/ │ └── datacontract.yaml ├── gen-openapi-yaml ├── versions/ │ ├── 0.9.0/ │ │ ├── README.md │ │ ├── datacontract.init.yaml │ │ └── datacontract.schema.json │ ├── 0.9.1/ │ │ ├── README.md │ │ ├── datacontract.init.yaml │ │ └── datacontract.schema.json │ ├── 0.9.2/ │ │ ├── README.md │ │ ├── datacontract.init.yaml │ │ └── datacontract.schema.json │ ├── 0.9.3/ │ │ ├── README.md │ │ ├── datacontract.init.yaml │ │ ├── datacontract.schema.json │ │ └── definition.schema.json │ ├── 1.1.0/ │ │ ├── README.md │ │ ├── datacontract.init.yaml │ │ ├── datacontract.schema.json │ │ └── definition.schema.json │ ├── 1.2.0/ │ │ ├── datacontract.init.yaml │ │ ├── datacontract.schema.json │ │ └── definition.schema.json │ └── 1.2.1/ │ ├── datacontract.init.yaml │ ├── datacontract.schema.json │ └── definition.schema.json └── workshop.md ================================================ FILE CONTENTS ================================================ ================================================ FILE: .github/validate-examples ================================================ #!/bin/bash set -ex #function datacontract() { # docker run --rm -v "${PWD}:/home/datacontract" --platform linux/amd64 datacontract/cli:latest "$@" #} datacontract --version SCHEMA=datacontract.schema.json awk '/^```yaml$/{flag=1; next} /^```$/{print ""; flag=0; exit} flag' README.md > datacontract-from-readme.yaml datacontract lint datacontract-from-readme.yaml --schema $SCHEMA datacontract test --examples datacontract-from-readme.yaml --schema $SCHEMA # Compare with example? datacontract lint examples/orders-latest/datacontract.yaml --schema $SCHEMA datacontract test --examples examples/orders-latest/datacontract.yaml --schema $SCHEMA datacontract lint examples/orders-latest-nested/datacontract.yaml --schema $SCHEMA datacontract test --examples examples/orders-latest-nested/datacontract.yaml --schema $SCHEMA || true # examples are not nested datacontract lint examples/covid-cases/datacontract.yaml --schema $SCHEMA datacontract test --examples examples/covid-cases/datacontract.yaml --schema $SCHEMA || true ================================================ FILE: .github/workflows/ci.yaml ================================================ on: push: pull_request: workflow_call: name: CI jobs: test: if: false # skip as the example structure has changed with v1.1.0 runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Set up Python uses: actions/setup-python@v5 with: python-version: 3.11 - name: Install dependencies run: | python -m pip install --upgrade pip pip install datacontract-cli[all] datacontract --version - name: Validate examples run: .github/validate-examples ================================================ FILE: .gitignore ================================================ .idea/ *.bkp datacontract.schema.openapi-format.* .soda/ datacontract-from-readme.yaml .duckdb/ ================================================ FILE: CHANGELOG.md ================================================ # Changelog All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). ## [1.2.1] - 2025-09-24 ### Added - Support for data quality metrics that align with ODCS 3.1 ### Changed - Replaced threshold operators mustBeGreaterThanOrEqualTo with mustBeGreaterOrEqualTo and mustBeLessThanOrEqualTo with mustBeLessOrEqualTo to align with ODCS 3.1, even if it feels wrong... ## [1.2.0] - 2025-07-05 ### Added - Support for `models.additionalFields` to define if additional fields (columns) are allowed or not in the physical server ([#99](https://github.com/datacontract/datacontract-specification/pull/99)) - Add `time` data type ([#123](https://github.com/datacontract/datacontract-specification/issues/123)) - Added `variant` data type ([#113](https://github.com/datacontract/datacontract-specification/issues/113)) - Added `json` data types ([#112](https://github.com/datacontract/datacontract-specification/issues/112)) ### Changed - `server.type` changed from enum to simple string to support custom types ([#107](https://github.com/datacontract/datacontract-specification/pull/107)) ## [1.1.0] - 2024-10-30 ### Added - Data quality on model and field level ([#55](https://github.com/datacontract/datacontract-specification/issues/55)) - Lineage support ([#90](https://github.com/datacontract/datacontract-specification/issues/90)) - Field and definition `examples` as array of any type, instead of `example` as a single value ([#29](https://github.com/datacontract/datacontract-specification/issues/29) - Support for server-specific data types as config map ([#63](https://github.com/datacontract/datacontract-specification/issues/63)) - AWS Glue Catalog server support - sftp server support - info.status field - oracle server support - field.title attribute - model.title attribute - AWS Kinesis Data Streams server support - field.links attribute - Trino support - Field `type: map` support with properties `keys` and `values` - Definitions: `fields`, for type `object`, `record`, and `struct` - Field `field.primaryKey` (Replaces `field.primary`) - Field `model.primaryKey` to describe a composite primary key - Add Redshift server properties `clusterIdentifier`, `endpoint`, `host` and `port`. ### Removed - `definitions.domain` removed (use a hierarchical structure instead) - `definitions.name` removed (use a hierarchical structure instead) - `quality` on top-level removed - `examples` on top-level removed - `schema` removed in favor of encoding any physical schema configuration in the `model` using the `config` map at the field level and supporting import/export ([#21](https://github.com/datacontract/datacontract-specification/issues/21)). ### Deprecated - `field.primary` (use `field.primaryKey` instead) ## [0.9.3] - 2024-03-06 ### Added - Service levels as a top level `servicelevels` element - pubsub server support - primary key and relationship support via `field.primary` and `field.references` attributes - databricks server support improved ## [0.9.2] - 2024-01-04 ### Added - Format and validation attributes to fields in models and definitions - Postgres support - Databricks support ## [0.9.1] - 2023-11-19 ### Added - A logical data model (#13), mainly to simplify editor support with a defined schema, easier to detect breaking changes, and better Databricks support. - Definitions (#14) for reusable semantic definitions within one data contract or across data contracts. ### Removed - Property `info.dataProduct` as data products should define which data contracts they implement. - Property `info.outputPort` as data products should define which data contracts they implement. Those removals are not considered as breaking changes, as these attributes are now treated as specification extensions. ## [0.9.0] - 2023-09-12 First public release. ================================================ FILE: CNAME ================================================ datacontract-specification.com ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2023 Data Mesh Architecture Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ # Data Contract Specification Stars Slack Status > **Deprecation Notice** > With the release of the [Open Data Contract Standard v3.1.0](https://github.com/bitol-io/open-data-contract-standard), we deprecate the Data Contract Specification in line with our commitment to focus on a single industry standard for data contracts. We have actively contributed to the Open Data Contract Standard in the TSC and will continue to support it.

> If you are using Data Contract Specification, we recommend [migrating to the Open Data Contract Standard](#migration) within the next few months.
> The Data Contract Specification will be supported in Data Contract CLI and Entropy Data until the end of 2026. ![datacontract.png](images/datacontract.png) Data contracts bring data providers and data consumers together. A _data contract_ is a document that defines the structure, format, semantics, quality, and terms of use for exchanging data between a data provider and their consumers. Think of an API, but for data. A data contract is implemented by a data product or other data technologies, even legacy data warehouses. Data contracts can also be used for the input port to specify the expectations of data dependencies and verify given guarantees. The _data contract specification_ defines a YAML format to describe attributes of provided data sets. It is data platform neutral and can be used with any data platform, such as AWS S3, Google BigQuery, Azure, Databricks, and Snowflake. The data contract specification is an open initiative to define a common data contract format. It follows [OpenAPI](https://www.openapis.org/) and [AsyncAPI](https://www.asyncapi.com/) conventions. If you haven't adopted a YAML format yet, we recommend to start directly with the [Open Data Contract Standard](https://github.com/bitol-io/open-data-contract-standard). It’s considered the conceptual successor and comes highly recommended. Data contracts come into play when data is exchanged between different teams or organizational units, such as in a [data mesh architecture](https://www.datamesh-architecture.com/). First, and foremost, data contracts are a communication tool to express a common understanding of how data should be structured and interpreted. They make semantic and quality expectations explicit. They are often created collaboratively in [workshops](./workshop.md) together with data providers and data consumers. Later in development and production, they also serve as the basis for code generation, testing, schema validations, quality checks, monitoring, access control, and computational governance policies. The specification comes along with the [Data Contract CLI](https://github.com/datacontract/datacontract-cli), an open-source tool to develop, validate, and enforce data contracts. > _Note: The term "data contract" refers to a specification that is usually owned by the data provider and thus does not align with a "contract" in a legal sense as a mutual agreement between two parties. > The term "contract" may be somewhat misleading, but it is how it is used by the industry. > The mutual agreement between one data provider and one data consumer is the "data usage agreement" that refers to a data contract. > Data usage agreements have a defined lifecycle, start/end date, and help the data provider to track who accesses their data and for which purposes._ Version --- 1.2.1([Changelog](CHANGELOG.md)) Example --- View in [Data Contract Catalog](https://datacontract.com/examples/index.html) ```yaml dataContractSpecification: 1.2.1 id: orders-latest info: title: Orders Latest version: 2.0.0 description: | Successful customer orders in the webshop. All orders since 2020-01-01. Orders with their line items are in their current state (no history included). owner: Checkout Team status: active contact: name: John Doe (Data Product Owner) url: https://teams.microsoft.com/l/channel/example/checkout servers: production: type: s3 environment: prod location: s3://datacontract-example-orders-latest/v2/{model}/*.json format: json delimiter: new_line description: "One folder per model. One file per day." roles: - name: analyst_us description: Access to the data for US region - name: analyst_cn description: Access to the data for China region terms: usage: | Data can be used for reports, analytics and machine learning use cases. Order may be linked and joined by other tables limitations: | Not suitable for real-time use cases. Data may not be used to identify individual customers. Max data processing per day: 10 TiB policies: - name: privacy-policy url: https://example.com/privacy-policy - name: license description: External data is licensed under agreement 1234. url: https://example.com/license/1234 billing: 5000 USD per month noticePeriod: P3M models: orders: description: One record per order. Includes cancelled and deleted orders. type: table fields: order_id: $ref: '#/definitions/order_id' required: true unique: true primaryKey: true order_timestamp: description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. type: timestamp required: true examples: - "2024-09-09T08:30:00Z" tags: ["business-timestamp"] order_total: description: Total amount the smallest monetary unit (e.g., cents). type: long required: true examples: - 9999 quality: - type: sql description: 95% of all order total values are expected to be between 10 and 499 EUR. query: | SELECT quantile_cont(order_total, 0.95) AS percentile_95 FROM orders mustBeBetween: [1000, 49900] customer_id: description: Unique identifier for the customer. type: text minLength: 10 maxLength: 20 customer_email_address: description: The email address, as entered by the customer. type: text format: email required: true pii: true classification: sensitive quality: - type: text description: The email address is not verified and may be invalid. lineage: inputFields: - namespace: com.example.service.checkout name: checkout_db.orders field: email_address processed_timestamp: description: The timestamp when the record was processed by the data platform. type: timestamp required: true config: jsonType: string jsonFormat: date-time quality: - type: sql description: The maximum duration between two orders should be less that 3600 seconds query: | SELECT MAX(duration) AS max_duration FROM ( SELECT EXTRACT(EPOCH FROM (order_timestamp - LAG(order_timestamp) OVER (ORDER BY order_timestamp))) AS duration FROM orders ) mustBeLessThan: 3600 - type: sql description: Row Count query: | SELECT count(*) as row_count FROM orders mustBeGreaterThan: 5 examples: - | order_id,order_timestamp,order_total,customer_id,customer_email_address,processed_timestamp "1001","2030-09-09T08:30:00Z",2500,"1000000001","mary.taylor82@example.com","2030-09-09T08:31:00Z" "1002","2030-09-08T15:45:00Z",1800,"1000000002","michael.miller83@example.com","2030-09-09T08:31:00Z" "1003","2030-09-07T12:15:00Z",3200,"1000000003","michael.smith5@example.com","2030-09-09T08:31:00Z" "1004","2030-09-06T19:20:00Z",1500,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" "1005","2030-09-05T10:10:00Z",4200,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" "1006","2030-09-04T14:55:00Z",2800,"1000000005","john.davis28@example.com","2030-09-09T08:31:00Z" "1007","2030-09-03T21:05:00Z",1900,"1000000006","linda.brown67@example.com","2030-09-09T08:31:00Z" "1008","2030-09-02T17:40:00Z",3600,"1000000007","patricia.smith40@example.com","2030-09-09T08:31:00Z" "1009","2030-09-01T09:25:00Z",3100,"1000000008","linda.wilson43@example.com","2030-09-09T08:31:00Z" "1010","2030-08-31T22:50:00Z",2700,"1000000009","mary.smith98@example.com","2030-09-09T08:31:00Z" line_items: description: A single article that is part of an order. type: table fields: line_item_id: type: text description: Primary key of the lines_item_id table required: true order_id: $ref: '#/definitions/order_id' references: orders.order_id sku: description: The purchased article number $ref: '#/definitions/sku' primaryKey: ["order_id", "line_item_id"] examples: - | line_item_id,order_id,sku "LI-1","1001","5901234123457" "LI-2","1001","4001234567890" "LI-3","1002","5901234123457" "LI-4","1002","2001234567893" "LI-5","1003","4001234567890" "LI-6","1003","5001234567892" "LI-7","1004","5901234123457" "LI-8","1005","2001234567893" "LI-9","1005","5001234567892" "LI-10","1005","6001234567891" definitions: order_id: title: Order ID type: text format: uuid description: An internal ID that identifies an order in the online shop. examples: - 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2 pii: true classification: restricted tags: - orders sku: title: Stock Keeping Unit type: text pattern: ^[A-Za-z0-9]{8,14}$ examples: - "96385074" description: | A Stock Keeping Unit (SKU) is an internal unique identifier for an article. It is typically associated with an article's barcode, such as the EAN/GTIN. links: wikipedia: https://en.wikipedia.org/wiki/Stock_keeping_unit tags: - inventory servicelevels: availability: description: The server is available during support hours percentage: 99.9% retention: description: Data is retained for one year period: P1Y unlimited: false latency: description: Data is available within 25 hours after the order was placed threshold: 25h sourceTimestampField: orders.order_timestamp processedTimestampField: orders.processed_timestamp freshness: description: The age of the youngest row in a table. threshold: 25h timestampField: orders.order_timestamp frequency: description: Data is delivered once a day type: batch # or streaming interval: daily # for batch, either or cron cron: 0 0 * * * # for batch, either or interval support: description: The data is available during typical business hours at headquarters time: 9am to 5pm in EST on business days responseTime: 1h backup: description: Data is backed up once a week, every Sunday at 0:00 UTC. interval: weekly cron: 0 0 * * 0 recoveryTime: 24 hours recoveryPoint: 1 week tags: - checkout - orders - s3 links: datacontractCli: https://cli.datacontract.com ``` Migration --- To migrate from Data Contract Specification to the Open Data Contract Specification, you can use the [Data Contract CLI](https://github.com/datacontract/datacontract-cli): ``` uv tool install --python python3.11 --upgrade 'datacontract-cli[all]' datacontract export --format odcs --output odcs.yaml datacontract.yaml ``` You can now continue to work with _odcs.yaml_ file. Data Contract CLI --- The [Data Contract CLI](https://cli.datacontract.com) is a command line tool and Python library to lint, test, import and export data contracts (supporting Data Contract Specification and ODCS). Here is a short example of how to verify that your actual dataset matches the data contract: ```bash pip3 install "datacontract-cli[all]" datacontract test https://datacontract.com/examples/orders-latest/datacontract.yaml ``` or, if you prefer Docker: ```bash docker run datacontract/cli test https://datacontract.com/examples/orders-latest/datacontract.yaml ``` The Data Contract contains all required information to verify data: - The _servers_ block has the connection details to the actual data set. - The _models_ define the syntax, formats, and constraints. - The _quality_ defined further quality checks. The Data Contract CLI chooses the appropriate engine, formulates test cases, connects to the server, and executes the tests, based on the server type. More information and configuration options on [cli.datacontract.com](https://cli.datacontract.com). Specification --- ![The eight major categories in the data contract specification](images/categories.png) - [Data Contract Object](#data-contract-object) - [Info Object](#info-object) - [Contact Object](#contact-object) - [Server Object](#server-object) - [Terms Object](#terms-object) - [Model Object](#model-object) - [Field Object](#field-object) - [Definition Object](#definition-object) - [Service Level Object](#service-levels-object) - [Quality Object](#quality-object) - [Lineage Object](#lineage-object) - [Data Types](#data-types) - [Specification Extensions](#specification-extensions) [JSON Schema](https://github.com/datacontract/datacontract-specification/blob/main/datacontract.schema.json) of the Data Contract Specification. ### Data Contract Object This is the root document. It is _RECOMMENDED_ that the root document be named: `datacontract.yaml`. | Field | Type | Description | |---------------------------|--------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------| | dataContractSpecification | `string` | REQUIRED. Specifies the Data Contract Specification being used. | | id | `string` | REQUIRED. An organization-wide unique technical identifier, such as a UUID, URN, slug, string, or number | | info | [Info Object](#info-object) | REQUIRED. Specifies the metadata of the data contract. May be used by tooling. | | servers | Map[`string`, [Server Object](#server-object)] | Specifies the servers of the data contract. | | terms | [Terms Object](#terms-object) | Specifies the terms and conditions of the data contract. | | models | Map[`string`, [Model Object](#model-object)] | Specifies the logical data model. | | definitions | Map[`string`, [Definition Object](#definition-object)] | Specifies definitions. | | servicelevels | [Service Levels Object](#service-levels-object) | Specifies the service level of the provided data | | links | Map[`string`, `string`] | Additional external documentation links. | | tags | Array of `string` | Custom metadata to provide additional context. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Info Object Metadata and life cycle information about the data contract. | Field | Type | Description | |-------------|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------| | title | `string` | REQUIRED. The title of the data contract. | | version | `string` | REQUIRED. The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version). | | status | `string` | The status of the data contract. Can be `proposed`, `in development`, `active`, `deprecated`, `retired`. | | description | `string` | A description of the data contract. | | owner | `string` | The owner or team responsible for managing the data contract and providing the data. | | contact | [Contact Object](#contact-object) | Contact information for the data contract. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Contact Object Contact information for the data contract. | Field | Type | Description | |-------|----------|-------------------------------------------------------------------------------------------------------| | name | `string` | The identifying name of the contact person/organization. | | url | `string` | The URL pointing to the contact information. This _MUST_ be in the form of a URL. | | email | `string` | The email address of the contact person/organization. This _MUST_ be in the form of an email address. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Server Object The fields are dependent on the defined type. | Field | Type | Description | |-------------|----------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | REQUIRED. The type of the data product technology that implements the data contract. Well-known server types are: `bigquery`, `clickhouse`, `s3`, `glue`, `redshift`, `azure`, `sqlserver`, `snowflake`, `databricks`, `postgres`, `oracle`, `kafka`, `pubsub`, `sftp`, `kinesis`, `trino`, `local` | | description | `string` | An optional string describing the server. | | environment | `string` | An optional string describing the environment, e.g., prod, sit, stg. | | roles | Array of [Server Role Object](#server-role-object) | An optional array of roles that are available and can be requested to access the server for role-based access control. E.g. separate roles for different regions or sensitive data. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### BigQuery Server Object | Field | Type | Description | |---------|----------|-----------------------| | type | `string` | `bigquery` | | project | `string` | The GCP project name. | | dataset | `string` | | #### S3 Server Object | Field | Type | Description | |-------------|----------|-------------------------------------------------------------------------------------------------------------------------| | type | `string` | `s3` | | location | `string` | S3 URL, starting with `s3://` | | endpointUrl | `string` | The server endpoint for S3-compatible servers, such as MioIO or Google Cloud Storage, e.g., `https://minio.example.com` | | format | `string` | Format of files, such as `parquet`, `delta`, `json`, `csv` | | delimiter | `string` | (Only for format = `json`), how multiple json documents are delimited within one file, e.g., `new_line`, `array` | Example (AWS S3): ```yaml servers: production: type: s3 location: s3://acme-orders-prod/orders/ format: json delimiter: new_line ``` Example (MinIO): ```yaml servers: minio: type: s3 endpointUrl: http://localhost:9000 location: s3://my-bucket/path/ format: delta ``` Example (Google Cloud Storage): ```yaml servers: gcs: type: s3 endpointUrl: https://storage.googleapis.com location: s3://my-bucket/path/*/*/*/*/*.parquet format: parquet ``` #### Redshift Server Object | Field | Type | Description | |-------------------|----------|---------------------------------------------------------------------------------------------------------------------| | type | `string` | `redshift` | | account | `string` | | | database | `string` | | | schema | `string` | | | clusterIdentifier | `string` | Identifier of the cluster.
Example: `analytics-cluster` | | host | `string` | Host of the cluster.
Example: `analytics-cluster.example.eu-west-1.redshift.amazonaws.com` | | port | `number` | Port of the cluster.
Example: `5439` | | endpoint | `string` | Endpoint of the cluster
Example: `analytics-cluster.example.eu-west-1.redshift.amazonaws.com:5439/analytics` | Example, specifying an endpoint: ```yaml servers: analytics: type: redshift account: '123456789012' database: analytics schema: analytics endpoint: analytics-cluster.example.eu-west-1.redshift.amazonaws.com:5439/analytics ``` Example, specifying the cluster identifier: ```yaml servers: analytics: type: redshift account: '123456789012' database: analytics schema: analytics clusterIdentifier: analytics-cluster ``` Example, specifying the cluster host: ```yaml servers: analytics: type: redshift account: '123456789012' database: analytics schema: analytics host: analytics-cluster.example.eu-west-1.redshift.amazonaws.com port: 5439 ``` #### Azure Server Object | Field | Type | Description | |----------------|----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | `azure` | | storageAccount | `string` | The storage account name that contains the files | | location | `string` | Path to Azure Blob Storage or Azure Data Lake Storage (ADLS) in the storage account, supports globs. Starting with `az://` or `abfss`
Recommended pattern is `abfss:///`, Examples: `az://my_storage_account_name.blob.core.windows.net/my_container/path/*.parquet` or `abfss://my_container_name/path/*.parquet` | | format | `string` | Format of files, such as `parquet`, `json`, `csv` | | delimiter | `string` | (Only for format = `json`), how multiple json documents are delimited within one file, e.g., `new_line`, `array` | #### SQL-Server Server Object | Field | Type | Description | |----------|-----------|--------------------------------------------------------------------------| | type | `string` | `sqlserver` | | host | `string` | The host to the database server | | port | `integer` | The port to the database server, default: `1433` | | database | `string` | The name of the database, e.g., `database`. | | schema | `string` | The name of the schema in the database, e.g., `dbo`. | | driver | `string` | The name of the supported driver, e.g., `ODBC Driver 18 for SQL Server`. | #### Snowflake Server Object | Field | Type | Description | |----------|----------|-------------| | type | `string` | `snowflake` | | account | `string` | | | database | `string` | | | schema | `string` | | #### Databricks Server Object | Field | Type | Description | |---------|----------|---------------------------------------------------------------------| | type | `string` | `databricks` | | host | `string` | The Databricks host, e.g., `dbc-abcdefgh-1234.cloud.databricks.com` | | catalog | `string` | The name of the Hive or Unity catalog | | schema | `string` | The schema name in the catalog | #### Postgres Server Object | Field | Type | Description | |----------|-----------|---------------------------------------------------------| | type | `string` | `postgres` | | host | `string` | The host to the database server | | port | `integer` | The port to the database server | | database | `string` | The name of the database, e.g., `postgres`. | | schema | `string` | The name of the schema in the database, e.g., `public`. | #### Oracle Server Object | Field | Type | Description | |-------------|-----------|---------------------------------| | type | `string` | `oracle` | | host | `string` | The host to the oracle server | | port | `integer` | The port to the oracle server | | serviceName | `string` | The name of the service | #### Kafka Server Object | Field | Type | Description | |--------|----------|---------------------------------------------------------------------------| | type | `string` | `kafka` | | host | `string` | The bootstrap server of the kafka cluster. | | topic | `string` | The topic name. | | format | `string` | The format of the message. Examples: json, avro, protobuf. Default: json. | #### Pub/Sub Server Object | Field | Type | Description | |---------|----------|-----------------------| | type | `string` | `pubsub` | | project | `string` | The GCP project name. | | topic | `string` | The topic name. | #### sftp Server Object | Field | Type | Description | |-----------|----------|------------------------------------------------------------------------------------------------------------------| | type | `string` | `sftp` | | location | `string` | SFTP URL, starting with `sftp://` | | format | `string` | Format of files, such as `parquet`, `delta`, `json`, `csv` | | delimiter | `string` | (Only for format = `json`), how multiple json documents are delimited within one file, e.g., `new_line`, `array` | #### AWS Kinesis Data Streams Server Object | Field | Type | Description | |--------|----------|---------------------------------------------------------------------------| | type | `string` | `kinesis` | | stream | `string` | The name of the Kinesis data stream. | | region | `string` | AWS region, e.g., `eu-west-1`. | | format | `string` | The format of the records. Examples: json, avro, protobuf. | #### Trino Server Object | Field | Type | Description | |----------|-----------|-----------------------------------------------------------| | type | `string` | `trino` | | host | `string` | The Trino host | | port | `integer` | The Trino port | | catalog | `string` | The name of the catalog, e.g., `my_catalog`. | | schema | `string` | The name of the schema in the catalog, e.g., `my_schema`. | #### Local Server Object | Field | Type | Description | |--------|----------|-------------------------------------------------------------------------------------| | type | `string` | `local` | | path | `string` | The relative or absolute path to the data file(s), such as `./folder/data.parquet`. | | format | `string` | The format of the file(s), such as `parquet`, `delta`, `csv`, or `json`. | #### Server Role Object | Field | Type | Description | |-------------|----------|--------------------------------------------------------------| | name | `string` | Name of the role | | description | `string` | A description of the role and what access the role provides. | ### Terms Object The terms and conditions of the data contract. | Field | Type | Description | |--------------|------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | usage | `string` | The usage describes the way the data is expected to be used. Can contain business and technical information. | | limitations | `string` | The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for. | | policies | Array of [Policy Object](#policy-object) | A list of policies, licenses, standards, that are applicable for this data contract and that must be acknowledged by data consumers. | | billing | `string` | The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use. | | noticePeriod | `string` | The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., `P3M` for a period of three months. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### Policy Object | Field | Type | Description | |-------------|----------|-----------------------------------| | name | `string` | Name of the policy. | | description | `string` | A description of the policy. | | url | `string` | An URL that refers to the policy. | ### Model Object The Model Object describes the structure and semantics of a data model, such as tables, views, or structured files. The name of the data model (table name) is defined by the key that refers to this Model Object. | Field | Type | Description | |------------------|----------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | The type of the model. Examples: `table`, `view`, `object`. Default: `table`. | | description | `string` | An optional string describing the data model. | | title | `string` | An optional string for the title of the data model. Especially useful if the name of the model is cryptic or contains abbreviations. | | fields | Map[`string`, [Field Object](#field-object)] | The fields (e.g. columns) of the data model. | | primaryKey | Array of `string` | If the primary key is a compound key, list the field names that constitute the primary key. Alternative to field-level `primaryKey`. | | quality | Array of [Quality Object](#quality-object) | Specifies the quality attributes on model level. | | examples | Array of `Any` | Specifies example data sets for the model. | | additionalFields | `Boolean` | Specify, if the model can have additional fields that are not defined in the contract. Default: `false`. | | config | [Config Object](#config-object) | Any additional key-value pairs that might be useful for further tooling. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Field Object The Field Objects describes one field (column, property, nested field) of a data model. | Field | Type | Description | |------------------|----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | description | `string` | An optional string describing the semantic of the data in this field. | | type | [Data Type](#data-types) | The logical data type of the field. | | title | `string` | An optional string providing a human readable name for the field. Especially useful if the field name is cryptic or contains abbreviations. | | enum | array of `string` | A value must be equal to one of the elements in this array value. Only evaluated if the value is not null. | | required | `boolean` | An indication, if this field must contain a value and may not be null. Default: `false` | | primaryKey | `boolean` | If this field is a primary key. Default: `false` | | references | `string` | The reference to a field in another model. E.g. use 'orders.order_id' to reference the order_id field of the model orders. Think of defining a foreign key relationship. | | unique | `boolean` | An indication, if the value must be unique within the model. Default: `false` | | format | `string` | `email`: A value must be complaint to [RFC 5321, section 4.1.2](https://www.rfc-editor.org/info/rfc5321).
`uri`: A value must be complaint to [RFC 3986](https://www.rfc-editor.org/info/rfc3986).
`uuid`: A value must be complaint to [RFC 4122](https://www.rfc-editor.org/info/rfc4122). Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | precision | `number` | The maximum number of digits in a number. Only applies to numeric values. Defaults to 38. | | scale | `number` | The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0. | | minLength | `number` | A value must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | maxLength | `number` | A value must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | pattern | `string` | A value must be valid according to the [ECMA-262](https://262.ecma-international.org/5.1/) regular expression dialect. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | minimum | `number` | A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | exclusiveMinimum | `number` | A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | maximum | `number` | A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | exclusiveMaximum | `number` | A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | ~~example~~ | `string` | DEPRECATED, use examples. An example value. | | examples | Array of Any | A list of example values. | | pii | `boolean` | An indication, if this field contains Personal Identifiable Information (PII). | | classification | `string` | The data class defining the sensitivity level for this field, according to the organization's classification scheme. Examples may be: `sensitive`, `restricted`, `internal`, `public`. | | tags | Array of `string` | Custom metadata to provide additional context. | | links | Map[`string`,`string`] | Additional external documentation links. | | $ref | `string` | A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition. | | fields | Map[`string`, [Field Object](#field-object)] | The nested fields (e.g. columns) of the object, record, or struct. Use only when type is `object`, `record`, or `struct`. | | items | [Field Object](#field-object) | The type of the elements in the array. Use only when type is `array`. | | keys | [Field Object](#field-object) | Describes the key structure of a map. Defaults to `type: string` if a map is defined as type. Not all server types support different key types. Use only when type is `map`. | | values | [Field Object](#field-object) | Describes the value structure of a map. Use only when type is `map`. | | quality | Array of [Quality Object](#quality-object) | Specifies the quality attributes on field level. | | lineage | [Lineage Object](#lineage-object) | Provides information where the data comes from. | | config | [Config Object](#config-object) | Any additional key-value pairs that might be useful for further tooling. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Definition Object The Definition Object includes a clear and concise explanations of syntax, semantic, and classification of a business object in a given domain. It serves as a reference for a common understanding of terminology, ensure consistent usage and to identify join-able fields. Models fields can refer to definitions using the `$ref` field to link to existing definitions and avoid duplicate documentations. | Field | Type | Description | |------------------|----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | type | [Data Type](#data-types) | REQUIRED. The logical data type | | title | `string` | The business name of this definition. | | description | `string` | Clear and concise explanations related to the domain | | enum | array of `string` | A value must be equal to one of the elements in this array value. Only evaluated if the value is not null. | | format | `string` | `email`: A value must be complaint to [RFC 5321, section 4.1.2](https://www.rfc-editor.org/info/rfc5321).
`uri`: A value must be complaint to [RFC 3986](https://www.rfc-editor.org/info/rfc3986).
`uuid`: A value must be complaint to [RFC 4122](https://www.rfc-editor.org/info/rfc4122). Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | precision | `number` | The maximum number of digits in a number. Only applies to numeric values. Defaults to 38. | | scale | `number` | The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0. | | minLength | `number` | A value must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | maxLength | `number` | A value must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | pattern | `string` | A value must be valid according to the [ECMA-262](https://262.ecma-international.org/5.1/) regular expression dialect. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | minimum | `number` | A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | exclusiveMinimum | `number` | A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | maximum | `number` | A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | exclusiveMaximum | `number` | A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | examples | Array of Any | A list of example values. | | pii | `boolean` | An indication, if this field contains Personal Identifiable Information (PII). | | classification | `string` | The data class defining the sensitivity level for this field, according to the organization's classification scheme. | | tags | Array of `string` | Custom metadata to provide additional context. | | links | Map[`string`, `string`] | Additional external documentation links. | | fields | Map[`string`, [Field Object](#field-object)] | The nested fields (e.g. columns) of the object, record, or struct. Use only when type is `object`, `record`, or `struct`. | | items | [Field Object](#field-object) | The type of the elements in the array. Use only when type is `array`. | | keys | [Field Object](#field-object) | Describes the key structure of a map. Defaults to `type: string` if a map is defined as type. Not all server types support different key types. Use only when type is `map`. | | values | [Field Object](#field-object) | Describes the value structure of a map. Use only when type is `map`. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Service Levels Object A service level is defined as an agreed-upon, measurable level of performance for provided the data. Data Contract Specification defines well-known service levels. This list can be extended with custom service levels. One can either describe each service level informally using the `description` field, or make use of the predefined fields for automation support, e.g., via the [Data Contract CLI](https://cli.datacontract.com). | Field | Type | Description | |--------------|-----------------------------------------------|-------------------------------------------------------------------------| | availability | [Availability Object](#availability-object) | The promised uptime of the system that provides the data | | retention | [Retention Object](#retention-object) | The period how long data will be available. | | latency | [Latency Object](#latency-object) | The maximum amount of time from the source to its destination. | | freshness | [Freshness Object](#freshness-object) | The maximum age of the youngest entry. | | frequency | [Frequency Object](#frequency-object) | The update frequency. | | support | [Support Object](#support-object) | The times when support is provided. | | backup | [Backup Object](#backup-object) | The details about data backup procedures. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### Availability Object Availability refers to the promise or guarantee by the service provider about the uptime of the system that provides the data. | Field | Type | Description | |-------------|----------|--------------------------------------------------------------------------------| | description | `string` | An optional string describing the availability service level. | | percentage | `string` | An optional string describing the guaranteed uptime in percent (e.g., `99.9%`) | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### Retention Object Retention covers the period how long data will be available. | Field | Type | Description | |----------------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------| | description | `string` | An optional string describing the retention service level. | | period | `string` | An optional period of time, how long data is available. Supported formats: Simple duration (e.g., `1 year`, `30d`) and ISO 8601 duration (e.g, `P1Y`). | | unlimited | `boolean` | An optional indicator that data is kept forever. | | timestampField | `string` | An optional reference to the field that contains the timestamp that the period refers to. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### Latency Object Latency refers to the maximum amount of time from the source to its destination. Examples are the maximum duration it takes after an order has been recorded in the ecommerce shop until it is available in the orders table in the data analytics platform. This includes the waiting times until the next batch run is started and the processing time of the pipeline. | Field | Type | Description | |-------------------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | description | `string` | An optional string describing the latency service level. | | threshold | `string` | An optional maximum duration between the source timestamp and the processed timestamp. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g, `PT24H`). | | sourceTimestampField | `string` | An optional reference to the field that contains the timestamp when the data was provided at the source. | | processedTimestampField | `string` | An optional reference to the field that contains the processing timestamp, which denotes when the data is made available to consumers of this data contract. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### Freshness Object Freshness refers to the maximum age of the youngest entry. | Field | Type | Description | |-------------------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------| | description | `string` | An optional string describing the freshness service level. | | threshold | `string` | An optional maximum age of the youngest entry. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g, `PT24H`). | | timestampField | `string` | An optional reference to the field that contains the timestamp that the threshold refers to. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### Frequency Object Frequency describes how often data is updated. | Field | Type | Description | |-------------|----------|-----------------------------------------------------------------------------------------------------------| | description | `string` | An optional string describing the frequency service level. | | type | `string` | An optional type of data processing. Typical values are `batch`, `micro-batching`, `streaming`, `manual`. | | interval | `string` | Optional. Only for batch: How often the pipeline is triggered, e.g., `daily`. | | cron | `string` | Optional. Only for batch: A cron expression when the pipelines is triggered. E.g., `0 0 * * *`. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### Support Object Support describes the times when support will be available for contact. | Field | Type | Description | |--------------|----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | description | `string` | An optional string describing the support service level. | | time | `string` | An optional string describing the times when support will be available for contact such as `24/7` or `business hours only`. | | responseTime | `string` | An optional string describing the time it takes for the support team to acknowledge a request. This does not mean the issue will be resolved immediately, but it assures users that their request has been received and will be dealt with. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### Backup Object Backup specifies details about data backup procedures. | Field | Type | Description | |---------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | description | `string` | An optional string describing the backup service level. | | interval | `string` | An optional interval that defines how often data will be backed up, e.g., `daily`. | | cron | `string` | An optional cron expression when data will be backed up, e.g., `0 0 * * *`. | | recoveryTime | `string` | An optional Recovery Time Objective (RTO) specifies the maximum amount of time allowed to restore data from a backup after a failure or loss event (e.g., 4 hours, 24 hours). | | recoveryPoint | `string` | An optional Recovery Point Objective (RPO) defines the maximum acceptable age of files that must be recovered from backup storage for normal operations to resume after a disaster or data loss event. This essentially measures how much data you can afford to lose, measured in time (e.g., 4 hours, 24 hours). | ### Quality Object The quality object defines quality attributes. Quality attributes are checks that can be applied to the data to ensure its quality. Data can be verified by executing these checks through a data quality engine. Quality attributes can be: - A text in natural language that describes the quality of the data. - A predefined metric from the library of commonly used metrics - An individual SQL query that returns a single value that can be compared. - Engine-specific types: Pre-defined quality checks, as defined by data quality libraries. Currently, the engines `soda` and `great-expectations` are supported. A quality object can be specified on the field level and on the model level. The top-level quality object is deprecated. #### Description Text A description in natural language that defines the expected quality of the data. This is useful to express requirements or expectations when discussing the data contract with stakeholders. Later in the development process, these might be translated into an executable check (such as `sql`). It can also be used as a prompt to check the data with an AI engine. | Field | Type | Description | |-------------|----------|--------------------------------------------------------------------| | type | `string` | `text` | | description | `string` | A plain text describing the quality attribute in natural language. | Example: ```yaml models: my_table: fields: account_iban: quality: - type: text description: Must be a valid IBAN. Must not be empty. ``` #### SQL An individual SQL query that returns a single number that can be compared with a threshold. The SQL query must be in the SQL dialect of the provided server. > __Note:__ Establish a secure development process and use read-only connections, as the misuse of SQL queries can lead to SQL injection attacks. | Field | Type | Description | |----------------------------|-----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | `sql` | | description | `string` | A plain text describing the quality of the data. | | query | `string` | A SQL query that returns a single number to compare with the threshold. | | dialect | `string` | The SQL dialect that is used for the query. Should be compatible to the server type. Examples: `postgres`, `spark`, `bigquery`, `snowflake`, `duckdb`, ... | | mustBe | `integer` | The threshold to check the return value of the query | | mustNotBe | `integer` | The threshold to check the return value of the query | | mustBeGreaterThan | `integer` | The threshold to check the return value of the query | | mustBeGreaterThanOrEqualTo | `integer` | The threshold to check the return value of the query | | mustBeLessThan | `integer` | The threshold to check the return value of the query | | mustBeLessThanOrEqualTo | `integer` | The threshold to check the return value of the query | | mustBeBetween | array of two integers | The threshold to check the return value of the query. Boundaries are inclusive. | | mustNotBeBetween | array of two integers | The threshold to check the return value of the query. Boundaries are inclusive. | In the query the following placeholders can be used: | Placeholder | Description | |-------------|----------------------------------------------------------------------------------------| | `{model}` | The name of the model that is checked. | | `{table}` | Alias for `{model}`. | | `{field}` | The name of the field that is checked (only if the quality is defined on field-level). | | `{column}` | Alias for `{field}`. | Example: ```yaml models: orders: quality: - type: sql description: The maximum duration between two orders must be less that 3600 seconds query: | SELECT MAX(duration) AS max_duration FROM ( SELECT EXTRACT(EPOCH FROM (order_timestamp - LAG(order_timestamp) OVER (ORDER BY order_timestamp))) AS duration FROM {model} ) mustBeLessThan: 3600 ``` SQL queries allow powerful checks for custom business logic. A SQL query should run not longer than 10 minutes. #### Library / Metrics A set of predefined metrics commonly used in data quality checks, designed to be compatible with all major data quality engines. This simplifies the work for data engineers by eliminating the need to manually write SQL queries. These metrics are aligned with ODCS 3.1. | Field | Type | Description | |------------------------|-----------------------|----------------------------------------------------------------------------------| | type | `string` | `library` (can be omitted, if `metric` is defined) | | metric | `string` | `nullValues`, `missingValues`, `invalidValues`, `duplicateValues`, or `rowCount` | | arguments | `object` | Some metrics require additional arguments | | description | `string` | A plain text describing the quality of the data. | | mustBe | `integer` | The threshold to check the return value of the query | | mustNotBe | `integer` | The threshold to check the return value of the query | | mustBeGreaterThan | `integer` | The threshold to check the return value of the query | | mustBeGreaterOrEqualTo | `integer` | The threshold to check the return value of the query | | mustBeLessThan | `integer` | The threshold to check the return value of the query | | mustBeLessOrEqualTo | `integer` | The threshold to check the return value of the query | | mustBeBetween | array of two integers | The threshold to check the return value of the query. Boundaries are inclusive. | | mustNotBeBetween | array of two integers | The threshold to check the return value of the query. Boundaries are inclusive. | | unit | `string` | `rows` (default) or `percent` | Metrics: | Metric | Level | Description | Arguments | Arguments Example | |--------|--------|----------------------------------------------------------------|------------------------------------------------------------------|----------------------------------------------------------------------| | `nullValues` | Property | Counts null values in a column/field | None | | | `missingValues` | Property | Counts values considered as missing (empty strings, N/A, etc.) | `missingValues`: Array of values considered missing | `missingValues: [null, '', 'N/A']` | | `invalidValues` | Property | Counts values that don't match valid criteria | `validValues`: Array of valid values
`pattern`: Regex pattern | `validValues: ['pounds', 'kg']`
`pattern: '^[A-Z]{2}[0-9]{2}...'` | | `duplicateValues` | Property | Counts duplicate values in a column | None | | | `duplicateValues` | Schema | Counts duplicate values across multiple columns | `properties`: Array of property names | `properties: ['tenant_id', 'order_id']` | | `rowCount` | Schema | Counts total number of rows in a table/object store | None | | Example: ```yaml properties: - name: email_address quality: - metric: missingValues arguments: missingValues: [null, '', 'N/A', 'n/a'] mustBeLessThan: 5 unit: percent # rows (default) or percent ``` #### Custom You can define custom quality attributes that are specific to a data quality engine. #### Custom (Engine: Soda) Soda has a number of predefined quality [checks](https://docs.soda.io/soda/data-contracts-checks.html) that can be referenced as quality attributes. Soda checks can be applied on model and field level. > Note: Soda Data contract check reference is experimental and may change in the future. Currently only supported by Postgres, Snowflake, and Spark (Databricks) | Field | Type | Description | |---------------|----------|-----------------------------------------------------------------------------------------------------------------------------| | type | `string` | `custom` | | description | `string` | Optional. A plain text describing the quality attribute in natural language. | | engine | `string` | `soda` | | implementation | `object` | A check type as defined in the [Data contract check reference](https://docs.soda.io/soda/data-contracts-checks.html) | See the [Data contract check reference](https://docs.soda.io/soda/data-contracts-checks.html) for all possible types and configuration values. Example: ```yaml models: my_table: fields: order_id: type: string quality: - type: custom description: This is a check on field level engine: soda implementation: type: no_duplicate_values carrier: type: string shipment_numer: type: string quality: - type: custom description: This is a check on model level engine: soda implementation: type: duplicate_percent columns: - carrier - shipment_numer must_be_less_than: 1.0 - type: custom description: This is a check on model level engine: soda implementation: type: row_count must_be_greater_than: 500000 ``` #### Custom (Engine: Great Expectations) Quality attributes defined as Great Expectations [Expectation](https://greatexpectations.io/expectations/). Expectations are applied on model level. | Field | Type | Description | |---------------|----------|-----------------------------------------------------------------------------------------------------| | description | `string` | Optional. A plain text describing the quality attribute in natural language. | | engine | `string` | `great-expectations` | | implementation | `object` | An expectation type as listed in [Expectation](https://greatexpectations.io/expectations/) as YAML. | Example: ```yaml models: my_table: quality: - type: custom engine: great-expectations implementation: expectation_type: expect_table_row_count_to_be_between kwargs: min_value: 10000 max_value: 50000 meta: notes: "This expectation is crucial to avoid processing datasets that are too small or too large." - type: custom engine: great-expectations description: "Check that passenger_count values are between 1 and 6." implementation: expectation_type: expect_column_values_to_be_between kwargs: column: passenger_count max_value: 6 min_value: 1 mostly: 1.0 strict_max: false strict_min: false meta: tags: - business-critical - range_check ``` ### Lineage Object Field level lineage provides optional fine-grained information where the data comes from and how it was transformed. The lineage object is based on the OpenLinage [Column Level Lineage Dataset Facet](https://openlineage.io/docs/spec/facets/dataset-facets/column_lineage_facet) to describe the input fields. | Field | Type | Description | |-------------|---------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | inputFields | Array of [InputField Object](#inputfield-object) | The input fields refer to specific fields, columns, or data points from source systems or other data contracts that feed into a particular transformation, calculation, or final result. | #### InputField Object | Field | Type | Description | |-----------------|----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | namespace | `string` | The input dataset namespace, such as the name of the source system or the domain of another data contract. Examples: `com.example.crm`, `checkout`, snowflake://{account name}. [More on namespace](https://openlineage.io/blog/whats-in-a-namespace/#namespaces-in-the-spec) | | name | `string` | The input dataset name, such as a reference to a data contract, a fully qualified table name, a Kafka topic. | | field | `string` | The input field name, such as the field in an upstream data contract, a table column or a JSON Path. | | transformations | Array of [Transformation Object](#transformation-object) | Optional. This describes how the input field data was used to generate the final result. | #### Transformation Object | Field | Type | Description | |-------------|-----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | Indicates how direct is the relationship e.g. in query. Allows values are: `DIRECT` and `INDIRECT`. | | subtype | `string` | Optional. Contains more specific information about the transformation.
Allowed values for type `DIRECT`: `IDENTITY`, `TRANSFORMATION`, `AGGREGATION`.
Allowed values for type `INDIRECT`: `JOIN`, `GROUP_BY`, `FILTER`, `SORT`, `WINDOW`, `CONDITIONAL`. | | description | `string` | Optional. A string representation of the transformation applied. | | masking | `boolean` | Optional. Boolean value indicating if the input value was obfuscated during the transformation. | Example: ```yaml models: orders: fields: order_id: type: string lineage: inputFields: - namespace: com.example.service.checkout name: checkout_db.orders field: order_id transformations: - type: DIRECT subtype: IDENTITY description: The order ID from the checkout order - namespace: com.example.service.checkout name: checkout_db.orders field: order_timestamp transformations: - type: INDIRECT subtype: SORT customer_email_address_hash: type: string lineage: inputFields: - namespace: com.example.service.checkout name: checkout_db.orders field: email_address transformations: - type: DIRECT subtype: Transformation description: The email address from the checkout order, hashed with SHA-256 masking: true ``` ### Config Object The config field can be used to set additional metadata that may be used by tools, e.g. to define a namespace for code generation, specify physical data types, toggle tests, etc. A config field can be added with any name. The value can be null, a primitive, an array or an object. For developer experience, a list of well-known field names is maintained here, as these fields are used in the Data Contract CLI: | Field | Type | Description | |-----------------|----------|----------------------------------------------------------------------------------------------------------------| | avroNamespace | `string` | (Only on model level) The namespace to use when importing and exporting the data model from / to Apache Avro. | | avroType | `string` | (Only on field level) Specify the field type to use when exporting the data model to Apache Avro. | | avroLogicalType | `string` | (Only on field level) Specify the logical field type to use when exporting the data model to Apache Avro. | | bigqueryType | `string` | (Only on field level) Specify the physical column type that is used in a BigQuery table, e.g., `NUMERIC(5, 2)` | | snowflakeType | `string` | (Only on field level) Specify the physical column type that is used in a Snowflake table, e.g, `TIMESTAMP_LTZ` | | redshiftType | `string` | (Only on field level) Specify the physical column type that is used in a Redshift table, e.g, `SMALLINT` | | sqlserverType | `string` | (Only on field level) Specify the physical column type that is used in a SQL Server table, e.g, `DATETIME2` | | databricksType | `string` | (Only on field level) Specify the physical column type that is used in a Databricks table | | glueType | `string` | (Only on field level) Specify the physical column type that is used in a AWS Glue Data Catalog table | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). Example: ``` models: orders: config: avroNamespace: "my.namespace" fields: my_field_1: description: Example for AVRO with Timestamp (millisecond precision) type: timestamp config: avroType: long avroLogicalType: timestamp-millis snowflakeType: timestamp_tz ``` ### Data Types The following data types are supported for model fields and definitions: - Unicode character sequence: `string`, `text`, `varchar` - Any numeric type, either integers or floating point numbers: `number`, `decimal`, `numeric` - 32-bit signed integer: `int`, `integer` - 64-bit signed integer: `long`, `bigint` - Single precision (32-bit) IEEE 754 floating-point number: `float` - Double precision (64-bit) IEEE 754 floating-point number: `double` - Binary value: `boolean` - Timestamp with timezone: `timestamp`, `timestamp_tz` - Timestamp with no timezone: `timestamp_ntz` - Date with no time information: `date` - Time with no date information: `time` - Array: `array` - Map: `map` (may not be supported by some server types) - Sequence of 8-bit unsigned bytes: `bytes` - Complex type: `object`, `record`, `struct` - Semi-structured data: `variant` (may not be supported by some server types) - JSON data: `json` (may not be supported by some server types) - No value: `null` ### Specification Extensions While the Data Contract Specification tries to accommodate most use cases, additional data can be added to extend the specification at certain points. A custom field can be added with any name. The value can be null, a primitive, an array or an object. Tooling --- - [Data Contract CLI](https://github.com/datacontract/datacontract-cli) is an open-source CLI tool to help you create, develop, and maintain your data contracts. - [Data Contract Manager](https://www.datamesh-manager.com/) is a commercial tool to manage data contracts. It includes a data contract catalog, a Web-Editor, and a request and approval workflow to automate access to data products for a full enterprise data marketplace. - [Data Contract GPT](https://gpt.datacontract.com) is a custom GPT that can help you write data contracts. - [Data Contract Editor](https://editor.datacontract.com) is an open-source editor for Data Contracts, including a live html preview. Code Completion --- The [JSON Schema](https://datacontract.com/datacontract.schema.json) of the current data contract specification is registered in [Schema Store](https://www.schemastore.org/), which brings code completion and syntax checks for all major IDEs. IntelliJ comes with a built-in YAML plugin which will show you autocompletions. For VS Code we recommend to install the [YAML](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml) plugin. No additional configuration is required. Autocompletion is then enabled for files following these patterns: ``` datacontract.yaml datacontract.yml *-datacontract.yaml *-datacontract.yml *.datacontract.yaml *.datacontract.yml datacontract-*.yaml datacontract-*.yml **/datacontract/*.yml **/datacontract/*.yaml **/datacontracts/*.yml **/datacontracts/*.yaml ``` Authors --- The Data Contract Specification was originally created by [Jochen Christ](https://www.linkedin.com/in/jochenchrist/) and [Dr. Simon Harrer](https://www.linkedin.com/in/simonharrer/), and is currently maintained by them. Contributing --- Contributions are welcome! Please open an issue or a pull request. License --- [MIT License](LICENSE) ================================================ FILE: _config.yml ================================================ plugins: - jekyll-sitemap name: Data Contract Specification title: null description: Data contracts bring data providers and data consumers together. ================================================ FILE: _layouts/default.html ================================================ {% seo %}
{% if site.title and site.title != page.title %}

{{ site.title }}

{% endif %} {{ content }} {% if site.github.private != true and site.github.license %} {% endif %}
{% if site.google_analytics %} {% endif %} ================================================ FILE: datacontract.init.yaml ================================================ dataContractSpecification: 1.2.1 id: my-data-contract-id info: title: My Data Contract version: 0.0.1 # description: # owner: # contact: # name: # url: # email: ### servers #servers: # production: # type: s3 # location: s3:// # format: parquet # delimiter: new_line ### terms #terms: # usage: # limitations: # billing: # noticePeriod: ### models # models: # my_model: # description: # type: # fields: # my_field: # type: # description: ### definitions # definitions: # my_field: # domain: # name: # title: # type: # description: # example: # pii: # classification: ### servicelevels #servicelevels: # availability: # description: The server is available during support hours # percentage: 99.9% # retention: # description: Data is retained for one year because! # period: P1Y # unlimited: false # latency: # description: Data is available within 25 hours after the order was placed # threshold: 25h # sourceTimestampField: orders.order_timestamp # processedTimestampField: orders.processed_timestamp # freshness: # description: The age of the youngest row in a table. # threshold: 25h # timestampField: orders.order_timestamp # frequency: # description: Data is delivered once a day # type: batch # or streaming # interval: daily # for batch, either or cron # cron: 0 0 * * * # for batch, either or interval # support: # description: The data is available during typical business hours at headquarters # time: 9am to 5pm in EST on business days # responseTime: 1h # backup: # description: Data is backed up once a week, every Sunday at 0:00 UTC. # interval: weekly # cron: 0 0 * * 0 # recoveryTime: 24 hours # recoveryPoint: 1 week ================================================ FILE: datacontract.schema.json ================================================ { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "title": "DataContractSpecification", "properties": { "dataContractSpecification": { "type": "string", "title": "DataContractSpecificationVersion", "enum": [ "1.2.1", "1.2.0", "1.1.0", "0.9.3", "0.9.2", "0.9.1", "0.9.0" ], "description": "Specifies the Data Contract Specification being used." }, "id": { "type": "string", "description": "Specifies the identifier of the data contract." }, "info": { "type": "object", "properties": { "title": { "type": "string", "description": "The title of the data contract." }, "version": { "type": "string", "description": "The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version)." }, "status": { "type": "string", "description": "The status of the data contract. Can be proposed, in development, active, retired.", "examples": [ "proposed", "in development", "active", "deprecated", "retired" ] }, "description": { "type": "string", "description": "A description of the data contract." }, "owner": { "type": "string", "description": "The owner or team responsible for managing the data contract and providing the data." }, "contact": { "type": "object", "properties": { "name": { "type": "string", "description": "The identifying name of the contact person/organization." }, "url": { "type": "string", "format": "uri", "description": "The URL pointing to the contact information. This MUST be in the form of a URL." }, "email": { "type": "string", "format": "email", "description": "The email address of the contact person/organization. This MUST be in the form of an email address." } }, "description": "Contact information for the data contract.", "additionalProperties": true } }, "additionalProperties": true, "required": [ "title", "version" ], "description": "Metadata and life cycle information about the data contract." }, "servers": { "type": "object", "description": "Information about the servers.", "additionalProperties": { "$ref": "#/$defs/BaseServer", "allOf": [ { "if": { "properties": { "type": { "const": "bigquery" } } }, "then": { "$ref": "#/$defs/BigQueryServer" } }, { "if": { "properties": { "type": { "const": "postgres" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/PostgresServer" } }, { "if": { "properties": { "type": { "const": "s3" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/S3Server" } }, { "if": { "properties": { "type": { "const": "sftp" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/SftpServer" } }, { "if": { "properties": { "type": { "const": "redshift" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/RedshiftServer" } }, { "if": { "properties": { "type": { "const": "azure" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/AzureServer" } }, { "if": { "properties": { "type": { "const": "sqlserver" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/SqlserverServer" } }, { "if": { "properties": { "type": { "const": "snowflake" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/SnowflakeServer" } }, { "if": { "properties": { "type": { "const": "databricks" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/DatabricksServer" } }, { "if": { "properties": { "type": { "const": "dataframe" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/DataframeServer" } }, { "if": { "properties": { "type": { "const": "glue" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/GlueServer" } }, { "if": { "properties": { "type": { "const": "postgres" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/PostgresServer" } }, { "if": { "properties": { "type": { "const": "oracle" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/OracleServer" } }, { "if": { "properties": { "type": { "const": "kafka" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/KafkaServer" } }, { "if": { "properties": { "type": { "const": "pubsub" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/PubSubServer" } }, { "if": { "properties": { "type": { "const": "kinesis" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/KinesisDataStreamsServer" } }, { "if": { "properties": { "type": { "const": "trino" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/TrinoServer" } }, { "if": { "properties": { "type": { "const": "clickhouse" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/ClickhouseServer" } }, { "if": { "properties": { "type": { "const": "local" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/LocalServer" } } ] } }, "terms": { "type": "object", "description": "The terms and conditions of the data contract.", "properties": { "usage": { "type": "string", "description": "The usage describes the way the data is expected to be used. Can contain business and technical information." }, "limitations": { "type": "string", "description": "The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for." }, "policies": { "type": "array", "items": { "type": "object", "properties": { "type": { "type": "string", "description": "The type of the policy.", "examples": [ "privacy", "security", "retention", "compliance" ] }, "description": { "type": "string", "description": "A description of the policy." }, "url": { "type": "string", "format": "uri", "description": "A URL to the policy document." } }, "additionalProperties": true }, "description": "The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for." }, "billing": { "type": "string", "description": "The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use." }, "noticePeriod": { "type": "string", "description": "The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., 'P3M' for a period of three months." } }, "additionalProperties": true }, "models": { "description": "Specifies the logical data model. Use the models name (e.g., the table name) as the key.", "type": "object", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "object", "title": "Model", "properties": { "description": { "type": "string" }, "type": { "description": "The type of the model. Examples: table, view, object. Default: table.", "type": "string", "title": "ModelType", "default": "table", "enum": [ "table", "view", "object" ] }, "title": { "type": "string", "description": "An optional string providing a human readable name for the model. Especially useful if the model name is cryptic or contains abbreviations.", "examples": [ "Purchase Orders", "Air Shipments" ] }, "fields": { "description": "Specifies a field in the data model. Use the field name (e.g., the column name) as the key.", "type": "object", "additionalProperties": { "type": "object", "title": "Field", "properties": { "description": { "type": "string", "description": "An optional string describing the semantic of the data in this field." }, "title": { "type": "string", "description": "An optional string providing a human readable name for the field. Especially useful if the field name is cryptic or contains abbreviations." }, "type": { "$ref": "#/$defs/FieldType" }, "required": { "type": "boolean", "default": false, "description": "An indication, if this field must contain a value and may not be null." }, "fields": { "description": "The nested fields (e.g. columns) of the object, record, or struct.", "type": "object", "additionalProperties": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" } }, "items": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "keys": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "values": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "primary": { "type": "boolean", "deprecationMessage": "Use the primaryKey field instead." }, "primaryKey": { "type": "boolean", "default": false, "description": "If this field is a primary key." }, "references": { "type": "string", "description": "The reference to a field in another model. E.g. use 'orders.order_id' to reference the order_id field of the model orders. Think of defining a foreign key relationship.", "examples": [ "orders.order_id", "model.nested_field.field" ] }, "unique": { "type": "boolean", "default": false, "description": "An indication, if the value must be unique within the model." }, "enum": { "type": "array", "items": { "type": "string" }, "uniqueItems": true, "description": "A value must be equal to one of the elements in this array value. Only evaluated if the value is not null." }, "minLength": { "type": "integer", "description": "A value must greater than, or equal to, the value of this. Only applies to string types." }, "maxLength": { "type": "integer", "description": "A value must less than, or equal to, the value of this. Only applies to string types." }, "format": { "type": "string", "description": "A specific format the value must comply with (e.g., 'email', 'uri', 'uuid').", "examples": [ "email", "uri", "uuid" ] }, "precision": { "type": "number", "examples": [ 38 ], "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." }, "scale": { "type": "number", "examples": [ 0 ], "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." }, "pattern": { "type": "string", "description": "A regular expression the value must match. Only applies to string types.", "examples": [ "^[a-zA-Z0-9_-]+$" ] }, "minimum": { "type": "number", "description": "A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "exclusiveMinimum": { "type": "number", "description": "A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "maximum": { "type": "number", "description": "A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "exclusiveMaximum": { "type": "number", "description": "A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "example": { "type": "string", "description": "An example value for this field.", "deprecationMessage": "Use the examples field instead." }, "examples": { "type": "array", "description": "A examples value for this field." }, "pii": { "type": "boolean", "description": "An indication, if this field contains Personal Identifiable Information (PII)." }, "classification": { "type": "string", "description": "The data class defining the sensitivity level for this field, according to the organization's classification scheme.", "examples": [ "sensitive", "restricted", "internal", "public" ] }, "tags": { "type": "array", "items": { "type": "string" }, "description": "Custom metadata to provide additional context." }, "links": { "type": "object", "description": "Links to external resources.", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "string", "title": "Link", "description": "A URL to an external resource.", "format": "uri", "examples": [ "https://example.com" ] } }, "$ref": { "type": "string", "description": "A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition." }, "quality": { "type": "array", "items": { "$ref": "#/$defs/Quality" } }, "lineage": { "$ref": "#/$defs/Lineage" }, "config": { "type": "object", "description": "Additional metadata for field configuration.", "additionalProperties": { "type": [ "string", "number", "boolean", "object", "array", "null" ] }, "properties": { "avroType": { "type": "string", "description": "Specify the field type to use when exporting the data model to Apache Avro." }, "avroLogicalType": { "type": "string", "description": "Specify the logical field type to use when exporting the data model to Apache Avro." }, "bigqueryType": { "type": "string", "description": "Specify the physical column type that is used in a BigQuery table, e.g., `NUMERIC(5, 2)`." }, "snowflakeType": { "type": "string", "description": "Specify the physical column type that is used in a Snowflake table, e.g., `TIMESTAMP_LTZ`." }, "redshiftType": { "type": "string", "description": "Specify the physical column type that is used in a Redshift table, e.g., `SMALLINT`." }, "sqlserverType": { "type": "string", "description": "Specify the physical column type that is used in a SQL Server table, e.g., `DATETIME2`." }, "databricksType": { "type": "string", "description": "Specify the physical column type that is used in a Databricks Unity Catalog table." }, "glueType": { "type": "string", "description": "Specify the physical column type that is used in an AWS Glue Data Catalog table." } } } } } }, "primaryKey": { "type": "array", "items": { "type": "string" }, "description": "The compound primary key of the model." }, "quality": { "type": "array", "items": { "$ref": "#/$defs/Quality" } }, "examples": { "type": "array" }, "additionalFields": { "type": "boolean", "description": " Specify, if the model can have additional fields that are not defined in the contract. ", "default": false }, "config": { "type": "object", "description": "Additional metadata for model configuration.", "additionalProperties": { "type": [ "string", "number", "boolean", "object", "array", "null" ] }, "properties": { "avroNamespace": { "type": "string", "description": "The namespace to use when importing and exporting the data model from / to Apache Avro." } } } } } }, "definitions": { "description": "Clear and concise explanations of syntax, semantic, and classification of business objects in a given domain.", "type": "object", "propertyNames": { "pattern": "^[a-zA-Z0-9/_-]+$" }, "additionalProperties": { "type": "object", "title": "Definition", "properties": { "domain": { "type": "string", "description": "The domain in which this definition is valid.", "default": "global", "deprecationMessage": "This field is deprecated. Encode the domain into the ID using slashes." }, "name": { "type": "string", "description": "The technical name of this definition.", "deprecationMessage": "This field is deprecated. Encode the name into the ID using slashes." }, "title": { "type": "string", "description": "The business name of this definition." }, "description": { "type": "string", "description": "Clear and concise explanations related to the domain." }, "type": { "$ref": "#/$defs/FieldType" }, "fields": { "description": "The nested fields (e.g. columns) of the object, record, or struct.", "type": "object", "additionalProperties": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" } }, "items": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "keys": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "values": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "minLength": { "type": "integer", "description": "A value must be greater than or equal to this value. Applies only to string types." }, "maxLength": { "type": "integer", "description": "A value must be less than or equal to this value. Applies only to string types." }, "format": { "type": "string", "description": "Specific format requirements for the value (e.g., 'email', 'uri', 'uuid')." }, "precision": { "type": "integer", "examples": [ 38 ], "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." }, "scale": { "type": "integer", "examples": [ 0 ], "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." }, "pattern": { "type": "string", "description": "A regular expression pattern the value must match. Applies only to string types." }, "minimum": { "type": "number", "description": "A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "exclusiveMinimum": { "type": "number", "description": "A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "maximum": { "type": "number", "description": "A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "exclusiveMaximum": { "type": "number", "description": "A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "example": { "type": "string", "description": "An example value.", "deprecationMessage": "Use the examples field instead." }, "examples": { "type": "array", "description": "Example value." }, "pii": { "type": "boolean", "description": "Indicates if the field contains Personal Identifiable Information (PII)." }, "classification": { "type": "string", "description": "The data class defining the sensitivity level for this field." }, "tags": { "type": "array", "items": { "type": "string" }, "description": "Custom metadata to provide additional context." }, "links": { "type": "object", "description": "Links to external resources.", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "string", "title": "Link", "description": "A URL to an external resource.", "format": "uri", "examples": [ "https://example.com" ] } } }, "required": [ "type" ] } }, "servicelevels": { "type": "object", "description": "Specifies the service level agreements for the provided data, including availability, data retention policies, latency requirements, data freshness, update frequency, support availability, and backup policies.", "properties": { "availability": { "type": "object", "description": "Availability refers to the promise or guarantee by the service provider about the uptime of the system that provides the data.", "properties": { "description": { "type": "string", "description": "An optional string describing the availability service level.", "example": "The server is available during support hours" }, "percentage": { "type": "string", "description": "An optional string describing the guaranteed uptime in percent (e.g., `99.9%`)", "pattern": "^\\d+(\\.\\d+)?%$", "example": "99.9%" } } }, "retention": { "type": "object", "description": "Retention covers the period how long data will be available.", "properties": { "description": { "type": "string", "description": "An optional string describing the retention service level.", "example": "Data is retained for one year." }, "period": { "type": "string", "description": "An optional period of time, how long data is available. Supported formats: Simple duration (e.g., `1 year`, `30d`) and ISO 8601 duration (e.g, `P1Y`).", "example": "P1Y" }, "unlimited": { "type": "boolean", "description": "An optional indicator that data is kept forever.", "example": false }, "timestampField": { "type": "string", "description": "An optional reference to the field that contains the timestamp that the period refers to.", "example": "orders.order_timestamp" } } }, "latency": { "type": "object", "description": "Latency refers to the maximum amount of time from the source to its destination.", "properties": { "description": { "type": "string", "description": "An optional string describing the latency service level.", "example": "Data is available within 25 hours after the order was placed." }, "threshold": { "type": "string", "description": "An optional maximum duration between the source timestamp and the processed timestamp. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g, `PT24H`).", "example": "25h" }, "sourceTimestampField": { "type": "string", "description": "An optional reference to the field that contains the timestamp when the data was provided at the source.", "example": "orders.order_timestamp" }, "processedTimestampField": { "type": "string", "description": "An optional reference to the field that contains the processing timestamp, which denotes when the data is made available to consumers of this data contract.", "example": "orders.processed_timestamp" } } }, "freshness": { "type": "object", "description": "The maximum age of the youngest row in a table.", "properties": { "description": { "type": "string", "description": "An optional string describing the freshness service level.", "example": "The age of the youngest row in a table is within 25 hours." }, "threshold": { "type": "string", "description": "An optional maximum age of the youngest entry. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g., `PT24H`).", "example": "25h" }, "timestampField": { "type": "string", "description": "An optional reference to the field that contains the timestamp that the threshold refers to.", "example": "orders.order_timestamp" } } }, "frequency": { "type": "object", "description": "Frequency describes how often data is updated.", "properties": { "description": { "type": "string", "description": "An optional string describing the frequency service level.", "example": "Data is delivered once a day." }, "type": { "type": "string", "enum": [ "batch", "micro-batching", "streaming", "manual" ], "description": "The method of data processing.", "example": "batch" }, "interval": { "type": "string", "description": "Optional. Only for batch: How often the pipeline is triggered, e.g., `daily`.", "example": "daily" }, "cron": { "type": "string", "description": "Optional. Only for batch: A cron expression when the pipelines is triggered. E.g., `0 0 * * *`.", "example": "0 0 * * *" } } }, "support": { "type": "object", "description": "Support describes the times when support will be available for contact.", "properties": { "description": { "type": "string", "description": "An optional string describing the support service level.", "example": "The data is available during typical business hours at headquarters." }, "time": { "type": "string", "description": "An optional string describing the times when support will be available for contact such as `24/7` or `business hours only`.", "example": "9am to 5pm in EST on business days" }, "responseTime": { "type": "string", "description": "An optional string describing the time it takes for the support team to acknowledge a request. This does not mean the issue will be resolved immediately, but it assures users that their request has been received and will be dealt with.", "example": "24 hours" } } }, "backup": { "type": "object", "description": "Backup specifies details about data backup procedures.", "properties": { "description": { "type": "string", "description": "An optional string describing the backup service level.", "example": "Data is backed up once a week, every Sunday at 0:00 UTC." }, "interval": { "type": "string", "description": "An optional interval that defines how often data will be backed up, e.g., `daily`.", "example": "weekly" }, "cron": { "type": "string", "description": "An optional cron expression when data will be backed up, e.g., `0 0 * * *`.", "example": "0 0 * * 0" }, "recoveryTime": { "type": "string", "description": "An optional Recovery Time Objective (RTO) specifies the maximum amount of time allowed to restore data from a backup after a failure or loss event (e.g., 4 hours, 24 hours).", "example": "24 hours" }, "recoveryPoint": { "type": "string", "description": "An optional Recovery Point Objective (RPO) defines the maximum acceptable age of files that must be recovered from backup storage for normal operations to resume after a disaster or data loss event. This essentially measures how much data you can afford to lose, measured in time (e.g., 4 hours, 24 hours).", "example": "1 week" } } } } }, "links": { "type": "object", "description": "Links to external resources.", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "string", "title": "Link", "description": "A URL to an external resource.", "format": "uri", "examples": [ "https://example.com" ] } }, "tags": { "type": "array", "items": { "type": "string", "description": "Tags to facilitate searching and filtering.", "examples": [ "databricks", "pii", "sensitive" ] }, "description": "Tags to facilitate searching and filtering." } }, "required": [ "dataContractSpecification", "id", "info" ], "$defs": { "FieldType": { "type": "string", "title": "FieldType", "description": "The logical data type of the field.", "enum": [ "number", "decimal", "numeric", "int", "integer", "long", "bigint", "float", "double", "string", "text", "varchar", "boolean", "timestamp", "timestamp_tz", "timestamp_ntz", "date", "time", "array", "map", "object", "record", "struct", "bytes", "variant", "json", "null" ] }, "BaseServer": { "type": "object", "properties": { "description": { "type": "string", "description": "An optional string describing the servers." }, "environment": { "type": "string", "description": "The environment in which the servers are running. Examples: prod, sit, stg." }, "type": { "type": "string", "description": "The type of the data product technology that implements the data contract.", "examples": [ "azure", "bigquery", "BigQuery", "clickhouse", "databricks", "dataframe", "glue", "kafka", "kinesis", "local", "oracle", "postgres", "pubsub", "redshift", "sftp", "sqlserver", "snowflake", "s3", "trino" ] }, "roles": { "description": " An optional array of roles that are available and can be requested to access the server for role-based access control. E.g. separate roles for different regions or sensitive data.", "type": "array", "items": { "type": "object", "properties": { "name": { "type": "string", "description": "The name of the role." }, "description": { "type": "string", "description": "A description of the role and what access the role provides." } }, "required": [ "name" ] } } }, "additionalProperties": true, "required": [ "type" ] }, "BigQueryServer": { "type": "object", "title": "BigQueryServer", "properties": { "project": { "type": "string", "description": "The GCP project name." }, "dataset": { "type": "string", "description": "The GCP dataset name." } }, "required": [ "project", "dataset" ] }, "S3Server": { "type": "object", "title": "S3Server", "properties": { "location": { "type": "string", "format": "uri", "description": "S3 URL, starting with `s3://`", "examples": [ "s3://datacontract-example-orders-latest/data/{model}/*.json" ] }, "endpointUrl": { "type": "string", "format": "uri", "description": "The server endpoint for S3-compatible servers.", "examples": [ "https://minio.example.com" ] }, "format": { "type": "string", "enum": [ "parquet", "delta", "json", "csv" ], "description": "File format." }, "delimiter": { "type": "string", "enum": [ "new_line", "array" ], "description": "Only for format = json. How multiple json documents are delimited within one file" } }, "required": [ "location" ] }, "SftpServer": { "type": "object", "title": "SftpServer", "properties": { "location": { "type": "string", "format": "uri", "pattern": "^sftp://.*", "description": "SFTP URL, starting with `sftp://`", "examples": [ "sftp://123.123.12.123/{model}/*.json" ] }, "format": { "type": "string", "enum": [ "parquet", "delta", "json", "csv" ], "description": "File format." }, "delimiter": { "type": "string", "enum": [ "new_line", "array" ], "description": "Only for format = json. How multiple json documents are delimited within one file" } }, "required": [ "location" ] }, "RedshiftServer": { "type": "object", "title": "RedshiftServer", "properties": { "account": { "type": "string", "description": "An optional string describing the server." }, "host": { "type": "string", "description": "An optional string describing the host name." }, "database": { "type": "string", "description": "An optional string describing the server." }, "schema": { "type": "string", "description": "An optional string describing the server." }, "clusterIdentifier": { "type": "string", "description": "An optional string describing the cluster's identifier.", "examples": [ "redshift-prod-eu", "analytics-cluster" ] }, "port": { "type": "integer", "description": "An optional string describing the cluster's port.", "examples": [ 5439 ] }, "endpoint": { "type": "string", "description": "An optional string describing the cluster's endpoint.", "examples": [ "analytics-cluster.example.eu-west-1.redshift.amazonaws.com:5439/analytics" ] } }, "additionalProperties": true, "required": [ "account", "database", "schema" ] }, "AzureServer": { "type": "object", "title": "AzureServer", "properties": { "location": { "type": "string", "format": "uri", "description": "Path to Azure Blob Storage or Azure Data Lake Storage (ADLS), supports globs. Recommended pattern is 'abfss:///'", "examples": [ "abfss://my_container_name/path", "abfss://my_container_name/path/*.json", "az://my_storage_account_name.blob.core.windows.net/my_container/path/*.parquet", "abfss://my_storage_account_name.dfs.core.windows.net/my_container_name/path/*.parquet" ] }, "format": { "type": "string", "enum": [ "parquet", "delta", "json", "csv" ], "description": "File format." }, "delimiter": { "type": "string", "enum": [ "new_line", "array" ], "description": "Only for format = json. How multiple json documents are delimited within one file" } }, "required": [ "location", "format" ] }, "SqlserverServer": { "type": "object", "title": "SqlserverServer", "properties": { "host": { "type": "string", "description": "The host to the database server", "examples": [ "localhost" ] }, "port": { "type": "integer", "description": "The port to the database server.", "default": 1433, "examples": [ 1433 ] }, "database": { "type": "string", "description": "The name of the database.", "examples": [ "database" ] }, "schema": { "type": "string", "description": "The name of the schema in the database.", "examples": [ "dbo" ] } }, "required": [ "host", "database", "schema" ] }, "SnowflakeServer": { "type": "object", "title": "SnowflakeServer", "properties": { "account": { "type": "string", "description": "An optional string describing the server." }, "database": { "type": "string", "description": "An optional string describing the server." }, "schema": { "type": "string", "description": "An optional string describing the server." } }, "required": [ "account", "database", "schema" ] }, "DatabricksServer": { "type": "object", "title": "DatabricksServer", "properties": { "host": { "type": "string", "description": "The Databricks host", "examples": [ "dbc-abcdefgh-1234.cloud.databricks.com" ] }, "catalog": { "type": "string", "description": "The name of the Hive or Unity catalog" }, "schema": { "type": "string", "description": "The schema name in the catalog" } }, "required": [ "catalog", "schema" ] }, "DataframeServer": { "type": "object", "title": "DataframeServer", "required": [ "type" ] }, "GlueServer": { "type": "object", "title": "GlueServer", "properties": { "account": { "type": "string", "description": "The AWS Glue account", "examples": [ "1234-5678-9012" ] }, "database": { "type": "string", "description": "The AWS Glue database name", "examples": [ "my_database" ] }, "location": { "type": "string", "format": "uri", "description": "The AWS S3 path. Must be in the form of a URL.", "examples": [ "s3://datacontract-example-orders-latest/data/{model}" ] }, "format": { "type": "string", "description": "The format of the files", "examples": [ "parquet", "csv", "json", "delta" ] } }, "required": [ "account", "database" ] }, "PostgresServer": { "type": "object", "title": "PostgresServer", "properties": { "host": { "type": "string", "description": "The host to the database server", "examples": [ "localhost" ] }, "port": { "type": "integer", "description": "The port to the database server." }, "database": { "type": "string", "description": "The name of the database.", "examples": [ "postgres" ] }, "schema": { "type": "string", "description": "The name of the schema in the database.", "examples": [ "public" ] } }, "required": [ "host", "port", "database", "schema" ] }, "OracleServer": { "type": "object", "title": "OracleServer", "properties": { "host": { "type": "string", "description": "The host to the oracle server", "examples": [ "localhost" ] }, "port": { "type": "integer", "description": "The port to the oracle server.", "examples": [ 1523 ] }, "serviceName": { "type": "string", "description": "The name of the service.", "examples": [ "service" ] } }, "required": [ "host", "port", "serviceName" ] }, "KafkaServer": { "type": "object", "title": "KafkaServer", "description": "Kafka Server", "properties": { "host": { "type": "string", "description": "The bootstrap server of the kafka cluster." }, "topic": { "type": "string", "description": "The topic name." }, "format": { "type": "string", "description": "The format of the message. Examples: json, avro, protobuf.", "default": "json" } }, "required": [ "host", "topic" ] }, "PubSubServer": { "type": "object", "title": "PubSubServer", "properties": { "project": { "type": "string", "description": "The GCP project name." }, "topic": { "type": "string", "description": "The topic name." } }, "required": [ "project", "topic" ] }, "KinesisDataStreamsServer": { "type": "object", "title": "KinesisDataStreamsServer", "description": "Kinesis Data Streams Server", "properties": { "stream": { "type": "string", "description": "The name of the Kinesis data stream." }, "region": { "type": "string", "description": "AWS region.", "examples": [ "eu-west-1" ] }, "format": { "type": "string", "description": "The format of the record", "examples": [ "json", "avro", "protobuf" ] } }, "required": [ "stream" ] }, "TrinoServer": { "type": "object", "title": "TrinoServer", "properties": { "host": { "type": "string", "description": "The Trino host URL.", "examples": [ "localhost" ] }, "port": { "type": "integer", "description": "The Trino port." }, "catalog": { "type": "string", "description": "The name of the catalog.", "examples": [ "hive" ] }, "schema": { "type": "string", "description": "The name of the schema in the database.", "examples": [ "my_schema" ] } }, "required": [ "host", "port", "catalog", "schema" ] }, "ClickhouseServer": { "type": "object", "title": "ClickhouseServer", "properties": { "host": { "type": "string", "description": "The host to the database server", "examples": [ "localhost" ] }, "port": { "type": "integer", "description": "The port to the database server." }, "database": { "type": "string", "description": "The name of the database.", "examples": [ "postgres" ] } }, "required": [ "host", "port", "database" ] }, "LocalServer": { "type": "object", "title": "LocalServer", "properties": { "path": { "type": "string", "description": "The relative or absolute path to the data file(s).", "examples": [ "./folder/data.parquet", "./folder/*.parquet" ] }, "format": { "type": "string", "description": "The format of the file(s)", "examples": [ "json", "parquet", "delta", "csv" ] } }, "required": [ "path", "format" ] }, "Quality": { "allOf": [ { "type": "object", "properties": { "type": { "type": "string", "description": "The type of quality check", "enum": [ "text", "library", "sql", "custom" ] }, "description": { "type": "string", "description": "A plain text describing the quality attribute in natural language." } } }, { "if": { "properties": { "type": { "const": "text" } } }, "then": { "required": [ "description" ] } }, { "if": { "properties": { "type": { "const": "sql" } } }, "then": { "properties": { "query": { "type": "string", "description": "A SQL query that returns a single number to compare with the threshold." }, "dialect": { "type": "string", "description": "The SQL dialect that is used for the query. Should be compatible to the server.type.", "examples": [ "athena", "bigquery", "redshift", "snowflake", "trino", "postgres", "oracle" ] }, "mustBe": { "type": "number" }, "mustNotBe": { "type": "number" }, "mustBeGreaterThan": { "type": "number" }, "mustBeGreaterOrEqualTo": { "type": "number" }, "mustBeGreaterThanOrEqualTo": { "type": "number", "deprecated": true }, "mustBeLessThan": { "type": "number" }, "mustBeLessThanOrEqualTo": { "type": "number", "deprecated": true }, "mustBeLessOrEqualTo": { "type": "number" }, "mustBeBetween": { "type": "array", "items": { "type": "number" }, "minItems": 2, "maxItems": 2 }, "mustNotBeBetween": { "type": "array", "items": { "type": "number" }, "minItems": 2, "maxItems": 2 } }, "required": [ "query" ] } }, { "if": { "properties": { "type": { "const": "library" } } }, "then": { "properties": { "metric": { "type": "string", "description": "The DataQualityLibrary metric to use for the quality check.", "examples": ["nullValues", "missingValues", "invalidValues", "duplicateValues", "rowCount"] }, "rule": { "type": "string", "deprecated": true, "description": "Deprecated. Use metric instead" }, "arguments": { "type": "object", "description": "Additional metric-specific parameters for the quality check.", "additionalProperties": { "type": ["string", "number", "boolean", "array", "object"] } }, "mustBe": { "description": "Must be equal to the value to be valid. When using numbers, it is equivalent to '='." }, "mustNotBe": { "description": "Must not be equal to the value to be valid. When using numbers, it is equivalent to '!='." }, "mustBeGreaterThan": { "type": "number", "description": "Must be greater than the value to be valid. It is equivalent to '>'." }, "mustBeGreaterOrEqualTo": { "type": "number", "description": "Must be greater than or equal to the value to be valid. It is equivalent to '>='." }, "mustBeLessThan": { "type": "number", "description": "Must be less than the value to be valid. It is equivalent to '<'." }, "mustBeLessOrEqualTo": { "type": "number", "description": "Must be less than or equal to the value to be valid. It is equivalent to '<='." }, "mustBeBetween": { "type": "array", "description": "Must be between the two numbers to be valid. Smallest number first in the array.", "minItems": 2, "maxItems": 2, "uniqueItems": true, "items": { "type": "number" } }, "mustNotBeBetween": { "type": "array", "description": "Must not be between the two numbers to be valid. Smallest number first in the array.", "minItems": 2, "maxItems": 2, "uniqueItems": true, "items": { "type": "number" } } }, "required": [ "metric" ] } }, { "if": { "properties": { "type": { "const": "custom" } } }, "then": { "properties": { "description": { "type": "string", "description": "A plain text describing the quality attribute in natural language." }, "engine": { "type": "string", "examples": [ "soda", "great-expectations" ], "description": "The engine used for custom quality checks." }, "implementation": { "type": [ "object", "array", "string" ], "description": "Engine-specific quality checks and expectations." } }, "required": [ "engine" ] } } ] }, "Lineage": { "type": "object", "properties": { "inputFields": { "type": "array", "items": { "type": "object", "properties": { "namespace": { "type": "string", "description": "The input dataset namespace" }, "name": { "type": "string", "description": "The input dataset name" }, "field": { "type": "string", "description": "The input field" }, "transformations": { "type": "array", "items": { "type": "object", "properties": { "type": { "description": "The type of the transformation. Allowed values are: DIRECT, INDIRECT", "type": "string" }, "subtype": { "type": "string", "description": "The subtype of the transformation" }, "description": { "type": "string", "description": "a string representation of the transformation applied" }, "masking": { "type": "boolean", "description": "is transformation masking the data or not" } }, "required": [ "type" ], "additionalProperties": true } } }, "additionalProperties": true, "required": [ "namespace", "name", "field" ] } }, "transformationDescription": { "type": "string", "description": "a string representation of the transformation applied", "deprecated": true }, "transformationType": { "type": "string", "description": "IDENTITY|MASKED reflects a clearly defined behavior. IDENTITY: exact same as input; MASKED: no original data available (like a hash of PII for example)", "deprecated": true } }, "additionalProperties": true, "required": [ "inputFields" ] } } } ================================================ FILE: definition.schema.json ================================================ { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "description": "Clear and concise explanations of syntax, semantic, and classification of business objects in a given domain.", "properties": { "id": { "type": "string", "description": "A unique identifier for this definition. Encode the domain into the ID, separated by slashes.", "examples": [ "checkout/order_id" ] }, "title": { "type": "string", "description": "The business name of this definition." }, "description": { "type": "string", "description": "Clear and concise explanations related to the domain." }, "type": { "type": "string", "description": "The logical data type." }, "minLength": { "type": "integer", "description": "A value must be greater than or equal to this value. Applies only to string types." }, "maxLength": { "type": "integer", "description": "A value must be less than or equal to this value. Applies only to string types." }, "format": { "type": "string", "description": "Specific format requirements for the value (e.g., 'email', 'uri', 'uuid')." }, "precision": { "type": "integer", "examples": [ 38 ], "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." }, "scale": { "type": "integer", "examples": [ 0 ], "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." }, "pattern": { "type": "string", "description": "A regular expression pattern the value must match. Applies only to string types." }, "example": { "type": "string", "description": "An example value for this field.", "deprecationMessage": "Use the examples field instead." }, "examples": { "type": "array", "description": "A examples value for this field." }, "pii": { "type": "boolean", "description": "Indicates if the field contains Personal Identifiable Information (PII)." }, "classification": { "type": "string", "description": "The data class defining the sensitivity level for this field." }, "tags": { "type": "array", "items": { "type": "string" }, "description": "Custom metadata to provide additional context." }, "links": { "type": "object", "description": "Links to external resources.", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "string", "title": "Link", "description": "A URL to an external resource.", "format": "uri", "examples": [ "https://example.com" ] } } }, "required": [ "type" ] } ================================================ FILE: diagrams/automation.drawio ================================================ ================================================ FILE: diagrams/datacontract.drawio ================================================ ================================================ FILE: diagrams/favicon.drawio ================================================ ================================================ FILE: examples/covid-cases/datacontract.html ================================================ Data Contract

Data Contract

covid_cases

Info

Information about the data contract

Title
COVID-19 cases
Version
0.0.1
Description
Johns Hopkins University Consolidated data on COVID-19 cases, sourced from Enigma
links
{'blog': 'https://aws.amazon.com/blogs/big-data/a-public-data-lake-for-analysis-of-covid-19-data/', 'data-explorer': 'https://dj2taa9i652rf.cloudfront.net/', 'data': 'https://covid19-lake.s3.us-east-2.amazonaws.com/enigma-jhu/json/part-00000-adec1cd2-96df-4c6b-a5f2-780f092951ba-c000.json'}

Servers

Servers of the data contract

  • Server
    s3-json
    Type
    s3
    Location
    s3://covid19-lake/enigma-jhu/json/*.json
    Format
    json
    Delimiter
    new_line

Data Model

The logical data model

covid_cases table
the number of confirmed covid cases reported for a specified region, with location and county/province/country information.
fips
string
state and county two digits code
admin2
string
county name
province_state
string
province name or state name
country_region
string
country name or region name
last_update
timestamp_ntz
last update timestamp
latitude
double
location (latitude)
longitude
double
location (longitude)
confirmed
int
number of confirmed cases
combined_key
string
county name+state name+country name

Quality

SodaCL

checks for covid_cases:
- freshness(last_update::datetime) < 5000d
- row_count > 1000
Created at 27 Jun 2024 14:50:11 UTC with Data Contract CLI v0.10.8
dataContractSpecification: 0.9.3
id: covid_cases
info:
  title: COVID-19 cases
  version: 0.0.1
  description: Johns Hopkins University Consolidated data on COVID-19 cases, sourced
    from Enigma
  links:
    blog: https://aws.amazon.com/blogs/big-data/a-public-data-lake-for-analysis-of-covid-19-data/
    data-explorer: https://dj2taa9i652rf.cloudfront.net/
    data: https://covid19-lake.s3.us-east-2.amazonaws.com/enigma-jhu/json/part-00000-adec1cd2-96df-4c6b-a5f2-780f092951ba-c000.json
servers:
  s3-json:
    type: s3
    format: json
    delimiter: new_line
    location: s3://covid19-lake/enigma-jhu/json/*.json
models:
  covid_cases:
    description: the number of confirmed covid cases reported for a specified region,
      with location and county/province/country information.
    type: table
    fields:
      fips:
        type: string
        required: false
        primary: false
        unique: false
        description: state and county two digits code
      admin2:
        type: string
        required: false
        primary: false
        unique: false
        description: county name
      province_state:
        type: string
        required: false
        primary: false
        unique: false
        description: province name or state name
      country_region:
        type: string
        required: false
        primary: false
        unique: false
        description: country name or region name
      last_update:
        type: timestamp_ntz
        required: false
        primary: false
        unique: false
        description: last update timestamp
      latitude:
        type: double
        required: false
        primary: false
        unique: false
        description: location (latitude)
      longitude:
        type: double
        required: false
        primary: false
        unique: false
        description: location (longitude)
      confirmed:
        type: int
        required: false
        primary: false
        unique: false
        description: number of confirmed cases
      combined_key:
        type: string
        required: false
        primary: false
        unique: false
        description: county name+state name+country name
quality:
  type: SodaCL
  specification:
    checks for covid_cases:
    - freshness(last_update::datetime) < 5000d
    - row_count > 1000
================================================ FILE: examples/covid-cases/datacontract.yaml ================================================ dataContractSpecification: 0.9.3 id: covid_cases info: title: COVID-19 cases description: Johns Hopkins University Consolidated data on COVID-19 cases, sourced from Enigma version: "0.0.1" links: blog: https://aws.amazon.com/blogs/big-data/a-public-data-lake-for-analysis-of-covid-19-data/ data-explorer: https://dj2taa9i652rf.cloudfront.net/ data: https://covid19-lake.s3.us-east-2.amazonaws.com/enigma-jhu/json/part-00000-adec1cd2-96df-4c6b-a5f2-780f092951ba-c000.json servers: s3-json: type: s3 location: s3://covid19-lake/enigma-jhu/json/*.json format: json delimiter: new_line models: covid_cases: description: the number of confirmed covid cases reported for a specified region, with location and county/province/country information. fields: fips: type: string description: state and county two digits code admin2: type: string description: county name province_state: type: string description: province name or state name country_region: type: string description: country name or region name last_update: type: timestamp_ntz description: last update timestamp latitude: type: double description: location (latitude) longitude: type: double description: location (longitude) confirmed: type: int description: number of confirmed cases combined_key: type: string description: county name+state name+country name quality: type: SodaCL specification: checks for covid_cases: - freshness(last_update::datetime) < 5000d # dataset is not updated anymore - row_count > 1000 ================================================ FILE: examples/datacontract.html ================================================ Data Contract

Data Contract

covid_cases

Info

Information about the data contract

Title
COVID-19 cases
Version
0.0.1
Description
Johns Hopkins University Consolidated data on COVID-19 cases, sourced from Enigma

Servers

Servers of the data contract

Data Model

The logical data model

covid_cases table
the number of confirmed covid cases reported for a specified region, with location and county/province/country information.
fips
string
state and county two digits code
admin2
string
county name
province_state
string
province name or state name
country_region
string
country name or region name
last_update
timestamp_ntz
last update timestamp
latitude
double
location (latitude)
longitude
double
location (longitude)
confirmed
int
number of confirmed cases
combined_key
string
county name+state name+country name

Quality

SodaCL

checks for covid_cases:
- freshness(last_update::datetime) < 5000d
- row_count > 1000
Created at 29 Apr 2024 19:30:08 UTC with Data Contract CLI v0.10.1
dataContractSpecification: 0.9.3
id: covid_cases
info:
  title: COVID-19 cases
  version: 0.0.1
  description: Johns Hopkins University Consolidated data on COVID-19 cases, sourced
    from Enigma
servers:
  s3-json:
    type: s3
    format: json
    delimiter: new_line
    location: s3://covid19-lake/enigma-jhu/json/*.json
models:
  covid_cases:
    description: the number of confirmed covid cases reported for a specified region,
      with location and county/province/country information.
    type: table
    fields:
      fips:
        type: string
        required: false
        primary: false
        unique: false
        description: state and county two digits code
      admin2:
        type: string
        required: false
        primary: false
        unique: false
        description: county name
      province_state:
        type: string
        required: false
        primary: false
        unique: false
        description: province name or state name
      country_region:
        type: string
        required: false
        primary: false
        unique: false
        description: country name or region name
      last_update:
        type: timestamp_ntz
        required: false
        primary: false
        unique: false
        description: last update timestamp
      latitude:
        type: double
        required: false
        primary: false
        unique: false
        description: location (latitude)
      longitude:
        type: double
        required: false
        primary: false
        unique: false
        description: location (longitude)
      confirmed:
        type: int
        required: false
        primary: false
        unique: false
        description: number of confirmed cases
      combined_key:
        type: string
        required: false
        primary: false
        unique: false
        description: county name+state name+country name
quality:
  type: SodaCL
  specification:
    checks for covid_cases:
    - freshness(last_update::datetime) < 5000d
    - row_count > 1000
================================================ FILE: examples/generate-catalog ================================================ datacontract catalog --files "**/*.yaml" --output "." ================================================ FILE: examples/index.html ================================================ Data Contract ================================================ FILE: examples/muellimperium/data.csv ================================================ Pluto,residual_waste,2021-01-09 Pluto,bio_waste,2021-01-02 Pluto,paper,2021-01-11 Pluto,plastic,2021-01-12 Pluto,bulky_waste,2021-02-04 Earth,residual_waste,2021-01-14 Earth,bio_waste,2021-01-08 Earth,paper,2021-01-12 Earth,plastic,2021-01-27 Earth,bulky_waste,2021-02-03 ================================================ FILE: examples/muellimperium/datacontract.html ================================================ Data Contract

Data Contract

muellimperium-exchange-format

Info

Information about the data contract

Title
Muellimperium Exchange Format
Version
0.0.1
Description
The Muellimperium Exchange Format is a data contract for exchanging data between the Muellimperium and its partners.
Owner
Emperor of the Muellimperium
contract
{'name': 'The Emperor', 'email': 'the-emperor@muellimperium.com'}

Servers

Servers of the data contract

  • Server
    exchange
    Type
    local
    Path
    data.csv
    Format
    csv

Data Model

The logical data model

garbage_collection table
None
location
text
The location where the garbage is collected.
required
garbage_type
text
The type of garbage that is collected.
required
collection_date
date
The date when the garbage is collected.
required

Examples

Examples for models in the data contract

garbage_collection json
None
[{'location': 'Musterstadt', 'garbage_type': 'paper', 'collection_date': '2022-01-01'}, {'location': 'Musterstadt', 'garbage_type': 'plastic', 'collection_date': '2022-01-02'}, {'location': 'Musterstadt', 'garbage_type': 'residual_waste', 'collection_date': '2022-01-03'}]
Created at 27 Jun 2024 14:50:12 UTC with Data Contract CLI v0.10.8
dataContractSpecification: 0.9.3
id: muellimperium-exchange-format
info:
  title: Muellimperium Exchange Format
  version: 0.0.1
  description: 'The Muellimperium Exchange Format is a data contract for exchanging
    data between the Muellimperium and its partners.

    '
  owner: Emperor of the Muellimperium
  contract:
    name: The Emperor
    email: the-emperor@muellimperium.com
servers:
  exchange:
    type: local
    format: csv
    path: data.csv
models:
  garbage_collection:
    type: table
    fields:
      location:
        type: text
        required: true
        primary: false
        unique: false
        description: The location where the garbage is collected.
      garbage_type:
        type: text
        required: true
        primary: false
        unique: false
        description: The type of garbage that is collected.
        enum:
        - paper
        - plastic
        - residual_waste
        - bio_waste
        - bulky_waste
        - hazardous_waste
      collection_date:
        type: date
        required: true
        primary: false
        unique: false
        description: The date when the garbage is collected.
examples:
- type: json
  model: garbage_collection
  data:
  - location: Musterstadt
    garbage_type: paper
    collection_date: '2022-01-01'
  - location: Musterstadt
    garbage_type: plastic
    collection_date: '2022-01-02'
  - location: Musterstadt
    garbage_type: residual_waste
    collection_date: '2022-01-03'
================================================ FILE: examples/muellimperium/datacontract.yaml ================================================ dataContractSpecification: 0.9.3 id: muellimperium-exchange-format info: title: Muellimperium Exchange Format version: 0.0.1 description: | The Muellimperium Exchange Format is a data contract for exchanging data between the Muellimperium and its partners. owner: Emperor of the Muellimperium contract: name: The Emperor email: the-emperor@muellimperium.com servers: exchange: type: local path: data.csv format: csv models: garbage_collection: type: table fields: location: type: text required: true description: The location where the garbage is collected. garbage_type: type: text required: true description: The type of garbage that is collected. enum: - paper - plastic - residual_waste - bio_waste - bulky_waste - hazardous_waste collection_date: type: date required: true description: The date when the garbage is collected. examples: - model: garbage_collection type: json data: - location: "Musterstadt" garbage_type: "paper" collection_date: "2022-01-01" - location: "Musterstadt" garbage_type: "plastic" collection_date: "2022-01-02" - location: "Musterstadt" garbage_type: "residual_waste" collection_date: "2022-01-03" ================================================ FILE: examples/orders-latest/datacontract.html ================================================ Data Contract

Data Contract

urn:orders-latest
checkout orders s3

Info

Information about the data contract

Title
Orders Latest
Version
1.0.0
Description
Successful customer orders in the webshop. All orders since 2020-01-01. Orders with their line items are in their current state (no history included).
Owner
Checkout Team
slackChannel
#checkout
Contact
John Doe (Data Product Owner)

Servers

Servers of the data contract

  • Server
    production
    Environment
    prod
    Type
    s3
    Location
    s3://datacontract-example-orders-latest/data/{model}/*.json
    Format
    json
    Delimiter
    new_line
    Description
    One folder per model. One file per day.

Terms

Terms and conditions of the data contract

Usage
Data can be used for reports, analytics and machine learning use cases. Order may be linked and joined by other tables
Limitations
Not suitable for real-time use cases. Data may not be used to identify individual customers. Max data processing per day: 10 TiB
Billing
5000 USD per month
Notice Period
P3M

Data Model

The logical data model

orders table
One record per order. Includes cancelled and deleted orders.
Order ID
order_id
text
An internal ID that identifies an order in the online shop.
Example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2
primary required unique format:uuid restricted PII
order_timestamp
timestamp
The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful.
Example: 2024-09-09T08:30:00Z
required
order_total
long
Total amount the smallest monetary unit (e.g., cents).
Example: 9999
required
customer_id
text
Unique identifier for the customer.
minLength:10 maxLength:20
customer_email_address
text
The email address, as entered by the customer. The email address was not verified.
required format:email sensitive PII
processed_timestamp
timestamp
The timestamp when the record was processed by the data platform.
required
line_items table
A single article that is part of an order.
lines_item_id
text
Primary key of the lines_item_id table
primary required unique
Order ID
order_id
text
An internal ID that identifies an order in the online shop.
Example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2
format:uuid restricted PII
Stock Keeping Unit
sku
text
The purchased article number
Example: 96385074
pattern:^[A-Za-z0-9]{8,14}$ wikipedia

Definitions

Domain specific definitions in the data contract

order_id checkout
An internal ID that identifies an order in the online shop.
Order ID
order_id
text uuid
Example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2
Tags: orders
restricted PII
sku inventory
A Stock Keeping Unit (SKU) is an internal unique identifier for an article. It is typically associated with an article's barcode, such as the EAN/GTIN.
Stock Keeping Unit
sku
text
Example: 96385074
Tags: inventory
pattern:^[A-Za-z0-9]{8,14}$ wikipedia

Examples

Examples for models in the data contract

orders csv
An example list of order records.
order_id,order_timestamp,order_total,customer_id,customer_email_address,processed_timestamp
"1001","2030-09-09T08:30:00Z",2500,"1000000001","mary.taylor82@example.com","2030-09-09T08:31:00Z"
"1002","2030-09-08T15:45:00Z",1800,"1000000002","michael.miller83@example.com","2030-09-09T08:31:00Z"
"1003","2030-09-07T12:15:00Z",3200,"1000000003","michael.smith5@example.com","2030-09-09T08:31:00Z"
"1004","2030-09-06T19:20:00Z",1500,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z"
"1005","2030-09-05T10:10:00Z",4200,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z"
"1006","2030-09-04T14:55:00Z",2800,"1000000005","john.davis28@example.com","2030-09-09T08:31:00Z"
"1007","2030-09-03T21:05:00Z",1900,"1000000006","linda.brown67@example.com","2030-09-09T08:31:00Z"
"1008","2030-09-02T17:40:00Z",3600,"1000000007","patricia.smith40@example.com","2030-09-09T08:31:00Z"
"1009","2030-09-01T09:25:00Z",3100,"1000000008","linda.wilson43@example.com","2030-09-09T08:31:00Z"
"1010","2030-08-31T22:50:00Z",2700,"1000000009","mary.smith98@example.com","2030-09-09T08:31:00Z"
line_items csv
An example list of line items.
lines_item_id,order_id,sku
"LI-1","1001","5901234123457"
"LI-2","1001","4001234567890"
"LI-3","1002","5901234123457"
"LI-4","1002","2001234567893"
"LI-5","1003","4001234567890"
"LI-6","1003","5001234567892"
"LI-7","1004","5901234123457"
"LI-8","1005","2001234567893"
"LI-9","1005","5001234567892"
"LI-10","1005","6001234567891"

Service Levels

Service levels of the data contract

Availability

Description
The server is available during support hours
Percentage
99.9%

Retention

Description
Data is retained for one year
Period
P1Y

Latency

Description
Data is available within 25 hours after the order was placed
Threshold
25h
Source Timestamp field
orders.order_timestamp
Processed Timestamp field
orders.processed_timestamp

Freshness

Description
The age of the youngest row in a table.
Threshold
25h
Timestamp field
orders.order_timestamp

Frequency

Description
Data is delivered once a day
Type
batch
Interval
daily
Cron
0 0 * * *

Support

Description
The data is available during typical business hours at headquarters
Time
9am to 5pm in EST on business days
Response Time
1h

Backup

Description
Data is backed up once a week, every Sunday at 0:00 UTC.
Cron
0 0 * * 0
Recovery Time
24 hours
Recovery Point
1 week

Quality

SodaCL

checks for orders:
- row_count >= 5
- duplicate_count(order_id) = 0
checks for line_items:
- values in (order_id) must exist in orders (order_id)
- row_count >= 5
Created at 27 Jun 2024 14:50:10 UTC with Data Contract CLI v0.10.8
dataContractSpecification: 0.9.3
id: urn:orders-latest
info:
  title: Orders Latest
  version: 1.0.0
  description: "Successful customer orders in the webshop. \nAll orders since 2020-01-01.\
    \ \nOrders with their line items are in their current state (no history included).\n"
  owner: Checkout Team
  contact:
    name: John Doe (Data Product Owner)
    url: https://teams.microsoft.com/l/channel/example/checkout
  slackChannel: '#checkout'
servers:
  production:
    type: s3
    description: One folder per model. One file per day.
    environment: prod
    format: json
    delimiter: new_line
    location: s3://datacontract-example-orders-latest/data/{model}/*.json
terms:
  usage: 'Data can be used for reports, analytics and machine learning use cases.

    Order may be linked and joined by other tables

    '
  limitations: 'Not suitable for real-time use cases.

    Data may not be used to identify individual customers.

    Max data processing per day: 10 TiB

    '
  billing: 5000 USD per month
  noticePeriod: P3M
models:
  orders:
    description: One record per order. Includes cancelled and deleted orders.
    type: table
    fields:
      order_id:
        ref: '#/definitions/order_id'
        title: Order ID
        type: text
        format: uuid
        required: true
        primary: true
        unique: true
        description: An internal ID that identifies an order in the online shop.
        pii: true
        classification: restricted
        tags:
        - orders
        example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2
      order_timestamp:
        type: timestamp
        required: true
        primary: false
        unique: false
        description: The business timestamp in UTC when the order was successfully
          registered in the source system and the payment was successful.
        example: '2024-09-09T08:30:00Z'
      order_total:
        type: long
        required: true
        primary: false
        unique: false
        description: Total amount the smallest monetary unit (e.g., cents).
        example: '9999'
      customer_id:
        type: text
        required: false
        primary: false
        unique: false
        description: Unique identifier for the customer.
        minLength: 10
        maxLength: 20
      customer_email_address:
        type: text
        format: email
        required: true
        primary: false
        unique: false
        description: The email address, as entered by the customer. The email address
          was not verified.
        pii: true
        classification: sensitive
      processed_timestamp:
        type: timestamp
        required: true
        primary: false
        unique: false
        description: The timestamp when the record was processed by the data platform.
        config:
          jsonType: string
          jsonFormat: date-time
  line_items:
    description: A single article that is part of an order.
    type: table
    fields:
      lines_item_id:
        type: text
        required: true
        primary: true
        unique: true
        description: Primary key of the lines_item_id table
      order_id:
        ref: '#/definitions/order_id'
        title: Order ID
        type: text
        format: uuid
        required: false
        primary: false
        unique: false
        references: orders.order_id
        description: An internal ID that identifies an order in the online shop.
        pii: true
        classification: restricted
        tags:
        - orders
        example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2
      sku:
        ref: '#/definitions/sku'
        title: Stock Keeping Unit
        type: text
        required: false
        primary: false
        unique: false
        description: The purchased article number
        pattern: ^[A-Za-z0-9]{8,14}$
        tags:
        - inventory
        links:
          wikipedia: https://en.wikipedia.org/wiki/Stock_keeping_unit
        example: '96385074'
definitions:
  order_id:
    domain: checkout
    name: order_id
    title: Order ID
    description: An internal ID that identifies an order in the online shop.
    type: text
    format: uuid
    pii: true
    classification: restricted
    tags:
    - orders
    example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2
  sku:
    domain: inventory
    name: sku
    title: Stock Keeping Unit
    description: "A Stock Keeping Unit (SKU) is an internal unique identifier for\
      \ an article. \nIt is typically associated with an article's barcode, such as\
      \ the EAN/GTIN.\n"
    type: text
    pattern: ^[A-Za-z0-9]{8,14}$
    tags:
    - inventory
    links:
      wikipedia: https://en.wikipedia.org/wiki/Stock_keeping_unit
    example: '96385074'
examples:
- type: csv
  description: An example list of order records.
  model: orders
  data: 'order_id,order_timestamp,order_total,customer_id,customer_email_address,processed_timestamp

    "1001","2030-09-09T08:30:00Z",2500,"1000000001","mary.taylor82@example.com","2030-09-09T08:31:00Z"

    "1002","2030-09-08T15:45:00Z",1800,"1000000002","michael.miller83@example.com","2030-09-09T08:31:00Z"

    "1003","2030-09-07T12:15:00Z",3200,"1000000003","michael.smith5@example.com","2030-09-09T08:31:00Z"

    "1004","2030-09-06T19:20:00Z",1500,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z"

    "1005","2030-09-05T10:10:00Z",4200,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z"

    "1006","2030-09-04T14:55:00Z",2800,"1000000005","john.davis28@example.com","2030-09-09T08:31:00Z"

    "1007","2030-09-03T21:05:00Z",1900,"1000000006","linda.brown67@example.com","2030-09-09T08:31:00Z"

    "1008","2030-09-02T17:40:00Z",3600,"1000000007","patricia.smith40@example.com","2030-09-09T08:31:00Z"

    "1009","2030-09-01T09:25:00Z",3100,"1000000008","linda.wilson43@example.com","2030-09-09T08:31:00Z"

    "1010","2030-08-31T22:50:00Z",2700,"1000000009","mary.smith98@example.com","2030-09-09T08:31:00Z"

    '
- type: csv
  description: An example list of line items.
  model: line_items
  data: 'lines_item_id,order_id,sku

    "LI-1","1001","5901234123457"

    "LI-2","1001","4001234567890"

    "LI-3","1002","5901234123457"

    "LI-4","1002","2001234567893"

    "LI-5","1003","4001234567890"

    "LI-6","1003","5001234567892"

    "LI-7","1004","5901234123457"

    "LI-8","1005","2001234567893"

    "LI-9","1005","5001234567892"

    "LI-10","1005","6001234567891"

    '
quality:
  type: SodaCL
  specification:
    checks for orders:
    - row_count >= 5
    - duplicate_count(order_id) = 0
    checks for line_items:
    - values in (order_id) must exist in orders (order_id)
    - row_count >= 5
servicelevels:
  availability:
    description: The server is available during support hours
    percentage: 99.9%
  retention:
    description: Data is retained for one year
    period: P1Y
    unlimited: false
  latency:
    description: Data is available within 25 hours after the order was placed
    threshold: 25h
    sourceTimestampField: orders.order_timestamp
    processedTimestampField: orders.processed_timestamp
  freshness:
    description: The age of the youngest row in a table.
    threshold: 25h
    timestampField: orders.order_timestamp
  frequency:
    description: Data is delivered once a day
    type: batch
    interval: daily
    cron: 0 0 * * *
  support:
    description: The data is available during typical business hours at headquarters
    time: 9am to 5pm in EST on business days
    responseTime: 1h
  backup:
    description: Data is backed up once a week, every Sunday at 0:00 UTC.
    interval: weekly
    cron: 0 0 * * 0
    recoveryTime: 24 hours
    recoveryPoint: 1 week
links:
  datacontractCli: https://cli.datacontract.com
tags:
- checkout
- orders
- s3
================================================ FILE: examples/orders-latest/datacontract.yaml ================================================ dataContractSpecification: 1.2.0 id: orders-latest info: title: Orders Latest version: 2.0.0 description: | Successful customer orders in the webshop. All orders since 2020-01-01. Orders with their line items are in their current state (no history included). owner: Checkout Team contact: name: John Doe (Data Product Owner) url: https://teams.microsoft.com/l/channel/example/checkout servers: production: type: s3 environment: prod location: s3://datacontract-example-orders-latest/v2/{model}/*.json format: json delimiter: new_line description: "One folder per model. One file per day." roles: - name: analyst_us description: Access to the data for US region - name: analyst_cn description: Access to the data for China region terms: usage: | Data can be used for reports, analytics and machine learning use cases. Order may be linked and joined by other tables limitations: | Not suitable for real-time use cases. Data may not be used to identify individual customers. Max data processing per day: 10 TiB policies: - name: privacy-policy url: https://example.com/privacy-policy - name: license description: External data is licensed under agreement 1234. url: https://example.com/license/1234 billing: 5000 USD per month noticePeriod: P3M models: orders: description: One record per order. Includes cancelled and deleted orders. type: table fields: order_id: $ref: '#/definitions/order_id' required: true unique: true primaryKey: true order_timestamp: description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. type: timestamp required: true examples: - "2024-09-09T08:30:00Z" tags: ["business-timestamp"] order_total: description: Total amount the smallest monetary unit (e.g., cents). type: long required: true examples: - 9999 quality: - type: sql description: 95% of all order total values are expected to be between 10 and 499 EUR. query: | SELECT quantile_cont(order_total, 0.95) AS percentile_95 FROM orders mustBeBetween: [1000, 49900] customer_id: description: Unique identifier for the customer. type: text minLength: 10 maxLength: 20 customer_email_address: description: The email address, as entered by the customer. type: text format: email required: true pii: true classification: sensitive quality: - type: text description: The email address is not verified and may be invalid. lineage: inputFields: - namespace: com.example.service.checkout name: checkout_db.orders field: email_address processed_timestamp: description: The timestamp when the record was processed by the data platform. type: timestamp required: true config: jsonType: string jsonFormat: date-time quality: - type: sql description: The maximum duration between two orders should be less that 3600 seconds query: | SELECT MAX(duration) AS max_duration FROM (SELECT EXTRACT(EPOCH FROM (order_timestamp - LAG(order_timestamp) OVER (ORDER BY order_timestamp))) AS duration FROM orders) mustBeLessThan: 3600 - type: sql description: Row Count query: | SELECT count(*) as row_count FROM orders mustBeGreaterThan: 5 examples: - | order_id,order_timestamp,order_total,customer_id,customer_email_address,processed_timestamp "1001","2030-09-09T08:30:00Z",2500,"1000000001","mary.taylor82@example.com","2030-09-09T08:31:00Z" "1002","2030-09-08T15:45:00Z",1800,"1000000002","michael.miller83@example.com","2030-09-09T08:31:00Z" "1003","2030-09-07T12:15:00Z",3200,"1000000003","michael.smith5@example.com","2030-09-09T08:31:00Z" "1004","2030-09-06T19:20:00Z",1500,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" "1005","2030-09-05T10:10:00Z",4200,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" "1006","2030-09-04T14:55:00Z",2800,"1000000005","john.davis28@example.com","2030-09-09T08:31:00Z" "1007","2030-09-03T21:05:00Z",1900,"1000000006","linda.brown67@example.com","2030-09-09T08:31:00Z" "1008","2030-09-02T17:40:00Z",3600,"1000000007","patricia.smith40@example.com","2030-09-09T08:31:00Z" "1009","2030-09-01T09:25:00Z",3100,"1000000008","linda.wilson43@example.com","2030-09-09T08:31:00Z" "1010","2030-08-31T22:50:00Z",2700,"1000000009","mary.smith98@example.com","2030-09-09T08:31:00Z" line_items: description: A single article that is part of an order. type: table fields: line_item_id: type: text description: Primary key of the lines_item_id table required: true order_id: $ref: '#/definitions/order_id' references: orders.order_id sku: description: The purchased article number $ref: '#/definitions/sku' primaryKey: ["order_id", "line_item_id"] examples: - | line_item_id,order_id,sku "LI-1","1001","5901234123457" "LI-2","1001","4001234567890" "LI-3","1002","5901234123457" "LI-4","1002","2001234567893" "LI-5","1003","4001234567890" "LI-6","1003","5001234567892" "LI-7","1004","5901234123457" "LI-8","1005","2001234567893" "LI-9","1005","5001234567892" "LI-10","1005","6001234567891" definitions: order_id: title: Order ID type: text format: uuid description: An internal ID that identifies an order in the online shop. examples: - 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2 pii: true classification: restricted tags: - orders sku: title: Stock Keeping Unit type: text pattern: ^[A-Za-z0-9]{8,14}$ examples: - "96385074" description: | A Stock Keeping Unit (SKU) is an internal unique identifier for an article. It is typically associated with an article's barcode, such as the EAN/GTIN. links: wikipedia: https://en.wikipedia.org/wiki/Stock_keeping_unit tags: - inventory servicelevels: availability: description: The server is available during support hours percentage: 99.9% retention: description: Data is retained for one year period: P1Y unlimited: false latency: description: Data is available within 25 hours after the order was placed threshold: 25h sourceTimestampField: orders.order_timestamp processedTimestampField: orders.processed_timestamp freshness: description: The age of the youngest row in a table. threshold: 25h timestampField: orders.order_timestamp frequency: description: Data is delivered once a day type: batch # or streaming interval: daily # for batch, either or cron cron: 0 0 * * * # for batch, either or interval support: description: The data is available during typical business hours at headquarters time: 9am to 5pm in EST on business days responseTime: 1h backup: description: Data is backed up once a week, every Sunday at 0:00 UTC. interval: weekly cron: 0 0 * * 0 recoveryTime: 24 hours recoveryPoint: 1 week tags: - checkout - orders - s3 links: datacontractCli: https://cli.datacontract.com ================================================ FILE: examples/orders-latest-nested/datacontract.html ================================================ Data Contract

Data Contract

urn:orders-latest-nested

Info

Information about the data contract

Title
Orders Latest (Nested)
Version
1.0.0
Description
Successful customer orders in the webshop. All orders since 2020-01-01. Orders with their line items are in their current state (no history included).
Owner
Checkout Team
Contact
John Doe (Data Product Owner)

Terms

Terms and conditions of the data contract

Usage
Data can be used for reports, analytics and machine learning use cases. Order may be linked and joined by other tables
Limitations
Not suitable for real-time use cases. Data may not be used to identify individual customers. Max data processing per day: 10 TiB
Billing
5000 USD per month
Notice Period
P3M

Data Model

The logical data model

orders table
One record per order. Includes cancelled and deleted orders.
Order ID
order_id
text
An internal ID that identifies an order in the online shop.
Example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2
primary required unique format:uuid restricted PII
order_timestamp
timestamp
The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful.
required
order_total
long
Total amount the smallest monetary unit (e.g., cents).
required
customer_id
text
Unique identifier for the customer.
minLength:10 maxLength:20
customer_email_address
text
The email address, as entered by the customer. The email address was not verified.
required format:email
address
object
The delivery address of the customer.
 
street
text
The street name and house number.
 
city
text
The city name.
 
additional_lines
array
Additional address lines, such as floor, apartment, or company name.
processed_timestamp
timestamp
The timestamp when the record was processed by the data platform.
required
line_items table
A single article that is part of an order.
lines_item_id
text
Primary key of the lines_item_id table
primary required unique
Order ID
order_id
text
An internal ID that identifies an order in the online shop.
Example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2
format:uuid restricted PII
Stock Keeping Unit
sku
text
The purchased article number
Example: 96385074
pattern:^[A-Za-z0-9]{8,14}$

Definitions

Domain specific definitions in the data contract

order_id checkout
An internal ID that identifies an order in the online shop.
Order ID
order_id
text uuid
Example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2
restricted PII
sku inventory
A Stock Keeping Unit (SKU) is an internal unique identifier for an article. It is typically associated with an article's barcode, such as the EAN/GTIN.
Stock Keeping Unit
sku
text
Example: 96385074
pattern:^[A-Za-z0-9]{8,14}$
Created at 27 Jun 2024 14:50:06 UTC with Data Contract CLI v0.10.8
dataContractSpecification: 0.9.3
id: urn:orders-latest-nested
info:
  title: Orders Latest (Nested)
  version: 1.0.0
  description: "Successful customer orders in the webshop. \nAll orders since 2020-01-01.\
    \ \nOrders with their line items are in their current state (no history included).\n"
  owner: Checkout Team
  contact:
    name: John Doe (Data Product Owner)
    url: https://teams.microsoft.com/l/channel/example/checkout
terms:
  usage: 'Data can be used for reports, analytics and machine learning use cases.

    Order may be linked and joined by other tables

    '
  limitations: 'Not suitable for real-time use cases.

    Data may not be used to identify individual customers.

    Max data processing per day: 10 TiB

    '
  billing: 5000 USD per month
  noticePeriod: P3M
models:
  orders:
    description: One record per order. Includes cancelled and deleted orders.
    type: table
    fields:
      order_id:
        ref: '#/definitions/order_id'
        title: Order ID
        type: text
        format: uuid
        required: true
        primary: true
        unique: true
        description: An internal ID that identifies an order in the online shop.
        pii: true
        classification: restricted
        example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2
      order_timestamp:
        type: timestamp
        required: true
        primary: false
        unique: false
        description: The business timestamp in UTC when the order was successfully
          registered in the source system and the payment was successful.
      order_total:
        type: long
        required: true
        primary: false
        unique: false
        description: Total amount the smallest monetary unit (e.g., cents).
      customer_id:
        type: text
        required: false
        primary: false
        unique: false
        description: Unique identifier for the customer.
        minLength: 10
        maxLength: 20
      customer_email_address:
        type: text
        format: email
        required: true
        primary: false
        unique: false
        description: The email address, as entered by the customer. The email address
          was not verified.
      address:
        type: object
        required: false
        primary: false
        unique: false
        description: The delivery address of the customer.
        fields:
          street:
            type: text
            required: false
            primary: false
            unique: false
            description: The street name and house number.
          city:
            type: text
            required: false
            primary: false
            unique: false
            description: The city name.
          additional_lines:
            type: array
            required: false
            primary: false
            unique: false
            description: Additional address lines, such as floor, apartment, or company
              name.
            items:
              type: text
              required: false
              primary: false
              unique: false
              description: Additional line
      processed_timestamp:
        type: timestamp
        required: true
        primary: false
        unique: false
        description: The timestamp when the record was processed by the data platform.
  line_items:
    description: A single article that is part of an order.
    type: table
    fields:
      lines_item_id:
        type: text
        required: true
        primary: true
        unique: true
        description: Primary key of the lines_item_id table
      order_id:
        ref: '#/definitions/order_id'
        title: Order ID
        type: text
        format: uuid
        required: false
        primary: false
        unique: false
        references: orders.order_id
        description: An internal ID that identifies an order in the online shop.
        pii: true
        classification: restricted
        example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2
      sku:
        ref: '#/definitions/sku'
        title: Stock Keeping Unit
        type: text
        required: false
        primary: false
        unique: false
        description: The purchased article number
        pattern: ^[A-Za-z0-9]{8,14}$
        example: '96385074'
definitions:
  order_id:
    domain: checkout
    name: order_id
    title: Order ID
    description: An internal ID that identifies an order in the online shop.
    type: text
    format: uuid
    pii: true
    classification: restricted
    example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2
  sku:
    domain: inventory
    name: sku
    title: Stock Keeping Unit
    description: "A Stock Keeping Unit (SKU) is an internal unique identifier for\
      \ an article. \nIt is typically associated with an article's barcode, such as\
      \ the EAN/GTIN.\n"
    type: text
    pattern: ^[A-Za-z0-9]{8,14}$
    example: '96385074'
================================================ FILE: examples/orders-latest-nested/datacontract.yaml ================================================ dataContractSpecification: 0.9.3 id: urn:orders-latest-nested info: title: Orders Latest (Nested) version: 1.0.0 description: | Successful customer orders in the webshop. All orders since 2020-01-01. Orders with their line items are in their current state (no history included). owner: Checkout Team contact: name: John Doe (Data Product Owner) url: https://teams.microsoft.com/l/channel/example/checkout terms: usage: | Data can be used for reports, analytics and machine learning use cases. Order may be linked and joined by other tables limitations: | Not suitable for real-time use cases. Data may not be used to identify individual customers. Max data processing per day: 10 TiB billing: 5000 USD per month noticePeriod: P3M models: orders: description: One record per order. Includes cancelled and deleted orders. type: table fields: order_id: $ref: '#/definitions/order_id' required: true unique: true primary: true order_timestamp: description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. type: timestamp required: true order_total: description: Total amount the smallest monetary unit (e.g., cents). type: long required: true customer_id: description: Unique identifier for the customer. type: text minLength: 10 maxLength: 20 customer_email_address: description: The email address, as entered by the customer. The email address was not verified. type: text format: email required: true address: type: object description: The delivery address of the customer. fields: street: description: The street name and house number. type: text city: description: The city name. type: text additional_lines: description: Additional address lines, such as floor, apartment, or company name. type: array items: type: text description: Additional line processed_timestamp: description: The timestamp when the record was processed by the data platform. type: timestamp required: true line_items: description: A single article that is part of an order. type: table fields: lines_item_id: type: text description: Primary key of the lines_item_id table required: true unique: true primary: true order_id: $ref: '#/definitions/order_id' references: orders.order_id sku: description: The purchased article number $ref: '#/definitions/sku' definitions: order_id: domain: checkout name: order_id title: Order ID type: text format: uuid description: An internal ID that identifies an order in the online shop. example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2 pii: true classification: restricted sku: domain: inventory name: sku title: Stock Keeping Unit type: text pattern: ^[A-Za-z0-9]{8,14}$ example: "96385074" description: | A Stock Keeping Unit (SKU) is an internal unique identifier for an article. It is typically associated with an article's barcode, such as the EAN/GTIN. ================================================ FILE: examples/time-example/datacontract.html ================================================ Time Data Type Example - Data Contract Specification v1.2.1
Data Contract Specification v1.2.1

Time Data Type Example

Overview

This example demonstrates the usage of the new time data type introduced in Data Contract Specification v1.2.1. The time data type is specifically designed for storing time values without date information, making it perfect for business hours, schedules, and time-based data.

Note: The time data type may not be supported by all server types. Please check your specific data platform's documentation for compatibility.

Data Contract Details

  • ID: time-demo
  • Title: Time Data Type Example
  • Version: 1.0.0
  • Owner: Data Contract Team

Model: business_hours

This model demonstrates how to use the time data type for business hours and schedules.

location_id
type: string
Unique identifier for the business location
Examples:
"loc_001"
"loc_002"
opening_time
type: time
Daily opening time for the business
Examples:
"09:00:00"
"08:30:00"
closing_time
type: time
Daily closing time for the business
Examples:
"17:00:00"
"21:00:00"
lunch_start
type: time
Start time of lunch break
Examples:
"12:00:00"
"12:30:00"
lunch_end
type: time
End time of lunch break
Examples:
"13:00:00"
"13:30:00"

Model: shift_schedules

This model demonstrates time data type usage for employee shift schedules.

shift_start
type: time
Start time of the employee's shift
Examples:
"08:00:00"
"16:00:00"
shift_end
type: time
End time of the employee's shift
Examples:
"16:00:00"
"00:00:00"

Sample Data

Business Hours CSV Example:
location_id,location_name,opening_time,closing_time,lunch_start,lunch_end
"loc_001","Downtown Store","09:00:00","17:00:00","12:00:00","13:00:00"
"loc_002","Mall Branch","08:30:00","21:00:00","12:30:00","13:30:00"
"loc_003","24/7 Store","00:00:00","23:59:59",,
Shift Schedules CSV Example:
employee_id,shift_start,shift_end,break_start,break_end
"emp_001","08:00:00","16:00:00","12:00:00","13:00:00"
"emp_002","16:00:00","00:00:00","20:00:00","21:00:00"
"emp_003","00:00:00","08:00:00",,

Time Data Type Characteristics

  • Format: Typically follows ISO 8601 time format (HH:MM:SS)
  • No Date Information: Contains only time components, no date
  • 24-hour Format: Uses 24-hour clock format
  • Optional Seconds: Can include or exclude seconds based on precision needs

Use Cases

Business Operations

  • Store opening and closing hours
  • Service availability times
  • Break and lunch schedules
  • Operating hours for different days

Employee Management

  • Shift start and end times
  • Break periods
  • Work schedule definitions
  • Time tracking data

Event Scheduling

  • Meeting start and end times
  • Event schedules
  • Appointment times
  • Class schedules

Transportation

  • Departure and arrival times
  • Bus and train schedules
  • Flight times
  • Delivery time windows

Comparison with Other Time-Related Types

Data Type Description Example
time Time only, no date "09:00:00"
date Date only, no time "2024-01-15"
timestamp Date and time with timezone "2024-01-15T09:00:00Z"

Testing

You can test this data contract using the Data Contract CLI:

datacontract test examples/time-example/datacontract.yaml
Important: When using the time data type, ensure your data processing tools and pipelines are compatible with time-only data. Some data platforms may require specific configurations or have limitations when working with time data types.
================================================ FILE: examples/time-example/datacontract.yaml ================================================ dataContractSpecification: 1.2.1 id: time-demo info: title: Time Data Type Example version: 1.0.0 description: | Example demonstrating the usage of the time data type introduced in Data Contract Specification v1.2.1. owner: Data Contract Team contact: name: Data Contract Team url: https://github.com/datacontract/datacontract-specification servers: production: type: s3 environment: prod location: s3://example-time-demo/{model}/*.json format: json delimiter: new_line description: "Example data with time fields" terms: usage: | This is an example demonstrating the new time data type. Data can be used for testing and educational purposes. limitations: | This is example data only and should not be used in production. models: business_hours: description: Business hours for different locations type: table fields: location_id: type: string description: Unique identifier for the business location required: true primaryKey: true examples: - "loc_001" - "loc_002" location_name: type: string description: Name of the business location required: true examples: - "Downtown Store" - "Mall Branch" opening_time: type: time description: Daily opening time for the business required: true examples: - "09:00:00" - "08:30:00" closing_time: type: time description: Daily closing time for the business required: true examples: - "17:00:00" - "21:00:00" lunch_start: type: time description: Start time of lunch break required: false examples: - "12:00:00" - "12:30:00" lunch_end: type: time description: End time of lunch break required: false examples: - "13:00:00" - "13:30:00" examples: - | location_id,location_name,opening_time,closing_time,lunch_start,lunch_end "loc_001","Downtown Store","09:00:00","17:00:00","12:00:00","13:00:00" "loc_002","Mall Branch","08:30:00","21:00:00","12:30:00","13:30:00" "loc_003","24/7 Store","00:00:00","23:59:59",, shift_schedules: description: Employee shift schedules type: table fields: employee_id: type: string description: Unique identifier for the employee required: true primaryKey: true examples: - "emp_001" - "emp_002" shift_start: type: time description: Start time of the employee's shift required: true examples: - "08:00:00" - "16:00:00" shift_end: type: time description: End time of the employee's shift required: true examples: - "16:00:00" - "00:00:00" break_start: type: time description: Start time of the break period required: false examples: - "12:00:00" - "20:00:00" break_end: type: time description: End time of the break period required: false examples: - "13:00:00" - "21:00:00" examples: - | employee_id,shift_start,shift_end,break_start,break_end "emp_001","08:00:00","16:00:00","12:00:00","13:00:00" "emp_002","16:00:00","00:00:00","20:00:00","21:00:00" "emp_003","00:00:00","08:00:00",, tags: - example - time - v1.2.1 links: datacontractCli: https://cli.datacontract.com ================================================ FILE: examples/variant-json-example/datacontract.yaml ================================================ dataContractSpecification: 1.2.1 id: variant-json-demo info: title: Variant and JSON Data Types Example version: 1.0.0 description: | Example demonstrating the usage of variant and json data types introduced in Data Contract Specification v1.2.1. owner: Data Contract Team contact: name: Data Contract Team url: https://github.com/datacontract/datacontract-specification servers: production: type: s3 environment: prod location: s3://example-variant-json-demo/{model}/*.json format: json delimiter: new_line description: "Example data with variant and json fields" terms: usage: | This is an example demonstrating the new variant and json data types. Data can be used for testing and educational purposes. limitations: | This is example data only and should not be used in production. models: user_profiles: description: User profiles with variant and JSON data fields type: table fields: user_id: type: string description: Unique identifier for the user required: true primaryKey: true examples: - "user_123" - "user_456" profile_data: type: variant description: Semi-structured profile data that can contain various types of information required: false examples: - "John Doe" - 25 - true - {"preferences": {"theme": "dark", "language": "en"}} metadata: type: json description: JSON-formatted metadata about the user profile required: false examples: - '{"created_at": "2024-01-15T10:30:00Z", "source": "web_form", "version": 1}' - '{"tags": ["premium", "verified"], "settings": {"notifications": true}}' preferences: type: json description: User preferences stored as JSON required: false examples: - '{"theme": "dark", "language": "en", "timezone": "UTC"}' - '{"notifications": {"email": true, "sms": false, "push": true}}' examples: - | user_id,profile_data,metadata,preferences "user_123","John Doe",'{"created_at": "2024-01-15T10:30:00Z", "source": "web_form"}','{"theme": "dark", "language": "en"}' "user_456",25,'{"tags": ["premium"], "version": 2}','{"notifications": {"email": true, "sms": false}}' "user_789",true,'{"source": "api", "verified": true}','{"theme": "light", "timezone": "EST"}' tags: - example - variant - json - v1.2.1 links: datacontractCli: https://cli.datacontract.com ================================================ FILE: gen-openapi-yaml ================================================ #!/bin/bash # INSTALL BEFORE # npm install -g @openapi-contrib/json-schema-to-openapi-schema # brew install yq json-schema-to-openapi-schema convert datacontract.schema.json > datacontract.schema.openapi-format.json yq --input-format=json --output-format=yaml --prettyPrint datacontract.schema.openapi-format.json > datacontract.schema.openapi-format.yaml echo "Compare 'datacontract.schema.openapi-format.yaml' with openapi.yaml of the Data Mesh Manager" echo "Prepend 'DataContract:\\n' and match the indendation correctly. Then, compare in IntelliJ" ================================================ FILE: versions/0.9.0/README.md ================================================ # Data Contract Specification ![datacontract.png](images/datacontract.png) Data contracts bring data providers and data consumers together. A _data contract_ is a document that defines the structure, format, semantics, quality, and terms of use for exchanging data between a data provider and their consumers. A data contract is implemented by a data product's output port or other data technologies. Data contracts can also be used for the input port to specify the expectations of data dependencies and verify given guarantees. The _data contract specification_ defines a YAML format to describe attributes of provided data sets. It is data platform neutral, yet supports well-known formats to express schemas (e.g., dbt models, JSON Schema, Protobuf, SQL DDL) and quality tests (e.g., SodaCL, SQL queries) to avoid unnecessary abstractions. The data contract specification is an open initiative to define a common data contract format. Think of an [OpenAPI specification](https://www.openapis.org/), but for data sets. Data contracts come into play when data is exchanged between different teams or organizational units, such as in a [data mesh architecture](https://www.datamesh-architecture.com/). First, and foremost, data contracts are a communication tool to express a common understanding of how data should be structured and interpreted. They make semantic and quality expectations explicit. They are often created in [workshops](/workshop). Later in development and production, they also serve as the basis for code generation, testing, schema validations, quality checks, monitoring, access control, and computational governance policies. _Note: The term "data contract" refers to a specification that is usually owned by the data provider and thus does not align with a "contract" in a legal sense as a mutual agreement between two parties. The term "contract" may be somewhat misleading, but it is how it is used in practice. The mutual agreement between one data provider and one data consumer is the "data usage agreement" that refers to a data contract. Data usage agreements have a defined lifecycle, start/end date, and help the data provider to track who accesses their data and for which purposes._ The specification is inspired by [AIDA User Group's Open Data Contract Standard](https://github.com/AIDAUserGroup/open-data-contract-standard), (formerly [PayPal's Data Contract Template](https://github.com/paypal/data-contract-template/blob/main/docs/README.md)) and Data Mesh Manager's [Data Contract API](https://www.datamesh-manager.com). It follows [OpenAPI](https://www.openapis.org/) and [AsyncAPI](https://www.asyncapi.com/) conventions. Version --- 0.9.0 Example --- [![Open in Data Contract Studio](https://img.shields.io/badge/open%20in-Data%20Contract%20Studio-blue)](https://studio.datacontract.com/) ```yaml dataContractSpecification: 0.9.0 id: urn:datacontract:checkout:orders-latest-npii info: title: Orders Latest NPII version: 1.0.0 description: Successful customer orders in the webshop. All orders since 2020-01-01. Orders with their line items are in their current state (no history included). PII data is removed. owner: Checkout Team contact: name: John Doe (Data Product Owner) email: john.doe@example.com servers: production: type: BigQuery project: acme_orders_prod dataset: bigquery_orders_latest_npii_v1 terms: usage: > Data can be used for reports, analytics and machine learning use cases. Order may be linked and joined by other tables limitations: > Not suitable for real-time use cases. Data may not be used to identify individual customers. Max data processing per day: 10 TiB billing: 5000 USD per month noticePeriod: P3M schema: type: dbt # the specification format: dbt, bigquery, avro, protobuf, sql, json-schema, custom specification: # expressed as string or inline yaml or via "$ref: model.yaml" version: 2 description: The subset of the output port's data model that we agree to use models: - name: orders description: > One record per order. Includes cancelled and deleted orders. columns: - name: order_id data_type: string description: Primary key of the orders table - name: order_timestamp data_type: timestamptz description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. - name: order_total data_type: integer description: "Total amount of the order in the smallest monetary unit (e.g., cents)." - name: line_items description: > The items that are part of an order columns: - name: lines_item_id data_type: string description: Primary key of the lines_item_id table - name: order_id data_type: string description: Foreign key to the orders table - name: sku data_type: string description: The purchased article number examples: - type: csv # csv, json, yaml, custom model: orders data: |- # expressed as string or inline yaml or via "$ref: data.csv" order_id,order_timestamp,order_total "1001","2023-09-09T08:30:00Z",2500 "1002","2023-09-08T15:45:00Z",1800 "1003","2023-09-07T12:15:00Z",3200 "1004","2023-09-06T19:20:00Z",1500 "1005","2023-09-05T10:10:00Z",4200 "1006","2023-09-04T14:55:00Z",2800 "1007","2023-09-03T21:05:00Z",1900 "1008","2023-09-02T17:40:00Z",3600 "1009","2023-09-01T09:25:00Z",3100 "1010","2023-08-31T22:50:00Z",2700 - type: csv model: line_items data: |- lines_item_id,order_id,sku "1","1001","5901234123457" "2","1001","4001234567890" "3","1002","5901234123457" "4","1002","2001234567893" "5","1003","4001234567890" "6","1003","5001234567892" "7","1004","5901234123457" "8","1005","2001234567893" "9","1005","5001234567892" "10","1005","6001234567891" quality: type: SodaCL # data quality check format: SodaCL, montecarlo, custom specification: # expressed as string or inline yaml or via "$ref: checks.yaml" checks for orders: - freshness(order_timestamp) < 24h - row_count > 500000 - duplicate_count(order_id) = 0 checks for line_items: - row_count > 500000 ``` Schema --- [JSON Schema](https://github.com/datacontract/datacontract-specification/blob/main/datacontract.schema.json) of the Data Contract Specification. ### Data Contract Object This is the root document. It is _RECOMMENDED_ that the root document be named: `datacontract.yaml`. | Field | Type | Description | |---------------------------|------------------------------------|-------------------------------------------------------------------------------------------------------| | dataContractSpecification | `string` | REQUIRED. Specifies the Data Contract Specification being used. | | id | `string` | REQUIRED. An organization-wide unique technical identifier, such as a UUID, URN, slug, string, or number | | info | [Info Object](#info-object) | REQUIRED. Specifies the metadata of the data contract. May be used by tooling. | | servers | [Servers Object](#servers-object) | Specifies the servers of the data contract. | | terms | [Terms Object](#terms-object) | Specifies the terms and conditions of the data contract. | | schema | [Schema Object](#schema-object) | Specifies the data contract schema. The specification supports different schemas. | | examples | [Examples Object](#examples-object) | Specifies example data sets for the schema. The specification supports different example types. | | quality | [Quality Object](#quality-object) | Specifies the quality attributes and checks. The specification supports different quality check DSLs. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Info Object Metadata and life cycle information about the data contract. | Field | Type | Description | |---------|--------|------------------------------------------------------------------------------------------------------------------------------------------------------------------| | title | `string` | REQUIRED. The title of the data contract. | | version | `string` | REQUIRED. The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version). | | description | `string` | A description of the data contract. | | owner | `string` | The owner or team responsible for managing the data contract and providing the data. | | dataProduct | `string` | The identifier of the data product that contains the output port providing the data. | | outputPort | `string` | DEPRECATED. The identifier of the output port that implements the data contract. | | contact | [Contact Object](#contact-object) | Contact information for the data contract. | ### Contact Object Contact information for the data contract. | Field | Type | Description | |-------|----------|-------------------------------------------------------------------------------------------------------| | name | `string` | The identifying name of the contact person/organization. | | url | `string` | The URL pointing to the contact information. This _MUST_ be in the form of a URL. | | email | `string` | The email address of the contact person/organization. This _MUST_ be in the form of an email address. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Servers Object Information about the servers. The Servers Object is a map of [Server Objects](#server-object). ### Server Object The fields are dependent on the defined type. | Field | Type | Description | |-------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | The type of the data product technology that implements the data contract. Well-known server types are: `bigquery`, `s3`, `redshift`, `snowflake`, `databricks`, `kafka` | | description | `string` | An optional string describing the server. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### BigQuery Server Object | Field | Type | Description | |---------|----------|-------------| | type | `string` | `bigquery` | | project | `string` | | | dataset | `string` | | #### S3 Server Object | Field | Type | Description | |----------|----------|--------------------------------| | type | `string` | `s3` | | location | `string` | S3 URL, starting with `s3://` | Example: ```yaml servers: production: type: s3 location: s3://acme-orders-prod/orders/ ``` #### Redshift Server Object | Field | Type | Description | |----------|----------|-------------| | type | `string` | `redshift` | | account | `string` | | | database | `string` | | | schema | `string` | | #### Snowflake Server Object | Field | Type | Description | |----------|----------|-------------| | type | `string` | `snowflake` | | account | `string` | | | database | `string` | | | schema | `string` | | #### Databricks Server Object | Field | Type | Description | |----------|----------|--------------| | type | `string` | `databricks` | | share | `string` | | #### Kafka Server Object | Field | Type | Description | |-------|----------|-------------| | type | `string` | `kafka` | | host | `string` | | | topic | `string` | | ### Terms Object The terms and conditions of the data contract. | Field | Type | Description | |----------------------|--------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | usage | `string` | The usage describes the way the data is expected to be used. Can contain business and technical information. | | limitations | `string` | The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for. | | billing | `string` | The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use. | | noticePeriod | `string` | The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., `P3M` for a period of three months. | ### Schema Object The schema of the data contract describes the syntax and semantics of provided data sets. As the type of the output port depends on the data platform, multiple schema specifications are supported. A schema may define a single table, a collection of tables as a dataset, a file structure, or any arbitrary structure. To avoid unnecessary abstractions, the data contract specification supports existing well-known formats. Some schema types, such as `dbt`, also support defining tests and additional metadata. | Field | Type | Description | | ----- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | REQUIRED. The type of the schema.
Typical values are: `dbt`, `bigquery`, `json-schema`, `sql-ddl`, `avro`, `protobuf`, `custom` | | specification | [dbt Schema Object](#dbt-schema-object) \|
[BigQuery Schema Object](#bigquery-schema-object) \|
[JSON Schema Schema Object](#bigquery-schema-object) \|
[SQL DDL Schema Object](#sql-ddl-schema-object) \|
`string` | REQUIRED. The specification of the schema. The schema specification can be encoded as a string or as inline YAML. | #### dbt Schema Object https://docs.getdbt.com/reference/model-properties Example (inline YAML): ```yaml schema: type: dbt specification: version: 2 models: - name: "My Table" description: "My description" columns: - name: "My column" data_type: text description: "My description" ``` Example (string): ```yaml schema: type: dbt specification: |- version: 2 models: - name: "My Table" description: "My description" columns: - name: "My column" data_type: text description: "My description" ``` #### BigQuery Schema Object The schema structure is defined by the [Google BigQuery Table](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#resource:-table) object. You can extract such a Table object via the [tables.get](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/get) endpoint. Instead of providing a single Table object, you can also provide an array of such objects. Be aware that [tables.list](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/list) only returns a subset of the full Table object. You need to call every Table object via [tables.get](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/get) to get the full Table object, including the actual schema. Learn more: [Google BigQuery REST Reference v2](https://cloud.google.com/bigquery/docs/reference/rest) Example: ```yaml schema: type: bigquery specification: |- { "tableReference": { "projectId": "my-project", "datasetId": "my_dataset", "tableId": "my_table" }, "description": "This is a description", "type": "TABLE", "schema": { "fields": [ { "name": "name", "type": "STRING", "mode": "NULLABLE", "description": "This is a description" } ] } } ``` #### JSON Schema Schema Object JSON Schema can be defined as JSON or rendered as YAML, following the [OpenAPI Schema Object dialect](https://spec.openapis.org/oas/v3.1.0#properties) Example (inline YAML): ```yaml schema: type: json-schema specification: orders: description: One record per order. Includes cancelled and deleted orders. type: object properties: order_id: type: string description: Primary key of the orders table order_timestamp: type: string format: date-time description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. order_total: type: integer description: Total amount of the order in the smallest monetary unit (e.g., cents). line_items: type: object properties: lines_item_id: type: string description: Primary key of the lines_item_id table order_id: type: string description: Foreign key to the orders table sku: type: string description: The purchased article number ``` Example (string): ```yaml schema: type: json-schema specification: |- { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "orders": { "type": "object", "description": "One record per order. Includes cancelled and deleted orders.", "properties": { "order_id": { "type": "string", "description": "Primary key of the orders table" }, "order_timestamp": { "type": "string", "format": "date-time", "description": "The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful." }, "order_total": { "type": "integer", "description": "Total amount of the order in the smallest monetary unit (e.g., cents)." } }, "required": ["order_id", "order_timestamp", "order_total"] }, "line_items": { "type": "object", "properties": { "lines_item_id": { "type": "string", "description": "Primary key of the lines_item_id table" }, "order_id": { "type": "string", "description": "Foreign key to the orders table" }, "sku": { "type": "string", "description": "The purchased article number" } }, "required": ["lines_item_id", "order_id", "sku"] } }, "required": ["orders", "line_items"] } ``` #### SQL DDL Schema Object Classical SQL DDLs can be used to describe the structure. Example (string): ```yaml schema: type: sql-ddl specification: |- -- One record per order. Includes cancelled and deleted orders. CREATE TABLE orders ( order_id TEXT PRIMARY KEY, -- Primary key of the orders table order_timestamp TIMESTAMPTZ NOT NULL, -- The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. order_total INTEGER NOT NULL -- Total amount of the order in the smallest monetary unit (e.g., cents) ); -- The items that are part of an order CREATE TABLE line_items ( lines_item_id TEXT PRIMARY KEY, -- Primary key of the lines_item_id table order_id TEXT REFERENCES orders(order_id), -- Foreign key to the orders table sku TEXT NOT NULL -- The purchased article number ); ``` ### Examples Object The Examples Object is an array of [Example Objects](#examples-object). ### Example Object | Field | Type | Description | |-------------|----------|-----------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | The type of the data product technology that implements the data contract. Well-known server types are: `csv`, `json`, `yaml`, `custom` | | description | `string` | An optional string describing the example. | | model | `string` | The reference to the model in the schema, e.g. a table name. | | data | `string` | Example data for this model. | Example: ```yaml examples: - type: csv model: orders data: |- order_id,order_timestamp,order_total "1001","2023-09-09T08:30:00Z",2500 "1002","2023-09-08T15:45:00Z",1800 "1003","2023-09-07T12:15:00Z",3200 "1004","2023-09-06T19:20:00Z",1500 "1005","2023-09-05T10:10:00Z",4200 "1006","2023-09-04T14:55:00Z",2800 "1007","2023-09-03T21:05:00Z",1900 "1008","2023-09-02T17:40:00Z",3600 "1009","2023-09-01T09:25:00Z",3100 "1010","2023-08-31T22:50:00Z",2700 ``` ### Quality Object The quality object contains quality attributes and checks. | Field | Type | Description | | ----- |-------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------| | type | `string` | REQUIRED. The type of the schema.
Typical values are: `SodaCL`, `montecarlo`, `custom` | | specification | [SodaCL Quality Object](#sodacl-quality-object) \|
[Monte Carlo Schema Object](#monte-carlo-quality-object) \|
`string` | REQUIRED. The specification of the quality attributes. The quality specification can be encoded as a string or as inline YAML. | #### SodaCL Quality Object Quality attributes in [Soda Checks Language](https://docs.soda.io/soda-cl/soda-cl-overview.html). The `specification` represents the content of a `checks.yml` file. Example (inline): ```yaml quality: type: SodaCL # data quality check format: SodaCL, montecarlo, dbt-tests, custom specification: # expressed as string or inline yaml or via "$ref: checks.yaml" checks for orders: - row_count > 0 - duplicate_count(order_id) = 0 checks for line_items: - row_count > 0 ``` Example (string): ```yaml quality: type: SodaCL specification: |- checks for search_queries: - freshness(search_timestamp) < 1d - row_count > 100000 - missing_count(search_query) = 0 ``` #### Monte Carlo Quality Object Quality attributes defined as Monte Carlos [Monitors as Code](https://docs.getmontecarlo.com/docs/monitors-as-code). The `specification` represents the content of a `montecarlo.yml` file. Example (string): ```yaml quality: type: montecarlo specification: |- montecarlo: field_health: - table: project:dataset.table_name timestamp_field: created dimension_tracking: - table: project:dataset.table_name timestamp_field: created field: order_status ``` ### Specification Extensions While the Data Contract Specification tries to accommodate most use cases, additional data can be added to extend the specification at certain points. A custom fields can be added with any name. The value can be null, a primitive, an array or an object. ### Design Principles The Data Contract Specification follows these design principles: - Is an open standard and its serialization can be versioned in git - Follows OpenAPI and AsyncAPI conventions so that it feels immediately familiar - Supports tooling by being machine-readable - Supports existing well-known formats to avoid unnecessary abstractions - Supports contract-first approaches - Supports code-first approaches Tooling --- - [Data Contract Studio](https://studio.datacontract.com/) is a free web tool to develop and share data contracts. - [Data Contract CLI](https://github.com/datacontract/cli) is a free CLI tool to help you create, develop, and maintain your data contracts. - [Data Mesh Manager](https://www.datamesh-manager.com/) is a commercial tool to manage data products and data contracts. It supports the data contract specification and allows the user to import or export data contracts using this specification. Other Data Contract Specifications --- - [AIDA User Group's Open Data Contract Standard](https://github.com/AIDAUserGroup/open-data-contract-standard) - [PayPal's Data Contract Template](https://github.com/paypal/data-contract-template/blob/main/docs/README.md) Literature --- - [Driving Data Quality with Data Contracts](https://www.amazon.com/dp/B0C37FPH3D) by Andrew Jones Authors --- The Data Contract Specification was originally created by [Jochen Christ](https://www.linkedin.com/in/jochenchrist/) and [Dr. Simon Harrer](https://www.linkedin.com/in/simonharrer/), and is currently maintained by them. Contributing --- Contributions are welcome! Please open an issue or a pull request. License --- [MIT License](LICENSE) ================================================ FILE: versions/0.9.0/datacontract.init.yaml ================================================ dataContractSpecification: 0.9.0 id: my-data-contract-id info: title: My Data Contract version: 0.0.1 # description: # owner: # contact: # name: # url: # email: ### servers #servers: # my-stage: # type: bigquery # project: # dataset: #servers: # my-stage: # type: s3 # location: s3:// #servers: # my-stage: # type: redshift # account: # database: # schema: #servers: # my-stage: # type: snowflake # account: # database: # schema: #servers: # my-stage: # type: databricks # share: #servers: # my-stage: # type: kafka # host: # topic: ### terms #terms: # usage: # limitations: # billing: # noticePeriod: ### schema #schema: # type: dbt # specification: # version: # models: # - name: # description: # columns: # - name: # type: # description: # tests: #schema: # type: dbt # specification: |- # version: # models: # - name: # description: # columns: # - name: # type: # description: # tests: #schema: # type: dbt # specification: "$ref: model.yaml" #schema: # type: bigquery # specification: |- # { # "tableReference": { # "projectId": "my-project", # "datasetId": "my_dataset", # "tableId": "my_table" # }, # "description": "This is a description", # "type": "TABLE", # "schema": { # "fields": [ # { # "name": "name", # "type": "STRING", # "mode": "NULLABLE", # "description": "This is a description" # } # ] # } # } #schema: # type: json-schema # specification: # my-table: # description: # type: object # properties: # id: # type: string # description: #schema: # type: json-schema # specification: |- # { # "$schema": "http://json-schema.org/draft-07/schema#", # "type": "object", # "properties": { # "my_table": { # "type": "object", # "description": "", # "properties": { # "id": { # "type": "string", # "description": "" # }, # "required": ["id"] # } # }, # "required": ["my-table"] # } #schema: # type: sql-ddl # specification: |- # CREATE TABLE my_table ( # id TEXT PRIMARY KEY # ); #schema: # type: avro # specification: # User: # type: record # name: MyTable # fields: # - name: id # type: string #schema: # type: avro # specification: |- # { # "type": "record", # "name": "MyTable", # "fields": [ # { # "name": "name", # "type": "string" # } # ] # } #schema: # type: protobuf # specification: |- # message MyTable { # string id = 1; # } #schema: # type: custom # specification: ### examples #examples: # - type: csv # model: my_table # data: |- # id,timestamp,amount # "1001","2023-09-09T08:30:00Z",2500 # "1002","2023-09-08T15:45:00Z",1800 # #examples: # - type: csv # model: my_table # data: "$ref: data.csv" #examples: # - type: json # model: my_table # data: |- # [ # { # "id": "1001", # "timestamp": "2023-09-09T08:30:00Z", # "amount": 2500 # }, # { # "id": "1002", # "timestamp": "2023-09-08T15:45:00Z", # "amount": 1800 # } # ] #examples: # - type: yaml # model: my_table # data: # - id: 1001 # timestamp: 2023-09-09T08:30:00Z # amount: 2500 # - id: 1002 # timestamp: 2023-09-08T15:45:00Z # amount: 1800 #examples: # - type: custom # model: my_table # data: |- ### quality #quality: # type: SodaCL # specification: # checks for my_table: # - duplicate_count(order_id) = 0 #quality: # type: SodaCL # specification: # checks for my_table: |- # - duplicate_count(id) = 0 #quality: # type: SodaCL # specification: # checks for my_table: "$ref: checks.yaml" #quality: # type: montecarlo # specification: |- # montecarlo: # field_health: # - table: my_project:my_dataset.my_table # fields: # - id # - timestamp # - amount # timestamp_field: timestamp #quality: # type: custom # specification: |- ================================================ FILE: versions/0.9.0/datacontract.schema.json ================================================ { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "dataContractSpecification": { "type": "string", "enum": [ "0.9.0" ], "description": "Specifies the Data Contract Specification being used." }, "id": { "type": "string", "description": "Specifies the identifier of the data contract." }, "info": { "type": "object", "properties": { "title": { "type": "string", "description": "The title of the data contract." }, "version": { "type": "string", "description": "The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version)." }, "description": { "type": "string", "description": "A description of the data contract." }, "owner": { "type": "string", "description": "The owner or team responsible for managing the data contract and providing the data." }, "dataProduct": { "type": "string", "description": "The data product that contains the output port providing the data." }, "outputPort": { "type": "string", "description": "The output port that implements the data contract." }, "contact": { "type": "object", "properties": { "name": { "type": "string", "description": "The identifying name of the contact person/organization." }, "url": { "type": "string", "format": "uri", "description": "The URL pointing to the contact information. This MUST be in the form of a URL." }, "email": { "type": "string", "format": "email", "description": "The email address of the contact person/organization. This MUST be in the form of an email address." } }, "description": "Contact information for the data contract." } }, "required": [ "title", "version" ], "description": "Metadata and life cycle information about the data contract." }, "servers": { "type": "object", "additionalProperties": { "anyOf": [ { "type": "object", "properties": { "type": { "type": "string", "enum": [ "bigquery", "BigQuery" ], "description": "The type of the data product technology that implements the data contract." }, "project": { "type": "string", "description": "An optional string describing the server." }, "dataset": { "type": "string", "description": "An optional string describing the server." } }, "required": [ "type", "project", "dataset" ] }, { "type": "object", "properties": { "type": { "type": "string", "enum": [ "s3" ], "description": "The type of the data product technology that implements the data contract." }, "location": { "type": "string", "format": "uri", "description": "An optional string describing the server. Must be in the form of a URL." } }, "required": [ "type", "location" ] }, { "type": "object", "properties": { "type": { "type": "string", "enum": [ "redshift" ], "description": "The type of the data product technology that implements the data contract." }, "account": { "type": "string", "description": "An optional string describing the server." }, "database": { "type": "string", "description": "An optional string describing the server." }, "schema": { "type": "string", "description": "An optional string describing the server." } }, "required": [ "type", "account", "database", "schema" ] }, { "type": "object", "properties": { "type": { "type": "string", "enum": [ "snowflake" ], "description": "The type of the data product technology that implements the data contract." }, "account": { "type": "string", "description": "An optional string describing the server." }, "database": { "type": "string", "description": "An optional string describing the server." }, "schema": { "type": "string", "description": "An optional string describing the server." } }, "required": [ "type", "account", "database", "schema" ] }, { "type": "object", "properties": { "type": { "type": "string", "enum": [ "databricks" ], "description": "The type of the data product technology that implements the data contract." }, "share": { "type": "string", "description": "An optional string describing the server." } }, "required": [ "type", "share" ] }, { "type": "object", "properties": { "type": { "type": "string", "enum": [ "kafka" ], "description": "The type of the data product technology that implements the data contract." }, "host": { "type": "string", "description": "An optional string describing the server." }, "topic": { "type": "string", "description": "An optional string describing the server." } }, "required": [ "type", "host", "topic" ] } ] }, "description": "Information about the servers." }, "terms": { "type": "object", "description": "The terms and conditions of the data contract.", "properties": { "usage": { "type": "string", "description": "The usage describes the way the data is expected to be used. Can contain business and technical information." }, "limitations": { "type": "string", "description": "The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for." }, "billing": { "type": "string", "description": "The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use." }, "noticePeriod": { "type": "string", "description": "The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., 'P3M' for a period of three months." } } }, "schema": { "type": "object", "properties": { "type": { "type": "string", "enum": [ "dbt", "bigquery", "json-schema", "sql-ddl", "avro", "protobuf", "custom" ], "description": "The type of the schema. Typical values are: dbt, bigquery, json-schema, sql-ddl, avro, protobuf, custom." }, "specification": { "anyOf": [ { "type": "string", "description": "The specification of the schema as a string." }, { "type": "object", "description": "The specification of the schema as an object." } ] } }, "required": [ "type", "specification" ], "description": "The schema of the data contract describes the syntax and semantics of provided data sets. It supports different schema types." }, "examples": { "type": "array", "items": { "type": "object", "properties": { "type": { "type": "string", "enum": [ "csv", "json", "yaml", "custom" ], "description": "The type of the example data. Well-known types are: csv, json, yaml, custom." }, "description": { "type": "string", "description": "An optional string describing the example." }, "model": { "type": "string", "description": "The reference to the model in the schema, e.g., a table name." }, "data": { "type": "string", "description": "Example data for this model." } }, "required": [ "type", "model", "data" ] }, "description": "The Examples Object is an array of Example Objects." }, "quality": { "type": "object", "properties": { "type": { "type": "string", "enum": [ "SodaCL", "montecarlo", "custom" ], "description": "The type of the quality check. Typical values are: SodaCL, montecarlo, custom." }, "specification": { "anyOf": [ { "type": "string", "description": "The specification of the quality attributes as a string." }, { "type": "object", "description": "The specification of the quality attributes as an object." } ] } }, "required": [ "type", "specification" ], "description": "The quality object contains quality attributes and checks." } }, "required": [ "dataContractSpecification", "id", "info" ] } ================================================ FILE: versions/0.9.1/README.md ================================================ # Data Contract Specification ![datacontract.png](images/datacontract.png) Data contracts bring data providers and data consumers together. A _data contract_ is a document that defines the structure, format, semantics, quality, and terms of use for exchanging data between a data provider and their consumers. A data contract is implemented by a data product's output port or other data technologies. Data contracts can also be used for the input port to specify the expectations of data dependencies and verify given guarantees. The _data contract specification_ defines a YAML format to describe attributes of provided data sets. It is data platform neutral and can be used with any data platform, such as AWS S3, Google BigQuery, Microsoft Fabric, Databricks, and Snowflake. The data contract specification is an open initiative to define a common data contract format. It follows [OpenAPI](https://www.openapis.org/) and [AsyncAPI](https://www.asyncapi.com/) conventions. Data contracts come into play when data is exchanged between different teams or organizational units, such as in a [data mesh architecture](https://www.datamesh-architecture.com/). First, and foremost, data contracts are a communication tool to express a common understanding of how data should be structured and interpreted. They make semantic and quality expectations explicit. They are often created collaboratively in [workshops](/workshop) together with data providers and data consumers. Later in development and production, they also serve as the basis for code generation, testing, schema validations, quality checks, monitoring, access control, and computational governance policies. The specification comes along with the [Data Contract CLI](https://github.com/datacontract/cli), an open-source tool to develop, validate, and enforce data contracts. _Note: The term "data contract" refers to a specification that is usually owned by the data provider and thus does not align with a "contract" in a legal sense as a mutual agreement between two parties. The term "contract" may be somewhat misleading, but it is how it is used in practice. The mutual agreement between one data provider and one data consumer is the "data usage agreement" that refers to a data contract. Data usage agreements have a defined lifecycle, start/end date, and help the data provider to track who accesses their data and for which purposes._ Version --- 0.9.1 ([Changelog](CHANGELOG.md)) Example --- [![Open in Data Contract Studio](https://img.shields.io/badge/open%20in-Data%20Contract%20Studio-blue)](https://studio.datacontract.com/) ```yaml dataContractSpecification: 0.9.1 id: urn:datacontract:checkout:orders-latest-npii info: title: Orders Latest NPII version: 1.0.0 description: Successful customer orders in the webshop. All orders since 2020-01-01. Orders with their line items are in their current state (no history included). PII data is removed. owner: Checkout Team contact: name: John Doe (Data Product Owner) email: john.doe@example.com servers: production: type: BigQuery project: acme_orders_prod dataset: bigquery_orders_latest_npii_v1 terms: usage: > Data can be used for reports, analytics and machine learning use cases. Order may be linked and joined by other tables limitations: > Not suitable for real-time use cases. Data may not be used to identify individual customers. Max data processing per day: 10 TiB billing: 5000 USD per month noticePeriod: P3M models: orders: description: One record per order. Includes cancelled and deleted orders. type: table fields: order_id: $ref: '#/definitions/order_id' order_timestamp: type: timestamp description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. order_total: type: long description: Total amount the smallest monetary unit (e.g., cents). line_items: description: A single article that is part of an order. type: table fields: lines_item_id: type: string description: Primary key of the lines_item_id table order_id: $ref: '#/definitions/order_id' sku: description: The purchased article number $ref: '#/definitions/sku' definitions: order_id: domain: checkout name: order_id title: Order ID type: string description: An internal ID that identifies an order in the online shop. example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2 pii: true classification: restricted sku: domain: inventory name: sku title: Stock Keeping Unit type: string example: AC1212ME1 description: | A Stock Keeping Unit (SKU) is an internal unique identifier for an article. It is typically associated with an article's barcode, such as the EAN/GTIN. examples: - type: csv # csv, json, yaml, custom model: orders data: |- # expressed as string or inline yaml or via "$ref: data.csv" order_id,order_timestamp,order_total "1001","2023-09-09T08:30:00Z",2500 "1002","2023-09-08T15:45:00Z",1800 "1003","2023-09-07T12:15:00Z",3200 "1004","2023-09-06T19:20:00Z",1500 "1005","2023-09-05T10:10:00Z",4200 "1006","2023-09-04T14:55:00Z",2800 "1007","2023-09-03T21:05:00Z",1900 "1008","2023-09-02T17:40:00Z",3600 "1009","2023-09-01T09:25:00Z",3100 "1010","2023-08-31T22:50:00Z",2700 - type: csv model: line_items data: |- lines_item_id,order_id,sku "1","1001","5901234123457" "2","1001","4001234567890" "3","1002","5901234123457" "4","1002","2001234567893" "5","1003","4001234567890" "6","1003","5001234567892" "7","1004","5901234123457" "8","1005","2001234567893" "9","1005","5001234567892" "10","1005","6001234567891" quality: type: SodaCL # data quality check format: SodaCL, montecarlo, custom specification: # expressed as string or inline yaml or via "$ref: checks.yaml" checks for orders: - freshness(order_timestamp) < 24h - row_count > 500000 - duplicate_count(order_id) = 0 checks for line_items: - row_count > 500000 ``` Schema --- - [Data Contract Object](#data-contract-object) - [Info Object](#info-object) - [Contact Object](#contact-object) - [Server Object](#server-object) - [Terms Object](#terms-object) - [Model Object](#model-object) - [Field Object](#field-object) - [Definition Object](#definition-object) - [Schema Object](#schema-object) - [Example Object](#example-object) - [Quality Object](#quality-object) - [Data Types](#data-types) - [Specification Extensions](#specification-extensions) [JSON Schema](https://github.com/datacontract/datacontract-specification/blob/main/datacontract.schema.json) of the Data Contract Specification. ### Data Contract Object This is the root document. It is _RECOMMENDED_ that the root document be named: `datacontract.yaml`. | Field | Type | Description | |---------------------------|------------------------------------------------------|----------------------------------------------------------------------------------------------------------| | dataContractSpecification | `string` | REQUIRED. Specifies the Data Contract Specification being used. | | id | `string` | REQUIRED. An organization-wide unique technical identifier, such as a UUID, URN, slug, string, or number | | info | [Info Object](#info-object) | REQUIRED. Specifies the metadata of the data contract. May be used by tooling. | | servers | Map[string, [Server Object](#server-object)] | Specifies the servers of the data contract. | | terms | [Terms Object](#terms-object) | Specifies the terms and conditions of the data contract. | | models | Map[string, [Model Object](#model-object)] | Specifies the logical data model. | | definitions | Map[string, [Definition Object](#definition-object)] | Specifies definitions. | | schema | [Schema Object](#schema-object) | Specifies the physical schema. The specification supports different schema format. | | examples | Array of [Example Objects](#example-object) | Specifies example data sets for the data model. The specification supports different example types. | | quality | [Quality Object](#quality-object) | Specifies the quality attributes and checks. The specification supports different quality check DSLs. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Info Object Metadata and life cycle information about the data contract. | Field | Type | Description | |---------|--------|------------------------------------------------------------------------------------------------------------------------------------------------------------------| | title | `string` | REQUIRED. The title of the data contract. | | version | `string` | REQUIRED. The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version). | | description | `string` | A description of the data contract. | | owner | `string` | The owner or team responsible for managing the data contract and providing the data. | | contact | [Contact Object](#contact-object) | Contact information for the data contract. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Contact Object Contact information for the data contract. | Field | Type | Description | |-------|----------|-------------------------------------------------------------------------------------------------------| | name | `string` | The identifying name of the contact person/organization. | | url | `string` | The URL pointing to the contact information. This _MUST_ be in the form of a URL. | | email | `string` | The email address of the contact person/organization. This _MUST_ be in the form of an email address. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Server Object The fields are dependent on the defined type. | Field | Type | Description | |-------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | The type of the data product technology that implements the data contract. Well-known server types are: `bigquery`, `s3`, `redshift`, `snowflake`, `databricks`, `kafka` | | description | `string` | An optional string describing the server. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### BigQuery Server Object | Field | Type | Description | |---------|----------|-------------| | type | `string` | `bigquery` | | project | `string` | | | dataset | `string` | | #### S3 Server Object | Field | Type | Description | |----------|----------|--------------------------------| | type | `string` | `s3` | | location | `string` | S3 URL, starting with `s3://` | Example: ```yaml servers: production: type: s3 location: s3://acme-orders-prod/orders/ ``` #### Redshift Server Object | Field | Type | Description | |----------|----------|-------------| | type | `string` | `redshift` | | account | `string` | | | database | `string` | | | schema | `string` | | #### Snowflake Server Object | Field | Type | Description | |----------|----------|-------------| | type | `string` | `snowflake` | | account | `string` | | | database | `string` | | | schema | `string` | | #### Databricks Server Object | Field | Type | Description | |----------|----------|--------------| | type | `string` | `databricks` | | share | `string` | | #### Kafka Server Object | Field | Type | Description | |-------|----------|-------------| | type | `string` | `kafka` | | host | `string` | | | topic | `string` | | ### Terms Object The terms and conditions of the data contract. | Field | Type | Description | |----------------------|--------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | usage | `string` | The usage describes the way the data is expected to be used. Can contain business and technical information. | | limitations | `string` | The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for. | | billing | `string` | The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use. | | noticePeriod | `string` | The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., `P3M` for a period of three months. | ### Model Object The Model Object describes the structure and semantics of a data model, such as tables, views, or structured files. The name of the data model (table name) is defined by the key that refers to this Model Object. | Field | Type | Description | |-------------|----------------------------------------------|-----------------------------------------------------------------------| | type | `string` | The type of the model. Examples: `table`, `object`. Default: `table`. | | description | `string` | An optional string describing the data model. | | fields | Map[`string`, [Field Object](#field-object)] | The fields (e.g. columns) of the data model. | ### Field Object The Field Objects describes one field (column, property, nested field) of a data model. | Field | Type | Description | |----------------|--------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | type | [Data Type](#data-types) | The logical data type of the field. | | description | `string` | An optional string describing the semantic of the data in this field. | | pii | `boolean` | An indication, if this field contains Personal Identifiable Information (PII). | | classification | `string` | The data class defining the sensitivity level for this field, according to the organization's classification scheme. Examples may be: `sensitive`, `restricted`, `internal`, `public`. | | tags | Array of `string` | Custom metadata to provide additional context. | | $ref | `string` | A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition. | ### Definition Object The Definition Object includes a clear and concise explanations of syntax, semantic, and classification of a business object in a given domain. It serves as a reference for a common understanding of terminology, ensure consistent usage and to identify join-able fields. Models fields can refer to definitions using the `$ref` field to link to existing definitions and avoid duplicate documentations. | Field | Type | Description | |----------------|--------------------------|----------------------------------------------------------------------------------------------------------------------| | domain | `string` | The domain in which this definition is valid. Default: `global`. | | name | `string` | The technical name of this definition. | | title | `string` | The business name of this definition. | | type | [Data Type](#data-types) | The logical data type | | description | `string` | Clear and concise explanations related to the domain | | example | `string` | An example value. | | pii | `boolean` | An indication, if this field contains Personal Identifiable Information (PII). | | classification | `string` | The data class defining the sensitivity level for this field, according to the organization's classification scheme. | | tags | Array of `string` | Custom metadata to provide additional context. | ### Schema Object The schema of the data contract describes the physical schema. The type of the schema depends on the data platform. | Field | Type | Description | | ----- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | REQUIRED. The type of the schema.
Typical values are: `dbt`, `bigquery`, `json-schema`, `sql-ddl`, `avro`, `protobuf`, `custom` | | specification | [dbt Schema Object](#dbt-schema-object) \|
[BigQuery Schema Object](#bigquery-schema-object) \|
[JSON Schema Schema Object](#bigquery-schema-object) \|
[SQL DDL Schema Object](#sql-ddl-schema-object) \|
`string` | REQUIRED. The specification of the schema. The schema specification can be encoded as a string or as inline YAML. | #### dbt Schema Object https://docs.getdbt.com/reference/model-properties Example (inline YAML): ```yaml schema: type: dbt specification: version: 2 models: - name: "My Table" description: "My description" columns: - name: "My column" data_type: text description: "My description" ``` Example (string): ```yaml schema: type: dbt specification: |- version: 2 models: - name: "My Table" description: "My description" columns: - name: "My column" data_type: text description: "My description" ``` #### BigQuery Schema Object The schema structure is defined by the [Google BigQuery Table](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#resource:-table) object. You can extract such a Table object via the [tables.get](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/get) endpoint. Instead of providing a single Table object, you can also provide an array of such objects. Be aware that [tables.list](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/list) only returns a subset of the full Table object. You need to call every Table object via [tables.get](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/get) to get the full Table object, including the actual schema. Learn more: [Google BigQuery REST Reference v2](https://cloud.google.com/bigquery/docs/reference/rest) Example: ```yaml schema: type: bigquery specification: |- { "tableReference": { "projectId": "my-project", "datasetId": "my_dataset", "tableId": "my_table" }, "description": "This is a description", "type": "TABLE", "schema": { "fields": [ { "name": "name", "type": "STRING", "mode": "NULLABLE", "description": "This is a description" } ] } } ``` #### JSON Schema Schema Object JSON Schema can be defined as JSON or rendered as YAML, following the [OpenAPI Schema Object dialect](https://spec.openapis.org/oas/v3.1.0#properties) Example (inline YAML): ```yaml schema: type: json-schema specification: orders: description: One record per order. Includes cancelled and deleted orders. type: object properties: order_id: type: string description: Primary key of the orders table order_timestamp: type: string format: date-time description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. order_total: type: integer description: Total amount of the order in the smallest monetary unit (e.g., cents). line_items: type: object properties: lines_item_id: type: string description: Primary key of the lines_item_id table order_id: type: string description: Foreign key to the orders table sku: type: string description: The purchased article number ``` Example (string): ```yaml schema: type: json-schema specification: |- { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "orders": { "type": "object", "description": "One record per order. Includes cancelled and deleted orders.", "properties": { "order_id": { "type": "string", "description": "Primary key of the orders table" }, "order_timestamp": { "type": "string", "format": "date-time", "description": "The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful." }, "order_total": { "type": "integer", "description": "Total amount of the order in the smallest monetary unit (e.g., cents)." } }, "required": ["order_id", "order_timestamp", "order_total"] }, "line_items": { "type": "object", "properties": { "lines_item_id": { "type": "string", "description": "Primary key of the lines_item_id table" }, "order_id": { "type": "string", "description": "Foreign key to the orders table" }, "sku": { "type": "string", "description": "The purchased article number" } }, "required": ["lines_item_id", "order_id", "sku"] } }, "required": ["orders", "line_items"] } ``` #### SQL DDL Schema Object Classical SQL DDLs can be used to describe the structure. Example (string): ```yaml schema: type: sql-ddl specification: |- -- One record per order. Includes cancelled and deleted orders. CREATE TABLE orders ( order_id TEXT PRIMARY KEY, -- Primary key of the orders table order_timestamp TIMESTAMPTZ NOT NULL, -- The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. order_total INTEGER NOT NULL -- Total amount of the order in the smallest monetary unit (e.g., cents) ); -- The items that are part of an order CREATE TABLE line_items ( lines_item_id TEXT PRIMARY KEY, -- Primary key of the lines_item_id table order_id TEXT REFERENCES orders(order_id), -- Foreign key to the orders table sku TEXT NOT NULL -- The purchased article number ); ``` ### Example Object | Field | Type | Description | |-------------|----------|-----------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | The type of the data product technology that implements the data contract. Well-known server types are: `csv`, `json`, `yaml`, `custom` | | description | `string` | An optional string describing the example. | | model | `string` | The reference to the model in the schema, e.g. a table name. | | data | `string` | Example data for this model. | Example: ```yaml examples: - type: csv model: orders data: |- order_id,order_timestamp,order_total "1001","2023-09-09T08:30:00Z",2500 "1002","2023-09-08T15:45:00Z",1800 "1003","2023-09-07T12:15:00Z",3200 "1004","2023-09-06T19:20:00Z",1500 "1005","2023-09-05T10:10:00Z",4200 "1006","2023-09-04T14:55:00Z",2800 "1007","2023-09-03T21:05:00Z",1900 "1008","2023-09-02T17:40:00Z",3600 "1009","2023-09-01T09:25:00Z",3100 "1010","2023-08-31T22:50:00Z",2700 ``` ### Quality Object The quality object contains quality attributes and checks. | Field | Type | Description | | ----- |-------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------| | type | `string` | REQUIRED. The type of the schema.
Typical values are: `SodaCL`, `montecarlo`, `custom` | | specification | [SodaCL Quality Object](#sodacl-quality-object) \|
[Monte Carlo Schema Object](#monte-carlo-quality-object) \|
`string` | REQUIRED. The specification of the quality attributes. The quality specification can be encoded as a string or as inline YAML. | #### SodaCL Quality Object Quality attributes in [Soda Checks Language](https://docs.soda.io/soda-cl/soda-cl-overview.html). The `specification` represents the content of a `checks.yml` file. Example (inline): ```yaml quality: type: SodaCL # data quality check format: SodaCL, montecarlo, dbt-tests, custom specification: # expressed as string or inline yaml or via "$ref: checks.yaml" checks for orders: - row_count > 0 - duplicate_count(order_id) = 0 checks for line_items: - row_count > 0 ``` Example (string): ```yaml quality: type: SodaCL specification: |- checks for search_queries: - freshness(search_timestamp) < 1d - row_count > 100000 - missing_count(search_query) = 0 ``` #### Monte Carlo Quality Object Quality attributes defined as Monte Carlos [Monitors as Code](https://docs.getmontecarlo.com/docs/monitors-as-code). The `specification` represents the content of a `montecarlo.yml` file. Example (string): ```yaml quality: type: montecarlo specification: |- montecarlo: field_health: - table: project:dataset.table_name timestamp_field: created dimension_tracking: - table: project:dataset.table_name timestamp_field: created field: order_status ``` ### Data Types The following data types are supported for model fields and definitions: - Unicode character sequence: `string`, `text`, `varchar` - Any numeric type, either integers or floating point numbers: `number`, `decimal`, `numeric` - 32-bit signed integer: `int`, `integer` - 64-bit signed integer: `long`, `bigint` - Single precision (32-bit) IEEE 754 floating-point number: `float` - Double precision (64-bit) IEEE 754 floating-point number: `double` - Binary value: `boolean` - Timestamp with timezone: `timestamp`, `timestamp_tz` - Timestamp with no timezone: `timestamp_ntz` - Date with no time information: `date` - Array: `array` - Sequence of 8-bit unsigned bytes: `bytes` - Complex type: `object`, `record`, `struct` - No value: `null` ### Specification Extensions While the Data Contract Specification tries to accommodate most use cases, additional data can be added to extend the specification at certain points. A custom fields can be added with any name. The value can be null, a primitive, an array or an object. ### Design Principles The Data Contract Specification follows these design principles: - A free, open, and open-sourced standard - Follow OpenAPI and AsyncAPI conventions so that it feels immediately familiar - Support contract-first approaches - Support code-first approaches - Support tooling by being machine-readable Tooling --- - [Data Contract CLI](https://github.com/datacontract/cli) is a free CLI tool to help you create, develop, and maintain your data contracts. - [Data Contract Studio](https://studio.datacontract.com/) is a free web tool to develop and share data contracts. - [Data Mesh Manager](https://www.datamesh-manager.com/) is a commercial tool to manage data products and data contracts. It supports the data contract specification and allows the user to import or export data contracts using this specification. Other Data Contract Specifications --- - [AIDA User Group's Open Data Contract Standard](https://github.com/AIDAUserGroup/open-data-contract-standard) - [PayPal's Data Contract Template](https://github.com/paypal/data-contract-template/blob/main/docs/README.md) Literature --- - [Driving Data Quality with Data Contracts](https://www.amazon.com/dp/B0C37FPH3D) by Andrew Jones Authors --- The Data Contract Specification was originally created by [Jochen Christ](https://www.linkedin.com/in/jochenchrist/) and [Dr. Simon Harrer](https://www.linkedin.com/in/simonharrer/), and is currently maintained by them. Contributing --- Contributions are welcome! Please open an issue or a pull request. License --- [MIT License](LICENSE) ================================================ FILE: versions/0.9.1/datacontract.init.yaml ================================================ dataContractSpecification: 0.9.1 id: my-data-contract-id info: title: My Data Contract version: 0.0.1 # description: # owner: # contact: # name: # url: # email: ### servers #servers: # my-stage: # type: bigquery # project: # dataset: #servers: # my-stage: # type: s3 # location: s3:// #servers: # my-stage: # type: redshift # account: # database: # schema: #servers: # my-stage: # type: snowflake # account: # database: # schema: #servers: # my-stage: # type: databricks # share: #servers: # my-stage: # type: kafka # host: # topic: ### terms #terms: # usage: # limitations: # billing: # noticePeriod: ### models # models: # my_model: # description: # type: # fields: # my_field: # type: # description: ### definitions # definitions: # my_field: # domain: # name: # title: # type: # description: # example: # pii: # classification: ### schema #schema: # type: dbt # specification: # version: # models: # - name: # description: # columns: # - name: # type: # description: # tests: #schema: # type: dbt # specification: |- # version: # models: # - name: # description: # columns: # - name: # type: # description: # tests: #schema: # type: dbt # specification: "$ref: model.yaml" #schema: # type: bigquery # specification: |- # { # "tableReference": { # "projectId": "my-project", # "datasetId": "my_dataset", # "tableId": "my_table" # }, # "description": "This is a description", # "type": "TABLE", # "schema": { # "fields": [ # { # "name": "name", # "type": "STRING", # "mode": "NULLABLE", # "description": "This is a description" # } # ] # } # } #schema: # type: json-schema # specification: # my-table: # description: # type: object # properties: # id: # type: string # description: #schema: # type: json-schema # specification: |- # { # "$schema": "http://json-schema.org/draft-07/schema#", # "type": "object", # "properties": { # "my_table": { # "type": "object", # "description": "", # "properties": { # "id": { # "type": "string", # "description": "" # }, # "required": ["id"] # } # }, # "required": ["my-table"] # } #schema: # type: sql-ddl # specification: |- # CREATE TABLE my_table ( # id TEXT PRIMARY KEY # ); #schema: # type: avro # specification: # User: # type: record # name: MyTable # fields: # - name: id # type: string #schema: # type: avro # specification: |- # { # "type": "record", # "name": "MyTable", # "fields": [ # { # "name": "name", # "type": "string" # } # ] # } #schema: # type: protobuf # specification: |- # message MyTable { # string id = 1; # } #schema: # type: custom # specification: ### examples #examples: # - type: csv # model: my_table # data: |- # id,timestamp,amount # "1001","2023-09-09T08:30:00Z",2500 # "1002","2023-09-08T15:45:00Z",1800 # #examples: # - type: csv # model: my_table # data: "$ref: data.csv" #examples: # - type: json # model: my_table # data: |- # [ # { # "id": "1001", # "timestamp": "2023-09-09T08:30:00Z", # "amount": 2500 # }, # { # "id": "1002", # "timestamp": "2023-09-08T15:45:00Z", # "amount": 1800 # } # ] #examples: # - type: yaml # model: my_table # data: # - id: 1001 # timestamp: 2023-09-09T08:30:00Z # amount: 2500 # - id: 1002 # timestamp: 2023-09-08T15:45:00Z # amount: 1800 #examples: # - type: custom # model: my_table # data: |- ### quality #quality: # type: SodaCL # specification: # checks for my_table: # - duplicate_count(order_id) = 0 #quality: # type: SodaCL # specification: # checks for my_table: |- # - duplicate_count(id) = 0 #quality: # type: SodaCL # specification: # checks for my_table: "$ref: checks.yaml" #quality: # type: montecarlo # specification: |- # montecarlo: # field_health: # - table: my_project:my_dataset.my_table # fields: # - id # - timestamp # - amount # timestamp_field: timestamp #quality: # type: custom # specification: |- ================================================ FILE: versions/0.9.1/datacontract.schema.json ================================================ { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "dataContractSpecification": { "type": "string", "enum": [ "0.9.1", "0.9.0" ], "description": "Specifies the Data Contract Specification being used." }, "id": { "type": "string", "description": "Specifies the identifier of the data contract." }, "info": { "type": "object", "properties": { "title": { "type": "string", "description": "The title of the data contract." }, "version": { "type": "string", "description": "The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version)." }, "description": { "type": "string", "description": "A description of the data contract." }, "owner": { "type": "string", "description": "The owner or team responsible for managing the data contract and providing the data." }, "contact": { "type": "object", "properties": { "name": { "type": "string", "description": "The identifying name of the contact person/organization." }, "url": { "type": "string", "format": "uri", "description": "The URL pointing to the contact information. This MUST be in the form of a URL." }, "email": { "type": "string", "format": "email", "description": "The email address of the contact person/organization. This MUST be in the form of an email address." } }, "description": "Contact information for the data contract." } }, "required": [ "title", "version" ], "description": "Metadata and life cycle information about the data contract." }, "servers": { "type": "object", "additionalProperties": { "anyOf": [ { "type": "object", "properties": { "type": { "type": "string", "enum": [ "bigquery", "BigQuery" ], "description": "The type of the data product technology that implements the data contract." }, "project": { "type": "string", "description": "An optional string describing the server." }, "dataset": { "type": "string", "description": "An optional string describing the server." } }, "required": [ "type", "project", "dataset" ] }, { "type": "object", "properties": { "type": { "type": "string", "enum": [ "s3" ], "description": "The type of the data product technology that implements the data contract." }, "location": { "type": "string", "format": "uri", "description": "An optional string describing the server. Must be in the form of a URL." } }, "required": [ "type", "location" ] }, { "type": "object", "properties": { "type": { "type": "string", "enum": [ "redshift" ], "description": "The type of the data product technology that implements the data contract." }, "account": { "type": "string", "description": "An optional string describing the server." }, "database": { "type": "string", "description": "An optional string describing the server." }, "schema": { "type": "string", "description": "An optional string describing the server." } }, "required": [ "type", "account", "database", "schema" ] }, { "type": "object", "properties": { "type": { "type": "string", "enum": [ "snowflake" ], "description": "The type of the data product technology that implements the data contract." }, "account": { "type": "string", "description": "An optional string describing the server." }, "database": { "type": "string", "description": "An optional string describing the server." }, "schema": { "type": "string", "description": "An optional string describing the server." } }, "required": [ "type", "account", "database", "schema" ] }, { "type": "object", "properties": { "type": { "type": "string", "enum": [ "databricks" ], "description": "The type of the data product technology that implements the data contract." }, "share": { "type": "string", "description": "An optional string describing the server." } }, "required": [ "type", "share" ] }, { "type": "object", "properties": { "type": { "type": "string", "enum": [ "kafka" ], "description": "The type of the data product technology that implements the data contract." }, "host": { "type": "string", "description": "An optional string describing the server." }, "topic": { "type": "string", "description": "An optional string describing the server." } }, "required": [ "type", "host", "topic" ] } ] }, "description": "Information about the servers." }, "terms": { "type": "object", "description": "The terms and conditions of the data contract.", "properties": { "usage": { "type": "string", "description": "The usage describes the way the data is expected to be used. Can contain business and technical information." }, "limitations": { "type": "string", "description": "The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for." }, "billing": { "type": "string", "description": "The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use." }, "noticePeriod": { "type": "string", "description": "The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., 'P3M' for a period of three months." } } }, "models": { "description": "Specifies the logical data model. Use the models name (e.g., the table name) as the key.", "type": "object", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "object", "properties": { "description": { "type": "string" }, "type": { "description": "The type of the model. Examples: table, object. Default: table.", "type": "string", "default": "table" }, "fields": { "description": "Specifies a field in the data model. Use the field name (e.g., the column name) as the key.", "type": "object", "additionalProperties": { "type": "object", "properties": { "$ref": { "type": "string", "description": "A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition." }, "type": { "type": "string", "description": "The logical data type of the field.", "enum": [ "number", "decimal", "numeric", "int", "integer", "long", "bigint", "float", "double", "string", "text", "varchar", "boolean", "timestamp", "timestamp_tz", "timestamp_ntz", "date", "array", "object", "record", "struct", "bytes", "null" ] }, "description": { "type": "string", "description": "An optional string describing the semantic of the data in this field." }, "pii": { "type": "boolean", "description": "An indication, if this field contains Personal Identifiable Information (PII)." }, "classification": { "type": "string", "description": "The data class defining the sensitivity level for this field, according to the organization's classification scheme.", "examples": ["sensitive", "restricted", "internal", "public"] }, "tags": { "type": "array", "items": { "type": "string" }, "description": "Custom metadata to provide additional context." } } } } } } }, "schema": { "type": "object", "properties": { "type": { "type": "string", "enum": [ "dbt", "bigquery", "json-schema", "sql-ddl", "avro", "protobuf", "custom" ], "description": "The type of the schema. Typical values are: dbt, bigquery, json-schema, sql-ddl, avro, protobuf, custom." }, "specification": { "anyOf": [ { "type": "string", "description": "The specification of the schema as a string." }, { "type": "object", "description": "The specification of the schema as an object." } ] } }, "required": [ "type", "specification" ], "description": "The schema of the data contract describes the syntax and semantics of provided data sets. It supports different schema types." }, "examples": { "type": "array", "items": { "type": "object", "properties": { "type": { "type": "string", "enum": [ "csv", "json", "yaml", "custom" ], "description": "The type of the example data. Well-known types are: csv, json, yaml, custom." }, "description": { "type": "string", "description": "An optional string describing the example." }, "model": { "type": "string", "description": "The reference to the model in the schema, e.g., a table name." }, "data": { "type": "string", "description": "Example data for this model." } }, "required": [ "type", "data" ] }, "description": "The Examples Object is an array of Example Objects." }, "quality": { "type": "object", "properties": { "type": { "type": "string", "enum": [ "SodaCL", "montecarlo", "custom" ], "description": "The type of the quality check. Typical values are: SodaCL, montecarlo, custom." }, "specification": { "anyOf": [ { "type": "string", "description": "The specification of the quality attributes as a string." }, { "type": "object", "description": "The specification of the quality attributes as an object." } ] } }, "required": [ "type", "specification" ], "description": "The quality object contains quality attributes and checks." } }, "required": [ "dataContractSpecification", "id", "info" ] } ================================================ FILE: versions/0.9.2/README.md ================================================ # Data Contract Specification Stars Slack Status ![datacontract.png](images/datacontract.png) Data contracts bring data providers and data consumers together. A _data contract_ is a document that defines the structure, format, semantics, quality, and terms of use for exchanging data between a data provider and their consumers. A data contract is implemented by a data product's output port or other data technologies. Data contracts can also be used for the input port to specify the expectations of data dependencies and verify given guarantees. The _data contract specification_ defines a YAML format to describe attributes of provided data sets. It is data platform neutral and can be used with any data platform, such as AWS S3, Google BigQuery, Microsoft Fabric, Databricks, and Snowflake. The data contract specification is an open initiative to define a common data contract format. It follows [OpenAPI](https://www.openapis.org/) and [AsyncAPI](https://www.asyncapi.com/) conventions. Data contracts come into play when data is exchanged between different teams or organizational units, such as in a [data mesh architecture](https://www.datamesh-architecture.com/). First, and foremost, data contracts are a communication tool to express a common understanding of how data should be structured and interpreted. They make semantic and quality expectations explicit. They are often created collaboratively in [workshops](/workshop) together with data providers and data consumers. Later in development and production, they also serve as the basis for code generation, testing, schema validations, quality checks, monitoring, access control, and computational governance policies. The specification comes along with the [Data Contract CLI](https://github.com/datacontract/cli), an open-source tool to develop, validate, and enforce data contracts. _Note: The term "data contract" refers to a specification that is usually owned by the data provider and thus does not align with a "contract" in a legal sense as a mutual agreement between two parties. The term "contract" may be somewhat misleading, but it is how it is used in practice. The mutual agreement between one data provider and one data consumer is the "data usage agreement" that refers to a data contract. Data usage agreements have a defined lifecycle, start/end date, and help the data provider to track who accesses their data and for which purposes._ Version --- 0.9.2 ([Changelog](CHANGELOG.md)) Example --- [![Open in Data Contract Studio](https://img.shields.io/badge/open%20in-Data%20Contract%20Studio-blue)](https://studio.datacontract.com/) ```yaml dataContractSpecification: 0.9.2 id: urn:datacontract:checkout:orders-latest info: title: Orders Latest version: 1.0.0 description: | Successful customer orders in the webshop. All orders since 2020-01-01. Orders with their line items are in their current state (no history included). owner: Checkout Team contact: name: John Doe (Data Product Owner) url: https://teams.microsoft.com/l/channel/example/checkout servers: production: type: s3 location: s3://datacontract-example-orders-latest/data/{model}/*.json format: json delimiter: new_line terms: usage: > Data can be used for reports, analytics and machine learning use cases. Order may be linked and joined by other tables limitations: > Not suitable for real-time use cases. Data may not be used to identify individual customers. Max data processing per day: 10 TiB billing: 5000 USD per month noticePeriod: P3M models: orders: description: One record per order. Includes cancelled and deleted orders. type: table fields: order_id: $ref: '#/definitions/order_id' required: true unique: true order_timestamp: description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. type: timestamp required: true order_total: description: Total amount the smallest monetary unit (e.g., cents). type: long required: true customer_id: description: Unique identifier for the customer. type: text minLength: 10 maxLength: 20 customer_email_address: description: The email address, as entered by the customer. The email address was not verified. type: text format: email required: true line_items: description: A single article that is part of an order. type: table fields: lines_item_id: type: text description: Primary key of the lines_item_id table required: true unique: true order_id: $ref: '#/definitions/order_id' sku: description: The purchased article number $ref: '#/definitions/sku' definitions: order_id: domain: checkout name: order_id title: Order ID type: text format: uuid description: An internal ID that identifies an order in the online shop. example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2 pii: true classification: restricted sku: domain: inventory name: sku title: Stock Keeping Unit type: text pattern: ^[A-Za-z0-9]{8,14}$ example: "96385074" description: | A Stock Keeping Unit (SKU) is an internal unique identifier for an article. It is typically associated with an article's barcode, such as the EAN/GTIN. examples: - type: csv # csv, json, yaml, custom model: orders data: |- # expressed as string or inline yaml or via "$ref: data.csv" order_id,order_timestamp,order_total,customer_id,customer_email_address "1001","2030-09-09T08:30:00Z",2500,"1000000001","mary.taylor82@example.com" "1002","2030-09-08T15:45:00Z",1800,"1000000002","michael.miller83@example.com" "1003","2030-09-07T12:15:00Z",3200,"1000000003","michael.smith5@example.com" "1004","2030-09-06T19:20:00Z",1500,"1000000004","elizabeth.moore80@example.com" "1005","2030-09-05T10:10:00Z",4200,"1000000004","elizabeth.moore80@example.com" "1006","2030-09-04T14:55:00Z",2800,"1000000005","john.davis28@example.com" "1007","2030-09-03T21:05:00Z",1900,"1000000006","linda.brown67@example.com" "1008","2030-09-02T17:40:00Z",3600,"1000000007","patricia.smith40@example.com" "1009","2030-09-01T09:25:00Z",3100,"1000000008","linda.wilson43@example.com" "1010","2030-08-31T22:50:00Z",2700,"1000000009","mary.smith98@example.com" - type: csv model: line_items data: |- lines_item_id,order_id,sku "LI-1","1001","5901234123457" "LI-2","1001","4001234567890" "LI-3","1002","5901234123457" "LI-4","1002","2001234567893" "LI-5","1003","4001234567890" "LI-6","1003","5001234567892" "LI-7","1004","5901234123457" "LI-8","1005","2001234567893" "LI-9","1005","5001234567892" "LI-10","1005","6001234567891" quality: type: SodaCL # data quality check format: SodaCL, montecarlo, custom specification: # expressed as string or inline yaml or via "$ref: checks.yaml" checks for orders: - freshness(order_timestamp) < 24h - row_count >= 5000 - duplicate_count(order_id) = 0 checks for line_items: - values in (order_id) must exist in orders (order_id) - row_count >= 5000 ``` Data Contract CLI --- The [Data Contract CLI](https://cli.datacontract.com) is a command line tool and Python library to lint, test, import and export data contracts. Here is short example how to verify that your actual dataset matches the data contract: ```bash pip3 install datacontract-cli datacontract test https://datacontract.com/examples/orders-latest/datacontract.yaml ``` or, if you prefer Docker: ```bash docker run datacontract/cli test https://datacontract.com/examples/orders-latest/datacontract.yaml ``` The Data Contract contains all required information to verify data: - The _servers_ block has the connection details to the actual data set. - The _models_ define the syntax, formats, and constraints. - The _quality_ defined further quality checks. The Data Contract CLI chooses the appropriate engine, formulates test cases, connects to the server, and executes the tests, based on the server type. More information and configuration options on [cli.datacontract.com](https://cli.datacontract.com). Specification --- ![The eight major categories in the data contract specification](images/categories.png) - [Data Contract Object](#data-contract-object) - [Info Object](#info-object) - [Contact Object](#contact-object) - [Server Object](#server-object) - [Terms Object](#terms-object) - [Model Object](#model-object) - [Field Object](#field-object) - [Definition Object](#definition-object) - [Schema Object](#schema-object) - [Example Object](#example-object) - [Quality Object](#quality-object) - [Data Types](#data-types) - [Specification Extensions](#specification-extensions) [JSON Schema](https://github.com/datacontract/datacontract-specification/blob/main/datacontract.schema.json) of the Data Contract Specification. ### Data Contract Object This is the root document. It is _RECOMMENDED_ that the root document be named: `datacontract.yaml`. | Field | Type | Description | |---------------------------|------------------------------------------------------|----------------------------------------------------------------------------------------------------------| | dataContractSpecification | `string` | REQUIRED. Specifies the Data Contract Specification being used. | | id | `string` | REQUIRED. An organization-wide unique technical identifier, such as a UUID, URN, slug, string, or number | | info | [Info Object](#info-object) | REQUIRED. Specifies the metadata of the data contract. May be used by tooling. | | servers | Map[string, [Server Object](#server-object)] | Specifies the servers of the data contract. | | terms | [Terms Object](#terms-object) | Specifies the terms and conditions of the data contract. | | models | Map[string, [Model Object](#model-object)] | Specifies the logical data model. | | definitions | Map[string, [Definition Object](#definition-object)] | Specifies definitions. | | schema | [Schema Object](#schema-object) | Specifies the physical schema. The specification supports different schema format. | | examples | Array of [Example Objects](#example-object) | Specifies example data sets for the data model. The specification supports different example types. | | quality | [Quality Object](#quality-object) | Specifies the quality attributes and checks. The specification supports different quality check DSLs. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Info Object Metadata and life cycle information about the data contract. | Field | Type | Description | |---------|--------|------------------------------------------------------------------------------------------------------------------------------------------------------------------| | title | `string` | REQUIRED. The title of the data contract. | | version | `string` | REQUIRED. The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version). | | description | `string` | A description of the data contract. | | owner | `string` | The owner or team responsible for managing the data contract and providing the data. | | contact | [Contact Object](#contact-object) | Contact information for the data contract. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Contact Object Contact information for the data contract. | Field | Type | Description | |-------|----------|-------------------------------------------------------------------------------------------------------| | name | `string` | The identifying name of the contact person/organization. | | url | `string` | The URL pointing to the contact information. This _MUST_ be in the form of a URL. | | email | `string` | The email address of the contact person/organization. This _MUST_ be in the form of an email address. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Server Object The fields are dependent on the defined type. | Field | Type | Description | |-------------|----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | The type of the data product technology that implements the data contract. Well-known server types are: `bigquery`, `s3`, `redshift`, `snowflake`, `databricks`, `postgres`, `kafka`, `pubsub`, `local` | | description | `string` | An optional string describing the server. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### BigQuery Server Object | Field | Type | Description | |---------|----------|-----------------------| | type | `string` | `bigquery` | | project | `string` | The GCP project name. | | dataset | `string` | | #### S3 Server Object | Field | Type | Description | |----------|----------|--------------------------------| | type | `string` | `s3` | | location | `string` | S3 URL, starting with `s3://` | | endpointUrl | `string` | The server endpoint for S3-compatible servers, such as `https://minio.example.com` | | format | `string` | Format of files, such as `parquet`, `delta`, `json`, `csv` | | delimiter | `string` | (Only for format = `json`), how multiple json documents are delimited within one file, e.g., `new_line`, `array` | Example: ```yaml servers: production: type: s3 location: s3://acme-orders-prod/orders/ ``` #### Redshift Server Object | Field | Type | Description | |----------|----------|-------------| | type | `string` | `redshift` | | account | `string` | | | database | `string` | | | schema | `string` | | #### Snowflake Server Object | Field | Type | Description | |----------|----------|-------------| | type | `string` | `snowflake` | | account | `string` | | | database | `string` | | | schema | `string` | | #### Databricks Server Object | Field | Type | Description | |---------|----------|---------------------------------------------------------------------| | type | `string` | `databricks` | | host | `string` | The Databricks host, e.g., `dbc-abcdefgh-1234.cloud.databricks.com` | | catalog | `string` | The name of the Hive or Unity catalog | | schema | `string` | The schema name in the catalog | #### Postgres Server Object | Field | Type | Description | |----------|-----------|---------------------------------------------------------| | type | `string` | `postgres` | | host | `string` | The host to the database server | | port | `integer` | The port to the database server | | database | `string` | The name of the database, e.g., `postgres`. | | schema | `string` | The name of the schema in the database, e.g., `public`. | #### Kafka Server Object | Field | Type | Description | |--------|----------|---------------------------------------------------------------------------| | type | `string` | `kafka` | | host | `string` | The bootstrap server of the kafka cluster. | | topic | `string` | The topic name. | | format | `string` | The format of the message. Examples: json, avro, protobuf. Default: json. | #### Pub/Sub Server Object | Field | Type | Description | |---------|----------|-----------------------| | type | `string` | `pubsub` | | project | `string` | The GCP project name. | | topic | `string` | The topic name. | #### Local Server Object | Field | Type | Description | |--------|----------|-------------------------------------------------------------------------------------| | type | `string` | `local` | | path | `string` | The relative or absolute path to the data file(s), such as `./folder/data.parquet`. | | format | `string` | The format of the file(s), such as `parquet`, `csv`, or `json`. | ### Terms Object The terms and conditions of the data contract. | Field | Type | Description | |----------------------|--------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | usage | `string` | The usage describes the way the data is expected to be used. Can contain business and technical information. | | limitations | `string` | The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for. | | billing | `string` | The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use. | | noticePeriod | `string` | The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., `P3M` for a period of three months. | ### Model Object The Model Object describes the structure and semantics of a data model, such as tables, views, or structured files. The name of the data model (table name) is defined by the key that refers to this Model Object. | Field | Type | Description | |-------------|----------------------------------------------|-------------------------------------------------------------------------------| | type | `string` | The type of the model. Examples: `table`, `view`, `object`. Default: `table`. | | description | `string` | An optional string describing the data model. | | fields | Map[`string`, [Field Object](#field-object)] | The fields (e.g. columns) of the data model. | ### Field Object The Field Objects describes one field (column, property, nested field) of a data model. | Field | Type | Description | |------------------|--------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | description | `string` | An optional string describing the semantic of the data in this field. | | type | [Data Type](#data-types) | The logical data type of the field. | | required | `boolean` | An indication, if this field must contain a value and may not be null. Default: `false` | | primary | `boolean` | If this field is a primary key. Default: `false` | | references | `string` | The reference to a field in another model. E.g. use 'orders.order_id' to reference the order_id field of the model orders. Think of defining a foreign key relationship. | | unique | `boolean` | An indication, if the value must be unique within the model. Default: `false` | | format | `string` | `email`: A value must be complaint to [RFC 5321, section 4.1.2](https://www.rfc-editor.org/info/rfc5321).
`uri`: A value must be complaint to [RFC 3986](https://www.rfc-editor.org/info/rfc3986).
`uuid`: A value must be complaint to [RFC 4122](https://www.rfc-editor.org/info/rfc4122). Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | minLength | `number` | A value must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | maxLength | `number` | A value must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | pattern | `string` | A value must be valid according to the [ECMA-262](https://262.ecma-international.org/5.1/) regular expression dialect. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | minimum | `number` | A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | exclusiveMinimum | `number` | A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | maximum | `number` | A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | exclusiveMaximum | `number` | A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | example | `string` | An example value. | | pii | `boolean` | An indication, if this field contains Personal Identifiable Information (PII). | | classification | `string` | The data class defining the sensitivity level for this field, according to the organization's classification scheme. Examples may be: `sensitive`, `restricted`, `internal`, `public`. | | tags | Array of `string` | Custom metadata to provide additional context. | | $ref | `string` | A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition. | ### Definition Object The Definition Object includes a clear and concise explanations of syntax, semantic, and classification of a business object in a given domain. It serves as a reference for a common understanding of terminology, ensure consistent usage and to identify join-able fields. Models fields can refer to definitions using the `$ref` field to link to existing definitions and avoid duplicate documentations. | Field | Type | Description | |------------------|--------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | domain | `string` | The domain in which this definition is valid. Default: `global`. | | name | `string` | The technical name of this definition. | | title | `string` | The business name of this definition. | | description | `string` | Clear and concise explanations related to the domain | | type | [Data Type](#data-types) | The logical data type | | enum | array of `string` | A value must be equal to one of the elements in this array value. Only evaluated if the value is not null. | | format | `string` | `email`: A value must be complaint to [RFC 5321, section 4.1.2](https://www.rfc-editor.org/info/rfc5321).
`uri`: A value must be complaint to [RFC 3986](https://www.rfc-editor.org/info/rfc3986).
`uuid`: A value must be complaint to [RFC 4122](https://www.rfc-editor.org/info/rfc4122). Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | minLength | `number` | A value must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | maxLength | `number` | A value must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | pattern | `string` | A value must be valid according to the [ECMA-262](https://262.ecma-international.org/5.1/) regular expression dialect. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | minimum | `number` | A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | exclusiveMinimum | `number` | A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | maximum | `number` | A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | exclusiveMaximum | `number` | A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | example | `string` | An example value. | | pii | `boolean` | An indication, if this field contains Personal Identifiable Information (PII). | | classification | `string` | The data class defining the sensitivity level for this field, according to the organization's classification scheme. | | tags | Array of `string` | Custom metadata to provide additional context. | ### Schema Object The schema of the data contract describes the physical schema. The type of the schema depends on the data platform. | Field | Type | Description | | ----- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | REQUIRED. The type of the schema.
Typical values are: `dbt`, `bigquery`, `json-schema`, `sql-ddl`, `avro`, `protobuf`, `custom` | | specification | [dbt Schema Object](#dbt-schema-object) \|
[BigQuery Schema Object](#bigquery-schema-object) \|
[JSON Schema Schema Object](#bigquery-schema-object) \|
[SQL DDL Schema Object](#sql-ddl-schema-object) \|
`string` | REQUIRED. The specification of the schema. The schema specification can be encoded as a string or as inline YAML. | #### dbt Schema Object https://docs.getdbt.com/reference/model-properties Example (inline YAML): ```yaml schema: type: dbt specification: version: 2 models: - name: "My Table" description: "My description" columns: - name: "My column" data_type: text description: "My description" ``` Example (string): ```yaml schema: type: dbt specification: |- version: 2 models: - name: "My Table" description: "My description" columns: - name: "My column" data_type: text description: "My description" ``` #### BigQuery Schema Object The schema structure is defined by the [Google BigQuery Table](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#resource:-table) object. You can extract such a Table object via the [tables.get](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/get) endpoint. Instead of providing a single Table object, you can also provide an array of such objects. Be aware that [tables.list](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/list) only returns a subset of the full Table object. You need to call every Table object via [tables.get](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/get) to get the full Table object, including the actual schema. Learn more: [Google BigQuery REST Reference v2](https://cloud.google.com/bigquery/docs/reference/rest) Example: ```yaml schema: type: bigquery specification: |- { "tableReference": { "projectId": "my-project", "datasetId": "my_dataset", "tableId": "my_table" }, "description": "This is a description", "type": "TABLE", "schema": { "fields": [ { "name": "name", "type": "STRING", "mode": "NULLABLE", "description": "This is a description" } ] } } ``` #### JSON Schema Schema Object JSON Schema can be defined as JSON or rendered as YAML, following the [OpenAPI Schema Object dialect](https://spec.openapis.org/oas/v3.1.0#properties) Example (inline YAML): ```yaml schema: type: json-schema specification: orders: description: One record per order. Includes cancelled and deleted orders. type: object properties: order_id: type: string description: Primary key of the orders table order_timestamp: type: string format: date-time description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. order_total: type: integer description: Total amount of the order in the smallest monetary unit (e.g., cents). line_items: type: object properties: lines_item_id: type: string description: Primary key of the lines_item_id table order_id: type: string description: Foreign key to the orders table sku: type: string description: The purchased article number ``` Example (string): ```yaml schema: type: json-schema specification: |- { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "orders": { "type": "object", "description": "One record per order. Includes cancelled and deleted orders.", "properties": { "order_id": { "type": "string", "description": "Primary key of the orders table" }, "order_timestamp": { "type": "string", "format": "date-time", "description": "The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful." }, "order_total": { "type": "integer", "description": "Total amount of the order in the smallest monetary unit (e.g., cents)." } }, "required": ["order_id", "order_timestamp", "order_total"] }, "line_items": { "type": "object", "properties": { "lines_item_id": { "type": "string", "description": "Primary key of the lines_item_id table" }, "order_id": { "type": "string", "description": "Foreign key to the orders table" }, "sku": { "type": "string", "description": "The purchased article number" } }, "required": ["lines_item_id", "order_id", "sku"] } }, "required": ["orders", "line_items"] } ``` #### SQL DDL Schema Object Classical SQL DDLs can be used to describe the structure. Example (string): ```yaml schema: type: sql-ddl specification: |- -- One record per order. Includes cancelled and deleted orders. CREATE TABLE orders ( order_id TEXT PRIMARY KEY, -- Primary key of the orders table order_timestamp TIMESTAMPTZ NOT NULL, -- The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. order_total INTEGER NOT NULL -- Total amount of the order in the smallest monetary unit (e.g., cents) ); -- The items that are part of an order CREATE TABLE line_items ( lines_item_id TEXT PRIMARY KEY, -- Primary key of the lines_item_id table order_id TEXT REFERENCES orders(order_id), -- Foreign key to the orders table sku TEXT NOT NULL -- The purchased article number ); ``` ### Example Object | Field | Type | Description | |-------------|----------|-----------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | The type of the data product technology that implements the data contract. Well-known server types are: `csv`, `json`, `yaml`, `custom` | | description | `string` | An optional string describing the example. | | model | `string` | The reference to the model in the schema, e.g. a table name. | | data | `string` | Example data for this model. | Example: ```yaml examples: - type: csv model: orders data: |- order_id,order_timestamp,order_total "1001","2023-09-09T08:30:00Z",2500 "1002","2023-09-08T15:45:00Z",1800 "1003","2023-09-07T12:15:00Z",3200 "1004","2023-09-06T19:20:00Z",1500 "1005","2023-09-05T10:10:00Z",4200 "1006","2023-09-04T14:55:00Z",2800 "1007","2023-09-03T21:05:00Z",1900 "1008","2023-09-02T17:40:00Z",3600 "1009","2023-09-01T09:25:00Z",3100 "1010","2023-08-31T22:50:00Z",2700 ``` ### Quality Object The quality object contains quality attributes and checks. | Field | Type | Description | | ----- |-------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------| | type | `string` | REQUIRED. The type of the schema.
Typical values are: `SodaCL`, `montecarlo`, `custom` | | specification | [SodaCL Quality Object](#sodacl-quality-object) \|
[Monte Carlo Schema Object](#monte-carlo-quality-object) \|
`string` | REQUIRED. The specification of the quality attributes. The quality specification can be encoded as a string or as inline YAML. | #### SodaCL Quality Object Quality attributes in [Soda Checks Language](https://docs.soda.io/soda-cl/soda-cl-overview.html). The `specification` represents the content of a `checks.yml` file. Example (inline): ```yaml quality: type: SodaCL # data quality check format: SodaCL, montecarlo, dbt-tests, custom specification: # expressed as string or inline yaml or via "$ref: checks.yaml" checks for orders: - row_count > 0 - duplicate_count(order_id) = 0 checks for line_items: - row_count > 0 ``` Example (string): ```yaml quality: type: SodaCL specification: |- checks for search_queries: - freshness(search_timestamp) < 1d - row_count > 100000 - missing_count(search_query) = 0 ``` #### Monte Carlo Quality Object Quality attributes defined as Monte Carlos [Monitors as Code](https://docs.getmontecarlo.com/docs/monitors-as-code). The `specification` represents the content of a `montecarlo.yml` file. Example (string): ```yaml quality: type: montecarlo specification: |- montecarlo: field_health: - table: project:dataset.table_name timestamp_field: created dimension_tracking: - table: project:dataset.table_name timestamp_field: created field: order_status ``` ### Data Types The following data types are supported for model fields and definitions: - Unicode character sequence: `string`, `text`, `varchar` - Any numeric type, either integers or floating point numbers: `number`, `decimal`, `numeric` - 32-bit signed integer: `int`, `integer` - 64-bit signed integer: `long`, `bigint` - Single precision (32-bit) IEEE 754 floating-point number: `float` - Double precision (64-bit) IEEE 754 floating-point number: `double` - Binary value: `boolean` - Timestamp with timezone: `timestamp`, `timestamp_tz` - Timestamp with no timezone: `timestamp_ntz` - Date with no time information: `date` - Array: `array` - Sequence of 8-bit unsigned bytes: `bytes` - Complex type: `object`, `record`, `struct` - No value: `null` ### Specification Extensions While the Data Contract Specification tries to accommodate most use cases, additional data can be added to extend the specification at certain points. A custom fields can be added with any name. The value can be null, a primitive, an array or an object. ### Design Principles The Data Contract Specification follows these design principles: - A free, open, and open-sourced standard - Follow OpenAPI and AsyncAPI conventions so that it feels immediately familiar - Support contract-first approaches - Support code-first approaches - Support tooling by being machine-readable Tooling --- - [Data Contract CLI](https://github.com/datacontract/cli) is a free CLI tool to help you create, develop, and maintain your data contracts. - [Data Contract Studio](https://studio.datacontract.com/) is a free web tool to develop and share data contracts. - [Data Mesh Manager](https://www.datamesh-manager.com/) is a commercial tool to manage data products and data contracts. It supports the data contract specification and allows the user to import or export data contracts using this specification. Other Data Contract Specifications --- - [AIDA User Group's Open Data Contract Standard](https://github.com/AIDAUserGroup/open-data-contract-standard) - [PayPal's Data Contract Template](https://github.com/paypal/data-contract-template/blob/main/docs/README.md) Literature --- - [Driving Data Quality with Data Contracts](https://www.amazon.com/dp/B0C37FPH3D) by Andrew Jones Authors --- The Data Contract Specification was originally created by [Jochen Christ](https://www.linkedin.com/in/jochenchrist/) and [Dr. Simon Harrer](https://www.linkedin.com/in/simonharrer/), and is currently maintained by them. Contributing --- Contributions are welcome! Please open an issue or a pull request. License --- [MIT License](LICENSE) ================================================ FILE: versions/0.9.2/datacontract.init.yaml ================================================ dataContractSpecification: 0.9.2 id: my-data-contract-id info: title: My Data Contract version: 0.0.1 # description: # owner: # contact: # name: # url: # email: ### servers #servers: # production: # type: s3 # location: s3:// # format: parquet # delimiter: new_line ### terms #terms: # usage: # limitations: # billing: # noticePeriod: ### models # models: # my_model: # description: # type: # fields: # my_field: # type: # description: ### definitions # definitions: # my_field: # domain: # name: # title: # type: # description: # example: # pii: # classification: ### examples #examples: # - type: csv # model: my_model # data: |- # id,timestamp,amount # "1001","2023-09-09T08:30:00Z",2500 # "1002","2023-09-08T15:45:00Z",1800 ### quality #quality: # type: SodaCL # specification: # checks for my_model: |- # - duplicate_count(id) = 0 ================================================ FILE: versions/0.9.2/datacontract.schema.json ================================================ { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "title": "DataContractSpecification", "properties": { "dataContractSpecification": { "type": "string", "title": "DataContractSpecificationVersion", "enum": [ "0.9.2", "0.9.1", "0.9.0" ], "description": "Specifies the Data Contract Specification being used." }, "id": { "type": "string", "description": "Specifies the identifier of the data contract." }, "info": { "type": "object", "properties": { "title": { "type": "string", "description": "The title of the data contract." }, "version": { "type": "string", "description": "The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version)." }, "description": { "type": "string", "description": "A description of the data contract." }, "owner": { "type": "string", "description": "The owner or team responsible for managing the data contract and providing the data." }, "contact": { "type": "object", "properties": { "name": { "type": "string", "description": "The identifying name of the contact person/organization." }, "url": { "type": "string", "format": "uri", "description": "The URL pointing to the contact information. This MUST be in the form of a URL." }, "email": { "type": "string", "format": "email", "description": "The email address of the contact person/organization. This MUST be in the form of an email address." } }, "description": "Contact information for the data contract." } }, "required": [ "title", "version" ], "description": "Metadata and life cycle information about the data contract." }, "servers": { "type": "object", "additionalProperties": { "oneOf": [ { "type": "object", "title": "BigQueryServer", "properties": { "type": { "type": "string", "enum": [ "bigquery", "BigQuery" ], "description": "The type of the data product technology that implements the data contract." }, "project": { "type": "string", "description": "An optional string describing the server." }, "dataset": { "type": "string", "description": "An optional string describing the server." } }, "additionalProperties": true, "required": [ "type", "project", "dataset" ] }, { "type": "object", "title": "S3Server", "properties": { "type": { "type": "string", "enum": [ "s3" ], "description": "The type of the data product technology that implements the data contract." }, "location": { "type": "string", "format": "uri", "description": "An optional string describing the server. Must be in the form of a URL." } }, "additionalProperties": true, "required": [ "type", "location" ] }, { "type": "object", "title": "RedshiftServer", "properties": { "type": { "type": "string", "enum": [ "redshift" ], "description": "The type of the data product technology that implements the data contract." }, "account": { "type": "string", "description": "An optional string describing the server." }, "database": { "type": "string", "description": "An optional string describing the server." }, "schema": { "type": "string", "description": "An optional string describing the server." } }, "additionalProperties": true, "required": [ "type", "account", "database", "schema" ] }, { "type": "object", "title": "SnowflakeServer", "properties": { "type": { "type": "string", "enum": [ "snowflake" ], "description": "The type of the data product technology that implements the data contract." }, "account": { "type": "string", "description": "An optional string describing the server." }, "database": { "type": "string", "description": "An optional string describing the server." }, "schema": { "type": "string", "description": "An optional string describing the server." } }, "additionalProperties": true, "required": [ "type", "account", "database", "schema" ] }, { "type": "object", "title": "DatabricksServer", "properties": { "type": { "type": "string", "const": "databricks", "description": "The type of the data product technology that implements the data contract." }, "host": { "type": "string", "description": "The Databricks host", "examples": ["dbc-abcdefgh-1234.cloud.databricks.com"] }, "catalog": { "type": "string", "description": "The name of the Hive or Unity catalog" }, "schema": { "type": "string", "description": "The schema name in the catalog" } }, "additionalProperties": true, "required": [ "type", "host", "catalog", "schema" ] }, { "type": "object", "title": "PostgresServer", "properties": { "type": { "type": "string", "const": "postgres", "description": "The type of the data product technology that implements the data contract." }, "host": { "type": "string", "description": "The host to the database server", "examples": ["localhost"] }, "port": { "type": "integer", "description": "The port to the database server." }, "database": { "type": "string", "description": "The name of the database.", "examples": ["postgres"] }, "schema": { "type": "string", "description": "The name of the schema in the database.", "examples": ["public"] } }, "additionalProperties": true, "required": [ "type", "host", "port", "database", "schema" ] }, { "type": "object", "title": "KafkaServer", "description": "Kafka Server", "properties": { "type": { "type": "string", "enum": [ "kafka" ], "description": "The type of the data product technology that implements the data contract." }, "host": { "type": "string", "description": "The bootstrap server of the kafka cluster." }, "topic": { "type": "string", "description": "The topic name." }, "format": { "type": "string", "description": "The format of the message. Examples: json, avro, protobuf. Default: json.", "default": "json" } }, "additionalProperties": true, "required": [ "type", "host", "topic" ] }, { "type": "object", "title": "PubSubServer", "properties": { "type": { "type": "string", "enum": [ "pubsub" ], "description": "The type of the data product technology that implements the data contract." }, "project": { "type": "string", "description": "The GCP project name." }, "topic": { "type": "string", "description": "The topic name." } }, "additionalProperties": true, "required": [ "type", "project", "topic" ] }, { "type": "object", "title": "LocalServer", "properties": { "type": { "type": "string", "enum": [ "local" ], "description": "The type of the data product technology that implements the data contract." }, "path": { "type": "string", "description": "The relative or absolute path to the data file(s).", "examples": [ "./folder/data.parquet", "./folder/*.parquet" ] }, "format": { "type": "string", "description": "The format of the file(s)", "examples": ["json", "parquet", "csv"] } }, "additionalProperties": true, "required": [ "type", "path", "format" ] } ] }, "description": "Information about the servers." }, "terms": { "type": "object", "description": "The terms and conditions of the data contract.", "properties": { "usage": { "type": "string", "description": "The usage describes the way the data is expected to be used. Can contain business and technical information." }, "limitations": { "type": "string", "description": "The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for." }, "billing": { "type": "string", "description": "The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use." }, "noticePeriod": { "type": "string", "description": "The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., 'P3M' for a period of three months." } } }, "models": { "description": "Specifies the logical data model. Use the models name (e.g., the table name) as the key.", "type": "object", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "object", "title": "Model", "properties": { "description": { "type": "string" }, "type": { "description": "The type of the model. Examples: table, view, object. Default: table.", "type": "string", "title": "ModelType", "default": "table", "enum": [ "table", "view", "object" ] }, "fields": { "description": "Specifies a field in the data model. Use the field name (e.g., the column name) as the key.", "type": "object", "additionalProperties": { "type": "object", "title": "Field", "properties": { "description": { "type": "string", "description": "An optional string describing the semantic of the data in this field." }, "type": { "type": "string", "title": "FieldType", "description": "The logical data type of the field.", "enum": [ "number", "decimal", "numeric", "int", "integer", "long", "bigint", "float", "double", "string", "text", "varchar", "boolean", "timestamp", "timestamp_tz", "timestamp_ntz", "date", "array", "object", "record", "struct", "bytes", "null" ] }, "required": { "type": "boolean", "default": false, "description": "An indication, if this field must contain a value and may not be null." }, "primary": { "type": "boolean", "default": false, "description": "If this field is a primary key." }, "unique": { "type": "boolean", "default": false, "description": "An indication, if the value must be unique within the model." }, "enum": { "type": "array", "items": { "type": "string" }, "uniqueItems": true, "description": "A value must be equal to one of the elements in this array value. Only evaluated if the value is not null." }, "minLength": { "type": "number", "description": "A value must greater than, or equal to, the value of this. Only applies to string types." }, "maxLength": { "type": "number", "description": "A value must less than, or equal to, the value of this. Only applies to string types." }, "format": { "type": "string", "description": "A specific format the value must comply with (e.g., 'email', 'uri', 'uuid')." }, "pattern": { "type": "string", "description": "A regular expression the value must match. Only applies to string types." }, "minimum": { "type": "number", "description": "A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "exclusiveMinimum": { "type": "number", "description": "A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "maximum": { "type": "number", "description": "A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "exclusiveMaximum": { "type": "number", "description": "A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "example": { "type": "string", "description": "An example value for this field." }, "pii": { "type": "boolean", "description": "An indication, if this field contains Personal Identifiable Information (PII)." }, "classification": { "type": "string", "description": "The data class defining the sensitivity level for this field, according to the organization's classification scheme.", "examples": [ "sensitive", "restricted", "internal", "public" ] }, "tags": { "type": "array", "items": { "type": "string" }, "description": "Custom metadata to provide additional context." }, "$ref": { "type": "string", "description": "A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition." } } } } } } }, "definitions": { "description": "Clear and concise explanations of syntax, semantic, and classification of business objects in a given domain.", "type": "object", "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "object", "title": "Definition", "properties": { "domain": { "type": "string", "description": "The domain in which this definition is valid.", "default": "global" }, "name": { "type": "string", "description": "The technical name of this definition." }, "title": { "type": "string", "description": "The business name of this definition." }, "description": { "type": "string", "description": "Clear and concise explanations related to the domain." }, "type": { "type": "string", "description": "The logical data type." }, "minLength": { "type": "number", "description": "A value must be greater than or equal to this value. Applies only to string types." }, "maxLength": { "type": "number", "description": "A value must be less than or equal to this value. Applies only to string types." }, "format": { "type": "string", "description": "Specific format requirements for the value (e.g., 'email', 'uri', 'uuid')." }, "pattern": { "type": "string", "description": "A regular expression pattern the value must match. Applies only to string types." }, "example": { "type": "string", "description": "An example value." }, "pii": { "type": "boolean", "description": "Indicates if the field contains Personal Identifiable Information (PII)." }, "classification": { "type": "string", "description": "The data class defining the sensitivity level for this field." }, "tags": { "type": "array", "items": { "type": "string" }, "description": "Custom metadata to provide additional context." } }, "required": [ "name", "type" ] } }, "schema": { "type": "object", "properties": { "type": { "type": "string", "title": "SchemaType", "enum": [ "dbt", "bigquery", "json-schema", "sql-ddl", "avro", "protobuf", "custom" ], "description": "The type of the schema. Typical values are dbt, bigquery, json-schema, sql-ddl, avro, protobuf, custom." }, "specification": { "oneOf": [ { "type": "string", "description": "The specification of the schema as a string." }, { "type": "object", "description": "The specification of the schema as an object." } ] } }, "required": [ "type", "specification" ], "description": "The schema of the data contract describes the syntax and semantics of provided data sets. It supports different schema types." }, "examples": { "type": "array", "items": { "type": "object", "properties": { "type": { "type": "string", "title": "ExampleType", "enum": [ "csv", "json", "yaml", "custom" ], "description": "The type of the example data. Well-known types are csv, json, yaml, custom." }, "description": { "type": "string", "description": "An optional string describing the example." }, "model": { "type": "string", "description": "The reference to the model in the schema, e.g., a table name." }, "data": { "oneOf": [{ "type": "string", "description": "Example data for this model." },{ "type": "array", "description": "Example data for this model in a structured format. Use this for type json or yaml." }] } }, "required": [ "type", "data" ] }, "description": "The Examples Object is an array of Example Objects." }, "quality": { "type": "object", "properties": { "type": { "type": "string", "title": "QualityType", "enum": [ "SodaCL", "montecarlo", "custom" ], "description": "The type of the quality check. Typical values are SodaCL, montecarlo, custom." }, "specification": { "oneOf": [ { "type": "string", "description": "The specification of the quality attributes as a string." }, { "type": "object", "description": "The specification of the quality attributes as an object." } ] } }, "required": [ "type", "specification" ], "description": "The quality object contains quality attributes and checks." } }, "required": [ "dataContractSpecification", "id", "info" ] } ================================================ FILE: versions/0.9.3/README.md ================================================ # Data Contract Specification Stars Slack Status ![datacontract.png](images/datacontract.png) Data contracts bring data providers and data consumers together. A _data contract_ is a document that defines the structure, format, semantics, quality, and terms of use for exchanging data between a data provider and their consumers. Think of an API, but for data. A data contract is implemented by a data product or other data technologies, even legacy data warehouses. Data contracts can also be used for the input port to specify the expectations of data dependencies and verify given guarantees. The _data contract specification_ defines a YAML format to describe attributes of provided data sets. It is data platform neutral and can be used with any data platform, such as AWS S3, Google BigQuery, Azure, Databricks, and Snowflake. The data contract specification is an open initiative to define a common data contract format. It follows [OpenAPI](https://www.openapis.org/) and [AsyncAPI](https://www.asyncapi.com/) conventions. Data contracts come into play when data is exchanged between different teams or organizational units, such as in a [data mesh architecture](https://www.datamesh-architecture.com/). First, and foremost, data contracts are a communication tool to express a common understanding of how data should be structured and interpreted. They make semantic and quality expectations explicit. They are often created collaboratively in [workshops](./workshop.md) together with data providers and data consumers. Later in development and production, they also serve as the basis for code generation, testing, schema validations, quality checks, monitoring, access control, and computational governance policies. The specification comes along with the [Data Contract CLI](https://github.com/datacontract/datacontract-cli), an open-source tool to develop, validate, and enforce data contracts. > _Note: The term "data contract" refers to a specification that is usually owned by the data provider and thus does not align with a "contract" in a legal sense as a mutual agreement between two parties. > The term "contract" may be somewhat misleading, but it is how it is used by the industry. > The mutual agreement between one data provider and one data consumer is the "data usage agreement" that refers to a data contract. > Data usage agreements have a defined lifecycle, start/end date, and help the data provider to track who accesses their data and for which purposes._ Version --- 0.9.3 ([Changelog](CHANGELOG.md)) Example --- [![Data Contract Catalog](https://img.shields.io/badge/Data%20Contract-Catalog-blue)](https://datacontract.com/examples/index.html) ```yaml dataContractSpecification: 0.9.3 id: urn:datacontract:checkout:orders-latest info: title: Orders Latest version: 1.0.0 description: | Successful customer orders in the webshop. All orders since 2020-01-01. Orders with their line items are in their current state (no history included). owner: Checkout Team slackChannel: "#checkout" contact: name: John Doe (Data Product Owner) url: https://teams.microsoft.com/l/channel/example/checkout tags: - checkout - orders - s3 links: datacontractCli: https://cli.datacontract.com servers: production: type: s3 environment: prod location: s3://datacontract-example-orders-latest/data/{model}/*.json format: json delimiter: new_line description: "One folder per model. One file per day." terms: usage: | Data can be used for reports, analytics and machine learning use cases. Order may be linked and joined by other tables limitations: | Not suitable for real-time use cases. Data may not be used to identify individual customers. Max data processing per day: 10 TiB billing: 5000 USD per month noticePeriod: P3M models: orders: description: One record per order. Includes cancelled and deleted orders. type: table fields: order_id: $ref: '#/definitions/order_id' required: true unique: true primary: true order_timestamp: description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. type: timestamp required: true example: "2024-09-09T08:30:00Z" order_total: description: Total amount the smallest monetary unit (e.g., cents). type: long required: true example: "9999" customer_id: description: Unique identifier for the customer. type: text minLength: 10 maxLength: 20 customer_email_address: description: The email address, as entered by the customer. The email address was not verified. type: text format: email required: true pii: true classification: sensitive processed_timestamp: description: The timestamp when the record was processed by the data platform. type: timestamp required: true config: jsonType: string jsonFormat: date-time line_items: description: A single article that is part of an order. type: table fields: lines_item_id: type: text description: Primary key of the lines_item_id table required: true unique: true primary: true order_id: $ref: '#/definitions/order_id' references: orders.order_id sku: description: The purchased article number $ref: '#/definitions/sku' definitions: order_id: domain: checkout name: order_id title: Order ID type: text format: uuid description: An internal ID that identifies an order in the online shop. example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2 pii: true classification: restricted tags: - orders sku: domain: inventory name: sku title: Stock Keeping Unit type: text pattern: ^[A-Za-z0-9]{8,14}$ example: "96385074" description: | A Stock Keeping Unit (SKU) is an internal unique identifier for an article. It is typically associated with an article's barcode, such as the EAN/GTIN. links: wikipedia: https://en.wikipedia.org/wiki/Stock_keeping_unit tags: - inventory examples: - type: csv # csv, json, yaml, custom model: orders description: An example list of order records. data: | # expressed as string or inline yaml or via "$ref: data.csv" order_id,order_timestamp,order_total,customer_id,customer_email_address,processed_timestamp "1001","2030-09-09T08:30:00Z",2500,"1000000001","mary.taylor82@example.com","2030-09-09T08:31:00Z" "1002","2030-09-08T15:45:00Z",1800,"1000000002","michael.miller83@example.com","2030-09-09T08:31:00Z" "1003","2030-09-07T12:15:00Z",3200,"1000000003","michael.smith5@example.com","2030-09-09T08:31:00Z" "1004","2030-09-06T19:20:00Z",1500,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" "1005","2030-09-05T10:10:00Z",4200,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" "1006","2030-09-04T14:55:00Z",2800,"1000000005","john.davis28@example.com","2030-09-09T08:31:00Z" "1007","2030-09-03T21:05:00Z",1900,"1000000006","linda.brown67@example.com","2030-09-09T08:31:00Z" "1008","2030-09-02T17:40:00Z",3600,"1000000007","patricia.smith40@example.com","2030-09-09T08:31:00Z" "1009","2030-09-01T09:25:00Z",3100,"1000000008","linda.wilson43@example.com","2030-09-09T08:31:00Z" "1010","2030-08-31T22:50:00Z",2700,"1000000009","mary.smith98@example.com","2030-09-09T08:31:00Z" - type: csv model: line_items description: An example list of line items. data: | lines_item_id,order_id,sku "LI-1","1001","5901234123457" "LI-2","1001","4001234567890" "LI-3","1002","5901234123457" "LI-4","1002","2001234567893" "LI-5","1003","4001234567890" "LI-6","1003","5001234567892" "LI-7","1004","5901234123457" "LI-8","1005","2001234567893" "LI-9","1005","5001234567892" "LI-10","1005","6001234567891" servicelevels: availability: description: The server is available during support hours percentage: 99.9% retention: description: Data is retained for one year period: P1Y unlimited: false latency: description: Data is available within 25 hours after the order was placed threshold: 25h sourceTimestampField: orders.order_timestamp processedTimestampField: orders.processed_timestamp freshness: description: The age of the youngest row in a table. threshold: 25h timestampField: orders.order_timestamp frequency: description: Data is delivered once a day type: batch # or streaming interval: daily # for batch, either or cron cron: 0 0 * * * # for batch, either or interval support: description: The data is available during typical business hours at headquarters time: 9am to 5pm in EST on business days responseTime: 1h backup: description: Data is backed up once a week, every Sunday at 0:00 UTC. interval: weekly cron: 0 0 * * 0 recoveryTime: 24 hours recoveryPoint: 1 week quality: type: SodaCL # data quality check format: SodaCL, montecarlo, custom specification: # expressed as string or inline yaml or via "$ref: checks.yaml" checks for orders: - row_count >= 5 - duplicate_count(order_id) = 0 checks for line_items: - values in (order_id) must exist in orders (order_id) - row_count >= 5 ``` Data Contract CLI --- The [Data Contract CLI](https://cli.datacontract.com) is a command line tool and Python library to lint, test, import and export data contracts. Here is short example how to verify that your actual dataset matches the data contract: ```bash pip3 install datacontract-cli datacontract test https://datacontract.com/examples/orders-latest/datacontract.yaml ``` or, if you prefer Docker: ```bash docker run datacontract/cli test https://datacontract.com/examples/orders-latest/datacontract.yaml ``` The Data Contract contains all required information to verify data: - The _servers_ block has the connection details to the actual data set. - The _models_ define the syntax, formats, and constraints. - The _quality_ defined further quality checks. The Data Contract CLI chooses the appropriate engine, formulates test cases, connects to the server, and executes the tests, based on the server type. More information and configuration options on [cli.datacontract.com](https://cli.datacontract.com). Specification --- ![The eight major categories in the data contract specification](images/categories.png) - [Data Contract Object](#data-contract-object) - [Info Object](#info-object) - [Contact Object](#contact-object) - [Server Object](#server-object) - [Terms Object](#terms-object) - [Model Object](#model-object) - [Field Object](#field-object) - [Definition Object](#definition-object) - [Schema Object (DEPRECATED)](#schema-object-deprecated) - [Example Object](#example-object) - [Service Level Object](#service-levels-object) - [Quality Object](#quality-object) - [Data Types](#data-types) - [Specification Extensions](#specification-extensions) [JSON Schema](https://github.com/datacontract/datacontract-specification/blob/main/datacontract.schema.json) of the Data Contract Specification. ### Data Contract Object This is the root document. It is _RECOMMENDED_ that the root document be named: `datacontract.yaml`. | Field | Type | Description | |---------------------------|------------------------------------------------------|----------------------------------------------------------------------------------------------------------| | dataContractSpecification | `string` | REQUIRED. Specifies the Data Contract Specification being used. | | id | `string` | REQUIRED. An organization-wide unique technical identifier, such as a UUID, URN, slug, string, or number | | info | [Info Object](#info-object) | REQUIRED. Specifies the metadata of the data contract. May be used by tooling. | | servers | Map[`string`, [Server Object](#server-object)] | Specifies the servers of the data contract. | | terms | [Terms Object](#terms-object) | Specifies the terms and conditions of the data contract. | | models | Map[`string`, [Model Object](#model-object)] | Specifies the logical data model. | | definitions | Map[`string`, [Definition Object](#definition-object)] | Specifies definitions. | | schema | [Schema Object (DEPRECATED)](#schema-object-deprecated) | Specifies the physical schema. The specification supports different schema format. | | examples | Array of [Example Objects](#example-object) | Specifies example data sets for the data model. The specification supports different example types. | | servicelevels | [Service Levels Object](#service-levels-object) | Specifies the service level of the provided data | | quality | [Quality Object](#quality-object) | Specifies the quality attributes and checks. The specification supports different quality check DSLs. | | links | Map[`string`, `string`] | Additional external documentation links. | | tags | Array of `string` | Custom metadata to provide additional context. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Info Object Metadata and life cycle information about the data contract. | Field | Type | Description | |-------------|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------| | title | `string` | REQUIRED. The title of the data contract. | | version | `string` | REQUIRED. The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version). | | status | `string` | The status of the data contract. Can be `proposed`, `in development`, `active`, `deprecated`, `retired`. | | description | `string` | A description of the data contract. | | owner | `string` | The owner or team responsible for managing the data contract and providing the data. | | contact | [Contact Object](#contact-object) | Contact information for the data contract. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Contact Object Contact information for the data contract. | Field | Type | Description | |-------|----------|-------------------------------------------------------------------------------------------------------| | name | `string` | The identifying name of the contact person/organization. | | url | `string` | The URL pointing to the contact information. This _MUST_ be in the form of a URL. | | email | `string` | The email address of the contact person/organization. This _MUST_ be in the form of an email address. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Server Object The fields are dependent on the defined type. | Field | Type | Description | |-------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | REQUIRED. The type of the data product technology that implements the data contract. Well-known server types are: `bigquery`, `s3`, `glue`, `redshift`, `azure`, `sqlserver`, `snowflake`, `databricks`, `postgres`, `oracle`, `kafka`, `pubsub`, `sftp`, `kinesis`, `trino`, `local` | | description | `string` | An optional string describing the server. | | environment | `string` | An optional string describing the environment, e.g., prod, sit, stg. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### BigQuery Server Object | Field | Type | Description | |---------|----------|-----------------------| | type | `string` | `bigquery` | | project | `string` | The GCP project name. | | dataset | `string` | | #### S3 Server Object | Field | Type | Description | |-------------|----------|-------------------------------------------------------------------------------------------------------------------------| | type | `string` | `s3` | | location | `string` | S3 URL, starting with `s3://` | | endpointUrl | `string` | The server endpoint for S3-compatible servers, such as MioIO or Google Cloud Storage, e.g., `https://minio.example.com` | | format | `string` | Format of files, such as `parquet`, `delta`, `json`, `csv` | | delimiter | `string` | (Only for format = `json`), how multiple json documents are delimited within one file, e.g., `new_line`, `array` | Example (AWS S3): ```yaml servers: production: type: s3 location: s3://acme-orders-prod/orders/ format: json delimiter: new_line ``` Example (MinIO): ```yaml servers: minio: type: s3 endpointUrl: http://localhost:9000 location: s3://my-bucket/path/ format: delta ``` Example (Google Cloud Storage): ```yaml servers: gcs: type: s3 endpointUrl: https://storage.googleapis.com location: s3://my-bucket/path/*/*/*/*/*.parquet format: parquet ``` #### Redshift Server Object | Field | Type | Description | |-------------------|----------|---------------------------------------------------------------------------------------------------------------------| | type | `string` | `redshift` | | account | `string` | | | database | `string` | | | schema | `string` | | | clusterIdentifier | `string` | Identifier of the cluster.
Example: `analytics-cluster` | | host | `string` | Host of the cluster.
Example: `analytics-cluster.example.eu-west-1.redshift.amazonaws.com` | | port | `number` | Port of the cluster.
Example: `5439` | | endpoint | `string` | Endpoint of the cluster
Example: `analytics-cluster.example.eu-west-1.redshift.amazonaws.com:5439/analytics` | Example, specifying an endpoint: ```yaml servers: analytics: type: redshift account: '123456789012' database: analytics schema: analytics endpoint: analytics-cluster.example.eu-west-1.redshift.amazonaws.com:5439/analytics ``` Example, specifying the cluster identifier: ```yaml servers: analytics: type: redshift account: '123456789012' database: analytics schema: analytics clusterIdentifier: analytics-cluster ``` Example, specifying the cluster host: ```yaml servers: analytics: type: redshift account: '123456789012' database: analytics schema: analytics host: analytics-cluster.example.eu-west-1.redshift.amazonaws.com port: 5439 ``` #### Azure Server Object | Field | Type | Description | |-----------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | `azure` | | location | `string` | Fully qualified path to Azure Blob Storage or Azure Data Lake Storage (ADLS), supports globs. Starting with `az://` or `abfss`
Examples: `az://my_storage_account_name.blob.core.windows.net/my_container/path/*.parquet` or `abfss://my_storage_account_name.dfs.core.windows.net/my_container_name/path/*.parquet` | | format | `string` | Format of files, such as `parquet`, `json`, `csv` | | delimiter | `string` | (Only for format = `json`), how multiple json documents are delimited within one file, e.g., `new_line`, `array` | #### SQL-Server Server Object | Field | Type | Description | |----------|-----------|------------------------------------------------------| | type | `string` | `sqlserver` | | host | `string` | The host to the database server | | port | `integer` | The port to the database server, default: `1433` | | database | `string` | The name of the database, e.g., `database`. | | schema | `string` | The name of the schema in the database, e.g., `dbo`. | | driver | `string` | The name of the supported driver, e.g., `ODBC Driver 18 for SQL Server`. | #### Snowflake Server Object | Field | Type | Description | |----------|----------|-------------| | type | `string` | `snowflake` | | account | `string` | | | database | `string` | | | schema | `string` | | #### Databricks Server Object | Field | Type | Description | |---------|----------|---------------------------------------------------------------------| | type | `string` | `databricks` | | host | `string` | The Databricks host, e.g., `dbc-abcdefgh-1234.cloud.databricks.com` | | catalog | `string` | The name of the Hive or Unity catalog | | schema | `string` | The schema name in the catalog | #### Postgres Server Object | Field | Type | Description | |----------|-----------|---------------------------------------------------------| | type | `string` | `postgres` | | host | `string` | The host to the database server | | port | `integer` | The port to the database server | | database | `string` | The name of the database, e.g., `postgres`. | | schema | `string` | The name of the schema in the database, e.g., `public`. | #### Oracle Server Object | Field | Type | Description | |-------------|-----------|---------------------------------| | type | `string` | `oracle` | | host | `string` | The host to the oracle server | | port | `integer` | The port to the oracle server | | serviceName | `string` | The name of the service | #### Kafka Server Object | Field | Type | Description | |--------|----------|---------------------------------------------------------------------------| | type | `string` | `kafka` | | host | `string` | The bootstrap server of the kafka cluster. | | topic | `string` | The topic name. | | format | `string` | The format of the message. Examples: json, avro, protobuf. Default: json. | #### Pub/Sub Server Object | Field | Type | Description | |---------|----------|-----------------------| | type | `string` | `pubsub` | | project | `string` | The GCP project name. | | topic | `string` | The topic name. | #### sftp Server Object | Field | Type | Description | |-----------|----------|------------------------------------------------------------------------------------------------------------------| | type | `string` | `sftp` | | location | `string` | S3 URL, starting with `sftp://` | | format | `string` | Format of files, such as `parquet`, `delta`, `json`, `csv` | | delimiter | `string` | (Only for format = `json`), how multiple json documents are delimited within one file, e.g., `new_line`, `array` | #### AWS Kinesis Data Streams Server Object | Field | Type | Description | |--------|----------|---------------------------------------------------------------------------| | type | `string` | `kinesis` | | stream | `string` | The name of the Kinesis data stream. | | region | `string` | AWS region, e.g., `eu-west-1`. | | format | `string` | The format of the records. Examples: json, avro, protobuf. | #### Trino Server Object | Field | Type | Description | |----------|-----------|-----------------------------------------------------------| | type | `string` | `trino` | | host | `string` | The Trino host | | port | `integer` | The Trino port | | catalog | `string` | The name of the catalog, e.g., `my_catalog`. | | schema | `string` | The name of the schema in the catalog, e.g., `my_schema`. | #### Local Server Object | Field | Type | Description | |--------|----------|-------------------------------------------------------------------------------------| | type | `string` | `local` | | path | `string` | The relative or absolute path to the data file(s), such as `./folder/data.parquet`. | | format | `string` | The format of the file(s), such as `parquet`, `delta`, `csv`, or `json`. | ### Terms Object The terms and conditions of the data contract. | Field | Type | Description | |--------------|----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | usage | `string` | The usage describes the way the data is expected to be used. Can contain business and technical information. | | limitations | `string` | The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for. | | billing | `string` | The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use. | | noticePeriod | `string` | The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., `P3M` for a period of three months. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Model Object The Model Object describes the structure and semantics of a data model, such as tables, views, or structured files. The name of the data model (table name) is defined by the key that refers to this Model Object. | Field | Type | Description | |-------------|----------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | The type of the model. Examples: `table`, `view`, `object`. Default: `table`. | | description | `string` | An optional string describing the data model. | | title | `string` | An optional string for the title of the data model. Especially useful if the name of the model is cryptic or contains abbreviations. | | fields | Map[`string`, [Field Object](#field-object)] | The fields (e.g. columns) of the data model. | | config | [Config Object](#config-object) | Any additional key-value pairs that might be useful for further tooling. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Field Object The Field Objects describes one field (column, property, nested field) of a data model. | Field | Type | Description | |------------------|----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | description | `string` | An optional string describing the semantic of the data in this field. | | type | [Data Type](#data-types) | The logical data type of the field. | | title | `string` | An optional string providing a human readable name for the field. Especially useful if the field name is cryptic or contains abbreviations. | | enum | array of `string` | A value must be equal to one of the elements in this array value. Only evaluated if the value is not null. | | required | `boolean` | An indication, if this field must contain a value and may not be null. Default: `false` | | primary | `boolean` | If this field is a primary key. Default: `false` | | references | `string` | The reference to a field in another model. E.g. use 'orders.order_id' to reference the order_id field of the model orders. Think of defining a foreign key relationship. | | unique | `boolean` | An indication, if the value must be unique within the model. Default: `false` | | format | `string` | `email`: A value must be complaint to [RFC 5321, section 4.1.2](https://www.rfc-editor.org/info/rfc5321).
`uri`: A value must be complaint to [RFC 3986](https://www.rfc-editor.org/info/rfc3986).
`uuid`: A value must be complaint to [RFC 4122](https://www.rfc-editor.org/info/rfc4122). Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | precision | `number` | The maximum number of digits in a number. Only applies to numeric values. Defaults to 38. | | scale | `number` | The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0. | | minLength | `number` | A value must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | maxLength | `number` | A value must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | pattern | `string` | A value must be valid according to the [ECMA-262](https://262.ecma-international.org/5.1/) regular expression dialect. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | minimum | `number` | A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | exclusiveMinimum | `number` | A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | maximum | `number` | A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | exclusiveMaximum | `number` | A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | example | `string` | An example value. | | pii | `boolean` | An indication, if this field contains Personal Identifiable Information (PII). | | classification | `string` | The data class defining the sensitivity level for this field, according to the organization's classification scheme. Examples may be: `sensitive`, `restricted`, `internal`, `public`. | | tags | Array of `string` | Custom metadata to provide additional context. | | links | Map[`string`,`string`] | Additional external documentation links. | | $ref | `string` | A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition. | | fields | Map[`string`, [Field Object](#field-object)] | The nested fields (e.g. columns) of the object, record, or struct. Use only when type is `object`, `record`, or `struct`. | | items | [Field Object](#field-object) | The type of the elements in the array. Use only when type is `array`. | | keys | [Field Object](#field-object) | Describes the key structure of a map. Defaults to `type: string` if a map is defined as type. Not all server types support different key types. Use only when type is `map`. | | values | [Field Object](#field-object) | Describes the value structure of a map. Use only when type is `map`. | | config | [Config Object](#config-object) | Any additional key-value pairs that might be useful for further tooling. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Definition Object The Definition Object includes a clear and concise explanations of syntax, semantic, and classification of a business object in a given domain. It serves as a reference for a common understanding of terminology, ensure consistent usage and to identify join-able fields. Models fields can refer to definitions using the `$ref` field to link to existing definitions and avoid duplicate documentations. | Field | Type | Description | |------------------|----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | name | `string` | REQUIRED. The technical name of this definition. | | type | [Data Type](#data-types) | REQUIRED. The logical data type | | domain | `string` | The domain in which this definition is valid. Default: `global`. | | title | `string` | The business name of this definition. | | description | `string` | Clear and concise explanations related to the domain | | enum | array of `string` | A value must be equal to one of the elements in this array value. Only evaluated if the value is not null. | | format | `string` | `email`: A value must be complaint to [RFC 5321, section 4.1.2](https://www.rfc-editor.org/info/rfc5321).
`uri`: A value must be complaint to [RFC 3986](https://www.rfc-editor.org/info/rfc3986).
`uuid`: A value must be complaint to [RFC 4122](https://www.rfc-editor.org/info/rfc4122). Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | precision | `number` | The maximum number of digits in a number. Only applies to numeric values. Defaults to 38. | | scale | `number` | The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0. | | minLength | `number` | A value must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | maxLength | `number` | A value must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | pattern | `string` | A value must be valid according to the [ECMA-262](https://262.ecma-international.org/5.1/) regular expression dialect. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | minimum | `number` | A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | exclusiveMinimum | `number` | A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | maximum | `number` | A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | exclusiveMaximum | `number` | A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | example | `string` | An example value. | | pii | `boolean` | An indication, if this field contains Personal Identifiable Information (PII). | | classification | `string` | The data class defining the sensitivity level for this field, according to the organization's classification scheme. | | tags | Array of `string` | Custom metadata to provide additional context. | | links | Map[`string`, `string`] | Additional external documentation links. | | fields | Map[`string`, [Field Object](#field-object)] | The nested fields (e.g. columns) of the object, record, or struct. Use only when type is `object`, `record`, or `struct`. | | items | [Field Object](#field-object) | The type of the elements in the array. Use only when type is `array`. | | keys | [Field Object](#field-object) | Describes the key structure of a map. Defaults to `type: string` if a map is defined as type. Not all server types support different key types. Use only when type is `map`. | | values | [Field Object](#field-object) | Describes the value structure of a map. Use only when type is `map`. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Schema Object (DEPRECATED) The schema of the data contract describes the physical schema. The type of the schema depends on the data platform. | Field | Type | Description | |---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | REQUIRED. The type of the schema.
Typical values are: `dbt`, `bigquery`, `json-schema`, `sql-ddl`, `avro`, `protobuf`, `custom` | | specification | [dbt Schema Object](#dbt-schema-object) \|
[BigQuery Schema Object](#bigquery-schema-object) \|
[JSON Schema Schema Object](#bigquery-schema-object) \|
[SQL DDL Schema Object](#sql-ddl-schema-object) \|
`string` | REQUIRED. The specification of the schema. The schema specification can be encoded as a string or as inline YAML. | #### dbt Schema Object https://docs.getdbt.com/reference/model-properties Example (inline YAML): ```yaml schema: type: dbt specification: version: 2 models: - name: "My Table" description: "My description" columns: - name: "My column" data_type: text description: "My description" ``` Example (string): ```yaml schema: type: dbt specification: |- version: 2 models: - name: "My Table" description: "My description" columns: - name: "My column" data_type: text description: "My description" ``` #### BigQuery Schema Object The schema structure is defined by the [Google BigQuery Table](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#resource:-table) object. You can extract such a Table object via the [tables.get](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/get) endpoint. Instead of providing a single Table object, you can also provide an array of such objects. Be aware that [tables.list](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/list) only returns a subset of the full Table object. You need to call every Table object via [tables.get](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/get) to get the full Table object, including the actual schema. Learn more: [Google BigQuery REST Reference v2](https://cloud.google.com/bigquery/docs/reference/rest) Example: ```yaml schema: type: bigquery specification: |- { "tableReference": { "projectId": "my-project", "datasetId": "my_dataset", "tableId": "my_table" }, "description": "This is a description", "type": "TABLE", "schema": { "fields": [ { "name": "name", "type": "STRING", "mode": "NULLABLE", "description": "This is a description" } ] } } ``` #### JSON Schema Schema Object JSON Schema can be defined as JSON or rendered as YAML, following the [OpenAPI Schema Object dialect](https://spec.openapis.org/oas/v3.1.0#properties) Example (inline YAML): ```yaml schema: type: json-schema specification: orders: description: One record per order. Includes cancelled and deleted orders. type: object properties: order_id: type: string description: Primary key of the orders table order_timestamp: type: string format: date-time description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. order_total: type: integer description: Total amount of the order in the smallest monetary unit (e.g., cents). line_items: type: object properties: lines_item_id: type: string description: Primary key of the lines_item_id table order_id: type: string description: Foreign key to the orders table sku: type: string description: The purchased article number ``` Example (string): ```yaml schema: type: json-schema specification: |- { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "orders": { "type": "object", "description": "One record per order. Includes cancelled and deleted orders.", "properties": { "order_id": { "type": "string", "description": "Primary key of the orders table" }, "order_timestamp": { "type": "string", "format": "date-time", "description": "The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful." }, "order_total": { "type": "integer", "description": "Total amount of the order in the smallest monetary unit (e.g., cents)." } }, "required": ["order_id", "order_timestamp", "order_total"] }, "line_items": { "type": "object", "properties": { "lines_item_id": { "type": "string", "description": "Primary key of the lines_item_id table" }, "order_id": { "type": "string", "description": "Foreign key to the orders table" }, "sku": { "type": "string", "description": "The purchased article number" } }, "required": ["lines_item_id", "order_id", "sku"] } }, "required": ["orders", "line_items"] } ``` #### SQL DDL Schema Object Classical SQL DDLs can be used to describe the structure. Example (string): ```yaml schema: type: sql-ddl specification: |- -- One record per order. Includes cancelled and deleted orders. CREATE TABLE orders ( order_id TEXT PRIMARY KEY, -- Primary key of the orders table order_timestamp TIMESTAMPTZ NOT NULL, -- The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. order_total INTEGER NOT NULL -- Total amount of the order in the smallest monetary unit (e.g., cents) ); -- The items that are part of an order CREATE TABLE line_items ( lines_item_id TEXT PRIMARY KEY, -- Primary key of the lines_item_id table order_id TEXT REFERENCES orders(order_id), -- Foreign key to the orders table sku TEXT NOT NULL -- The purchased article number ); ``` ### Example Object | Field | Type | Description | |-------------|----------|-----------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | The type of the data product technology that implements the data contract. Well-known server types are: `csv`, `json`, `yaml`, `custom` | | description | `string` | An optional string describing the example. | | model | `string` | The reference to the model in the schema, e.g. a table name. | | data | `string` | Example data for this model. | Example: ```yaml examples: - type: csv model: orders data: |- order_id,order_timestamp,order_total "1001","2023-09-09T08:30:00Z",2500 "1002","2023-09-08T15:45:00Z",1800 "1003","2023-09-07T12:15:00Z",3200 "1004","2023-09-06T19:20:00Z",1500 "1005","2023-09-05T10:10:00Z",4200 "1006","2023-09-04T14:55:00Z",2800 "1007","2023-09-03T21:05:00Z",1900 "1008","2023-09-02T17:40:00Z",3600 "1009","2023-09-01T09:25:00Z",3100 "1010","2023-08-31T22:50:00Z",2700 ``` ### Service Levels Object A service level is defined as an agreed-upon, measurable level of performance for provided the data. Data Contract Specification defines well-known service levels. This list can be extended with custom service levels. One can either describe each service level informally using the `description` field, or make use of the predefined fields for automation support, e.g., via the [Data Contract CLI](https://cli.datacontract.com). | Field | Type | Description | |--------------|-----------------------------------------------|-------------------------------------------------------------------------| | availability | [Availability Object](#availability-object) | The promised uptime of the system that provides the data | | retention | [Retention Object](#retention-object) | The period how long data will be available. | | latency | [Latency Object](#latency-object) | The maximum amount of time from the source to its destination. | | freshness | [Freshness Object](#freshness-object) | The maximum age of the youngest entry. | | frequency | [Frequency Object](#frequency-object) | The update frequency. | | support | [Support Object](#support-object) | The times when support is provided. | | backup | [Backup Object](#backup-object) | The details about data backup procedures. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### Availability Object Availability refers to the promise or guarantee by the service provider about the uptime of the system that provides the data. | Field | Type | Description | |-------------|----------|--------------------------------------------------------------------------------| | description | `string` | An optional string describing the availability service level. | | percentage | `string` | An optional string describing the guaranteed uptime in percent (e.g., `99.9%`) | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### Retention Object Retention covers the period how long data will be available. | Field | Type | Description | |----------------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------| | description | `string` | An optional string describing the retention service level. | | period | `string` | An optional period of time, how long data is available. Supported formats: Simple duration (e.g., `1 year`, `30d`) and ISO 8601 duration (e.g, `P1Y`). | | unlimited | `boolean` | An optional indicator that data is kept forever. | | timestampField | `string` | An optional reference to the field that contains the timestamp that the period refers to. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### Latency Object Latency refers to the maximum amount of time from the source to its destination. Examples are the maximum duration it takes after an order has been recorded in the ecommerce shop until it is available in the orders table in the data analytics platform. This includes the waiting times until the next batch run is started and the processing time of the pipeline. | Field | Type | Description | |-------------------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | description | `string` | An optional string describing the latency service level. | | threshold | `string` | An optional maximum duration between the source timestamp and the processed timestamp. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g, `PT24H`). | | sourceTimestampField | `string` | An optional reference to the field that contains the timestamp when the data was provided at the source. | | processedTimestampField | `string` | An optional reference to the field that contains the processing timestamp, which denotes when the data is made available to consumers of this data contract. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### Freshness Object Freshness refers to the maximum age of the youngest entry. | Field | Type | Description | |-------------------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------| | description | `string` | An optional string describing the freshness service level. | | threshold | `string` | An optional maximum age of the youngest entry. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g, `PT24H`). | | timestampField | `string` | An optional reference to the field that contains the timestamp that the threshold refers to. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### Frequency Object Frequency describes how often data is updated. | Field | Type | Description | |-------------|----------|-----------------------------------------------------------------------------------------------------------| | description | `string` | An optional string describing the frequency service level. | | type | `string` | An optional type of data processing. Typical values are `batch`, `micro-batching`, `streaming`, `manual`. | | interval | `string` | Optional. Only for batch: How often the pipeline is triggered, e.g., `daily`. | | cron | `string` | Optional. Only for batch: A cron expression when the pipelines is triggered. E.g., `0 0 * * *`. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### Support Object Support describes the times when support will be available for contact. | Field | Type | Description | |--------------|----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | description | `string` | An optional string describing the support service level. | | time | `string` | An optional string describing the times when support will be available for contact such as `24/7` or `business hours only`. | | responseTime | `string` | An optional string describing the time it takes for the support team to acknowledge a request. This does not mean the issue will be resolved immediately, but it assures users that their request has been received and will be dealt with. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### Backup Object Backup specifies details about data backup procedures. | Field | Type | Description | |---------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | description | `string` | An optional string describing the backup service level. | | interval | `string` | An optional interval that defines how often data will be backed up, e.g., `daily`. | | cron | `string` | An optional cron expression when data will be backed up, e.g., `0 0 * * *`. | | recoveryTime | `string` | An optional Recovery Time Objective (RTO) specifies the maximum amount of time allowed to restore data from a backup after a failure or loss event (e.g., 4 hours, 24 hours). | | recoveryPoint | `string` | An optional Recovery Point Objective (RPO) defines the maximum acceptable age of files that must be recovered from backup storage for normal operations to resume after a disaster or data loss event. This essentially measures how much data you can afford to lose, measured in time (e.g., 4 hours, 24 hours). | ### Quality Object The quality object contains quality attributes and checks. | Field | Type | Description | |---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------| | type | `string` | REQUIRED. The type of the schema.
Typical values are: `SodaCL`, `montecarlo`, `great-expectations`, `custom` | | specification | [SodaCL Quality Object](#sodacl-quality-object) \|
[Monte Carlo Schema Object](#monte-carlo-quality-object) \|
[Great Expectations Quality Object](#great-expectations-quality-object) \|
`string` | REQUIRED. The specification of the quality attributes. The quality specification can be encoded as a string or as inline YAML. | #### SodaCL Quality Object Quality attributes in [Soda Checks Language](https://docs.soda.io/soda-cl/soda-cl-overview.html). The `specification` represents the content of a `checks.yml` file. Example (inline): ```yaml quality: type: SodaCL # data quality check format: SodaCL, montecarlo, dbt-tests, custom specification: # expressed as string or inline yaml or via "$ref: checks.yaml" checks for orders: - row_count > 0 - duplicate_count(order_id) = 0 checks for line_items: - row_count > 0 ``` Example (string): ```yaml quality: type: SodaCL specification: |- checks for search_queries: - freshness(search_timestamp) < 1d - row_count > 100000 - missing_count(search_query) = 0 ``` #### Monte Carlo Quality Object Quality attributes defined as Monte Carlos [Monitors as Code](https://docs.getmontecarlo.com/docs/monitors-as-code). The `specification` represents the content of a `montecarlo.yml` file. Example (string): ```yaml quality: type: montecarlo specification: |- montecarlo: field_health: - table: project:dataset.table_name timestamp_field: created dimension_tracking: - table: project:dataset.table_name timestamp_field: created field: order_status ``` #### Great Expectations Quality Object Quality attributes defined as Great Expectations [Expectations](https://greatexpectations.io/expectations/). The `specification` represents a list of expectations on a specific model. Example (string): ```yaml quality: type: great-expectations specification: orders: |- [ { "expectation_type": "expect_table_row_count_to_be_between", "kwargs": { "min_value": 10 }, "meta": { } } ] ``` ### Config Object The config field can be used to set additional metadata that may be used by tools, e.g. to define a namespace for code generation, specify physical data types, toggle tests, etc. A config field can be added with any name. The value can be null, a primitive, an array or an object. For developer experience, a list of well-known field names is maintained here, as these fields are used in the Data Contract CLI: | Field | Type | Description | |-----------------|----------|----------------------------------------------------------------------------------------------------------------| | avroNamespace | `string` | (Only on model level) The namespace to use when importing and exporting the data model from / to Apache Avro. | | avroType | `string` | (Only on field level) Specify the field type to use when exporting the data model to Apache Avro. | | avroLogicalType | `string` | (Only on field level) Specify the logical field type to use when exporting the data model to Apache Avro. | | bigqueryType | `string` | (Only on field level) Specify the physical column type that is used in a BigQuery table, e.g., `NUMERIC(5, 2)` | | snowflakeType | `string` | (Only on field level) Specify the physical column type that is used in a Snowflake table, e.g, `TIMESTAMP_LTZ` | | redshiftType | `string` | (Only on field level) Specify the physical column type that is used in a Redshift table, e.g, `SMALLINT` | | sqlserverType | `string` | (Only on field level) Specify the physical column type that is used in a Snowflake table, e.g, `DATETIME2` | | databricksType | `string` | (Only on field level) Specify the physical column type that is used in a Databricks table | | glueType | `string` | (Only on field level) Specify the physical column type that is used in a AWS Glue Data Catalog table | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). Example: ``` models: orders: config: avroNamespace: "my.namespace" fields: my_field_1: description: Example for AVRO with Timestamp (millisecond precision) type: timestamp config: avroType: long avroLogicalType: timestamp-millis snowflakeType: timestamp_tz ``` ### Data Types The following data types are supported for model fields and definitions: - Unicode character sequence: `string`, `text`, `varchar` - Any numeric type, either integers or floating point numbers: `number`, `decimal`, `numeric` - 32-bit signed integer: `int`, `integer` - 64-bit signed integer: `long`, `bigint` - Single precision (32-bit) IEEE 754 floating-point number: `float` - Double precision (64-bit) IEEE 754 floating-point number: `double` - Binary value: `boolean` - Timestamp with timezone: `timestamp`, `timestamp_tz` - Timestamp with no timezone: `timestamp_ntz` - Date with no time information: `date` - Array: `array` - Map: `map` (may not be supported by some server types) - Sequence of 8-bit unsigned bytes: `bytes` - Complex type: `object`, `record`, `struct` - No value: `null` ### Specification Extensions While the Data Contract Specification tries to accommodate most use cases, additional data can be added to extend the specification at certain points. A custom field can be added with any name. The value can be null, a primitive, an array or an object. Tooling --- - [Data Contract CLI](https://github.com/datacontract/datacontract-cli) is an open-source CLI tool to help you create, develop, and maintain your data contracts. - [Data Contract Manager](https://www.datamesh-manager.com/) is a commercial tool to manage data contracts. It includes a data contract catalog, a Web-Editor, and a request and approval workflow to automate access to data products for a full enterprise data marketplace. - [Data Contract GPT](https://gpt.datacontract.com) is a custom GPT that can help you write data contracts. - [Data Contract Editor](https://editor.datacontract.com) is an open-source editor for Data Contracts, including a live html preview. Code Completion --- The [JSON Schema](https://datacontract.com/datacontract.schema.json) of the current data contract specification is registered in [Schema Store](https://www.schemastore.org/), which brings code completion and syntax checks for all major IDEs. IntelliJ comes with a built-in YAML plugin which will show you autocompletions. For VS Code we recommend to install the [YAML](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml) plugin. No additional configuration is required. Autocompletion is then enabled for files following these patterns: ``` datacontract.yaml datacontract.yml *-datacontract.yaml *-datacontract.yml *.datacontract.yaml *.datacontract.yml datacontract-*.yaml datacontract-*.yml **/datacontract/*.yml **/datacontract/*.yaml **/datacontracts/*.yml **/datacontracts/*.yaml ``` Authors --- The Data Contract Specification was originally created by [Jochen Christ](https://www.linkedin.com/in/jochenchrist/) and [Dr. Simon Harrer](https://www.linkedin.com/in/simonharrer/), and is currently maintained by them. Contributing --- Contributions are welcome! Please open an issue or a pull request. License --- [MIT License](LICENSE) ================================================ FILE: versions/0.9.3/datacontract.init.yaml ================================================ dataContractSpecification: 0.9.3 id: my-data-contract-id info: title: My Data Contract version: 0.0.1 # description: # owner: # contact: # name: # url: # email: ### servers #servers: # production: # type: s3 # location: s3:// # format: parquet # delimiter: new_line ### terms #terms: # usage: # limitations: # billing: # noticePeriod: ### models # models: # my_model: # description: # type: # fields: # my_field: # type: # description: ### definitions # definitions: # my_field: # domain: # name: # title: # type: # description: # example: # pii: # classification: ### examples #examples: # - type: csv # model: my_model # data: |- # id,timestamp,amount # "1001","2023-09-09T08:30:00Z",2500 # "1002","2023-09-08T15:45:00Z",1800 ### servicelevels #servicelevels: # availability: # description: The server is available during support hours # percentage: 99.9% # retention: # description: Data is retained for one year because! # period: P1Y # unlimited: false # latency: # description: Data is available within 25 hours after the order was placed # threshold: 25h # sourceTimestampField: orders.order_timestamp # processedTimestampField: orders.processed_timestamp # freshness: # description: The age of the youngest row in a table. # threshold: 25h # timestampField: orders.order_timestamp # frequency: # description: Data is delivered once a day # type: batch # or streaming # interval: daily # for batch, either or cron # cron: 0 0 * * * # for batch, either or interval # support: # description: The data is available during typical business hours at headquarters # time: 9am to 5pm in EST on business days # responseTime: 1h # backup: # description: Data is backed up once a week, every Sunday at 0:00 UTC. # interval: weekly # cron: 0 0 * * 0 # recoveryTime: 24 hours # recoveryPoint: 1 week ### quality #quality: # type: SodaCL # specification: # checks for my_model: |- # - duplicate_count(id) = 0 ================================================ FILE: versions/0.9.3/datacontract.schema.json ================================================ { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "title": "DataContractSpecification", "properties": { "dataContractSpecification": { "type": "string", "title": "DataContractSpecificationVersion", "enum": [ "0.9.3", "0.9.2", "0.9.1", "0.9.0" ], "description": "Specifies the Data Contract Specification being used." }, "id": { "type": "string", "description": "Specifies the identifier of the data contract." }, "info": { "type": "object", "properties": { "title": { "type": "string", "description": "The title of the data contract." }, "version": { "type": "string", "description": "The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version)." }, "status": { "type": "string", "description": "The status of the data contract. Can be proposed, in development, active, retired.", "x-extensible-enum": [ "proposed", "in development", "active", "deprecated", "retired" ] }, "description": { "type": "string", "description": "A description of the data contract." }, "owner": { "type": "string", "description": "The owner or team responsible for managing the data contract and providing the data." }, "contact": { "type": "object", "properties": { "name": { "type": "string", "description": "The identifying name of the contact person/organization." }, "url": { "type": "string", "format": "uri", "description": "The URL pointing to the contact information. This MUST be in the form of a URL." }, "email": { "type": "string", "format": "email", "description": "The email address of the contact person/organization. This MUST be in the form of an email address." } }, "description": "Contact information for the data contract.", "additionalProperties": true } }, "additionalProperties": true, "required": [ "title", "version" ], "description": "Metadata and life cycle information about the data contract." }, "servers": { "type": "object", "properties": { "description": { "type": "string", "description": "An optional string describing the servers." }, "environment": { "type": "string", "description": "The environment in which the servers are running. Examples: prod, sit, stg." } }, "additionalProperties": { "oneOf": [ { "type": "object", "title": "BigQueryServer", "properties": { "type": { "type": "string", "enum": [ "bigquery", "BigQuery" ], "description": "The type of the data product technology that implements the data contract." }, "project": { "type": "string", "description": "An optional string describing the server." }, "dataset": { "type": "string", "description": "An optional string describing the server." } }, "additionalProperties": true, "required": [ "type", "project", "dataset" ] }, { "type": "object", "title": "S3Server", "properties": { "type": { "type": "string", "enum": [ "s3" ], "description": "The type of the data product technology that implements the data contract." }, "location": { "type": "string", "format": "uri", "description": "An optional string describing the server. Must be in the form of a URL.", "examples": [ "s3://datacontract-example-orders-latest/data/{model}/*.json" ] }, "endpointUrl": { "type": "string", "format": "uri", "description": "The server endpoint for S3-compatible servers.", "examples": ["https://minio.example.com"] }, "format": { "type": "string", "enum": [ "parquet", "delta", "json", "csv" ], "description": "File format." }, "delimiter": { "type": "string", "enum": [ "new_line", "array" ], "description": "Only for format = json. How multiple json documents are delimited within one file" } }, "additionalProperties": true, "required": [ "type", "location" ] }, { "type": "object", "title": "GcsServer", "properties": { "type": { "type": "string", "enum": [ "gcs" ], "description": "The type of the data product technology that implements the data contract." }, "location": { "type": "string", "format": "uri", "description": "The GS/GCS url to the data.", "examples": [ "gs://example-storage/data/*/*.json" ] }, "format": { "type": "string", "enum": [ "parquet", "delta", "json", "csv" ], "description": "File format." }, "delimiter": { "type": "string", "enum": [ "new_line", "array" ], "description": "Only for format = json. How multiple json documents are delimited within one file" } }, "additionalProperties": true, "required": [ "type", "location" ] }, { "type": "object", "title": "SftpServer", "properties": { "type": { "type": "string", "enum": [ "sftp" ], "description": "The type of the data product technology that implements the data contract." }, "location": { "type": "string", "format": "uri", "description": "An optional string describing the server. Must be in the form of a sftp URL.", "examples": [ "sftp://123.123.12.123/{model}/*.json" ] }, "format": { "type": "string", "enum": [ "parquet", "delta", "json", "csv" ], "description": "File format." }, "delimiter": { "type": "string", "enum": [ "new_line", "array" ], "description": "Only for format = json. How multiple json documents are delimited within one file" } }, "additionalProperties": true, "required": [ "type", "location" ] }, { "type": "object", "title": "RedshiftServer", "properties": { "type": { "type": "string", "enum": [ "redshift" ], "description": "The type of the data product technology that implements the data contract." }, "account": { "type": "string", "description": "An optional string describing the server." }, "host": { "type": "string", "description": "An optional string describing the host name." }, "database": { "type": "string", "description": "An optional string describing the server." }, "schema": { "type": "string", "description": "An optional string describing the server." }, "clusterIdentifier": { "type": "string", "description": "An optional string describing the cluster's identifier.", "examples": [ "redshift-prod-eu", "analytics-cluster" ] }, "port": { "type": "integer", "description": "An optional string describing the cluster's port.", "examples": [ 5439 ] }, "endpoint": { "type": "string", "description": "An optional string describing the cluster's endpoint.", "examples": [ "analytics-cluster.example.eu-west-1.redshift.amazonaws.com:5439/analytics" ] } }, "additionalProperties": true, "required": [ "type", "account", "database", "schema" ] }, { "type": "object", "title": "AzureServer", "properties": { "type": { "type": "string", "enum": [ "azure" ], "description": "The type of the data product technology that implements the data contract." }, "location": { "type": "string", "format": "uri", "description": "Fully qualified path to Azure Blob Storage or Azure Data Lake Storage (ADLS), supports globs.", "examples": [ "az://my_storage_account_name.blob.core.windows.net/my_container/path/*.parquet", "abfss://my_storage_account_name.dfs.core.windows.net/my_container_name/path/*.parquet" ] }, "format": { "type": "string", "enum": [ "parquet", "delta", "json", "csv" ], "description": "File format." }, "delimiter": { "type": "string", "enum": [ "new_line", "array" ], "description": "Only for format = json. How multiple json documents are delimited within one file" } }, "additionalProperties": true, "required": [ "type", "location", "format" ] }, { "type": "object", "title": "SqlserverServer", "properties": { "type": { "type": "string", "enum": [ "sqlserver" ], "description": "The type of the data product technology that implements the data contract." }, "host": { "type": "string", "description": "The host to the database server", "examples": [ "localhost" ] }, "port": { "type": "integer", "description": "The port to the database server.", "default": 1433, "examples": [ 1433 ] }, "database": { "type": "string", "description": "The name of the database.", "examples": [ "database" ] }, "schema": { "type": "string", "description": "The name of the schema in the database.", "examples": [ "dbo" ] } }, "additionalProperties": true, "required": [ "type", "host", "database", "schema" ] }, { "type": "object", "title": "SnowflakeServer", "properties": { "type": { "type": "string", "enum": [ "snowflake" ], "description": "The type of the data product technology that implements the data contract." }, "account": { "type": "string", "description": "An optional string describing the server." }, "database": { "type": "string", "description": "An optional string describing the server." }, "schema": { "type": "string", "description": "An optional string describing the server." } }, "additionalProperties": true, "required": [ "type", "account", "database", "schema" ] }, { "type": "object", "title": "DatabricksServer", "properties": { "type": { "type": "string", "const": "databricks", "description": "The type of the data product technology that implements the data contract." }, "host": { "type": "string", "description": "The Databricks host", "examples": [ "dbc-abcdefgh-1234.cloud.databricks.com" ] }, "catalog": { "type": "string", "description": "The name of the Hive or Unity catalog" }, "schema": { "type": "string", "description": "The schema name in the catalog" } }, "additionalProperties": true, "required": [ "type", "catalog", "schema" ] }, { "type": "object", "title": "DataframeServer", "properties": { "type": { "type": "string", "const": "dataframe", "description": "The type of the data product technology that implements the data contract." } }, "additionalProperties": true, "required": [ "type" ] }, { "type": "object", "title": "GlueServer", "properties": { "type": { "type": "string", "const": "glue", "description": "The type of the data product technology that implements the data contract." }, "account": { "type": "string", "description": "The AWS Glue account", "examples": [ "1234-5678-9012" ] }, "database": { "type": "string", "description": "The AWS Glue database name", "examples": [ "my_database" ] }, "location": { "type": "string", "format": "uri", "description": "The AWS S3 path. Must be in the form of a URL.", "examples": [ "s3://datacontract-example-orders-latest/data/{model}" ] }, "format": { "type": "string", "description": "The format of the files", "examples": [ "parquet", "csv", "json", "delta" ] } }, "additionalProperties": true, "required": [ "type", "account", "database" ] }, { "type": "object", "title": "PostgresServer", "properties": { "type": { "type": "string", "const": "postgres", "description": "The type of the data product technology that implements the data contract." }, "host": { "type": "string", "description": "The host to the database server", "examples": [ "localhost" ] }, "port": { "type": "integer", "description": "The port to the database server." }, "database": { "type": "string", "description": "The name of the database.", "examples": [ "postgres" ] }, "schema": { "type": "string", "description": "The name of the schema in the database.", "examples": [ "public" ] } }, "additionalProperties": true, "required": [ "type", "host", "port", "database", "schema" ] }, { "type": "object", "title": "OracleServer", "properties": { "type": { "type": "string", "const": "oracle", "description": "The type of the data product technology that implements the data contract." }, "host": { "type": "string", "description": "The host to the oracle server", "examples": [ "localhost" ] }, "port": { "type": "integer", "description": "The port to the oracle server.", "examples": [ 1523 ] }, "serviceName": { "type": "string", "description": "The name of the service.", "examples": [ "service" ] } }, "additionalProperties": true, "required": [ "type", "host", "port", "serviceName" ] }, { "type": "object", "title": "KafkaServer", "description": "Kafka Server", "properties": { "type": { "type": "string", "enum": [ "kafka" ], "description": "The type of the data product technology that implements the data contract." }, "host": { "type": "string", "description": "The bootstrap server of the kafka cluster." }, "topic": { "type": "string", "description": "The topic name." }, "format": { "type": "string", "description": "The format of the message. Examples: json, avro, protobuf. Default: json.", "default": "json" } }, "additionalProperties": true, "required": [ "type", "host", "topic" ] }, { "type": "object", "title": "PubSubServer", "properties": { "type": { "type": "string", "enum": [ "pubsub" ], "description": "The type of the data product technology that implements the data contract." }, "project": { "type": "string", "description": "The GCP project name." }, "topic": { "type": "string", "description": "The topic name." } }, "additionalProperties": true, "required": [ "type", "project", "topic" ] }, { "type": "object", "title": "KinesisDataStreamsServer", "description": "Kinesis Data Streams Server", "properties": { "type": { "type": "string", "enum": [ "kinesis" ], "description": "The type of the data product technology that implements the data contract." }, "stream": { "type": "string", "description": "The name of the Kinesis data stream." }, "region": { "type": "string", "description": "AWS region.", "examples": [ "eu-west-1" ] }, "format": { "type": "string", "description": "The format of the record", "examples": [ "json", "avro", "protobuf" ] } }, "additionalProperties": true, "required": [ "type", "stream" ] }, { "type": "object", "title": "TrinoServer", "properties": { "type": { "type": "string", "const": "trino", "description": "The type of the data product technology that implements the data contract." }, "host": { "type": "string", "description": "The host to the database server", "examples": [ "localhost" ] }, "port": { "type": "integer", "description": "The port to the database server." }, "catalog": { "type": "string", "description": "The name of the catalog.", "examples": [ "hive" ] }, "schema": { "type": "string", "description": "The name of the schema in the database.", "examples": [ "my_schema" ] } }, "additionalProperties": true, "required": [ "type", "host", "port", "catalog", "schema" ] }, { "type": "object", "title": "LocalServer", "properties": { "type": { "type": "string", "enum": [ "local" ], "description": "The type of the data product technology that implements the data contract." }, "path": { "type": "string", "description": "The relative or absolute path to the data file(s).", "examples": [ "./folder/data.parquet", "./folder/*.parquet" ] }, "format": { "type": "string", "description": "The format of the file(s)", "examples": [ "json", "parquet", "delta", "csv" ] } }, "additionalProperties": true, "required": [ "type", "path", "format" ] } ] }, "description": "Information about the servers." }, "terms": { "type": "object", "description": "The terms and conditions of the data contract.", "properties": { "usage": { "type": "string", "description": "The usage describes the way the data is expected to be used. Can contain business and technical information." }, "limitations": { "type": "string", "description": "The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for." }, "billing": { "type": "string", "description": "The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use." }, "noticePeriod": { "type": "string", "description": "The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., 'P3M' for a period of three months." } }, "additionalProperties": true }, "models": { "description": "Specifies the logical data model. Use the models name (e.g., the table name) as the key.", "type": "object", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "object", "title": "Model", "properties": { "description": { "type": "string" }, "type": { "description": "The type of the model. Examples: table, view, object. Default: table.", "type": "string", "title": "ModelType", "default": "table", "enum": [ "table", "view", "object" ] }, "title": { "type": "string", "description": "An optional string providing a human readable name for the model. Especially useful if the model name is cryptic or contains abbreviations.", "examples": ["Purchase Orders", "Air Shipments"] }, "fields": { "description": "Specifies a field in the data model. Use the field name (e.g., the column name) as the key.", "type": "object", "additionalProperties": { "type": "object", "title": "Field", "properties": { "description": { "type": "string", "description": "An optional string describing the semantic of the data in this field." }, "title": { "type": "string", "description": "An optional string providing a human readable name for the field. Especially useful if the field name is cryptic or contains abbreviations." }, "type": { "$ref": "#/$defs/FieldType" }, "required": { "type": "boolean", "default": false, "description": "An indication, if this field must contain a value and may not be null." }, "fields": { "description": "The nested fields (e.g. columns) of the object, record, or struct.", "type": "object", "additionalProperties": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" } }, "items": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "keys": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "values": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "primary": { "type": "boolean", "default": false, "description": "If this field is a primary key." }, "references": { "type": "string", "description": "The reference to a field in another model. E.g. use 'orders.order_id' to reference the order_id field of the model orders. Think of defining a foreign key relationship.", "examples": [ "orders.order_id", "model.nested_field.field" ] }, "unique": { "type": "boolean", "default": false, "description": "An indication, if the value must be unique within the model." }, "enum": { "type": "array", "items": { "type": "string" }, "uniqueItems": true, "description": "A value must be equal to one of the elements in this array value. Only evaluated if the value is not null." }, "minLength": { "type": "integer", "description": "A value must greater than, or equal to, the value of this. Only applies to string types." }, "maxLength": { "type": "integer", "description": "A value must less than, or equal to, the value of this. Only applies to string types." }, "format": { "type": "string", "description": "A specific format the value must comply with (e.g., 'email', 'uri', 'uuid').", "examples": [ "email", "uri", "uuid" ] }, "precision": { "type": "number", "examples": [ 38 ], "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." }, "scale": { "type": "number", "examples": [ 0 ], "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." }, "pattern": { "type": "string", "description": "A regular expression the value must match. Only applies to string types.", "examples": [ "^[a-zA-Z0-9_-]+$" ] }, "minimum": { "type": "number", "description": "A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "exclusiveMinimum": { "type": "number", "description": "A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "maximum": { "type": "number", "description": "A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "exclusiveMaximum": { "type": "number", "description": "A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "example": { "type": "string", "description": "An example value for this field." }, "pii": { "type": "boolean", "description": "An indication, if this field contains Personal Identifiable Information (PII)." }, "classification": { "type": "string", "description": "The data class defining the sensitivity level for this field, according to the organization's classification scheme.", "examples": [ "sensitive", "restricted", "internal", "public" ] }, "tags": { "type": "array", "items": { "type": "string" }, "description": "Custom metadata to provide additional context." }, "links": { "type": "object", "description": "Links to external resources.", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "string", "title": "Link", "description": "A URL to an external resource.", "format": "uri", "examples": [ "https://example.com" ] } }, "$ref": { "type": "string", "description": "A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition." }, "config": { "type": "object", "description": "Additional metadata for field configuration.", "additionalProperties": { "type": [ "string", "number", "boolean", "object", "array", "null" ] }, "properties": { "avroType": { "type": "string", "description": "Specify the field type to use when exporting the data model to Apache Avro." }, "avroLogicalType": { "type": "string", "description": "Specify the logical field type to use when exporting the data model to Apache Avro." }, "bigqueryType": { "type": "string", "description": "Specify the physical column type that is used in a BigQuery table, e.g., `NUMERIC(5, 2)`." }, "snowflakeType": { "type": "string", "description": "Specify the physical column type that is used in a Snowflake table, e.g., `TIMESTAMP_LTZ`." }, "redshiftType": { "type": "string", "description": "Specify the physical column type that is used in a Redshift table, e.g., `SMALLINT`." }, "sqlserverType": { "type": "string", "description": "Specify the physical column type that is used in a SQL Server table, e.g., `DATETIME2`." }, "databricksType": { "type": "string", "description": "Specify the physical column type that is used in a Databricks Unity Catalog table." }, "glueType": { "type": "string", "description": "Specify the physical column type that is used in an AWS Glue Data Catalog table." } } } } } }, "config": { "type": "object", "description": "Additional metadata for model configuration.", "additionalProperties": { "type": [ "string", "number", "boolean", "object", "array", "null" ] }, "properties": { "avroNamespace": { "type": "string", "description": "The namespace to use when importing and exporting the data model from / to Apache Avro." } } } } } }, "definitions": { "description": "Clear and concise explanations of syntax, semantic, and classification of business objects in a given domain.", "type": "object", "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "object", "title": "Definition", "properties": { "domain": { "type": "string", "description": "The domain in which this definition is valid.", "default": "global" }, "name": { "type": "string", "description": "The technical name of this definition." }, "title": { "type": "string", "description": "The business name of this definition." }, "description": { "type": "string", "description": "Clear and concise explanations related to the domain." }, "type": { "$ref": "#/$defs/FieldType" }, "fields": { "description": "The nested fields (e.g. columns) of the object, record, or struct.", "type": "object", "additionalProperties": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" } }, "items": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "keys": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "values": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "minLength": { "type": "integer", "description": "A value must be greater than or equal to this value. Applies only to string types." }, "maxLength": { "type": "integer", "description": "A value must be less than or equal to this value. Applies only to string types." }, "format": { "type": "string", "description": "Specific format requirements for the value (e.g., 'email', 'uri', 'uuid')." }, "precision": { "type": "integer", "examples": [ 38 ], "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." }, "scale": { "type": "integer", "examples": [ 0 ], "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." }, "pattern": { "type": "string", "description": "A regular expression pattern the value must match. Applies only to string types." }, "minimum": { "type": "number", "description": "A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "exclusiveMinimum": { "type": "number", "description": "A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "maximum": { "type": "number", "description": "A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "exclusiveMaximum": { "type": "number", "description": "A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "example": { "type": "string", "description": "An example value." }, "pii": { "type": "boolean", "description": "Indicates if the field contains Personal Identifiable Information (PII)." }, "classification": { "type": "string", "description": "The data class defining the sensitivity level for this field." }, "tags": { "type": "array", "items": { "type": "string" }, "description": "Custom metadata to provide additional context." }, "links": { "type": "object", "description": "Links to external resources.", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "string", "title": "Link", "description": "A URL to an external resource.", "format": "uri", "examples": [ "https://example.com" ] } } }, "required": [ "name", "type" ] } }, "schema": { "type": "object", "properties": { "type": { "type": "string", "title": "SchemaType", "enum": [ "dbt", "bigquery", "json-schema", "sql-ddl", "avro", "protobuf", "custom" ], "description": "The type of the schema. Typical values are dbt, bigquery, json-schema, sql-ddl, avro, protobuf, custom." }, "specification": { "oneOf": [ { "type": "string", "description": "The specification of the schema as a string." }, { "type": "object", "description": "The specification of the schema as an object." } ] } }, "required": [ "type", "specification" ], "description": "The schema of the data contract describes the syntax and semantics of provided data sets. It supports different schema types." }, "examples": { "type": "array", "items": { "type": "object", "properties": { "type": { "type": "string", "title": "ExampleType", "enum": [ "csv", "json", "yaml", "custom" ], "description": "The type of the example data. Well-known types are csv, json, yaml, custom." }, "description": { "type": "string", "description": "An optional string describing the example." }, "model": { "type": "string", "description": "The reference to the model in the schema, e.g., a table name." }, "data": { "oneOf": [ { "type": "string", "description": "Example data for this model." }, { "type": "array", "description": "Example data for this model in a structured format. Use this for type json or yaml." } ] } }, "required": [ "type", "data" ] }, "description": "The Examples Object is an array of Example Objects." }, "servicelevels": { "type": "object", "description": "Specifies the service level agreements for the provided data, including availability, data retention policies, latency requirements, data freshness, update frequency, support availability, and backup policies.", "properties": { "availability": { "type": "object", "description": "Availability refers to the promise or guarantee by the service provider about the uptime of the system that provides the data.", "properties": { "description": { "type": "string", "description": "An optional string describing the availability service level.", "example": "The server is available during support hours" }, "percentage": { "type": "string", "description": "An optional string describing the guaranteed uptime in percent (e.g., `99.9%`)", "pattern": "^\\d+(\\.\\d+)?%$", "example": "99.9%" } } }, "retention": { "type": "object", "description": "Retention covers the period how long data will be available.", "properties": { "description": { "type": "string", "description": "An optional string describing the retention service level.", "example": "Data is retained for one year." }, "period": { "type": "string", "description": "An optional period of time, how long data is available. Supported formats: Simple duration (e.g., `1 year`, `30d`) and ISO 8601 duration (e.g, `P1Y`).", "example": "P1Y" }, "unlimited": { "type": "boolean", "description": "An optional indicator that data is kept forever.", "example": false }, "timestampField": { "type": "string", "description": "An optional reference to the field that contains the timestamp that the period refers to.", "example": "orders.order_timestamp" } } }, "latency": { "type": "object", "description": "Latency refers to the maximum amount of time from the source to its destination.", "properties": { "description": { "type": "string", "description": "An optional string describing the latency service level.", "example": "Data is available within 25 hours after the order was placed." }, "threshold": { "type": "string", "description": "An optional maximum duration between the source timestamp and the processed timestamp. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g, `PT24H`).", "example": "25h" }, "sourceTimestampField": { "type": "string", "description": "An optional reference to the field that contains the timestamp when the data was provided at the source.", "example": "orders.order_timestamp" }, "processedTimestampField": { "type": "string", "description": "An optional reference to the field that contains the processing timestamp, which denotes when the data is made available to consumers of this data contract.", "example": "orders.processed_timestamp" } } }, "freshness": { "type": "object", "description": "The maximum age of the youngest row in a table.", "properties": { "description": { "type": "string", "description": "An optional string describing the freshness service level.", "example": "The age of the youngest row in a table is within 25 hours." }, "threshold": { "type": "string", "description": "An optional maximum age of the youngest entry. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g., `PT24H`).", "example": "25h" }, "timestampField": { "type": "string", "description": "An optional reference to the field that contains the timestamp that the threshold refers to.", "example": "orders.order_timestamp" } } }, "frequency": { "type": "object", "description": "Frequency describes how often data is updated.", "properties": { "description": { "type": "string", "description": "An optional string describing the frequency service level.", "example": "Data is delivered once a day." }, "type": { "type": "string", "enum": [ "batch", "micro-batching", "streaming", "manual" ], "description": "The method of data processing.", "example": "batch" }, "interval": { "type": "string", "description": "Optional. Only for batch: How often the pipeline is triggered, e.g., `daily`.", "example": "daily" }, "cron": { "type": "string", "description": "Optional. Only for batch: A cron expression when the pipelines is triggered. E.g., `0 0 * * *`.", "example": "0 0 * * *" } } }, "support": { "type": "object", "description": "Support describes the times when support will be available for contact.", "properties": { "description": { "type": "string", "description": "An optional string describing the support service level.", "example": "The data is available during typical business hours at headquarters." }, "time": { "type": "string", "description": "An optional string describing the times when support will be available for contact such as `24/7` or `business hours only`.", "example": "9am to 5pm in EST on business days" }, "responseTime": { "type": "string", "description": "An optional string describing the time it takes for the support team to acknowledge a request. This does not mean the issue will be resolved immediately, but it assures users that their request has been received and will be dealt with.", "example": "24 hours" } } }, "backup": { "type": "object", "description": "Backup specifies details about data backup procedures.", "properties": { "description": { "type": "string", "description": "An optional string describing the backup service level.", "example": "Data is backed up once a week, every Sunday at 0:00 UTC." }, "interval": { "type": "string", "description": "An optional interval that defines how often data will be backed up, e.g., `daily`.", "example": "weekly" }, "cron": { "type": "string", "description": "An optional cron expression when data will be backed up, e.g., `0 0 * * *`.", "example": "0 0 * * 0" }, "recoveryTime": { "type": "string", "description": "An optional Recovery Time Objective (RTO) specifies the maximum amount of time allowed to restore data from a backup after a failure or loss event (e.g., 4 hours, 24 hours).", "example": "24 hours" }, "recoveryPoint": { "type": "string", "description": "An optional Recovery Point Objective (RPO) defines the maximum acceptable age of files that must be recovered from backup storage for normal operations to resume after a disaster or data loss event. This essentially measures how much data you can afford to lose, measured in time (e.g., 4 hours, 24 hours).", "example": "1 week" } } } } }, "quality": { "type": "object", "properties": { "type": { "type": "string", "title": "QualityType", "enum": [ "SodaCL", "montecarlo", "great-expectations", "custom" ], "description": "The type of the quality check. Typical values are SodaCL, montecarlo, great-expectations, custom." }, "specification": { "oneOf": [ { "type": "string", "description": "The specification of the quality attributes as a string." }, { "type": "object", "description": "The specification of the quality attributes as an object." } ] } }, "required": [ "type", "specification" ], "description": "The quality object contains quality attributes and checks." }, "links": { "type": "object", "description": "Links to external resources.", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "string", "title": "Link", "description": "A URL to an external resource.", "format": "uri", "examples": [ "https://example.com" ] } }, "tags": { "type": "array", "items": { "type": "string", "description": "Tags to facilitate searching and filtering.", "examples": [ "databricks", "pii", "sensitive" ] }, "description": "Tags to facilitate searching and filtering." } }, "required": [ "dataContractSpecification", "id", "info" ], "$defs": { "FieldType": { "type": "string", "title": "FieldType", "description": "The logical data type of the field.", "enum": [ "number", "decimal", "numeric", "int", "integer", "long", "bigint", "float", "double", "string", "text", "varchar", "boolean", "timestamp", "timestamp_tz", "timestamp_ntz", "date", "array", "map", "object", "record", "struct", "bytes", "null" ] } } } ================================================ FILE: versions/0.9.3/definition.schema.json ================================================ { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "description": "Clear and concise explanations of syntax, semantic, and classification of business objects in a given domain.", "properties": { "domain": { "type": "string", "description": "The domain in which this definition is valid.", "default": "global" }, "name": { "type": "string", "description": "The technical name of this definition." }, "title": { "type": "string", "description": "The business name of this definition." }, "description": { "type": "string", "description": "Clear and concise explanations related to the domain." }, "type": { "type": "string", "description": "The logical data type." }, "minLength": { "type": "integer", "description": "A value must be greater than or equal to this value. Applies only to string types." }, "maxLength": { "type": "integer", "description": "A value must be less than or equal to this value. Applies only to string types." }, "format": { "type": "string", "description": "Specific format requirements for the value (e.g., 'email', 'uri', 'uuid')." }, "precision": { "type": "integer", "examples": [ 38 ], "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." }, "scale": { "type": "integer", "examples": [ 0 ], "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." }, "pattern": { "type": "string", "description": "A regular expression pattern the value must match. Applies only to string types." }, "example": { "type": "string", "description": "An example value." }, "pii": { "type": "boolean", "description": "Indicates if the field contains Personal Identifiable Information (PII)." }, "classification": { "type": "string", "description": "The data class defining the sensitivity level for this field." }, "tags": { "type": "array", "items": { "type": "string" }, "description": "Custom metadata to provide additional context." }, "links": { "type": "object", "description": "Links to external resources.", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "string", "title": "Link", "description": "A URL to an external resource.", "format": "uri", "examples": [ "https://example.com" ] } } }, "required": [ "name", "type" ] } ================================================ FILE: versions/1.1.0/README.md ================================================ # Data Contract Specification Stars Slack Status ![datacontract.png](images/datacontract.png) Data contracts bring data providers and data consumers together. A _data contract_ is a document that defines the structure, format, semantics, quality, and terms of use for exchanging data between a data provider and their consumers. Think of an API, but for data. A data contract is implemented by a data product or other data technologies, even legacy data warehouses. Data contracts can also be used for the input port to specify the expectations of data dependencies and verify given guarantees. The _data contract specification_ defines a YAML format to describe attributes of provided data sets. It is data platform neutral and can be used with any data platform, such as AWS S3, Google BigQuery, Azure, Databricks, and Snowflake. The data contract specification is an open initiative to define a common data contract format. It follows [OpenAPI](https://www.openapis.org/) and [AsyncAPI](https://www.asyncapi.com/) conventions. Data contracts come into play when data is exchanged between different teams or organizational units, such as in a [data mesh architecture](https://www.datamesh-architecture.com/). First, and foremost, data contracts are a communication tool to express a common understanding of how data should be structured and interpreted. They make semantic and quality expectations explicit. They are often created collaboratively in [workshops](./workshop.md) together with data providers and data consumers. Later in development and production, they also serve as the basis for code generation, testing, schema validations, quality checks, monitoring, access control, and computational governance policies. The specification comes along with the [Data Contract CLI](https://github.com/datacontract/datacontract-cli), an open-source tool to develop, validate, and enforce data contracts. > _Note: The term "data contract" refers to a specification that is usually owned by the data provider and thus does not align with a "contract" in a legal sense as a mutual agreement between two parties. > The term "contract" may be somewhat misleading, but it is how it is used by the industry. > The mutual agreement between one data provider and one data consumer is the "data usage agreement" that refers to a data contract. > Data usage agreements have a defined lifecycle, start/end date, and help the data provider to track who accesses their data and for which purposes._ Version --- 1.1.0([Changelog](CHANGELOG.md)) Example --- View in [Data Contract Catalog](https://datacontract.com/examples/index.html) ```yaml dataContractSpecification: 1.1.0 id: urn:datacontract:checkout:orders-latest info: title: Orders Latest version: 2.0.0 description: | Successful customer orders in the webshop. All orders since 2020-01-01. Orders with their line items are in their current state (no history included). owner: Checkout Team status: active contact: name: John Doe (Data Product Owner) url: https://teams.microsoft.com/l/channel/example/checkout servers: production: type: s3 environment: prod location: s3://datacontract-example-orders-latest/v2/{model}/*.json format: json delimiter: new_line description: "One folder per model. One file per day." roles: - name: analyst_us description: Access to the data for US region - name: analyst_cn description: Access to the data for China region terms: usage: | Data can be used for reports, analytics and machine learning use cases. Order may be linked and joined by other tables limitations: | Not suitable for real-time use cases. Data may not be used to identify individual customers. Max data processing per day: 10 TiB policies: - name: privacy-policy url: https://example.com/privacy-policy - name: license description: External data is licensed under agreement 1234. url: https://example.com/license/1234 billing: 5000 USD per month noticePeriod: P3M models: orders: description: One record per order. Includes cancelled and deleted orders. type: table fields: order_id: $ref: '#/definitions/order_id' required: true unique: true primaryKey: true order_timestamp: description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful. type: timestamp required: true examples: - "2024-09-09T08:30:00Z" tags: ["business-timestamp"] order_total: description: Total amount the smallest monetary unit (e.g., cents). type: long required: true examples: - 9999 quality: - type: sql description: 95% of all order total values are expected to be between 10 and 499 EUR. query: | SELECT quantile_cont(order_total, 0.95) AS percentile_95 FROM orders mustBeBetween: [1000, 49900] customer_id: description: Unique identifier for the customer. type: text minLength: 10 maxLength: 20 customer_email_address: description: The email address, as entered by the customer. type: text format: email required: true pii: true classification: sensitive quality: - type: text description: The email address is not verified and may be invalid. lineage: inputFields: - namespace: com.example.service.checkout name: checkout_db.orders field: email_address processed_timestamp: description: The timestamp when the record was processed by the data platform. type: timestamp required: true config: jsonType: string jsonFormat: date-time quality: - type: sql description: The maximum duration between two orders should be less that 3600 seconds query: | SELECT MAX(duration) AS max_duration FROM ( SELECT EXTRACT(EPOCH FROM (order_timestamp - LAG(order_timestamp) OVER (ORDER BY order_timestamp))) AS duration FROM orders ) mustBeLessThan: 3600 - type: sql description: Row Count query: | SELECT count(*) as row_count FROM orders mustBeGreaterThan: 5 examples: - | order_id,order_timestamp,order_total,customer_id,customer_email_address,processed_timestamp "1001","2030-09-09T08:30:00Z",2500,"1000000001","mary.taylor82@example.com","2030-09-09T08:31:00Z" "1002","2030-09-08T15:45:00Z",1800,"1000000002","michael.miller83@example.com","2030-09-09T08:31:00Z" "1003","2030-09-07T12:15:00Z",3200,"1000000003","michael.smith5@example.com","2030-09-09T08:31:00Z" "1004","2030-09-06T19:20:00Z",1500,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" "1005","2030-09-05T10:10:00Z",4200,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z" "1006","2030-09-04T14:55:00Z",2800,"1000000005","john.davis28@example.com","2030-09-09T08:31:00Z" "1007","2030-09-03T21:05:00Z",1900,"1000000006","linda.brown67@example.com","2030-09-09T08:31:00Z" "1008","2030-09-02T17:40:00Z",3600,"1000000007","patricia.smith40@example.com","2030-09-09T08:31:00Z" "1009","2030-09-01T09:25:00Z",3100,"1000000008","linda.wilson43@example.com","2030-09-09T08:31:00Z" "1010","2030-08-31T22:50:00Z",2700,"1000000009","mary.smith98@example.com","2030-09-09T08:31:00Z" line_items: description: A single article that is part of an order. type: table fields: line_item_id: type: text description: Primary key of the lines_item_id table required: true order_id: $ref: '#/definitions/order_id' references: orders.order_id sku: description: The purchased article number $ref: '#/definitions/sku' primaryKey: ["order_id", "line_item_id"] examples: - | line_item_id,order_id,sku "LI-1","1001","5901234123457" "LI-2","1001","4001234567890" "LI-3","1002","5901234123457" "LI-4","1002","2001234567893" "LI-5","1003","4001234567890" "LI-6","1003","5001234567892" "LI-7","1004","5901234123457" "LI-8","1005","2001234567893" "LI-9","1005","5001234567892" "LI-10","1005","6001234567891" definitions: order_id: title: Order ID type: text format: uuid description: An internal ID that identifies an order in the online shop. examples: - 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2 pii: true classification: restricted tags: - orders sku: title: Stock Keeping Unit type: text pattern: ^[A-Za-z0-9]{8,14}$ examples: - "96385074" description: | A Stock Keeping Unit (SKU) is an internal unique identifier for an article. It is typically associated with an article's barcode, such as the EAN/GTIN. links: wikipedia: https://en.wikipedia.org/wiki/Stock_keeping_unit tags: - inventory servicelevels: availability: description: The server is available during support hours percentage: 99.9% retention: description: Data is retained for one year period: P1Y unlimited: false latency: description: Data is available within 25 hours after the order was placed threshold: 25h sourceTimestampField: orders.order_timestamp processedTimestampField: orders.processed_timestamp freshness: description: The age of the youngest row in a table. threshold: 25h timestampField: orders.order_timestamp frequency: description: Data is delivered once a day type: batch # or streaming interval: daily # for batch, either or cron cron: 0 0 * * * # for batch, either or interval support: description: The data is available during typical business hours at headquarters time: 9am to 5pm in EST on business days responseTime: 1h backup: description: Data is backed up once a week, every Sunday at 0:00 UTC. interval: weekly cron: 0 0 * * 0 recoveryTime: 24 hours recoveryPoint: 1 week tags: - checkout - orders - s3 links: datacontractCli: https://cli.datacontract.com ``` Data Contract CLI --- The [Data Contract CLI](https://cli.datacontract.com) is a command line tool and Python library to lint, test, import and export data contracts. Here is short example how to verify that your actual dataset matches the data contract: ```bash pip3 install "datacontract-cli[all]" datacontract test https://datacontract.com/examples/orders-latest/datacontract.yaml ``` or, if you prefer Docker: ```bash docker run datacontract/cli test https://datacontract.com/examples/orders-latest/datacontract.yaml ``` The Data Contract contains all required information to verify data: - The _servers_ block has the connection details to the actual data set. - The _models_ define the syntax, formats, and constraints. - The _quality_ defined further quality checks. The Data Contract CLI chooses the appropriate engine, formulates test cases, connects to the server, and executes the tests, based on the server type. More information and configuration options on [cli.datacontract.com](https://cli.datacontract.com). Specification --- ![The eight major categories in the data contract specification](images/categories.png) - [Data Contract Object](#data-contract-object) - [Info Object](#info-object) - [Contact Object](#contact-object) - [Server Object](#server-object) - [Terms Object](#terms-object) - [Model Object](#model-object) - [Field Object](#field-object) - [Definition Object](#definition-object) - [Service Level Object](#service-levels-object) - [Quality Object](#quality-object) - [Lineage Object](#lineage-object) - [Data Types](#data-types) - [Specification Extensions](#specification-extensions) [JSON Schema](https://github.com/datacontract/datacontract-specification/blob/main/datacontract.schema.json) of the Data Contract Specification. ### Data Contract Object This is the root document. It is _RECOMMENDED_ that the root document be named: `datacontract.yaml`. | Field | Type | Description | |---------------------------|--------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------| | dataContractSpecification | `string` | REQUIRED. Specifies the Data Contract Specification being used. | | id | `string` | REQUIRED. An organization-wide unique technical identifier, such as a UUID, URN, slug, string, or number | | info | [Info Object](#info-object) | REQUIRED. Specifies the metadata of the data contract. May be used by tooling. | | servers | Map[`string`, [Server Object](#server-object)] | Specifies the servers of the data contract. | | terms | [Terms Object](#terms-object) | Specifies the terms and conditions of the data contract. | | models | Map[`string`, [Model Object](#model-object)] | Specifies the logical data model. | | definitions | Map[`string`, [Definition Object](#definition-object)] | Specifies definitions. | | servicelevels | [Service Levels Object](#service-levels-object) | Specifies the service level of the provided data | | links | Map[`string`, `string`] | Additional external documentation links. | | tags | Array of `string` | Custom metadata to provide additional context. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Info Object Metadata and life cycle information about the data contract. | Field | Type | Description | |-------------|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------| | title | `string` | REQUIRED. The title of the data contract. | | version | `string` | REQUIRED. The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version). | | status | `string` | The status of the data contract. Can be `proposed`, `in development`, `active`, `deprecated`, `retired`. | | description | `string` | A description of the data contract. | | owner | `string` | The owner or team responsible for managing the data contract and providing the data. | | contact | [Contact Object](#contact-object) | Contact information for the data contract. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Contact Object Contact information for the data contract. | Field | Type | Description | |-------|----------|-------------------------------------------------------------------------------------------------------| | name | `string` | The identifying name of the contact person/organization. | | url | `string` | The URL pointing to the contact information. This _MUST_ be in the form of a URL. | | email | `string` | The email address of the contact person/organization. This _MUST_ be in the form of an email address. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Server Object The fields are dependent on the defined type. | Field | Type | Description | |-------------|----------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | REQUIRED. The type of the data product technology that implements the data contract. Well-known server types are: `bigquery`, `s3`, `glue`, `redshift`, `azure`, `sqlserver`, `snowflake`, `databricks`, `postgres`, `oracle`, `kafka`, `pubsub`, `sftp`, `kinesis`, `trino`, `local` | | description | `string` | An optional string describing the server. | | environment | `string` | An optional string describing the environment, e.g., prod, sit, stg. | | roles | Array of [Server Role Object](#server-role-object) | An optional array of roles that are available and can be requested to access the server for role-based access control. E.g. separate roles for different regions or sensitive data. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### BigQuery Server Object | Field | Type | Description | |---------|----------|-----------------------| | type | `string` | `bigquery` | | project | `string` | The GCP project name. | | dataset | `string` | | #### S3 Server Object | Field | Type | Description | |-------------|----------|-------------------------------------------------------------------------------------------------------------------------| | type | `string` | `s3` | | location | `string` | S3 URL, starting with `s3://` | | endpointUrl | `string` | The server endpoint for S3-compatible servers, such as MioIO or Google Cloud Storage, e.g., `https://minio.example.com` | | format | `string` | Format of files, such as `parquet`, `delta`, `json`, `csv` | | delimiter | `string` | (Only for format = `json`), how multiple json documents are delimited within one file, e.g., `new_line`, `array` | Example (AWS S3): ```yaml servers: production: type: s3 location: s3://acme-orders-prod/orders/ format: json delimiter: new_line ``` Example (MinIO): ```yaml servers: minio: type: s3 endpointUrl: http://localhost:9000 location: s3://my-bucket/path/ format: delta ``` Example (Google Cloud Storage): ```yaml servers: gcs: type: s3 endpointUrl: https://storage.googleapis.com location: s3://my-bucket/path/*/*/*/*/*.parquet format: parquet ``` #### Redshift Server Object | Field | Type | Description | |-------------------|----------|---------------------------------------------------------------------------------------------------------------------| | type | `string` | `redshift` | | account | `string` | | | database | `string` | | | schema | `string` | | | clusterIdentifier | `string` | Identifier of the cluster.
Example: `analytics-cluster` | | host | `string` | Host of the cluster.
Example: `analytics-cluster.example.eu-west-1.redshift.amazonaws.com` | | port | `number` | Port of the cluster.
Example: `5439` | | endpoint | `string` | Endpoint of the cluster
Example: `analytics-cluster.example.eu-west-1.redshift.amazonaws.com:5439/analytics` | Example, specifying an endpoint: ```yaml servers: analytics: type: redshift account: '123456789012' database: analytics schema: analytics endpoint: analytics-cluster.example.eu-west-1.redshift.amazonaws.com:5439/analytics ``` Example, specifying the cluster identifier: ```yaml servers: analytics: type: redshift account: '123456789012' database: analytics schema: analytics clusterIdentifier: analytics-cluster ``` Example, specifying the cluster host: ```yaml servers: analytics: type: redshift account: '123456789012' database: analytics schema: analytics host: analytics-cluster.example.eu-west-1.redshift.amazonaws.com port: 5439 ``` #### Azure Server Object | Field | Type | Description | |----------------|----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | `azure` | | storageAccount | `string` | The storage account name that contains the files | | location | `string` | Path to Azure Blob Storage or Azure Data Lake Storage (ADLS) in the storage account, supports globs. Starting with `az://` or `abfss`
Recommended pattern is `abfss:///`, Examples: `az://my_storage_account_name.blob.core.windows.net/my_container/path/*.parquet` or `abfss://my_container_name/path/*.parquet` | | format | `string` | Format of files, such as `parquet`, `json`, `csv` | | delimiter | `string` | (Only for format = `json`), how multiple json documents are delimited within one file, e.g., `new_line`, `array` | #### SQL-Server Server Object | Field | Type | Description | |----------|-----------|--------------------------------------------------------------------------| | type | `string` | `sqlserver` | | host | `string` | The host to the database server | | port | `integer` | The port to the database server, default: `1433` | | database | `string` | The name of the database, e.g., `database`. | | schema | `string` | The name of the schema in the database, e.g., `dbo`. | | driver | `string` | The name of the supported driver, e.g., `ODBC Driver 18 for SQL Server`. | #### Snowflake Server Object | Field | Type | Description | |----------|----------|-------------| | type | `string` | `snowflake` | | account | `string` | | | database | `string` | | | schema | `string` | | #### Databricks Server Object | Field | Type | Description | |---------|----------|---------------------------------------------------------------------| | type | `string` | `databricks` | | host | `string` | The Databricks host, e.g., `dbc-abcdefgh-1234.cloud.databricks.com` | | catalog | `string` | The name of the Hive or Unity catalog | | schema | `string` | The schema name in the catalog | #### Postgres Server Object | Field | Type | Description | |----------|-----------|---------------------------------------------------------| | type | `string` | `postgres` | | host | `string` | The host to the database server | | port | `integer` | The port to the database server | | database | `string` | The name of the database, e.g., `postgres`. | | schema | `string` | The name of the schema in the database, e.g., `public`. | #### Oracle Server Object | Field | Type | Description | |-------------|-----------|---------------------------------| | type | `string` | `oracle` | | host | `string` | The host to the oracle server | | port | `integer` | The port to the oracle server | | serviceName | `string` | The name of the service | #### Kafka Server Object | Field | Type | Description | |--------|----------|---------------------------------------------------------------------------| | type | `string` | `kafka` | | host | `string` | The bootstrap server of the kafka cluster. | | topic | `string` | The topic name. | | format | `string` | The format of the message. Examples: json, avro, protobuf. Default: json. | #### Pub/Sub Server Object | Field | Type | Description | |---------|----------|-----------------------| | type | `string` | `pubsub` | | project | `string` | The GCP project name. | | topic | `string` | The topic name. | #### sftp Server Object | Field | Type | Description | |-----------|----------|------------------------------------------------------------------------------------------------------------------| | type | `string` | `sftp` | | location | `string` | S3 URL, starting with `sftp://` | | format | `string` | Format of files, such as `parquet`, `delta`, `json`, `csv` | | delimiter | `string` | (Only for format = `json`), how multiple json documents are delimited within one file, e.g., `new_line`, `array` | #### AWS Kinesis Data Streams Server Object | Field | Type | Description | |--------|----------|---------------------------------------------------------------------------| | type | `string` | `kinesis` | | stream | `string` | The name of the Kinesis data stream. | | region | `string` | AWS region, e.g., `eu-west-1`. | | format | `string` | The format of the records. Examples: json, avro, protobuf. | #### Trino Server Object | Field | Type | Description | |----------|-----------|-----------------------------------------------------------| | type | `string` | `trino` | | host | `string` | The Trino host | | port | `integer` | The Trino port | | catalog | `string` | The name of the catalog, e.g., `my_catalog`. | | schema | `string` | The name of the schema in the catalog, e.g., `my_schema`. | #### Local Server Object | Field | Type | Description | |--------|----------|-------------------------------------------------------------------------------------| | type | `string` | `local` | | path | `string` | The relative or absolute path to the data file(s), such as `./folder/data.parquet`. | | format | `string` | The format of the file(s), such as `parquet`, `delta`, `csv`, or `json`. | #### Server Role Object | Field | Type | Description | |-------------|----------|--------------------------------------------------------------| | name | `string` | Name of the role | | description | `string` | A description of the role and what access the role provides. | ### Terms Object The terms and conditions of the data contract. | Field | Type | Description | |--------------|------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | usage | `string` | The usage describes the way the data is expected to be used. Can contain business and technical information. | | limitations | `string` | The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for. | | policies | Array of [Policy Object](#policy-object) | A list of policies, licenses, standards, that are applicable for this data contract and that must be acknowledged by data consumers. | | billing | `string` | The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use. | | noticePeriod | `string` | The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., `P3M` for a period of three months. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### Policy Object | Field | Type | Description | |-------------|----------|-----------------------------------| | name | `string` | Name of the policy. | | description | `string` | A description of the policy. | | url | `string` | An URL that refers to the policy. | ### Model Object The Model Object describes the structure and semantics of a data model, such as tables, views, or structured files. The name of the data model (table name) is defined by the key that refers to this Model Object. | Field | Type | Description | |-------------|----------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | The type of the model. Examples: `table`, `view`, `object`. Default: `table`. | | description | `string` | An optional string describing the data model. | | title | `string` | An optional string for the title of the data model. Especially useful if the name of the model is cryptic or contains abbreviations. | | fields | Map[`string`, [Field Object](#field-object)] | The fields (e.g. columns) of the data model. | | primaryKey | Array of `string` | If the primary key is a compound key, list the field names that constitute the primary key. Alternative to field-level `primaryKey`. | | quality | Array of [Quality Object](#quality-object) | Specifies the quality attributes on model level. | | examples | Array of `Any` | Specifies example data sets for the model. | | config | [Config Object](#config-object) | Any additional key-value pairs that might be useful for further tooling. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Field Object The Field Objects describes one field (column, property, nested field) of a data model. | Field | Type | Description | |------------------|----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | description | `string` | An optional string describing the semantic of the data in this field. | | type | [Data Type](#data-types) | The logical data type of the field. | | title | `string` | An optional string providing a human readable name for the field. Especially useful if the field name is cryptic or contains abbreviations. | | enum | array of `string` | A value must be equal to one of the elements in this array value. Only evaluated if the value is not null. | | required | `boolean` | An indication, if this field must contain a value and may not be null. Default: `false` | | primaryKey | `boolean` | If this field is a primary key. Default: `false` | | references | `string` | The reference to a field in another model. E.g. use 'orders.order_id' to reference the order_id field of the model orders. Think of defining a foreign key relationship. | | unique | `boolean` | An indication, if the value must be unique within the model. Default: `false` | | format | `string` | `email`: A value must be complaint to [RFC 5321, section 4.1.2](https://www.rfc-editor.org/info/rfc5321).
`uri`: A value must be complaint to [RFC 3986](https://www.rfc-editor.org/info/rfc3986).
`uuid`: A value must be complaint to [RFC 4122](https://www.rfc-editor.org/info/rfc4122). Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | precision | `number` | The maximum number of digits in a number. Only applies to numeric values. Defaults to 38. | | scale | `number` | The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0. | | minLength | `number` | A value must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | maxLength | `number` | A value must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | pattern | `string` | A value must be valid according to the [ECMA-262](https://262.ecma-international.org/5.1/) regular expression dialect. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | minimum | `number` | A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | exclusiveMinimum | `number` | A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | maximum | `number` | A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | exclusiveMaximum | `number` | A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | ~~example~~ | `string` | DEPRECATED, use examples. An example value. | | examples | Array of Any | A list of example values. | | pii | `boolean` | An indication, if this field contains Personal Identifiable Information (PII). | | classification | `string` | The data class defining the sensitivity level for this field, according to the organization's classification scheme. Examples may be: `sensitive`, `restricted`, `internal`, `public`. | | tags | Array of `string` | Custom metadata to provide additional context. | | links | Map[`string`,`string`] | Additional external documentation links. | | $ref | `string` | A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition. | | fields | Map[`string`, [Field Object](#field-object)] | The nested fields (e.g. columns) of the object, record, or struct. Use only when type is `object`, `record`, or `struct`. | | items | [Field Object](#field-object) | The type of the elements in the array. Use only when type is `array`. | | keys | [Field Object](#field-object) | Describes the key structure of a map. Defaults to `type: string` if a map is defined as type. Not all server types support different key types. Use only when type is `map`. | | values | [Field Object](#field-object) | Describes the value structure of a map. Use only when type is `map`. | | quality | Array of [Quality Object](#quality-object) | Specifies the quality attributes on field level. | | lineage | [Lineage Object](#lineage-object) | Provides information where the data comes from. | | config | [Config Object](#config-object) | Any additional key-value pairs that might be useful for further tooling. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Definition Object The Definition Object includes a clear and concise explanations of syntax, semantic, and classification of a business object in a given domain. It serves as a reference for a common understanding of terminology, ensure consistent usage and to identify join-able fields. Models fields can refer to definitions using the `$ref` field to link to existing definitions and avoid duplicate documentations. | Field | Type | Description | |------------------|----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | type | [Data Type](#data-types) | REQUIRED. The logical data type | | title | `string` | The business name of this definition. | | description | `string` | Clear and concise explanations related to the domain | | enum | array of `string` | A value must be equal to one of the elements in this array value. Only evaluated if the value is not null. | | format | `string` | `email`: A value must be complaint to [RFC 5321, section 4.1.2](https://www.rfc-editor.org/info/rfc5321).
`uri`: A value must be complaint to [RFC 3986](https://www.rfc-editor.org/info/rfc3986).
`uuid`: A value must be complaint to [RFC 4122](https://www.rfc-editor.org/info/rfc4122). Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | precision | `number` | The maximum number of digits in a number. Only applies to numeric values. Defaults to 38. | | scale | `number` | The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0. | | minLength | `number` | A value must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | maxLength | `number` | A value must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | pattern | `string` | A value must be valid according to the [ECMA-262](https://262.ecma-international.org/5.1/) regular expression dialect. Only evaluated if the value is not null. Only applies to unicode character sequences types (`string`, `text`, `varchar`). | | minimum | `number` | A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | exclusiveMinimum | `number` | A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | maximum | `number` | A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | exclusiveMaximum | `number` | A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values. | | examples | Array of Any | A list of example values. | | pii | `boolean` | An indication, if this field contains Personal Identifiable Information (PII). | | classification | `string` | The data class defining the sensitivity level for this field, according to the organization's classification scheme. | | tags | Array of `string` | Custom metadata to provide additional context. | | links | Map[`string`, `string`] | Additional external documentation links. | | fields | Map[`string`, [Field Object](#field-object)] | The nested fields (e.g. columns) of the object, record, or struct. Use only when type is `object`, `record`, or `struct`. | | items | [Field Object](#field-object) | The type of the elements in the array. Use only when type is `array`. | | keys | [Field Object](#field-object) | Describes the key structure of a map. Defaults to `type: string` if a map is defined as type. Not all server types support different key types. Use only when type is `map`. | | values | [Field Object](#field-object) | Describes the value structure of a map. Use only when type is `map`. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). ### Service Levels Object A service level is defined as an agreed-upon, measurable level of performance for provided the data. Data Contract Specification defines well-known service levels. This list can be extended with custom service levels. One can either describe each service level informally using the `description` field, or make use of the predefined fields for automation support, e.g., via the [Data Contract CLI](https://cli.datacontract.com). | Field | Type | Description | |--------------|-----------------------------------------------|-------------------------------------------------------------------------| | availability | [Availability Object](#availability-object) | The promised uptime of the system that provides the data | | retention | [Retention Object](#retention-object) | The period how long data will be available. | | latency | [Latency Object](#latency-object) | The maximum amount of time from the source to its destination. | | freshness | [Freshness Object](#freshness-object) | The maximum age of the youngest entry. | | frequency | [Frequency Object](#frequency-object) | The update frequency. | | support | [Support Object](#support-object) | The times when support is provided. | | backup | [Backup Object](#backup-object) | The details about data backup procedures. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### Availability Object Availability refers to the promise or guarantee by the service provider about the uptime of the system that provides the data. | Field | Type | Description | |-------------|----------|--------------------------------------------------------------------------------| | description | `string` | An optional string describing the availability service level. | | percentage | `string` | An optional string describing the guaranteed uptime in percent (e.g., `99.9%`) | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### Retention Object Retention covers the period how long data will be available. | Field | Type | Description | |----------------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------| | description | `string` | An optional string describing the retention service level. | | period | `string` | An optional period of time, how long data is available. Supported formats: Simple duration (e.g., `1 year`, `30d`) and ISO 8601 duration (e.g, `P1Y`). | | unlimited | `boolean` | An optional indicator that data is kept forever. | | timestampField | `string` | An optional reference to the field that contains the timestamp that the period refers to. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### Latency Object Latency refers to the maximum amount of time from the source to its destination. Examples are the maximum duration it takes after an order has been recorded in the ecommerce shop until it is available in the orders table in the data analytics platform. This includes the waiting times until the next batch run is started and the processing time of the pipeline. | Field | Type | Description | |-------------------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | description | `string` | An optional string describing the latency service level. | | threshold | `string` | An optional maximum duration between the source timestamp and the processed timestamp. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g, `PT24H`). | | sourceTimestampField | `string` | An optional reference to the field that contains the timestamp when the data was provided at the source. | | processedTimestampField | `string` | An optional reference to the field that contains the processing timestamp, which denotes when the data is made available to consumers of this data contract. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### Freshness Object Freshness refers to the maximum age of the youngest entry. | Field | Type | Description | |-------------------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------| | description | `string` | An optional string describing the freshness service level. | | threshold | `string` | An optional maximum age of the youngest entry. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g, `PT24H`). | | timestampField | `string` | An optional reference to the field that contains the timestamp that the threshold refers to. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### Frequency Object Frequency describes how often data is updated. | Field | Type | Description | |-------------|----------|-----------------------------------------------------------------------------------------------------------| | description | `string` | An optional string describing the frequency service level. | | type | `string` | An optional type of data processing. Typical values are `batch`, `micro-batching`, `streaming`, `manual`. | | interval | `string` | Optional. Only for batch: How often the pipeline is triggered, e.g., `daily`. | | cron | `string` | Optional. Only for batch: A cron expression when the pipelines is triggered. E.g., `0 0 * * *`. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### Support Object Support describes the times when support will be available for contact. | Field | Type | Description | |--------------|----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | description | `string` | An optional string describing the support service level. | | time | `string` | An optional string describing the times when support will be available for contact such as `24/7` or `business hours only`. | | responseTime | `string` | An optional string describing the time it takes for the support team to acknowledge a request. This does not mean the issue will be resolved immediately, but it assures users that their request has been received and will be dealt with. | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). #### Backup Object Backup specifies details about data backup procedures. | Field | Type | Description | |---------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | description | `string` | An optional string describing the backup service level. | | interval | `string` | An optional interval that defines how often data will be backed up, e.g., `daily`. | | cron | `string` | An optional cron expression when data will be backed up, e.g., `0 0 * * *`. | | recoveryTime | `string` | An optional Recovery Time Objective (RTO) specifies the maximum amount of time allowed to restore data from a backup after a failure or loss event (e.g., 4 hours, 24 hours). | | recoveryPoint | `string` | An optional Recovery Point Objective (RPO) defines the maximum acceptable age of files that must be recovered from backup storage for normal operations to resume after a disaster or data loss event. This essentially measures how much data you can afford to lose, measured in time (e.g., 4 hours, 24 hours). | ### Quality Object The quality object defines quality attributes. Quality attributes are checks that can be applied to the data to ensure its quality. Data can be verified by executing these checks through a data quality engine. Quality attributes can be: - A text in natural language that describes the quality of the data. - An individual SQL query that returns a single value that can be compared. - Engine-specific types: Pre-defined quality checks, as defined by data quality libraries. Currently, the engines `soda` and `great-expectations` are supported. A quality object can be specified on field level and on model level. The top-level quality object is deprecated. #### Description Text A description in natural language that defines the expected quality of the data. This is useful to express requirements or expectation when discussing the data contract with stakeholders. Later in the development process, these might be translated into an executable check (such as `sql`). It can also be used as a prompt to check the data with an AI engine. | Field | Type | Description | |-------------|----------|--------------------------------------------------------------------| | type | `string` | `text` | | description | `string` | A plain text describing the quality attribute in natural language. | Example: ```yaml models: my_table: fields: account_iban: quality: - type: text description: Must be a valid IBAN. Must not be empty. ``` #### SQL An individual SQL query that returns a single number that can be compared with a threshold. The SQL query must be in the SQL dialect of the provided server. > __Note:__ Establish a secure development process and use read-only connections, as the misuse of SQL queries can lead to SQL injection attacks. | Field | Type | Description | |----------------------------|-----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | `sql` | | description | `string` | A plain text describing the quality of the data. | | query | `string` | A SQL query that returns a single number to compare with the threshold. | | dialect | `string` | The SQL dialect that is used for the query. Should be compatible to the server type. Examples: `postgres`, `spark`, `bigquery`, `snowflake`, `duckdb`, ... | | mustBe | `integer` | The threshold to check the return value of the query | | mustNotBe | `integer` | The threshold to check the return value of the query | | mustBeGreaterThan | `integer` | The threshold to check the return value of the query | | mustBeGreaterThanOrEqualTo | `integer` | The threshold to check the return value of the query | | mustBeLessThan | `integer` | The threshold to check the return value of the query | | mustBeLessThanOrEqualTo | `integer` | The threshold to check the return value of the query | | mustBeBetween | array of two integers | The threshold to check the return value of the query. Boundaries are inclusive. | | mustNotBeBetween | array of two integers | The threshold to check the return value of the query. Boundaries are inclusive. | In the query the following placeholders can be used: | Placeholder | Description | |-------------|----------------------------------------------------------------------------------------| | `{model}` | The name of the model that is checked. | | `{table}` | Alias for `{model}`. | | `{field}` | The name of the field that is checked (only if the quality is defined on field-level). | | `{column}` | Alias for `{field}`. | Example: ```yaml models: orders: quality: - type: sql description: The maximum duration between two orders must be less that 3600 seconds query: | SELECT MAX(duration) AS max_duration FROM ( SELECT EXTRACT(EPOCH FROM (order_timestamp - LAG(order_timestamp) OVER (ORDER BY order_timestamp))) AS duration FROM {model} ) mustBeLessThan: 3600 ``` SQL queries allow powerful checks for custom business logic. A SQL query should run not longer than 10 minutes. #### Custom You can define custom quality attributes that are specific to a data quality engine. #### Custom (Engine: Soda) Soda has a number of predefined quality [checks](https://docs.soda.io/soda/data-contracts-checks.html) that can be referenced as quality attributes. Soda checks can be applied on model and field level. > Note: Soda Data contract check reference is experimental and may change in the future. Currently only supported by Postgres, Snowflake, and Spark (Databricks) | Field | Type | Description | |---------------|----------|-----------------------------------------------------------------------------------------------------------------------------| | type | `string` | `custom` | | description | `string` | Optional. A plain text describing the quality attribute in natural language. | | engine | `string` | `soda` | | implementation | `object` | A check type as defined in the [Data contract check reference](https://docs.soda.io/soda/data-contracts-checks.html) | See the [Data contract check reference](https://docs.soda.io/soda/data-contracts-checks.html) for all possible types and configuration values. Example: ```yaml models: my_table: fields: order_id: type: string quality: - type: custom description: This is a check on field level engine: soda implementation: type: no_duplicate_values carrier: type: string shipment_numer: type: string quality: - type: custom description: This is a check on model level engine: soda implementation: type: duplicate_percent columns: - carrier - shipment_numer must_be_less_than: 1.0 - type: custom description: This is a check on model level engine: soda implementation: type: row_count must_be_greater_than: 500000 ``` #### Custom (Engine: Great Expectations) Quality attributes defined as Great Expectations [Expectation](https://greatexpectations.io/expectations/). Expectations are applied on model level. | Field | Type | Description | |---------------|----------|-----------------------------------------------------------------------------------------------------| | description | `string` | Optional. A plain text describing the quality attribute in natural language. | | engine | `string` | `great-expectations` | | implementation | `object` | An expectation type as listed in [Expectation](https://greatexpectations.io/expectations/) as YAML. | Example: ```yaml models: my_table: quality: - type: custom engine: great-expectations implementation: expectation_type: expect_table_row_count_to_be_between kwargs: min_value: 10000 max_value: 50000 meta: notes: "This expectation is crucial to avoid processing datasets that are too small or too large." - type: custom engine: great-expectations description: "Check that passenger_count values are between 1 and 6." implementation: expectation_type: expect_column_values_to_be_between kwargs: column: passenger_count max_value: 6 min_value: 1 mostly: 1.0 strict_max: false strict_min: false meta: tags: - business-critical - range_check ``` ### Lineage Object Field level lineage provides optional fine-grained information where the data comes from and how it was transformed. The lineage object is based on the OpenLinage [Column Level Lineage Dataset Facet](https://openlineage.io/docs/spec/facets/dataset-facets/column_lineage_facet) to describe the input fields. | Field | Type | Description | |-------------|---------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | inputFields | Array of [InputField Object](#inputfield-object) | The input fields refer to specific fields, columns, or data points from source systems or other data contracts that feed into a particular transformation, calculation, or final result. | #### InputField Object | Field | Type | Description | |-----------------|----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | namespace | `string` | The input dataset namespace, such as the name of the source system or the domain of another data contract. Examples: `com.example.crm`, `checkout`, snowflake://{account name}. [More on namespace](https://openlineage.io/blog/whats-in-a-namespace/#namespaces-in-the-spec) | | name | `string` | The input dataset name, such as a reference to a data contract, a fully qualified table name, a Kafka topic. | | field | `string` | The input field name, such as the field in an upstream data contract, a table column or a JSON Path. | | transformations | Array of [Transformation Object](#transformation-object) | Optional. This describes how the input field data was used to generate the final result. | #### Transformation Object | Field | Type | Description | |-------------|-----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | type | `string` | Indicates how direct is the relationship e.g. in query. Allows values are: `DIRECT` and `INDIRECT`. | | subtype | `string` | Optional. Contains more specific information about the transformation.
Allowed values for type `DIRECT`: `IDENTITY`, `TRANSFORMATION`, `AGGREGATION`.
Allowed values for type `INDIRECT`: `JOIN`, `GROUP_BY`, `FILTER`, `SORT`, `WINDOW`, `CONDITIONAL`. | | description | `string` | Optional. A string representation of the transformation applied. | | masking | `boolean` | Optional. Boolean value indicating if the input value was obfuscated during the transformation. | Example: ```yaml models: orders: fields: order_id: type: string lineage: inputFields: - namespace: com.example.service.checkout name: checkout_db.orders field: order_id transformations: - type: DIRECT subtype: IDENTITY description: The order ID from the checkout order - namespace: com.example.service.checkout name: checkout_db.orders field: order_timestamp transformations: - type: INDIRECT subtype: SORT customer_email_address_hash: type: string lineage: inputFields: - namespace: com.example.service.checkout name: checkout_db.orders field: email_address transformations: - type: DIRECT subtype: Transformation description: The email address from the checkout order, hashed with SHA-256 masking: true ``` ### Config Object The config field can be used to set additional metadata that may be used by tools, e.g. to define a namespace for code generation, specify physical data types, toggle tests, etc. A config field can be added with any name. The value can be null, a primitive, an array or an object. For developer experience, a list of well-known field names is maintained here, as these fields are used in the Data Contract CLI: | Field | Type | Description | |-----------------|----------|----------------------------------------------------------------------------------------------------------------| | avroNamespace | `string` | (Only on model level) The namespace to use when importing and exporting the data model from / to Apache Avro. | | avroType | `string` | (Only on field level) Specify the field type to use when exporting the data model to Apache Avro. | | avroLogicalType | `string` | (Only on field level) Specify the logical field type to use when exporting the data model to Apache Avro. | | bigqueryType | `string` | (Only on field level) Specify the physical column type that is used in a BigQuery table, e.g., `NUMERIC(5, 2)` | | snowflakeType | `string` | (Only on field level) Specify the physical column type that is used in a Snowflake table, e.g, `TIMESTAMP_LTZ` | | redshiftType | `string` | (Only on field level) Specify the physical column type that is used in a Redshift table, e.g, `SMALLINT` | | sqlserverType | `string` | (Only on field level) Specify the physical column type that is used in a SQL Server table, e.g, `DATETIME2` | | databricksType | `string` | (Only on field level) Specify the physical column type that is used in a Databricks table | | glueType | `string` | (Only on field level) Specify the physical column type that is used in a AWS Glue Data Catalog table | This object _MAY_ be extended with [Specification Extensions](#specification-extensions). Example: ``` models: orders: config: avroNamespace: "my.namespace" fields: my_field_1: description: Example for AVRO with Timestamp (millisecond precision) type: timestamp config: avroType: long avroLogicalType: timestamp-millis snowflakeType: timestamp_tz ``` ### Data Types The following data types are supported for model fields and definitions: - Unicode character sequence: `string`, `text`, `varchar` - Any numeric type, either integers or floating point numbers: `number`, `decimal`, `numeric` - 32-bit signed integer: `int`, `integer` - 64-bit signed integer: `long`, `bigint` - Single precision (32-bit) IEEE 754 floating-point number: `float` - Double precision (64-bit) IEEE 754 floating-point number: `double` - Binary value: `boolean` - Timestamp with timezone: `timestamp`, `timestamp_tz` - Timestamp with no timezone: `timestamp_ntz` - Date with no time information: `date` - Array: `array` - Map: `map` (may not be supported by some server types) - Sequence of 8-bit unsigned bytes: `bytes` - Complex type: `object`, `record`, `struct` - No value: `null` ### Specification Extensions While the Data Contract Specification tries to accommodate most use cases, additional data can be added to extend the specification at certain points. A custom field can be added with any name. The value can be null, a primitive, an array or an object. Tooling --- - [Data Contract CLI](https://github.com/datacontract/datacontract-cli) is an open-source CLI tool to help you create, develop, and maintain your data contracts. - [Data Contract Manager](https://www.datamesh-manager.com/) is a commercial tool to manage data contracts. It includes a data contract catalog, a Web-Editor, and a request and approval workflow to automate access to data products for a full enterprise data marketplace. - [Data Contract GPT](https://gpt.datacontract.com) is a custom GPT that can help you write data contracts. - [Data Contract Editor](https://editor.datacontract.com) is an open-source editor for Data Contracts, including a live html preview. Code Completion --- The [JSON Schema](https://datacontract.com/datacontract.schema.json) of the current data contract specification is registered in [Schema Store](https://www.schemastore.org/), which brings code completion and syntax checks for all major IDEs. IntelliJ comes with a built-in YAML plugin which will show you autocompletions. For VS Code we recommend to install the [YAML](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml) plugin. No additional configuration is required. Autocompletion is then enabled for files following these patterns: ``` datacontract.yaml datacontract.yml *-datacontract.yaml *-datacontract.yml *.datacontract.yaml *.datacontract.yml datacontract-*.yaml datacontract-*.yml **/datacontract/*.yml **/datacontract/*.yaml **/datacontracts/*.yml **/datacontracts/*.yaml ``` Authors --- The Data Contract Specification was originally created by [Jochen Christ](https://www.linkedin.com/in/jochenchrist/) and [Dr. Simon Harrer](https://www.linkedin.com/in/simonharrer/), and is currently maintained by them. Contributing --- Contributions are welcome! Please open an issue or a pull request. License --- [MIT License](LICENSE) ================================================ FILE: versions/1.1.0/datacontract.init.yaml ================================================ dataContractSpecification: 1.1.0 id: my-data-contract-id info: title: My Data Contract version: 0.0.1 # description: # owner: # contact: # name: # url: # email: ### servers #servers: # production: # type: s3 # location: s3:// # format: parquet # delimiter: new_line ### terms #terms: # usage: # limitations: # billing: # noticePeriod: ### models # models: # my_model: # description: # type: # fields: # my_field: # type: # description: ### definitions # definitions: # my_field: # domain: # name: # title: # type: # description: # example: # pii: # classification: ### servicelevels #servicelevels: # availability: # description: The server is available during support hours # percentage: 99.9% # retention: # description: Data is retained for one year because! # period: P1Y # unlimited: false # latency: # description: Data is available within 25 hours after the order was placed # threshold: 25h # sourceTimestampField: orders.order_timestamp # processedTimestampField: orders.processed_timestamp # freshness: # description: The age of the youngest row in a table. # threshold: 25h # timestampField: orders.order_timestamp # frequency: # description: Data is delivered once a day # type: batch # or streaming # interval: daily # for batch, either or cron # cron: 0 0 * * * # for batch, either or interval # support: # description: The data is available during typical business hours at headquarters # time: 9am to 5pm in EST on business days # responseTime: 1h # backup: # description: Data is backed up once a week, every Sunday at 0:00 UTC. # interval: weekly # cron: 0 0 * * 0 # recoveryTime: 24 hours # recoveryPoint: 1 week ================================================ FILE: versions/1.1.0/datacontract.schema.json ================================================ { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "title": "DataContractSpecification", "properties": { "dataContractSpecification": { "type": "string", "title": "DataContractSpecificationVersion", "enum": [ "1.1.0", "0.9.3", "0.9.2", "0.9.1", "0.9.0" ], "description": "Specifies the Data Contract Specification being used." }, "id": { "type": "string", "description": "Specifies the identifier of the data contract." }, "info": { "type": "object", "properties": { "title": { "type": "string", "description": "The title of the data contract." }, "version": { "type": "string", "description": "The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version)." }, "status": { "type": "string", "description": "The status of the data contract. Can be proposed, in development, active, retired.", "examples": [ "proposed", "in development", "active", "deprecated", "retired" ] }, "description": { "type": "string", "description": "A description of the data contract." }, "owner": { "type": "string", "description": "The owner or team responsible for managing the data contract and providing the data." }, "contact": { "type": "object", "properties": { "name": { "type": "string", "description": "The identifying name of the contact person/organization." }, "url": { "type": "string", "format": "uri", "description": "The URL pointing to the contact information. This MUST be in the form of a URL." }, "email": { "type": "string", "format": "email", "description": "The email address of the contact person/organization. This MUST be in the form of an email address." } }, "description": "Contact information for the data contract.", "additionalProperties": true } }, "additionalProperties": true, "required": [ "title", "version" ], "description": "Metadata and life cycle information about the data contract." }, "servers": { "type": "object", "description": "Information about the servers.", "additionalProperties": { "$ref": "#/$defs/BaseServer", "allOf": [ { "if": { "properties": { "type": { "const": "bigquery" } } }, "then": { "$ref": "#/$defs/BigQueryServer" } }, { "if": { "properties": { "type": { "const": "postgres" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/PostgresServer" } }, { "if": { "properties": { "type": { "const": "s3" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/S3Server" } }, { "if": { "properties": { "type": { "const": "sftp" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/SftpServer" } }, { "if": { "properties": { "type": { "const": "redshift" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/RedshiftServer" } }, { "if": { "properties": { "type": { "const": "azure" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/AzureServer" } }, { "if": { "properties": { "type": { "const": "sqlserver" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/SqlserverServer" } }, { "if": { "properties": { "type": { "const": "snowflake" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/SnowflakeServer" } }, { "if": { "properties": { "type": { "const": "databricks" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/DatabricksServer" } }, { "if": { "properties": { "type": { "const": "dataframe" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/DataframeServer" } }, { "if": { "properties": { "type": { "const": "glue" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/GlueServer" } }, { "if": { "properties": { "type": { "const": "postgres" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/PostgresServer" } }, { "if": { "properties": { "type": { "const": "oracle" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/OracleServer" } }, { "if": { "properties": { "type": { "const": "kafka" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/KafkaServer" } }, { "if": { "properties": { "type": { "const": "pubsub" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/PubSubServer" } }, { "if": { "properties": { "type": { "const": "kinesis" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/KinesisDataStreamsServer" } }, { "if": { "properties": { "type": { "const": "trino" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/TrinoServer" } }, { "if": { "properties": { "type": { "const": "local" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/LocalServer" } } ] } }, "terms": { "type": "object", "description": "The terms and conditions of the data contract.", "properties": { "usage": { "type": "string", "description": "The usage describes the way the data is expected to be used. Can contain business and technical information." }, "limitations": { "type": "string", "description": "The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for." }, "policies": { "type": "array", "items": { "type": "object", "properties": { "type": { "type": "string", "description": "The type of the policy.", "examples": [ "privacy", "security", "retention", "compliance" ] }, "description": { "type": "string", "description": "A description of the policy." }, "url": { "type": "string", "format": "uri", "description": "A URL to the policy document." } }, "additionalProperties": true }, "description": "The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for." }, "billing": { "type": "string", "description": "The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use." }, "noticePeriod": { "type": "string", "description": "The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., 'P3M' for a period of three months." } }, "additionalProperties": true }, "models": { "description": "Specifies the logical data model. Use the models name (e.g., the table name) as the key.", "type": "object", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "object", "title": "Model", "properties": { "description": { "type": "string" }, "type": { "description": "The type of the model. Examples: table, view, object. Default: table.", "type": "string", "title": "ModelType", "default": "table", "enum": [ "table", "view", "object" ] }, "title": { "type": "string", "description": "An optional string providing a human readable name for the model. Especially useful if the model name is cryptic or contains abbreviations.", "examples": [ "Purchase Orders", "Air Shipments" ] }, "fields": { "description": "Specifies a field in the data model. Use the field name (e.g., the column name) as the key.", "type": "object", "additionalProperties": { "type": "object", "title": "Field", "properties": { "description": { "type": "string", "description": "An optional string describing the semantic of the data in this field." }, "title": { "type": "string", "description": "An optional string providing a human readable name for the field. Especially useful if the field name is cryptic or contains abbreviations." }, "type": { "$ref": "#/$defs/FieldType" }, "required": { "type": "boolean", "default": false, "description": "An indication, if this field must contain a value and may not be null." }, "fields": { "description": "The nested fields (e.g. columns) of the object, record, or struct.", "type": "object", "additionalProperties": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" } }, "items": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "keys": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "values": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "primary": { "type": "boolean", "deprecationMessage": "Use the primaryKey field instead." }, "primaryKey": { "type": "boolean", "default": false, "description": "If this field is a primary key." }, "references": { "type": "string", "description": "The reference to a field in another model. E.g. use 'orders.order_id' to reference the order_id field of the model orders. Think of defining a foreign key relationship.", "examples": [ "orders.order_id", "model.nested_field.field" ] }, "unique": { "type": "boolean", "default": false, "description": "An indication, if the value must be unique within the model." }, "enum": { "type": "array", "items": { "type": "string" }, "uniqueItems": true, "description": "A value must be equal to one of the elements in this array value. Only evaluated if the value is not null." }, "minLength": { "type": "integer", "description": "A value must greater than, or equal to, the value of this. Only applies to string types." }, "maxLength": { "type": "integer", "description": "A value must less than, or equal to, the value of this. Only applies to string types." }, "format": { "type": "string", "description": "A specific format the value must comply with (e.g., 'email', 'uri', 'uuid').", "examples": [ "email", "uri", "uuid" ] }, "precision": { "type": "number", "examples": [ 38 ], "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." }, "scale": { "type": "number", "examples": [ 0 ], "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." }, "pattern": { "type": "string", "description": "A regular expression the value must match. Only applies to string types.", "examples": [ "^[a-zA-Z0-9_-]+$" ] }, "minimum": { "type": "number", "description": "A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "exclusiveMinimum": { "type": "number", "description": "A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "maximum": { "type": "number", "description": "A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "exclusiveMaximum": { "type": "number", "description": "A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "example": { "type": "string", "description": "An example value for this field.", "deprecationMessage": "Use the examples field instead." }, "examples": { "type": "array", "description": "A examples value for this field." }, "pii": { "type": "boolean", "description": "An indication, if this field contains Personal Identifiable Information (PII)." }, "classification": { "type": "string", "description": "The data class defining the sensitivity level for this field, according to the organization's classification scheme.", "examples": [ "sensitive", "restricted", "internal", "public" ] }, "tags": { "type": "array", "items": { "type": "string" }, "description": "Custom metadata to provide additional context." }, "links": { "type": "object", "description": "Links to external resources.", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "string", "title": "Link", "description": "A URL to an external resource.", "format": "uri", "examples": [ "https://example.com" ] } }, "$ref": { "type": "string", "description": "A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition." }, "quality": { "type": "array", "items": { "$ref": "#/$defs/Quality" } }, "lineage": { "$ref": "#/$defs/Lineage" }, "config": { "type": "object", "description": "Additional metadata for field configuration.", "additionalProperties": { "type": [ "string", "number", "boolean", "object", "array", "null" ] }, "properties": { "avroType": { "type": "string", "description": "Specify the field type to use when exporting the data model to Apache Avro." }, "avroLogicalType": { "type": "string", "description": "Specify the logical field type to use when exporting the data model to Apache Avro." }, "bigqueryType": { "type": "string", "description": "Specify the physical column type that is used in a BigQuery table, e.g., `NUMERIC(5, 2)`." }, "snowflakeType": { "type": "string", "description": "Specify the physical column type that is used in a Snowflake table, e.g., `TIMESTAMP_LTZ`." }, "redshiftType": { "type": "string", "description": "Specify the physical column type that is used in a Redshift table, e.g., `SMALLINT`." }, "sqlserverType": { "type": "string", "description": "Specify the physical column type that is used in a SQL Server table, e.g., `DATETIME2`." }, "databricksType": { "type": "string", "description": "Specify the physical column type that is used in a Databricks Unity Catalog table." }, "glueType": { "type": "string", "description": "Specify the physical column type that is used in an AWS Glue Data Catalog table." } } } } } }, "primaryKey": { "type": "array", "items": { "type": "string" }, "description": "The compound primary key of the model." }, "quality": { "type": "array", "items": { "$ref": "#/$defs/Quality" } }, "examples": { "type": "array" }, "config": { "type": "object", "description": "Additional metadata for model configuration.", "additionalProperties": { "type": [ "string", "number", "boolean", "object", "array", "null" ] }, "properties": { "avroNamespace": { "type": "string", "description": "The namespace to use when importing and exporting the data model from / to Apache Avro." } } } } } }, "definitions": { "description": "Clear and concise explanations of syntax, semantic, and classification of business objects in a given domain.", "type": "object", "propertyNames": { "pattern": "^[a-zA-Z0-9/_-]+$" }, "additionalProperties": { "type": "object", "title": "Definition", "properties": { "domain": { "type": "string", "description": "The domain in which this definition is valid.", "default": "global", "deprecationMessage": "This field is deprecated. Encode the domain into the ID using slashes." }, "name": { "type": "string", "description": "The technical name of this definition.", "deprecationMessage": "This field is deprecated. Encode the name into the ID using slashes." }, "title": { "type": "string", "description": "The business name of this definition." }, "description": { "type": "string", "description": "Clear and concise explanations related to the domain." }, "type": { "$ref": "#/$defs/FieldType" }, "fields": { "description": "The nested fields (e.g. columns) of the object, record, or struct.", "type": "object", "additionalProperties": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" } }, "items": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "keys": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "values": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "minLength": { "type": "integer", "description": "A value must be greater than or equal to this value. Applies only to string types." }, "maxLength": { "type": "integer", "description": "A value must be less than or equal to this value. Applies only to string types." }, "format": { "type": "string", "description": "Specific format requirements for the value (e.g., 'email', 'uri', 'uuid')." }, "precision": { "type": "integer", "examples": [ 38 ], "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." }, "scale": { "type": "integer", "examples": [ 0 ], "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." }, "pattern": { "type": "string", "description": "A regular expression pattern the value must match. Applies only to string types." }, "minimum": { "type": "number", "description": "A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "exclusiveMinimum": { "type": "number", "description": "A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "maximum": { "type": "number", "description": "A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "exclusiveMaximum": { "type": "number", "description": "A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "example": { "type": "string", "description": "An example value.", "deprecationMessage": "Use the examples field instead." }, "examples": { "type": "array", "description": "Example value." }, "pii": { "type": "boolean", "description": "Indicates if the field contains Personal Identifiable Information (PII)." }, "classification": { "type": "string", "description": "The data class defining the sensitivity level for this field." }, "tags": { "type": "array", "items": { "type": "string" }, "description": "Custom metadata to provide additional context." }, "links": { "type": "object", "description": "Links to external resources.", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "string", "title": "Link", "description": "A URL to an external resource.", "format": "uri", "examples": [ "https://example.com" ] } } }, "required": [ "type" ] } }, "servicelevels": { "type": "object", "description": "Specifies the service level agreements for the provided data, including availability, data retention policies, latency requirements, data freshness, update frequency, support availability, and backup policies.", "properties": { "availability": { "type": "object", "description": "Availability refers to the promise or guarantee by the service provider about the uptime of the system that provides the data.", "properties": { "description": { "type": "string", "description": "An optional string describing the availability service level.", "example": "The server is available during support hours" }, "percentage": { "type": "string", "description": "An optional string describing the guaranteed uptime in percent (e.g., `99.9%`)", "pattern": "^\\d+(\\.\\d+)?%$", "example": "99.9%" } } }, "retention": { "type": "object", "description": "Retention covers the period how long data will be available.", "properties": { "description": { "type": "string", "description": "An optional string describing the retention service level.", "example": "Data is retained for one year." }, "period": { "type": "string", "description": "An optional period of time, how long data is available. Supported formats: Simple duration (e.g., `1 year`, `30d`) and ISO 8601 duration (e.g, `P1Y`).", "example": "P1Y" }, "unlimited": { "type": "boolean", "description": "An optional indicator that data is kept forever.", "example": false }, "timestampField": { "type": "string", "description": "An optional reference to the field that contains the timestamp that the period refers to.", "example": "orders.order_timestamp" } } }, "latency": { "type": "object", "description": "Latency refers to the maximum amount of time from the source to its destination.", "properties": { "description": { "type": "string", "description": "An optional string describing the latency service level.", "example": "Data is available within 25 hours after the order was placed." }, "threshold": { "type": "string", "description": "An optional maximum duration between the source timestamp and the processed timestamp. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g, `PT24H`).", "example": "25h" }, "sourceTimestampField": { "type": "string", "description": "An optional reference to the field that contains the timestamp when the data was provided at the source.", "example": "orders.order_timestamp" }, "processedTimestampField": { "type": "string", "description": "An optional reference to the field that contains the processing timestamp, which denotes when the data is made available to consumers of this data contract.", "example": "orders.processed_timestamp" } } }, "freshness": { "type": "object", "description": "The maximum age of the youngest row in a table.", "properties": { "description": { "type": "string", "description": "An optional string describing the freshness service level.", "example": "The age of the youngest row in a table is within 25 hours." }, "threshold": { "type": "string", "description": "An optional maximum age of the youngest entry. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g., `PT24H`).", "example": "25h" }, "timestampField": { "type": "string", "description": "An optional reference to the field that contains the timestamp that the threshold refers to.", "example": "orders.order_timestamp" } } }, "frequency": { "type": "object", "description": "Frequency describes how often data is updated.", "properties": { "description": { "type": "string", "description": "An optional string describing the frequency service level.", "example": "Data is delivered once a day." }, "type": { "type": "string", "enum": [ "batch", "micro-batching", "streaming", "manual" ], "description": "The method of data processing.", "example": "batch" }, "interval": { "type": "string", "description": "Optional. Only for batch: How often the pipeline is triggered, e.g., `daily`.", "example": "daily" }, "cron": { "type": "string", "description": "Optional. Only for batch: A cron expression when the pipelines is triggered. E.g., `0 0 * * *`.", "example": "0 0 * * *" } } }, "support": { "type": "object", "description": "Support describes the times when support will be available for contact.", "properties": { "description": { "type": "string", "description": "An optional string describing the support service level.", "example": "The data is available during typical business hours at headquarters." }, "time": { "type": "string", "description": "An optional string describing the times when support will be available for contact such as `24/7` or `business hours only`.", "example": "9am to 5pm in EST on business days" }, "responseTime": { "type": "string", "description": "An optional string describing the time it takes for the support team to acknowledge a request. This does not mean the issue will be resolved immediately, but it assures users that their request has been received and will be dealt with.", "example": "24 hours" } } }, "backup": { "type": "object", "description": "Backup specifies details about data backup procedures.", "properties": { "description": { "type": "string", "description": "An optional string describing the backup service level.", "example": "Data is backed up once a week, every Sunday at 0:00 UTC." }, "interval": { "type": "string", "description": "An optional interval that defines how often data will be backed up, e.g., `daily`.", "example": "weekly" }, "cron": { "type": "string", "description": "An optional cron expression when data will be backed up, e.g., `0 0 * * *`.", "example": "0 0 * * 0" }, "recoveryTime": { "type": "string", "description": "An optional Recovery Time Objective (RTO) specifies the maximum amount of time allowed to restore data from a backup after a failure or loss event (e.g., 4 hours, 24 hours).", "example": "24 hours" }, "recoveryPoint": { "type": "string", "description": "An optional Recovery Point Objective (RPO) defines the maximum acceptable age of files that must be recovered from backup storage for normal operations to resume after a disaster or data loss event. This essentially measures how much data you can afford to lose, measured in time (e.g., 4 hours, 24 hours).", "example": "1 week" } } } } }, "links": { "type": "object", "description": "Links to external resources.", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "string", "title": "Link", "description": "A URL to an external resource.", "format": "uri", "examples": [ "https://example.com" ] } }, "tags": { "type": "array", "items": { "type": "string", "description": "Tags to facilitate searching and filtering.", "examples": [ "databricks", "pii", "sensitive" ] }, "description": "Tags to facilitate searching and filtering." } }, "required": [ "dataContractSpecification", "id", "info" ], "$defs": { "FieldType": { "type": "string", "title": "FieldType", "description": "The logical data type of the field.", "enum": [ "number", "decimal", "numeric", "int", "integer", "long", "bigint", "float", "double", "string", "text", "varchar", "boolean", "timestamp", "timestamp_tz", "timestamp_ntz", "date", "array", "map", "object", "record", "struct", "bytes", "null" ] }, "BaseServer": { "type": "object", "properties": { "description": { "type": "string", "description": "An optional string describing the servers." }, "environment": { "type": "string", "description": "The environment in which the servers are running. Examples: prod, sit, stg." }, "type": { "type": "string", "description": "The type of the data product technology that implements the data contract.", "enum": [ "bigquery", "BigQuery", "s3", "sftp", "redshift", "azure", "sqlserver", "snowflake", "databricks", "dataframe", "glue", "postgres", "oracle", "kafka", "pubsub", "kinesis", "trino", "local" ] }, "roles": { "description": " An optional array of roles that are available and can be requested to access the server for role-based access control. E.g. separate roles for different regions or sensitive data.", "type": "array", "items": { "type": "object", "properties": { "name": { "type": "string", "description": "The name of the role." }, "description": { "type": "string", "description": "A description of the role and what access the role provides." } }, "required": [ "name" ] } } }, "additionalProperties": true, "required": [ "type" ] }, "BigQueryServer": { "type": "object", "title": "BigQueryServer", "properties": { "project": { "type": "string", "description": "The GCP project name." }, "dataset": { "type": "string", "description": "The GCP dataset name." } }, "required": [ "project", "dataset" ] }, "S3Server": { "type": "object", "title": "S3Server", "properties": { "location": { "type": "string", "format": "uri", "description": "S3 URL, starting with `s3://`", "examples": [ "s3://datacontract-example-orders-latest/data/{model}/*.json" ] }, "endpointUrl": { "type": "string", "format": "uri", "description": "The server endpoint for S3-compatible servers.", "examples": [ "https://minio.example.com" ] }, "format": { "type": "string", "enum": [ "parquet", "delta", "json", "csv" ], "description": "File format." }, "delimiter": { "type": "string", "enum": [ "new_line", "array" ], "description": "Only for format = json. How multiple json documents are delimited within one file" } }, "required": [ "location" ] }, "SftpServer": { "type": "object", "title": "SftpServer", "properties": { "location": { "type": "string", "format": "uri", "pattern": "^sftp://.*", "description": "SFTP URL, starting with `sftp://`", "examples": [ "sftp://123.123.12.123/{model}/*.json" ] }, "format": { "type": "string", "enum": [ "parquet", "delta", "json", "csv" ], "description": "File format." }, "delimiter": { "type": "string", "enum": [ "new_line", "array" ], "description": "Only for format = json. How multiple json documents are delimited within one file" } }, "required": [ "location" ] }, "RedshiftServer": { "type": "object", "title": "RedshiftServer", "properties": { "account": { "type": "string", "description": "An optional string describing the server." }, "host": { "type": "string", "description": "An optional string describing the host name." }, "database": { "type": "string", "description": "An optional string describing the server." }, "schema": { "type": "string", "description": "An optional string describing the server." }, "clusterIdentifier": { "type": "string", "description": "An optional string describing the cluster's identifier.", "examples": [ "redshift-prod-eu", "analytics-cluster" ] }, "port": { "type": "integer", "description": "An optional string describing the cluster's port.", "examples": [ 5439 ] }, "endpoint": { "type": "string", "description": "An optional string describing the cluster's endpoint.", "examples": [ "analytics-cluster.example.eu-west-1.redshift.amazonaws.com:5439/analytics" ] } }, "additionalProperties": true, "required": [ "account", "database", "schema" ] }, "AzureServer": { "type": "object", "title": "AzureServer", "properties": { "location": { "type": "string", "format": "uri", "description": "Path to Azure Blob Storage or Azure Data Lake Storage (ADLS), supports globs. Recommended pattern is 'abfss:///'", "examples": [ "abfss://my_container_name/path", "abfss://my_container_name/path/*.json", "az://my_storage_account_name.blob.core.windows.net/my_container/path/*.parquet", "abfss://my_storage_account_name.dfs.core.windows.net/my_container_name/path/*.parquet" ] }, "format": { "type": "string", "enum": [ "parquet", "delta", "json", "csv" ], "description": "File format." }, "delimiter": { "type": "string", "enum": [ "new_line", "array" ], "description": "Only for format = json. How multiple json documents are delimited within one file" } }, "required": [ "location", "format" ] }, "SqlserverServer": { "type": "object", "title": "SqlserverServer", "properties": { "host": { "type": "string", "description": "The host to the database server", "examples": [ "localhost" ] }, "port": { "type": "integer", "description": "The port to the database server.", "default": 1433, "examples": [ 1433 ] }, "database": { "type": "string", "description": "The name of the database.", "examples": [ "database" ] }, "schema": { "type": "string", "description": "The name of the schema in the database.", "examples": [ "dbo" ] } }, "required": [ "host", "database", "schema" ] }, "SnowflakeServer": { "type": "object", "title": "SnowflakeServer", "properties": { "account": { "type": "string", "description": "An optional string describing the server." }, "database": { "type": "string", "description": "An optional string describing the server." }, "schema": { "type": "string", "description": "An optional string describing the server." } }, "required": [ "account", "database", "schema" ] }, "DatabricksServer": { "type": "object", "title": "DatabricksServer", "properties": { "host": { "type": "string", "description": "The Databricks host", "examples": [ "dbc-abcdefgh-1234.cloud.databricks.com" ] }, "catalog": { "type": "string", "description": "The name of the Hive or Unity catalog" }, "schema": { "type": "string", "description": "The schema name in the catalog" } }, "required": [ "catalog", "schema" ] }, "DataframeServer": { "type": "object", "title": "DataframeServer", "required": [ "type" ] }, "GlueServer": { "type": "object", "title": "GlueServer", "properties": { "account": { "type": "string", "description": "The AWS Glue account", "examples": [ "1234-5678-9012" ] }, "database": { "type": "string", "description": "The AWS Glue database name", "examples": [ "my_database" ] }, "location": { "type": "string", "format": "uri", "description": "The AWS S3 path. Must be in the form of a URL.", "examples": [ "s3://datacontract-example-orders-latest/data/{model}" ] }, "format": { "type": "string", "description": "The format of the files", "examples": [ "parquet", "csv", "json", "delta" ] } }, "required": [ "account", "database" ] }, "PostgresServer": { "type": "object", "title": "PostgresServer", "properties": { "host": { "type": "string", "description": "The host to the database server", "examples": [ "localhost" ] }, "port": { "type": "integer", "description": "The port to the database server." }, "database": { "type": "string", "description": "The name of the database.", "examples": [ "postgres" ] }, "schema": { "type": "string", "description": "The name of the schema in the database.", "examples": [ "public" ] } }, "required": [ "host", "port", "database", "schema" ] }, "OracleServer": { "type": "object", "title": "OracleServer", "properties": { "host": { "type": "string", "description": "The host to the oracle server", "examples": [ "localhost" ] }, "port": { "type": "integer", "description": "The port to the oracle server.", "examples": [ 1523 ] }, "serviceName": { "type": "string", "description": "The name of the service.", "examples": [ "service" ] } }, "required": [ "host", "port", "serviceName" ] }, "KafkaServer": { "type": "object", "title": "KafkaServer", "description": "Kafka Server", "properties": { "host": { "type": "string", "description": "The bootstrap server of the kafka cluster." }, "topic": { "type": "string", "description": "The topic name." }, "format": { "type": "string", "description": "The format of the message. Examples: json, avro, protobuf.", "default": "json" } }, "required": [ "host", "topic" ] }, "PubSubServer": { "type": "object", "title": "PubSubServer", "properties": { "project": { "type": "string", "description": "The GCP project name." }, "topic": { "type": "string", "description": "The topic name." } }, "required": [ "project", "topic" ] }, "KinesisDataStreamsServer": { "type": "object", "title": "KinesisDataStreamsServer", "description": "Kinesis Data Streams Server", "properties": { "stream": { "type": "string", "description": "The name of the Kinesis data stream." }, "region": { "type": "string", "description": "AWS region.", "examples": [ "eu-west-1" ] }, "format": { "type": "string", "description": "The format of the record", "examples": [ "json", "avro", "protobuf" ] } }, "required": [ "stream" ] }, "TrinoServer": { "type": "object", "title": "TrinoServer", "properties": { "host": { "type": "string", "description": "The Trino host URL.", "examples": [ "localhost" ] }, "port": { "type": "integer", "description": "The Trino port." }, "catalog": { "type": "string", "description": "The name of the catalog.", "examples": [ "hive" ] }, "schema": { "type": "string", "description": "The name of the schema in the database.", "examples": [ "my_schema" ] } }, "required": [ "host", "port", "catalog", "schema" ] }, "LocalServer": { "type": "object", "title": "LocalServer", "properties": { "path": { "type": "string", "description": "The relative or absolute path to the data file(s).", "examples": [ "./folder/data.parquet", "./folder/*.parquet" ] }, "format": { "type": "string", "description": "The format of the file(s)", "examples": [ "json", "parquet", "delta", "csv" ] } }, "required": [ "path", "format" ] }, "Quality": { "allOf": [ { "type": "object", "properties": { "type": { "type": "string", "description": "The type of quality check", "enum": [ "text", "library", "sql", "custom" ] }, "description": { "type": "string", "description": "A plain text describing the quality attribute in natural language." } } }, { "if": { "properties": { "type": { "const": "text" } } }, "then": { "required": [ "description" ] } }, { "if": { "properties": { "type": { "const": "sql" } } }, "then": { "properties": { "query": { "type": "string", "description": "A SQL query that returns a single number to compare with the threshold." }, "dialect": { "type": "string", "description": "The SQL dialect that is used for the query. Should be compatible to the server.type.", "examples": [ "athena", "bigquery", "redshift", "snowflake", "trino", "postgres", "oracle" ] }, "mustBe": { "type": "number" }, "mustNotBe": { "type": "number" }, "mustBeGreaterThan": { "type": "number" }, "mustBeGreaterThanOrEqualTo": { "type": "number" }, "mustBeLessThan": { "type": "number" }, "mustBeLessThanOrEqualTo": { "type": "number" }, "mustBeBetween": { "type": "array", "items": { "type": "number" }, "minItems": 2, "maxItems": 2 }, "mustNotBeBetween": { "type": "array", "items": { "type": "number" }, "minItems": 2, "maxItems": 2 } }, "required": [ "query" ] } }, { "if": { "properties": { "type": { "const": "library" } } }, "then": { "properties": { "rule": { "type": "string", "description": "Define a data quality check based on the predefined rules as per ODCS.", "examples": ["duplicateCount", "validValues", "rowCount"] }, "mustBe": { "description": "Must be equal to the value to be valid. When using numbers, it is equivalent to '='." }, "mustNotBe": { "description": "Must not be equal to the value to be valid. When using numbers, it is equivalent to '!='." }, "mustBeGreaterThan": { "type": "number", "description": "Must be greater than the value to be valid. It is equivalent to '>'." }, "mustBeGreaterOrEqualTo": { "type": "number", "description": "Must be greater than or equal to the value to be valid. It is equivalent to '>='." }, "mustBeLessThan": { "type": "number", "description": "Must be less than the value to be valid. It is equivalent to '<'." }, "mustBeLessOrEqualTo": { "type": "number", "description": "Must be less than or equal to the value to be valid. It is equivalent to '<='." }, "mustBeBetween": { "type": "array", "description": "Must be between the two numbers to be valid. Smallest number first in the array.", "minItems": 2, "maxItems": 2, "uniqueItems": true, "items": { "type": "number" } }, "mustNotBeBetween": { "type": "array", "description": "Must not be between the two numbers to be valid. Smallest number first in the array.", "minItems": 2, "maxItems": 2, "uniqueItems": true, "items": { "type": "number" } } }, "required": [ "rule" ] } }, { "if": { "properties": { "type": { "const": "custom" } } }, "then": { "properties": { "description": { "type": "string", "description": "A plain text describing the quality attribute in natural language." }, "engine": { "type": "string", "examples": [ "soda", "great-expectations" ], "description": "The engine used for custom quality checks." }, "implementation": { "type": [ "object", "array", "string" ], "description": "Engine-specific quality checks and expectations." } }, "required": [ "engine" ] } } ] }, "Lineage": { "type": "object", "properties": { "inputFields": { "type": "array", "items": { "type": "object", "properties": { "namespace": { "type": "string", "description": "The input dataset namespace" }, "name": { "type": "string", "description": "The input dataset name" }, "field": { "type": "string", "description": "The input field" }, "transformations": { "type": "array", "items": { "type": "object", "properties": { "type": { "description": "The type of the transformation. Allowed values are: DIRECT, INDIRECT", "type": "string" }, "subtype": { "type": "string", "description": "The subtype of the transformation" }, "description": { "type": "string", "description": "a string representation of the transformation applied" }, "masking": { "type": "boolean", "description": "is transformation masking the data or not" } }, "required": [ "type" ], "additionalProperties": true } } }, "additionalProperties": true, "required": [ "namespace", "name", "field" ] } }, "transformationDescription": { "type": "string", "description": "a string representation of the transformation applied", "deprecated": true }, "transformationType": { "type": "string", "description": "IDENTITY|MASKED reflects a clearly defined behavior. IDENTITY: exact same as input; MASKED: no original data available (like a hash of PII for example)", "deprecated": true } }, "additionalProperties": true, "required": [ "inputFields" ] } } } ================================================ FILE: versions/1.1.0/definition.schema.json ================================================ { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "description": "Clear and concise explanations of syntax, semantic, and classification of business objects in a given domain.", "properties": { "id": { "type": "string", "description": "A unique identifier for this definition. Encode the domain into the ID, separated by slashes.", "examples": [ "checkout/order_id" ] }, "title": { "type": "string", "description": "The business name of this definition." }, "description": { "type": "string", "description": "Clear and concise explanations related to the domain." }, "type": { "type": "string", "description": "The logical data type." }, "minLength": { "type": "integer", "description": "A value must be greater than or equal to this value. Applies only to string types." }, "maxLength": { "type": "integer", "description": "A value must be less than or equal to this value. Applies only to string types." }, "format": { "type": "string", "description": "Specific format requirements for the value (e.g., 'email', 'uri', 'uuid')." }, "precision": { "type": "integer", "examples": [ 38 ], "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." }, "scale": { "type": "integer", "examples": [ 0 ], "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." }, "pattern": { "type": "string", "description": "A regular expression pattern the value must match. Applies only to string types." }, "example": { "type": "string", "description": "An example value for this field.", "deprecationMessage": "Use the examples field instead." }, "examples": { "type": "array", "description": "A examples value for this field." }, "pii": { "type": "boolean", "description": "Indicates if the field contains Personal Identifiable Information (PII)." }, "classification": { "type": "string", "description": "The data class defining the sensitivity level for this field." }, "tags": { "type": "array", "items": { "type": "string" }, "description": "Custom metadata to provide additional context." }, "links": { "type": "object", "description": "Links to external resources.", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "string", "title": "Link", "description": "A URL to an external resource.", "format": "uri", "examples": [ "https://example.com" ] } } }, "required": [ "type" ] } ================================================ FILE: versions/1.2.0/datacontract.init.yaml ================================================ dataContractSpecification: 1.2.0 id: my-data-contract-id info: title: My Data Contract version: 0.0.1 # description: # owner: # contact: # name: # url: # email: ### servers #servers: # production: # type: s3 # location: s3:// # format: parquet # delimiter: new_line ### terms #terms: # usage: # limitations: # billing: # noticePeriod: ### models # models: # my_model: # description: # type: # fields: # my_field: # type: # description: ### definitions # definitions: # my_field: # domain: # name: # title: # type: # description: # example: # pii: # classification: ### servicelevels #servicelevels: # availability: # description: The server is available during support hours # percentage: 99.9% # retention: # description: Data is retained for one year because! # period: P1Y # unlimited: false # latency: # description: Data is available within 25 hours after the order was placed # threshold: 25h # sourceTimestampField: orders.order_timestamp # processedTimestampField: orders.processed_timestamp # freshness: # description: The age of the youngest row in a table. # threshold: 25h # timestampField: orders.order_timestamp # frequency: # description: Data is delivered once a day # type: batch # or streaming # interval: daily # for batch, either or cron # cron: 0 0 * * * # for batch, either or interval # support: # description: The data is available during typical business hours at headquarters # time: 9am to 5pm in EST on business days # responseTime: 1h # backup: # description: Data is backed up once a week, every Sunday at 0:00 UTC. # interval: weekly # cron: 0 0 * * 0 # recoveryTime: 24 hours # recoveryPoint: 1 week ================================================ FILE: versions/1.2.0/datacontract.schema.json ================================================ { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "title": "DataContractSpecification", "properties": { "dataContractSpecification": { "type": "string", "title": "DataContractSpecificationVersion", "enum": [ "1.2.0", "1.1.0", "0.9.3", "0.9.2", "0.9.1", "0.9.0" ], "description": "Specifies the Data Contract Specification being used." }, "id": { "type": "string", "description": "Specifies the identifier of the data contract." }, "info": { "type": "object", "properties": { "title": { "type": "string", "description": "The title of the data contract." }, "version": { "type": "string", "description": "The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version)." }, "status": { "type": "string", "description": "The status of the data contract. Can be proposed, in development, active, retired.", "examples": [ "proposed", "in development", "active", "deprecated", "retired" ] }, "description": { "type": "string", "description": "A description of the data contract." }, "owner": { "type": "string", "description": "The owner or team responsible for managing the data contract and providing the data." }, "contact": { "type": "object", "properties": { "name": { "type": "string", "description": "The identifying name of the contact person/organization." }, "url": { "type": "string", "format": "uri", "description": "The URL pointing to the contact information. This MUST be in the form of a URL." }, "email": { "type": "string", "format": "email", "description": "The email address of the contact person/organization. This MUST be in the form of an email address." } }, "description": "Contact information for the data contract.", "additionalProperties": true } }, "additionalProperties": true, "required": [ "title", "version" ], "description": "Metadata and life cycle information about the data contract." }, "servers": { "type": "object", "description": "Information about the servers.", "additionalProperties": { "$ref": "#/$defs/BaseServer", "allOf": [ { "if": { "properties": { "type": { "const": "bigquery" } } }, "then": { "$ref": "#/$defs/BigQueryServer" } }, { "if": { "properties": { "type": { "const": "postgres" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/PostgresServer" } }, { "if": { "properties": { "type": { "const": "s3" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/S3Server" } }, { "if": { "properties": { "type": { "const": "sftp" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/SftpServer" } }, { "if": { "properties": { "type": { "const": "redshift" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/RedshiftServer" } }, { "if": { "properties": { "type": { "const": "azure" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/AzureServer" } }, { "if": { "properties": { "type": { "const": "sqlserver" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/SqlserverServer" } }, { "if": { "properties": { "type": { "const": "snowflake" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/SnowflakeServer" } }, { "if": { "properties": { "type": { "const": "databricks" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/DatabricksServer" } }, { "if": { "properties": { "type": { "const": "dataframe" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/DataframeServer" } }, { "if": { "properties": { "type": { "const": "glue" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/GlueServer" } }, { "if": { "properties": { "type": { "const": "postgres" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/PostgresServer" } }, { "if": { "properties": { "type": { "const": "oracle" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/OracleServer" } }, { "if": { "properties": { "type": { "const": "kafka" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/KafkaServer" } }, { "if": { "properties": { "type": { "const": "pubsub" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/PubSubServer" } }, { "if": { "properties": { "type": { "const": "kinesis" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/KinesisDataStreamsServer" } }, { "if": { "properties": { "type": { "const": "trino" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/TrinoServer" } }, { "if": { "properties": { "type": { "const": "clickhouse" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/ClickhouseServer" } }, { "if": { "properties": { "type": { "const": "local" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/LocalServer" } } ] } }, "terms": { "type": "object", "description": "The terms and conditions of the data contract.", "properties": { "usage": { "type": "string", "description": "The usage describes the way the data is expected to be used. Can contain business and technical information." }, "limitations": { "type": "string", "description": "The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for." }, "policies": { "type": "array", "items": { "type": "object", "properties": { "type": { "type": "string", "description": "The type of the policy.", "examples": [ "privacy", "security", "retention", "compliance" ] }, "description": { "type": "string", "description": "A description of the policy." }, "url": { "type": "string", "format": "uri", "description": "A URL to the policy document." } }, "additionalProperties": true }, "description": "The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for." }, "billing": { "type": "string", "description": "The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use." }, "noticePeriod": { "type": "string", "description": "The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., 'P3M' for a period of three months." } }, "additionalProperties": true }, "models": { "description": "Specifies the logical data model. Use the models name (e.g., the table name) as the key.", "type": "object", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "object", "title": "Model", "properties": { "description": { "type": "string" }, "type": { "description": "The type of the model. Examples: table, view, object. Default: table.", "type": "string", "title": "ModelType", "default": "table", "enum": [ "table", "view", "object" ] }, "title": { "type": "string", "description": "An optional string providing a human readable name for the model. Especially useful if the model name is cryptic or contains abbreviations.", "examples": [ "Purchase Orders", "Air Shipments" ] }, "fields": { "description": "Specifies a field in the data model. Use the field name (e.g., the column name) as the key.", "type": "object", "additionalProperties": { "type": "object", "title": "Field", "properties": { "description": { "type": "string", "description": "An optional string describing the semantic of the data in this field." }, "title": { "type": "string", "description": "An optional string providing a human readable name for the field. Especially useful if the field name is cryptic or contains abbreviations." }, "type": { "$ref": "#/$defs/FieldType" }, "required": { "type": "boolean", "default": false, "description": "An indication, if this field must contain a value and may not be null." }, "fields": { "description": "The nested fields (e.g. columns) of the object, record, or struct.", "type": "object", "additionalProperties": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" } }, "items": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "keys": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "values": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "primary": { "type": "boolean", "deprecationMessage": "Use the primaryKey field instead." }, "primaryKey": { "type": "boolean", "default": false, "description": "If this field is a primary key." }, "references": { "type": "string", "description": "The reference to a field in another model. E.g. use 'orders.order_id' to reference the order_id field of the model orders. Think of defining a foreign key relationship.", "examples": [ "orders.order_id", "model.nested_field.field" ] }, "unique": { "type": "boolean", "default": false, "description": "An indication, if the value must be unique within the model." }, "enum": { "type": "array", "items": { "type": "string" }, "uniqueItems": true, "description": "A value must be equal to one of the elements in this array value. Only evaluated if the value is not null." }, "minLength": { "type": "integer", "description": "A value must greater than, or equal to, the value of this. Only applies to string types." }, "maxLength": { "type": "integer", "description": "A value must less than, or equal to, the value of this. Only applies to string types." }, "format": { "type": "string", "description": "A specific format the value must comply with (e.g., 'email', 'uri', 'uuid').", "examples": [ "email", "uri", "uuid" ] }, "precision": { "type": "number", "examples": [ 38 ], "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." }, "scale": { "type": "number", "examples": [ 0 ], "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." }, "pattern": { "type": "string", "description": "A regular expression the value must match. Only applies to string types.", "examples": [ "^[a-zA-Z0-9_-]+$" ] }, "minimum": { "type": "number", "description": "A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "exclusiveMinimum": { "type": "number", "description": "A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "maximum": { "type": "number", "description": "A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "exclusiveMaximum": { "type": "number", "description": "A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "example": { "type": "string", "description": "An example value for this field.", "deprecationMessage": "Use the examples field instead." }, "examples": { "type": "array", "description": "A examples value for this field." }, "pii": { "type": "boolean", "description": "An indication, if this field contains Personal Identifiable Information (PII)." }, "classification": { "type": "string", "description": "The data class defining the sensitivity level for this field, according to the organization's classification scheme.", "examples": [ "sensitive", "restricted", "internal", "public" ] }, "tags": { "type": "array", "items": { "type": "string" }, "description": "Custom metadata to provide additional context." }, "links": { "type": "object", "description": "Links to external resources.", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "string", "title": "Link", "description": "A URL to an external resource.", "format": "uri", "examples": [ "https://example.com" ] } }, "$ref": { "type": "string", "description": "A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition." }, "quality": { "type": "array", "items": { "$ref": "#/$defs/Quality" } }, "lineage": { "$ref": "#/$defs/Lineage" }, "config": { "type": "object", "description": "Additional metadata for field configuration.", "additionalProperties": { "type": [ "string", "number", "boolean", "object", "array", "null" ] }, "properties": { "avroType": { "type": "string", "description": "Specify the field type to use when exporting the data model to Apache Avro." }, "avroLogicalType": { "type": "string", "description": "Specify the logical field type to use when exporting the data model to Apache Avro." }, "bigqueryType": { "type": "string", "description": "Specify the physical column type that is used in a BigQuery table, e.g., `NUMERIC(5, 2)`." }, "snowflakeType": { "type": "string", "description": "Specify the physical column type that is used in a Snowflake table, e.g., `TIMESTAMP_LTZ`." }, "redshiftType": { "type": "string", "description": "Specify the physical column type that is used in a Redshift table, e.g., `SMALLINT`." }, "sqlserverType": { "type": "string", "description": "Specify the physical column type that is used in a SQL Server table, e.g., `DATETIME2`." }, "databricksType": { "type": "string", "description": "Specify the physical column type that is used in a Databricks Unity Catalog table." }, "glueType": { "type": "string", "description": "Specify the physical column type that is used in an AWS Glue Data Catalog table." } } } } } }, "primaryKey": { "type": "array", "items": { "type": "string" }, "description": "The compound primary key of the model." }, "quality": { "type": "array", "items": { "$ref": "#/$defs/Quality" } }, "examples": { "type": "array" }, "additionalFields": { "type": "boolean", "description": " Specify, if the model can have additional fields that are not defined in the contract. ", "default": false }, "config": { "type": "object", "description": "Additional metadata for model configuration.", "additionalProperties": { "type": [ "string", "number", "boolean", "object", "array", "null" ] }, "properties": { "avroNamespace": { "type": "string", "description": "The namespace to use when importing and exporting the data model from / to Apache Avro." } } } } } }, "definitions": { "description": "Clear and concise explanations of syntax, semantic, and classification of business objects in a given domain.", "type": "object", "propertyNames": { "pattern": "^[a-zA-Z0-9/_-]+$" }, "additionalProperties": { "type": "object", "title": "Definition", "properties": { "domain": { "type": "string", "description": "The domain in which this definition is valid.", "default": "global", "deprecationMessage": "This field is deprecated. Encode the domain into the ID using slashes." }, "name": { "type": "string", "description": "The technical name of this definition.", "deprecationMessage": "This field is deprecated. Encode the name into the ID using slashes." }, "title": { "type": "string", "description": "The business name of this definition." }, "description": { "type": "string", "description": "Clear and concise explanations related to the domain." }, "type": { "$ref": "#/$defs/FieldType" }, "fields": { "description": "The nested fields (e.g. columns) of the object, record, or struct.", "type": "object", "additionalProperties": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" } }, "items": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "keys": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "values": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "minLength": { "type": "integer", "description": "A value must be greater than or equal to this value. Applies only to string types." }, "maxLength": { "type": "integer", "description": "A value must be less than or equal to this value. Applies only to string types." }, "format": { "type": "string", "description": "Specific format requirements for the value (e.g., 'email', 'uri', 'uuid')." }, "precision": { "type": "integer", "examples": [ 38 ], "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." }, "scale": { "type": "integer", "examples": [ 0 ], "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." }, "pattern": { "type": "string", "description": "A regular expression pattern the value must match. Applies only to string types." }, "minimum": { "type": "number", "description": "A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "exclusiveMinimum": { "type": "number", "description": "A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "maximum": { "type": "number", "description": "A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "exclusiveMaximum": { "type": "number", "description": "A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "example": { "type": "string", "description": "An example value.", "deprecationMessage": "Use the examples field instead." }, "examples": { "type": "array", "description": "Example value." }, "pii": { "type": "boolean", "description": "Indicates if the field contains Personal Identifiable Information (PII)." }, "classification": { "type": "string", "description": "The data class defining the sensitivity level for this field." }, "tags": { "type": "array", "items": { "type": "string" }, "description": "Custom metadata to provide additional context." }, "links": { "type": "object", "description": "Links to external resources.", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "string", "title": "Link", "description": "A URL to an external resource.", "format": "uri", "examples": [ "https://example.com" ] } } }, "required": [ "type" ] } }, "servicelevels": { "type": "object", "description": "Specifies the service level agreements for the provided data, including availability, data retention policies, latency requirements, data freshness, update frequency, support availability, and backup policies.", "properties": { "availability": { "type": "object", "description": "Availability refers to the promise or guarantee by the service provider about the uptime of the system that provides the data.", "properties": { "description": { "type": "string", "description": "An optional string describing the availability service level.", "example": "The server is available during support hours" }, "percentage": { "type": "string", "description": "An optional string describing the guaranteed uptime in percent (e.g., `99.9%`)", "pattern": "^\\d+(\\.\\d+)?%$", "example": "99.9%" } } }, "retention": { "type": "object", "description": "Retention covers the period how long data will be available.", "properties": { "description": { "type": "string", "description": "An optional string describing the retention service level.", "example": "Data is retained for one year." }, "period": { "type": "string", "description": "An optional period of time, how long data is available. Supported formats: Simple duration (e.g., `1 year`, `30d`) and ISO 8601 duration (e.g, `P1Y`).", "example": "P1Y" }, "unlimited": { "type": "boolean", "description": "An optional indicator that data is kept forever.", "example": false }, "timestampField": { "type": "string", "description": "An optional reference to the field that contains the timestamp that the period refers to.", "example": "orders.order_timestamp" } } }, "latency": { "type": "object", "description": "Latency refers to the maximum amount of time from the source to its destination.", "properties": { "description": { "type": "string", "description": "An optional string describing the latency service level.", "example": "Data is available within 25 hours after the order was placed." }, "threshold": { "type": "string", "description": "An optional maximum duration between the source timestamp and the processed timestamp. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g, `PT24H`).", "example": "25h" }, "sourceTimestampField": { "type": "string", "description": "An optional reference to the field that contains the timestamp when the data was provided at the source.", "example": "orders.order_timestamp" }, "processedTimestampField": { "type": "string", "description": "An optional reference to the field that contains the processing timestamp, which denotes when the data is made available to consumers of this data contract.", "example": "orders.processed_timestamp" } } }, "freshness": { "type": "object", "description": "The maximum age of the youngest row in a table.", "properties": { "description": { "type": "string", "description": "An optional string describing the freshness service level.", "example": "The age of the youngest row in a table is within 25 hours." }, "threshold": { "type": "string", "description": "An optional maximum age of the youngest entry. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g., `PT24H`).", "example": "25h" }, "timestampField": { "type": "string", "description": "An optional reference to the field that contains the timestamp that the threshold refers to.", "example": "orders.order_timestamp" } } }, "frequency": { "type": "object", "description": "Frequency describes how often data is updated.", "properties": { "description": { "type": "string", "description": "An optional string describing the frequency service level.", "example": "Data is delivered once a day." }, "type": { "type": "string", "enum": [ "batch", "micro-batching", "streaming", "manual" ], "description": "The method of data processing.", "example": "batch" }, "interval": { "type": "string", "description": "Optional. Only for batch: How often the pipeline is triggered, e.g., `daily`.", "example": "daily" }, "cron": { "type": "string", "description": "Optional. Only for batch: A cron expression when the pipelines is triggered. E.g., `0 0 * * *`.", "example": "0 0 * * *" } } }, "support": { "type": "object", "description": "Support describes the times when support will be available for contact.", "properties": { "description": { "type": "string", "description": "An optional string describing the support service level.", "example": "The data is available during typical business hours at headquarters." }, "time": { "type": "string", "description": "An optional string describing the times when support will be available for contact such as `24/7` or `business hours only`.", "example": "9am to 5pm in EST on business days" }, "responseTime": { "type": "string", "description": "An optional string describing the time it takes for the support team to acknowledge a request. This does not mean the issue will be resolved immediately, but it assures users that their request has been received and will be dealt with.", "example": "24 hours" } } }, "backup": { "type": "object", "description": "Backup specifies details about data backup procedures.", "properties": { "description": { "type": "string", "description": "An optional string describing the backup service level.", "example": "Data is backed up once a week, every Sunday at 0:00 UTC." }, "interval": { "type": "string", "description": "An optional interval that defines how often data will be backed up, e.g., `daily`.", "example": "weekly" }, "cron": { "type": "string", "description": "An optional cron expression when data will be backed up, e.g., `0 0 * * *`.", "example": "0 0 * * 0" }, "recoveryTime": { "type": "string", "description": "An optional Recovery Time Objective (RTO) specifies the maximum amount of time allowed to restore data from a backup after a failure or loss event (e.g., 4 hours, 24 hours).", "example": "24 hours" }, "recoveryPoint": { "type": "string", "description": "An optional Recovery Point Objective (RPO) defines the maximum acceptable age of files that must be recovered from backup storage for normal operations to resume after a disaster or data loss event. This essentially measures how much data you can afford to lose, measured in time (e.g., 4 hours, 24 hours).", "example": "1 week" } } } } }, "links": { "type": "object", "description": "Links to external resources.", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "string", "title": "Link", "description": "A URL to an external resource.", "format": "uri", "examples": [ "https://example.com" ] } }, "tags": { "type": "array", "items": { "type": "string", "description": "Tags to facilitate searching and filtering.", "examples": [ "databricks", "pii", "sensitive" ] }, "description": "Tags to facilitate searching and filtering." } }, "required": [ "dataContractSpecification", "id", "info" ], "$defs": { "FieldType": { "type": "string", "title": "FieldType", "description": "The logical data type of the field.", "enum": [ "number", "decimal", "numeric", "int", "integer", "long", "bigint", "float", "double", "string", "text", "varchar", "boolean", "timestamp", "timestamp_tz", "timestamp_ntz", "date", "time", "array", "map", "object", "record", "struct", "bytes", "variant", "json", "null" ] }, "BaseServer": { "type": "object", "properties": { "description": { "type": "string", "description": "An optional string describing the servers." }, "environment": { "type": "string", "description": "The environment in which the servers are running. Examples: prod, sit, stg." }, "type": { "type": "string", "description": "The type of the data product technology that implements the data contract.", "examples": [ "azure", "bigquery", "BigQuery", "clickhouse", "databricks", "dataframe", "glue", "kafka", "kinesis", "local", "oracle", "postgres", "pubsub", "redshift", "sftp", "sqlserver", "snowflake", "s3", "trino" ] }, "roles": { "description": " An optional array of roles that are available and can be requested to access the server for role-based access control. E.g. separate roles for different regions or sensitive data.", "type": "array", "items": { "type": "object", "properties": { "name": { "type": "string", "description": "The name of the role." }, "description": { "type": "string", "description": "A description of the role and what access the role provides." } }, "required": [ "name" ] } } }, "additionalProperties": true, "required": [ "type" ] }, "BigQueryServer": { "type": "object", "title": "BigQueryServer", "properties": { "project": { "type": "string", "description": "The GCP project name." }, "dataset": { "type": "string", "description": "The GCP dataset name." } }, "required": [ "project", "dataset" ] }, "S3Server": { "type": "object", "title": "S3Server", "properties": { "location": { "type": "string", "format": "uri", "description": "S3 URL, starting with `s3://`", "examples": [ "s3://datacontract-example-orders-latest/data/{model}/*.json" ] }, "endpointUrl": { "type": "string", "format": "uri", "description": "The server endpoint for S3-compatible servers.", "examples": [ "https://minio.example.com" ] }, "format": { "type": "string", "enum": [ "parquet", "delta", "json", "csv" ], "description": "File format." }, "delimiter": { "type": "string", "enum": [ "new_line", "array" ], "description": "Only for format = json. How multiple json documents are delimited within one file" } }, "required": [ "location" ] }, "SftpServer": { "type": "object", "title": "SftpServer", "properties": { "location": { "type": "string", "format": "uri", "pattern": "^sftp://.*", "description": "SFTP URL, starting with `sftp://`", "examples": [ "sftp://123.123.12.123/{model}/*.json" ] }, "format": { "type": "string", "enum": [ "parquet", "delta", "json", "csv" ], "description": "File format." }, "delimiter": { "type": "string", "enum": [ "new_line", "array" ], "description": "Only for format = json. How multiple json documents are delimited within one file" } }, "required": [ "location" ] }, "RedshiftServer": { "type": "object", "title": "RedshiftServer", "properties": { "account": { "type": "string", "description": "An optional string describing the server." }, "host": { "type": "string", "description": "An optional string describing the host name." }, "database": { "type": "string", "description": "An optional string describing the server." }, "schema": { "type": "string", "description": "An optional string describing the server." }, "clusterIdentifier": { "type": "string", "description": "An optional string describing the cluster's identifier.", "examples": [ "redshift-prod-eu", "analytics-cluster" ] }, "port": { "type": "integer", "description": "An optional string describing the cluster's port.", "examples": [ 5439 ] }, "endpoint": { "type": "string", "description": "An optional string describing the cluster's endpoint.", "examples": [ "analytics-cluster.example.eu-west-1.redshift.amazonaws.com:5439/analytics" ] } }, "additionalProperties": true, "required": [ "account", "database", "schema" ] }, "AzureServer": { "type": "object", "title": "AzureServer", "properties": { "location": { "type": "string", "format": "uri", "description": "Path to Azure Blob Storage or Azure Data Lake Storage (ADLS), supports globs. Recommended pattern is 'abfss:///'", "examples": [ "abfss://my_container_name/path", "abfss://my_container_name/path/*.json", "az://my_storage_account_name.blob.core.windows.net/my_container/path/*.parquet", "abfss://my_storage_account_name.dfs.core.windows.net/my_container_name/path/*.parquet" ] }, "format": { "type": "string", "enum": [ "parquet", "delta", "json", "csv" ], "description": "File format." }, "delimiter": { "type": "string", "enum": [ "new_line", "array" ], "description": "Only for format = json. How multiple json documents are delimited within one file" } }, "required": [ "location", "format" ] }, "SqlserverServer": { "type": "object", "title": "SqlserverServer", "properties": { "host": { "type": "string", "description": "The host to the database server", "examples": [ "localhost" ] }, "port": { "type": "integer", "description": "The port to the database server.", "default": 1433, "examples": [ 1433 ] }, "database": { "type": "string", "description": "The name of the database.", "examples": [ "database" ] }, "schema": { "type": "string", "description": "The name of the schema in the database.", "examples": [ "dbo" ] } }, "required": [ "host", "database", "schema" ] }, "SnowflakeServer": { "type": "object", "title": "SnowflakeServer", "properties": { "account": { "type": "string", "description": "An optional string describing the server." }, "database": { "type": "string", "description": "An optional string describing the server." }, "schema": { "type": "string", "description": "An optional string describing the server." } }, "required": [ "account", "database", "schema" ] }, "DatabricksServer": { "type": "object", "title": "DatabricksServer", "properties": { "host": { "type": "string", "description": "The Databricks host", "examples": [ "dbc-abcdefgh-1234.cloud.databricks.com" ] }, "catalog": { "type": "string", "description": "The name of the Hive or Unity catalog" }, "schema": { "type": "string", "description": "The schema name in the catalog" } }, "required": [ "catalog", "schema" ] }, "DataframeServer": { "type": "object", "title": "DataframeServer", "required": [ "type" ] }, "GlueServer": { "type": "object", "title": "GlueServer", "properties": { "account": { "type": "string", "description": "The AWS Glue account", "examples": [ "1234-5678-9012" ] }, "database": { "type": "string", "description": "The AWS Glue database name", "examples": [ "my_database" ] }, "location": { "type": "string", "format": "uri", "description": "The AWS S3 path. Must be in the form of a URL.", "examples": [ "s3://datacontract-example-orders-latest/data/{model}" ] }, "format": { "type": "string", "description": "The format of the files", "examples": [ "parquet", "csv", "json", "delta" ] } }, "required": [ "account", "database" ] }, "PostgresServer": { "type": "object", "title": "PostgresServer", "properties": { "host": { "type": "string", "description": "The host to the database server", "examples": [ "localhost" ] }, "port": { "type": "integer", "description": "The port to the database server." }, "database": { "type": "string", "description": "The name of the database.", "examples": [ "postgres" ] }, "schema": { "type": "string", "description": "The name of the schema in the database.", "examples": [ "public" ] } }, "required": [ "host", "port", "database", "schema" ] }, "OracleServer": { "type": "object", "title": "OracleServer", "properties": { "host": { "type": "string", "description": "The host to the oracle server", "examples": [ "localhost" ] }, "port": { "type": "integer", "description": "The port to the oracle server.", "examples": [ 1523 ] }, "serviceName": { "type": "string", "description": "The name of the service.", "examples": [ "service" ] } }, "required": [ "host", "port", "serviceName" ] }, "KafkaServer": { "type": "object", "title": "KafkaServer", "description": "Kafka Server", "properties": { "host": { "type": "string", "description": "The bootstrap server of the kafka cluster." }, "topic": { "type": "string", "description": "The topic name." }, "format": { "type": "string", "description": "The format of the message. Examples: json, avro, protobuf.", "default": "json" } }, "required": [ "host", "topic" ] }, "PubSubServer": { "type": "object", "title": "PubSubServer", "properties": { "project": { "type": "string", "description": "The GCP project name." }, "topic": { "type": "string", "description": "The topic name." } }, "required": [ "project", "topic" ] }, "KinesisDataStreamsServer": { "type": "object", "title": "KinesisDataStreamsServer", "description": "Kinesis Data Streams Server", "properties": { "stream": { "type": "string", "description": "The name of the Kinesis data stream." }, "region": { "type": "string", "description": "AWS region.", "examples": [ "eu-west-1" ] }, "format": { "type": "string", "description": "The format of the record", "examples": [ "json", "avro", "protobuf" ] } }, "required": [ "stream" ] }, "TrinoServer": { "type": "object", "title": "TrinoServer", "properties": { "host": { "type": "string", "description": "The Trino host URL.", "examples": [ "localhost" ] }, "port": { "type": "integer", "description": "The Trino port." }, "catalog": { "type": "string", "description": "The name of the catalog.", "examples": [ "hive" ] }, "schema": { "type": "string", "description": "The name of the schema in the database.", "examples": [ "my_schema" ] } }, "required": [ "host", "port", "catalog", "schema" ] }, "ClickhouseServer": { "type": "object", "title": "ClickhouseServer", "properties": { "host": { "type": "string", "description": "The host to the database server", "examples": [ "localhost" ] }, "port": { "type": "integer", "description": "The port to the database server." }, "database": { "type": "string", "description": "The name of the database.", "examples": [ "postgres" ] } }, "required": [ "host", "port", "database" ] }, "LocalServer": { "type": "object", "title": "LocalServer", "properties": { "path": { "type": "string", "description": "The relative or absolute path to the data file(s).", "examples": [ "./folder/data.parquet", "./folder/*.parquet" ] }, "format": { "type": "string", "description": "The format of the file(s)", "examples": [ "json", "parquet", "delta", "csv" ] } }, "required": [ "path", "format" ] }, "Quality": { "allOf": [ { "type": "object", "properties": { "type": { "type": "string", "description": "The type of quality check", "enum": [ "text", "library", "sql", "custom" ] }, "description": { "type": "string", "description": "A plain text describing the quality attribute in natural language." } } }, { "if": { "properties": { "type": { "const": "text" } } }, "then": { "required": [ "description" ] } }, { "if": { "properties": { "type": { "const": "sql" } } }, "then": { "properties": { "query": { "type": "string", "description": "A SQL query that returns a single number to compare with the threshold." }, "dialect": { "type": "string", "description": "The SQL dialect that is used for the query. Should be compatible to the server.type.", "examples": [ "athena", "bigquery", "redshift", "snowflake", "trino", "postgres", "oracle" ] }, "mustBe": { "type": "integer" }, "mustNotBe": { "type": "integer" }, "mustBeGreaterThan": { "type": "integer" }, "mustBeGreaterThanOrEqualTo": { "type": "integer" }, "mustBeLessThan": { "type": "integer" }, "mustBeLessThanOrEqualTo": { "type": "integer" }, "mustBeBetween": { "type": "array", "items": { "type": "integer" }, "minItems": 2, "maxItems": 2 }, "mustNotBeBetween": { "type": "array", "items": { "type": "integer" }, "minItems": 2, "maxItems": 2 } }, "required": [ "query" ] } }, { "if": { "properties": { "type": { "const": "library" } } }, "then": { "properties": { "rule": { "type": "string", "description": "Define a data quality check based on the predefined rules as per ODCS.", "examples": ["duplicateCount", "validValues", "rowCount"] }, "mustBe": { "description": "Must be equal to the value to be valid. When using numbers, it is equivalent to '='." }, "mustNotBe": { "description": "Must not be equal to the value to be valid. When using numbers, it is equivalent to '!='." }, "mustBeGreaterThan": { "type": "number", "description": "Must be greater than the value to be valid. It is equivalent to '>'." }, "mustBeGreaterOrEqualTo": { "type": "number", "description": "Must be greater than or equal to the value to be valid. It is equivalent to '>='." }, "mustBeLessThan": { "type": "number", "description": "Must be less than the value to be valid. It is equivalent to '<'." }, "mustBeLessOrEqualTo": { "type": "number", "description": "Must be less than or equal to the value to be valid. It is equivalent to '<='." }, "mustBeBetween": { "type": "array", "description": "Must be between the two numbers to be valid. Smallest number first in the array.", "minItems": 2, "maxItems": 2, "uniqueItems": true, "items": { "type": "number" } }, "mustNotBeBetween": { "type": "array", "description": "Must not be between the two numbers to be valid. Smallest number first in the array.", "minItems": 2, "maxItems": 2, "uniqueItems": true, "items": { "type": "number" } } }, "required": [ "rule" ] } }, { "if": { "properties": { "type": { "const": "custom" } } }, "then": { "properties": { "description": { "type": "string", "description": "A plain text describing the quality attribute in natural language." }, "engine": { "type": "string", "examples": [ "soda", "great-expectations" ], "description": "The engine used for custom quality checks." }, "implementation": { "type": [ "object", "array", "string" ], "description": "Engine-specific quality checks and expectations." } }, "required": [ "engine" ] } } ] }, "Lineage": { "type": "object", "properties": { "inputFields": { "type": "array", "items": { "type": "object", "properties": { "namespace": { "type": "string", "description": "The input dataset namespace" }, "name": { "type": "string", "description": "The input dataset name" }, "field": { "type": "string", "description": "The input field" }, "transformations": { "type": "array", "items": { "type": "object", "properties": { "type": { "description": "The type of the transformation. Allowed values are: DIRECT, INDIRECT", "type": "string" }, "subtype": { "type": "string", "description": "The subtype of the transformation" }, "description": { "type": "string", "description": "a string representation of the transformation applied" }, "masking": { "type": "boolean", "description": "is transformation masking the data or not" } }, "required": [ "type" ], "additionalProperties": true } } }, "additionalProperties": true, "required": [ "namespace", "name", "field" ] } }, "transformationDescription": { "type": "string", "description": "a string representation of the transformation applied", "deprecated": true }, "transformationType": { "type": "string", "description": "IDENTITY|MASKED reflects a clearly defined behavior. IDENTITY: exact same as input; MASKED: no original data available (like a hash of PII for example)", "deprecated": true } }, "additionalProperties": true, "required": [ "inputFields" ] } } } ================================================ FILE: versions/1.2.0/definition.schema.json ================================================ { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "description": "Clear and concise explanations of syntax, semantic, and classification of business objects in a given domain.", "properties": { "id": { "type": "string", "description": "A unique identifier for this definition. Encode the domain into the ID, separated by slashes.", "examples": [ "checkout/order_id" ] }, "title": { "type": "string", "description": "The business name of this definition." }, "description": { "type": "string", "description": "Clear and concise explanations related to the domain." }, "type": { "type": "string", "description": "The logical data type." }, "minLength": { "type": "integer", "description": "A value must be greater than or equal to this value. Applies only to string types." }, "maxLength": { "type": "integer", "description": "A value must be less than or equal to this value. Applies only to string types." }, "format": { "type": "string", "description": "Specific format requirements for the value (e.g., 'email', 'uri', 'uuid')." }, "precision": { "type": "integer", "examples": [ 38 ], "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." }, "scale": { "type": "integer", "examples": [ 0 ], "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." }, "pattern": { "type": "string", "description": "A regular expression pattern the value must match. Applies only to string types." }, "example": { "type": "string", "description": "An example value for this field.", "deprecationMessage": "Use the examples field instead." }, "examples": { "type": "array", "description": "A examples value for this field." }, "pii": { "type": "boolean", "description": "Indicates if the field contains Personal Identifiable Information (PII)." }, "classification": { "type": "string", "description": "The data class defining the sensitivity level for this field." }, "tags": { "type": "array", "items": { "type": "string" }, "description": "Custom metadata to provide additional context." }, "links": { "type": "object", "description": "Links to external resources.", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "string", "title": "Link", "description": "A URL to an external resource.", "format": "uri", "examples": [ "https://example.com" ] } } }, "required": [ "type" ] } ================================================ FILE: versions/1.2.1/datacontract.init.yaml ================================================ dataContractSpecification: 1.2.0 id: my-data-contract-id info: title: My Data Contract version: 0.0.1 # description: # owner: # contact: # name: # url: # email: ### servers #servers: # production: # type: s3 # location: s3:// # format: parquet # delimiter: new_line ### terms #terms: # usage: # limitations: # billing: # noticePeriod: ### models # models: # my_model: # description: # type: # fields: # my_field: # type: # description: ### definitions # definitions: # my_field: # domain: # name: # title: # type: # description: # example: # pii: # classification: ### servicelevels #servicelevels: # availability: # description: The server is available during support hours # percentage: 99.9% # retention: # description: Data is retained for one year because! # period: P1Y # unlimited: false # latency: # description: Data is available within 25 hours after the order was placed # threshold: 25h # sourceTimestampField: orders.order_timestamp # processedTimestampField: orders.processed_timestamp # freshness: # description: The age of the youngest row in a table. # threshold: 25h # timestampField: orders.order_timestamp # frequency: # description: Data is delivered once a day # type: batch # or streaming # interval: daily # for batch, either or cron # cron: 0 0 * * * # for batch, either or interval # support: # description: The data is available during typical business hours at headquarters # time: 9am to 5pm in EST on business days # responseTime: 1h # backup: # description: Data is backed up once a week, every Sunday at 0:00 UTC. # interval: weekly # cron: 0 0 * * 0 # recoveryTime: 24 hours # recoveryPoint: 1 week ================================================ FILE: versions/1.2.1/datacontract.schema.json ================================================ { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "title": "DataContractSpecification", "properties": { "dataContractSpecification": { "type": "string", "title": "DataContractSpecificationVersion", "enum": [ "1.2.1", "1.2.0", "1.1.0", "0.9.3", "0.9.2", "0.9.1", "0.9.0" ], "description": "Specifies the Data Contract Specification being used." }, "id": { "type": "string", "description": "Specifies the identifier of the data contract." }, "info": { "type": "object", "properties": { "title": { "type": "string", "description": "The title of the data contract." }, "version": { "type": "string", "description": "The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version)." }, "status": { "type": "string", "description": "The status of the data contract. Can be proposed, in development, active, retired.", "examples": [ "proposed", "in development", "active", "deprecated", "retired" ] }, "description": { "type": "string", "description": "A description of the data contract." }, "owner": { "type": "string", "description": "The owner or team responsible for managing the data contract and providing the data." }, "contact": { "type": "object", "properties": { "name": { "type": "string", "description": "The identifying name of the contact person/organization." }, "url": { "type": "string", "format": "uri", "description": "The URL pointing to the contact information. This MUST be in the form of a URL." }, "email": { "type": "string", "format": "email", "description": "The email address of the contact person/organization. This MUST be in the form of an email address." } }, "description": "Contact information for the data contract.", "additionalProperties": true } }, "additionalProperties": true, "required": [ "title", "version" ], "description": "Metadata and life cycle information about the data contract." }, "servers": { "type": "object", "description": "Information about the servers.", "additionalProperties": { "$ref": "#/$defs/BaseServer", "allOf": [ { "if": { "properties": { "type": { "const": "bigquery" } } }, "then": { "$ref": "#/$defs/BigQueryServer" } }, { "if": { "properties": { "type": { "const": "postgres" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/PostgresServer" } }, { "if": { "properties": { "type": { "const": "s3" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/S3Server" } }, { "if": { "properties": { "type": { "const": "sftp" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/SftpServer" } }, { "if": { "properties": { "type": { "const": "redshift" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/RedshiftServer" } }, { "if": { "properties": { "type": { "const": "azure" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/AzureServer" } }, { "if": { "properties": { "type": { "const": "sqlserver" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/SqlserverServer" } }, { "if": { "properties": { "type": { "const": "snowflake" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/SnowflakeServer" } }, { "if": { "properties": { "type": { "const": "databricks" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/DatabricksServer" } }, { "if": { "properties": { "type": { "const": "dataframe" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/DataframeServer" } }, { "if": { "properties": { "type": { "const": "glue" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/GlueServer" } }, { "if": { "properties": { "type": { "const": "postgres" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/PostgresServer" } }, { "if": { "properties": { "type": { "const": "oracle" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/OracleServer" } }, { "if": { "properties": { "type": { "const": "kafka" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/KafkaServer" } }, { "if": { "properties": { "type": { "const": "pubsub" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/PubSubServer" } }, { "if": { "properties": { "type": { "const": "kinesis" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/KinesisDataStreamsServer" } }, { "if": { "properties": { "type": { "const": "trino" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/TrinoServer" } }, { "if": { "properties": { "type": { "const": "clickhouse" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/ClickhouseServer" } }, { "if": { "properties": { "type": { "const": "local" } }, "required": [ "type" ] }, "then": { "$ref": "#/$defs/LocalServer" } } ] } }, "terms": { "type": "object", "description": "The terms and conditions of the data contract.", "properties": { "usage": { "type": "string", "description": "The usage describes the way the data is expected to be used. Can contain business and technical information." }, "limitations": { "type": "string", "description": "The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for." }, "policies": { "type": "array", "items": { "type": "object", "properties": { "type": { "type": "string", "description": "The type of the policy.", "examples": [ "privacy", "security", "retention", "compliance" ] }, "description": { "type": "string", "description": "A description of the policy." }, "url": { "type": "string", "format": "uri", "description": "A URL to the policy document." } }, "additionalProperties": true }, "description": "The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for." }, "billing": { "type": "string", "description": "The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use." }, "noticePeriod": { "type": "string", "description": "The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., 'P3M' for a period of three months." } }, "additionalProperties": true }, "models": { "description": "Specifies the logical data model. Use the models name (e.g., the table name) as the key.", "type": "object", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "object", "title": "Model", "properties": { "description": { "type": "string" }, "type": { "description": "The type of the model. Examples: table, view, object. Default: table.", "type": "string", "title": "ModelType", "default": "table", "enum": [ "table", "view", "object" ] }, "title": { "type": "string", "description": "An optional string providing a human readable name for the model. Especially useful if the model name is cryptic or contains abbreviations.", "examples": [ "Purchase Orders", "Air Shipments" ] }, "fields": { "description": "Specifies a field in the data model. Use the field name (e.g., the column name) as the key.", "type": "object", "additionalProperties": { "type": "object", "title": "Field", "properties": { "description": { "type": "string", "description": "An optional string describing the semantic of the data in this field." }, "title": { "type": "string", "description": "An optional string providing a human readable name for the field. Especially useful if the field name is cryptic or contains abbreviations." }, "type": { "$ref": "#/$defs/FieldType" }, "required": { "type": "boolean", "default": false, "description": "An indication, if this field must contain a value and may not be null." }, "fields": { "description": "The nested fields (e.g. columns) of the object, record, or struct.", "type": "object", "additionalProperties": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" } }, "items": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "keys": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "values": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "primary": { "type": "boolean", "deprecationMessage": "Use the primaryKey field instead." }, "primaryKey": { "type": "boolean", "default": false, "description": "If this field is a primary key." }, "references": { "type": "string", "description": "The reference to a field in another model. E.g. use 'orders.order_id' to reference the order_id field of the model orders. Think of defining a foreign key relationship.", "examples": [ "orders.order_id", "model.nested_field.field" ] }, "unique": { "type": "boolean", "default": false, "description": "An indication, if the value must be unique within the model." }, "enum": { "type": "array", "items": { "type": "string" }, "uniqueItems": true, "description": "A value must be equal to one of the elements in this array value. Only evaluated if the value is not null." }, "minLength": { "type": "integer", "description": "A value must greater than, or equal to, the value of this. Only applies to string types." }, "maxLength": { "type": "integer", "description": "A value must less than, or equal to, the value of this. Only applies to string types." }, "format": { "type": "string", "description": "A specific format the value must comply with (e.g., 'email', 'uri', 'uuid').", "examples": [ "email", "uri", "uuid" ] }, "precision": { "type": "number", "examples": [ 38 ], "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." }, "scale": { "type": "number", "examples": [ 0 ], "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." }, "pattern": { "type": "string", "description": "A regular expression the value must match. Only applies to string types.", "examples": [ "^[a-zA-Z0-9_-]+$" ] }, "minimum": { "type": "number", "description": "A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "exclusiveMinimum": { "type": "number", "description": "A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "maximum": { "type": "number", "description": "A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "exclusiveMaximum": { "type": "number", "description": "A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "example": { "type": "string", "description": "An example value for this field.", "deprecationMessage": "Use the examples field instead." }, "examples": { "type": "array", "description": "A examples value for this field." }, "pii": { "type": "boolean", "description": "An indication, if this field contains Personal Identifiable Information (PII)." }, "classification": { "type": "string", "description": "The data class defining the sensitivity level for this field, according to the organization's classification scheme.", "examples": [ "sensitive", "restricted", "internal", "public" ] }, "tags": { "type": "array", "items": { "type": "string" }, "description": "Custom metadata to provide additional context." }, "links": { "type": "object", "description": "Links to external resources.", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "string", "title": "Link", "description": "A URL to an external resource.", "format": "uri", "examples": [ "https://example.com" ] } }, "$ref": { "type": "string", "description": "A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition." }, "quality": { "type": "array", "items": { "$ref": "#/$defs/Quality" } }, "lineage": { "$ref": "#/$defs/Lineage" }, "config": { "type": "object", "description": "Additional metadata for field configuration.", "additionalProperties": { "type": [ "string", "number", "boolean", "object", "array", "null" ] }, "properties": { "avroType": { "type": "string", "description": "Specify the field type to use when exporting the data model to Apache Avro." }, "avroLogicalType": { "type": "string", "description": "Specify the logical field type to use when exporting the data model to Apache Avro." }, "bigqueryType": { "type": "string", "description": "Specify the physical column type that is used in a BigQuery table, e.g., `NUMERIC(5, 2)`." }, "snowflakeType": { "type": "string", "description": "Specify the physical column type that is used in a Snowflake table, e.g., `TIMESTAMP_LTZ`." }, "redshiftType": { "type": "string", "description": "Specify the physical column type that is used in a Redshift table, e.g., `SMALLINT`." }, "sqlserverType": { "type": "string", "description": "Specify the physical column type that is used in a SQL Server table, e.g., `DATETIME2`." }, "databricksType": { "type": "string", "description": "Specify the physical column type that is used in a Databricks Unity Catalog table." }, "glueType": { "type": "string", "description": "Specify the physical column type that is used in an AWS Glue Data Catalog table." } } } } } }, "primaryKey": { "type": "array", "items": { "type": "string" }, "description": "The compound primary key of the model." }, "quality": { "type": "array", "items": { "$ref": "#/$defs/Quality" } }, "examples": { "type": "array" }, "additionalFields": { "type": "boolean", "description": " Specify, if the model can have additional fields that are not defined in the contract. ", "default": false }, "config": { "type": "object", "description": "Additional metadata for model configuration.", "additionalProperties": { "type": [ "string", "number", "boolean", "object", "array", "null" ] }, "properties": { "avroNamespace": { "type": "string", "description": "The namespace to use when importing and exporting the data model from / to Apache Avro." } } } } } }, "definitions": { "description": "Clear and concise explanations of syntax, semantic, and classification of business objects in a given domain.", "type": "object", "propertyNames": { "pattern": "^[a-zA-Z0-9/_-]+$" }, "additionalProperties": { "type": "object", "title": "Definition", "properties": { "domain": { "type": "string", "description": "The domain in which this definition is valid.", "default": "global", "deprecationMessage": "This field is deprecated. Encode the domain into the ID using slashes." }, "name": { "type": "string", "description": "The technical name of this definition.", "deprecationMessage": "This field is deprecated. Encode the name into the ID using slashes." }, "title": { "type": "string", "description": "The business name of this definition." }, "description": { "type": "string", "description": "Clear and concise explanations related to the domain." }, "type": { "$ref": "#/$defs/FieldType" }, "fields": { "description": "The nested fields (e.g. columns) of the object, record, or struct.", "type": "object", "additionalProperties": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" } }, "items": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "keys": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "values": { "$ref": "#/properties/models/additionalProperties/properties/fields/additionalProperties" }, "minLength": { "type": "integer", "description": "A value must be greater than or equal to this value. Applies only to string types." }, "maxLength": { "type": "integer", "description": "A value must be less than or equal to this value. Applies only to string types." }, "format": { "type": "string", "description": "Specific format requirements for the value (e.g., 'email', 'uri', 'uuid')." }, "precision": { "type": "integer", "examples": [ 38 ], "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." }, "scale": { "type": "integer", "examples": [ 0 ], "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." }, "pattern": { "type": "string", "description": "A regular expression pattern the value must match. Applies only to string types." }, "minimum": { "type": "number", "description": "A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "exclusiveMinimum": { "type": "number", "description": "A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "maximum": { "type": "number", "description": "A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "exclusiveMaximum": { "type": "number", "description": "A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values." }, "example": { "type": "string", "description": "An example value.", "deprecationMessage": "Use the examples field instead." }, "examples": { "type": "array", "description": "Example value." }, "pii": { "type": "boolean", "description": "Indicates if the field contains Personal Identifiable Information (PII)." }, "classification": { "type": "string", "description": "The data class defining the sensitivity level for this field." }, "tags": { "type": "array", "items": { "type": "string" }, "description": "Custom metadata to provide additional context." }, "links": { "type": "object", "description": "Links to external resources.", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "string", "title": "Link", "description": "A URL to an external resource.", "format": "uri", "examples": [ "https://example.com" ] } } }, "required": [ "type" ] } }, "servicelevels": { "type": "object", "description": "Specifies the service level agreements for the provided data, including availability, data retention policies, latency requirements, data freshness, update frequency, support availability, and backup policies.", "properties": { "availability": { "type": "object", "description": "Availability refers to the promise or guarantee by the service provider about the uptime of the system that provides the data.", "properties": { "description": { "type": "string", "description": "An optional string describing the availability service level.", "example": "The server is available during support hours" }, "percentage": { "type": "string", "description": "An optional string describing the guaranteed uptime in percent (e.g., `99.9%`)", "pattern": "^\\d+(\\.\\d+)?%$", "example": "99.9%" } } }, "retention": { "type": "object", "description": "Retention covers the period how long data will be available.", "properties": { "description": { "type": "string", "description": "An optional string describing the retention service level.", "example": "Data is retained for one year." }, "period": { "type": "string", "description": "An optional period of time, how long data is available. Supported formats: Simple duration (e.g., `1 year`, `30d`) and ISO 8601 duration (e.g, `P1Y`).", "example": "P1Y" }, "unlimited": { "type": "boolean", "description": "An optional indicator that data is kept forever.", "example": false }, "timestampField": { "type": "string", "description": "An optional reference to the field that contains the timestamp that the period refers to.", "example": "orders.order_timestamp" } } }, "latency": { "type": "object", "description": "Latency refers to the maximum amount of time from the source to its destination.", "properties": { "description": { "type": "string", "description": "An optional string describing the latency service level.", "example": "Data is available within 25 hours after the order was placed." }, "threshold": { "type": "string", "description": "An optional maximum duration between the source timestamp and the processed timestamp. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g, `PT24H`).", "example": "25h" }, "sourceTimestampField": { "type": "string", "description": "An optional reference to the field that contains the timestamp when the data was provided at the source.", "example": "orders.order_timestamp" }, "processedTimestampField": { "type": "string", "description": "An optional reference to the field that contains the processing timestamp, which denotes when the data is made available to consumers of this data contract.", "example": "orders.processed_timestamp" } } }, "freshness": { "type": "object", "description": "The maximum age of the youngest row in a table.", "properties": { "description": { "type": "string", "description": "An optional string describing the freshness service level.", "example": "The age of the youngest row in a table is within 25 hours." }, "threshold": { "type": "string", "description": "An optional maximum age of the youngest entry. Supported formats: Simple duration (e.g., `24 hours`, `5s`) and ISO 8601 duration (e.g., `PT24H`).", "example": "25h" }, "timestampField": { "type": "string", "description": "An optional reference to the field that contains the timestamp that the threshold refers to.", "example": "orders.order_timestamp" } } }, "frequency": { "type": "object", "description": "Frequency describes how often data is updated.", "properties": { "description": { "type": "string", "description": "An optional string describing the frequency service level.", "example": "Data is delivered once a day." }, "type": { "type": "string", "enum": [ "batch", "micro-batching", "streaming", "manual" ], "description": "The method of data processing.", "example": "batch" }, "interval": { "type": "string", "description": "Optional. Only for batch: How often the pipeline is triggered, e.g., `daily`.", "example": "daily" }, "cron": { "type": "string", "description": "Optional. Only for batch: A cron expression when the pipelines is triggered. E.g., `0 0 * * *`.", "example": "0 0 * * *" } } }, "support": { "type": "object", "description": "Support describes the times when support will be available for contact.", "properties": { "description": { "type": "string", "description": "An optional string describing the support service level.", "example": "The data is available during typical business hours at headquarters." }, "time": { "type": "string", "description": "An optional string describing the times when support will be available for contact such as `24/7` or `business hours only`.", "example": "9am to 5pm in EST on business days" }, "responseTime": { "type": "string", "description": "An optional string describing the time it takes for the support team to acknowledge a request. This does not mean the issue will be resolved immediately, but it assures users that their request has been received and will be dealt with.", "example": "24 hours" } } }, "backup": { "type": "object", "description": "Backup specifies details about data backup procedures.", "properties": { "description": { "type": "string", "description": "An optional string describing the backup service level.", "example": "Data is backed up once a week, every Sunday at 0:00 UTC." }, "interval": { "type": "string", "description": "An optional interval that defines how often data will be backed up, e.g., `daily`.", "example": "weekly" }, "cron": { "type": "string", "description": "An optional cron expression when data will be backed up, e.g., `0 0 * * *`.", "example": "0 0 * * 0" }, "recoveryTime": { "type": "string", "description": "An optional Recovery Time Objective (RTO) specifies the maximum amount of time allowed to restore data from a backup after a failure or loss event (e.g., 4 hours, 24 hours).", "example": "24 hours" }, "recoveryPoint": { "type": "string", "description": "An optional Recovery Point Objective (RPO) defines the maximum acceptable age of files that must be recovered from backup storage for normal operations to resume after a disaster or data loss event. This essentially measures how much data you can afford to lose, measured in time (e.g., 4 hours, 24 hours).", "example": "1 week" } } } } }, "links": { "type": "object", "description": "Links to external resources.", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "string", "title": "Link", "description": "A URL to an external resource.", "format": "uri", "examples": [ "https://example.com" ] } }, "tags": { "type": "array", "items": { "type": "string", "description": "Tags to facilitate searching and filtering.", "examples": [ "databricks", "pii", "sensitive" ] }, "description": "Tags to facilitate searching and filtering." } }, "required": [ "dataContractSpecification", "id", "info" ], "$defs": { "FieldType": { "type": "string", "title": "FieldType", "description": "The logical data type of the field.", "enum": [ "number", "decimal", "numeric", "int", "integer", "long", "bigint", "float", "double", "string", "text", "varchar", "boolean", "timestamp", "timestamp_tz", "timestamp_ntz", "date", "time", "array", "map", "object", "record", "struct", "bytes", "variant", "json", "null" ] }, "BaseServer": { "type": "object", "properties": { "description": { "type": "string", "description": "An optional string describing the servers." }, "environment": { "type": "string", "description": "The environment in which the servers are running. Examples: prod, sit, stg." }, "type": { "type": "string", "description": "The type of the data product technology that implements the data contract.", "examples": [ "azure", "bigquery", "BigQuery", "clickhouse", "databricks", "dataframe", "glue", "kafka", "kinesis", "local", "oracle", "postgres", "pubsub", "redshift", "sftp", "sqlserver", "snowflake", "s3", "trino" ] }, "roles": { "description": " An optional array of roles that are available and can be requested to access the server for role-based access control. E.g. separate roles for different regions or sensitive data.", "type": "array", "items": { "type": "object", "properties": { "name": { "type": "string", "description": "The name of the role." }, "description": { "type": "string", "description": "A description of the role and what access the role provides." } }, "required": [ "name" ] } } }, "additionalProperties": true, "required": [ "type" ] }, "BigQueryServer": { "type": "object", "title": "BigQueryServer", "properties": { "project": { "type": "string", "description": "The GCP project name." }, "dataset": { "type": "string", "description": "The GCP dataset name." } }, "required": [ "project", "dataset" ] }, "S3Server": { "type": "object", "title": "S3Server", "properties": { "location": { "type": "string", "format": "uri", "description": "S3 URL, starting with `s3://`", "examples": [ "s3://datacontract-example-orders-latest/data/{model}/*.json" ] }, "endpointUrl": { "type": "string", "format": "uri", "description": "The server endpoint for S3-compatible servers.", "examples": [ "https://minio.example.com" ] }, "format": { "type": "string", "enum": [ "parquet", "delta", "json", "csv" ], "description": "File format." }, "delimiter": { "type": "string", "enum": [ "new_line", "array" ], "description": "Only for format = json. How multiple json documents are delimited within one file" } }, "required": [ "location" ] }, "SftpServer": { "type": "object", "title": "SftpServer", "properties": { "location": { "type": "string", "format": "uri", "pattern": "^sftp://.*", "description": "SFTP URL, starting with `sftp://`", "examples": [ "sftp://123.123.12.123/{model}/*.json" ] }, "format": { "type": "string", "enum": [ "parquet", "delta", "json", "csv" ], "description": "File format." }, "delimiter": { "type": "string", "enum": [ "new_line", "array" ], "description": "Only for format = json. How multiple json documents are delimited within one file" } }, "required": [ "location" ] }, "RedshiftServer": { "type": "object", "title": "RedshiftServer", "properties": { "account": { "type": "string", "description": "An optional string describing the server." }, "host": { "type": "string", "description": "An optional string describing the host name." }, "database": { "type": "string", "description": "An optional string describing the server." }, "schema": { "type": "string", "description": "An optional string describing the server." }, "clusterIdentifier": { "type": "string", "description": "An optional string describing the cluster's identifier.", "examples": [ "redshift-prod-eu", "analytics-cluster" ] }, "port": { "type": "integer", "description": "An optional string describing the cluster's port.", "examples": [ 5439 ] }, "endpoint": { "type": "string", "description": "An optional string describing the cluster's endpoint.", "examples": [ "analytics-cluster.example.eu-west-1.redshift.amazonaws.com:5439/analytics" ] } }, "additionalProperties": true, "required": [ "account", "database", "schema" ] }, "AzureServer": { "type": "object", "title": "AzureServer", "properties": { "location": { "type": "string", "format": "uri", "description": "Path to Azure Blob Storage or Azure Data Lake Storage (ADLS), supports globs. Recommended pattern is 'abfss:///'", "examples": [ "abfss://my_container_name/path", "abfss://my_container_name/path/*.json", "az://my_storage_account_name.blob.core.windows.net/my_container/path/*.parquet", "abfss://my_storage_account_name.dfs.core.windows.net/my_container_name/path/*.parquet" ] }, "format": { "type": "string", "enum": [ "parquet", "delta", "json", "csv" ], "description": "File format." }, "delimiter": { "type": "string", "enum": [ "new_line", "array" ], "description": "Only for format = json. How multiple json documents are delimited within one file" } }, "required": [ "location", "format" ] }, "SqlserverServer": { "type": "object", "title": "SqlserverServer", "properties": { "host": { "type": "string", "description": "The host to the database server", "examples": [ "localhost" ] }, "port": { "type": "integer", "description": "The port to the database server.", "default": 1433, "examples": [ 1433 ] }, "database": { "type": "string", "description": "The name of the database.", "examples": [ "database" ] }, "schema": { "type": "string", "description": "The name of the schema in the database.", "examples": [ "dbo" ] } }, "required": [ "host", "database", "schema" ] }, "SnowflakeServer": { "type": "object", "title": "SnowflakeServer", "properties": { "account": { "type": "string", "description": "An optional string describing the server." }, "database": { "type": "string", "description": "An optional string describing the server." }, "schema": { "type": "string", "description": "An optional string describing the server." } }, "required": [ "account", "database", "schema" ] }, "DatabricksServer": { "type": "object", "title": "DatabricksServer", "properties": { "host": { "type": "string", "description": "The Databricks host", "examples": [ "dbc-abcdefgh-1234.cloud.databricks.com" ] }, "catalog": { "type": "string", "description": "The name of the Hive or Unity catalog" }, "schema": { "type": "string", "description": "The schema name in the catalog" } }, "required": [ "catalog", "schema" ] }, "DataframeServer": { "type": "object", "title": "DataframeServer", "required": [ "type" ] }, "GlueServer": { "type": "object", "title": "GlueServer", "properties": { "account": { "type": "string", "description": "The AWS Glue account", "examples": [ "1234-5678-9012" ] }, "database": { "type": "string", "description": "The AWS Glue database name", "examples": [ "my_database" ] }, "location": { "type": "string", "format": "uri", "description": "The AWS S3 path. Must be in the form of a URL.", "examples": [ "s3://datacontract-example-orders-latest/data/{model}" ] }, "format": { "type": "string", "description": "The format of the files", "examples": [ "parquet", "csv", "json", "delta" ] } }, "required": [ "account", "database" ] }, "PostgresServer": { "type": "object", "title": "PostgresServer", "properties": { "host": { "type": "string", "description": "The host to the database server", "examples": [ "localhost" ] }, "port": { "type": "integer", "description": "The port to the database server." }, "database": { "type": "string", "description": "The name of the database.", "examples": [ "postgres" ] }, "schema": { "type": "string", "description": "The name of the schema in the database.", "examples": [ "public" ] } }, "required": [ "host", "port", "database", "schema" ] }, "OracleServer": { "type": "object", "title": "OracleServer", "properties": { "host": { "type": "string", "description": "The host to the oracle server", "examples": [ "localhost" ] }, "port": { "type": "integer", "description": "The port to the oracle server.", "examples": [ 1523 ] }, "serviceName": { "type": "string", "description": "The name of the service.", "examples": [ "service" ] } }, "required": [ "host", "port", "serviceName" ] }, "KafkaServer": { "type": "object", "title": "KafkaServer", "description": "Kafka Server", "properties": { "host": { "type": "string", "description": "The bootstrap server of the kafka cluster." }, "topic": { "type": "string", "description": "The topic name." }, "format": { "type": "string", "description": "The format of the message. Examples: json, avro, protobuf.", "default": "json" } }, "required": [ "host", "topic" ] }, "PubSubServer": { "type": "object", "title": "PubSubServer", "properties": { "project": { "type": "string", "description": "The GCP project name." }, "topic": { "type": "string", "description": "The topic name." } }, "required": [ "project", "topic" ] }, "KinesisDataStreamsServer": { "type": "object", "title": "KinesisDataStreamsServer", "description": "Kinesis Data Streams Server", "properties": { "stream": { "type": "string", "description": "The name of the Kinesis data stream." }, "region": { "type": "string", "description": "AWS region.", "examples": [ "eu-west-1" ] }, "format": { "type": "string", "description": "The format of the record", "examples": [ "json", "avro", "protobuf" ] } }, "required": [ "stream" ] }, "TrinoServer": { "type": "object", "title": "TrinoServer", "properties": { "host": { "type": "string", "description": "The Trino host URL.", "examples": [ "localhost" ] }, "port": { "type": "integer", "description": "The Trino port." }, "catalog": { "type": "string", "description": "The name of the catalog.", "examples": [ "hive" ] }, "schema": { "type": "string", "description": "The name of the schema in the database.", "examples": [ "my_schema" ] } }, "required": [ "host", "port", "catalog", "schema" ] }, "ClickhouseServer": { "type": "object", "title": "ClickhouseServer", "properties": { "host": { "type": "string", "description": "The host to the database server", "examples": [ "localhost" ] }, "port": { "type": "integer", "description": "The port to the database server." }, "database": { "type": "string", "description": "The name of the database.", "examples": [ "postgres" ] } }, "required": [ "host", "port", "database" ] }, "LocalServer": { "type": "object", "title": "LocalServer", "properties": { "path": { "type": "string", "description": "The relative or absolute path to the data file(s).", "examples": [ "./folder/data.parquet", "./folder/*.parquet" ] }, "format": { "type": "string", "description": "The format of the file(s)", "examples": [ "json", "parquet", "delta", "csv" ] } }, "required": [ "path", "format" ] }, "Quality": { "allOf": [ { "type": "object", "properties": { "type": { "type": "string", "description": "The type of quality check", "enum": [ "text", "library", "sql", "custom" ] }, "description": { "type": "string", "description": "A plain text describing the quality attribute in natural language." } } }, { "if": { "properties": { "type": { "const": "text" } } }, "then": { "required": [ "description" ] } }, { "if": { "properties": { "type": { "const": "sql" } } }, "then": { "properties": { "query": { "type": "string", "description": "A SQL query that returns a single number to compare with the threshold." }, "dialect": { "type": "string", "description": "The SQL dialect that is used for the query. Should be compatible to the server.type.", "examples": [ "athena", "bigquery", "redshift", "snowflake", "trino", "postgres", "oracle" ] }, "mustBe": { "type": "number" }, "mustNotBe": { "type": "number" }, "mustBeGreaterThan": { "type": "number" }, "mustBeGreaterOrEqualTo": { "type": "number" }, "mustBeGreaterThanOrEqualTo": { "type": "number", "deprecated": true }, "mustBeLessThan": { "type": "number" }, "mustBeLessThanOrEqualTo": { "type": "number", "deprecated": true }, "mustBeLessOrEqualTo": { "type": "number" }, "mustBeBetween": { "type": "array", "items": { "type": "number" }, "minItems": 2, "maxItems": 2 }, "mustNotBeBetween": { "type": "array", "items": { "type": "number" }, "minItems": 2, "maxItems": 2 } }, "required": [ "query" ] } }, { "if": { "properties": { "type": { "const": "library" } } }, "then": { "properties": { "metric": { "type": "string", "description": "The DataQualityLibrary metric to use for the quality check.", "examples": ["nullValues", "missingValues", "invalidValues", "duplicateValues", "rowCount"] }, "rule": { "type": "string", "deprecated": true, "description": "Deprecated. Use metric instead" }, "arguments": { "type": "object", "description": "Additional metric-specific parameters for the quality check.", "additionalProperties": { "type": ["string", "number", "boolean", "array", "object"] } }, "mustBe": { "description": "Must be equal to the value to be valid. When using numbers, it is equivalent to '='." }, "mustNotBe": { "description": "Must not be equal to the value to be valid. When using numbers, it is equivalent to '!='." }, "mustBeGreaterThan": { "type": "number", "description": "Must be greater than the value to be valid. It is equivalent to '>'." }, "mustBeGreaterOrEqualTo": { "type": "number", "description": "Must be greater than or equal to the value to be valid. It is equivalent to '>='." }, "mustBeLessThan": { "type": "number", "description": "Must be less than the value to be valid. It is equivalent to '<'." }, "mustBeLessOrEqualTo": { "type": "number", "description": "Must be less than or equal to the value to be valid. It is equivalent to '<='." }, "mustBeBetween": { "type": "array", "description": "Must be between the two numbers to be valid. Smallest number first in the array.", "minItems": 2, "maxItems": 2, "uniqueItems": true, "items": { "type": "number" } }, "mustNotBeBetween": { "type": "array", "description": "Must not be between the two numbers to be valid. Smallest number first in the array.", "minItems": 2, "maxItems": 2, "uniqueItems": true, "items": { "type": "number" } } }, "required": [ "metric" ] } }, { "if": { "properties": { "type": { "const": "custom" } } }, "then": { "properties": { "description": { "type": "string", "description": "A plain text describing the quality attribute in natural language." }, "engine": { "type": "string", "examples": [ "soda", "great-expectations" ], "description": "The engine used for custom quality checks." }, "implementation": { "type": [ "object", "array", "string" ], "description": "Engine-specific quality checks and expectations." } }, "required": [ "engine" ] } } ] }, "Lineage": { "type": "object", "properties": { "inputFields": { "type": "array", "items": { "type": "object", "properties": { "namespace": { "type": "string", "description": "The input dataset namespace" }, "name": { "type": "string", "description": "The input dataset name" }, "field": { "type": "string", "description": "The input field" }, "transformations": { "type": "array", "items": { "type": "object", "properties": { "type": { "description": "The type of the transformation. Allowed values are: DIRECT, INDIRECT", "type": "string" }, "subtype": { "type": "string", "description": "The subtype of the transformation" }, "description": { "type": "string", "description": "a string representation of the transformation applied" }, "masking": { "type": "boolean", "description": "is transformation masking the data or not" } }, "required": [ "type" ], "additionalProperties": true } } }, "additionalProperties": true, "required": [ "namespace", "name", "field" ] } }, "transformationDescription": { "type": "string", "description": "a string representation of the transformation applied", "deprecated": true }, "transformationType": { "type": "string", "description": "IDENTITY|MASKED reflects a clearly defined behavior. IDENTITY: exact same as input; MASKED: no original data available (like a hash of PII for example)", "deprecated": true } }, "additionalProperties": true, "required": [ "inputFields" ] } } } ================================================ FILE: versions/1.2.1/definition.schema.json ================================================ { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "description": "Clear and concise explanations of syntax, semantic, and classification of business objects in a given domain.", "properties": { "id": { "type": "string", "description": "A unique identifier for this definition. Encode the domain into the ID, separated by slashes.", "examples": [ "checkout/order_id" ] }, "title": { "type": "string", "description": "The business name of this definition." }, "description": { "type": "string", "description": "Clear and concise explanations related to the domain." }, "type": { "type": "string", "description": "The logical data type." }, "minLength": { "type": "integer", "description": "A value must be greater than or equal to this value. Applies only to string types." }, "maxLength": { "type": "integer", "description": "A value must be less than or equal to this value. Applies only to string types." }, "format": { "type": "string", "description": "Specific format requirements for the value (e.g., 'email', 'uri', 'uuid')." }, "precision": { "type": "integer", "examples": [ 38 ], "description": "The maximum number of digits in a number. Only applies to numeric values. Defaults to 38." }, "scale": { "type": "integer", "examples": [ 0 ], "description": "The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0." }, "pattern": { "type": "string", "description": "A regular expression pattern the value must match. Applies only to string types." }, "example": { "type": "string", "description": "An example value for this field.", "deprecationMessage": "Use the examples field instead." }, "examples": { "type": "array", "description": "A examples value for this field." }, "pii": { "type": "boolean", "description": "Indicates if the field contains Personal Identifiable Information (PII)." }, "classification": { "type": "string", "description": "The data class defining the sensitivity level for this field." }, "tags": { "type": "array", "items": { "type": "string" }, "description": "Custom metadata to provide additional context." }, "links": { "type": "object", "description": "Links to external resources.", "minProperties": 1, "propertyNames": { "pattern": "^[a-zA-Z0-9_-]+$" }, "additionalProperties": { "type": "string", "title": "Link", "description": "A URL to an external resource.", "format": "uri", "examples": [ "https://example.com" ] } } }, "required": [ "type" ] } ================================================ FILE: workshop.md ================================================ # Data Contract Workshop Bring data producers and consumers together to define data contracts in a facilitated workshop. ## Goal A defined and agreed upon data contract between data producers and consumers. ## Participants - Facilitator - Neutral moderator and typist - Should know the used data contract formal ([Data Contract Specification](https://datacontract.com) or [ODCS](https://bitol-io.github.io/open-data-contract-standard/latest/)) and its tools well - Get the [authors of the Data Contract Specification](https://datacontract.com/#authors) as facilitators for your workshop. - Data producer - Product Owner - Software Engineers - Data consumers - Product Owner - Data Engineers / Scientist / Analyst Recommendation: keep the group small (not more than 5 people) ## Settings - Show data contract the whole workshop on the screen (projector, screenshare, ...) - Facilitator is the typist - Facilitator is moderator - Data Producer and Data Consumers discuss and give commands to the facilitator ## Guidelines for the Data Contract Specification ### Recommended Order of Completion (Data Contract Specification) 1. Info (get the context) 2. Examples (example-driven facilitation) 3. Model (you will spend most of your time here) - Use the [Data Contract CLI](https://cli.datacontract.com) to test the model against the previously created examples:\\ `datacontract test --examples datacontract.yaml` 4. Quality 5. Terms 6. Servers (if already applicable) - Start with a "local" server with actual, real data you downloaded - Use the [Data Contract CLI](https://cli.datacontract.com) to test the model against the actual data on a specific server:\\ `datacontract test datacontract.yaml` - Switch to the actual remote server, if applicable ### Tooling (Data Contract Specification) - Open the [starter template](https://datacontract.com/datacontract.init.yaml) in the [Data Contract Editor](https://editor.datacontract.com) and get going. If you lack an experienced facilitator, ignore any validation errors and warnings within the editor. - Use the [Data Contract Editor](https://editor.datacontract.com) to share the results of the workshop afterward with the participants and other stakeholders. - Use the [Data Contract CLI](https://cli.datacontract.com) to validate the data contract after the workshop. - Use the [Data Mesh Manager](https://www.datamesh-manager.com) to publish the data contract and have it in a central place ## Guidelines for ODCS We recommend to use the [Excel template](https://github.com/datacontract/open-data-contract-standard-excel-template) for workshops as it is easier to work with in such a setting as it comes with a nice visualization. ### Recommended Order of Completion (ODCS) 1. Fundamentals (get the context) - **[Fill in the fundamentals](https://bitol-io.github.io/open-data-contract-standard/latest/#fundamentals)** consisting of id, name, version, status, and description. 2. Schema (you will spend most of your time here) - **[Fill in the schemas](https://bitol-io.github.io/open-data-contract-standard/latest/#schema)** (tables) and their properties (columns) along with their name and logicalType as a start in the schema part. - After that, add information like `description`, `classification`, ... - Use tags or customProperties add additional metadata where there is no direct support by ODCS 3. Quality - **[Add quality checks](https://bitol-io.github.io/open-data-contract-standard/latest/#data-quality)** at the schema or the property level. Start with quality checks of type text first to capture the requirements. - OPTIONAL Conver the text-based requirements into automated sql-based quality checks 4. SLAs - **[Add SLAs](https://bitol-io.github.io/open-data-contract-standard/latest/#service-level-agreement-sla)** that the data provider guarantees towards all data consumers. 5. Team & Support - **[Add the team members](https://bitol-io.github.io/open-data-contract-standard/latest/#team)** so that the data consumer knows who is part of the team that owns the data protected by the data contracts. - **[Add a support channel](https://bitol-io.github.io/open-data-contract-standard/latest/#support-and-communication-channels)** so (potential) data consumers know how to get support and reach the data owners. 6. Servers (if already applicable) - **[Add the server information](https://bitol-io.github.io/open-data-contract-standard/latest/#infrastructure-and-servers)** on where the data is available - Use the [Data Contract CLI](https://cli.datacontract.com) to test the schema against the actual data on a specific server:\\ `datacontract test datacontract.yaml` ### Tooling (ODCS) - Use the [Excel template](https://github.com/datacontract/open-data-contract-standard-excel-template) for the workshop - Use the [Data Contract CLI](https://cli.datacontract.com) to validate the data contract after the workshop. - Use the [Data Mesh Manager](https://www.datamesh-manager.com) to publish the data contract and have it in a central place ## Related - This data contract workshop could be a followup to a data product design workshop using the [Data Product Canvas](https://www.datamesh-architecture.com/data-product-canvas), making the offered contract at the output port of the designed data product more concrete.