Showing preview only (5,528K chars total). Download the full file or copy to clipboard to get everything.
Repository: common-voice/cv-dataset
Branch: main
Commit: d93bea708ce1
Files: 72
Total size: 5.3 MB
Directory structure:
gitextract_44_0gd_b/
├── CHANGELOG.md
├── LICENSE
├── README.md
├── datasets/
│ ├── code-switching/
│ │ └── README.md
│ ├── scripted-speech/
│ │ ├── CHANGELOG.md
│ │ ├── README.md
│ │ ├── cv-corpus-1.json
│ │ ├── cv-corpus-10.0-2022-07-04.json
│ │ ├── cv-corpus-10.0-delta-2022-07-04.json
│ │ ├── cv-corpus-11.0-2022-09-21.json
│ │ ├── cv-corpus-11.0-delta-2022-09-21.json
│ │ ├── cv-corpus-12.0-2022-12-07.json
│ │ ├── cv-corpus-12.0-delta-2022-12-07.json
│ │ ├── cv-corpus-13.0-2023-03-09.json
│ │ ├── cv-corpus-13.0-delta-2023-03-09.json
│ │ ├── cv-corpus-14.0-2023-06-23.json
│ │ ├── cv-corpus-14.0-delta-2023-06-23.json
│ │ ├── cv-corpus-15.0-2023-09-08.json
│ │ ├── cv-corpus-15.0-delta-2023-09-08.json
│ │ ├── cv-corpus-16.0-2023-12-06.json
│ │ ├── cv-corpus-16.0-delta-2023-12-06.json
│ │ ├── cv-corpus-16.1-2023-12-06.json
│ │ ├── cv-corpus-16.1-delta-2023-12-06.json
│ │ ├── cv-corpus-17.0-2024-03-15.json
│ │ ├── cv-corpus-17.0-delta-2024-03-15.json
│ │ ├── cv-corpus-18.0-2024-06-14.json
│ │ ├── cv-corpus-18.0-delta-2024-06-14.json
│ │ ├── cv-corpus-19.0-2024-09-13.json
│ │ ├── cv-corpus-19.0-delta-2024-09-13.json
│ │ ├── cv-corpus-2.json
│ │ ├── cv-corpus-20.0-2024-12-06.json
│ │ ├── cv-corpus-20.0-delta-2024-12-06.json
│ │ ├── cv-corpus-21.0-2025-03-14.json
│ │ ├── cv-corpus-21.0-delta-2025-03-14.json
│ │ ├── cv-corpus-22.0-2025-06-20.json
│ │ ├── cv-corpus-22.0-delta-2025-06-20.json
│ │ ├── cv-corpus-23.0-2025-09-05.json
│ │ ├── cv-corpus-23.0-delta-2025-09-05.json
│ │ ├── cv-corpus-24.0-2025-12-05.json
│ │ ├── cv-corpus-24.0-delta-2025-12-05.json
│ │ ├── cv-corpus-25.0-2026-03-09.json
│ │ ├── cv-corpus-25.0-delta-2026-03-09.json
│ │ ├── cv-corpus-3.json
│ │ ├── cv-corpus-4-2019-12-10.json
│ │ ├── cv-corpus-5-2020-06-22.json
│ │ ├── cv-corpus-5-singleword.json
│ │ ├── cv-corpus-5.1-2020-06-22.json
│ │ ├── cv-corpus-5.1-singleword.json
│ │ ├── cv-corpus-6.0-2020-12-11.json
│ │ ├── cv-corpus-6.0-singleword.json
│ │ ├── cv-corpus-6.1-2020-12-11.json
│ │ ├── cv-corpus-6.1-singleword.json
│ │ ├── cv-corpus-7.0-2021-07-21.json
│ │ ├── cv-corpus-7.0-singleword.json
│ │ ├── cv-corpus-8.0-2022-01-19.json
│ │ └── cv-corpus-9.0-2022-04-27.json
│ └── spontaneous-speech/
│ ├── .gitkeep
│ ├── CHANGELOG.md
│ ├── README.md
│ ├── sps-corpus-1.0-2025-09-05.json
│ ├── sps-corpus-2.0-2025-12-05.json
│ ├── sps-corpus-2.0-delta-2025-12-05.json
│ ├── sps-corpus-3.0-2026-03-09.json
│ └── sps-corpus-3.0-delta-2026-03-09.json
└── helpers/
├── .eslintrc.json
├── README.md
├── common.js
├── compareReleases.js
├── createDeltaStatistics.js
├── createStats.js
├── jsconfig.json
└── recalculateStats.js
================================================
FILE CONTENTS
================================================
================================================
FILE: CHANGELOG.md
================================================
# Changelog
Changelogs are maintained per dataset type:
- [Scripted Speech (SCS)](datasets/scripted-speech/CHANGELOG.md) -- 25 releases (v1 through v25.0)
- [Spontaneous Speech (SPS)](datasets/spontaneous-speech/CHANGELOG.md) -- 3 releases (v1.0 through v3.0)
- [Code Switching (CS)](datasets/code-switching/README.md) -- planned, no releases yet
## Major Changes with March 2026 Releases
The March 2026 release cycle (SCS v25.0 / SPS v3.0) introduces significant infrastructure and tooling changes across the Common Voice dataset ecosystem. Below is a summary; see each dataset type's changelog for details relevant to dataset consumers.
- **Multi-modality dataset statistics.** This repository (`cv-dataset`) now tracks release statistics for all dataset types (SCS, SPS, CS). Helper scripts (`createStats.js`, `compareReleases.js`, `createDeltaStatistics.js`, `recalculateStats.js`) were refactored to handle both SCS and SPS data formats, with per-type handlers and recursive comparison for nested SPS objects.
- **SCS & SPS Bundler changes** The Scripted Speech bundler is augmented with new `variant` option, and ability to handle licensed datasets. The Spontaneous Speech bundler reached its first production release matching SCS counterpart where possible, with four release types (`full`, `delta`, `variants`, `statistics`) and includes graceful delta release support with passive locale skipping for locales with zero new activity.
- **Embedded QA pipeline.** The SPS bundler now embeds the quality-control-data-pipeline as a `PostProcessCorpus` step. This applies disfluency standardization, quality tagging, and generates a per-locale QA summary JSON included in each release archive.
- **Datasheets integration.** Both SCS and SPS bundlers now generate per-locale datasheets (Markdown documentation) as part of the release pipeline. Templates and community-contributed content are sourced from `cv-datasheets` (schema v2.0.0), and the bundler fills in auto-generated statistics at build time. Datasheets are included in full release archives and also presented at datasets pages on the MDC platform. They merge community-contributed content with auto-generated statistics, and are designed to be human-readable summaries of the dataset for each locale.
- **SCS-SPS data bridge.** The SPS bundler can cross-reference the SCS database to provide demographics data. This enables accent, age, and gender data from SCS profiles to appear in SPS releases when available. Note that SPS was not connected to SCS user profiles at the start, thus older data may have missing demographics.
================================================
FILE: LICENSE
================================================
Mozilla Public License Version 2.0
==================================
1. Definitions
--------------
1.1. "Contributor"
means each individual or legal entity that creates, contributes to
the creation of, or owns Covered Software.
1.2. "Contributor Version"
means the combination of the Contributions of others (if any) used
by a Contributor and that particular Contributor's Contribution.
1.3. "Contribution"
means Covered Software of a particular Contributor.
1.4. "Covered Software"
means Source Code Form to which the initial Contributor has attached
the notice in Exhibit A, the Executable Form of such Source Code
Form, and Modifications of such Source Code Form, in each case
including portions thereof.
1.5. "Incompatible With Secondary Licenses"
means
(a) that the initial Contributor has attached the notice described
in Exhibit B to the Covered Software; or
(b) that the Covered Software was made available under the terms of
version 1.1 or earlier of the License, but not also under the
terms of a Secondary License.
1.6. "Executable Form"
means any form of the work other than Source Code Form.
1.7. "Larger Work"
means a work that combines Covered Software with other material, in
a separate file or files, that is not Covered Software.
1.8. "License"
means this document.
1.9. "Licensable"
means having the right to grant, to the maximum extent possible,
whether at the time of the initial grant or subsequently, any and
all of the rights conveyed by this License.
1.10. "Modifications"
means any of the following:
(a) any file in Source Code Form that results from an addition to,
deletion from, or modification of the contents of Covered
Software; or
(b) any new file in Source Code Form that contains any Covered
Software.
1.11. "Patent Claims" of a Contributor
means any patent claim(s), including without limitation, method,
process, and apparatus claims, in any patent Licensable by such
Contributor that would be infringed, but for the grant of the
License, by the making, using, selling, offering for sale, having
made, import, or transfer of either its Contributions or its
Contributor Version.
1.12. "Secondary License"
means either the GNU General Public License, Version 2.0, the GNU
Lesser General Public License, Version 2.1, the GNU Affero General
Public License, Version 3.0, or any later versions of those
licenses.
1.13. "Source Code Form"
means the form of the work preferred for making modifications.
1.14. "You" (or "Your")
means an individual or a legal entity exercising rights under this
License. For legal entities, "You" includes any entity that
controls, is controlled by, or is under common control with You. For
purposes of this definition, "control" means (a) the power, direct
or indirect, to cause the direction or management of such entity,
whether by contract or otherwise, or (b) ownership of more than
fifty percent (50%) of the outstanding shares or beneficial
ownership of such entity.
2. License Grants and Conditions
--------------------------------
2.1. Grants
Each Contributor hereby grants You a world-wide, royalty-free,
non-exclusive license:
(a) under intellectual property rights (other than patent or trademark)
Licensable by such Contributor to use, reproduce, make available,
modify, display, perform, distribute, and otherwise exploit its
Contributions, either on an unmodified basis, with Modifications, or
as part of a Larger Work; and
(b) under Patent Claims of such Contributor to make, use, sell, offer
for sale, have made, import, and otherwise transfer either its
Contributions or its Contributor Version.
2.2. Effective Date
The licenses granted in Section 2.1 with respect to any Contribution
become effective for each Contribution on the date the Contributor first
distributes such Contribution.
2.3. Limitations on Grant Scope
The licenses granted in this Section 2 are the only rights granted under
this License. No additional rights or licenses will be implied from the
distribution or licensing of Covered Software under this License.
Notwithstanding Section 2.1(b) above, no patent license is granted by a
Contributor:
(a) for any code that a Contributor has removed from Covered Software;
or
(b) for infringements caused by: (i) Your and any other third party's
modifications of Covered Software, or (ii) the combination of its
Contributions with other software (except as part of its Contributor
Version); or
(c) under Patent Claims infringed by Covered Software in the absence of
its Contributions.
This License does not grant any rights in the trademarks, service marks,
or logos of any Contributor (except as may be necessary to comply with
the notice requirements in Section 3.4).
2.4. Subsequent Licenses
No Contributor makes additional grants as a result of Your choice to
distribute the Covered Software under a subsequent version of this
License (see Section 10.2) or under the terms of a Secondary License (if
permitted under the terms of Section 3.3).
2.5. Representation
Each Contributor represents that the Contributor believes its
Contributions are its original creation(s) or it has sufficient rights
to grant the rights to its Contributions conveyed by this License.
2.6. Fair Use
This License is not intended to limit any rights You have under
applicable copyright doctrines of fair use, fair dealing, or other
equivalents.
2.7. Conditions
Sections 3.1, 3.2, 3.3, and 3.4 are conditions of the licenses granted
in Section 2.1.
3. Responsibilities
-------------------
3.1. Distribution of Source Form
All distribution of Covered Software in Source Code Form, including any
Modifications that You create or to which You contribute, must be under
the terms of this License. You must inform recipients that the Source
Code Form of the Covered Software is governed by the terms of this
License, and how they can obtain a copy of this License. You may not
attempt to alter or restrict the recipients' rights in the Source Code
Form.
3.2. Distribution of Executable Form
If You distribute Covered Software in Executable Form then:
(a) such Covered Software must also be made available in Source Code
Form, as described in Section 3.1, and You must inform recipients of
the Executable Form how they can obtain a copy of such Source Code
Form by reasonable means in a timely manner, at a charge no more
than the cost of distribution to the recipient; and
(b) You may distribute such Executable Form under the terms of this
License, or sublicense it under different terms, provided that the
license for the Executable Form does not attempt to limit or alter
the recipients' rights in the Source Code Form under this License.
3.3. Distribution of a Larger Work
You may create and distribute a Larger Work under terms of Your choice,
provided that You also comply with the requirements of this License for
the Covered Software. If the Larger Work is a combination of Covered
Software with a work governed by one or more Secondary Licenses, and the
Covered Software is not Incompatible With Secondary Licenses, this
License permits You to additionally distribute such Covered Software
under the terms of such Secondary License(s), so that the recipient of
the Larger Work may, at their option, further distribute the Covered
Software under the terms of either this License or such Secondary
License(s).
3.4. Notices
You may not remove or alter the substance of any license notices
(including copyright notices, patent notices, disclaimers of warranty,
or limitations of liability) contained within the Source Code Form of
the Covered Software, except that You may alter any license notices to
the extent required to remedy known factual inaccuracies.
3.5. Application of Additional Terms
You may choose to offer, and to charge a fee for, warranty, support,
indemnity or liability obligations to one or more recipients of Covered
Software. However, You may do so only on Your own behalf, and not on
behalf of any Contributor. You must make it absolutely clear that any
such warranty, support, indemnity, or liability obligation is offered by
You alone, and You hereby agree to indemnify every Contributor for any
liability incurred by such Contributor as a result of warranty, support,
indemnity or liability terms You offer. You may include additional
disclaimers of warranty and limitations of liability specific to any
jurisdiction.
4. Inability to Comply Due to Statute or Regulation
---------------------------------------------------
If it is impossible for You to comply with any of the terms of this
License with respect to some or all of the Covered Software due to
statute, judicial order, or regulation then You must: (a) comply with
the terms of this License to the maximum extent possible; and (b)
describe the limitations and the code they affect. Such description must
be placed in a text file included with all distributions of the Covered
Software under this License. Except to the extent prohibited by statute
or regulation, such description must be sufficiently detailed for a
recipient of ordinary skill to be able to understand it.
5. Termination
--------------
5.1. The rights granted under this License will terminate automatically
if You fail to comply with any of its terms. However, if You become
compliant, then the rights granted under this License from a particular
Contributor are reinstated (a) provisionally, unless and until such
Contributor explicitly and finally terminates Your grants, and (b) on an
ongoing basis, if such Contributor fails to notify You of the
non-compliance by some reasonable means prior to 60 days after You have
come back into compliance. Moreover, Your grants from a particular
Contributor are reinstated on an ongoing basis if such Contributor
notifies You of the non-compliance by some reasonable means, this is the
first time You have received notice of non-compliance with this License
from such Contributor, and You become compliant prior to 30 days after
Your receipt of the notice.
5.2. If You initiate litigation against any entity by asserting a patent
infringement claim (excluding declaratory judgment actions,
counter-claims, and cross-claims) alleging that a Contributor Version
directly or indirectly infringes any patent, then the rights granted to
You by any and all Contributors for the Covered Software under Section
2.1 of this License shall terminate.
5.3. In the event of termination under Sections 5.1 or 5.2 above, all
end user license agreements (excluding distributors and resellers) which
have been validly granted by You or Your distributors under this License
prior to termination shall survive termination.
************************************************************************
* *
* 6. Disclaimer of Warranty *
* ------------------------- *
* *
* Covered Software is provided under this License on an "as is" *
* basis, without warranty of any kind, either expressed, implied, or *
* statutory, including, without limitation, warranties that the *
* Covered Software is free of defects, merchantable, fit for a *
* particular purpose or non-infringing. The entire risk as to the *
* quality and performance of the Covered Software is with You. *
* Should any Covered Software prove defective in any respect, You *
* (not any Contributor) assume the cost of any necessary servicing, *
* repair, or correction. This disclaimer of warranty constitutes an *
* essential part of this License. No use of any Covered Software is *
* authorized under this License except under this disclaimer. *
* *
************************************************************************
************************************************************************
* *
* 7. Limitation of Liability *
* -------------------------- *
* *
* Under no circumstances and under no legal theory, whether tort *
* (including negligence), contract, or otherwise, shall any *
* Contributor, or anyone who distributes Covered Software as *
* permitted above, be liable to You for any direct, indirect, *
* special, incidental, or consequential damages of any character *
* including, without limitation, damages for lost profits, loss of *
* goodwill, work stoppage, computer failure or malfunction, or any *
* and all other commercial damages or losses, even if such party *
* shall have been informed of the possibility of such damages. This *
* limitation of liability shall not apply to liability for death or *
* personal injury resulting from such party's negligence to the *
* extent applicable law prohibits such limitation. Some *
* jurisdictions do not allow the exclusion or limitation of *
* incidental or consequential damages, so this exclusion and *
* limitation may not apply to You. *
* *
************************************************************************
8. Litigation
-------------
Any litigation relating to this License may be brought only in the
courts of a jurisdiction where the defendant maintains its principal
place of business and such litigation shall be governed by laws of that
jurisdiction, without reference to its conflict-of-law provisions.
Nothing in this Section shall prevent a party's ability to bring
cross-claims or counter-claims.
9. Miscellaneous
----------------
This License represents the complete agreement concerning the subject
matter hereof. If any provision of this License is held to be
unenforceable, such provision shall be reformed only to the extent
necessary to make it enforceable. Any law or regulation which provides
that the language of a contract shall be construed against the drafter
shall not be used to construe this License against a Contributor.
10. Versions of the License
---------------------------
10.1. New Versions
Mozilla Foundation is the license steward. Except as provided in Section
10.3, no one other than the license steward has the right to modify or
publish new versions of this License. Each version will be given a
distinguishing version number.
10.2. Effect of New Versions
You may distribute the Covered Software under the terms of the version
of the License under which You originally received the Covered Software,
or under the terms of any subsequent version published by the license
steward.
10.3. Modified Versions
If you create software not governed by this License, and you want to
create a new license for such software, you may create and use a
modified version of this License if you rename the license and remove
any references to the name of the license steward (except to note that
such modified license differs from this License).
10.4. Distributing Source Code Form that is Incompatible With Secondary
Licenses
If You choose to distribute Source Code Form that is Incompatible With
Secondary Licenses under the terms of this version of the License, the
notice described in Exhibit B of this License must be attached.
Exhibit A - Source Code Form License Notice
-------------------------------------------
This Source Code Form is subject to the terms of the Mozilla Public
License, v. 2.0. If a copy of the MPL was not distributed with this
file, You can obtain one at http://mozilla.org/MPL/2.0/.
If it is not possible or desirable to put the notice in a particular
file, then You may include the notice in a location (such as a LICENSE
file in a relevant directory) where a recipient would be likely to look
for such a notice.
You may add additional accurate notices of copyright ownership.
Exhibit B - "Incompatible With Secondary Licenses" Notice
---------------------------------------------------------
This Source Code Form is "Incompatible With Secondary Licenses", as
defined by the Mozilla Public License, v. 2.0.
================================================
FILE: README.md
================================================
# Common Voice Datasets
This repo contains release details and metadata for the [Common Voice](https://commonvoice.mozilla.org) datasets. Please visit the [Mozilla Data Collective Common Voice section](https://mozilladatacollective.com/organization/cmfh0j9o10006ns07jq45h7xk) to download the latest datasets.
## Dataset Types
Common Voice collects voice data through multiple modalities. Each dataset type has its own release information, data structure, and documentation.
| Type | Alias | Status | Releases | Latest (2026-03) | Languages |
| -------------------------------------------------- | ----- | ------ | -------: | :--------------: | --------: |
| [Scripted Speech](datasets/scripted-speech/) | SCS | Active | 25 | v25.0 | 290 |
| [Spontaneous Speech](datasets/spontaneous-speech/) | SPS | Active | 3 | v3.0 | 72 |
| [Code Switching](datasets/code-switching/) | CS | Alpha | -- | -- | -- |
See each dataset type's documentation for detailed information about data structures, fields in metadata files (`.tsv`), archive contents, and release changelogs. Note that the "date" in releases represents the cut-off date for data collection and validation, not the actual release date of the dataset.
## Data Pipeline
```mermaid
flowchart LR
subgraph SCS["Scripted Speech (SCS)"]
SCS_DB[("DB")]
SCS_GCS["GCS"]
end
subgraph SCS_BUN["SCS Bundler"]
CC["CorporaCreator"]
end
subgraph SCS_BUN2["SCS Bundler"]
UP["Uploader"]
end
DSH["cv-datasheets"]
subgraph SPS["Spontaneous Speech (SPS)"]
SPS_DB[("DB")]
SPS_GCS["GCS"]
end
subgraph SPS_BUN["SPS Bundler"]
QA["QA Pipeline"]
end
BUN_GCS["GCS
datasets
datasheets
stats"]
MDC[["MDC
downloads"]]
CDS[["cv-dataset ◀"]]
SCS_DB -->|data| SCS_BUN
SCS_GCS -->|clips| SCS_BUN
DSH -->|JSON| SCS_BUN
DSH -->|JSON| SPS_BUN
SPS_DB -->|data| SPS_BUN
SPS_GCS -->|audio| SPS_BUN
SCS_BUN --> BUN_GCS
SPS_BUN --> BUN_GCS
BUN_GCS -->|datasets| UP
BUN_GCS -->|datasheets| UP
UP -->|API| MDC
BUN_GCS -->|stats| CDS
style CDS fill:#1a73e8,color:#ffffff,stroke:#1558b0,stroke-width:2px
```
## Overview
### Scripted Speech (SCS)
```mermaid
---
config:
xyChart:
width: 900
height: 400
---
xychart-beta
title "Scripted Speech: Total & Validated Hours"
x-axis ["1","2","3","4","5.1","6.1","7","8","9","10","11","12","13","14","15","16.1","17","18","19","20","21","22","23","24","25"]
y-axis "Hours" 0 --> 42000
bar [1368,2366,2454,4257,7226,9283,13905,18243,20217,20817,24231,26119,27141,28117,28750,30328,31175,32121,32584,33154,33534,33815,35921,38932,41792]
bar [1096,1872,1979,3401,5671,7335,11192,14122,14973,15234,16429,17127,17689,18651,19159,19915,20408,20943,21593,22106,22344,22640,24600,25886,28377]
```
For details see: [Scripted Speech documentation](datasets/scripted-speech/)
### Spontaneous Speech (SPS)
```mermaid
---
config:
xyChart:
width: 600
height: 350
---
xychart-beta
title "Spontaneous Speech: Total vs Validated Hours"
x-axis ["v1.0","v2.0","v3.0"]
y-axis "Hours" 0 --> 600
bar [428,454,508]
bar [263,268,269]
```
For details see: [Spontaneous Speech documentation](datasets/spontaneous-speech/)
## Dataset Access
You can download the Common Voice datasets from the [Mozilla Data Collective](https://mozilladatacollective.com/) (MDC) platform:
- [Directly from the browser](https://mozilladatacollective.com/organization/cmfh0j9o10006ns07jq45h7xk)
- [Using the MDC API](https://mozilladatacollective.com/api-reference)
- [Using the MDC Python SDK](https://github.com/Mozilla-Data-Collective/datacollective-python) to directly load the datasets as pandas DataFrame in your Python codebase
## Generating Dataset Statistics
Helper scripts are available in the [helpers/](helpers/) directory for processing bundler output into dataset statistics. See [helpers/README.md](helpers/README.md) for detailed usage and examples.
All helper scripts support multiple dataset types via the first argument:
```bash
node helpers/createStats.js <dataset-type> <stats-folder>
node helpers/compareReleases.js <dataset-type> <dataset-1> <dataset-2>
node helpers/createDeltaStatistics.js <dataset-type> <dataset-1> <dataset-2>
node helpers/recalculateStats.js <dataset-type> <dataset>
```
## Citation
If you use the data in a published academic work we would appreciate if you cite the following article:
- Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M. and Weber, G. (2020) "[Common Voice: A Massively-Multilingual Speech Corpus](https://arxiv.org/abs/1912.06670)". _Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)._ pp. 4211--4215
```bibtex
@inproceedings{commonvoice:2020,
author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.},
title = {Common Voice: A Massively-Multilingual Speech Corpus},
booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)},
pages = {4211--4215},
year = 2020
}
```
## Feedback
Please only use this repo to provide feedback on **technical issues** with the dataset, such as file corruptions, problems with the partitions, and so on. For more expansive discussions, please join us in [Discourse](https://discourse.mozilla.org/c/voice) or [Matrix](https://chat.mozilla.org/#/room/#common-voice:mozilla.org).
================================================
FILE: datasets/code-switching/README.md
================================================
# Code Switching (CS)
Code Switching is an upcoming Common Voice modality where contributors produce speech that naturally switches between two or more languages within a single utterance. This is a subproject within the [Spontaneous Speech](https://github.com/common-voice/spontaneous-speech) repository, currently gated to alpha testers.
**Status**: Alpha test phase -- no releases yet.
This directory will contain release statistics once the first Code Switching dataset is published.
================================================
FILE: datasets/scripted-speech/CHANGELOG.md
================================================
# Scripted Speech (SCS) Changelog
## Current Release
### [Corpus 25.0](cv-corpus-25.0-2026-03-09.json)
Regularly scheduled dataset release Q1 2026.
- **Date released**: 18 March 2026
- **Clip cut-off date**: 09 March 2026
- **Total hours**: 41,792
- **Total validated hours**: 28,377
- **Number of languages**: 290
**New languages since last major release**: Croatian (`hr`)
#### Dataset Changes in Corpus 25.0
- added `README.md` datasheet per locale -- a Markdown document with language description, dataset statistics, demographic breakdowns, and community context (generated from [cv-datasheets](https://github.com/common-voice/cv-datasheets), schema v2.0.0)
- added `variant` column to `validated_sentences.tsv` (after `sentence`), containing the language variant token for the sentence (empty if none)
- added `variant` column to `unvalidated_sentences.tsv` (after `sentence`)
- added `up_votes`, `down_votes`, and `status` columns to `unvalidated_sentences.tsv`
- `status` is `pending` (not-yet-validated) or `rejected` (when `down_votes` >= 2 and `down_votes` > `up_votes`)
- the `unvalidated_sentences.tsv` description is corrected: it contains sentences that have not reached the validated threshold, not only sentences without any votes
- added `variant` and `locale` columns to [Corpora Creator](https://github.com/common-voice/CorporaCreator) clip files
## Past Releases
### [Corpus 24.0](cv-corpus-24.0-2025-12-05.json)
Regularly scheduled dataset release Q4 2025.
- **Date released**: 17 December 2025
- **Clip cut-off date**: 05 December 2025
- **Total hours**: 38,932
- **Total validated hours**: 25,886
- **Number of languages**: 289
**New languages since last major release**: Sorbian, Lower (`dsb`), Alsatian (`gsw`), Laz (`lzz`)
### [Corpus 23.0](cv-corpus-23.0-2025-09-05.json)
Regularly scheduled dataset release Q3 2025.
- **Date released**: 17 September 2025
- **Clip cut-off date**: 05 September 2025
- **Total hours**: 35,921
- **Total validated hours**: 24,600
- **Number of languages**: 286
**New languages since last major release**: Adamawa Fulfulde (`fub`), Adja (`ajg`), Adyghe (`ady`), Aragonese (`an`), Asheninka Perene, Asheninka South Ucayali, Atayal (`tay`), Baatonum (`bba`), Bafia (`ksf`), Bafut (`bfd`), Bakoko, Balti (`bft`), Bamun (`bax`), Bamvele (`beb`), Bankon (`abb`), Baoule (`bci`), Batanga (`bnm`), Bateri (`btv`), Borgu Fulfulde (`fue`), Brahui, Brushaski (`bsk`), Bulu (`bum`), Bunun (`bnn`), Cameroon Pidgin (`wes`), Central Alaskan Yup'ik (`esu`), Central Puebla Nahuatl (`ncx`), Central Tarahumara, Chokwe, Copainalá Zoque (`zoc`), Cornish (`kw`), Dagbani (`dag`), Dameli (`dml`), Dargwa (`dar`), Dawoodi (`dmk`), Dhatki, Duala (`dua`), Eastern Balochi (`bgp`), Ebrie (`ebr`), Ekoti, Eton (`eto`), Ewondo (`ewo`), Fang (`fan`), Fe'efe'e (`fmp`), Gawarbaiti (`gwt`), Gawri (`gwc`), Ghomala (`bbj`), Goaria, Guidar, Guiziga, Gujari (`gju`), Gurgula (`ggg`), Hazargi, Huarijio (`var`), Huautla Mazatec (`mau`), Ibibio (`ibb`), Indus Kohistani (`mvy`), Iñupiaq (`ipk`), Jaqaru (`jqr`), Kabardian (`kbd`), Kachhi, Kalasha (`kls`), Kalkoti (`xka`), Kateviri (`bsh`), Khetrani (`xhe`), Khowar (`khw`), Kichwa (`qvi`), Kihemba, Kirombo, Kohistani Shina (`plk`), Kom (`bkm`), Kotokoli, Kunabembe, Kwasio, Lassi (`lss`), Loarki, Loja Highland Kichwa, Losso, Mada (`mxu`), Malay (`ms`), Manx (`gv`), Massa, Matses, Mbo (`mbo`), Mbum, Medumba (`byv`), Mengambo, Mina, Mingrelian (`xmf`), Mokpwe (`bri`), Moussey, Mpiemo, Mundang, Mungaka, Musgum, Ngiembon (`nnh`), Ngomba, Ngombale, Nigerian Pidgin English (`pcm`), Northern Hindko (`hno`), Northwest Gbaya (`gya`), Nuasue, Nyungwe, Nüpode Huitoto, Oadki, Orizaba Nahuatl, Ormuri (`oru`), Ouldémé, Pahari-Pothwari, Paiwan (`pwn`), Pakistani Marwari, Palula (`phl`), Parkari Koli, Puno Quechua (`qxp`), Quechua Ambo-Pasco (`qva`), Quechua Arequipa-La Unión (`qxu`), Quechua Cajatambo (`qvl`), Quechua Chiquián (`qxa`), Quechua Corongo Ancash (`qwa`), Quechua Jauja Wanka (`qxw`), Quechua Pasco Santa Ana de Tusi (`qxt`), Quechua Santiago del Estero, Quechua Sihuas Ancash (`qws`), Quechua Yanahuanca, Quechua Yauyos (`qux`), Rukai (`dru`), Sakizaya (`szy`), Sansi, Seediq (`trv`), Seri (`sei`), Shina (`scl`), Sindhi Bhili, Siswati (`ss`), Southern Pastaza Quechua (`qup`), Svan (`sva`), Tepeuxila Cuicatec (`cux`), Teutila Cuicatec (`cut`), Tlingit, Torwali (`trw`), Tshiluba, Tuki, Tunen (`tvu`), Tupuri (`tui`), Tush (`bbl`), Ushojo (`ush`), Wadiyara Koli, Wakhi (`wbl`), Western Highland Purepecha (`pua`), Yadgha, Yaqui (`yaq`)
### [Corpus 22.0](cv-corpus-22.0-2025-06-20.json)
Regularly scheduled dataset release Q2 2025.
- **Date released**: 25 June 2025
- **Clip cut-off date**: 20 June 2025
- **Total hours**: 33,815
- **Total validated hours**: 22,640
- **Number of languages**: 137
**New languages since last major release**: Aromanian (`rup`), Tajik (`tg`), Tshivenda (`ve`)
### [Corpus 21.0](cv-corpus-21.0-2025-03-14.json)
Regularly scheduled dataset release Q1 2025.
- **Date released**: 19 March 2025
- **Clip cut-off date**: 14 March 2025
- **Total hours**: 33,534
- **Total validated hours**: 22,344
- **Number of languages**: 134
**New languages since last major release**: Norwegian Bokmål (`nb-NO`)
### [Corpus 20.0](cv-corpus-20.0-2024-12-06.json)
Regularly scheduled dataset release Q4 2024.
- **Date released**: 11 December 2024
- **Clip cut-off date**: 06 December 2024
- **Total hours**: 33,154
- **Total validated hours**: 22,106
- **Number of languages**: 133
**New languages since last major release**: IsiNdebele (South) (`nr`), Southern Sotho (`st`)
### [Corpus 19.0](cv-corpus-19.0-2024-09-13.json)
Regularly scheduled dataset release Q3 2024.
- **Date released**: 18 September 2024
- **Clip cut-off date**: 13 September 2024
- **Total hours**: 32,584
- **Total validated hours**: 21,593
- **Number of languages**: 131
**New languages since last major release**: Sindhi (`sd`), Xitsonga (`ts`)
### [Corpus 18.0](cv-corpus-18.0-2024-06-14.json)
#### Dataset Changes in Corpus 18.0
- the `sentence_domain` column contains now up to three domains separated by a comma, e.g. `general,finance,news_current_affairs`
- the domains `agriculture`, `automotive` and `food_service_retail` have been renamed to `agriculture_food`, `automotive_transport`, `service_retail` respectively
### [Corpus 17.0](cv-corpus-17.0-2024-03-15.json)
#### Dataset Changes in Corpus 17.0
- added `unvalidated_sentences.tsv` and `validated_sentences.tsv`
- `unvalidated_sentences.tsv` contains sentences that have not reached the validated threshold (including those with votes but insufficient up votes), the columns are: `sentence_id`, `sentence`, `sentence_domain` and `source`
- `validated_sentences.tsv` contains sentences that have two or more up votes, it has two additional columns: `is_used` and `clips_count`
- `is_used`: indicates whether or not the sentence is used on the speak page
- `clips_count`: the number of clips that are associated with the sentence
- added `sentence_id` and `sentence_domain` to the [Corpora Creator](https://github.com/common-voice/CorporaCreator) files
- the following [sentence domains](https://github.com/common-voice/common-voice/blob/f820e0fa3ec00fc6d49dae7e31bcebf9eb24878b/common/taxonomies.ts#L35) are supported
### [Corpus 16.1](cv-corpus-16.1-2023-12-06.json)
#### Dataset Changes in Corpus 16.1
- changed `times.txt` to `clip_durations.tsv` for consistency
- `clip_durations.tsv` contains two columns: `clip` and `duration[ms]`
### [Corpus 14.0](cv-corpus-14.0-2023-06-23.json)
#### Dataset Changes in Corpus 14.0
- added `times.txt` containing mp3 filename and duration in ms
### [Corpus 13.0](cv-corpus-13.0-2023-03-09.json)
#### Dataset Changes in Corpus 13.0
- added `variant` column to [Corpora Creator](https://github.com/common-voice/CorporaCreator) files
### [Corpus 10.0](cv-corpus-10.0-2022-07-04.json)
#### Dataset Changes in Corpus 10.0
- introduced delta segments
- delta segment tar file naming is `cv-corpus-{version}-delta-{YYYY-MM-DD}-{locale}.tar.gz`
- delta segments contain the same files except for the training splits, i.e. `dev.tsv`, `test.tsv`, `train.tsv`
### [Corpus 9.0](cv-corpus-9.0-2022-04-27.json)
Regularly scheduled dataset release Q1 2022.
- **Date released**: 27 April 2022
- **Clip cut-off date**: 07 April 2022
- **Total hours**: 20,217
- **Total validated hours**: 14,973
- **Number of languages**: 93
**New languages since last major release**: Tigre (`tig`), Taiwanese (Minnan) (`nan-tw`), Meadow Mari (`mhr`), Bengali (`bn`), Toki Pona (`tok`), Cantonese (`yue`)
### [Corpus 8.0](cv-corpus-8.0-2022-01-19.json)
Regularly scheduled dataset release.
- **Date released**: 26 January 2022
- **Clip cut-off date**: 19 January 2022
- **Total hours**: 18,243
- **Total validated hours**: 14,122
- **Number of languages**: 87
**New languages since last major release**: Igbo (`ig`), Marathi (`mr`), Danish (`da`), Norwegian Nynorsk (`nn-NO`), Central Kurdish (`ckb`), Malayalam (`ml`), Swahili (`sw`), Erzya (`myv`), Moksha (`mdf`), Macedonian (`mk`), Santali (Ol Chiki) (`sat`)
Note: minor variations in the validated hours of minor dot releases reflects the fact that labeling/validation happens on a different schedule than recording. In the timespan between dot releases the community will usually have performed additional validations, even if the clip cut-off date remains the same.
### [Corpus 7.0](cv-corpus-7.0-2021-07-21.json)
Regularly scheduled dataset release for H1 of 2021.
- **Date released**: 28 July 2021
- **Clip cut-off date**: 21 July 2021
- **Total hours**: 13,905
- **Total validated hours**: 11,192
- **Number of languages**: 76
**New languages since last major release**: Basaa (`bas`), Slovak (`sk`), Kurmanji Kurdish (`kmr`), Bulgarian (`bg`), Kazakh (`kk`), Bashkir (`ba`), Galician (`gl`), Uyghur (`ug`), Armenian (`hy-AM`), Belarusian (`be`), Urdu (`ur`), Guarani (`gn`), Serbian (`sr`), Uzbek (`uz`), Azerbaijani (`az`), Hausa (`ha`)
#### Dataset Changes in Corpus 7.0
- changed tar file naming to `cv-corpus-{version}-{YYYY-MM-DD}-{locale}.tar.gz`
### [Singleword Segment 7.0](cv-corpus-7.0-singleword.json)
Update to Singleword Segment 6.1.
- **Date released**: 28 July 2021
- **Clip cut-off date**: 21 July 2021
- **Total hours**: 141
- **Total validated hours**: 82
- **Number of languages**: 34
### [Corpus 6.1](cv-corpus-6.1-2020-12-11.json)
Correction to Corpus 6.0, which had a bug that did not properly attribute demographics information.
- **Date released**: 22 Dec 2020
- **Clip cut-off date**: 11 Dec 2020
- **Total hours**: 9,283
- **Total validated hours**: 7,335
- **Number of languages**: 60
### [Singleword Segment 6.1](cv-corpus-6.1-singleword.json)
Correction to Singleword Segment 6.0, which had a bug that did not properly attribute demographics information.
- **Date released**: 22 Dec 2020
- **Clip cut-off date**: 11 Dec 2020
- **Total hours**: 131
- **Total validated hours**: 77
- **Number of languages**: 31
### [Corpus 6.0](cv-corpus-6.0-2020-12-11.json)
Regularly scheduled dataset release for H2 of 2020.
- **Date released**: 22 Dec 2020
- **Clip cut-off date**: 11 Dec 2020
- **Total hours**: 9,261
- **Total validated hours**: 7,327
- **Number of languages**: 60
**New languages since last major release**: Hindi (`hi`), Lithuanian (`lt`), Luganda (`lg`), Thai (`th`), Finnish (`fi`), Hungarian (`hu`)
### [Singleword Segment 6.0](cv-corpus-6.0-singleword.json)
Update to Singleword Segment 5.1.
- **Date released**: 22 Dec 2020
- **Clip cut-off date**: 11 Dec 2020
- **Total hours**: 131
- **Total validated hours**: 77
- **Number of languages**: 31
### [Corpus 5.1](cv-corpus-5.1-2020-06-22.json)
Correction to Corpus 5.0, which unintentionally altered the column order of the test/train/dev sets, and included some redundant metadata entries for clips that didn't actually have valid audio.
- **Date released**: 14 July 2020
- **Clip cut-off date**: 22 June 2020
- **Total hours**: 7,226
- **Total validated hours**: 5,671
- **Number of languages**: 54
### [Singleword Segment 5.1](cv-corpus-5.1-singleword.json)
Correction to Singleword Segment 5.0, which was still optimizing for no repeated sentences during segmentation and thus resulted in disproportionately small test/dev/train sets.
- **Date released**: 16 September 2020
- **Clip cut-off date**: 22 June 2020
- **Total hours**: 120
- **Total validated hours**: 64
- **Number of languages**: 18
### [Corpus 5.0](cv-corpus-5-2020-06-22.json)
Regularly scheduled dataset release for H1 of 2020. This release introduced sha256 checksum values for each dataset.
- **Date released**: 30 June 2020
- **Clip cut-off date**: 22 June 2020
- **Total hours**: 7,226
- **Total validated hours**: 5,591
- **Number of languages**: 54
**New languages since last major release**: Sorbian, Upper (`hsb`), Romanian (`ro`), Frisian (`fy-NL`), Czech (`cs`), Greek (`el`), Romansh Vallader (`rm-vallader`), Polish (`pl`), Assamese (`as`), Ukrainian (`uk`), Maltese (`mt`), Georgian (`ka`), Punjabi (`pa-IN`), Odia (`or`), Vietnamese (`vi`)
#### Dataset Changes in Corpus 5.0
- changed archive folder structure: dataset release archive now contains a locale folder
```txt
cv-corpus-5.1-2020-06-22/
└── tr/
├── clips/
├── dev.tsv
├── invalidated.tsv
├── other.tsv
├── reported.tsv
├── test.tsv
├── train.tsv
└── validated.tsv
```
- added `reported.tsv` containing sentences that have been reported by the community
- added `locale` and `segment` columns to the [Corpora Creator](https://github.com/common-voice/CorporaCreator) files
### [Singleword Segment 5.0](cv-corpus-5-singleword.json)
This contains all of the voice data collected as part of the Common Voice pilot target segment effort collecting single-word utterances for a benchmark experiment.
- **Date released**: 30 June 2020
- **Clip cut-off date**: 22 June 2020
- **Total hours**: 120
- **Total validated hours**: 64
- **Number of languages**: 18
### [Corpus 4](cv-corpus-4-2019-12-10.json)
Regularly scheduled dataset release for H2 of 2019.
- **Date released**: 14 Jan 2020
- **Clip cut-off date**: 10 Dec 2019
- **Total hours**: 4,257
- **Total validated hours**: 3,401
- **Number of languages**: 40
**New languages since last major release**: Abkhaz (`ab`), Arabic (`ar`), Chinese (Hong Kong) (`zh-HK`), Indonesian (`id`), Interlingua (`ia`), Japanese (`ja`), Latvian (`lv`), Portuguese (`pt`), Romansh Sursilvan (`rm-sursilv`), Tamil (`ta`), Votic (`vot`)
#### Dataset Changes in Corpus 4.0
- changed tar file naming from `cv-corpus-{version}_{locale}.tar.tar` to `cv-corpus-{version}-{YYYY-MM-DD}_{locale}.tar.tar`
### [Corpus 3](cv-corpus-3.json)
Minor update to Corpus 2 to correct an issue with file-naming.
- **Date released**: 24 June 2019
- **Clip cut-off date**: 24 June 2019 (est)
- **Total hours**: 2,454
- **Total validated hours**: 1,979
- **Number of languages**: 29
**New languages since last major release**: Persian (`fa`)
### [Corpus 2](cv-corpus-2.json)
Regularly scheduled dataset release for H1 of 2019.
- **Date released**: 11 June 2019
- **Clip cut-off date**: 11 June 2019 (est)
- **Total hours**: 2,366
- **Total validated hours**: 1,872
- **Number of languages**: 28
**New languages since last major release**: Basque (`eu`), Spanish (`es`), Chinese (China) (`zh-CN`), Mongolian (`mn`), Sakha/Yakut (`sah`), Dhivehi (`dv`), Kinyarwanda (`rw`), Swedish (`sv-SE`), Russian (`ru`)
### [Corpus 1](cv-corpus-1.json)
First multilingual release.
- **Date released**: 25 February 2019
- **Clip cut-off date**: 25 February 2019 (est)
- **Total hours**: 1,368
- **Total validated hours**: 1,096
- **Number of languages**: 19
**New languages**: German (`de`), French (`fr`), Welsh (`cy`), Breton (`br`), Chuvash (`cv`), Turkish (`tr`), Tatar (`tt`), Kyrgyz (`ky`), Irish (`ga-IE`), Kabyle (`kab`), Catalan (`ca`), Chinese (Taiwan) (`zh-TW`), Slovenian (`sl`), Italian (`it`), Dutch (`nl`), Hakha Chin (`cnh`), Esperanto (`eo`), Estonian (`et`)
#### Initial Dataset Structure
- the initial dataset release folder structure:
```txt
cv-corpus-1_{locale}/
├── clips/
├── dev.tsv
├── invalidated.tsv
├── other.tsv
├── test.tsv
├── train.tsv
└── validated.tsv
```
- to get more information about the files included in the dataset release, please see [Corpora Creator](https://github.com/common-voice/CorporaCreator)
- columns: `client_id`, `path`, `sentence`, `up_votes`, `down_votes`, `age`, `gender`, `accent`
================================================
FILE: datasets/scripted-speech/README.md
================================================
# Scripted Speech (SCS)
Scripted Speech is the classic Common Voice dataset. Contributors read pre-written sentences aloud, and the community validates the recordings. New datasets are released approximately every quarter.
All voice contributions are released as part of datasets, regardless of validation status. From v25.0 on, clips that fail quality checks (over-length, corrupted, or missing audio files) are excluded during bundling; a per-locale problem clip report is included with each release for transparency. The clips are currently bundled using the embedded bundler in the public repo [Common Voice - Bundler](https://github.com/common-voice/common-voice/tree/main/bundler).
## Release History
See the full [Changelog](CHANGELOG.md) for detailed release notes and new languages per release.
### Total and Validated Hours
```mermaid
---
config:
xyChart:
width: 900
height: 400
---
xychart-beta
title "Scripted Speech: Total & Validated Hours"
x-axis ["1","2","3","4","5.1","6.1","7","8","9","10","11","12","13","14","15","16.1","17","18","19","20","21","22","23","24","25"]
y-axis "Hours" 0 --> 42000
bar [1368,2366,2454,4257,7226,9283,13905,18243,20217,20817,24231,26119,27141,28117,28750,30328,31175,32121,32584,33154,33534,33815,35921,38932,41792]
bar [1096,1872,1979,3401,5671,7335,11192,14122,14973,15234,16429,17127,17689,18651,19159,19915,20408,20943,21593,22106,22344,22640,24600,25886,28377]
```
### Contributors
```mermaid
---
config:
xyChart:
width: 900
height: 400
---
xychart-beta
title "Scripted Speech: Total Contributors"
x-axis ["1","2","3","4","5.1","6.1","7","8","9","10","11","12","13","14","15","16.1","17","18","19","20","21","22","23","24","25"]
y-axis "Users" 0 --> 500000
bar [42109,56059,57420,95798,138225,151434,191622,207602,252576,263879,271817,281069,288617,298724,302232,319703,330323,335780,338378,345996,350098,356074,361614,371058,375673]
```
_Counts are summed per language — contributors active in multiple languages are counted once per language._
### Language Count
```mermaid
---
config:
xyChart:
width: 900
height: 400
---
xychart-beta
title "Scripted Speech: Languages per Release"
x-axis ["1","2","3","4","5.1","6.1","7","8","9","10","11","12","13","14","15","16.1","17","18","19","20","21","22","23","24","25"]
y-axis "Languages" 0 --> 310
line [19,28,29,40,54,60,76,87,93,96,100,104,108,112,114,120,124,129,131,133,134,137,286,289,290]
```
### Release Summary
<div align="center">
| Release | Date | Languages | Total Hours | Validated Hours |
| ------- | ---------- | --------: | ----------: | --------------: |
| v1 | 2019-02-25 | 19 | 1,368 | 1,096 |
| v2 | 2019-06-11 | 28 | 2,366 | 1,872 |
| v3 | 2019-06-24 | 29 | 2,454 | 1,979 |
| v4 | 2019-12-10 | 40 | 4,257 | 3,401 |
| v5.1 | 2020-06-22 | 54 | 7,226 | 5,671 |
| v6.1 | 2020-12-11 | 60 | 9,283 | 7,335 |
| v7.0 | 2021-07-21 | 76 | 13,905 | 11,192 |
| v8.0 | 2022-01-19 | 87 | 18,243 | 14,122 |
| v9.0 | 2022-04-27 | 93 | 20,217 | 14,973 |
| v10.0 | 2022-07-04 | 96 | 20,817 | 15,234 |
| v11.0 | 2022-09-21 | 100 | 24,231 | 16,429 |
| v12.0 | 2022-12-07 | 104 | 26,119 | 17,127 |
| v13.0 | 2023-03-09 | 108 | 27,141 | 17,689 |
| v14.0 | 2023-06-23 | 112 | 28,117 | 18,651 |
| v15.0 | 2023-09-08 | 114 | 28,750 | 19,159 |
| v16.1 | 2023-12-06 | 120 | 30,328 | 19,915 |
| v17.0 | 2024-03-15 | 124 | 31,175 | 20,408 |
| v18.0 | 2024-06-14 | 129 | 32,121 | 20,943 |
| v19.0 | 2024-09-13 | 131 | 32,584 | 21,593 |
| v20.0 | 2024-12-06 | 133 | 33,154 | 22,106 |
| v21.0 | 2025-03-14 | 134 | 33,534 | 22,344 |
| v22.0 | 2025-06-20 | 137 | 33,815 | 22,640 |
| v23.0 | 2025-09-05 | 286 | 35,921 | 24,600 |
| v24.0 | 2025-12-05 | 289 | 38,932 | 25,886 |
| v25.0 | 2026-03-09 | 290 | 41,792 | 28,377 |
</div>
## About the Statistics
Statistics for each release are stored as JSON files in this directory. The JSON structure may have changed slightly from release to release, so if you plan on doing any comparisons you may need to normalize them between versions.
Any demographic split (i.e. gender, age, accent) is applied to **the entire dataset**, not just the validated set. Unless otherwise indicated, durations are measured in milliseconds, and file sizes are measured in bytes.
## Archive Structure
Each downloaded `.tar.gz` file has the following structure, where `{lang}` represents the [BCP 47](https://en.wikipedia.org/wiki/IETF_language_tag) locale code for that language:
```txt
cv-corpus-{version}-{YYYY-MM-DD}-{lang}.tar.gz/
cv-corpus-{version}-{YYYY-MM-DD}/
└── {lang}/
├── README.md (datasheet, since Corpus 25.0)
├── clips/
│ └── *.mp3
├── dev.tsv
├── invalidated.tsv
├── other.tsv
├── test.tsv
├── train.tsv
├── validated.tsv
├── reported.tsv
├── clip_durations.tsv
├── validated_sentences.tsv
└── unvalidated_sentences.tsv
```
## TSV Fields
Each row of a clip TSV file (`validated.tsv`, `invalidated.tsv`, `other.tsv`, `train.tsv`, `dev.tsv`, `test.tsv`) represents a single audio clip:
- `client_id` -- hashed UUID of a given user
- `path` -- relative path of the audio file
- `sentence` -- transcription of the audio to be read aloud by the contributor
- `sentence_id` -- unique identifier for the sentence (since Corpus 17.0)
- `sentence_domain` -- domain classification(s) of the sentence (since Corpus 17.0)
- `up_votes` -- number of people who said audio matches the sentence
- `down_votes` -- number of people who said audio does not match the sentence
- `age` -- age bracket of the speaker - if provided\*
- `gender` -- gender of the speaker - if provided\*
- `accents` -- accent(s) of the speaker - if provided\* (previously named `accent` but renamed to reflect multiple selections, since Corpus 17.0)
- `variant` -- language variant - if provided (since Corpus 13.0)
- `locale` -- locale code of the language (since Corpus 5.0)
- `segment` -- custom dataset segment, if applicable (since Corpus 5.0)
The `train.tsv`, `dev.tsv`, and `test.tsv` splits are produced by [CorporaCreator](https://github.com/common-voice/CorporaCreator) and contain the same columns as `validated.tsv`.
\*For a full list of age, gender, and accent options, see the [demographics spec](https://github.com/common-voice/common-voice/blob/main/web/src/stores/demographics.ts). These are only reported if the speaker opted in.
### Additional TSV Files
**`clip_durations.tsv`** (since Corpus 16.1) -- clip filename and duration:
- `clip` -- clip filename
- `duration[ms]` -- duration of the clip in milliseconds
**`validated_sentences.tsv`** (since Corpus 17.0) -- sentences that have reached the validated threshold (two or more up votes):
- `sentence_id` -- unique identifier for the sentence
- `sentence` -- the sentence itself
- `variant` -- language variant token for the sentence, if provided (since Corpus 25.0)
- `sentence_domain` -- domain classification(s) of the sentence, if provided
- `source` -- origin of the sentence (user provided or from old files under server/data)
- `is_used` -- whether the sentence is still eligible for recording (sentences may be retired if they are incorrect, outdated, too similar to other sentences, or for other reasons via database migrations)
- `clips_count` -- number of clips recorded for this sentence
**`unvalidated_sentences.tsv`** (since Corpus 17.0) -- sentences that have not reached the validated threshold or have been rejected:
- `sentence_id` -- unique identifier for the sentence
- `sentence` -- the sentence itself
- `variant` -- language variant token for the sentence, if provided (since Corpus 25.0)
- `sentence_domain` -- domain classification(s) of the sentence, if provided
- `source` -- origin of the sentence (user provided or from old files under server/data)
- `up_votes` -- number of approving votes (since Corpus 25.0)
- `down_votes` -- number of rejecting votes (since Corpus 25.0)
- `status` -- `pending` (not yet decided) or `rejected` (2+ down votes exceeding up votes) (since Corpus 25.0)
### Validation Categories
- `validated` -- clips with two or more validations where `up_votes` > `down_votes`
- `invalidated` -- clips with two or more validations where `down_votes` > `up_votes`, or three or more where `down_votes` = `up_votes`
- `other` -- clips without sufficient validations to determine their status
**`reported.tsv`** (since Corpus 5.0) -- sentences flagged by contributors:
- `sentence` -- text of the reported sentence
- `sentence_id` -- unique identifier for the sentence
- `locale` -- locale code
- `reason` -- report reason: `offensive-language`, `grammar-or-spelling`, `different-language`, `difficult-pronounce`
Note: reporting a sentence does not remove it from circulation. Reported sentences remain available for recording and validation. The `reported.tsv` file is provided for post-processing by dataset consumers.
## Use for Machine Learning
We use the [Corpora Creator](https://github.com/common-voice/CorporaCreator) tool to parse through metadata to generate [train, dev, and test](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets) sets. The Corpora Creator eliminates duplication in clips and maximizes for speaker diversity.
Each train/dev/test set is generated non-deterministically, meaning they will vary from release to release even for minor updates. This is to avoid reproducing and perpetuating any demographic skews in each subsequent set.
Note that total clips in these sets will most probably not add up to the total validated clips because of this limitation. Please check the repo to include multiple recordings per sentence (the `-s` flag) if you want to get as close as possible to the total validated clips.
================================================
FILE: datasets/scripted-speech/cv-corpus-1.json
================================================
{
"date": "2019-02-25",
"locales": {
"en": {
"clips": 677020,
"splits": {
"accent": {
"": 0.55,
"canada": 0.02,
"england": 0.08,
"us": 0.21,
"indian": 0.04,
"australia": 0.02,
"malaysia": 0,
"newzealand": 0.01,
"african": 0.01,
"ireland": 0.01,
"philippines": 0,
"singapore": 0,
"scotland": 0.01,
"hongkong": 0,
"bermuda": 0,
"southatlandtic": 0,
"wales": 0,
"other": 0.03
},
"age": {
"": 0.48,
"twenties": 0.18,
"sixties": 0.02,
"thirties": 0.14,
"teens": 0.04,
"seventies": 0.01,
"fourties": 0.08,
"fifties": 0.05,
"eighties": 0,
"nineties": 0
},
"gender": { "": 0.48, "male": 0.41, "female": 0.1, "other": 0.01 }
},
"users": 33541,
"duration": 2893916688,
"buckets": {
"dev": 7016,
"invalidated": 61200,
"other": 125337,
"test": 7016,
"train": 12135,
"validated": 490483
},
"size": 22487893709
},
"de": {
"clips": 133646,
"splits": {
"accent": {
"germany": 0.71,
"": 0.23,
"austria": 0.02,
"liechtenstein": 0,
"switzerland": 0.03,
"france": 0,
"other": 0.01,
"poland": 0,
"united_kingdom": 0,
"hungary": 0,
"netherlands": 0,
"namibia": 0
},
"age": {
"twenties": 0.31,
"fourties": 0.17,
"": 0.19,
"thirties": 0.14,
"teens": 0.04,
"sixties": 0.03,
"fifties": 0.12,
"seventies": 0
},
"gender": { "male": 0.76, "": 0.19, "female": 0.05, "other": 0 }
},
"users": 2249,
"duration": 526772160,
"buckets": {
"dev": 2269,
"invalidated": 5487,
"other": 0,
"test": 2269,
"train": 2629,
"validated": 128159
},
"size": 4151335731
},
"fr": {
"clips": 75022,
"splits": {
"accent": {
"": 0.22,
"france": 0.74,
"germany": 0,
"belgium": 0.02,
"switzerland": 0.01,
"guadeloupe": 0,
"reunion": 0,
"monaco": 0,
"tunisia": 0,
"canada": 0.01,
"other": 0,
"mayotte": 0,
"algeria": 0,
"netherlands": 0,
"senegal": 0,
"martinique": 0,
"portugal": 0,
"united_states": 0,
"cote_d_ivoire": 0,
"st_pierre_et_miquelon": 0
},
"age": {
"twenties": 0.23,
"thirties": 0.2,
"": 0.21,
"teens": 0.06,
"fourties": 0.25,
"fifties": 0.02,
"sixties": 0.02,
"seventies": 0
},
"gender": { "male": 0.72, "": 0.21, "female": 0.07, "other": 0 }
},
"users": 1697,
"duration": 284516280,
"buckets": {
"dev": 8857,
"invalidated": 4770,
"other": 0,
"test": 8858,
"train": 18941,
"validated": 70252
},
"size": 2245754155
},
"cy": {
"clips": 19412,
"splits": {
"accent": { "united_kingdom": 0.6, "": 0.36, "other": 0.03 },
"age": {
"fourties": 0.16,
"twenties": 0.13,
"sixties": 0.18,
"fifties": 0.09,
"": 0.32,
"thirties": 0.11,
"seventies": 0.01,
"eighties": 0
},
"gender": { "male": 0.43, "female": 0.26, "": 0.31 }
},
"users": 365,
"duration": 79378296,
"buckets": {
"dev": 484,
"invalidated": 672,
"other": 0,
"test": 484,
"train": 500,
"validated": 18731
},
"size": 622806292
},
"br": {
"clips": 9306,
"splits": {
"accent": { "other": 0.01, "": 0.99 },
"age": {
"twenties": 0.16,
"": 0.56,
"fifties": 0.01,
"fourties": 0.13,
"thirties": 0.14,
"sixties": 0.01
},
"gender": { "male": 0.43, "": 0.56, "female": 0.02 }
},
"users": 82,
"duration": 26068056,
"buckets": {
"dev": 1022,
"invalidated": 364,
"other": 5147,
"test": 1054,
"train": 1458,
"validated": 3795
},
"size": 201554829
},
"cv": {
"clips": 2299,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.03,
"": 0.54,
"fourties": 0.39,
"thirties": 0.03,
"teens": 0.01
},
"gender": { "male": 0.46, "": 0.54 }
},
"users": 33,
"duration": 9802458,
"buckets": {
"dev": 49,
"invalidated": 628,
"other": 915,
"test": 187,
"train": 414,
"validated": 756
},
"size": 77597058
},
"tr": {
"clips": 6226,
"splits": {
"accent": { "": 0.87, "other": 0.13 },
"age": {
"": 0.18,
"thirties": 0.41,
"twenties": 0.37,
"teens": 0.02,
"fourties": 0.01,
"fifties": 0
},
"gender": { "": 0.18, "male": 0.75, "female": 0.07 }
},
"users": 203,
"duration": 23086560,
"buckets": {
"dev": 1039,
"invalidated": 843,
"other": 331,
"test": 1112,
"train": 1265,
"validated": 5052
},
"size": 182107529
},
"tt": {
"clips": 20663,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.17,
"thirties": 0.78,
"twenties": 0.04,
"sixties": 0,
"fifties": 0.01,
"teens": 0,
"fourties": 0
},
"gender": { "": 0.17, "male": 0.8, "female": 0.02 }
},
"users": 117,
"duration": 73813776,
"buckets": {
"dev": 1882,
"invalidated": 199,
"other": 17,
"test": 3437,
"train": 7425,
"validated": 20447
},
"size": 555266991
},
"ky": {
"clips": 4766,
"splits": {
"accent": { "other": 0.11, "": 0.89 },
"age": {
"thirties": 0.19,
"": 0.13,
"fourties": 0.05,
"twenties": 0.64
},
"gender": { "male": 0.8, "": 0.15, "female": 0.06 }
},
"users": 63,
"duration": 22124544,
"buckets": {
"dev": 782,
"invalidated": 249,
"other": 0,
"test": 1206,
"train": 1888,
"validated": 4517
},
"size": 152729320
},
"ga-IE": {
"clips": 2007,
"splits": {
"accent": {
"": 0.24,
"connachta": 0.45,
"other": 0.18,
"ulaidh": 0.13
},
"age": {
"twenties": 0.03,
"": 0.21,
"thirties": 0.57,
"fourties": 0.13,
"sixties": 0.01,
"teens": 0.04
},
"gender": { "male": 0.57, "": 0.21, "female": 0.22 }
},
"users": 30,
"duration": 6320352,
"buckets": {
"dev": 284,
"invalidated": 134,
"other": 0,
"test": 415,
"train": 644,
"validated": 1873
},
"size": 48777677
},
"kab": {
"clips": 101313,
"splits": {
"accent": { "": 0.73, "other": 0.27 },
"age": {
"fourties": 0.16,
"thirties": 0.21,
"": 0.33,
"twenties": 0.17,
"fifties": 0.04,
"eighties": 0.08,
"teens": 0,
"sixties": 0
},
"gender": { "male": 0.53, "": 0.29, "female": 0.17, "other": 0.01 }
},
"users": 382,
"duration": 353173920,
"buckets": {
"dev": 4073,
"invalidated": 6111,
"other": 0,
"test": 4073,
"train": 5395,
"validated": 95202
},
"size": 2787486601
},
"ca": {
"clips": 77137,
"splits": {
"accent": {
"valencian": 0.07,
"central": 0.69,
"": 0.18,
"other": 0.01,
"balearic": 0.01,
"northwestern": 0.04,
"northern": 0
},
"age": {
"thirties": 0.13,
"fifties": 0.26,
"fourties": 0.28,
"twenties": 0.08,
"": 0.18,
"sixties": 0.05,
"teens": 0.02,
"seventies": 0,
"eighties": 0
},
"gender": { "male": 0.38, "": 0.18, "female": 0.44, "other": 0 }
},
"users": 1639,
"duration": 353712432,
"buckets": {
"dev": 8375,
"invalidated": 4302,
"other": 21,
"test": 8374,
"train": 16870,
"validated": 72814
},
"size": 2780768799
},
"zh-TW": {
"clips": 36369,
"splits": {
"accent": { "": 0.79, "other": 0.21 },
"age": {
"thirties": 0.3,
"twenties": 0.43,
"teens": 0.01,
"": 0.24,
"fifties": 0.01,
"seventies": 0,
"fourties": 0.02
},
"gender": { "male": 0.38, "": 0.23, "female": 0.35, "other": 0.04 }
},
"users": 695,
"duration": 101606832,
"buckets": {
"dev": 1154,
"invalidated": 1765,
"other": 9315,
"test": 1154,
"train": 1240,
"validated": 25289
},
"size": 800988779
},
"sl": {
"clips": 3286,
"splits": {
"accent": { "other": 0.02, "": 0.98 },
"age": {
"twenties": 0.83,
"teens": 0.01,
"": 0.01,
"sixties": 0,
"fifties": 0.15
},
"gender": { "female": 0.17, "male": 0.82, "": 0.01 }
},
"users": 18,
"duration": 12475584,
"buckets": {
"dev": 291,
"invalidated": 97,
"other": 1399,
"test": 320,
"train": 762,
"validated": 1790
},
"size": 98867237
},
"it": {
"clips": 16048,
"splits": {
"accent": { "": 0.7, "other": 0.3 },
"age": {
"thirties": 0.13,
"twenties": 0.37,
"": 0.34,
"fifties": 0.08,
"fourties": 0.06,
"seventies": 0,
"sixties": 0,
"teens": 0.02
},
"gender": { "female": 0.07, "male": 0.67, "": 0.26 }
},
"users": 313,
"duration": 70795560,
"buckets": {
"dev": 3085,
"invalidated": 3061,
"other": 2,
"test": 3082,
"train": 3812,
"validated": 12985
},
"size": 556736370
},
"nl": {
"clips": 13385,
"splits": {
"accent": { "": 0.22, "netherlands": 0.68, "belgium": 0.1, "other": 0 },
"age": {
"": 0.18,
"twenties": 0.35,
"fourties": 0.05,
"thirties": 0.13,
"teens": 0.02,
"fifties": 0.26,
"sixties": 0
},
"gender": { "": 0.24, "male": 0.74, "female": 0.02, "other": 0 }
},
"users": 373,
"duration": 48954768,
"buckets": {
"dev": 1542,
"invalidated": 700,
"other": 243,
"test": 1542,
"train": 1701,
"validated": 12442
},
"size": 382910541
},
"cnh": {
"clips": 4289,
"splits": {
"accent": { "": 0.8, "other": 0.2 },
"age": {
"": 0.52,
"twenties": 0.3,
"fourties": 0.02,
"teens": 0.02,
"thirties": 0.11,
"fifties": 0.03
},
"gender": { "": 0.53, "male": 0.26, "female": 0.22 }
},
"users": 253,
"duration": 15737520,
"buckets": {
"dev": 641,
"invalidated": 452,
"other": 1689,
"test": 659,
"train": 733,
"validated": 2148
},
"size": 124559394
},
"eo": {
"clips": 5882,
"splits": {
"accent": { "": 0.76, "internacia": 0.19, "other": 0.05 },
"age": {
"twenties": 0.78,
"thirties": 0.04,
"": 0.1,
"fourties": 0.05,
"fifties": 0.01,
"seventies": 0,
"teens": 0.02
},
"gender": { "male": 0.21, "": 0.68, "female": 0.1, "other": 0.01 }
},
"users": 53,
"duration": 23382864,
"buckets": {
"dev": 526,
"invalidated": 238,
"other": 1872,
"test": 1057,
"train": 1992,
"validated": 3772
},
"size": 184514284
},
"et": {
"clips": 35,
"splits": {
"accent": { "": 1 },
"age": { "": 0.14, "thirties": 0.86 },
"gender": { "": 0.14, "male": 0.86 }
},
"users": 3,
"duration": 230328,
"buckets": {
"dev": 5,
"invalidated": 12,
"other": 0,
"test": 3,
"train": 15,
"validated": 23
},
"size": 1822156
}
},
"totalDuration": 4925868978,
"totalValidDurationSecs": 3946252,
"totalHrs": 1368,
"totalValidHrs": 1096,
"totalClips": 1208111
}
================================================
FILE: datasets/scripted-speech/cv-corpus-10.0-2022-07-04.json
================================================
{
"date": "2022-07-04",
"locales": {
"en": {
"duration": 10981597567,
"buckets": {
"dev": 16345,
"invalidated": 248337,
"other": 293021,
"reported": 4158,
"test": 16345,
"train": 921404,
"validated": 1589008
},
"reportedSentences": 4095,
"clips": 2130366,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.37,
"twenties": 0.24,
"sixties": 0.04,
"thirties": 0.13,
"teens": 0.06,
"seventies": 0.01,
"fourties": 0.1,
"fifties": 0.05,
"eighties": 0,
"nineties": 0
},
"gender": { "": 0.37, "male": 0.45, "female": 0.16, "other": 0.02 }
},
"users": 83790,
"size": 78797545779,
"checksum": "b82354bf4ff7a62568e071dbba3a48160f7368ed94890fd57f466a85c27e0511",
"avgDurationSecs": 5.155,
"validDurationSecs": 8191008.675,
"totalHrs": 3050.44,
"validHrs": 2275.28
},
"fa": {
"duration": 1371060000,
"buckets": {
"dev": 9937,
"invalidated": 13320,
"other": 34431,
"reported": 2117,
"test": 9937,
"train": 24672,
"validated": 294426
},
"reportedSentences": 2109,
"clips": 342177,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.24,
"twenties": 0.31,
"thirties": 0.37,
"fifties": 0.02,
"fourties": 0.03,
"teens": 0.03,
"sixties": 0
},
"gender": { "": 0.21, "male": 0.72, "female": 0.07, "other": 0 }
},
"users": 4088,
"size": 10118159391,
"checksum": "c298ceacbe35edbc0ed948c068afbd87c2076768a5e70ebe7f4166f7e053e4a8",
"avgDurationSecs": 4.0,
"validDurationSecs": 1177704,
"totalHrs": 380.85,
"validHrs": 327.14
},
"fr": {
"buckets": {
"dev": 16058,
"invalidated": 55708,
"other": 10156,
"reported": 6384,
"test": 16058,
"train": 458935,
"validated": 625586
},
"reportedSentences": 6308,
"duration": 3455188498,
"clips": 691450,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.18,
"thirties": 0.17,
"": 0.35,
"teens": 0.03,
"fourties": 0.14,
"fifties": 0.1,
"sixties": 0.03,
"seventies": 0.01,
"eighties": 0,
"nineties": 0
},
"gender": { "male": 0.62, "": 0.27, "female": 0.1, "other": 0.01 }
},
"users": 16510,
"size": 24964187263,
"checksum": "a7f6596f3f679fca1a9ae24f319b5feda67bcea5d1a514c6f6f32ae65f88aa27",
"avgDurationSecs": 4.997,
"validDurationSecs": 3126064.866,
"totalHrs": 959.77,
"validHrs": 868.35
},
"es": {
"buckets": {
"dev": 15459,
"invalidated": 48321,
"other": 211249,
"reported": 1773,
"test": 15459,
"train": 217774,
"validated": 293025
},
"reportedSentences": 1759,
"duration": 2792421414,
"clips": 552595,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.11,
"": 0.32,
"fifties": 0.08,
"twenties": 0.25,
"teens": 0.03,
"fourties": 0.07,
"sixties": 0.15,
"eighties": 0,
"seventies": 0,
"nineties": 0
},
"gender": { "male": 0.51, "": 0.32, "other": 0.01, "female": 0.17 }
},
"users": 23627,
"size": 20310025597,
"checksum": "29921567c0b8f98953295ff53d69bd3c0b6c6beb746791472daba1c87199ae66",
"avgDurationSecs": 5.053,
"validDurationSecs": 1480739.574,
"totalHrs": 775.67,
"validHrs": 411.31
},
"sl": {
"buckets": {
"dev": 1143,
"invalidated": 247,
"other": 1596,
"reported": 34,
"test": 1233,
"train": 1460,
"validated": 9525
},
"reportedSentences": 35,
"duration": 43362226,
"clips": 11368,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.51,
"teens": 0.08,
"": 0.2,
"sixties": 0.07,
"fifties": 0.07,
"fourties": 0.02,
"thirties": 0.05
},
"gender": { "female": 0.16, "male": 0.64, "": 0.2, "other": 0 }
},
"users": 137,
"size": 308301224,
"checksum": "a461623bfc1deed48ce7ef2ec4b64fd724006472b1e481e8c3dd47e3290c23f3",
"avgDurationSecs": 3.814,
"validDurationSecs": 36332.266,
"totalHrs": 12.04,
"validHrs": 10.09
},
"kab": {
"buckets": {
"dev": 14889,
"invalidated": 19217,
"other": 102012,
"reported": 4844,
"test": 14891,
"train": 141620,
"validated": 597994
},
"reportedSentences": 4838,
"duration": 2393581624,
"clips": 719223,
"splits": {
"accent": { "": 1 },
"age": {
"fourties": 0.09,
"thirties": 0.3,
"": 0.27,
"fifties": 0.19,
"twenties": 0.12,
"eighties": 0,
"teens": 0,
"sixties": 0.03,
"seventies": 0
},
"gender": { "male": 0.54, "": 0.25, "female": 0.2, "other": 0 }
},
"users": 1451,
"size": 18030751474,
"checksum": "c0b4cec3a040eaf0abf4ab5cc434aef152939f901cb8fd5e7a46f0dc03d280cd",
"avgDurationSecs": 3.328,
"validDurationSecs": 1990130.251,
"totalHrs": 664.88,
"validHrs": 552.81
},
"cy": {
"buckets": {
"dev": 5200,
"invalidated": 4294,
"other": 18134,
"reported": 152,
"test": 5209,
"train": 7594,
"validated": 87294
},
"reportedSentences": 153,
"duration": 531427401,
"clips": 109722,
"splits": {
"accent": { "": 1 },
"age": {
"fourties": 0.16,
"twenties": 0.13,
"sixties": 0.06,
"fifties": 0.09,
"": 0.42,
"thirties": 0.09,
"seventies": 0.01,
"eighties": 0,
"teens": 0.02
},
"gender": { "male": 0.33, "female": 0.25, "": 0.41, "other": 0.01 }
},
"users": 1715,
"size": 3921370932,
"checksum": "83f81318cf77b1f9835762b0f4dc06af083d387bf0524200b8dce8492a16fb56",
"avgDurationSecs": 4.843,
"validDurationSecs": 422799.653,
"totalHrs": 147.61,
"validHrs": 117.44
},
"ca": {
"buckets": {
"dev": 16277,
"invalidated": 65015,
"other": 560080,
"reported": 4809,
"test": 16277,
"train": 744272,
"validated": 907704
},
"reportedSentences": 4761,
"duration": 8449005306,
"clips": 1532799,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.07,
"fifties": 0.17,
"fourties": 0.1,
"twenties": 0.06,
"": 0.34,
"sixties": 0.22,
"teens": 0.01,
"seventies": 0.03,
"eighties": 0,
"nineties": 0
},
"gender": { "male": 0.42, "": 0.35, "female": 0.23, "other": 0 }
},
"users": 28648,
"size": 52904759182,
"checksum": "e6df7a73ffa2f9b61615c4f8199b44fb7a599cbbbd74a56789758842a6d77f54",
"avgDurationSecs": 5.512,
"validDurationSecs": 5003393.082,
"totalHrs": 2346.94,
"validHrs": 1389.83
},
"de": {
"buckets": {
"dev": 16067,
"invalidated": 47081,
"other": 5646,
"reported": 7953,
"test": 16067,
"train": 466189,
"validated": 793068
},
"reportedSentences": 7929,
"duration": 4363152833,
"clips": 845795,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.19,
"fourties": 0.17,
"": 0.32,
"thirties": 0.15,
"teens": 0.03,
"sixties": 0.03,
"fifties": 0.1,
"seventies": 0,
"eighties": 0,
"nineties": 0
},
"gender": { "male": 0.59, "": 0.32, "female": 0.09, "other": 0.01 }
},
"users": 16944,
"size": 31181120747,
"checksum": "a293c3e341f6aaf25019bc852d6475d0ef2f85c2835f8481f7cc65dbd0bde2fa",
"avgDurationSecs": 5.159,
"validDurationSecs": 4091153.165,
"totalHrs": 1211.98,
"validHrs": 1136.43
},
"tt": {
"duration": 108365246,
"buckets": {
"dev": 3062,
"invalidated": 385,
"other": 7,
"reported": 3,
"test": 5119,
"train": 9783,
"validated": 28531
},
"reportedSentences": 4,
"clips": 28923,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.2,
"thirties": 0.73,
"twenties": 0.05,
"sixties": 0,
"fifties": 0.01,
"teens": 0,
"fourties": 0,
"seventies": 0.01
},
"gender": { "": 0.2, "male": 0.79, "female": 0.02 }
},
"users": 219,
"size": 802652172,
"checksum": "e5b1670372444451dbd146a2bb911144eca59a843f59c347c98f75ee5c5ac507",
"avgDurationSecs": 3.747,
"validDurationSecs": 106896.547,
"totalHrs": 30.1,
"validHrs": 29.69
},
"ta": {
"duration": 1370209560,
"buckets": {
"dev": 11781,
"invalidated": 5557,
"other": 85369,
"reported": 3296,
"test": 11820,
"train": 40987,
"validated": 129693
},
"reportedSentences": 3296,
"clips": 220619,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.08,
"thirties": 0.09,
"": 0.72,
"fourties": 0.03,
"seventies": 0.02,
"fifties": 0.03,
"teens": 0.03,
"sixties": 0,
"eighties": 0
},
"gender": { "male": 0.16, "": 0.71, "other": 0, "female": 0.13 }
},
"users": 761,
"size": 8212452630,
"checksum": "8d8427ca7d2735131f5b77afc4ecd01342f617487f76a64c5eb2a597bc74f9a2",
"avgDurationSecs": 6.211,
"validDurationSecs": 805490.862,
"totalHrs": 380.61,
"validHrs": 223.74
},
"ru": {
"duration": 753692040,
"buckets": {
"dev": 9495,
"invalidated": 6751,
"other": 20572,
"reported": 320,
"test": 9494,
"train": 22117,
"validated": 118707
},
"reportedSentences": 314,
"clips": 146030,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.4,
"teens": 0.09,
"": 0.21,
"fourties": 0.14,
"thirties": 0.13,
"fifties": 0.03,
"sixties": 0,
"seventies": 0
},
"gender": { "male": 0.62, "": 0.21, "other": 0, "female": 0.16 }
},
"users": 2688,
"size": 5297836761,
"checksum": "78ea4fa2de776edc8ecae52440d0e4fc4669eb962ae8150f91078c907c5d049a",
"avgDurationSecs": 5.161,
"validDurationSecs": 612672.197,
"totalHrs": 209.35,
"validHrs": 170.18
},
"nl": {
"duration": 394176457,
"buckets": {
"dev": 10634,
"invalidated": 5098,
"other": 2412,
"reported": 318,
"test": 10641,
"train": 29521,
"validated": 83821
},
"reportedSentences": 318,
"clips": 91331,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.41,
"twenties": 0.22,
"fourties": 0.14,
"thirties": 0.11,
"teens": 0.02,
"fifties": 0.08,
"sixties": 0.01,
"nineties": 0,
"eighties": 0,
"seventies": 0
},
"gender": { "": 0.42, "male": 0.47, "female": 0.11, "other": 0 }
},
"users": 1520,
"size": 2713214861,
"checksum": "d9053d64a7e5fc2d3853c0d920c73229650cb4d88130c9c9b9939f1fa582fe4a",
"avgDurationSecs": 4.316,
"validDurationSecs": 361763.966,
"totalHrs": 109.49,
"validHrs": 100.48
},
"it": {
"duration": 1249550976,
"buckets": {
"dev": 14964,
"invalidated": 17188,
"other": 27,
"reported": 5279,
"test": 14973,
"train": 149590,
"validated": 216074
},
"reportedSentences": 5275,
"clips": 233289,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.16,
"twenties": 0.21,
"": 0.3,
"fifties": 0.16,
"fourties": 0.14,
"seventies": 0,
"sixties": 0.02,
"teens": 0.01,
"eighties": 0,
"nineties": 0
},
"gender": { "female": 0.12, "male": 0.59, "": 0.29, "other": 0 }
},
"users": 6735,
"size": 8670197349,
"checksum": "ae01f1b6fd93a65d964c274c4aee182ea00d861f0a4f9e98bc5c06fe55a4a1b5",
"avgDurationSecs": 5.356,
"validDurationSecs": 1157343.371,
"totalHrs": 347.09,
"validHrs": 321.48
},
"eu": {
"duration": 528391635,
"buckets": {
"dev": 6560,
"invalidated": 5790,
"other": 26855,
"reported": 61,
"test": 6560,
"train": 10829,
"validated": 69142
},
"reportedSentences": 61,
"clips": 101787,
"splits": {
"accent": { "": 1 },
"age": {
"fourties": 0.13,
"thirties": 0.07,
"fifties": 0.14,
"twenties": 0.35,
"": 0.25,
"teens": 0.03,
"sixties": 0.02,
"seventies": 0
},
"gender": { "male": 0.47, "female": 0.26, "": 0.25, "other": 0.02 }
},
"users": 1209,
"size": 3983306491,
"checksum": "a0b8c72bdcce0d23e9a58ba71681e5a1c5093504159c5f69d699a4b9da42cb85",
"avgDurationSecs": 5.191,
"validDurationSecs": 358926.527,
"totalHrs": 146.77,
"validHrs": 99.7
},
"tr": {
"duration": 285540831,
"buckets": {
"dev": 9095,
"invalidated": 3503,
"other": 151,
"reported": 332,
"test": 9124,
"train": 20228,
"validated": 74486
},
"reportedSentences": 333,
"clips": 78140,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.33,
"thirties": 0.09,
"twenties": 0.27,
"teens": 0.02,
"fourties": 0.04,
"fifties": 0.09,
"sixties": 0.12,
"eighties": 0,
"seventies": 0.03
},
"gender": { "": 0.33, "male": 0.46, "female": 0.21, "other": 0 }
},
"users": 1299,
"size": 1777072023,
"checksum": "2d12424877b65b3b39e6031c368a3d60579e347d31593558c57e0a3dc11b3791",
"avgDurationSecs": 3.654,
"validDurationSecs": 272188.307,
"totalHrs": 79.31,
"validHrs": 75.6
},
"ar": {
"duration": 523410000,
"buckets": {
"dev": 10354,
"invalidated": 14919,
"other": 34918,
"reported": 2062,
"test": 10435,
"train": 28078,
"validated": 75862
},
"reportedSentences": 2049,
"clips": 125699,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.11,
"": 0.56,
"twenties": 0.28,
"fourties": 0.01,
"teens": 0.03,
"fifties": 0,
"sixties": 0,
"nineties": 0
},
"gender": { "female": 0.18, "": 0.56, "male": 0.27, "other": 0 }
},
"users": 1272,
"size": 3110063691,
"checksum": "4f81fec5272134b6e7de8195fc94e629975c3142691587ad3be961c3cb12a686",
"avgDurationSecs": 4.164,
"validDurationSecs": 315889.368,
"totalHrs": 145.4,
"validHrs": 87.747
},
"zh-TW": {
"duration": 396217642,
"buckets": {
"dev": 4670,
"invalidated": 4556,
"other": 39955,
"reported": 139,
"test": 4670,
"train": 6494,
"validated": 76358
},
"reportedSentences": 140,
"clips": 120869,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.2,
"twenties": 0.33,
"teens": 0.05,
"": 0.27,
"fifties": 0.04,
"seventies": 0,
"fourties": 0.1,
"sixties": 0
},
"gender": { "male": 0.47, "": 0.27, "female": 0.25, "other": 0.02 }
},
"users": 2061,
"size": 2797955208,
"checksum": "c4c488f69eadae226b056396e3f6dc40baf89674240afb101e19d01ee2922614",
"avgDurationSecs": 3.278,
"validDurationSecs": 250307.248,
"totalHrs": 110.06,
"validHrs": 69.52
},
"br": {
"duration": 72258667,
"buckets": {
"dev": 2158,
"invalidated": 759,
"other": 11499,
"reported": 221,
"test": 2157,
"train": 2559,
"validated": 11168
},
"reportedSentences": 221,
"clips": 23426,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.25,
"": 0.33,
"fifties": 0.06,
"fourties": 0.07,
"thirties": 0.08,
"sixties": 0.17,
"seventies": 0.02,
"teens": 0.01
},
"gender": { "male": 0.64, "": 0.33, "female": 0.02, "other": 0 }
},
"users": 177,
"size": 531272372,
"checksum": "6275bfb2f17e9857a61d25185e47d83f9486f3b5a08e67219f269e0559410f3a",
"avgDurationSecs": 3.085,
"validDurationSecs": 34448.254,
"totalHrs": 20.07,
"validHrs": 9.56
},
"pt": {
"duration": 521906587,
"buckets": {
"dev": 8606,
"invalidated": 4634,
"other": 16276,
"reported": 2390,
"test": 8611,
"train": 17852,
"validated": 103678
},
"reportedSentences": 2385,
"clips": 124588,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.2,
"twenties": 0.41,
"teens": 0.03,
"thirties": 0.22,
"fourties": 0.1,
"sixties": 0.01,
"fifties": 0.03,
"seventies": 0
},
"gender": { "": 0.2, "male": 0.74, "female": 0.04, "other": 0.02 }
},
"users": 2562,
"size": 3400999932,
"checksum": "3ceeb91b12a07cbf8cf983b367573a35fdb89b39d925c086b595d1d32ca807cb",
"avgDurationSecs": 4.189,
"validDurationSecs": 434313.346,
"totalHrs": 144.97,
"validHrs": 120.64
},
"eo": {
"duration": 6740710000,
"buckets": {
"dev": 14907,
"invalidated": 127293,
"other": 135058,
"reported": 2127,
"test": 14907,
"train": 143988,
"validated": 848511
},
"reportedSentences": 2126,
"clips": 1110862,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.56,
"thirties": 0.12,
"": 0.2,
"fourties": 0.04,
"fifties": 0.02,
"seventies": 0,
"teens": 0.05,
"sixties": 0,
"eighties": 0
},
"gender": { "male": 0.69, "": 0.2, "female": 0.11, "other": 0 }
},
"users": 1541,
"size": 40260737095,
"checksum": "2179bad54bb2b69cd12964bc2f6533b9538b7a3f943f9e65f8f9a463796fd901",
"avgDurationSecs": 6.068,
"validDurationSecs": 5148764,
"totalHrs": 1872.42,
"validHrs": 1430.21
},
"zh-CN": {
"duration": 1614763392,
"buckets": {
"dev": 9760,
"invalidated": 6889,
"other": 293461,
"reported": 510,
"test": 9783,
"train": 23764,
"validated": 50432
},
"reportedSentences": 501,
"clips": 350782,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.87,
"teens": 0.02,
"twenties": 0.08,
"thirties": 0.02,
"fourties": 0.01,
"nineties": 0,
"fifties": 0,
"sixties": 0
},
"gender": { "": 0.87, "male": 0.11, "female": 0.02, "other": 0 }
},
"users": 5061,
"size": 9873490748,
"checksum": "5862d3e55aaa507b62c6e81343bb17ae784a32c1a36e9facf4b2713ecb5da4ce",
"avgDurationSecs": 4.603,
"validDurationSecs": 232154.864,
"totalHrs": 448.54,
"validHrs": 64.48
},
"id": {
"duration": 200152524,
"buckets": {
"dev": 3219,
"invalidated": 2460,
"other": 23682,
"reported": 268,
"test": 3621,
"train": 5046,
"validated": 23219
},
"reportedSentences": 269,
"clips": 49361,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.26,
"twenties": 0.39,
"thirties": 0.07,
"teens": 0.26,
"fifties": 0,
"fourties": 0.02
},
"gender": { "": 0.26, "male": 0.41, "female": 0.29, "other": 0.04 }
},
"users": 438,
"size": 1273536247,
"checksum": "33d49848b4e5341d166642ce175dc7c6128114307c4f3cab881be4e34b0703f7",
"avgDurationSecs": 4.055,
"validDurationSecs": 94150.067,
"totalHrs": 55.59,
"validHrs": 26.15
},
"ia": {
"duration": 60377448,
"buckets": {
"dev": 1789,
"invalidated": 328,
"other": 2744,
"reported": 264,
"test": 1731,
"train": 5049,
"validated": 11361
},
"reportedSentences": 260,
"clips": 14433,
"splits": {
"accent": { "": 1 },
"age": {
"seventies": 0.22,
"fourties": 0.3,
"": 0.39,
"twenties": 0.05,
"thirties": 0.02,
"teens": 0,
"fifties": 0.03,
"sixties": 0
},
"gender": { "male": 0.61, "": 0.39, "female": 0.01 }
},
"users": 60,
"size": 409324563,
"checksum": "581ed384b3498194710ae266a83bfc220df575893f57db49907db29cd8fcfbdf",
"avgDurationSecs": 4.183,
"validDurationSecs": 47526.376,
"totalHrs": 16.77,
"validHrs": 13.2
},
"lv": {
"duration": 30849209,
"buckets": {
"dev": 1857,
"invalidated": 167,
"other": 1267,
"reported": 27,
"test": 2161,
"train": 3103,
"validated": 7608
},
"reportedSentences": 28,
"clips": 9042,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.48,
"fourties": 0.03,
"": 0.18,
"twenties": 0.28,
"teens": 0.03,
"fifties": 0
},
"gender": { "male": 0.7, "female": 0.13, "": 0.17 }
},
"users": 117,
"size": 226830766,
"checksum": "6b245bc30f4a09415c0234545068084692a919b657df7281b414931e331fdf2f",
"avgDurationSecs": 3.412,
"validDurationSecs": 25956.733,
"totalHrs": 8.56,
"validHrs": 7.21
},
"ja": {
"duration": 185458929,
"buckets": {
"dev": 4312,
"invalidated": 2262,
"other": 370,
"reported": 153,
"test": 4489,
"train": 6352,
"validated": 36021
},
"reportedSentences": 153,
"clips": 38653,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.32,
"": 0.23,
"teens": 0.04,
"fifties": 0.01,
"thirties": 0.1,
"fourties": 0.29,
"sixties": 0,
"seventies": 0
},
"gender": { "male": 0.53, "": 0.21, "female": 0.25, "other": 0 }
},
"users": 652,
"size": 1125802593,
"checksum": "7196d23c02058a545c921539aa553bc6655c692274bca88cf0941f7e30018826",
"avgDurationSecs": 4.798,
"validDurationSecs": 172830.468,
"totalHrs": 51.51,
"validHrs": 48
},
"rw": {
"duration": 8580574229,
"buckets": {
"dev": 15987,
"invalidated": 227746,
"other": 47302,
"reported": 629,
"test": 16213,
"train": 1003021,
"validated": 1438408
},
"reportedSentences": 630,
"clips": 1713456,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.05,
"twenties": 0.61,
"thirties": 0.12,
"teens": 0.2,
"fourties": 0.02,
"fifties": 0
},
"gender": { "": 0.1, "male": 0.57, "female": 0.33, "other": 0 }
},
"users": 1076,
"size": 60998084828,
"checksum": "824d4a62cc4ce8a5e3fe0b4c24bd5a191a286dff50d4f4ccc4c724b342413a4b",
"avgDurationSecs": 5.008,
"validDurationSecs": 7203200.208,
"totalHrs": 2383.49,
"validHrs": 2000.88
},
"sv-SE": {
"duration": 180457431,
"buckets": {
"dev": 5055,
"invalidated": 1338,
"other": 5701,
"reported": 575,
"test": 5057,
"train": 7275,
"validated": 38581
},
"reportedSentences": 576,
"clips": 45620,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.25,
"": 0.18,
"teens": 0.03,
"fifties": 0.03,
"twenties": 0.12,
"fourties": 0.38,
"sixties": 0,
"seventies": 0
},
"gender": { "male": 0.48, "": 0.18, "female": 0.33, "other": 0 }
},
"users": 752,
"size": 1145881820,
"checksum": "ee19ce93f376d4980e4afb32f7b7ac04b74fd257547718a43110d530789a1e95",
"avgDurationSecs": 3.956,
"validDurationSecs": 152613.506,
"totalHrs": 50.12,
"validHrs": 42.39
},
"cnh": {
"duration": 20675832,
"buckets": {
"dev": 761,
"invalidated": 436,
"other": 2908,
"reported": 8,
"test": 763,
"train": 817,
"validated": 2458
},
"reportedSentences": 9,
"clips": 5802,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.51,
"twenties": 0.36,
"fourties": 0.01,
"teens": 0.02,
"thirties": 0.08,
"fifties": 0.02
},
"gender": { "": 0.51, "male": 0.33, "female": 0.16 }
},
"users": 299,
"size": 161394167,
"checksum": "76a3e555e9503e94799077f48b1ef84acfef3f1f19fd9cf1f6a30dc7c10b48fa",
"avgDurationSecs": 3.564,
"validDurationSecs": 8759.255,
"totalHrs": 5.74,
"validHrs": 2.43
},
"et": {
"duration": 195218498,
"buckets": {
"dev": 2637,
"invalidated": 6610,
"other": 651,
"reported": 476,
"test": 2637,
"train": 3136,
"validated": 21633
},
"reportedSentences": 473,
"clips": 28894,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.2,
"thirties": 0.08,
"twenties": 0.68,
"fourties": 0.04,
"fifties": 0,
"seventies": 0,
"teens": 0
},
"gender": { "": 0.2, "male": 0.54, "female": 0.26, "other": 0 }
},
"users": 808,
"size": 1332576701,
"checksum": "895f05b75825d23da787a223cd1c84f44b07bba18bec1e34a8a46b7afc642e56",
"avgDurationSecs": 6.756,
"validDurationSecs": 146160.51,
"totalHrs": 54.22,
"validHrs": 40.6
},
"ky": {
"duration": 161722308,
"buckets": {
"dev": 1613,
"invalidated": 5588,
"other": 309,
"reported": 36,
"test": 1613,
"train": 1787,
"validated": 29711
},
"reportedSentences": 37,
"clips": 35608,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.08,
"": 0.07,
"fourties": 0.01,
"twenties": 0.66,
"teens": 0.18
},
"gender": { "male": 0.54, "": 0.11, "female": 0.35, "other": 0 }
},
"users": 247,
"size": 1045163001,
"checksum": "b23c82a73e969b2bb26b7ffe4e4dc327b4bea59514fcc054be492192bc8ea493",
"avgDurationSecs": 4.542,
"validDurationSecs": 134939.662,
"totalHrs": 44.92,
"validHrs": 37.48
},
"ro": {
"duration": 138589267,
"buckets": {
"dev": 3792,
"invalidated": 860,
"other": 19079,
"reported": 304,
"test": 3841,
"train": 5080,
"validated": 14777
},
"reportedSentences": 305,
"clips": 34716,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.14,
"teens": 0.02,
"": 0.11,
"fourties": 0.06,
"sixties": 0,
"twenties": 0.66,
"fifties": 0.01,
"eighties": 0
},
"gender": { "male": 0.73, "": 0.11, "female": 0.15, "other": 0.01 }
},
"users": 361,
"size": 870416684,
"checksum": "c9da340807a83058beea735c8e3290a327886933ac6b158744de9ecf6d44c87f",
"avgDurationSecs": 3.992,
"validDurationSecs": 58991.059,
"totalHrs": 38.49,
"validHrs": 16.38
},
"hsb": {
"duration": 10207332,
"buckets": {
"dev": 172,
"invalidated": 243,
"other": 8,
"reported": 71,
"test": 440,
"train": 808,
"validated": 1420
},
"reportedSentences": 72,
"clips": 1671,
"splits": {
"accent": { "": 1 },
"age": {
"fourties": 0.55,
"": 0.18,
"thirties": 0.1,
"sixties": 0,
"seventies": 0.03,
"twenties": 0.11,
"fifties": 0.03
},
"gender": { "male": 0.82, "": 0.18, "other": 0 }
},
"users": 20,
"size": 79769846,
"checksum": "f3cb738b99ef8700809e4787c7877dbd90942b6b197375d212c3c5951ad0b32b",
"avgDurationSecs": 6.109,
"validDurationSecs": 8674.094,
"totalHrs": 2.83,
"validHrs": 2.4
},
"el": {
"duration": 103011982,
"buckets": {
"dev": 1704,
"invalidated": 792,
"other": 9222,
"reported": 64,
"test": 1695,
"train": 1910,
"validated": 14919
},
"reportedSentences": 65,
"clips": 24933,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.38,
"fourties": 0.13,
"": 0.33,
"twenties": 0.13,
"fifties": 0.03,
"teens": 0.01,
"sixties": 0
},
"gender": { "male": 0.63, "": 0.32, "other": 0.02, "female": 0.03 }
},
"users": 341,
"size": 694147628,
"checksum": "19d90367ab2be112a8400b5c39bdc718d45c07c7fc311ca29966145bb75931bd",
"avgDurationSecs": 4.132,
"validDurationSecs": 61638.622,
"totalHrs": 28.61,
"validHrs": 17.12
},
"cs": {
"duration": 253963263,
"buckets": {
"dev": 7257,
"invalidated": 1275,
"other": 9169,
"reported": 700,
"test": 7585,
"train": 14413,
"validated": 48252
},
"reportedSentences": 697,
"clips": 58696,
"splits": {
"accent": { "": 1 },
"age": {
"fourties": 0.19,
"": 0.36,
"thirties": 0.14,
"teens": 0.01,
"twenties": 0.27,
"fifties": 0.02,
"sixties": 0,
"seventies": 0
},
"gender": { "male": 0.62, "": 0.35, "female": 0.02 }
},
"users": 567,
"size": 1788385110,
"checksum": "f3e5120b45c0c1a469cfea6008fcee7b102c245b3e772128e2ae189a2799feca",
"avgDurationSecs": 4.327,
"validDurationSecs": 208774.625,
"totalHrs": 70.54,
"validHrs": 57.99
},
"pl": {
"duration": 599130506,
"buckets": {
"dev": 8223,
"invalidated": 6061,
"other": 5576,
"reported": 535,
"test": 8223,
"train": 16261,
"validated": 122188
},
"reportedSentences": 535,
"clips": 133825,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.28,
"": 0.24,
"teens": 0.02,
"thirties": 0.32,
"fourties": 0.12,
"fifties": 0.01,
"nineties": 0.01,
"sixties": 0
},
"gender": { "male": 0.6, "": 0.25, "female": 0.14, "other": 0.01 }
},
"users": 3100,
"size": 4305564674,
"checksum": "f9d491272e90ef9a10451779b9a4ceed52be45e9ceee80e6a3fc2b9689dc348c",
"avgDurationSecs": 4.477,
"validDurationSecs": 547032.006,
"totalHrs": 166.42,
"validHrs": 151.95
},
"rm-sursilv": {
"duration": 38747237,
"buckets": {
"dev": 1344,
"invalidated": 674,
"other": 2177,
"reported": 11,
"test": 1329,
"train": 1544,
"validated": 4220
},
"reportedSentences": 12,
"clips": 7071,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.03,
"twenties": 0.1,
"": 0.64,
"teens": 0.06,
"fourties": 0.17
},
"gender": { "male": 0.17, "female": 0.19, "": 0.64, "other": 0 }
},
"users": 85,
"size": 292787901,
"checksum": "e3bfa984c4cd61b2b9cf3dca1051ae64e49815b5e5786acfbf9a56242bfc9f1a",
"avgDurationSecs": 5.48,
"validDurationSecs": 23124.5,
"totalHrs": 10.76,
"validHrs": 6.42
},
"rm-vallader": {
"duration": 15074402,
"buckets": {
"dev": 376,
"invalidated": 392,
"other": 720,
"reported": 31,
"test": 437,
"train": 664,
"validated": 1484
},
"reportedSentences": 30,
"clips": 2596,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.36,
"fourties": 0.41,
"twenties": 0.14,
"thirties": 0.06,
"fifties": 0,
"sixties": 0.03
},
"gender": { "": 0.36, "male": 0.44, "female": 0.19, "other": 0.01 }
},
"users": 51,
"size": 115234506,
"checksum": "1e12ab4b075f336e5937333ef9b48bd61b0bbdb81b19ad1c43f94b73aee1c693",
"avgDurationSecs": 5.807,
"validDurationSecs": 8617.262,
"totalHrs": 4.18,
"validHrs": 2.39
},
"mn": {
"duration": 68212232,
"buckets": {
"dev": 1853,
"invalidated": 754,
"other": 3449,
"reported": 18,
"test": 1881,
"train": 2170,
"validated": 8258
},
"reportedSentences": 19,
"clips": 12461,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.23,
"": 0.27,
"twenties": 0.41,
"fourties": 0.01,
"teens": 0.02,
"nineties": 0.06,
"fifties": 0
},
"gender": { "male": 0.36, "": 0.27, "female": 0.31, "other": 0.06 }
},
"users": 472,
"size": 519169015,
"checksum": "7b8861f86e4414d444b8c1e2a64db45da8b82eeeac034ec426615ad4cdbbd822",
"avgDurationSecs": 5.474,
"validDurationSecs": 45204.768,
"totalHrs": 18.94,
"validHrs": 12.55
},
"zh-HK": {
"duration": 464720136,
"buckets": {
"dev": 5587,
"invalidated": 4166,
"other": 17036,
"reported": 638,
"test": 5587,
"train": 8414,
"validated": 89103
},
"reportedSentences": 627,
"clips": 110305,
"splits": {
"accent": { "": 1 },
"age": {
"fourties": 0.12,
"thirties": 0.11,
"": 0.4,
"teens": 0.02,
"fifties": 0.02,
"seventies": 0,
"sixties": 0,
"twenties": 0.31
},
"gender": { "male": 0.42, "": 0.36, "female": 0.21, "other": 0.01 }
},
"users": 2907,
"size": 3364974920,
"checksum": "dbd0a4254447b5de319be1be5d4262ec9bfe68da0d7235430469c33d9298c985",
"avgDurationSecs": 4.213,
"validDurationSecs": 375395.116,
"totalHrs": 129.08,
"validHrs": 104.27
},
"ab": {
"duration": 301806756,
"buckets": {
"dev": 9152,
"invalidated": 5271,
"other": 11662,
"reported": 220,
"test": 9122,
"train": 21027,
"validated": 41930
},
"reportedSentences": 219,
"clips": 58863,
"splits": {
"accent": { "": 1 },
"age": {
"seventies": 0.01,
"thirties": 0.13,
"": 0.19,
"teens": 0.28,
"twenties": 0.18,
"fifties": 0.06,
"sixties": 0.05,
"fourties": 0.09,
"eighties": 0.01
},
"gender": { "male": 0.18, "female": 0.64, "": 0.18 }
},
"users": 397,
"size": 1723981967,
"checksum": "404ea029bc6cfca120fe9c1b181cee4ad23957621ab18c0401a8dc732877b053",
"avgDurationSecs": 5.127,
"validDurationSecs": 214986.618,
"totalHrs": 83.83,
"validHrs": 59.71
},
"cv": {
"duration": 88743504,
"buckets": {
"dev": 1140,
"invalidated": 1960,
"other": 990,
"reported": 143,
"test": 1267,
"train": 1566,
"validated": 14664
},
"reportedSentences": 139,
"clips": 17614,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.49,
"": 0.21,
"fourties": 0.06,
"thirties": 0.01,
"teens": 0.21,
"fifties": 0.01
},
"gender": { "male": 0.52, "": 0.19, "female": 0.29 }
},
"users": 104,
"size": 613669627,
"checksum": "689f89b2db0b97c1fbf5cae853b438d22d151f66fe2b1c9b88dd187462f5df28",
"avgDurationSecs": 5.038,
"validDurationSecs": 73880.705,
"totalHrs": 24.65,
"validHrs": 20.52
},
"uk": {
"duration": 302901648,
"buckets": {
"dev": 6786,
"invalidated": 2410,
"other": 8616,
"reported": 587,
"test": 6785,
"train": 11463,
"validated": 52269
},
"reportedSentences": 588,
"clips": 63295,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.25,
"teens": 0.1,
"": 0.26,
"fourties": 0.13,
"thirties": 0.26,
"fifties": 0,
"sixties": 0
},
"gender": { "male": 0.58, "female": 0.16, "": 0.26 }
},
"users": 734,
"size": 2061567072,
"checksum": "b5bd9d8d49d96e5ef865a59e5ea00e16473be254cbbcc3874e8b58d7162600cb",
"avgDurationSecs": 4.786,
"validDurationSecs": 250136.128,
"totalHrs": 84.13,
"validHrs": 69.48
},
"mt": {
"duration": 61216920,
"buckets": {
"dev": 1594,
"invalidated": 320,
"other": 6252,
"reported": 9,
"test": 1636,
"train": 1948,
"validated": 6350
},
"reportedSentences": 10,
"clips": 12922,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.19,
"": 0.26,
"fourties": 0.17,
"thirties": 0.09,
"teens": 0.03,
"fifties": 0.26,
"sixties": 0.01
},
"gender": { "male": 0.25, "": 0.26, "female": 0.48, "other": 0.01 }
},
"users": 205,
"size": 455461817,
"checksum": "b5ef5b1715cc505e3202161a72ed16371856e133d8d61ee17d996bb3a29ff3f6",
"avgDurationSecs": 4.737,
"validDurationSecs": 30082.607,
"totalHrs": 17,
"validHrs": 8.35
},
"as": {
"duration": 11681021,
"buckets": {
"dev": 448,
"invalidated": 92,
"other": 605,
"reported": 9,
"test": 307,
"train": 604,
"validated": 1359
},
"reportedSentences": 10,
"clips": 2056,
"splits": {
"accent": { "": 1 },
"age": { "twenties": 0.37, "": 0.59, "thirties": 0.04, "teens": 0 },
"gender": { "male": 0.41, "": 0.59, "female": 0 }
},
"users": 42,
"size": 73049488,
"checksum": "4c6eca577436845cf0fe990dfc5a396c5ccfb9df89c0dd3a17825604d6ee320c",
"avgDurationSecs": 5.681,
"validDurationSecs": 7721.064,
"totalHrs": 3.24,
"validHrs": 2.14
},
"ka": {
"duration": 29847816,
"buckets": {
"dev": 1353,
"invalidated": 367,
"other": 5,
"reported": 40,
"test": 1365,
"train": 1686,
"validated": 5232
},
"reportedSentences": 41,
"clips": 5604,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.38,
"thirties": 0.24,
"": 0.34,
"fourties": 0.02,
"fifties": 0,
"teens": 0.01
},
"gender": { "male": 0.46, "female": 0.2, "": 0.35 }
},
"users": 136,
"size": 196989034,
"checksum": "042994d0aad43cd28261476019f9c00aa704c9d6738102beeda3345e9741b04f",
"avgDurationSecs": 5.326,
"validDurationSecs": 27866.483,
"totalHrs": 8.29,
"validHrs": 7.74
},
"fy-NL": {
"duration": 459512019,
"buckets": {
"dev": 3025,
"invalidated": 2913,
"other": 53343,
"reported": 425,
"test": 3025,
"train": 3700,
"validated": 36057
},
"reportedSentences": 423,
"clips": 92313,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.59,
"fifties": 0.12,
"thirties": 0.04,
"twenties": 0.02,
"fourties": 0.07,
"sixties": 0.15,
"seventies": 0.01,
"teens": 0,
"eighties": 0
},
"gender": { "": 0.6, "male": 0.1, "female": 0.3 }
},
"users": 1143,
"size": 2876553843,
"checksum": "93281e617fbfe22a4f677bc2039ac91b343b43fc4fb952fb0e6d8a477878820d",
"avgDurationSecs": 4.978,
"validDurationSecs": 179483.116,
"totalHrs": 127.64,
"validHrs": 49.85
},
"dv": {
"duration": 212224635,
"buckets": {
"dev": 2253,
"invalidated": 1545,
"other": 14616,
"reported": 49,
"test": 2249,
"train": 2611,
"validated": 25883
},
"reportedSentences": 50,
"clips": 42044,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.22,
"twenties": 0.18,
"thirties": 0.36,
"fourties": 0.23,
"teens": 0.01,
"nineties": 0
},
"gender": { "": 0.22, "male": 0.28, "female": 0.5 }
},
"users": 315,
"size": 1362758958,
"checksum": "467a1a1cf204e4b8a2f713e694f83db9705c35131f1ef66f0a90f597858be0c7",
"avgDurationSecs": 5.048,
"validDurationSecs": 130649.087,
"totalHrs": 58.95,
"validHrs": 36.29
},
"pa-IN": {
"duration": 13144634,
"buckets": {
"dev": 280,
"invalidated": 75,
"other": 1285,
"reported": 249,
"test": 399,
"train": 685,
"validated": 1364
},
"reportedSentences": 244,
"clips": 2724,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.25,
"fourties": 0.04,
"fifties": 0.05,
"thirties": 0.4,
"twenties": 0.25,
"sixties": 0,
"teens": 0
},
"gender": { "": 0.25, "male": 0.75, "female": 0 }
},
"users": 58,
"size": 94525450,
"checksum": "f397560fdf6a0d61d756dc94c2ebe03f4b9f02336561e5e80f3117f1a8a6c8a3",
"avgDurationSecs": 4.825,
"validDurationSecs": 6581.968,
"totalHrs": 3.65,
"validHrs": 1.82
},
"vi": {
"duration": 63903104,
"buckets": {
"dev": 227,
"invalidated": 334,
"other": 11313,
"reported": 179,
"test": 1195,
"train": 2559,
"validated": 4460
},
"reportedSentences": 178,
"clips": 16107,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.02,
"twenties": 0.18,
"": 0.25,
"teens": 0.21,
"seventies": 0,
"fourties": 0.02,
"sixties": 0.31
},
"gender": { "male": 0.52, "": 0.25, "female": 0.21, "other": 0.02 }
},
"users": 229,
"size": 371346215,
"checksum": "a4a0f2d2dfc35ef4317c334713762aaa51b53f7b26e1c9f0e860b5ed0cc1f31a",
"avgDurationSecs": 3.967,
"validDurationSecs": 17694.657,
"totalHrs": 17.75,
"validHrs": 4.91
},
"or": {
"duration": 35312604,
"buckets": {
"dev": 309,
"invalidated": 163,
"other": 5689,
"reported": 9,
"test": 218,
"train": 477,
"validated": 1143
},
"reportedSentences": 10,
"clips": 6995,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.16,
"": 0.08,
"thirties": 0.75,
"fourties": 0,
"teens": 0
},
"gender": { "male": 0.92, "": 0.08, "female": 0 }
},
"users": 86,
"size": 255371109,
"checksum": "00672631bee5854227d87b0f84ba5f19b8263477d34028dc086d9eeb2514e6a8",
"avgDurationSecs": 5.048,
"validDurationSecs": 5770.165,
"totalHrs": 9.8,
"validHrs": 1.6
},
"ga-IE": {
"duration": 33018939,
"buckets": {
"dev": 511,
"invalidated": 815,
"other": 3899,
"reported": 14,
"test": 512,
"train": 535,
"validated": 4608
},
"reportedSentences": 15,
"clips": 9322,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.25,
"": 0.37,
"thirties": 0.26,
"fourties": 0.04,
"sixties": 0.01,
"teens": 0.02,
"fifties": 0.05
},
"gender": { "male": 0.49, "": 0.37, "female": 0.13, "other": 0 }
},
"users": 162,
"size": 229204889,
"checksum": "e25bcd05f22041bbd2e8ebb4136f1c7a38ba3a55d59d6f9698ec2f97e7b7fc18",
"avgDurationSecs": 3.542,
"validDurationSecs": 16321.741,
"totalHrs": 9.17,
"validHrs": 4.53
},
"fi": {
"duration": 60106935,
"buckets": {
"dev": 1584,
"invalidated": 194,
"other": 5679,
"reported": 45,
"test": 1726,
"train": 2205,
"validated": 7232
},
"reportedSentences": 46,
"clips": 13105,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.15,
"": 0.37,
"twenties": 0.1,
"fourties": 0.34,
"teens": 0.01,
"fifties": 0.04,
"seventies": 0
},
"gender": { "male": 0.26, "": 0.37, "female": 0.37, "other": 0 }
},
"users": 192,
"size": 358878983,
"checksum": "9ca4fb4ca2bfb9eb0d10dd46469dd3a5ce0cc1a3595e61a82dfd44be67c0e971",
"avgDurationSecs": 4.587,
"validDurationSecs": 33170.038,
"totalHrs": 16.69,
"validHrs": 9.21
},
"hu": {
"duration": 93883429,
"buckets": {
"dev": 4634,
"invalidated": 831,
"other": 2080,
"reported": 95,
"test": 4627,
"train": 6870,
"validated": 16172
},
"reportedSentences": 96,
"clips": 19083,
"splits": {
"accent": { "": 1 },
"age": {
"teens": 0.08,
"": 0.29,
"thirties": 0.15,
"twenties": 0.39,
"fifties": 0.06,
"fourties": 0.02,
"sixties": 0.01
},
"gender": { "male": 0.6, "": 0.29, "female": 0.11 }
},
"users": 223,
"size": 606662040,
"checksum": "5ea1a62667d68a8e33a6758083a3fdb722fb22781996d5a7826f23a4d69c89e4",
"avgDurationSecs": 4.92,
"validDurationSecs": 79562.061,
"totalHrs": 26.07,
"validHrs": 22.1
},
"th": {
"duration": 1394333760,
"buckets": {
"dev": 10868,
"invalidated": 8420,
"other": 196481,
"reported": 3943,
"test": 10868,
"train": 31331,
"validated": 129415
},
"reportedSentences": 3943,
"clips": 334316,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.21,
"": 0.43,
"thirties": 0.07,
"fourties": 0.04,
"teens": 0.05,
"fifties": 0.21,
"eighties": 0,
"sixties": 0
},
"gender": { "male": 0.38, "": 0.43, "female": 0.18, "other": 0.01 }
},
"users": 7616,
"size": 8111064630,
"checksum": "b56ce794693feb2a79ef294a25f9def1bda407a06e0fe209d46a58e13621212d",
"avgDurationSecs": 4.171,
"validDurationSecs": 539751.922,
"totalHrs": 387.31,
"validHrs": 149.93
},
"lt": {
"duration": 73757780,
"buckets": {
"dev": 3505,
"invalidated": 554,
"other": 1423,
"reported": 128,
"test": 3667,
"train": 5157,
"validated": 12332
},
"reportedSentences": 128,
"clips": 14309,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.33,
"": 0.24,
"thirties": 0.28,
"fifties": 0.05,
"sixties": 0.01,
"teens": 0.03,
"fourties": 0.05
},
"gender": { "male": 0.62, "": 0.24, "female": 0.14 }
},
"users": 260,
"size": 453243619,
"checksum": "10a1fe7a63972122c308ffb4cb37e6d1fc7a888ffb667aad9fa9e9993fe79bcc",
"avgDurationSecs": 5.155,
"validDurationSecs": 63567.052,
"totalHrs": 20.48,
"validHrs": 17.65
},
"lg": {
"duration": 1720048347,
"buckets": {
"dev": 12660,
"invalidated": 38089,
"other": 5922,
"reported": 6039,
"test": 12717,
"train": 55020,
"validated": 252222
},
"reportedSentences": 6034,
"clips": 296233,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.27,
"thirties": 0.22,
"twenties": 0.41,
"fourties": 0.05,
"fifties": 0.03,
"teens": 0.01,
"nineties": 0,
"sixties": 0.01
},
"gender": { "": 0.27, "female": 0.4, "male": 0.34 }
},
"users": 487,
"size": 10083327282,
"checksum": "48451a611b86562bda710fd350b5fcf767921ea9bb1e3d6e78e51caf42a6efa5",
"avgDurationSecs": 5.806,
"validDurationSecs": 1464502.72,
"totalHrs": 477.79,
"validHrs": 406.8
},
"hi": {
"duration": 63918132,
"buckets": {
"dev": 2178,
"invalidated": 670,
"other": 3280,
"reported": 110,
"test": 2839,
"train": 4321,
"validated": 9367
},
"reportedSentences": 111,
"clips": 13317,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.34,
"fourties": 0.03,
"": 0.32,
"thirties": 0.27,
"teens": 0.01,
"fifties": 0.01,
"sixties": 0
},
"gender": { "male": 0.64, "female": 0.04, "": 0.33 }
},
"users": 318,
"size": 377781828,
"checksum": "21364b7526eb32b58502f3ba897009127b5375beda9a1b8e4558a9aa0bbdb08e",
"avgDurationSecs": 4.8,
"validDurationSecs": 44959.161,
"totalHrs": 17.75,
"validHrs": 12.48
},
"bas": {
"duration": 9991980,
"buckets": {
"dev": 457,
"invalidated": 483,
"other": 109,
"reported": 7,
"test": 444,
"train": 763,
"validated": 1664
},
"reportedSentences": 8,
"clips": 2256,
"splits": {
"accent": { "": 1 },
"age": { "": 0.98, "fourties": 0.01, "teens": 0.01 },
"gender": { "": 0.98, "female": 0.02 }
},
"users": 32,
"size": 55578662,
"checksum": "b7c92d2be66bbe18fc49a4e64e22c643a62483d25644208c9b6d45c2216e076d",
"avgDurationSecs": 4.429,
"validDurationSecs": 7369.971,
"totalHrs": 2.77,
"validHrs": 2.04
},
"sk": {
"duration": 69713676,
"buckets": {
"dev": 2240,
"invalidated": 713,
"other": 183,
"reported": 30,
"test": 2241,
"train": 3029,
"validated": 16544
},
"reportedSentences": 31,
"clips": 17440,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.53,
"thirties": 0.22,
"twenties": 0.04,
"fourties": 0.1,
"teens": 0.11
},
"gender": { "": 0.52, "male": 0.37, "female": 0.09, "other": 0.01 }
},
"users": 143,
"size": 390501164,
"checksum": "119e70bd5b08d7ead030f9a50cdf70122df080975d53bf1794137b030328a6ff",
"avgDurationSecs": 3.997,
"validDurationSecs": 66132.056,
"totalHrs": 19.36,
"validHrs": 18.37
},
"kmr": {
"duration": 194365584,
"buckets": {
"dev": 2375,
"invalidated": 1570,
"other": 2728,
"reported": 637,
"test": 2398,
"train": 2838,
"validated": 39641
},
"reportedSentences": 638,
"clips": 43939,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.5,
"twenties": 0.3,
"thirties": 0.06,
"fourties": 0.04,
"fifties": 0.09,
"teens": 0.02,
"sixties": 0
},
"gender": { "": 0.5, "male": 0.34, "female": 0.17 }
},
"users": 309,
"size": 1036731522,
"checksum": "134ca34be64ae928ac451ca33d09342227b0ba6ff8b8e5499f51c5f31b41ebc3",
"avgDurationSecs": 4.424,
"validDurationSecs": 175353.242,
"totalHrs": 53.99,
"validHrs": 48.7
},
"bg": {
"duration": 46476000,
"buckets": {
"dev": 915,
"invalidated": 390,
"other": 2062,
"reported": 145,
"test": 1900,
"train": 3161,
"validated": 5987
},
"reportedSentences": 146,
"clips": 8439,
"splits": {
"accent": { "": 1 },
"age": {
"fourties": 0.33,
"thirties": 0.07,
"": 0.42,
"twenties": 0.17,
"teens": 0.01,
"sixties": 0
},
"gender": { "male": 0.52, "female": 0.06, "": 0.42 }
},
"users": 69,
"size": 271306749,
"checksum": "f532e435bd3df3422db84b13311068a819dc4685e4973ea847c2c8b41fa0d968",
"avgDurationSecs": 5.507,
"validDurationSecs": 32972.131,
"totalHrs": 12.91,
"validHrs": 9.15
},
"kk": {
"duration": 6733260,
"buckets": {
"dev": 379,
"invalidated": 195,
"other": 0,
"reported": 22,
"test": 384,
"train": 401,
"validated": 1169
},
"reportedSentences": 23,
"clips": 1364,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.51,
"thirties": 0.03,
"twenties": 0.31,
"teens": 0.06,
"fifties": 0.1
},
"gender": { "": 0.52, "male": 0.46, "female": 0.03 }
},
"users": 80,
"size": 38650960,
"checksum": "6cc9593b426d0ed39a7e8e10e576031ee3436b6f00bd8327a57da2653fa32fdc",
"avgDurationSecs": 4.936,
"validDurationSecs": 5770.661,
"totalHrs": 1.87,
"validHrs": 1.6
},
"ba": {
"duration": 958413996,
"buckets": {
"dev": 14559,
"invalidated": 7892,
"other": 45,
"reported": 866,
"test": 14526,
"train": 118983,
"validated": 208602
},
"reportedSentences": 863,
"clips": 216539,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.17,
"": 0.3,
"fourties": 0.06,
"fifties": 0.05,
"twenties": 0.17,
"sixties": 0.2,
"seventies": 0,
"teens": 0.04
},
"gender": { "male": 0.3, "": 0.3, "female": 0.4 }
},
"users": 888,
"size": 5376316576,
"checksum": "834d2433e1fb541a7b3310f5f1cde1161bd7653e5f7d88338fbc4757b06f2818",
"avgDurationSecs": 4.426,
"validDurationSecs": 923284.38,
"totalHrs": 266.22,
"validHrs": 256.46
},
"gl": {
"duration": 60356232,
"buckets": {
"dev": 2396,
"invalidated": 317,
"other": 3748,
"reported": 190,
"test": 2556,
"train": 3402,
"validated": 8413
},
"reportedSentences": 191,
"clips": 12478,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.38,
"thirties": 0.39,
"fifties": 0.09,
"twenties": 0.07,
"fourties": 0.06,
"teens": 0,
"sixties": 0.01
},
"gender": { "": 0.39, "male": 0.4, "female": 0.21, "other": 0.01 }
},
"users": 162,
"size": 347443062,
"checksum": "05d320fa03a7003f26952392d9a17da6ea3252679d0c08e4110c1268157d0f98",
"avgDurationSecs": 4.837,
"validDurationSecs": 40693.779,
"totalHrs": 16.76,
"validHrs": 11.3
},
"ug": {
"duration": 261275040,
"buckets": {
"dev": 2748,
"invalidated": 1969,
"other": 2907,
"reported": 184,
"test": 2747,
"train": 3293,
"validated": 38878
},
"reportedSentences": 185,
"clips": 43754,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.58,
"fifties": 0.02,
"twenties": 0.13,
"thirties": 0.11,
"fourties": 0.15,
"teens": 0,
"eighties": 0.01
},
"gender": { "": 0.58, "male": 0.33, "female": 0.08, "other": 0 }
},
"users": 396,
"size": 1521167832,
"checksum": "de56839d05aab1d73099bbebb8ec54c11e88bb54191d97397f7618bf70276f1a",
"avgDurationSecs": 5.971,
"validDurationSecs": 232158.226,
"totalHrs": 72.57,
"validHrs": 64.48
},
"hy-AM": {
"duration": 16193592,
"buckets": {
"dev": 352,
"invalidated": 91,
"other": 1224,
"reported": 27,
"test": 382,
"train": 599,
"validated": 1334
},
"reportedSentences": 28,
"clips": 2649,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.37,
"thirties": 0.14,
"twenties": 0.36,
"fifties": 0.04,
"teens": 0.08
},
"gender": { "": 0.37, "male": 0.23, "female": 0.4 }
},
"users": 60,
"size": 95058888,
"checksum": "37aa8f887538d7bfe54f1ba6554b27c7f1996d4d96e8e74c891aedf1c8598b4a",
"avgDurationSecs": 6.113,
"validDurationSecs": 8154.87,
"totalHrs": 4.49,
"validHrs": 2.26
},
"be": {
"duration": 4045811760,
"buckets": {
"dev": 15875,
"invalidated": 26057,
"other": 27,
"reported": 3124,
"test": 15879,
"train": 347012,
"validated": 824875
},
"reportedSentences": 3123,
"clips": 850959,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.8,
"fourties": 0.06,
"thirties": 0.07,
"twenties": 0.05,
"teens": 0.01,
"fifties": 0,
"sixties": 0,
"seventies": 0
},
"gender": { "": 0.79, "male": 0.09, "female": 0.12, "other": 0 }
},
"users": 6408,
"size": 22942027342,
"checksum": "96f071638a5360172908b6573907284bceaa25b3bde66cb950e2e63a5a61488a",
"avgDurationSecs": 4.754,
"validDurationSecs": 3921797.614,
"totalHrs": 1123.83,
"validHrs": 1089.38
},
"ur": {
"duration": 500892012,
"buckets": {
"dev": 3303,
"invalidated": 3185,
"other": 84521,
"reported": 48,
"test": 3298,
"train": 4128,
"validated": 41591
},
"reportedSentences": 48,
"clips": 129297,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.89,
"": 0.1,
"fourties": 0.01,
"thirties": 0,
"teens": 0,
"fifties": 0
},
"gender": { "male": 0.69, "": 0.1, "female": 0.2 }
},
"users": 183,
"size": 2912285969,
"checksum": "72c662684967be9aafb3c1f42231fcdad4926b23bee7898d2ca8e5a5afcfb316",
"avgDurationSecs": 3.874,
"validDurationSecs": 161122.065,
"totalHrs": 139.13,
"validHrs": 44.75
},
"gn": {
"duration": 11959812,
"buckets": {
"dev": 201,
"invalidated": 82,
"other": 1815,
"reported": 25,
"test": 267,
"train": 356,
"validated": 824
},
"reportedSentences": 26,
"clips": 2721,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.52,
"twenties": 0.35,
"thirties": 0.12,
"sixties": 0.01
},
"gender": { "": 0.52, "male": 0.35, "female": 0.13 }
},
"users": 69,
"size": 66523649,
"checksum": "752ec145f7ea7a6f7b676c0c0cfd24336fe2628f24b85113419961280ebfecd0",
"avgDurationSecs": 4.395,
"validDurationSecs": 3621.788,
"totalHrs": 3.32,
"validHrs": 1
},
"sr": {
"duration": 6711228,
"buckets": {
"dev": 623,
"invalidated": 40,
"other": 14,
"reported": 18,
"test": 659,
"train": 1037,
"validated": 2321
},
"reportedSentences": 19,
"clips": 2375,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.68,
"": 0.15,
"fifties": 0.01,
"fourties": 0.11,
"thirties": 0.04,
"teens": 0
},
"gender": { "male": 0.39, "": 0.15, "female": 0.46 }
},
"users": 56,
"size": 37169416,
"checksum": "9fc5a4dd4885de3044303e0d356d727831da46cc4c2f770bd98c4bea3dbe4410",
"avgDurationSecs": 2.826,
"validDurationSecs": 6558.636,
"totalHrs": 1.86,
"validHrs": 1.82
},
"uz": {
"duration": 904201128,
"buckets": {
"dev": 11570,
"invalidated": 13134,
"other": 123659,
"reported": 1750,
"test": 12242,
"train": 47082,
"validated": 83316
},
"reportedSentences": 1733,
"clips": 220109,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.39,
"": 0.41,
"thirties": 0.01,
"teens": 0.18,
"fifties": 0,
"fourties": 0.01,
"nineties": 0
},
"gender": { "male": 0.44, "": 0.41, "female": 0.15, "other": 0 }
},
"users": 1932,
"size": 5040011949,
"checksum": "5d7887f2d36f891e02bd70858549752e8556c70b98855f759e94596572321253",
"avgDurationSecs": 4.108,
"validDurationSecs": 342259.613,
"totalHrs": 251.16,
"validHrs": 95.07
},
"mr": {
"duration": 95987052,
"buckets": {
"dev": 1689,
"invalidated": 2204,
"other": 2649,
"reported": 43,
"test": 1761,
"train": 2284,
"validated": 10670
},
"reportedSentences": 44,
"clips": 15523,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.13,
"sixties": 0,
"twenties": 0.28,
"": 0.05,
"teens": 0.54
},
"gender": { "male": 0.19, "female": 0.76, "": 0.05 }
},
"users": 77,
"size": 562491867,
"checksum": "2ceb9169d113c3354a3cc97907b3f36acdbc43c8ebe78c24f77a73afac216787",
"avgDurationSecs": 6.184,
"validDurationSecs": 65978.345,
"totalHrs": 26.66,
"validHrs": 18.32
},
"da": {
"duration": 35201952,
"buckets": {
"dev": 1905,
"invalidated": 310,
"other": 301,
"reported": 208,
"test": 1912,
"train": 2273,
"validated": 7645
},
"reportedSentences": 208,
"clips": 8256,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.34,
"thirties": 0.31,
"twenties": 0.2,
"sixties": 0,
"fourties": 0.11,
"fifties": 0.04,
"teens": 0
},
"gender": { "": 0.34, "female": 0.08, "male": 0.58 }
},
"users": 194,
"size": 201862714,
"checksum": "6a3338dd0bfec1e945c8f0ea5cf92fee517c48004dbf7389076d20a5abb4dcc4",
"avgDurationSecs": 4.264,
"validDurationSecs": 32596.769,
"totalHrs": 9.77,
"validHrs": 9.05
},
"myv": {
"duration": 11097324,
"buckets": {
"dev": 498,
"invalidated": 18,
"other": 236,
"reported": 19,
"test": 488,
"train": 684,
"validated": 1676
},
"reportedSentences": 20,
"clips": 1930,
"splits": {
"accent": { "": 1 },
"age": {
"sixties": 0.27,
"": 0.38,
"thirties": 0.25,
"twenties": 0.09,
"teens": 0.01
},
"gender": { "male": 0.55, "": 0.38, "female": 0.08 }
},
"users": 12,
"size": 64916998,
"checksum": "be220f8186d52f8c866c84bb6fec0c2094333dfb525ad44bd607159b6b702100",
"avgDurationSecs": 5.75,
"validDurationSecs": 9636.847,
"totalHrs": 3.08,
"validHrs": 2.67
},
"nn-NO": {
"duration": 3267000,
"buckets": {
"dev": 193,
"invalidated": 13,
"other": 79,
"reported": 14,
"test": 195,
"train": 240,
"validated": 633
},
"reportedSentences": 15,
"clips": 725,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.4,
"thirties": 0.36,
"twenties": 0.21,
"fourties": 0.04
},
"gender": { "": 0.4, "female": 0.2, "male": 0.37, "other": 0.03 }
},
"users": 25,
"size": 18486398,
"checksum": "d848f0d5bdedb577c8b56aab1f396bdf896bfcc7ce687052aa3fa4ce25163a61",
"avgDurationSecs": 4.506,
"validDurationSecs": 2852.429,
"totalHrs": 0.9,
"validHrs": 0.79
},
"ha": {
"duration": 39436776,
"buckets": {
"dev": 532,
"invalidated": 161,
"other": 5936,
"reported": 17,
"test": 469,
"train": 1945,
"validated": 2973
},
"reportedSentences": 17,
"clips": 9070,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.18,
"thirties": 0.75,
"twenties": 0.04,
"fourties": 0,
"fifties": 0.03
},
"gender": { "": 0.18, "male": 0.55, "female": 0.28 }
},
"users": 30,
"size": 230742490,
"checksum": "484e9ec6bdbe21875f5b91c74bdfabb38ad90d5a4d6453c7579b2a3ed56b2232",
"avgDurationSecs": 4.348,
"validDurationSecs": 12926.74,
"totalHrs": 10.95,
"validHrs": 3.59
},
"ckb": {
"duration": 405826848,
"buckets": {
"dev": 4524,
"invalidated": 4973,
"other": 17429,
"reported": 2261,
"test": 4526,
"train": 6225,
"validated": 81567
},
"reportedSentences": 2261,
"clips": 103969,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.37,
"thirties": 0.13,
"twenties": 0.44,
"fourties": 0.03,
"teens": 0.02,
"fifties": 0.02
},
"gender": { "": 0.35, "male": 0.58, "female": 0.07, "other": 0 }
},
"users": 1151,
"size": 2190261738,
"checksum": "742eace3283d4d50f771862f4366e72df5a626493a7bcb0efc38f87bf5d43af2",
"avgDurationSecs": 3.903,
"validDurationSecs": 318384.119,
"totalHrs": 112.72,
"validHrs": 88.44
},
"ml": {
"duration": 10157040,
"buckets": {
"dev": 0,
"invalidated": 6,
"other": 1964,
"reported": 112,
"test": 80,
"train": 414,
"validated": 494
},
"reportedSentences": 113,
"clips": 2464,
"splits": {
"accent": { "": 1 },
"age": { "": 0.45, "twenties": 0.49, "thirties": 0.05, "fourties": 0 },
"gender": { "": 0.45, "male": 0.55 }
},
"users": 22,
"size": 58792967,
"checksum": "cf6a3595a571d780db0c9e6b986e519ab047b56bd8b097fb9094dd8fd197682a",
"avgDurationSecs": 4.122,
"validDurationSecs": 2036.355,
"totalHrs": 2.82,
"validHrs": 0.56
},
"mdf": {
"duration": 1791720,
"buckets": {
"dev": 48,
"invalidated": 6,
"other": 77,
"reported": 9,
"test": 78,
"train": 130,
"validated": 256
},
"reportedSentences": 10,
"clips": 339,
"splits": {
"accent": { "": 1 },
"age": { "sixties": 0.06, "": 0.59, "fourties": 0.35 },
"gender": { "male": 0.06, "": 0.59, "female": 0.35 }
},
"users": 10,
"size": 10526142,
"checksum": "936191d697b2820af25c3a7b71baf1716090510467a45c4c286167f3dde8094c",
"avgDurationSecs": 5.285,
"validDurationSecs": 1353.039,
"totalHrs": 0.49,
"validHrs": 0.37
},
"sw": {
"duration": 2636057412,
"buckets": {
"dev": 9196,
"invalidated": 10371,
"other": 340061,
"reported": 1832,
"test": 9288,
"train": 21141,
"validated": 144219
},
"reportedSentences": 1827,
"clips": 494651,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.3,
"twenties": 0.44,
"thirties": 0.14,
"teens": 0,
"fifties": 0.06,
"fourties": 0.05,
"sixties": 0.01
},
"gender": { "": 0.27, "male": 0.39, "female": 0.34, "other": 0 }
},
"users": 663,
"size": 15409263795,
"checksum": "fbd27fae537fd3c843635a4518979c678c523b1bda5f109e050ec483af3a66fc",
"avgDurationSecs": 5.329,
"validDurationSecs": 768561.195,
"totalHrs": 732.23,
"validHrs": 213.48
},
"sat": {
"duration": 3025764,
"buckets": {
"dev": 0,
"invalidated": 11,
"other": 281,
"reported": 6,
"test": 118,
"train": 275,
"validated": 393
},
"reportedSentences": 7,
"clips": 685,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.44,
"twenties": 0.41,
"fourties": 0.01,
"fifties": 0.01,
"teens": 0.01,
"thirties": 0.11
},
"gender": { "": 0.42, "male": 0.57, "female": 0.01 }
},
"users": 9,
"size": 16534167,
"checksum": "37ea92dc8d330ab09c7f0ce8aedabd867d81530741b5be98a577aaa8c4e08d45",
"avgDurationSecs": 4.417,
"validDurationSecs": 1735.949,
"totalHrs": 0.84,
"validHrs": 0.48
},
"tig": {
"duration": 103284,
"buckets": {
"dev": 0,
"invalidated": 8,
"other": 5,
"reported": 0,
"test": 0,
"train": 10,
"validated": 10
},
"reportedSentences": 1,
"clips": 23,
"splits": {
"accent": { "": 1 },
"age": { "": 0.78, "twenties": 0.22 },
"gender": { "": 0.78, "male": 0.22 }
},
"users": 5,
"size": 603415,
"checksum": "17d9f9d54f00aa556bd4cc4daf4cdda80c345fee78b53a54befdfc640b027b8d",
"avgDurationSecs": 4.491,
"validDurationSecs": 44.906,
"totalHrs": 0.02,
"validHrs": 0.01
},
"ig": {
"duration": 31109796,
"buckets": {
"dev": 2,
"invalidated": 2,
"other": 5673,
"reported": 6,
"test": 4,
"train": 8,
"validated": 14
},
"reportedSentences": 6,
"clips": 5689,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.56,
"twenties": 0.32,
"teens": 0.05,
"eighties": 0,
"thirties": 0.04,
"sixties": 0.02,
"fourties": 0
},
"gender": { "": 0.56, "male": 0.13, "female": 0.3 }
},
"users": 104,
"size": 181718397,
"checksum": "23a28c2c5ab68230a17f531213e7010df1263108ae813ccf99503985ecce4813",
"avgDurationSecs": 5.468,
"validDurationSecs": 76.558,
"totalHrs": 8.64,
"validHrs": 0.02
},
"nan-tw": {
"duration": 33604848,
"buckets": {
"dev": 936,
"invalidated": 257,
"other": 9422,
"reported": 118,
"test": 892,
"train": 1040,
"validated": 2875
},
"reportedSentences": 119,
"clips": 12554,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.2,
"": 0.12,
"twenties": 0.4,
"fourties": 0.23,
"teens": 0.03,
"fifties": 0.01,
"sixties": 0
},
"gender": { "male": 0.53, "": 0.12, "other": 0.12, "female": 0.23 }
},
"users": 97,
"size": 188163531,
"checksum": "fb567620a6f4b449503f2a46a278f858cf2c62c9f817c769620f6e0000f06a4e",
"avgDurationSecs": 2.677,
"validDurationSecs": 7695.869,
"totalHrs": 9.33,
"validHrs": 2.13
},
"mhr": {
"duration": 424034172,
"buckets": {
"dev": 12583,
"invalidated": 2701,
"other": 0,
"reported": 35,
"test": 12797,
"train": 59242,
"validated": 86370
},
"reportedSentences": 36,
"clips": 89071,
"splits": {
"accent": { "": 1 },
"age": {
"fifties": 0.09,
"": 0.11,
"sixties": 0.08,
"thirties": 0.22,
"fourties": 0.18,
"twenties": 0.26,
"teens": 0.03,
"seventies": 0.01
},
"gender": { "male": 0.18, "": 0.11, "female": 0.7 }
},
"users": 239,
"size": 2402678566,
"checksum": "2291217507995a8741511c3e90fbbe69954046442210d50442e8e95aa462ae1e",
"avgDurationSecs": 4.761,
"validDurationSecs": 411175.707,
"totalHrs": 117.78,
"validHrs": 114.21
},
"bn": {
"duration": 1608861096,
"buckets": {
"dev": 8226,
"invalidated": 6312,
"other": 218729,
"reported": 970,
"test": 8226,
"train": 16271,
"validated": 35981
},
"reportedSentences": 965,
"clips": 261022,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.02,
"twenties": 0.24,
"": 0.69,
"teens": 0.04,
"fourties": 0
},
"gender": { "male": 0.26, "": 0.69, "female": 0.05, "other": 0 }
},
"users": 20630,
"size": 9255007781,
"checksum": "1430962f663e16ed42869096afca6378cf6194f96979b8194f5b170c2474e5f2",
"avgDurationSecs": 6.164,
"validDurationSecs": 221776.061,
"totalHrs": 446.9,
"validHrs": 61.6
},
"tok": {
"buckets": {
"dev": 1756,
"invalidated": 177,
"other": 2203,
"reported": 89,
"test": 1616,
"train": 2214,
"validated": 6782
},
"reportedSentences": 90,
"duration": 33801120,
"clips": 9162,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.44,
"twenties": 0.19,
"teens": 0.26,
"thirties": 0.12,
"fourties": 0
},
"gender": { "": 0.44, "male": 0.45, "other": 0.03, "female": 0.08 }
},
"users": 73,
"size": 196707194,
"checksum": "7c1e45005bcdc3a63e25cb4cfb9809ea2263868eeb2051435449c451dd50251f",
"avgDurationSecs": 3.689,
"validDurationSecs": 25020.65,
"totalHrs": 9.38,
"validHrs": 6.95
},
"yue": {
"duration": 139717368,
"buckets": {
"dev": 2157,
"invalidated": 1332,
"other": 17354,
"reported": 601,
"test": 2168,
"train": 2547,
"validated": 14772
},
"reportedSentences": 594,
"clips": 33458,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.18,
"": 0.42,
"twenties": 0.32,
"fourties": 0.04,
"sixties": 0,
"fifties": 0,
"teens": 0.03
},
"gender": { "male": 0.34, "": 0.48, "female": 0.16, "other": 0.02 }
},
"users": 584,
"size": 790089657,
"checksum": "10b86810d19f041c13f22b188872df968443aa382e8bf86a6a6a53ca9ecded17",
"avgDurationSecs": 4.176,
"validDurationSecs": 61686.442,
"totalHrs": 38.81,
"validHrs": 17.13
},
"sah": {
"duration": 24460932,
"buckets": {
"dev": 1083,
"invalidated": 101,
"other": 1,
"reported": 2,
"test": 1249,
"train": 1575,
"validated": 3975
},
"reportedSentences": 3,
"clips": 4077,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.37,
"twenties": 0.03,
"fourties": 0.07,
"thirties": 0.43,
"teens": 0.1,
"fifties": 0
},
"gender": { "": 0.37, "male": 0.53, "female": 0.1 }
},
"users": 53,
"size": 186022863,
"checksum": "0d424943facc126253edf6a8a459a52f5b3986a4ddec251eebd46a8777a16cb9",
"avgDurationSecs": 6,
"validDurationSecs": 23848.959,
"totalHrs": 6.79,
"validHrs": 6.62
},
"mk": {
"duration": 979452,
"buckets": {
"dev": 0,
"invalidated": 7,
"other": 48,
"reported": 4,
"test": 15,
"train": 114,
"validated": 129
},
"reportedSentences": 5,
"clips": 184,
"splits": {
"accent": { "": 1 },
"age": { "thirties": 0.48, "": 0.33, "twenties": 0.11, "teens": 0.08 },
"gender": { "male": 0.67, "": 0.33 }
},
"users": 6,
"size": 5763649,
"checksum": "c319ef45575aaad82e1d9d5629da486dbf789f21dbc04fe0ad43b44d9fbd1af5",
"avgDurationSecs": 5.323,
"validDurationSecs": 686.681,
"totalHrs": 0.27,
"validHrs": 0.19
},
"sc": {
"duration": 3140316,
"buckets": {
"dev": 80,
"invalidated": 21,
"other": 227,
"reported": 0,
"test": 99,
"train": 201,
"validated": 388
},
"reportedSentences": 1,
"clips": 636,
"splits": {
"accent": { "": 1 },
"age": { "": 0.29, "thirties": 0.57, "twenties": 0.14 },
"gender": { "": 0.29, "female": 0.65, "male": 0.06 }
},
"users": 8,
"size": 17925900,
"checksum": "6d87934eface17484f592b21ac0944ff5eb08c07c8606041f92771e044a61dbb",
"avgDurationSecs": 4.938,
"validDurationSecs": 1915.79,
"totalHrs": 0.87,
"validHrs": 0.53
},
"vot": {
"duration": 1025976,
"buckets": {
"dev": 0,
"invalidated": 324,
"other": 0,
"test": 6,
"train": 96,
"validated": 102
},
"clips": 426,
"splits": {
"accent": { "": 1 },
"age": { "": 0.25, "twenties": 0.73, "teens": 0.01 },
"gender": { "": 0.25, "male": 0.75 }
},
"users": 5,
"size": 7892427,
"checksum": "97ab1ad8cfa2c0526abe359ae53d94d873af374755366b430959faafe62967b4",
"avgDurationSecs": 2.408,
"validDurationSecs": 245.656,
"totalHrs": 0.28,
"validHrs": 0.06
},
"az": {
"duration": 611820,
"buckets": {
"dev": 20,
"invalidated": 28,
"other": 1,
"test": 22,
"train": 39,
"validated": 81
},
"clips": 110,
"splits": {
"accent": { "": 1 },
"age": { "": 0.42, "twenties": 0.55, "fourties": 0.03 },
"gender": { "": 0.42, "male": 0.58 }
},
"users": 14,
"size": 3573693,
"checksum": "3e571071448c82c67b14a3a8fd9ddacb67152542f3b40159fac1da25a9c48fcd",
"avgDurationSecs": 5.562,
"validDurationSecs": 450.522,
"totalHrs": 0.16,
"validHrs": 0.12
},
"ast": {
"duration": 921168,
"buckets": {
"dev": 0,
"invalidated": 0,
"other": 214,
"test": 0,
"train": 0,
"validated": 0
},
"clips": 214,
"splits": { "accent": { "": 1 }, "age": { "": 1 }, "gender": { "": 1 } },
"users": 2,
"size": 5399531,
"checksum": "2e13bf7a4f09d4b89ef269360faf1a30fa918c8dd81f28ee7e935e668177decb",
"avgDurationSecs": 4.305,
"validDurationSecs": 0,
"totalHrs": 0.25,
"validHrs": 0
},
"ne-NP": {
"duration": 515232,
"buckets": {
"dev": 0,
"invalidated": 1,
"other": 122,
"test": 1,
"train": 4,
"validated": 5
},
"clips": 128,
"splits": {
"accent": { "": 1 },
"age": { "thirties": 0.04, "": 0.96 },
"gender": { "male": 0.04, "": 0.96 }
},
"users": 4,
"size": 2927867,
"checksum": "5c571f22daeb4c27cb4c668790d769bb47f4a5dd45d18abf08c4667458ef77dc",
"avgDurationSecs": 4.025,
"validDurationSecs": 20.126,
"totalHrs": 0.14,
"validHrs": 0
}
},
"totalDuration": 74941200000,
"totalValidDurationSecs": 54841011,
"totalHrs": 20817,
"totalValidHrs": 15234
}
================================================
FILE: datasets/scripted-speech/cv-corpus-10.0-delta-2022-07-04.json
================================================
{
"locales": {
"en": {
"duration": 348687468,
"reportedSentences": 253,
"clips": 63939,
"users": 2705,
"size": 2029487644,
"checksum": "b82354bf4ff7a62568e071dbba3a48160f7368ed94890fd57f466a85c27e0511",
"avgDurationSecs": 5.155,
"validDurationSecs": 183220.9,
"totalHrs": 96.86,
"validHrs": 50.9
},
"fa": {
"duration": 14904748,
"reportedSentences": 39,
"clips": 3957,
"users": 30,
"size": 85189844,
"checksum": "c298ceacbe35edbc0ed948c068afbd87c2076768a5e70ebe7f4166f7e053e4a8",
"avgDurationSecs": 4,
"validDurationSecs": 14033.62,
"totalHrs": 4.15,
"validHrs": 3.9
},
"fr": {
"reportedSentences": 179,
"duration": 120206124,
"clips": 22871,
"users": 219,
"size": 703389081,
"checksum": "a7f6596f3f679fca1a9ae24f319b5feda67bcea5d1a514c6f6f32ae65f88aa27",
"avgDurationSecs": 4.997,
"validDurationSecs": 72958.71,
"totalHrs": 33.39,
"validHrs": 20.27
},
"es": {
"reportedSentences": 82,
"duration": 87761484,
"clips": 16163,
"users": 614,
"size": 509558554,
"checksum": "29921567c0b8f98953295ff53d69bd3c0b6c6beb746791472daba1c87199ae66",
"avgDurationSecs": 5.053,
"validDurationSecs": 13675.14,
"totalHrs": 24.38,
"validHrs": 3.8
},
"sl": {
"reportedSentences": 6,
"duration": 3233196,
"clips": 947,
"users": 9,
"size": 18764069,
"checksum": "a461623bfc1deed48ce7ef2ec4b64fd724006472b1e481e8c3dd47e3290c23f3",
"avgDurationSecs": 3.814,
"validDurationSecs": 554.62,
"totalHrs": 0.9,
"validHrs": 0.16
},
"kab": {
"reportedSentences": 36,
"duration": 10593576,
"clips": 2494,
"users": 10,
"size": 55516139,
"checksum": "c0b4cec3a040eaf0abf4ab5cc434aef152939f901cb8fd5e7a46f0dc03d280cd",
"avgDurationSecs": 3.328,
"validDurationSecs": 7479.3,
"totalHrs": 2.94,
"validHrs": 2.08
},
"cy": {
"reportedSentences": 0,
"duration": 2239596,
"clips": 373,
"users": 10,
"size": 13145485,
"checksum": "83f81318cf77b1f9835762b0f4dc06af083d387bf0524200b8dce8492a16fb56",
"avgDurationSecs": 4.843,
"validDurationSecs": 1255.49,
"totalHrs": 0.62,
"validHrs": 0.35
},
"ca": {
"reportedSentences": 1075,
"duration": 1086014664,
"clips": 216035,
"users": 2595,
"size": 5809675725,
"checksum": "e6df7a73ffa2f9b61615c4f8199b44fb7a599cbbbd74a56789758842a6d77f54",
"avgDurationSecs": 5.512,
"validDurationSecs": 510777.98,
"totalHrs": 301.67,
"validHrs": 141.89
},
"de": {
"reportedSentences": 717,
"duration": 165497976,
"clips": 29099,
"users": 299,
"size": 960858837,
"checksum": "a293c3e341f6aaf25019bc852d6475d0ef2f85c2835f8481f7cc65dbd0bde2fa",
"avgDurationSecs": 5.159,
"validDurationSecs": 155145.01,
"totalHrs": 45.97,
"validHrs": 43.1
},
"tt": {
"duration": 422100,
"reportedSentences": 0,
"clips": 93,
"users": 4,
"size": 2408354,
"checksum": "e5b1670372444451dbd146a2bb911144eca59a843f59c347c98f75ee5c5ac507",
"avgDurationSecs": 3.747,
"validDurationSecs": 368.68,
"totalHrs": 0.12,
"validHrs": 0.1
},
"ta": {
"duration": 68076324,
"reportedSentences": 67,
"clips": 10547,
"users": 31,
"size": 399769191,
"checksum": "8d8427ca7d2735131f5b77afc4ecd01342f617487f76a64c5eb2a597bc74f9a2",
"avgDurationSecs": 6.211,
"validDurationSecs": 11889.52,
"totalHrs": 18.91,
"validHrs": 3.3
},
"ru": {
"duration": 28183392,
"reportedSentences": 30,
"clips": 5741,
"users": 103,
"size": 165197914,
"checksum": "78ea4fa2de776edc8ecae52440d0e4fc4669eb962ae8150f91078c907c5d049a",
"avgDurationSecs": 5.161,
"validDurationSecs": 17910.15,
"totalHrs": 7.82,
"validHrs": 4.97
},
"nl": {
"duration": 5887440,
"reportedSentences": 7,
"clips": 1127,
"users": 32,
"size": 34580925,
"checksum": "d9053d64a7e5fc2d3853c0d920c73229650cb4d88130c9c9b9939f1fa582fe4a",
"avgDurationSecs": 4.316,
"validDurationSecs": 4588.35,
"totalHrs": 1.64,
"validHrs": 1.27
},
"it": {
"duration": 22953780,
"reportedSentences": 75,
"clips": 4010,
"users": 95,
"size": 133391324,
"checksum": "ae01f1b6fd93a65d964c274c4aee182ea00d861f0a4f9e98bc5c06fe55a4a1b5",
"avgDurationSecs": 5.356,
"validDurationSecs": 21773.05,
"totalHrs": 6.37,
"validHrs": 6.05
},
"eu": {
"duration": 1302084,
"reportedSentences": 17,
"clips": 223,
"users": 8,
"size": 7592880,
"checksum": "a0b8c72bdcce0d23e9a58ba71681e5a1c5093504159c5f69d699a4b9da42cb85",
"avgDurationSecs": 5.191,
"validDurationSecs": 259.22,
"totalHrs": 0.36,
"validHrs": 0.08
},
"tr": {
"duration": 28867104,
"reportedSentences": 27,
"clips": 8649,
"users": 35,
"size": 168127776,
"checksum": "2d12424877b65b3b39e6031c368a3d60579e347d31593558c57e0a3dc11b3791",
"avgDurationSecs": 3.654,
"validDurationSecs": 28120.92,
"totalHrs": 8.02,
"validHrs": 7.81
},
"ar": {
"duration": 71301762451,
"reportedSentences": 16,
"clips": 5202,
"users": 35,
"size": 114366388,
"checksum": "4f81fec5272134b6e7de8195fc94e629975c3142691587ad3be961c3cb12a686",
"avgDurationSecs": 4.164,
"validDurationSecs": 1872.37,
"totalHrs": 5.87,
"validHrs": 0.52
},
"zh-TW": {
"duration": 16178400,
"reportedSentences": 1,
"clips": 3900,
"users": 59,
"size": 78208489,
"checksum": "c4c488f69eadae226b056396e3f6dc40baf89674240afb101e19d01ee2922614",
"avgDurationSecs": 3.278,
"validDurationSecs": 6972.18,
"totalHrs": 4.5,
"validHrs": 1.93
},
"br": {
"duration": 933840,
"reportedSentences": 38,
"clips": 235,
"users": 5,
"size": 5392101,
"checksum": "6275bfb2f17e9857a61d25185e47d83f9486f3b5a08e67219f269e0559410f3a",
"avgDurationSecs": 3.085,
"validDurationSecs": 149.84,
"totalHrs": 0.26,
"validHrs": 0.04
},
"pt": {
"duration": 37053540,
"reportedSentences": 80,
"clips": 9099,
"users": 109,
"size": 213021717,
"checksum": "3ceeb91b12a07cbf8cf983b367573a35fdb89b39d925c086b595d1d32ca807cb",
"avgDurationSecs": 4.189,
"validDurationSecs": 17959.14,
"totalHrs": 10.29,
"validHrs": 4.99
},
"eo": {
"duration": -6705988902,
"reportedSentences": 13,
"clips": 4593,
"users": 43,
"size": 173814790,
"checksum": "2179bad54bb2b69cd12964bc2f6533b9538b7a3f943f9e65f8f9a463796fd901",
"avgDurationSecs": 6.068,
"validDurationSecs": 467,
"totalHrs": -434.44,
"validHrs": 442.34
},
"zh-CN": {
"duration": 1255890492,
"reportedSentences": 56,
"clips": 281511,
"users": 828,
"size": 7236025294,
"checksum": "5862d3e55aaa507b62c6e81343bb17ae784a32c1a36e9facf4b2713ecb5da4ce",
"avgDurationSecs": 4.603,
"validDurationSecs": -19192.42,
"totalHrs": 348.86,
"validHrs": -5.33
},
"id": {
"duration": 3512736,
"reportedSentences": 8,
"clips": 988,
"users": 22,
"size": 20488039,
"checksum": "33d49848b4e5341d166642ce175dc7c6128114307c4f3cab881be4e34b0703f7",
"avgDurationSecs": 4.055,
"validDurationSecs": 116.79,
"totalHrs": 0.97,
"validHrs": 0.03
},
"ia": {
"duration": 327852,
"reportedSentences": 1,
"clips": 60,
"users": 1,
"size": 1932757,
"checksum": "581ed384b3498194710ae266a83bfc220df575893f57db49907db29cd8fcfbdf",
"avgDurationSecs": 4.183,
"validDurationSecs": 156.84,
"totalHrs": 0.09,
"validHrs": 0.05
},
"lv": {
"duration": 360684,
"reportedSentences": 2,
"clips": 102,
"users": 2,
"size": 2103521,
"checksum": "6b245bc30f4a09415c0234545068084692a919b657df7281b414931e331fdf2f",
"avgDurationSecs": 3.412,
"validDurationSecs": 140.39,
"totalHrs": 0.1,
"validHrs": 0.04
},
"ja": {
"duration": 7523784,
"reportedSentences": 24,
"clips": 1433,
"users": 33,
"size": 44204310,
"checksum": "7196d23c02058a545c921539aa553bc6655c692274bca88cf0941f7e30018826",
"avgDurationSecs": 4.798,
"validDurationSecs": 6870.81,
"totalHrs": 2.09,
"validHrs": 1.91
},
"rw": {
"duration": 189180,
"reportedSentences": 6,
"clips": 58,
"users": 8,
"size": 1206683,
"checksum": "824d4a62cc4ce8a5e3fe0b4c24bd5a191a286dff50d4f4ccc4c724b342413a4b",
"avgDurationSecs": 5.008,
"validDurationSecs": 240.49,
"totalHrs": 0.05,
"validHrs": 0.06
},
"sv-SE": {
"duration": 2524896,
"reportedSentences": 5,
"clips": 530,
"users": 14,
"size": 14829390,
"checksum": "ee19ce93f376d4980e4afb32f7b7ac04b74fd257547718a43110d530789a1e95",
"avgDurationSecs": 3.956,
"validDurationSecs": 1088.7,
"totalHrs": 0.7,
"validHrs": 0.3
},
"cnh": {
"duration": 2700,
"reportedSentences": 0,
"clips": 1,
"users": 1,
"size": 18315,
"checksum": "76a3e555e9503e94799077f48b1ef84acfef3f1f19fd9cf1f6a30dc7c10b48fa",
"avgDurationSecs": 3.564,
"validDurationSecs": -0.37,
"totalHrs": 0,
"validHrs": 0
},
"et": {
"duration": 17878644,
"reportedSentences": 12,
"clips": 2658,
"users": 15,
"size": 105537453,
"checksum": "895f05b75825d23da787a223cd1c84f44b07bba18bec1e34a8a46b7afc642e56",
"avgDurationSecs": 6.756,
"validDurationSecs": 10884.46,
"totalHrs": 4.96,
"validHrs": 3.03
},
"ky": {
"duration": 442404,
"reportedSentences": 2,
"clips": 102,
"users": 8,
"size": 2545815,
"checksum": "b23c82a73e969b2bb26b7ffe4e4dc327b4bea59514fcc054be492192bc8ea493",
"avgDurationSecs": 4.542,
"validDurationSecs": 205.12,
"totalHrs": 0.13,
"validHrs": 0.06
},
"ro": {
"duration": 4459500,
"reportedSentences": 7,
"clips": 855,
"users": 15,
"size": 26220871,
"checksum": "c9da340807a83058beea735c8e3290a327886933ac6b158744de9ecf6d44c87f",
"avgDurationSecs": 3.992,
"validDurationSecs": 947.78,
"totalHrs": 1.24,
"validHrs": 0.26
},
"hsb": {
"duration": 104004,
"reportedSentences": 15,
"clips": 15,
"users": 1,
"size": 413576,
"checksum": "f3cb738b99ef8700809e4787c7877dbd90942b6b197375d212c3c5951ad0b32b",
"avgDurationSecs": 6.109,
"validDurationSecs": 65.52,
"totalHrs": 0.03,
"validHrs": 0.01
},
"el": {
"duration": 7146972,
"reportedSentences": 0,
"clips": 1697,
"users": 16,
"size": 41609522,
"checksum": "19d90367ab2be112a8400b5c39bdc718d45c07c7fc311ca29966145bb75931bd",
"avgDurationSecs": 4.132,
"validDurationSecs": 2958.64,
"totalHrs": 1.99,
"validHrs": 0.83
},
"cs": {
"duration": 3449736,
"reportedSentences": 4,
"clips": 871,
"users": 33,
"size": 20010919,
"checksum": "f3e5120b45c0c1a469cfea6008fcee7b102c245b3e772128e2ae189a2799feca",
"avgDurationSecs": 4.327,
"validDurationSecs": 3061.11,
"totalHrs": 0.96,
"validHrs": 0.85
},
"pl": {
"duration": 5775120,
"reportedSentences": 7,
"clips": 1059,
"users": 38,
"size": 33484538,
"checksum": "f9d491272e90ef9a10451779b9a4ceed52be45e9ceee80e6a3fc2b9689dc348c",
"avgDurationSecs": 4.477,
"validDurationSecs": 12307.82,
"totalHrs": 1.6,
"validHrs": 3.42
},
"rm-sursilv": {
"duration": 326844,
"reportedSentences": 2,
"clips": 65,
"users": 0,
"size": 1930334,
"checksum": "e3bfa984c4cd61b2b9cf3dca1051ae64e49815b5e5786acfbf9a56242bfc9f1a",
"avgDurationSecs": 5.48,
"validDurationSecs": 223.62,
"totalHrs": 0.09,
"validHrs": 0.06
},
"rm-vallader": {
"duration": 0,
"reportedSentences": 1,
"clips": 0,
"users": 0,
"size": 1717,
"checksum": "1e12ab4b075f336e5937333ef9b48bd61b0bbdb81b19ad1c43f94b73aee1c693",
"avgDurationSecs": 5.807,
"validDurationSecs": 17.42,
"totalHrs": 0,
"validHrs": 0.01
},
"mn": {
"duration": 529920,
"reportedSentences": 0,
"clips": 100,
"users": 7,
"size": 2996787,
"checksum": "7b8861f86e4414d444b8c1e2a64db45da8b82eeeac034ec426615ad4cdbbd822",
"avgDurationSecs": 5.474,
"validDurationSecs": 262.09,
"totalHrs": 0.14,
"validHrs": 0.07
},
"zh-HK": {
"duration": 13566780,
"reportedSentences": 12,
"clips": 3504,
"users": 38,
"size": 77800236,
"checksum": "dbd0a4254447b5de319be1be5d4262ec9bfe68da0d7235430469c33d9298c985",
"avgDurationSecs": 4.213,
"validDurationSecs": 6238.54,
"totalHrs": 3.76,
"validHrs": 1.73
},
"ab": {
"duration": 669816,
"reportedSentences": 1,
"clips": 109,
"users": 0,
"size": 3971809,
"checksum": "404ea029bc6cfca120fe9c1b181cee4ad23957621ab18c0401a8dc732877b053",
"avgDurationSecs": 5.127,
"validDurationSecs": 217.56,
"totalHrs": 0.19,
"validHrs": 0.06
},
"cv": {
"duration": 3923136,
"reportedSentences": 1,
"clips": 671,
"users": 2,
"size": 22877938,
"checksum": "689f89b2db0b97c1fbf5cae853b438d22d151f66fe2b1c9b88dd187462f5df28",
"avgDurationSecs": 5.038,
"validDurationSecs": 4829.93,
"totalHrs": 1.09,
"validHrs": 1.34
},
"uk": {
"duration": 15494112,
"reportedSentences": 9,
"clips": 3750,
"users": 38,
"size": 90538999,
"checksum": "b5bd9d8d49d96e5ef865a59e5ea00e16473be254cbbcc3874e8b58d7162600cb",
"avgDurationSecs": 4.786,
"validDurationSecs": 10889.69,
"totalHrs": 4.3,
"validHrs": 3.03
},
"mt": {
"duration": 171108,
"reportedSentences": 2,
"clips": 35,
"users": 1,
"size": 1006833,
"checksum": "b5ef5b1715cc505e3202161a72ed16371856e133d8d61ee17d996bb3a29ff3f6",
"avgDurationSecs": 4.737,
"validDurationSecs": 92.61,
"totalHrs": 0.05,
"validHrs": 0.02
},
"as": {
"duration": 6479856,
"reportedSentences": 0,
"clips": 1076,
"users": 1,
"size": 37214046,
"checksum": "4c6eca577436845cf0fe990dfc5a396c5ccfb9df89c0dd3a17825604d6ee320c",
"avgDurationSecs": 5.681,
"validDurationSecs": 2822.42,
"totalHrs": 1.8,
"validHrs": 0.78
},
"ka": {
"duration": 120132,
"reportedSentences": 4,
"clips": 35,
"users": 4,
"size": 705711,
"checksum": "042994d0aad43cd28261476019f9c00aa704c9d6738102beeda3345e9741b04f",
"avgDurationSecs": 5.326,
"validDurationSecs": 71.18,
"totalHrs": 0.04,
"validHrs": 0.02
},
"fy-NL": {
"duration": 3485916,
"reportedSentences": 23,
"clips": 688,
"users": 5,
"size": 20094683,
"checksum": "93281e617fbfe22a4f677bc2039ac91b343b43fc4fb952fb0e6d8a477878820d",
"avgDurationSecs": 4.978,
"validDurationSecs": 128.61,
"totalHrs": 0.97,
"validHrs": 0.03
},
"dv": {
"duration": 1411884,
"reportedSentences": 0,
"clips": 236,
"users": 7,
"size": 7113857,
"checksum": "467a1a1cf204e4b8a2f713e694f83db9705c35131f1ef66f0a90f597858be0c7",
"avgDurationSecs": 5.048,
"validDurationSecs": 282.82,
"totalHrs": 0.4,
"validHrs": 0.08
},
"pa-IN": {
"duration": 332316,
"reportedSentences": 12,
"clips": 69,
"users": 7,
"size": 1927149,
"checksum": "f397560fdf6a0d61d756dc94c2ebe03f4b9f02336561e5e80f3117f1a8a6c8a3",
"avgDurationSecs": 4.825,
"validDurationSecs": 125.14,
"totalHrs": 0.1,
"validHrs": 0.03
},
"vi": {
"duration": 1050624,
"reportedSentences": 0,
"clips": 277,
"users": 10,
"size": 6133120,
"checksum": "a4a0f2d2dfc35ef4317c334713762aaa51b53f7b26e1c9f0e860b5ed0cc1f31a",
"avgDurationSecs": 3.967,
"validDurationSecs": 752.68,
"totalHrs": 0.3,
"validHrs": 0.21
},
"or": {
"duration": 1341288,
"reportedSentences": 5,
"clips": 264,
"users": 1,
"size": 7830367,
"checksum": "00672631bee5854227d87b0f84ba5f19b8263477d34028dc086d9eeb2514e6a8",
"avgDurationSecs": 5.048,
"validDurationSecs": 36.78,
"totalHrs": 0.37,
"validHrs": 0.01
},
"ga-IE": {
"duration": 795492,
"reportedSentences": 5,
"clips": 186,
"users": 5,
"size": 4672055,
"checksum": "e25bcd05f22041bbd2e8ebb4136f1c7a38ba3a55d59d6f9698ec2f97e7b7fc18",
"avgDurationSecs": 3.542,
"validDurationSecs": 358.16,
"totalHrs": 0.22,
"validHrs": 0.1
},
"fi": {
"duration": 4769604,
"reportedSentences": 7,
"clips": 941,
"users": 14,
"size": 28009942,
"checksum": "9ca4fb4ca2bfb9eb0d10dd46469dd3a5ce0cc1a3595e61a82dfd44be67c0e971",
"avgDurationSecs": 4.587,
"validDurationSecs": 1648.14,
"totalHrs": 1.32,
"validHrs": 0.46
},
"hu": {
"duration": 2222676,
"reportedSentences": 9,
"clips": 411,
"users": 14,
"size": 13054968,
"checksum": "5ea1a62667d68a8e33a6758083a3fdb722fb22781996d5a7826f23a4d69c89e4",
"avgDurationSecs": 4.92,
"validDurationSecs": 3021.01,
"totalHrs": 0.61,
"validHrs": 0.84
},
"th": {
"duration": 90219312,
"reportedSentences": 82,
"clips": 22522,
"users": 156,
"size": 520392367,
"checksum": "b56ce794693feb2a79ef294a25f9def1bda407a06e0fe209d46a58e13621212d",
"avgDurationSecs": 4.171,
"validDurationSecs": 15975.52,
"totalHrs": 25.06,
"validHrs": 4.44
},
"lt": {
"duration": 273888,
"reportedSentences": 6,
"clips": 57,
"users": 8,
"size": 1591278,
"checksum": "10a1fe7a63972122c308ffb4cb37e6d1fc7a888ffb667aad9fa9e9993fe79bcc",
"avgDurationSecs": 5.155,
"validDurationSecs": 720.07,
"totalHrs": 0.07,
"validHrs": 0.2
},
"lg": {
"duration": 24624,
"reportedSentences": 0,
"clips": 5,
"users": 0,
"size": 290226,
"checksum": "48451a611b86562bda710fd350b5fcf767921ea9bb1e3d6e78e51caf42a6efa5",
"avgDurationSecs": 5.806,
"validDurationSecs": 83.34,
"totalHrs": 0.01,
"validHrs": 0.02
},
"hi": {
"duration": 1342548,
"reportedSentences": 7,
"clips": 244,
"users": 19,
"size": 7905284,
"checksum": "21364b7526eb32b58502f3ba897009127b5375beda9a1b8e4558a9aa0bbdb08e",
"avgDurationSecs": 4.8,
"validDurationSecs": 1209.38,
"totalHrs": 0.37,
"validHrs": 0.33
},
"bas": {
"duration": 0,
"reportedSentences": 0,
"clips": 0,
"users": 0,
"size": -893,
"checksum": "b7c92d2be66bbe18fc49a4e64e22c643a62483d25644208c9b6d45c2216e076d",
"avgDurationSecs": 4.429,
"validDurationSecs": 0,
"totalHrs": 0,
"validHrs": 0
},
"sk": {
"duration": 1360404,
"reportedSentences": 0,
"clips": 358,
"users": 7,
"size": 7875631,
"checksum": "119e70bd5b08d7ead030f9a50cdf70122df080975d53bf1794137b030328a6ff",
"avgDurationSecs": 3.997,
"validDurationSecs": 1388.12,
"totalHrs": 0.38,
"validHrs": 0.39
},
"kmr": {
"duration": 2453760,
"reportedSentences": 60,
"clips": 546,
"users": 10,
"size": 13959328,
"checksum": "134ca34be64ae928ac451ca33d09342227b0ba6ff8b8e5499f51c5f31b41ebc3",
"avgDurationSecs": 4.424,
"validDurationSecs": 5311.4,
"totalHrs": 0.69,
"validHrs": 1.47
},
"bg": {
"duration": 464940,
"reportedSentences": 0,
"clips": 80,
"users": 4,
"size": 2745094,
"checksum": "f532e435bd3df3422db84b13311068a819dc4685e4973ea847c2c8b41fa0d968",
"avgDurationSecs": 5.507,
"validDurationSecs": 633.94,
"totalHrs": 0.13,
"validHrs": 0.17
},
"kk": {
"duration": 177444,
"reportedSentences": 15,
"clips": 54,
"users": 1,
"size": 1036939,
"checksum": "6cc9593b426d0ed39a7e8e10e576031ee3436b6f00bd8327a57da2653fa32fdc",
"avgDurationSecs": 4.936,
"validDurationSecs": 205.72,
"totalHrs": 0.05,
"validHrs": 0.06
},
"ba": {
"duration": 208260,
"reportedSentences": 0,
"clips": 36,
"users": 0,
"size": 1275103,
"checksum": "834d2433e1fb541a7b3310f5f1cde1161bd7653e5f7d88338fbc4757b06f2818",
"avgDurationSecs": 4.426,
"validDurationSecs": 237.45,
"totalHrs": 0.06,
"validHrs": 0.06
},
"gl": {
"duration": 3843756,
"reportedSentences": 25,
"clips": 763,
"users": 22,
"size": 22456454,
"checksum": "05d320fa03a7003f26952392d9a17da6ea3252679d0c08e4110c1268157d0f98",
"avgDurationSecs": 4.837,
"validDurationSecs": 1590.91,
"totalHrs": 1.07,
"validHrs": 0.44
},
"ug": {
"duration": 25854804,
"reportedSentences": 4,
"clips": 4722,
"users": 11,
"size": 148822135,
"checksum": "de56839d05aab1d73099bbebb8ec54c11e88bb54191d97397f7618bf70276f1a",
"avgDurationSecs": 5.971,
"validDurationSecs": 8764.74,
"totalHrs": 7.18,
"validHrs": 2.43
},
"hy-AM": {
"duration": 31536,
"reportedSentences": 2,
"clips": 5,
"users": 0,
"size": 192363,
"checksum": "37aa8f887538d7bfe54f1ba6554b27c7f1996d4d96e8e74c891aedf1c8598b4a",
"avgDurationSecs": 6.113,
"validDurationSecs": 153.31,
"totalHrs": 0.01,
"validHrs": 0.04
},
"be": {
"duration": 247822380,
"reportedSentences": 26,
"clips": 53268,
"users": 92,
"size": 1411816938,
"checksum": "96f071638a5360172908b6573907284bceaa25b3bde66cb950e2e63a5a61488a",
"avgDurationSecs": 4.754,
"validDurationSecs": 247319.27,
"totalHrs": 68.84,
"validHrs": 68.7
},
"ur": {
"duration": 452521440,
"reportedSentences": 13,
"clips": 117862,
"users": 75,
"size": 2629186798,
"checksum": "72c662684967be9aafb3c1f42231fcdad4926b23bee7898d2ca8e5a5afcfb316",
"avgDurationSecs": 3.874,
"validDurationSecs": 117083.06,
"totalHrs": 125.7,
"validHrs": 32.52
},
"gn": {
"duration": 2981592,
"reportedSentences": 5,
"clips": 671,
"users": 7,
"size": 17087849,
"checksum": "752ec145f7ea7a6f7b676c0c0cfd24336fe2628f24b85113419961280ebfecd0",
"avgDurationSecs": 4.395,
"validDurationSecs": 1177.96,
"totalHrs": 0.83,
"validHrs": 0.33
},
"sr": {
"duration": 94248,
"reportedSentences": 0,
"clips": 40,
"users": 3,
"size": 540976,
"checksum": "9fc5a4dd4885de3044303e0d356d727831da46cc4c2f770bd98c4bea3dbe4410",
"avgDurationSecs": 2.826,
"validDurationSecs": 49.34,
"totalHrs": 0.03,
"validHrs": 0.02
},
"uz": {
"duration": 70874604,
"reportedSentences": 69,
"clips": 13865,
"users": 429,
"size": 395971356,
"checksum": "5d7887f2d36f891e02bd70858549752e8556c70b98855f759e94596572321253",
"avgDurationSecs": 4.108,
"validDurationSecs": 25335.81,
"totalHrs": 19.69,
"validHrs": 7.04
},
"mr": {
"duration": 7931412,
"reportedSentences": 6,
"clips": 1202,
"users": 7,
"size": 46288478,
"checksum": "2ceb9169d113c3354a3cc97907b3f36acdbc43c8ebe78c24f77a73afac216787",
"avgDurationSecs": 6.184,
"validDurationSecs": 2456.05,
"totalHrs": 2.21,
"validHrs": 0.68
},
"da": {
"duration": 4251636,
"reportedSentences": 94,
"clips": 1096,
"users": 25,
"size": 24689961,
"checksum": "6a3338dd0bfec1e945c8f0ea5cf92fee517c48004dbf7389076d20a5abb4dcc4",
"avgDurationSecs": 4.264,
"validDurationSecs": 4460.51,
"totalHrs": 1.18,
"validHrs": 1.24
},
"myv": {
"duration": 2514744,
"reportedSentences": 5,
"clips": 429,
"users": 7,
"size": 14753085,
"checksum": "be220f8186d52f8c866c84bb6fec0c2094333dfb525ad44bd607159b6b702100",
"avgDurationSecs": 5.75,
"validDurationSecs": 2929.74,
"totalHrs": 0.7,
"validHrs": 0.81
},
"nn-NO": {
"duration": 947808,
"reportedSentences": 10,
"clips": 211,
"users": 7,
"size": 4891549,
"checksum": "d848f0d5bdedb577c8b56aab1f396bdf896bfcc7ce687052aa3fa4ce25163a61",
"avgDurationSecs": 4.506,
"validDurationSecs": 664.09,
"totalHrs": 0.26,
"validHrs": 0.19
},
"ha": {
"duration": 964692,
"reportedSentences": 3,
"clips": 204,
"users": 3,
"size": 5667061,
"checksum": "484e9ec6bdbe21875f5b91c74bdfabb38ad90d5a4d6453c7579b2a3ed56b2232",
"avgDurationSecs": 4.348,
"validDurationSecs": 616.19,
"totalHrs": 0.27,
"validHrs": 0.18
},
"ckb": {
"duration": 216675072,
"reportedSentences": 511,
"clips": 54073,
"users": 901,
"size": 1174860843,
"checksum": "742eace3283d4d50f771862f4366e72df5a626493a7bcb0efc38f87bf5d43af2",
"avgDurationSecs": 3.903,
"validDurationSecs": 147610.73,
"totalHrs": 60.18,
"validHrs": 41.01
},
"ml": {
"duration": 712368,
"reportedSentences": 6,
"clips": 159,
"users": 6,
"size": 4039995,
"checksum": "cf6a3595a571d780db0c9e6b986e519ab047b56bd8b097fb9094dd8fd197682a",
"avgDurationSecs": 4.122,
"validDurationSecs": 438.34,
"totalHrs": 0.2,
"validHrs": 0.12
},
"mdf": {
"duration": 0,
"reportedSentences": 4,
"clips": 0,
"users": 0,
"size": 580,
"checksum": "936191d697b2820af25c3a7b71baf1716090510467a45c4c286167f3dde8094c",
"avgDurationSecs": 5.285,
"validDurationSecs": 105.71,
"totalHrs": 0,
"validHrs": 0.03
},
"sw": {
"duration": 46233252,
"reportedSentences": 1464,
"clips": 10616,
"users": 94,
"size": 269705348,
"checksum": "fbd27fae537fd3c843635a4518979c678c523b1bda5f109e050ec483af3a66fc",
"avgDurationSecs": 5.329,
"validDurationSecs": 125962.06,
"totalHrs": 12.84,
"validHrs": 34.99
},
"sat": {
"duration": 982836,
"reportedSentences": 0,
"clips": 275,
"users": 1,
"size": 5722323,
"checksum": "37ea92dc8d330ab09c7f0ce8aedabd867d81530741b5be98a577aaa8c4e08d45",
"avgDurationSecs": 4.417,
"validDurationSecs": 958.64,
"totalHrs": 0.28,
"validHrs": 0.27
},
"tig": {
"duration": 29268,
"reportedSentences": 0,
"clips": 5,
"users": 1,
"size": 167260,
"checksum": "17d9f9d54f00aa556bd4cc4daf4cdda80c345fee78b53a54befdfc640b027b8d",
"avgDurationSecs": 4.491,
"validDurationSecs": 3.79,
"totalHrs": 0,
"validHrs": 0
},
"ig": {
"duration": 1185372,
"reportedSentences": 1,
"clips": 200,
"users": 8,
"size": 6926095,
"checksum": "23a28c2c5ab68230a17f531213e7010df1263108ae813ccf99503985ecce4813",
"avgDurationSecs": 5.468,
"validDurationSecs": 38.4,
"totalHrs": 0.33,
"validHrs": 0.01
},
"nan-tw": {
"duration": 22239108,
"reportedSentences": 99,
"clips": 8311,
"users": 42,
"size": 122814958,
"checksum": "fb567620a6f4b449503f2a46a278f858cf2c62c9f817c769620f6e0000f06a4e",
"avgDurationSecs": 2.677,
"validDurationSecs": 5151.1,
"totalHrs": 6.18,
"validHrs": 1.43
},
"mhr": {
"duration": 187157268,
"reportedSentences": 4,
"clips": 39849,
"users": 59,
"size": 1060765720,
"checksum": "2291217507995a8741511c3e90fbbe69954046442210d50442e8e95aa462ae1e",
"avgDurationSecs": 4.761,
"validDurationSecs": 213414.15,
"totalHrs": 51.99,
"validHrs": 59.28
},
"bn": {
"duration": 170748288,
"reportedSentences": 272,
"clips": 29902,
"users": 767,
"size": 992617275,
"checksum": "1430962f663e16ed42869096afca6378cf6194f96979b8194f5b170c2474e5f2",
"avgDurationSecs": 6.164,
"validDurationSecs": 17968.74,
"totalHrs": 47.43,
"validHrs": 4.99
},
"tok": {
"reportedSentences": 2,
"duration": 8310852,
"clips": 1996,
"users": 23,
"size": 48459960,
"checksum": "7c1e45005bcdc3a63e25cb4cfb9809ea2263868eeb2051435449c451dd50251f",
"avgDurationSecs": 3.689,
"validDurationSecs": 6345.81,
"totalHrs": 2.3,
"validHrs": 1.77
},
"yue": {
"duration": 81398376,
"reportedSentences": 498,
"clips": 19828,
"users": 252,
"size": 458600224,
"checksum": "10b86810d19f041c13f22b188872df968443aa382e8bf86a6a6a53ca9ecded17",
"avgDurationSecs": 4.176,
"validDurationSecs": 32240.27,
"totalHrs": 22.62,
"validHrs": 8.96
},
"sah": {
"duration": 253188,
"reportedSentences": 2,
"clips": 40,
"users": 1,
"size": 1503446,
"checksum": "0d424943facc126253edf6a8a459a52f5b3986a4ddec251eebd46a8777a16cb9",
"avgDurationSecs": 6,
"validDurationSecs": 390.77,
"totalHrs": 0.07,
"validHrs": 0.11
},
"mk": {
"duration": 250452,
"clips": 39,
"users": 1,
"size": 1477107,
"checksum": "c319ef45575aaad82e1d9d5629da486dbf789f21dbc04fe0ad43b44d9fbd1af5",
"avgDurationSecs": 5.323,
"validDurationSecs": 38.12,
"totalHrs": 0.07,
"validHrs": 0.01
},
"vot": {
"duration": 0,
"clips": 0,
"users": 0,
"size": 121,
"checksum": "97ab1ad8cfa2c0526abe359ae53d94d873af374755366b430959faafe62967b4",
"avgDurationSecs": 2.408,
"validDurationSecs": 0,
"totalHrs": 0,
"validHrs": 0
},
"az": {
"duration": 7344,
"clips": 2,
"users": 1,
"size": 43779,
"checksum": "3e571071448c82c67b14a3a8fd9ddacb67152542f3b40159fac1da25a9c48fcd",
"avgDurationSecs": 5.562,
"validDurationSecs": 2.76,
"totalHrs": 0,
"validHrs": 0
}
},
"totalDuration": 2159111903,
"totalValidDurationSecs": 936568,
"totalHrs": 600,
"totalValidHrs": 261
}
================================================
FILE: datasets/scripted-speech/cv-corpus-11.0-2022-09-21.json
================================================
{
"locales": {
"en": {
"duration": 11152496587,
"buckets": {
"dev": 16354,
"invalidated": 252599,
"other": 290846,
"reported": 4366,
"test": 16354,
"train": 948736,
"validated": 1618225
},
"reportedSentences": 4294,
"clips": 2161670,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.37,
"twenties": 0.24,
"sixties": 0.04,
"thirties": 0.13,
"teens": 0.06,
"seventies": 0.01,
"fourties": 0.1,
"fifties": 0.05,
"eighties": 0,
"nineties": 0
},
"gender": { "": 0.37, "male": 0.45, "female": 0.15, "other": 0.02 }
},
"users": 84673,
"size": 79751937788,
"checksum": "0efd86ca6b40641b55d1411b7d3b1f1ab8626de4b207504953706df201d198a5",
"avgDurationSecs": 5.159,
"validDurationSecs": 8348752.95,
"totalHrs": 3097.91,
"validHrs": 2319.09
},
"fa": {
"buckets": {
"dev": 10288,
"invalidated": 13793,
"other": 24401,
"reported": 2168,
"test": 10288,
"train": 26951,
"validated": 309996
},
"reportedSentences": 2159,
"duration": 1392397316,
"clips": 348190,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.25,
"twenties": 0.31,
"thirties": 0.37,
"fifties": 0.02,
"fourties": 0.03,
"teens": 0.03,
"sixties": 0
},
"gender": { "": 0.22, "male": 0.71, "female": 0.07, "other": 0 }
},
"users": 4124,
"size": 10237548588,
"checksum": "e40247da130302d1dd71e5f25742a0f2f61e8627e7b674c13294967a23f6cf47",
"avgDurationSecs": 3.999,
"validDurationSecs": 1239661.1,
"totalHrs": 386.77,
"validHrs": 344.35
},
"fr": {
"buckets": {
"dev": 16089,
"invalidated": 57607,
"other": 14359,
"reported": 6586,
"test": 16089,
"train": 485034,
"validated": 652051
},
"reportedSentences": 6510,
"duration": 3623103766,
"clips": 724017,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.17,
"thirties": 0.16,
"": 0.36,
"teens": 0.03,
"fourties": 0.14,
"fifties": 0.1,
"sixties": 0.03,
"seventies": 0.01,
"eighties": 0,
"nineties": 0
},
"gender": { "male": 0.61, "": 0.28, "female": 0.1, "other": 0.01 }
},
"users": 16785,
"size": 25947906602,
"checksum": "f763f9b1817280cd37db4d4161a9afc76257024d5bb54951a5987464e1e2ebb4",
"avgDurationSecs": 5.004,
"validDurationSecs": 3262973.706,
"totalHrs": 1006.41,
"validHrs": 906.38
},
"es": {
"buckets": {
"dev": 15520,
"invalidated": 52095,
"other": 1180383,
"reported": 2033,
"test": 15520,
"train": 230467,
"validated": 305875
},
"reportedSentences": 2019,
"duration": 7475216238,
"clips": 1538353,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.1,
"": 0.13,
"fifties": 0.04,
"twenties": 0.56,
"teens": 0.08,
"fourties": 0.03,
"sixties": 0.06,
"eighties": 0,
"seventies": 0,
"nineties": 0
},
"gender": { "male": 0.54, "": 0.13, "other": 0, "female": 0.33 }
},
"users": 24516,
"size": 47288212406,
"checksum": "319ae22d17dc2158322bb189a4938faba0debe653b611a325ee80c672be277a1",
"avgDurationSecs": 4.859,
"validDurationSecs": 1486318.008,
"totalHrs": 2076.44,
"validHrs": 412.86
},
"sl": {
"buckets": {
"dev": 1206,
"invalidated": 251,
"other": 1562,
"reported": 34,
"test": 1207,
"train": 1423,
"validated": 9590
},
"reportedSentences": 35,
"duration": 43522354,
"clips": 11403,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.51,
"teens": 0.08,
"": 0.2,
"sixties": 0.07,
"fifties": 0.07,
"fourties": 0.02,
"thirties": 0.05
},
"gender": { "female": 0.16, "male": 0.64, "": 0.2, "other": 0 }
},
"users": 138,
"size": 309245843,
"checksum": "acbbccd20450efbfb100ab1f5fd0484756d761a3300d1e5eaf8fd403a56f5bbf",
"avgDurationSecs": 3.817,
"validDurationSecs": 36602.594,
"totalHrs": 12.08,
"validHrs": 10.16
},
"kab": {
"buckets": {
"dev": 14994,
"invalidated": 19492,
"other": 110003,
"reported": 8947,
"test": 14994,
"train": 151534,
"validated": 608713
},
"reportedSentences": 8942,
"duration": 2462033464,
"clips": 738208,
"splits": {
"accent": { "": 1 },
"age": {
"fourties": 0.09,
"thirties": 0.29,
"": 0.28,
"fifties": 0.19,
"twenties": 0.12,
"eighties": 0,
"teens": 0,
"sixties": 0.03,
"seventies": 0
},
"gender": { "male": 0.54, "": 0.26, "female": 0.2, "other": 0 }
},
"users": 1496,
"size": 18395506381,
"checksum": "09e61eddb933a73606af153a5ed9394390f531093d79b004d27635ee79ecd95b",
"avgDurationSecs": 3.335,
"validDurationSecs": 2030148.381,
"totalHrs": 683.89,
"validHrs": 563.93
},
"cy": {
"buckets": {
"dev": 5247,
"invalidated": 4337,
"other": 18730,
"reported": 156,
"test": 5266,
"train": 7726,
"validated": 88378
},
"reportedSentences": 157,
"duration": 541540989,
"clips": 111445,
"splits": {
"accent": { "": 1 },
"age": {
"fourties": 0.16,
"twenties": 0.13,
"sixties": 0.06,
"fifties": 0.09,
"": 0.43,
"thirties": 0.09,
"seventies": 0.01,
"eighties": 0,
"teens": 0.02
},
"gender": { "male": 0.33, "female": 0.24, "": 0.41, "other": 0.01 }
},
"users": 1723,
"size": 3980577087,
"checksum": "b1a5d115e0b65bcab23e1fbbc170ed3b61d74aeba506720202d8b732089136cd",
"avgDurationSecs": 4.859,
"validDurationSecs": 429452.282,
"totalHrs": 150.42,
"validHrs": 119.29
},
"ca": {
"buckets": {
"dev": 16340,
"invalidated": 76690,
"other": 481402,
"reported": 5357,
"test": 16340,
"train": 905243,
"validated": 1111949
},
"reportedSentences": 5312,
"duration": 9194234502,
"clips": 1670041,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.06,
"fifties": 0.17,
"fourties": 0.1,
"twenties": 0.05,
"": 0.35,
"sixties": 0.22,
"teens": 0.01,
"seventies": 0.04,
"eighties": 0,
"nineties": 0
},
"gender": { "male": 0.42, "": 0.35, "female": 0.23, "other": 0 }
},
"users": 30225,
"size": 56931368598,
"checksum": "3ae9f3c82dee5102dfd8a3319b4339262980236f1b85336700ba5e7d3dcb4aae",
"avgDurationSecs": 5.505,
"validDurationSecs": 6121717.886,
"totalHrs": 2553.95,
"validHrs": 1700.47
},
"de": {
"buckets": {
"dev": 16082,
"invalidated": 47953,
"other": 5329,
"reported": 8204,
"test": 16082,
"train": 479008,
"validated": 805962
},
"reportedSentences": 8180,
"duration": 4439964557,
"clips": 859244,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.19,
"fourties": 0.17,
"": 0.32,
"thirties": 0.15,
"teens": 0.03,
"sixties": 0.03,
"fifties": 0.11,
"seventies": 0,
"eighties": 0,
"nineties": 0
},
"gender": { "male": 0.59, "": 0.32, "female": 0.09, "other": 0.01 }
},
"users": 17226,
"size": 31626133806,
"checksum": "94a0c7aeb0d18a280380e5a568d21251ed421f093bc164c9f67d8b28dfbecaaf",
"avgDurationSecs": 5.167,
"validDurationSecs": 4164640.91,
"totalHrs": 1233.32,
"validHrs": 1156.84
},
"tt": {
"buckets": {
"dev": 3062,
"invalidated": 388,
"other": 252,
"reported": 3,
"test": 5124,
"train": 9778,
"validated": 28538
},
"reportedSentences": 4,
"duration": 109538738,
"clips": 29178,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.2,
"thirties": 0.73,
"twenties": 0.05,
"sixties": 0,
"fifties": 0.01,
"teens": 0,
"fourties": 0,
"seventies": 0.01
},
"gender": { "": 0.2, "male": 0.78, "female": 0.02 }
},
"users": 223,
"size": 809543535,
"checksum": "e56d549c0aa66f6df596350347dd38d76778da732d3393d4b2d0281ff68cc8dc",
"avgDurationSecs": 3.754,
"validDurationSecs": 107136.079,
"totalHrs": 30.42,
"validHrs": 29.76
},
"ta": {
"buckets": {
"dev": 11758,
"invalidated": 5575,
"other": 87993,
"reported": 3315,
"test": 11815,
"train": 41710,
"validated": 130461
},
"reportedSentences": 3315,
"duration": 1392279684,
"clips": 224029,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.08,
"thirties": 0.09,
"": 0.72,
"fourties": 0.03,
"seventies": 0.02,
"fifties": 0.03,
"teens": 0.03,
"sixties": 0,
"eighties": 0
},
"gender": { "male": 0.16, "": 0.71, "other": 0, "female": 0.13 }
},
"users": 792,
"size": 8341540337,
"checksum": "d23d087efee1ba3c0c9ce93789d77f8a659e0469643a0de73a0b6586735adccc",
"avgDurationSecs": 6.215,
"validDurationSecs": 810779.854,
"totalHrs": 386.74,
"validHrs": 225.21
},
"ru": {
"buckets": {
"dev": 9629,
"invalidated": 7159,
"other": 16865,
"reported": 356,
"test": 9630,
"train": 22862,
"validated": 125553
},
"reportedSentences": 350,
"duration": 771816312,
"clips": 149577,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.39,
"teens": 0.09,
"": 0.21,
"fourties": 0.15,
"thirties": 0.13,
"fifties": 0.03,
"sixties": 0,
"seventies": 0
},
"gender": { "male": 0.62, "": 0.21, "other": 0, "female": 0.16 }
},
"users": 2731,
"size": 5403479932,
"checksum": "c5e32c22b2bda21dbded3f20fbcf77910e7a63932da8138058c2e71c13ffd5bd",
"avgDurationSecs": 5.16,
"validDurationSecs": 647852.634,
"totalHrs": 214.39,
"validHrs": 179.95
},
"nl": {
"buckets": {
"dev": 10736,
"invalidated": 5161,
"other": 2157,
"reported": 328,
"test": 10743,
"train": 30318,
"validated": 84823
},
"reportedSentences": 328,
"duration": 397871425,
"clips": 92141,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.41,
"twenties": 0.21,
"fourties": 0.14,
"thirties": 0.11,
"teens": 0.02,
"fifties": 0.08,
"sixties": 0.01,
"nineties": 0,
"eighties": 0,
"seventies": 0
},
"gender": { "": 0.42, "male": 0.47, "female": 0.11, "other": 0 }
},
"users": 1530,
"size": 2734842015,
"checksum": "03c65fb0d4964d23286337aca8200dfbec44e4c63361bedb0e0adc1b7f1f5758",
"avgDurationSecs": 4.318,
"validDurationSecs": 366271.778,
"totalHrs": 110.51,
"validHrs": 101.74
},
"it": {
"buckets": {
"dev": 14997,
"invalidated": 17476,
"other": 27,
"reported": 5329,
"test": 15003,
"train": 152609,
"validated": 219211
},
"reportedSentences": 5325,
"duration": 1268981904,
"clips": 236714,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.17,
"twenties": 0.2,
"": 0.29,
"fifties": 0.16,
"fourties": 0.14,
"seventies": 0,
"sixties": 0.03,
"teens": 0.01,
"eighties": 0,
"nineties": 0
},
"gender": { "female": 0.11, "male": 0.59, "": 0.29, "other": 0 }
},
"users": 6767,
"size": 8784083327,
"checksum": "a976f2e7ab10c7dbee95f1271c1221bb2d42ab52589bfefd28cf63b3a4fae520",
"avgDurationSecs": 5.361,
"validDurationSecs": 1175151.415,
"totalHrs": 352.49,
"validHrs": 326.43
},
"eu": {
"buckets": {
"dev": 6561,
"invalidated": 5791,
"other": 26899,
"reported": 72,
"test": 6561,
"train": 10832,
"validated": 69159
},
"reportedSentences": 72,
"duration": 528767151,
"clips": 101849,
"splits": {
"accent": { "": 1 },
"age": {
"fourties": 0.13,
"thirties": 0.07,
"fifties": 0.14,
"twenties": 0.35,
"": 0.25,
"teens": 0.03,
"sixties": 0.02,
"seventies": 0
},
"gender": { "male": 0.47, "female": 0.26, "": 0.26, "other": 0.02 }
},
"users": 1213,
"size": 3985536335,
"checksum": "58fc92fc7c4e2c8874c4e6ae9f58cbd418740e1573c181ac284b0782acc977b0",
"avgDurationSecs": 5.192,
"validDurationSecs": 359051.217,
"totalHrs": 146.87,
"validHrs": 99.73
},
"tr": {
"buckets": {
"dev": 10127,
"invalidated": 3593,
"other": 151,
"reported": 339,
"test": 10143,
"train": 25998,
"validated": 82351
},
"reportedSentences": 340,
"duration": 314932815,
"clips": 86095,
"splits": {
"accent": { "": 1 },
"age": {
"": 0.32,
"thirties": 0.09,
"twenties": 0.25,
"teens": 0.02,
"fourties": 0.04,
"fifties": 0.09,
"sixties": 0.15,
"eighties": 0.02,
"seventies": 0.03
},
"gender": { "": 0.32, "male": 0.43, "female": 0.25, "other": 0 }
},
"users": 1328,
"size": 1948927227,
"checksum": "1e9499bf233e6668d5e34802f3ec704c3fffc271380eef16733629e673092610",
"avgDurationSecs": 3.658,
"validDurationSecs": 301237.38,
"totalHrs": 87.48,
"validHrs": 83.67
},
"ar": {
"buckets": {
"dev": 10438,
"invalidated": 14959,
"other": 35514,
"reported": 2074,
"test": 10440,
"train": 28043,
"validated": 76208
},
"reportedSentences": 2066,
"clips": 126681,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.11,
"": 0.56,
"twenties": 0.28,
"fourties": 0.01,
"teens": 0.03,
"fifties": 0,
"sixties": 0,
"nineties": 0
},
"gender": { "female": 0.18, "": 0.55, "male": 0.27, "other": 0 }
},
"users": 1309,
"duration": 528133089,
"size": 3133096686,
"checksum": "c5122a5fcb393f6091e81ef60ae8cfc8e1a80451eabee0c31fa600f7c92e99f2",
"avgDurationSecs": 4.169,
"validDurationSecs": 317711.152,
"totalHrs": 146.704,
"validHrs": 88.253
},
"zh-TW": {
"buckets": {
"dev": 4709,
"invalidated": 4596,
"other": 40630,
"reported": 139,
"test": 4709,
"train": 6568,
"validated": 77357
},
"reportedSentences": 140,
"duration": 404438206,
"clips": 122583,
"splits": {
"accent": { "": 1 },
"age": {
"thirties": 0.2,
"twenties": 0.32,
"teens": 0.05,
"": 0.27,
"fifties": 0.05,
"seventies": 0,
"fourties": 0.1,
"sixties": 0
},
"gender": { "male": 0.47, "": 0.26, "female": 0.24, "other": 0.02 }
},
"users": 2082,
"size": 2830133882,
"checksum": "6666d38ac9095833ee88da1ec2df7917ee43a6b7a4e1d60e2148e9fbf2f36c37",
"avgDurationSecs": 3.299,
"validDurationSecs": 255224.022,
"totalHrs": 112.34,
"validHrs": 70.89
},
"br": {
"buckets": {
"dev": 2122,
"invalidated": 785,
"other": 12352,
"reported": 267,
"test": 2119,
"train": 2645,
"validated": 11334
},
"reportedSentences": 267,
"duration": 76458895,
"clips": 24471,
"splits": {
"accent": { "": 1 },
"age": {
"twenties": 0.25,
"": 0.34,
"fifties": 0.06,
"fourties": 0.06,
"thirties": 0.08,
"sixties": 0.17,
"seventies": 0.02,
"teens": 0.01
},
"gender": { "male": 0.63, "": 0.35, "female": 0.02, "other": 0 }
},
"users": 180,
"size": 555823615,
"checksum": "e5dda67bebcf968fd81e43fb0c0a5789deae2574504b4d74446cac9cd3565559",
"avgDurationSecs": 3.124,
"validDurationSecs": 35412.738,
"totalHrs": 21.23,
"validHrs": 9.83
},
"pt": {
"buckets": {
"dev": 8688,
"invalidated": 4870,
"other": 16751,
"reported": 240
gitextract_44_0gd_b/
├── CHANGELOG.md
├── LICENSE
├── README.md
├── datasets/
│ ├── code-switching/
│ │ └── README.md
│ ├── scripted-speech/
│ │ ├── CHANGELOG.md
│ │ ├── README.md
│ │ ├── cv-corpus-1.json
│ │ ├── cv-corpus-10.0-2022-07-04.json
│ │ ├── cv-corpus-10.0-delta-2022-07-04.json
│ │ ├── cv-corpus-11.0-2022-09-21.json
│ │ ├── cv-corpus-11.0-delta-2022-09-21.json
│ │ ├── cv-corpus-12.0-2022-12-07.json
│ │ ├── cv-corpus-12.0-delta-2022-12-07.json
│ │ ├── cv-corpus-13.0-2023-03-09.json
│ │ ├── cv-corpus-13.0-delta-2023-03-09.json
│ │ ├── cv-corpus-14.0-2023-06-23.json
│ │ ├── cv-corpus-14.0-delta-2023-06-23.json
│ │ ├── cv-corpus-15.0-2023-09-08.json
│ │ ├── cv-corpus-15.0-delta-2023-09-08.json
│ │ ├── cv-corpus-16.0-2023-12-06.json
│ │ ├── cv-corpus-16.0-delta-2023-12-06.json
│ │ ├── cv-corpus-16.1-2023-12-06.json
│ │ ├── cv-corpus-16.1-delta-2023-12-06.json
│ │ ├── cv-corpus-17.0-2024-03-15.json
│ │ ├── cv-corpus-17.0-delta-2024-03-15.json
│ │ ├── cv-corpus-18.0-2024-06-14.json
│ │ ├── cv-corpus-18.0-delta-2024-06-14.json
│ │ ├── cv-corpus-19.0-2024-09-13.json
│ │ ├── cv-corpus-19.0-delta-2024-09-13.json
│ │ ├── cv-corpus-2.json
│ │ ├── cv-corpus-20.0-2024-12-06.json
│ │ ├── cv-corpus-20.0-delta-2024-12-06.json
│ │ ├── cv-corpus-21.0-2025-03-14.json
│ │ ├── cv-corpus-21.0-delta-2025-03-14.json
│ │ ├── cv-corpus-22.0-2025-06-20.json
│ │ ├── cv-corpus-22.0-delta-2025-06-20.json
│ │ ├── cv-corpus-23.0-2025-09-05.json
│ │ ├── cv-corpus-23.0-delta-2025-09-05.json
│ │ ├── cv-corpus-24.0-2025-12-05.json
│ │ ├── cv-corpus-24.0-delta-2025-12-05.json
│ │ ├── cv-corpus-25.0-2026-03-09.json
│ │ ├── cv-corpus-25.0-delta-2026-03-09.json
│ │ ├── cv-corpus-3.json
│ │ ├── cv-corpus-4-2019-12-10.json
│ │ ├── cv-corpus-5-2020-06-22.json
│ │ ├── cv-corpus-5-singleword.json
│ │ ├── cv-corpus-5.1-2020-06-22.json
│ │ ├── cv-corpus-5.1-singleword.json
│ │ ├── cv-corpus-6.0-2020-12-11.json
│ │ ├── cv-corpus-6.0-singleword.json
│ │ ├── cv-corpus-6.1-2020-12-11.json
│ │ ├── cv-corpus-6.1-singleword.json
│ │ ├── cv-corpus-7.0-2021-07-21.json
│ │ ├── cv-corpus-7.0-singleword.json
│ │ ├── cv-corpus-8.0-2022-01-19.json
│ │ └── cv-corpus-9.0-2022-04-27.json
│ └── spontaneous-speech/
│ ├── .gitkeep
│ ├── CHANGELOG.md
│ ├── README.md
│ ├── sps-corpus-1.0-2025-09-05.json
│ ├── sps-corpus-2.0-2025-12-05.json
│ ├── sps-corpus-2.0-delta-2025-12-05.json
│ ├── sps-corpus-3.0-2026-03-09.json
│ └── sps-corpus-3.0-delta-2026-03-09.json
└── helpers/
├── .eslintrc.json
├── README.md
├── common.js
├── compareReleases.js
├── createDeltaStatistics.js
├── createStats.js
├── jsconfig.json
└── recalculateStats.js
SYMBOL INDEX (6 symbols across 5 files) FILE: helpers/common.js constant DATASET_TYPES (line 3) | const DATASET_TYPES = [ FILE: helpers/compareReleases.js constant USAGE (line 15) | const USAGE = "Usage: node helpers/compareReleases.js <dataset-type> <da... constant NON_ADDITIVE_KEYS (line 43) | const NON_ADDITIVE_KEYS = new Set([ FILE: helpers/createDeltaStatistics.js constant USAGE (line 8) | const USAGE = "Usage: node helpers/createDeltaStatistics.js <dataset-typ... FILE: helpers/createStats.js constant USAGE (line 11) | const USAGE = "Usage: node helpers/createStats.js <dataset-type> <stats-... FILE: helpers/recalculateStats.js constant USAGE (line 5) | const USAGE = "Usage: node helpers/recalculateStats.js <dataset-type> <d...
Condensed preview — 72 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (6,123K chars).
[
{
"path": "CHANGELOG.md",
"chars": 2601,
"preview": "# Changelog\n\nChangelogs are maintained per dataset type:\n\n- [Scripted Speech (SCS)](datasets/scripted-speech/CHANGELOG.m"
},
{
"path": "LICENSE",
"chars": 16725,
"preview": "Mozilla Public License Version 2.0\n==================================\n\n1. Definitions\n--------------\n\n1.1. \"Contributor\""
},
{
"path": "README.md",
"chars": 5753,
"preview": "# Common Voice Datasets\n\nThis repo contains release details and metadata for the [Common Voice](https://commonvoice.mozi"
},
{
"path": "datasets/code-switching/README.md",
"chars": 491,
"preview": "# Code Switching (CS)\n\nCode Switching is an upcoming Common Voice modality where contributors produce speech that natura"
},
{
"path": "datasets/scripted-speech/CHANGELOG.md",
"chars": 16565,
"preview": "# Scripted Speech (SCS) Changelog\n\n## Current Release\n\n### [Corpus 25.0](cv-corpus-25.0-2026-03-09.json)\n\nRegularly sche"
},
{
"path": "datasets/scripted-speech/README.md",
"chars": 10401,
"preview": "# Scripted Speech (SCS)\n\nScripted Speech is the classic Common Voice dataset. Contributors read pre-written sentences al"
},
{
"path": "datasets/scripted-speech/cv-corpus-1.json",
"chars": 13470,
"preview": "{\n \"date\": \"2019-02-25\",\n \"locales\": {\n \"en\": {\n \"clips\": 677020,\n \"splits\": {\n \"accent\": {\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-10.0-2022-07-04.json",
"chars": 82812,
"preview": "{\n \"date\": \"2022-07-04\",\n \"locales\": {\n \"en\": {\n \"duration\": 10981597567,\n \"buckets\": {\n \"dev\": 16"
},
{
"path": "datasets/scripted-speech/cv-corpus-10.0-delta-2022-07-04.json",
"chars": 31637,
"preview": "{\n \"locales\": {\n \"en\": {\n \"duration\": 348687468,\n \"reportedSentences\": 253,\n \"clips\": 63939,\n \"u"
},
{
"path": "datasets/scripted-speech/cv-corpus-11.0-2022-09-21.json",
"chars": 85893,
"preview": "{\n \"locales\": {\n \"en\": {\n \"duration\": 11152496587,\n \"buckets\": {\n \"dev\": 16354,\n \"invalidate"
},
{
"path": "datasets/scripted-speech/cv-corpus-11.0-delta-2022-09-21.json",
"chars": 32474,
"preview": "{\n \"locales\": {\n \"en\": {\n \"duration\": 170899020,\n \"reportedSentences\": 199,\n \"clips\": 31304,\n \"u"
},
{
"path": "datasets/scripted-speech/cv-corpus-12.0-2022-12-07.json",
"chars": 89062,
"preview": "{\n \"locales\": {\n \"en\": {\n \"duration\": 11378329699,\n \"buckets\": {\n \"dev\": 16365,\n \"invalidate"
},
{
"path": "datasets/scripted-speech/cv-corpus-12.0-delta-2022-12-07.json",
"chars": 33876,
"preview": "{\n \"locales\": {\n \"en\": {\n \"duration\": 225833112,\n \"reportedSentences\": 192,\n \"clips\": 40707,\n \"u"
},
{
"path": "datasets/scripted-speech/cv-corpus-13.0-2023-03-09.json",
"chars": 92213,
"preview": "{\n \"locales\": {\n \"de\": {\n \"duration\": 4821107393,\n \"buckets\": {\n \"dev\": 16143,\n \"invalidated"
},
{
"path": "datasets/scripted-speech/cv-corpus-13.0-delta-2023-03-09.json",
"chars": 35137,
"preview": "{\n \"locales\": {\n \"de\": {\n \"duration\": 202351608,\n \"reportedSentences\": 518,\n \"clips\": 34540,\n \"u"
},
{
"path": "datasets/scripted-speech/cv-corpus-14.0-2023-06-23.json",
"chars": 102667,
"preview": "{\n \"locales\": {\n \"en\": {\n \"buckets\": {\n \"dev\": 16380,\n \"invalidated\": 272017,\n \"other\": 27"
},
{
"path": "datasets/scripted-speech/cv-corpus-14.0-delta-2023-06-23.json",
"chars": 36515,
"preview": "{\n \"locales\": {\n \"en\": {\n \"duration\": 252536976,\n \"reportedSentences\": 1711,\n \"clips\": 43378,\n \""
},
{
"path": "datasets/scripted-speech/cv-corpus-15.0-2023-09-08.json",
"chars": 108337,
"preview": "{\n \"locales\": {\n \"en\": {\n \"buckets\": {\n \"clip_durations\": 2317263,\n \"dev\": 16386,\n \"invali"
},
{
"path": "datasets/scripted-speech/cv-corpus-15.0-delta-2023-09-08.json",
"chars": 37394,
"preview": "{\n \"locales\": {\n \"en\": {\n \"duration\": 244618020,\n \"reportedSentences\": 140,\n \"clips\": 40553,\n \"u"
},
{
"path": "datasets/scripted-speech/cv-corpus-16.0-2023-12-06.json",
"chars": 111050,
"preview": "{\n \"locales\": {\n \"ab\": {\n \"buckets\": {\n \"validated\": 41982,\n \"invalidated\": 5278,\n \"dev\": "
},
{
"path": "datasets/scripted-speech/cv-corpus-16.0-delta-2023-12-06.json",
"chars": 35715,
"preview": "{\n \"locales\": {\n \"ab\": {\n \"duration\": 12852,\n \"reportedSentences\": 0,\n \"clips\": 5,\n \"users\": 1,\n"
},
{
"path": "datasets/scripted-speech/cv-corpus-16.1-2023-12-06.json",
"chars": 111036,
"preview": "{\n \"locales\": {\n \"ab\": {\n \"buckets\": {\n \"validated\": 41982,\n \"invalidated\": 5278,\n \"dev\": "
},
{
"path": "datasets/scripted-speech/cv-corpus-16.1-delta-2023-12-06.json",
"chars": 38390,
"preview": "{\n \"locales\": {\n \"ab\": {\n \"duration\": 12852,\n \"reportedSentences\": 2,\n \"clips\": 5,\n \"users\": 1,\n"
},
{
"path": "datasets/scripted-speech/cv-corpus-17.0-2024-03-15.json",
"chars": 195390,
"preview": "{\n \"locales\": {\n \"ab\": {\n \"buckets\": {\n \"validated\": 41992,\n \"invalidated\": 5279,\n \"dev\": "
},
{
"path": "datasets/scripted-speech/cv-corpus-17.0-delta-2024-03-15.json",
"chars": 40572,
"preview": "{\n \"locales\": {\n \"ab\": {\n \"duration\": 37008,\n \"reportedSentences\": 0,\n \"clips\": 10,\n \"users\": 2,"
},
{
"path": "datasets/scripted-speech/cv-corpus-18.0-2024-06-14.json",
"chars": 204693,
"preview": "{\n \"locales\": {\n \"ab\": {\n \"buckets\": {\n \"dev\": 9160,\n \"test\": 9117,\n \"train\": 21027,\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-18.0-delta-2024-06-14.json",
"chars": 49912,
"preview": "{\n \"locales\": {\n \"ab\": {\n \"duration\": 101268,\n \"reportedSentences\": 0,\n \"validatedSentences\": 2,\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-19.0-2024-09-13.json",
"chars": 207910,
"preview": "{\n \"locales\": {\n \"ab\": {\n \"buckets\": {\n \"dev\": 9160,\n \"test\": 9117,\n \"train\": 21027,\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-19.0-delta-2024-09-13.json",
"chars": 51840,
"preview": "{\n \"locales\": {\n \"ab\": {\n \"duration\": 21893832,\n \"reportedSentences\": 0,\n \"validatedSentences\": 0,\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-2.json",
"chars": 20968,
"preview": "{\n \"date\": \"2019-06-11\",\n \"locales\": {\n \"en\": {\n \"clips\": 895794,\n \"splits\": {\n \"accent\": {\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-20.0-2024-12-06.json",
"chars": 211168,
"preview": "{\n \"locales\": {\n \"ab\": {\n \"buckets\": {\n \"dev\": 9160,\n \"test\": 9117,\n \"train\": 21027,\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-20.0-delta-2024-12-06.json",
"chars": 51858,
"preview": "{\n \"locales\": {\n \"ab\": {\n \"duration\": 5433228,\n \"reportedSentences\": 0,\n \"validatedSentences\": 0,\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-21.0-2025-03-14.json",
"chars": 212975,
"preview": "{\n \"locales\": {\n \"ab\": {\n \"buckets\": {\n \"dev\": 9150,\n \"test\": 9117,\n \"train\": 21037,\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-21.0-delta-2025-03-14.json",
"chars": 53402,
"preview": "{\n \"locales\": {\n \"ab\": {\n \"duration\": 53820,\n \"reportedSentences\": 0,\n \"validatedSentences\": 0,\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-22.0-2025-06-20.json",
"chars": 217647,
"preview": "{\n \"locales\": {\n \"ab\": {\n \"buckets\": {\n \"dev\": 9152,\n \"test\": 9132,\n \"train\": 21037,\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-22.0-delta-2025-06-20.json",
"chars": 53647,
"preview": "{\n \"locales\": {\n \"ab\": {\n \"duration\": 974304,\n \"reportedSentences\": 0,\n \"validatedSentences\": 1,\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-23.0-2025-09-05.json",
"chars": 450864,
"preview": "{\n \"locales\": {\n \"ab\": {\n \"buckets\": {\n \"dev\": 9153,\n \"test\": 9133,\n \"train\": 21038,\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-23.0-delta-2025-09-05.json",
"chars": 54611,
"preview": "{\n \"locale\": {\n \"ab\": {\n \"duration\": 95292,\n \"reportedSentences\": 0,\n \"validatedSentences\": 0,\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-24.0-2025-12-05.json",
"chars": 455723,
"preview": "{\n \"locales\": {\n \"ab\": {\n \"buckets\": {\n \"dev\": 9329,\n \"test\": 9230,\n \"train\": 21331,\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-24.0-delta-2025-12-05.json",
"chars": 112023,
"preview": "{\n \"locale\": {\n \"ab\": {\n \"duration\": 239328,\n \"reportedSentences\": 0,\n \"validatedSentences\": 1,\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-25.0-2026-03-09.json",
"chars": 316235,
"preview": "{\n \"locales\": {\n \"ab\": {\n \"buckets\": {\n \"dev\": 14152,\n \"invalidated\": 22429,\n \"other\": 330"
},
{
"path": "datasets/scripted-speech/cv-corpus-25.0-delta-2026-03-09.json",
"chars": 146878,
"preview": "{\n \"locales\": {\n \"ab\": {\n \"buckets\": {\n \"dev\": 12419,\n \"invalidated\": 14978,\n \"other\": 345"
},
{
"path": "datasets/scripted-speech/cv-corpus-3.json",
"chars": 21575,
"preview": "{\n \"date\": \"2019-06-24\",\n \"locales\": {\n \"en\": {\n \"clips\": 896823,\n \"splits\": {\n \"accent\": {\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-4-2019-12-10.json",
"chars": 32981,
"preview": "{\n \"date\": \"2019-12-10\",\n \"locales\": {\n \"en\": {\n \"clips\": 1137300,\n \"splits\": {\n \"accent\": {\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-5-2020-06-22.json",
"chars": 51444,
"preview": "{\n \"date\": \"2020-06-22\",\n \"locales\": {\n \"en\": {\n \"clips\": 1429041,\n \"splits\": {\n \"accent\": {\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-5-singleword.json",
"chars": 15352,
"preview": "{\n \"date\": \"2020-06-22\",\n \"locales\": {\n \"es\": {\n \"clips\": 69284,\n \"splits\": {\n \"accent\": {\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-5.1-2020-06-22.json",
"chars": 51425,
"preview": "{\n \"date\": \"2020-06-22\",\n \"locales\": {\n \"en\": {\n \"size\": 53753543765,\n \"checksum\": \"cb5903dc0775f96de81cd"
},
{
"path": "datasets/scripted-speech/cv-corpus-5.1-singleword.json",
"chars": 14393,
"preview": "{\n \"date\": \"2020-06-22\",\n \"locales\": {\n \"es\": {\n \"clips\": 68817,\n \"splits\": {\n \"accent\": {\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-6.0-2020-12-11.json",
"chars": 58568,
"preview": "{\n \"date\": \"2020-12-11\",\n \"locales\": {\n \"en\": {\n \"clips\": 1582837,\n \"splits\": {\n \"accent\": {\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-6.0-singleword.json",
"chars": 24950,
"preview": "{\n \"date\": \"2020-12-11\",\n \"locales\": {\n \"es\": {\n \"clips\": 70038,\n \"splits\": {\n \"accent\": {\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-6.1-2020-12-11.json",
"chars": 60721,
"preview": "{\n \"date\": \"2020-12-11\",\n \"locales\": {\n \"en\": {\n \"reportedSentences\": 1762,\n \"size\": 60613063630,\n \""
},
{
"path": "datasets/scripted-speech/cv-corpus-6.1-singleword.json",
"chars": 25905,
"preview": "{\n \"date\": \"2021-07-21\",\n \"locales\": {\n \"es\": {\n \"clips\": 70038,\n \"splits\": {\n \"accent\": {\n "
},
{
"path": "datasets/scripted-speech/cv-corpus-7.0-2021-07-21.json",
"chars": 70855,
"preview": "{\n \"date\": \"2021-07-21\",\n \"locales\": {\n \"en\": {\n \"buckets\": {\n \"dev\": 16284,\n \"invalidated\": 220"
},
{
"path": "datasets/scripted-speech/cv-corpus-7.0-singleword.json",
"chars": 24979,
"preview": "{\n \"locales\": {\n \"es\": {\n \"clips\": 70038,\n \"splits\": {\n \"accent\": {\n \"surpeninsular\": 0.01"
},
{
"path": "datasets/scripted-speech/cv-corpus-8.0-2022-01-19.json",
"chars": 107177,
"preview": "{\n \"date\": \"2022-01-19\",\n \"locales\": {\n \"en\": {\n \"duration\": 10390463635,\n \"buckets\":"
},
{
"path": "datasets/scripted-speech/cv-corpus-9.0-2022-04-27.json",
"chars": 115118,
"preview": "{\n \"date\": \"2022-04-27\",\n \"locales\": {\n \"en\": {\n \"duration\": 10632910099,\n \"buckets\":"
},
{
"path": "datasets/spontaneous-speech/.gitkeep",
"chars": 0,
"preview": ""
},
{
"path": "datasets/spontaneous-speech/CHANGELOG.md",
"chars": 3945,
"preview": "# Spontaneous Speech (SPS) Changelog\n\n## Dataset Changes in Corpus 3.0\n\nThe following changes affect SPS datasets starti"
},
{
"path": "datasets/spontaneous-speech/README.md",
"chars": 9192,
"preview": "# Spontaneous Speech (SPS)\n\nSpontaneous Speech is a newer Common Voice modality where contributors respond to open-ended"
},
{
"path": "datasets/spontaneous-speech/sps-corpus-1.0-2025-09-05.json",
"chars": 132553,
"preview": "{\n \"locales\": {\n \"aat\": {\n \"locale\": \"aat\",\n \"clips\": 334,\n \"users\": 5,\n \"questions\": {\n "
},
{
"path": "datasets/spontaneous-speech/sps-corpus-2.0-2025-12-05.json",
"chars": 142042,
"preview": "{\n \"locales\": {\n \"aat\": {\n \"locale\": \"aat\",\n \"clips\": 334,\n \"users\": 5,\n \"questions\": {\n "
},
{
"path": "datasets/spontaneous-speech/sps-corpus-2.0-delta-2025-12-05.json",
"chars": 54888,
"preview": "{\n \"locales\": {\n \"ady\": {\n \"locale\": \"ady\",\n \"clips\": 26,\n \"users\": 6,\n \"questions\": {\n \""
},
{
"path": "datasets/spontaneous-speech/sps-corpus-3.0-2026-03-09.json",
"chars": 163952,
"preview": "{\n \"locales\": {\n \"aat\": {\n \"locale\": \"aat\",\n \"clips\": 334,\n \"users\": 5,\n \"questions\": {\n "
},
{
"path": "datasets/spontaneous-speech/sps-corpus-3.0-delta-2026-03-09.json",
"chars": 69492,
"preview": "{\n \"locales\": {\n \"ady\": {\n \"locale\": \"ady\",\n \"clips\": 71,\n \"users\": 13,\n \"questions\": {\n "
},
{
"path": "helpers/.eslintrc.json",
"chars": 254,
"preview": "{\n \"env\": {\n \"node\": true,\n \"commonjs\": false,\n \"es2021\": true\n },\n \"extends\": [\n \""
},
{
"path": "helpers/README.md",
"chars": 4356,
"preview": "# CV Dataset Helper Scripts\n\nThis directory contains helper scripts for processing and analyzing Common Voice dataset st"
},
{
"path": "helpers/common.js",
"chars": 901,
"preview": "const path = require(\"path\");\n\nconst DATASET_TYPES = [\n \"scripted-speech\",\n \"spontaneous-speech\",\n \"code-switching\",\n"
},
{
"path": "helpers/compareReleases.js",
"chars": 4805,
"preview": "const fs = require(\"fs\");\nconst path = require(\"path\");\nconst args = process.argv.slice(2);\nconst { DATASET_TYPES, build"
},
{
"path": "helpers/createDeltaStatistics.js",
"chars": 5129,
"preview": "const fs = require(\"fs\");\nconst path = require(\"path\");\nconst args = process.argv.slice(2);\nconst { DATASET_TYPES, build"
},
{
"path": "helpers/createStats.js",
"chars": 4406,
"preview": "const fs = require(\"fs/promises\");\nconst path = require(\"path\");\nconst process = require(\"node:process\");\nconst {\n DATA"
},
{
"path": "helpers/jsconfig.json",
"chars": 138,
"preview": "{\n \"compilerOptions\": {\n \"module\": \"UMD\",\n \"target\": \"es6\"\n },\n \"exclude\": [\"node_modules\", \"**/node_"
},
{
"path": "helpers/recalculateStats.js",
"chars": 2574,
"preview": "const fs = require(\"fs\");\nconst args = process.argv.slice(2);\nconst { DATASET_TYPES, buildFilePath, validateDatasetType "
}
]
About this extraction
This page contains the full source code of the common-voice/cv-dataset GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 72 files (5.3 MB), approximately 1.4M tokens, and a symbol index with 6 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.