Full Code of CUNY-CL/wikipron for AI

master ded15d0522ac cached

612 files

145.1 MB

26.5M tokens

44 symbols

1 requests

Copy disabled (too large) Download .txt

Showing preview only (127,538K chars total). Download the full file to get everything.

Repository: CUNY-CL/wikipron
Branch: master
Commit: ded15d0522ac
Files: 612
Total size: 145.1 MB

Directory structure:
gitextract_29_vgov9/

├── .circleci/
│   └── config.yml
├── .github/
│   └── pull_request_template.md
├── .gitignore
├── CHANGELOG.md
├── CONTRIBUTING.md
├── LICENSE.txt
├── README.md
├── data/
│   ├── README.md
│   ├── covering_grammar/
│   │   ├── README.md
│   │   ├── lib/
│   │   │   ├── README.md
│   │   │   ├── covering_grammar.py
│   │   │   ├── error_analysis.py
│   │   │   └── make_test_file.py
│   │   └── tsv/
│   │       ├── ady_cyrl_narrow.tsv
│   │       ├── bul_cyrl_narrow.tsv
│   │       ├── fre_latn_broad.tsv
│   │       ├── geo_geor_broad.tsv
│   │       ├── gre_grek_broad.tsv
│   │       ├── ice_latn_broad.tsv
│   │       ├── ita_latn_broad.tsv
│   │       └── jpn_hira_narrow.tsv
│   ├── frequencies/
│   │   ├── README.md
│   │   ├── grab_wortschatz_data.py
│   │   ├── merge.py
│   │   ├── shared_tasks/
│   │   │   ├── README.md
│   │   │   ├── SIGMORPHON_2021.json
│   │   │   └── SIGMORPHON_2022.json
│   │   └── wortschatz_languages.json
│   ├── morphology/
│   │   ├── README.md
│   │   ├── grab_unimorph_data.py
│   │   ├── shared_tasks/
│   │   │   ├── README.md
│   │   │   ├── SIGMORPHON_2021.json
│   │   │   └── SIGMORPHON_2022.json
│   │   └── unimorph_languages.json
│   ├── phones/
│   │   ├── HOWTO.md
│   │   ├── README.md
│   │   ├── lib/
│   │   │   ├── generate_summary.py
│   │   │   ├── list_phones.py
│   │   │   └── normalize.py
│   │   ├── phones/
│   │   │   ├── ady_narrow.phones
│   │   │   ├── afr_broad.phones
│   │   │   ├── aze_narrow.phones
│   │   │   ├── ben_dhaka_broad.phones
│   │   │   ├── ben_rarh_broad.phones
│   │   │   ├── bul_broad.phones
│   │   │   ├── cym_nw_broad.phones
│   │   │   ├── cym_sw_broad.phones
│   │   │   ├── deu_broad.phones
│   │   │   ├── ell_broad.phones
│   │   │   ├── eng_uk_broad.phones
│   │   │   ├── eng_us_broad.phones
│   │   │   ├── fra_broad.phones
│   │   │   ├── hbs_broad.phones
│   │   │   ├── hin_broad.phones
│   │   │   ├── hun_narrow.phones
│   │   │   ├── hye_e_narrow.phones
│   │   │   ├── hye_w_narrow.phones
│   │   │   ├── isl_broad.phones
│   │   │   ├── ita_broad.phones
│   │   │   ├── jpn_narrow.phones
│   │   │   ├── kat_broad.phones
│   │   │   ├── khm_broad.phones
│   │   │   ├── kor_narrow.phones
│   │   │   ├── lat_clas_broad.phones
│   │   │   ├── lav_narrow.phones
│   │   │   ├── mlt_broad.phones
│   │   │   ├── mya_broad.phones
│   │   │   ├── nld_broad.phones
│   │   │   ├── nob_broad.phones
│   │   │   ├── por_bz_broad.phones
│   │   │   ├── por_po_broad.phones
│   │   │   ├── ron_narrow.phones
│   │   │   ├── slv_broad.phones
│   │   │   ├── spa_ca_broad.phones
│   │   │   ├── spa_la_broad.phones
│   │   │   ├── tur_narrow.phones
│   │   │   ├── vie_hanoi_narrow.phones
│   │   │   ├── vie_hue_narrow.phones
│   │   │   └── vie_saigon_narrow.phones
│   │   ├── postprocess
│   │   └── summary.tsv
│   └── scrape/
│       ├── README.md
│       ├── lib/
│       │   ├── README.md
│       │   ├── codes.py
│       │   ├── common_characters.py
│       │   ├── generate_summary.py
│       │   ├── languages.json
│       │   ├── languages_update.py
│       │   ├── scrape.py
│       │   ├── split.py
│       │   └── unmatched_languages.json
│       ├── postprocess
│       ├── scrape
│       ├── summary.tsv
│       └── tsv/
│           ├── aar_latn_broad.tsv
│           ├── aar_latn_narrow.tsv
│           ├── abk_cyrl_broad.tsv
│           ├── abk_cyrl_narrow.tsv
│           ├── acw_arab_broad.tsv
│           ├── acw_arab_narrow.tsv
│           ├── ady_cyrl_narrow.tsv
│           ├── ady_cyrl_narrow_filtered.tsv
│           ├── afb_arab_broad.tsv
│           ├── afr_latn_broad.tsv
│           ├── afr_latn_broad_filtered.tsv
│           ├── afr_latn_narrow.tsv
│           ├── aii_syrc_narrow.tsv
│           ├── ajp_arab_broad.tsv
│           ├── ajp_arab_narrow.tsv
│           ├── akk_latn_broad.tsv
│           ├── ale_latn_broad.tsv
│           ├── amh_ethi_broad.tsv
│           ├── ang_latn_broad.tsv
│           ├── ang_latn_narrow.tsv
│           ├── aot_latn_broad.tsv
│           ├── apw_latn_narrow.tsv
│           ├── ara_arab_broad.tsv
│           ├── ara_arab_narrow.tsv
│           ├── arc_hebr_broad.tsv
│           ├── arg_latn_broad.tsv
│           ├── ary_arab_broad.tsv
│           ├── arz_arab_broad.tsv
│           ├── asm_beng_broad.tsv
│           ├── ast_latn_broad.tsv
│           ├── ast_latn_narrow.tsv
│           ├── ayl_arab_broad.tsv
│           ├── aze_latn_broad.tsv
│           ├── aze_latn_narrow.tsv
│           ├── aze_latn_narrow_filtered.tsv
│           ├── bak_cyrl_broad.tsv
│           ├── bak_cyrl_narrow.tsv
│           ├── ban_bali_broad.tsv
│           ├── bar_latn_broad.tsv
│           ├── bbl_geor_broad.tsv
│           ├── bbn_latn_broad.tsv
│           ├── bcl_latn_broad.tsv
│           ├── bcl_latn_narrow.tsv
│           ├── bdq_latn_broad.tsv
│           ├── bel_cyrl_narrow.tsv
│           ├── ben_beng_broad.tsv
│           ├── ben_beng_dhaka_broad.tsv
│           ├── ben_beng_dhaka_broad_filtered.tsv
│           ├── ben_beng_dhaka_narrow.tsv
│           ├── ben_beng_narrow.tsv
│           ├── ben_beng_rarh_broad.tsv
│           ├── ben_beng_rarh_broad_filtered.tsv
│           ├── ben_beng_rarh_narrow.tsv
│           ├── bjb_latn_broad.tsv
│           ├── blt_tavt_narrow.tsv
│           ├── bod_tibt_broad.tsv
│           ├── bre_latn_broad.tsv
│           ├── bua_cyrl_broad.tsv
│           ├── bua_cyrl_narrow.tsv
│           ├── bul_cyrl_narrow.tsv
│           ├── cat_latn_broad.tsv
│           ├── cat_latn_narrow.tsv
│           ├── cbn_thai_broad.tsv
│           ├── ceb_latn_broad.tsv
│           ├── ceb_latn_narrow.tsv
│           ├── ces_latn_narrow.tsv
│           ├── chb_latn_broad.tsv
│           ├── che_cyrl_broad.tsv
│           ├── cho_latn_broad.tsv
│           ├── chr_cher_broad.tsv
│           ├── cic_latn_broad.tsv
│           ├── ckb_arab_broad.tsv
│           ├── cnk_latn_broad.tsv
│           ├── cop_copt_broad.tsv
│           ├── cor_latn_broad.tsv
│           ├── cor_latn_narrow.tsv
│           ├── cos_latn_broad.tsv
│           ├── crk_latn_broad.tsv
│           ├── crk_latn_narrow.tsv
│           ├── crx_cans_broad.tsv
│           ├── csb_latn_broad.tsv
│           ├── cym_latn_nw_broad.tsv
│           ├── cym_latn_nw_broad_filtered.tsv
│           ├── cym_latn_nw_narrow.tsv
│           ├── cym_latn_sw_broad.tsv
│           ├── cym_latn_sw_broad_filtered.tsv
│           ├── cym_latn_sw_narrow.tsv
│           ├── dan_latn_broad.tsv
│           ├── dan_latn_narrow.tsv
│           ├── deu_latn_broad.tsv
│           ├── deu_latn_broad_filtered.tsv
│           ├── deu_latn_narrow.tsv
│           ├── div_thaa_broad.tsv
│           ├── div_thaa_narrow.tsv
│           ├── dlm_latn_broad.tsv
│           ├── dng_cyrl_broad.tsv
│           ├── dsb_latn_broad.tsv
│           ├── dsb_latn_narrow.tsv
│           ├── dum_latn_broad.tsv
│           ├── dzo_tibt_broad.tsv
│           ├── egy_latn_broad.tsv
│           ├── ell_grek_broad.tsv
│           ├── ell_grek_broad_filtered.tsv
│           ├── ell_grek_narrow.tsv
│           ├── eng_latn_uk_broad.tsv
│           ├── eng_latn_uk_broad_filtered.tsv
│           ├── eng_latn_uk_narrow.tsv
│           ├── eng_latn_us_broad.tsv
│           ├── eng_latn_us_broad_filtered.tsv
│           ├── eng_latn_us_narrow.tsv
│           ├── enm_latn_broad.tsv
│           ├── epo_latn_broad.tsv
│           ├── epo_latn_narrow.tsv
│           ├── est_latn_broad.tsv
│           ├── est_latn_narrow.tsv
│           ├── ett_ital_broad.tsv
│           ├── eus_latn_broad.tsv
│           ├── eus_latn_narrow.tsv
│           ├── evn_cyrl_broad.tsv
│           ├── ewe_latn_broad.tsv
│           ├── fao_latn_broad.tsv
│           ├── fao_latn_narrow.tsv
│           ├── fas_arab_broad.tsv
│           ├── fas_arab_narrow.tsv
│           ├── fax_latn_broad.tsv
│           ├── fin_latn_broad.tsv
│           ├── fin_latn_narrow.tsv
│           ├── fra_latn_broad.tsv
│           ├── fra_latn_broad_filtered.tsv
│           ├── fra_latn_narrow.tsv
│           ├── fro_latn_broad.tsv
│           ├── frr_latn_broad.tsv
│           ├── fry_latn_broad.tsv
│           ├── gla_latn_broad.tsv
│           ├── gla_latn_narrow.tsv
│           ├── gle_latn_broad.tsv
│           ├── gle_latn_narrow.tsv
│           ├── glg_latn_broad.tsv
│           ├── glg_latn_narrow.tsv
│           ├── glv_latn_broad.tsv
│           ├── glv_latn_narrow.tsv
│           ├── gml_latn_broad.tsv
│           ├── goh_latn_broad.tsv
│           ├── got_goth_broad.tsv
│           ├── got_goth_narrow.tsv
│           ├── grc_grek_broad.tsv
│           ├── grn_latn_broad.tsv
│           ├── gsw_latn_broad.tsv
│           ├── guj_gujr_broad.tsv
│           ├── gur_latn_broad.tsv
│           ├── guw_latn_broad.tsv
│           ├── hat_latn_broad.tsv
│           ├── hau_latn_broad.tsv
│           ├── hau_latn_narrow.tsv
│           ├── haw_latn_broad.tsv
│           ├── haw_latn_narrow.tsv
│           ├── hbs_cyrl_broad.tsv
│           ├── hbs_cyrl_broad_filtered.tsv
│           ├── hbs_latn_broad.tsv
│           ├── hbs_latn_broad_filtered.tsv
│           ├── heb_hebr_broad.tsv
│           ├── heb_hebr_narrow.tsv
│           ├── hil_latn_broad.tsv
│           ├── hil_latn_narrow.tsv
│           ├── hin_deva_broad.tsv
│           ├── hin_deva_broad_filtered.tsv
│           ├── hin_deva_narrow.tsv
│           ├── hrx_latn_broad.tsv
│           ├── hsb_latn_broad.tsv
│           ├── hsb_latn_narrow.tsv
│           ├── hts_latn_broad.tsv
│           ├── hun_latn_narrow.tsv
│           ├── hun_latn_narrow_filtered.tsv
│           ├── huu_latn_narrow.tsv
│           ├── hye_armn_e_broad.tsv
│           ├── hye_armn_e_narrow.tsv
│           ├── hye_armn_e_narrow_filtered.tsv
│           ├── hye_armn_w_broad.tsv
│           ├── hye_armn_w_narrow.tsv
│           ├── hye_armn_w_narrow_filtered.tsv
│           ├── iba_latn_broad.tsv
│           ├── iba_latn_narrow.tsv
│           ├── ido_latn_broad.tsv
│           ├── ilo_latn_broad.tsv
│           ├── ilo_latn_narrow.tsv
│           ├── ina_latn_broad.tsv
│           ├── ind_latn_broad.tsv
│           ├── ind_latn_narrow.tsv
│           ├── inh_cyrl_broad.tsv
│           ├── isl_latn_broad.tsv
│           ├── isl_latn_broad_filtered.tsv
│           ├── isl_latn_narrow.tsv
│           ├── ita_latn_broad.tsv
│           ├── ita_latn_broad_filtered.tsv
│           ├── izh_latn_broad.tsv
│           ├── izh_latn_narrow.tsv
│           ├── jam_latn_broad.tsv
│           ├── jav_java_broad.tsv
│           ├── jje_hang_broad.tsv
│           ├── jpn_hira_narrow.tsv
│           ├── jpn_hira_narrow_filtered.tsv
│           ├── jpn_kana_narrow.tsv
│           ├── jpn_kana_narrow_filtered.tsv
│           ├── kal_latn_broad.tsv
│           ├── kal_latn_narrow.tsv
│           ├── kan_knda_broad.tsv
│           ├── kas_arab_broad.tsv
│           ├── kas_arab_narrow.tsv
│           ├── kas_deva_broad.tsv
│           ├── kat_geor_broad.tsv
│           ├── kat_geor_broad_filtered.tsv
│           ├── kat_geor_narrow.tsv
│           ├── kaw_latn_broad.tsv
│           ├── kaz_cyrl_broad.tsv
│           ├── kaz_cyrl_narrow.tsv
│           ├── kbd_cyrl_narrow.tsv
│           ├── kgp_latn_broad.tsv
│           ├── khb_talu_broad.tsv
│           ├── khm_khmr_broad.tsv
│           ├── khm_khmr_broad_filtered.tsv
│           ├── kik_latn_broad.tsv
│           ├── kir_cyrl_broad.tsv
│           ├── kir_cyrl_narrow.tsv
│           ├── kix_latn_broad.tsv
│           ├── kld_latn_broad.tsv
│           ├── klj_latn_narrow.tsv
│           ├── kmr_latn_broad.tsv
│           ├── koi_cyrl_broad.tsv
│           ├── koi_cyrl_narrow.tsv
│           ├── kok_deva_broad.tsv
│           ├── kok_deva_narrow.tsv
│           ├── kor_hang_narrow.tsv
│           ├── kor_hang_narrow_filtered.tsv
│           ├── kpv_cyrl_broad.tsv
│           ├── kpv_cyrl_narrow.tsv
│           ├── krl_latn_broad.tsv
│           ├── ksw_mymr_broad.tsv
│           ├── ktz_latn_broad.tsv
│           ├── kwk_latn_broad.tsv
│           ├── kxd_latn_broad.tsv
│           ├── kyu_kali_broad.tsv
│           ├── lad_latn_broad.tsv
│           ├── lao_laoo_narrow.tsv
│           ├── lat_latn_clas_narrow.tsv
│           ├── lat_latn_eccl_narrow.tsv
│           ├── lav_latn_narrow.tsv
│           ├── lav_latn_narrow_filtered.tsv
│           ├── lif_limb_broad.tsv
│           ├── lij_latn_broad.tsv
│           ├── lim_latn_broad.tsv
│           ├── lim_latn_narrow.tsv
│           ├── lit_latn_broad.tsv
│           ├── lit_latn_narrow.tsv
│           ├── liv_latn_broad.tsv
│           ├── lmo_latn_broad.tsv
│           ├── lmo_latn_narrow.tsv
│           ├── lmy_latn_narrow.tsv
│           ├── lou_latn_broad.tsv
│           ├── lsi_latn_broad.tsv
│           ├── ltg_latn_narrow.tsv
│           ├── ltz_latn_broad.tsv
│           ├── ltz_latn_narrow.tsv
│           ├── lut_latn_broad.tsv
│           ├── lwl_thai_broad.tsv
│           ├── lzz_geor_broad.tsv
│           ├── mah_latn_broad.tsv
│           ├── mah_latn_narrow.tsv
│           ├── mai_deva_narrow.tsv
│           ├── mak_latn_narrow.tsv
│           ├── mal_mlym_broad.tsv
│           ├── mal_mlym_narrow.tsv
│           ├── mar_deva_broad.tsv
│           ├── mar_deva_narrow.tsv
│           ├── mdf_cyrl_broad.tsv
│           ├── mfe_latn_broad.tsv
│           ├── mfe_latn_narrow.tsv
│           ├── mga_latn_broad.tsv
│           ├── mic_latn_broad.tsv
│           ├── mic_latn_narrow.tsv
│           ├── mkd_cyrl_narrow.tsv
│           ├── mlg_latn_broad.tsv
│           ├── mlt_latn_broad.tsv
│           ├── mlt_latn_broad_filtered.tsv
│           ├── mnc_mong_narrow.tsv
│           ├── mnw_mymr_broad.tsv
│           ├── mon_cyrl_broad.tsv
│           ├── mon_cyrl_narrow.tsv
│           ├── mqs_latn_broad.tsv
│           ├── msa_arab_ara_broad.tsv
│           ├── msa_arab_ara_narrow.tsv
│           ├── msa_arab_broad.tsv
│           ├── msa_arab_narrow.tsv
│           ├── msa_latn_broad.tsv
│           ├── msa_latn_narrow.tsv
│           ├── mtq_latn_broad.tsv
│           ├── mww_latn_broad.tsv
│           ├── mya_mymr_broad.tsv
│           ├── mya_mymr_broad_filtered.tsv
│           ├── nap_latn_broad.tsv
│           ├── nap_latn_narrow.tsv
│           ├── nav_latn_broad.tsv
│           ├── nci_latn_broad.tsv
│           ├── nci_latn_narrow.tsv
│           ├── nds_latn_broad.tsv
│           ├── nep_deva_narrow.tsv
│           ├── new_deva_narrow.tsv
│           ├── nhg_latn_narrow.tsv
│           ├── nhn_latn_broad.tsv
│           ├── nhx_latn_broad.tsv
│           ├── niv_cyrl_broad.tsv
│           ├── nld_latn_broad.tsv
│           ├── nld_latn_broad_filtered.tsv
│           ├── nld_latn_narrow.tsv
│           ├── nmy_latn_narrow.tsv
│           ├── nno_latn_broad.tsv
│           ├── nno_latn_narrow.tsv
│           ├── nob_latn_broad.tsv
│           ├── nob_latn_broad_filtered.tsv
│           ├── nob_latn_narrow.tsv
│           ├── non_latn_broad.tsv
│           ├── nor_latn_broad.tsv
│           ├── nrf_latn_broad.tsv
│           ├── nup_latn_broad.tsv
│           ├── nya_latn_broad.tsv
│           ├── oci_latn_broad.tsv
│           ├── oci_latn_narrow.tsv
│           ├── ofs_latn_broad.tsv
│           ├── okm_hang_broad.tsv
│           ├── okm_hang_narrow.tsv
│           ├── olo_latn_broad.tsv
│           ├── orv_cyrl_broad.tsv
│           ├── osp_latn_broad.tsv
│           ├── osx_latn_broad.tsv
│           ├── ota_arab_broad.tsv
│           ├── ota_arab_narrow.tsv
│           ├── pag_latn_broad.tsv
│           ├── pag_latn_narrow.tsv
│           ├── pam_latn_broad.tsv
│           ├── pam_latn_narrow.tsv
│           ├── pan_arab_broad.tsv
│           ├── pan_guru_broad.tsv
│           ├── pan_guru_narrow.tsv
│           ├── pbv_latn_broad.tsv
│           ├── pcc_latn_broad.tsv
│           ├── pdc_latn_broad.tsv
│           ├── phl_latn_broad.tsv
│           ├── pjt_latn_narrow.tsv
│           ├── pms_latn_broad.tsv
│           ├── pol_latn_broad.tsv
│           ├── por_latn_bz_broad.tsv
│           ├── por_latn_bz_broad_filtered.tsv
│           ├── por_latn_bz_narrow.tsv
│           ├── por_latn_po_broad.tsv
│           ├── por_latn_po_broad_filtered.tsv
│           ├── por_latn_po_narrow.tsv
│           ├── pox_latn_broad.tsv
│           ├── ppl_latn_broad.tsv
│           ├── pqm_latn_broad.tsv
│           ├── pqm_latn_narrow.tsv
│           ├── pus_arab_broad.tsv
│           ├── rgn_latn_broad.tsv
│           ├── rgn_latn_narrow.tsv
│           ├── rom_latn_broad.tsv
│           ├── ron_latn_broad.tsv
│           ├── ron_latn_narrow.tsv
│           ├── ron_latn_narrow_filtered.tsv
│           ├── rup_latn_narrow.tsv
│           ├── rus_cyrl_narrow.tsv
│           ├── sah_cyrl_broad.tsv
│           ├── san_deva_broad.tsv
│           ├── san_deva_narrow.tsv
│           ├── sce_latn_broad.tsv
│           ├── scn_latn_broad.tsv
│           ├── scn_latn_narrow.tsv
│           ├── sco_latn_broad.tsv
│           ├── sco_latn_narrow.tsv
│           ├── sdc_latn_broad.tsv
│           ├── sga_latn_broad.tsv
│           ├── sga_latn_narrow.tsv
│           ├── shn_mymr_broad.tsv
│           ├── sia_cyrl_broad.tsv
│           ├── sid_latn_broad.tsv
│           ├── sin_sinh_broad.tsv
│           ├── sin_sinh_narrow.tsv
│           ├── sjd_cyrl_broad.tsv
│           ├── skr_arab_broad.tsv
│           ├── slk_latn_broad.tsv
│           ├── slk_latn_narrow.tsv
│           ├── slr_latn_broad.tsv
│           ├── slr_latn_narrow.tsv
│           ├── slv_latn_broad.tsv
│           ├── slv_latn_broad_filtered.tsv
│           ├── slv_latn_narrow.tsv
│           ├── sme_latn_broad.tsv
│           ├── sms_latn_broad.tsv
│           ├── snd_arab_broad.tsv
│           ├── spa_latn_ca_broad.tsv
│           ├── spa_latn_ca_broad_filtered.tsv
│           ├── spa_latn_ca_narrow.tsv
│           ├── spa_latn_la_broad.tsv
│           ├── spa_latn_la_broad_filtered.tsv
│           ├── spa_latn_la_narrow.tsv
│           ├── sqi_latn_broad.tsv
│           ├── sqi_latn_narrow.tsv
│           ├── srd_latn_broad.tsv
│           ├── srd_latn_narrow.tsv
│           ├── srn_latn_broad.tsv
│           ├── srs_latn_broad.tsv
│           ├── stq_latn_broad.tsv
│           ├── swa_latn_broad.tsv
│           ├── swe_latn_broad.tsv
│           ├── swe_latn_narrow.tsv
│           ├── syc_syrc_narrow.tsv
│           ├── syl_sylo_broad.tsv
│           ├── szl_latn_broad.tsv
│           ├── tam_taml_broad.tsv
│           ├── tam_taml_narrow.tsv
│           ├── tby_latn_narrow.tsv
│           ├── tel_telu_broad.tsv
│           ├── tel_telu_narrow.tsv
│           ├── tft_latn_broad.tsv
│           ├── tft_latn_narrow.tsv
│           ├── tgk_cyrl_broad.tsv
│           ├── tgk_cyrl_narrow.tsv
│           ├── tgl_latn_broad.tsv
│           ├── tgl_latn_narrow.tsv
│           ├── tha_thai_broad.tsv
│           ├── tkl_latn_narrow.tsv
│           ├── ton_latn_broad.tsv
│           ├── tpw_latn_broad.tsv
│           ├── tru_syrc_broad.tsv
│           ├── tuk_latn_broad.tsv
│           ├── tur_latn_broad.tsv
│           ├── tur_latn_narrow.tsv
│           ├── tur_latn_narrow_filtered.tsv
│           ├── twf_latn_broad.tsv
│           ├── tyv_cyrl_broad.tsv
│           ├── tzm_tfng_broad.tsv
│           ├── tzm_tfng_narrow.tsv
│           ├── uby_cyrl_narrow.tsv
│           ├── uig_arab_ara_broad.tsv
│           ├── uig_arab_broad.tsv
│           ├── ukr_cyrl_narrow.tsv
│           ├── urd_arab_broad.tsv
│           ├── urd_arab_narrow.tsv
│           ├── urk_thai_broad.tsv
│           ├── urk_thai_narrow.tsv
│           ├── uzb_latn_broad.tsv
│           ├── uzb_latn_narrow.tsv
│           ├── vie_latn_hanoi_narrow.tsv
│           ├── vie_latn_hanoi_narrow_filtered.tsv
│           ├── vie_latn_hue_narrow.tsv
│           ├── vie_latn_hue_narrow_filtered.tsv
│           ├── vie_latn_saigon_narrow.tsv
│           ├── vie_latn_saigon_narrow_filtered.tsv
│           ├── vol_latn_broad.tsv
│           ├── vol_latn_narrow.tsv
│           ├── vot_latn_broad.tsv
│           ├── vot_latn_narrow.tsv
│           ├── wau_latn_broad.tsv
│           ├── wbk_latn_broad.tsv
│           ├── wiy_latn_broad.tsv
│           ├── wlm_latn_broad.tsv
│           ├── wln_latn_broad.tsv
│           ├── xal_cyrl_broad.tsv
│           ├── xho_latn_narrow.tsv
│           ├── xsl_latn_narrow.tsv
│           ├── ybi_deva_broad.tsv
│           ├── ycl_latn_narrow.tsv
│           ├── yid_hebr_broad.tsv
│           ├── yid_hebr_narrow.tsv
│           ├── yor_latn_broad.tsv
│           ├── yrk_cyrl_narrow.tsv
│           ├── yue_hani_broad.tsv
│           ├── yue_latn_broad.tsv
│           ├── yux_cyrl_narrow.tsv
│           ├── zha_latn_broad.tsv
│           ├── zho_hani_broad.tsv
│           ├── zho_latn_broad.tsv
│           ├── zom_latn_broad.tsv
│           ├── zul_latn_broad.tsv
│           └── zza_latn_narrow.tsv
├── pyproject.toml
├── requirements.txt
├── src/
│   └── wikipron/
│       ├── __init__.py
│       ├── cli.py
│       ├── config.py
│       ├── extract/
│       │   ├── __init__.py
│       │   ├── blt.py
│       │   ├── cmn.py
│       │   ├── core.py
│       │   ├── default.py
│       │   ├── eng.py
│       │   ├── jpn.py
│       │   ├── khb.py
│       │   ├── khm.py
│       │   ├── lat.py
│       │   ├── shn.py
│       │   ├── tha.py
│       │   ├── vie.py
│       │   └── yue.py
│       ├── html_utils.py
│       ├── languagecodes.py
│       ├── py.typed
│       ├── scrape.py
│       └── typing.py
└── tests/
    ├── __init__.py
    ├── test_data/
    │   ├── __init__.py
    │   ├── test_scrape.py
    │   ├── test_split.py
    │   └── test_summary.py
    └── test_wikipron/
        ├── __init__.py
        ├── test_cli.py
        ├── test_config.py
        ├── test_extract.py
        ├── test_languagecodes.py
        ├── test_scrape.py
        └── test_version.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .circleci/config.yml
================================================
version: 2.1

orbs:
  win: circleci/windows@5.0


jobs:
  pre-build:
    description: A check that needs to be done on only one supported Python version
    parameters:
      command-run:
        type: string
    docker:
      # Use the latest Python 3.x image from CircleCI that WikiPron supports.
      # See: https://circleci.com/developer/images/image/cimg/python
      - image: cimg/python:3.14
        auth:
          username: $DOCKERHUB_USERNAME
          password: $DOCKERHUB_PASSWORD
    steps:
      - checkout
      - run:
          command: pip install -r requirements.txt
      - run:
          command: << parameters.command-run >>

  build-python:
    parameters:
      python-version:
        type: string
    docker:
      - image: cimg/python:<< parameters.python-version >>
        auth:
          username: $DOCKERHUB_USERNAME
          password: $DOCKERHUB_PASSWORD
    steps:
      - checkout
      - run:
          name: Build source distribution and install package from it
          command: |
              pip install -r requirements.txt && \
              python -m build && \
              pip install dist/`ls dist/ | grep .whl`
      - run:
          name: Show installed Python packages
          command: pip list -v
      - run:
          name: Run Python tests
          command: |
              pytest -vv tests --junitxml /tmp/testxml/report.xml
      - store_test_results:
          path: /tmp/testxml/

  build-python-win:
    executor:
      name: win/default
      shell: powershell.exe
    steps:
      - checkout
      - run: systeminfo
      - run:
          name: Run tests on Windows
          shell: bash.exe
          command: |
            python --version && \
            pip install -r requirements.txt && \
            pip install . && \
            pip list && \
            pytest -vv tests

workflows:
  version: 2
  build-and-test:
    jobs:
      - pre-build:
          name: flake8
          command-run: flake8 --extend-ignore E203 data src tests
      - pre-build:
          name: black
          command-run: black --line-length 79 --check data src tests
      - pre-build:
          name: mypy
          command-run: mypy src tests
      - pre-build:
          name: twine
          command-run: |
            python -m build && \
            twine check dist/`ls dist/ | grep .tar.gz` && \
            twine check dist/`ls dist/ | grep .whl`
      - build-python:
          requires:
            - flake8
            - black
            - mypy
            - twine
          matrix:
            parameters:
              python-version: ["3.11", "3.12", "3.13", "3.14"]
      - build-python-win:
          requires:
            - flake8
            - black
            - mypy
            - twine


================================================
FILE: .github/pull_request_template.md
================================================
- [ ] Updated `Unreleased` in `CHANGELOG.md` to reflect the changes in code or data.


================================================
FILE: .gitignore
================================================
__pycache__/
.ipynb_checkpoints
.mypy_cache/
*.py[cdo]
*.egg-info/
*.log
env/
.idea/
.DS_Store
.vscode/

# Temporary data files.
**/tars
**/freq_tsvs
data/frequencies/tsv/*
data/frequencies/tgz/*
data/morphology/tsv/*
data/scrape/unscraped.json


================================================
FILE: CHANGELOG.md
================================================
Changelog
=========

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic
Versioning](http://semver.org/spec/v2.0.0.html).

Unreleased
----------

### Under `data/`

-  Sort `languages.json` and `unmatched_languages.json` by keys alphabetically. (\#582)
-  Updates Latin (`lat`). (\#581)
-  Implements multi-config scraping, avoiding redundant HTTP requests for the same
   Wiktionary page for all combinations of broad/narrow transcriptions and dialects. (\#581)
-  Updates English (`eng`). (\#568)
-  Adds Uzbek (`uzb`). (\#565)

#### Changed

-   Adds custom English extractor to increase the consistency of English r. (\#561)
-   Updated English data and made corrections to the pronunciation transcriptions, 
    removing ɾ, tʰ, y where needed. (\#555)
-   Rescrapes General American and UK Received Pronunciation data. (\#555)

### Under `src/` and elsewhere

-   Optimizes scraping with exponential backoff, session reuse, and API timeout. (\#579)
-   Replaces the unmaintained requests_html package with requests and lxml. (\#579)
-   Drops Python 3.10 support. (\#579)
-   Updates Latin XPath selectors. (\#578)
-   Replaces `--skip-parens`/`--no-skip-parens` with `--parens` accepting
    `skip`, `show`, or `expand`. The `expand` option generates all
    pronunciation variants from parenthesized content and is the new
    default. This is a breaking change for a v2 release. (\#577)
-   Drops Python 3.9 support and adds Python 3.14 support. (\#576)
-   Drops Python 3.8 support and adds Python 3.13 support. (\#562)
-   Updated dialect selector to use contains logic rather than exact match (\#557)
-   Updated XPath selectors to use `contains` for the `IPA` class to accommodate Wikipedia CSS changes.
-   Updated Received Pronunciation for "Likert" in tests to match current Wiktionary data.

[1.3.3] - 2024-07-27
--------------------

### Under `data/`

#### Changed

-   Updated Spanish configuration to include "Spain" in Castilian dialect selection. (\#553)
-   Rescrapes Spanish dialect data. (\#553)

### Under `src/` and elsewhere

#### Changed

-   Updated dialect selector to account for dialects without links. (\#553)

#### Added

-   Added test for Spanish dialect selection. (\#553)


[1.3.2] - 2024-07-17
--------------------

### Under `data/`

#### Changed

-   Rescrapes dialect data after \#548. (\#551)
-   Fixes dialect XPath selector. (\#548)
-   Fixes table alignment. (\#539)
-   Repeats big scrape after \#523. (\#536)
-   Fixes excessive line wrapping. (\#529)
-   Big scrape for 2024. (\#514)
-   Upstream cleaning for Bengali data. (\#547)

### Under `src/` and elsewhere

-   Upgrades `requests` for Dependabot. (\#541, \#544)
-   Upgrades `black` for Dependabot. (\#530)
-   Removes Min Nan (`nan`) custom selector. (\#529)

#### Added

-   Remove the case-folding attributes for the big scrape. (\#469)

#### Changed.

-   Removed the case-folding test for the big scrape. (\#469)

[1.3.1] - 2024-03-02
--------------------

### Under `data/`

#### Added

-   Added KPI computation to `generate_summary.py`. (\#465)
-   Added "ː"-suffixed characters to list of valid IPAs. (\#497)
-   Added Bengali (`ben`) phonelist. (\#526)

#### Changed

-   Updated JSON to introduce Bengali dialect (Rarh and Dhaka). (\#526)
-   Updated Maltese (`mlt`) phonelist. (\#517)
-   Fixed path bug in `generate_summary.py`. (\#517)
-   Fixed CLI arg bug in `list_phones.py`. (\#516)
-   Big scrape for 2023. (\#512)
-   Moved IPAs of words with tildes to multiple lines. (\#379)
-   Caught `iso639.language.LanguageNotFoundError` error in `codes.py`. (\#498)
-   Renamed the two TSV summaries to `summary.tsv`. (\#494)
-   Renamed `generate_tsv_summary.py` to `generate_summary.py`. (\#492)
-   Upstream cleaning wrt English tie bar. (\#491)
-   Upstream cleaning wrt English high vowel and schwa. (\#493)
-   Fixed Georgian (`kat`) phones and rescrapes. (\#488)

### Under `src/` and elsewhere

#### Added

-   Added not-already-mentioned language names. (\#478)

#### Fixed

-   Fixed dialect selector. (\#513)

[1.3.0] - 2022-11-28
--------------------

### Under `data/`

#### Added

-   Big scrape for 2023. (\#512)
-   Moved IPAs of words with tildes to multiple lines. (\#379)
-   Caught `iso639.language.LanguageNotFoundError` error in `codes.py`. (\#498)
-   Added KPI computation to `generate_summary.py`. (\#465)
-   Added "ː"-suffixed characters to list of valid IPAs. (\#497)
-   Renamed the two TSV summaries to `summary.tsv`. (\#494)
-   Renamed `generate_tsv_summary.py` to `generate_summary.py`. (\#492)
-   Upstream cleaning wrt English tie bar. (\#491)
-   Upstream cleaning wrt English high vowel and schwa. (\#493)
-   Fixed Georgian (`kat`) phones and rescrapes. (\#488)
-   Big scrape for 2022. (\#464)
-   Added the `--fresh` flag to `data/scrape/scrape.py` to facilitate running the big scrape in batches. (\#464)
-   Added the `--exclude` flag for excluding one or more languages in `data/scrape/scrape.py`. (\#460)
-   Added `data/src/normalize.py`. (\#356)
-   Updated `README.md`. (\#360)
-   Added `data/cg/tsv/geo.tsv`. (\#367)
-   Added `data/morphology`. (\#369)
-   Added SIGMORPHON 2021 morphology data. (\#375)
-   Added `data/cg/tsv/jpn_hira.tsv`. (\#384)
-   Enforced final newlines. (\#387)
-   Adds all UniMorph languages to morphology. (\#393)
-   Added `data/covering_grammar/tsv/fre_latn_phonemic.tsv` (\#398)
-   Added `data/covering_grammar/lib/make_test_file.py` (\#396, \#399)
-   Added Komi-Zyrian (`kpv`). (\#400)
-   Added Makasar (`mak`). (\#415, #419)
-   Added Zou (`zom`). (\#421)
-   Added Wiyot (`wiy`). (\#422)
-   Added Sidamo (`sid`). (\#423)
-   Added Central Atlas Tamazight (`tzm`). (\#429)
-   Added Chibcha (`chb`). (\#430)
-   Added Kashmiri (`kas`). (\#431)
-   Added Malayalam (`mal`). (\#434)
-   Added Dhivehi (`div`). (\#437)
-   Added Akkadian (`akk`). (\#441)
-   Added Central Nahuatl (`nhn`). (\#443)
-   Added Etruscan (`ett`). (\#444)
-   Added Gujarati (`guj`). (\#445)
-   Added Kannada (`kan`). (\#446)
-   Added Karelian (`krl`). (\#447)
-   Added Romagnol (`rgn`). (\#448)
-   Added Southern Yukaghir (`yux`). (\#449)
-   Added Urak Lawoi' (`urk`). (\#451)
-   Added Hausa (`ha`). (\#452)
-   Added Kashubian (`csb`). (\#453)
-   Added Tabaru (`tby`). (\#455)
-   Added West Makian (`mqs`). (\#457)
-   Added Amharic (`amh`). (\#458)
-   Added Livvi (`olo`). (\#459)
-   Added Kalmyk (`xal`). (\#472)
-   Added Ternate (`tft`). (\#473)
-   Added Abkhaz (`abk`). (\#474)
-   Added Farefare (`gur`). (\#475)
-   Added Iban (`iba`). (\#476)
-   Added Laz (`lzz`). (\#477)

#### Changed

-   Switched to ISO 639-3 language codes. (\#468)
-   Updated scraped data in preparation for the SIGMORPHON 2022 shared task:
    `swe nno ger dut ita rum ukr bel tgl ceb ben asm per pus tha lwl`. (\#461)
-   Made scripts under `data/frequencies/` and `data/morphology/` more flexible,
    especially for the purposes of preparing data for a shared task. (\#461)
-   Fixed the `--restriction` flag for specifying multiple languages in `data/scrape/scrape.py`. (\#460)
-   Added covering grammar coverage error log and specified error_type in error_analysis.py. (\#424)
-   Added error log writing in error_analysis.py. (\#420)
-   Added new columns in summary tables. (\#365)
-   Fixed broken paths in `data/src/generate_phones_summary.py` and in
    `data/phones/HOWTO.md`. (\#352)
-   Added Atong (India) (`aot`). (\#353)
-   Added Egyptian Arabic (`arz`). (\#354)
-   Added Lolopo (`ycl`). (\#355)
-   Fixed Unicode normalization in `data/phones/slv_phonemic.phones` and
    re-scraped Slovenian data. (\#356)
-   Updated `data/phones/HOWTO.md` to include instructions on applying the
    NFC Unicode normalization (\#357)
-   Updated `data/src/normalize.py` to be more efficient. (\#358)
-   Fixed inaccuracies in `data/phones/geo_phonemic.phones`. (\#367)
-   Fixed typo in `data/cg/tsv/geo.tsv` and added missing character. (\#370)
-   Morphology URLs are now provided as a list. (\#376)
-   Configured and scraped Yamphu (`ybi`). (\#380)
-   Configured and scraped Khumi Chin (`cnk`). (\#381)
-   Made summary generation in `common_characters.py` optional. (\#382)
-   Fixed phone counting in `data/src/generate_phones_summary.py` (\#390, \#392)
-   Reorganizes scraping scripts under `data/scrape` (\#394)
-   Reorganizes `.phones` files and related scripts under `data/phones` (\#395)
-   Reorganizes CG files and related scripts under `data/covering_grammar` (\#395)
-   Reorganized `data/phones/phones/fre_phonemic.phones` (\#398)
-   Removed `data/src/` (\#401)
-   Renamed TSV files and phonelists to use the terms "broad"/"narrow" instead
    of "phonemic"/"phonetic" (\#389, \#402, \#405)
-   Fixed typo in `README.md` (\#407)
-   Fixed column ordering of the test file read by the script in
    `data/covering_grammar/lib/error_analysis.py` (\#411)
-   Fixed Common character collection in `common_characters.py` (\#419)
-   Scraping test fixed for `blt`. (\#436)
-   Changed URLs to point at CUNY-CL repo, where applicable. (\#438)

### Under `src/` and elsewhere

#### Added

-  Adds Python 3.12 support. (\#520)
-  Temporarily disables Latin testing in lieu of #514. (\#519)
-  Fixed dialect selectors for languages other than Latin. (\#511)
-  Moved `wikipron/` directory under `src/` and adjusted package finding. (\#508)
-  Added documentation about selecting transcription level. (\#502)
-  Added `ckb` in `languagecodes.py`. (\#464)
-  Added support for Python 3.10. (\#462)
-  Added test of phones list generation in `test_data/test_summary.py` (\#363)
-  Added Min Nan extraction function. (\#397)
-  Added Tai Dam extraction function, configuration and initial scrape. (\#435)
-  Added test of `casefold` value for languages in `data/scrape/lib/languages.json` (\#442)
-  Added support for Python 3.11. (\#479)
-  Added checks for the Python source distribution and wheel on CI. (\#479)
-  Turned on tests for Windows on CI. (\#479)

#### Removed

-  Dropped support for Python 3.6. (\#462)
-  Dropped support for Python 3.7. (\#479)

#### Changed

-   Fixed missing logging for proto-languages. (\#505)
-   Switched to ISO 639-3 language codes. (\#468)
-   Converted `setup.py` to `pyproject.toml`. (\#479)

[1.2.0] - 2021-01-30
--------------------

### Under `data/`

#### Added

-   Added generate_phones_summary.py, generating `./phones/README.md` and `./phones/phones_summary.tsv`. (\#344)
-   Added Afrikaans whitelists, filtered TSV file, rescraped phonemic and phonetic TSV files. (\#311)
-   Added German whitelists and filtered TSV file. (\#285)
-   Added whitelisting capabilities to `postprocess`. (\#152)
-   Added whitelists for Dutch, English, Greek, Latin, Korean, and Spanish.
    (\#158, etc.)
-   Logged dialect configuration if specified. (\#133)
-   Added typing to big scrape code. (\#140)
-   Added argparse to allow limiting 'big scrape' to individual languages
    with `--restriction` flag. (\#154)
-   Added Manchu (`mnc`). (\#185)
-   Added Polabian (`pox`). (\#186)
-   Added `aar`, `bdq`, `jje`, and `lsi`. (\#202)
-   Added `tyv` to `languagecodes.py` (\#203, \#205)
-   Added `bcl`, `egl`, `izh`, `ltg`, `azg`, `kir` and `mga` to `languagecodes.py`. (\#205)
-   Added `nep` to `languagecodes.py`. (\#206)
-   Added Ingrian (`izh`). (\#215)
-   Added French phoneme list and filtered TSV file. (\#213, \#217)
-   Added Corsican (`cos`). (\#222)
-   Added Middle Korean (`okm`). (\#223)
-   Added Middle Irish (`mga`). (\#224)
-   Added Old Portuguese (`opt`). (\#225)
-   Added Serbo-Croatian phoneme list and filtered TSV files. (\#227)
-   Added Tuvan (`tyv`). (\#228)
-   Added Shan (`shn`) with custom extraction. (\#229)
-   Added Northern Kurdish (`kmr`). (\#243)
-   Added a script to facilitate the creation of a `.phones` file. (\#246)
-   Added IPA validity checks for phonemes. (\#248)
-   Split multiple pronunciations joined by tilde in `eng_us_phonetic`.
-   Added Italian phoneme list and filtered TSV file. (\#260, \#261)
-   Added Adyghe phone list and filtered TSV file. (\#262, \#263)
-   Added Bulgarian phoneme list and filtered TSV file. (\#264, \#267)
-   Added Icelandic phoneme list and filtered TSV file. (\#269, \#270)
-   Added Slovenian phoneme list and filtered TSV file. (\#271, \#273)
-   Added normalization to `list_phones.py`. Corrected errors relating to
    `ipapy` (\#275)
-   Added Welsh .phones lists and filtered TSV files. (\#274, \#276)
-   Added draft of covering grammar script. (\#297)
-   Updated `data/phones/README.md` with instructions to re-scrape. (\#279, \#281)
-   Added Vietnamese `.phones` files and re-scraped and filtered `.tsv` files.
    (\#278, \#283)
-   Added Hindi `.phones` files and the re-scraped and filtered `.tsv` files.
    (\#282, \#284)
-   Added Old Frisian (`ofs`). (\#294)
-   Added Dungan (`dng`). (\#293)
-   Added Latgalian (`ltg`). (\#296)
-   Added draft of covering grammar script. (\#297)
-   Added Portuguese `.phones` files and re-scraped data. (\#290, \#304)
-   Added Japanese `.phones` files and re-scraped data. (\#230, \#307)
-   Added Moksha (`mdf`). (\#295)
-   Added Azerbaijani `.phones` files and re-scraped data. (\#306, \#312)
-   Added Turkish `.phones` file and re-scraped data. (\#313, \#314)
-   Added Maltese `.phones` file and re-scraped data. (\#317, \#318)
-   Added Latvian `.phones` file and re-scraped data. (\#321, \#322)
-   Added Khmer `.phones` file and re-scraped data. (\#324, \#327)
-   Added Østnorsk (Bokmål) `.phones` file and re-scraped data. (\#324, \#327)
-   Added SIGMORPHON 2021 frequencies JSON. (\#332)
-   Several languages added to `languagecodes.py`. (\#334)
-   Configured scripts for Kazakh (`kaz`). (\#345)
-   Added Easten Lawa (`lwl`). (\#346)
-   Configuration for Western Lawa (`lcp`). (\#347)
-   Added Nyahkur (`cbn`). (\#348)
-   Split Tagalog (`tgl`) scripts into Latin and Baybayin, rescraped. (\#351)

#### Changed

-   Changed the name of the existing `./phones/README.md` to `./phones/HOWTO.md`. (\#344)
-   Edited the name of `generate_summary.py` to `generate_tsv_summary.py`.(\#344)
-   Edited the output file name of `generate_tsv_summary.py` to `tsv_summary.tsv`.(\#344)
-   Edited the arm_e_phonetic.phones and arm_w_phonetic.phones files. (\#298)
-   Improved printing in the README table. (\#145)
-   Renamed data directory `data`. (\#147)
-   Split `may` into Latin and Arabic files. (\#164)
-   Split `pan` into Gurmukhi and Shahmukhī. (\#169)
-   Split `uig` into Perso-Arabic and Cyrillic. (\#173)
-   Only allowed Latin spellings in Maltese lexicon. (\#166).
-   Split `mon` into Cyrillic and Mongol Bichig (\#179).
-   Merged whitelist.py into 'big scrape' script. src scrape.py now checks for
    existence of whitelist file during scrape to create second filtered TSV.
    New TSV placed under `tsv/\*\_filtered.tsv`. (\#154).
-   Updated `generate_summary.py` to reflect presence of 'filtered' tsv. (\#154)
-   Imperial Aramaic (`arc`) split into three scripts properly. (\#187)
-   Flattened data directory structure. (\#194)
-   Updated Georgian (`geo`) to take advantage of upstream bot-based
    consistency fixes. (\#138)
-   Split `arm` into Eastern and Western dialects. (\#197)
-   Rescraped files with new whitelists. (\#199)
-   Updated logging statements for consistency. (\#196)
-   Renamed `.whitelist` file extension name as `.phones`. (\#207)
-   Split `ban` into Latin and Balinese scripts. (\#214)
-   Split `kir` into Cyrillic and Arabic. (\#216)
-   Split Latin (`lat`) into its dialects. (\#233)
-   Added MyPy coverage for `wikipron`, `tests` and `data` directories. (\#247)
-   Modified paths in `codes.py`, `scrape.py`, and `split.py`. (\#251, \#256)
-   Modified config flags in `languages.json` and `scrape.py`. (\#258)
-   Edited Serbo-Croatian `.phones` file to list all vowel/pitch accent
    combinations. Re-scraped Serbo-Croatian data. (\#288)
-   Moved `list_phones.py` to parent directory. (\#265, \#266)
-   Moved `list_phones.py` to `src` directory. (\#297)
-   Frequencies code no longer overwrites TSV files. (\#320)
-   Updated `data/phones/README.md` to specify that `.phones` files should be
    in NFC normalization form. (\#333)
-   Kurdish (`kur`) and Opata (`opt`) removed from `languages.json`. (\#334)
-   Re-scraped Armenian data. Fixed an error in West Armenian phone list.
    (\#338)

#### Fixed

-   Fixed path issue with phonetic whitelisted files. (\#195)

### Under `wikipron/` and elsewhere

#### Added

-   Added positive flags for stress, syllable boundaries, tones, segment to `cli.py`. (\#141)
-   Added positive flags for space skipping to `cli.py`. (\#257)
-   Added two Vietnamese dialects to `languages.json`. (\#139)
-   Handled additional language codes. (\#132, \#148)
-   Added `--no-skip-spaces-word` and `--no-skip-spaces-pron` flag. (\#135)
-   Allowed ASCII apostrophes (0x27) in spellings. (\#172).
-   Added Vietnamese extraction function. (\#181).
-   Modified pron selector in Latin extraction function. (\#183).
-   Added `--no-tone` flag. (\#188)
-   Customized extractor and new scraped prons for `khb`. (\#219)
-   Added `tests/test_data` directory containing two tests. (\#226, \#251)
-   Added HTTP User-Agent header to API calls to Wiktionary. (\#234)
-   Added support for python 3.9 (\#240)
-   Added black style formatting to `.circleci/config.yml`. (\#242)
-   Added logging for scraping a language with `--dialect` specified
    that requires its custom extraction logic. (\#245)
-   Improved CircleCI workflow with orbs. (\#249)
-   Added `test_split.py` to `tests/test_data`. (\#256)
-   Handled Cantonese for scraping. (\#277)
-   Added exclusion for reconstructions. (\#302)
-   Added Vietnamese contour tone grouping test in `tests/test_config.py` (\#308)
-   Added restart functionality. (\#340)
-   Added very basic API for script detection. (\#341)
-   Added `--skip-parens` and `--no-skip-parens` flags. (\#343)

#### Changed

-   Renamed arguments to positive statements in `wikipron/config.py` and edited `_get_process_pron` function accordingly. (\#141, \#257)
-   Changed testing values used in `tests/test_config.py` in order to accomodate the addition of positive flags. (\#141)
-   Specified UTF-8 encoding in handling text files. (\#221)
-   Moved previous contents of `tests` into `tests/test_wikipron` (\#226)
-   Updated the packages version numbers in requirements.txt to their latest according to PyPI (\#239)
-   Updated the default pron selector to also look for IPA strings under paragraphs in addition to list items. (\#295)
-   Updated segments package version to 2.2.0 (\#308)

#### Removed

-   Moved Wiktionary querying functions from `test_languagecodes.py` to `codes.py` (\#205)

[1.1.0] - 2020-03-03
--------------------

#### Added

-   Added the extraction function for Mandarin Chinese and its scraped data. (\#124)
-   Integrated Wortschatz frequencies. (\#122)

#### Changed

-   Updated the Japanese extraction function and Japanese data. (\#129)
-   Updated all scraped Wiktionary data and frequency data. (\#127, \#128)
-   Generalized the splitting script in the big scrape. (\#123)
-   Moved small file removal to `generate_summary.py`. (\#119)
-   Updated Russian data. (\#115)

#### Fixed

-   Avoided and logged error in case of pron processing failure. (\#130)

[1.0.0] - 2019-11-29
----------------------

#### Added

-   Handled Japanese. (\#109, \#114)
-   Handled Latin, for which the actual graphemes cannot be the Wiktionary
    page titles and have to come from within the page. (\#92, \#93)
-   Handled Thai, whose pronunciations are embedded in HTML tables. (\#90)
-   Handled Khmer, whose pronunciations are embedded in HTML tables. (\#88)
-   IPA segmentation using spaces by default, with the `--no-segment` flag to
    optionally turn it off. (\#69, \#79, \#83, \#89, \#100)
-   Added TSV files for all Wiktionary languages with over 100 entries.
    (\#61, \#76, \#95, \#97, \#103, \#104)
-   Resolved Wiktionary language names for languages with at least 100
    pronunciation entries. (\#52, \#55)

#### Changed

-   Removed duplicate <word, pronunciation> pairs in the persisted data. (\#85, \#111, \#116)
-   Split Welsh into Northern Wales and Southern dialects in the persisted data. (\#110)
-   Factored out casefolding. (\#102)
-   Split Serbo-Croatian into Cyrillic and Latin TSVs. (\#96)
-   Generalized word and pronunciation extraction. (\#88)

#### Removed

-   Removed the timeout in smoke tests. (\#107)
-   Removed the `output` option. (\#82)
-   Removed the `require_dialect_label` option. (\#77)

#### Fixed

-   Skipped pronunciations with a dash. (\#106)
-   Skipped empty pronunciations in scraping. (\#59)
-   Updated the `<li>` XPath selector for an optional layer of `<span>` to cover
    previously unhandled languages (e.g., Korean). (\#50)
-   Updated the `<li>` XPath selector for
    `title="wikipedia:<language> phonology"` to cover previously unhandled
    languages (e.g., Estonian and Slovak). (\#49)

#### Security

-   Avoided using `exec` to retrieve the version string. Used `pkg_resources`
    instead. (\#63)

[0.1.1] - 2019-08-14
----------------------

#### Fixed

-   Fixed import bug. (\#45)

[0.1.0] - 2019-08-14
----------------------

First release.


================================================
FILE: CONTRIBUTING.md
================================================
# Contributing

Thank you for your interest in contributing to the `wikipron` codebase!

This page assumes that you have already created a fork of the `wikipron` repo
under your GitHub account and have the codebase available locally for
development work. If you have followed
[these steps](https://github.com/CUNY-CL/wikipron#development),
then you are all set.

## Working on a Feature or Bug Fix

The development steps below assumes that your local Git repo has a remote
`upstream` link to `CUNY-CL/wikipron`:
   
```bash
git remote add upstream https://github.com/CUNY-CL/wikipron.git
```

After this step (which you only have to do once),
running `git remote -v` should show your local Git repo
has links to both "origin"
(pointing to your fork `<your-github-username>/wikipron`)
and "upstream" (pointing to `CUNY-CL/wikipron`).

To work on a feature or bug fix, here are the development steps: 

1. Before doing any work, check out the master branch and
   make sure that your local master branch is up-to-date with upstream master:
   
   ```bash
   git checkout master
   git pull upstream master
   ``` 
   
2. Create a new branch.
   This branch is where you will make commits of your work.
   (As best practice, never make commits while on a master branch.
   Running `git branch` tells you which branch you are on.)
   
   ```bash
   git checkout -b new-branch-name
   ```
   
3. Make as many commits as needed for your work.
4. When you feel your work is ready for a pull request,
   push your branch to your fork.

   ```bash
   git push origin new-branch-name
   ```
5. Go to your fork `https://github.com/<your-github-username>/wikipron` and
   create a pull request off of your branch against the `CUNY-CL/wikipron`
   repo.

6. Add an entry to
   [CHANGELOG.md](https://github.com/CUNY-CL/wikipron/blob/master/CHANGELOG.md),
   commit this change, and push this commit to your branch.

## Documentation

* If relevant, please update the top-level
  [README](https://github.com/CUNY-CL/wikipron/blob/master/README.md)
  for your changes.

* To document functions and class methods, please name them transparently and
  type them. If it helps, please add a one-liner docstring immediately
  under the function signature, in the form of `"""Docstring here"""` with
  triple double quotes. For more elaborate docstrings, please follow the
  [numpydoc docstring format](https://numpydoc.readthedocs.io/en/latest/format.html).

## Running Tests

The `wikipron` repo has continuous integration (CI) turned on,
with autobuilds running pytest and flake8 for the test suite
(in the [`tests/`](tests) directory) and code style checks, respectively.
If an autobuild at a pending pull request fails because of `pytest`, `flake8` or
`mypy` errors, then the errors must be fixed by further commits pushed to the
branch by the author.

If you would like to help avoid wasting free Internet resources
(every push triggers a new CI autobuild),
you can run the following checks locally before pushing commits:
* `uv run mypy src/wikipron/ tests/`
* `uv run flake8 src/wikipron/ tests/`
* `uv run black --line-length=79 --check src/wikipron/ tests/ data/`
    * You can fix any errors by running the same command without `--check`
* `uv run pytest tests/`


================================================
FILE: LICENSE.txt
================================================
                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   Copyright 2019 Kyle Gorman, Jackson Lee, and contributors

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: README.md
================================================
# WikiPron

[![PyPI
version](https://badge.fury.io/py/wikipron.svg)](https://pypi.org/project/wikipron)
[![Supported Python
versions](https://img.shields.io/pypi/pyversions/wikipron.svg)](https://pypi.org/project/wikipron)
[![CircleCI](https://circleci.com/gh/CUNY-CL/wikipron/tree/master.svg?style=shield)](https://circleci.com/gh/CUNY-CL/wikipron/tree/master)
[![Paper](http://img.shields.io/badge/paper-ACL:2020.lrec--1.521-B31B1B.svg)](https://www.aclweb.org/anthology/2020.lrec-1.521/)
[![Conference](http://img.shields.io/badge/LREC-2020-4b44ce.svg)](https://lrec2020.lrec-conf.org/en/)

WikiPron is a command-line tool and Python API for mining multilingual
pronunciation data from Wiktionary, as well as a database of pronunciation
dictionaries mined using this tool.

-   [Command-line tool](#command-line-tool)
-   [Python API](#python-api)
-   [Data](#data)
-   [Models](#models)
-   [Development](#development)

If you use WikiPron in your research, please cite the following:

Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza, Yeonju Lee-Sikka, Sean
Miller, Alan Wong, Arya D. McCarthy, and Kyle Gorman (2020). [Massively
multilingual pronunciation mining with
WikiPron](https://www.aclweb.org/anthology/2020.lrec-1.521/). In *Proceedings of
the 12th Language Resources and Evaluation Conference*, pages 4223-4228.
\[[bibtex](https://www.aclweb.org/anthology/2020.lrec-1.521.bib)\]

## Command-line tool

### Installation

```bash
pip install wikipron
```

### Usage

#### Quick start

After installation, the terminal command `wikipron` will be available. As a
basic example, the following command scrapes G2P data for French:

```bash
wikipron fra
```

#### Specifying the language

The language is indicated by a three-letter [ISO
639-3](https://en.wikipedia.org/wiki/List_of_ISO_639-3_codes) language code,
e.g., `fra` for French. For which languages can be scraped,
[here](https://en.wiktionary.org/wiki/Category:Terms_with_IPA_pronunciation_by_language)
is the complete list of languages on Wiktionary that have pronunciation entries.

#### Specifying the dialect

One can optionally specify dialects to target using the `--dialect` flag. The
dialect name can be found together with the transcription on Wiktionary. For
example, "(UK, US) IPA: /təˈmɑːtəʊ/". To restrict to the union of dialects use
the pipe character '\|': e.g., `--dialect='General American | US'`.
Transcriptions which lack a dialect specification are selected regardless of the
value of this flag.

#### Specifying the transcription level

By default, WikiPron selects broad pronunciations in angled brackets /like
this/. One can instead select narrow transcriptions written \[like this\] using
the `--narrow` flag. Note that some languages only have broad or narrow
transcriptions (e.g., Russian only has the latter.

#### Segmentation

By default, the [`segments`](https://github.com/cldf/segments) library is used
to segment the transcription into whitespace. The segmentation tends to place
IPA diacritics and modifiers on the "parent" symbol. For instance, \[kʰæt\] is
rendered `kʰ æ t`. This can be disabled using the `--no-segment` flag.

#### Parentheses

Some transcriptions contain parentheses to indicate optional sounds (e.g.,
English *A&E* `/eɪ.ən(d)ˈiː/`). The `--parens` flag controls how they are handled:
`expand` (default) generates all variants, `skip` removes parentheses and
their content, and `show` keeps parentheses as-is in the output.

#### Output

The scraped data is organized with each \<word, pronunciation\> pair on its own
line, where the word and pronunciation are separated by a tab. Note that the
pronunciation is in [International Phonetic Alphabet
(IPA)](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet), segmented
by spaces that correctly handle the combining and modifier diacritics for
modeling purposes, e.g., we have `kʰ æ t` with the aspirated k instead of
`k ʰ æ t`.

For illustration, here is a snippet of French data scraped by WikiPron:

```tsv
accrémentitielle    a k ʁ e m ɑ̃ t i t j ɛ l
accrescent  a k ʁ ɛ s ɑ̃
accrétion   a k ʁ e s j ɔ̃
accrétions  a k ʁ e s j ɔ̃
```

By default, the scraped data appears in the terminal. To save the data in a TSV
file, please redirect the standard output to a filename of your choice:

```bash
wikipron fra > fra.tsv
```

#### Advanced options

The `wikipron` terminal command has an array of options to configure your
scraping run. For a full list of the options, please run `wikipron -h`.

## Python API

The underlying module can also be used from Python. A standard workflow looks
like:

```python
import wikipron

config = wikipron.Config(key="fra")  # French, with default options.
for word, pron in wikipron.scrape(config):
    ...
```

## Data

We also make available [a database of over 3 million word/pronunciation
pairs](https://github.com/CUNY-CL/wikipron/tree/master/data) mined using
WikiPron.

## Models

We host grapheme-to-phoneme models and modeling software [in a separate
repository](https://github.com/kylebgorman/wikipron-modeling).

## Development

### Repository

The source code of WikiPron is hosted on GitHub at
[`https://github.com/CUNY-CL/wikipron`](https://github.com/CUNY-CL/wikipron),
where development also happens.

For the latest changes not yet released through `pip` or working on the codebase
yourself, you may obtain the latest source code through GitHub and `git`:

1.  Create a fork of the `wikipron` repo on your GitHub account.

2.  Clone from your fork:

    ```bash
    git clone https://github.com/<your-github-username>/wikipron.git
    cd wikipron
    ```

3.  Set up a Python virtual environment. We recommend using [uv](https://docs.astral.sh/uv/):

    ```bash
    uv python install 3.14
    uv venv --python 3.14
    source .venv/bin/activate
    ```

4.  Install WikiPron in the "editable" mode together with the
    core and dev dependencies:

    ```bash
    uv pip install -e ".[dev]"
    ```

We keep track of notable changes in
[`CHANGELOG.md`](https://github.com/CUNY-CL/wikipron/blob/master/CHANGELOG.md).

### Contributing

For questions, bug reports, and feature requests, please [file an
issue](https://github.com/CUNY-CL/wikipron/issues).

If you would like to contribute to the `wikipron` codebase, please see
[CONTRIBUTING.md](https://github.com/CUNY-CL/wikipron/blob/master/CONTRIBUTING.md).

### License

WikiPron is released under an Apache 2.0 license. Please see
[LICENSE.txt](https://github.com/CUNY-CL/wikipron/blob/master/LICENSE.txt) for
details.

Please note that Wiktionary data in the
[`data/`](https://github.com/CUNY-CL/wikipron/tree/master/data) directory has
[its own licensing terms](https://en.wiktionary.org/wiki/Wiktionary:Copyrights).


================================================
FILE: data/README.md
================================================
Data directories
================

- The [scrape](./scrape) directory contains our "Big Scrape" scripts, the
  [TSVs](./scrape/tsv) created by those scripts, and two tables (a
  [README](./scrape/README.md) and a [TSV](./scrape/tsv_summary.tsv)) describing
  those TSVs.
  - More information on the "Big Scrape" scripts, including instructions on how
    to run your own scrape, can be found [here](./scrape/lib/README.md).
- The [phones](./phones) directory contains the [`.phones`](./phones/phones)
  files used to filter the TSVs produced by the "Big Scrape", scripts that
  facilitate the creation of `.phones` files, and two tables (a
  [README](./phones/README.md) and a [TSV](./phones/phones_summary.tsv))
  describing those `.phones` files.
  - More information on the files within the [phones](./phones) directory,
    including instructions on how to create your own `.phones` file, can be
    found [here](./phones/HOWTO.md).
- The [frequencies](./frequencies) directory contains scripts used to merge word
  frequencies into the TSVs produced by the "Big Scrape".
  - Details on the specific function of each script and how we acquire the
    frequencies can be found [here](./frequencies/README.md).
- The [morphology](./morphology) directory contains scripts that download
  UniMorph data for all languages covered by both UniMorph and the "Big Scrape".
  - Details can be found [here](./morphology/README.md).


================================================
FILE: data/covering_grammar/README.md
================================================
(TEMPORARY)

================================================
FILE: data/covering_grammar/lib/README.md
================================================
# Error analysis tool for grapheme-to-phoneme (g2p) conversion

This tool performs a fine-grained error analysis of a G2P model. It prints a
"performance matrix" comparing the gold and hypothesized results of a G2P model.

The performance matrix is a 2x2 table where the dimensions are:

-   whether the hypothesized pronunciation matches the corpus prediction and
-   whether the hypothesized pronunciation adheres to the spelling rules of the
    language and script, according to a user-provided covering grammar.

## Prerequisites

The script requires [PrettyTable](https://pypi.org/project/prettytable/) and
[Pynini](http://www.opengrm.org/twiki/bin/view/GRM/Pynini).

```bash
conda install -c conda-forge pynini
pip install prettytable
```

### Data

Two input files are required:

1.  Covering grammar: a two-column TSV file in which the left column contains
    zero or more graphemes, and the right contains zero or more phones it can
    correspond to.
2.  Test output: a three-column TSV file in which the columns are the graphemic
    form, the gold pronunciation, and the hypothesized pronunciation.

## Example workflow

```bash
cd data/covering_grammar/lib
./error_analysis.py --cg_path=cg.tsv --test_path=test.tsv
```


================================================
FILE: data/covering_grammar/lib/covering_grammar.py
================================================
#!/usr/bin/env python
"""Creates covering grammar FST from TSV of correspondences."""

import argparse

import pynini

TOKEN_TYPES = ["byte", "utf8"]


def main(args: argparse.Namespace) -> None:
    input_token_type = (
        args.input_token_type
        if args.input_token_type in TOKEN_TYPES
        else pynini.SymbolTable.read_text(args.input_token_type)
    )
    output_token_type = (
        args.output_token_type
        if args.output_token_type in TOKEN_TYPES
        else pynini.SymbolTable.read_text(args.output_token_type)
    )
    cg = pynini.string_file(
        args.tsv_path,
        input_token_type=input_token_type,
        output_token_type=output_token_type,
    )
    cg.closure().optimize()
    cg.write(args.fst_path)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument(
        "--input_token_type",
        default="utf8",
        help="input token type or path to symbol table (default: %(default)s)",
    )
    parser.add_argument(
        "--output_token_type",
        default="utf8",
        help="output token type or path to symbol table "
        "(default: %(default)s)",
    )
    parser.add_argument("tsv_path", help="path to input TSV")
    parser.add_argument("fst_path", help="path to output FST")
    main(parser.parse_args())


================================================
FILE: data/covering_grammar/lib/error_analysis.py
================================================
#!/usr/bin/env python
"""Error analysis tool for G2P.

Two input files are required:

1.  Covering grammar: a two-column TSV file in which the left column contains
    zero or more graphemes, and the right contains zero or more phones it can
    correspond to.
2.  Test output: a three-column TSV file in which the columns are the graphemic
    form, the gold pronunciation, and the hypothesized pronunciation.

Example:

espresso	ɛ s p ɹ ɛ s ə ʊ	ɛ k s p ɹ ɛ s ə ʊ
"""

__author__ = "Arundhati Sengupta"


import argparse
import csv
import datetime
import os

import prettytable  # type: ignore
import pynini
from pynini.lib import rewrite


def get_current_timestamp():
    return datetime.datetime.now().strftime("%m%d%Y_%H%M")


def log() -> str:
    error_log_dir = os.path.join(os.getcwd(), "logs")
    os.makedirs(error_log_dir, exist_ok=True)
    return error_log_dir


def match_pronunciation_rule(ortho, pron, cg_fst):
    try:
        return rewrite.matches(ortho, pron, cg_fst)
    except Exception:
        return False


def main(args: argparse.Namespace) -> None:
    with pynini.default_token_type("utf8"):
        cg_fst = pynini.string_file(args.cg_path).closure().optimize()
        rulematch_predmatch = 0
        rulematch_pred_notmatch = 0
        not_rulematch_predmatch = 0
        not_rulematch_pred_notmatch = 0
        total_records = 0
        error_log_dir = log()
        today_timestamp = get_current_timestamp()
        with open(args.test_path, "r") as source:
            with open(
                os.path.join(error_log_dir, today_timestamp + ".log"),
                "w",
                encoding="utf8",
            ) as log_file:
                fieldnames = ["Error_type", "Orthography", "Gold", "Hypo"]
                tsv_writer_object = csv.DictWriter(
                    log_file,
                    fieldnames=fieldnames,
                    delimiter="\t",
                    lineterminator="\n",
                )
                tsv_writer_object.writeheader()
                for line in source:
                    total_records += 1
                    ortho, gold_p, hypo_p = line.rstrip().split("\t", 2)
                    hypo_p = hypo_p.replace(" ", "")
                    gold_p = gold_p.replace(" ", "")
                    if match_pronunciation_rule(ortho, hypo_p, cg_fst):
                        if gold_p == hypo_p:
                            rulematch_predmatch += 1
                        else:
                            rulematch_pred_notmatch += 1
                            tsv_writer_object.writerow(
                                {
                                    "Error_type": "CG_match_Pron_non_match",
                                    "Orthography": ortho,
                                    "Gold": gold_p,
                                    "Hypo": hypo_p,
                                }
                            )
                    elif gold_p == hypo_p:
                        not_rulematch_predmatch += 1
                        tsv_writer_object.writerow(
                            {
                                "Error_type": "CG_non_match_pron_match",
                                "Orthography": ortho,
                                "Gold": gold_p,
                                "Hypo": hypo_p,
                            }
                        )
                    else:
                        not_rulematch_pred_notmatch += 1
                        tsv_writer_object.writerow(
                            {
                                "Error_type": "CG_non_match_pron_non_match",
                                "Orthography": ortho,
                                "Gold": gold_p,
                                "Hypo": hypo_p,
                            }
                        )
        # Collects percentages.
        rule_m_pred_nm = 100 * rulematch_pred_notmatch / total_records
        rule_m_pred_m = 100 * rulematch_predmatch / total_records
        rule_nm_pred_m = 100 * not_rulematch_predmatch / total_records
        rule_nm_pred_nm = 100 * not_rulematch_pred_notmatch / total_records
        # Builds and prints the table.
        print_table = prettytable.PrettyTable()
        print_table.field_names = ["", "CG match", "CG non-match"]
        print_table.add_row(
            ["Pron match", f"{rule_m_pred_m:.2f}", f"{rule_nm_pred_m:.2f}"]
        )
        print_table.add_row(
            [
                "Pron non-match",
                f"{rule_m_pred_nm:.2f}",
                f"{rule_nm_pred_nm:.2f}",
            ]
        )
        print(print_table)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument(
        "--cg_path", required=True, help="path to TSV covering grammar file"
    )
    parser.add_argument(
        "--test_path", required=True, help="path to test TSV file"
    )
    main(parser.parse_args())


================================================
FILE: data/covering_grammar/lib/make_test_file.py
================================================
#!/usr/bin/env python
"""Makes test file.

Using gold data and the model output, this script creates a three-column TSV
file in which each row contains a word, its gold pronunciation, and the
predicted pronunciation, assuming that the input files have the words listed
in the same order."""

import argparse
import contextlib
import logging

from data.scrape.lib.codes import LOGGING_PATH


def main(args: argparse.Namespace) -> None:
    with contextlib.ExitStack() as stack:
        gf = stack.enter_context(open(args.gold, "r"))
        pf = stack.enter_context(open(args.pred, "r"))
        wf = stack.enter_context(open(args.out, "w"))
        for lineno, (g_line, p_line) in enumerate(zip(gf, pf), 1):
            g_word, g_pron = g_line.rstrip().split("\t", 2)
            p_word, p_pron = p_line.rstrip().split("\t", 2)
            # Ensures the gold data and predictions have the same words.
            if g_word != p_word:
                logging.error("%s != %s (line %d)", g_word, p_word, lineno)
                exit(1)
            print(f"{g_word}\t{g_pron}\t{p_pron}", file=wf)


if __name__ == "__main__":
    logging.basicConfig(
        format="%(filename)s %(levelname)s: %(message)s",
        handlers=[
            logging.FileHandler(LOGGING_PATH, mode="a"),
            logging.StreamHandler(),
        ],
        level="INFO",
    )
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument(
        "gold", help="TSV with words and correct pronunciations"
    )
    parser.add_argument(
        "pred", help="TSV with words and predicted pronunciations"
    )
    parser.add_argument("out", help="output file")
    main(parser.parse_args())


================================================
FILE: data/covering_grammar/tsv/ady_cyrl_narrow.tsv
================================================
а	aː
а	a
а	ʔa
а	ə
б	b
б	p
в	v
в	f
г	ɣ
г	ɡʷ
г	ɡ
гу	ɡʷ
гъ	ʁ
гъ	ʁʷ
гъ	ɡʷ
гъу	ʁʷ
гь	ɡʲ
д	d
дж	d͡ʒ
дж	d͡ʒʷ
дз	d͡z
дз	d͡ʒ
дз	d͡ʒʷ
дзу	d͡zʷ
е	ja
е	aj
е	a
е	j
е	
ё	jo
ж	ʒ
ж	ʑ
жъ	ʐ
жъ	ʐʷ
жъ	ʒʷ
жъу	ʐʷ
жъу	ʒʷ
жь	ʑ
з	z
и	jə
и	əj
и	ə
и	
й	j
й	
к	kʷ
к	k
ку	kʷ
кхъ	qʷ
кхъ	χʷ
къ	q
къ	qʷ
къу	qʷ
кӏ	t͡ʃʼ
кӏ	t͡ʃ
кӏ	kʼ
кӏ	kʷʼ
кӏ	kʲʼ
кӏ	kʲ
кӏу	kʷʼ
кӏь	kʲʼ
кӏь	kʲ
кӏь	kʼ
кь	kʲ
л	l
л	ɮ
л	ɬ
лъ	ɬ
лъ	l
лӏ	ɬʼ
лӏ	ɬ
м	m
н	n
о	a
о	wa
о	aw
п	p
пӏ	pʼ
пӏ	pʷʼ
пӏ	p
пӏу	pʷʼ
р	r
р	rʲ
с	s
сӏ	sʼ
т	t
тӏ	tʼ
тӏ	tʷʼ
тӏ	t
тӏу	tʷʼ
у	w
у	əw
у	wə
у	
у	ə
ф	f
фӏ	fʼ
х	x
ху	xʷ
хъ	χ
хъ	χʷ
хъ	ħ
хъу	χʷ
хь	ħ
ц	t͡s
ц	t͡sʷ
ц	t͡ʃʷ
ц	t͡ʃ
цу	t͡sʷ
цу	t͡ʃʷ
цӏ	t͡sʼ
цӏ	sʼ
цӏ	t͡s
цӏу	ʃʷʼ
ч	t͡ʃ
ч	t͡ʃʼ
ч	kʲ
чъ	t͡ʂ
чъ	t͡ʃ
чъ	t͡ʂʼ
чъу	t͡ʃʷ
чӏ	t͡ʂʼ
чӏ	t͡ʃʼ
чӏ	t͡ʃ
чӏу	ʈ͡ʂʷʼ
чу	tʃʷ
ш	ʃ
ш	ɕ
шъ	ʂ
шъ	ʂʷ
шъ	ʃʷ
шъу	ʂʷ
шъу	ʃʷ
шъу	ʃʷʼ
шъу	ɕʷ
шӏ	ʃʼ
шӏ	ʃʷʼ
шӏ	ʃʷ
шӏ	ʃ
шӏу	ʃʷʼ
шӏу	ʃʷ
щ	ɕ
щ	ʃ
щу	ɕʷ
щӀ	ɕʼ
ъ	
ы	ə
ы	əː
ы	
ь	′
ь	
э	a
э	ə
э	ɜ
э	aː
ю	ju
я	jaː
я	aː
ӏ	ʔ
ӏ	ʔʷ
ӏ	
ӏу	ʔʷ
ӏь	ʔʲ
	ə
	ʔ


================================================
FILE: data/covering_grammar/tsv/bul_cyrl_narrow.tsv
================================================
а	ɤ
а	a
а	ɐ
а	ə
а	
б	b
б	bʲ
б	p
в	f
в	v
в	vʲ
г	ɡ
г	ɡʲ
г	k
г	ɟ
д	d̪
д	d
д	dʲ
д	t̪
д	t
д	
е	ɛ
е	
ж	ʒ
ж	ʃ
ж	d͡ʒ
з	z
з	zʲ
з	s
и	i
и	
й	j
к	k
к	kʲ
к	c
л	l
л	ɫ
л	lʲ
л	ʎ
м	m
м	ɱ
м	mʲ
н	n
н	ɲ
н	ŋ
н	nʲ
н	ɱ
о	ɔ
о	o
о	o̝
о	
п	p
п	pʲ
р	r
р	ɾ
р	rʲ
с	s
с	sʲ
т	t̪
т	t
т	tʲ
у	u
у	o
у	o̝
ф	f
ф	fʲ
х	x
х	xʲ
ц	t͡s
ц	ts
ч	t͡ʃ
ч	t͡sʲ
ч	tʃ
ч	t͡ʃʃ
ш	ʃ
щ	ʃt̪
щ	ʃtʲ
щ	ʃt
ъ	ɤ
ъ	ɐ
ъ	ə
ь	
ю	ju
ю	u
я	jɤ
я	ɤ
я	jɐ
я	ɐ
я	jə
я	ə
я	ja
я	a
дз	d͡z
дз	d͡zʲ
дж	d͡ʒ
нн	nː


================================================
FILE: data/covering_grammar/tsv/fre_latn_broad.tsv
================================================
# CONSONANT LETTERS
b	b
b	p  # Before a voiceless consonant.
b	
c	s  # Before <e, i, y>. Before <æ, œ> in some Greek and Latin loans.
c	k
ç	s
cc	ks  # Before <e, i, y>.
cc	k
ch	ʃ
ch	k  # Mainly in loanwords.
ch	tʃ  # Mainly in loanwords.
ct	kt
ct	
d	d
d	
f	f
ff	f
g	ʒ  # Before <e, i, y>
g	ɡ
g	
g	dʒ  # Mainly in loanwords.
gg	ɡʒ  # Before <e, i, y>
gg	ɡ
gn	ɲ
h	
j	ʒ
j	dʒ  # Mainly in loanwords.
k	k
l	l
l	
ll	l
m	m
mm	m
n	n
ng	ŋ  # Mainly in loanwords.
nn	n
p	p
p	
ph	f
pp	p
pt	pt
pt	t
pt
q	k
r	ʁ
rr	ʁ
r	
s	s
s	z
s	
sc	s  # Before <e, i, y>.
sc	sk
sch	ʃ
sch	sk
st	st
st	
t	t
t	
tch	tʃ
th	t
v	v
w	w
w	v
x	ks
x	ɡz
x	s
x	z
x	
xc	ks  # Before <e, i, y>.
xc	ksk
z	z
z	
# VOWEL LETTERS
a	a
a	ɑ
à	a
â	ɑ
â	a  # Parisian French.
æ	e
aa	a
ae	e
ae	a
aë	aɛ
ai	ɛ
ai	e
ai	ə  # In conjugations of <faire>.
aî	ɛ
aï	ai
aï	aj
aie	ɛ
aie	ɛj
ao	ao
ao	aɔ
aô	ao
aô	aɔ
aou	au
aou	u
aoû	au
aoû	u
au	o
au	ɔ
ay	ɛj
ay	aj
ay	ɛ
aye	ɛi
aye	ɛj
e	ə
e	ɛ
e	e
e	
é	e
é	ɛ  # I don't think this pronunciation occurs in the reformed orthography.
ée	e
è	ɛ
ê	ɛ
ê	e
ê	ɛː  # Regional.
ea	i  # In loanwords.
eau	o
ee	i  # In loanwords.
ei	ɛ
eî	ɛ
eoi	wa
eu	ø
eu	œ
eu	y  # In conjugations of <avoir>.
eû	ø
ey	ɛj
ey	ɛ  # In loanwords.
i	i
i	j
î	i
ï	i
ï	j
o	o
o	ɔ
ô	o
œ	œ
œ	e
œ	ɛ
oê	wa
œu	ø
œu	œ
oi	wa
oi	wɑ
oî	wa
oie	wa
oie	wɑ
oo	ɔɔ
oo	u  # In loanwords.
ou	u
ou	w
où	u
oû	u
oue	u
oy	waj
oy	wa
oy	ɔy  # In conjugations of <ouïr>.
u	y
u	ɥ
û	y
ue	ɥɛ
ue	œ
ue	y
ue	
uy	ɥij
uy	yj
y	i
y	j
ÿ	i
# VOWELS/CONSONANT COMBINATIONS
am	ɑ̃
am[EOS]	am
an	ɑ̃
an	an
aen	ɑ̃
aën	ɑ̃
aim	ɛ̃
ain	ɛ̃
aon	ɑ̃
aw	o
cqu	k
cte[EOS]	t
em	ɑ̃
en	ɑ̃
en	ɛ̃
eim	ɛ̃
ein	ɛ̃
ent	ɑ̃
ent	
er	e
er	ɛʁ
es	
es	e
es	ɛ
eun	œ̃
ew	ju  # In loanwords.
ge	ʒ
gu	ɡ
gu	ɡy
gu	ɡɥ
il[EOS]	j
il[EOS]	il
il[EOS]	i
ilh	ij
ilh	j
ill	j
ill	ij
ill	il
im	ɛ̃
in	ɛ̃
în	ɛ̃
oin	wɛ̃
oën	wɛ̃
om	ɔ̃
on	ɔ̃
ow	o
qu	k
qu	kɥ
qu	kw
ti	ti
ti	tj
ti	si
ti	sj
um	œ̃
um	ɔm
un	œ̃
ym	ɛ̃
ym	im
yn	ɛ̃



================================================
FILE: data/covering_grammar/tsv/geo_geor_broad.tsv
================================================
ა	ɑ
ბ	b
გ	ɡ
დ	d
ე	ɛ
ვ	v
ზ	z
თ	tʰ
ი	i
კ	kʼ
ლ	l
მ	m
ნ	n
ო	ɔ
პ	pʼ
ჟ	ʒ
რ	r
ს	s
ტ	tʼ
უ	u
ფ	pʰ
ქ	kʰ
ღ	ɣ
ყ	qʼ
შ	ʃ
ჩ	tʃ
ც	ts
ძ	dz
წ	tsʼ
ჭ	tʃʼ
ხ	x
ჯ	dʒ
ჰ	h



================================================
FILE: data/covering_grammar/tsv/gre_grek_broad.tsv
================================================
ά	a
λά	al
β	v
λ	ʎ
λ	l
λλ	l
α	o
α	a
ο	a
ο	o
σ	s
σ	z
σσ	s
ς	s
π	p
μπ	b
τ	t
τ	d
ττ	t
φ	f
γ	ɣ
γ	ʝ
γγ	ŋɟ
γγ	ŋɡ
γγ	ŋɣ
λι	ʎ
ε	e
ι	ɲ
ι	ʝ
ι	i
ι	ç
ω	o
ν	n
ν	ɲ
νν	n
νι	ɲ
ννι	ɲ
λλι	ʎ
νν	ɲ
κ	k
κ	c
κ	ç
κ	ɟ
κκ	κ
κκ	c
γκ	g
ρ	ɾ
ρ	r
ρρ	r
η	i
χ	ç
χ	x
θ	θ
μ	ɱ
μ	m
μμ	m
υ	u
υ	i
υ	f
υ	ɲ
έ	e
έ	i
ή	i
ί	i
ό	o
ύ	v
ύ	f
ύ	u
ύ	i
ώ	o
ϊ	i
ϋ	i
ει	i
εί	i
γι	ʝ
λί	i
ου	u
ού	u
ού	iu
οι	i
οί	i
αί	e
αι	e
αυ	av
ἀ	a
ἄ	a
ἐ	e
ἡ	i
ἤ	i
ἱ	i
ὑ	i
ξ	ks
ξ	sk
ψ	ps
ζ	d
ζ	z
δ	v
δ	ð


================================================
FILE: data/covering_grammar/tsv/ice_latn_broad.tsv
================================================
a	aː
a	a
a	au
á	auː
á	au
æ	ai
æ	aiː
au	øy
au	øyː
b	p
d	tː
d	t
ð	ð
dd	tː
e	ɛ
e	ɛː
e	e
é	jɛː
é	jɛ
é	ɛ
é	ɛː
ei	ei
ei	eiː
ey	ei
ey	eiː
f	f
ff	f
ff	fː
f	p
f	v
g	k
g	cː
g	kː
g	c
g	ɣ
g	j
g	x
gg	kː
gg	cː
gg	k
gj	c
ggj	cː
ggj	c
h	h
h
h	ç
hj	ç
hl	l̥
hn	n̥
hr	r̥
i	i
i	ɪ
i	ɪː
í	i
í	iː
ía	iːja
j	j
j
j	c
k	kʰ
k	k
k	cʰ
k	c
kj	cʰ
kj	c
kk	c
kk	hc
kk	hk
kl	kʰ
l	l
l	lː
l	l̥
ll	tl
ll	tl̥
ll	l:
m	m
m	mː
m	m̥
mm	m
mm	mː
n	n
n	n̥
n	nː
ng	ŋk
ng	ŋc
ng	ŋ
nk	ŋ̊k
nn	n
nn	nː
nn	tn
o	ɔː
o	ɔ
ó	ou
ó	ouː
ö	œ
ö	œː
p	p
p	pʰ
pp	hp
r	r̥
r	rː
r	r
rl	rtl
rl	rtl̥ 
rn	rtn
s	s
s	sː
ss	sː
sl	stl
sn	stn
t	t
t	tʰ
t	ht
þ	θ
tt	ht
u	ʏ
u	ʏː
u	u
ú	uː
ú	u
v	v
x	s
x	ks
x	xs
y	i
y	ɪ
y	ɪː
ý	i
ý	iː
	 



================================================
FILE: data/covering_grammar/tsv/ita_latn_broad.tsv
================================================
a	a
à	a
b	b
c	k
c	t͡ʃ
d	d
e	e
e	ɛ
è	ɛ
é	e
f	f
g	ɡ
g	d͡ʒ
h	
i
i	i
i	j
i	i̯
ì	i
í	i
l	l
m	m
n	n
o	o
o	ɔ
ò	ɔ
ó	o
p	p
q	k
r	r
s	s
s	z
t	t
u	u
u	w
u	u̯
ù	u
ú	u
v	v
z	t͡s
z	d͡z
# "Foreign" letters
j	j
j	d͡ʒ
k	k
w	w
w	v
x	ks
y	i
y	j
# Digraphs/Trigraphs
ch	k
ci	t͡ʃ
ci	t͡ʃi
gh	ɡ
gi	d͡ʒ
gi	d͡ʒi
gli	ʎ
gli	ʎʎ
gli	ɡli
gn	ɲ
gn	ɲɲ
sc	ʃ
sc	ʃʃ
sc	sk
sci	ʃ
sci	ʃʃ
# Idiosyncratic to Wiktionary transcription style
cc	tt͡ʃ
cc	kk
gg	dd͡ʒ
gg	ɡɡ
zz	tt͡s
zz	dd͡z


================================================
FILE: data/covering_grammar/tsv/jpn_hira_narrow.tsv
================================================
あ	a̠
か	ka̠
きゃ	kʲa̠
さ	sa̠
しゃ	ɕa̠
た	ta̠
ちゃ	t͡ɕa̠
な	na̠
にゃ	ɲ̟a̠
は	ha̠
ひゃ	ça̠
ま	ma̠
みゃ	mʲa̠
や	ja̠
ら	ɾa̠
りゃ	ɾʲa̠
わ	ɰᵝa̠
が	ɡa̠
ぎゃ	ɡʲa̠
ざ	za̠
ざ	d͡za̠
じゃ	ʑa̠
じゃ	d͡ʑa̠
だ	da̠
ぢゃ	ʑa̠
ぢゃ	d͡ʑa̠
ば	ba̠
びゃ	bʲa̠
ぱ	pa̠
ぴゃ	pʲa̠
い	i
き	kʲi
し	ɕi
ち	t͡ɕi
に	ɲ̟i
ひ	çi
み	mʲi
り	ɾʲi
ぎ	ɡʲi
じ	ʑi
じ	d͡ʑi
ぢ	ʑi
ぢ	d͡ʑi
び	bi
ぴ	pi
う	ɯ̟ᵝ
う	ɨᵝ
く	kɯ̟ᵝ
く	kɨᵝ
きゅ	kʲɯ̟ᵝ
きゅ	kʲɨᵝ
す	sɯ̟ᵝ
す	sɨᵝ
しゅ	ɕɯ̟ᵝ
しゅ	ɕɨᵝ
つ	t͡sɯ̟ᵝ
つ	t͡sɨᵝ
ちゅ	t͡ɕɯ̟ᵝ
ちゅ	t͡ɕɨᵝ
ぬ	nɯ̟ᵝ
ぬ	nɨᵝ
にゅ	ɲ̟ɯ̟ᵝ
にゅ	ɲ̟ɨᵝ
ふ	ɸɯ̟ᵝ
ふ	ɸɨᵝ
ひゅ	çɯ̟ᵝ
ひゅ	çɨᵝ
む	mɯ̟ᵝ
む	mɨᵝ
みゅ	mʲɯ̟ᵝ
みゅ	mʲɨᵝ
ゆ	jɯ̟ᵝ
ゆ	jɨᵝ
る	ɾɯ̟ᵝ
る	ɾɨᵝ
りゅ	ɾʲɯ̟ᵝ
りゅ	ɾʲɨᵝ
ぐ	ɡɯ̟ᵝ
ぐ	ɡɨᵝ
ぎゅ	ɡʲɯ̟ᵝ
ぎゅ	ɡʲɨᵝ
ず	zɯ̟ᵝ
ず	zɨᵝ
ず	d͡zɯ̟ᵝ
ず	d͡zɨᵝ
じゅ	ʑɯ̟ᵝ
じゅ	ʑɨᵝ
じゅ	d͡ʑɯ̟ᵝ
じゅ	d͡ʑɨᵝ
づ	zɯ̟ᵝ
づ	zɨᵝ
づ	d͡zɯ̟ᵝ
づ	d͡zɨᵝ
ぢゅ	ʑɯ̟ᵝ
ぢゅ	ʑɨᵝ
ぢゅ	d͡ʑɯ̟ᵝ
ぢゅ	d͡zɨᵝ
ぶ	bɯ̟ᵝ
ぶ	bɨᵝ
びゅ	bʲɯ̟ᵝ
びゅ	bʲɨᵝ
ぷ	pɯ̟ᵝ
ぷ	pɨᵝ
ぴゅ	pʲɯ̟ᵝ
ぴゅ	pʲɨᵝ
え	e̞
け	ke̞
せ	se̞
て	te̞
ね	ne̞
へ	he̞
め	me̞
れ	ɾe̞
げ	ɡe̞
ぜ	ze̞
ぜ	d͡ze̞
で	de̞
べ	be̞
ぺ	pe̞
お	o̞
こ	ko̞
きょ	kʲo̞
そ	so̞
しょ	ɕo̞
と	to̞
ちょ	t͡ɕo̞
の	no̞
にょ	ɲ̟o̞
ほ	ho̞
ひょ	ço̞
も	mo̞
みょ	mʲo̞
よ	jo̞
ろ	ɾo̞
りょ	ɾʲo̞
を	o̞
ご	ɡo̞
ぎょ	ɡʲo̞
ぞ	zo̞
ぞ	d͡zo̞
じょ	ʑo̞
じょ	d͡ʑo̞
ど	do̞
ぢょ	ʑo̞
ぢょ	d͡ʑo̞
ぼ	bo̞
びょ	bʲo̞
ぽ	po̞
ぴょ	pʲo̞
# WITH NASALIZED VOWELS (i.e., before <ん>)
あ	ã̠
か	kã̠
きゃ	kʲã̠
さ	sã̠
しゃ	ɕã̠
た	tã̠
ちゃ	t͡ɕã̠
な	nã̠
にゃ	ɲ̟ã̠
は	hã̠
ひゃ	çã̠
ま	mã̠
みゃ	mʲã̠
や	jã̠
ら	ɾã̠
りゃ	ɾʲã̠
わ	ɰᵝã̠
が	ɡã̠
ぎゃ	ɡʲã̠
ざ	zã̠
ざ	d͡zã̠
じゃ	ʑã̠
じゃ	d͡ʑã̠
だ	dã̠
ぢゃ	ʑã̠
ぢゃ	d͡ʑã̠
ば	bã̠
びゃ	bʲã̠
ぱ	pã̠
ぴゃ	pʲã̠
い	ĩ
き	kʲĩ
し	ɕĩ
ち	t͡ɕĩ
に	ɲ̟ĩ
ひ	çĩ
み	mʲĩ
り	ɾʲĩ
ぎ	ɡʲĩ
じ	ʑĩ
じ	d͡ʑĩ
ぢ	ʑĩ
ぢ	d͡ʑĩ
び	bĩ
ぴ	pĩ
う	ɯ̟̃ᵝ
う	ɨ̃ᵝ
く	kɯ̟̃ᵝ
く	kɨ̃ᵝ
きゅ	kʲɯ̟̃ᵝ
きゅ	kʲɨ̃ᵝ
す	sɯ̟̃ᵝ
す	sɨ̃ᵝ
しゅ	ɕɯ̟̃ᵝ
しゅ	ɕɨ̃ᵝ
つ	t͡sɯ̟̃ᵝ
つ	t͡sɨ̃ᵝ
ちゅ	t͡ɕɯ̟̃ᵝ
ちゅ	t͡ɕɨ̃ᵝ
ぬ	nɯ̟̃ᵝ
ぬ	nɨ̃ᵝ
にゅ	ɲ̟ɯ̟̃ᵝ
にゅ	ɲ̟ɨ̃ᵝ
ふ	ɸɯ̟̃ᵝ
ふ	ɸɨ̃ᵝ
ひゅ	çɯ̟̃ᵝ
ひゅ	çɨ̃ᵝ
む	mɯ̟̃ᵝ
む	mɨ̃ᵝ
みゅ	mʲɯ̟̃ᵝ
みゅ	mʲɨ̃ᵝ
ゆ	jɯ̟̃ᵝ
ゆ	jɨ̃ᵝ
る	ɾɯ̟̃ᵝ
る	ɾɨ̃ᵝ
りゅ	ɾʲɯ̟̃ᵝ
りゅ	ɾʲɨ̃ᵝ
ぐ	ɡɯ̟̃ᵝ
ぐ	ɡɨ̃ᵝ
ぎゅ	ɡʲɯ̟̃ᵝ
ぎゅ	ɡʲɨ̃ᵝ
ず	zɯ̟̃ᵝ
ず	zɨ̃ᵝ
ず	d͡zɯ̟̃ᵝ
ず	d͡zɨ̃ᵝ
じゅ	ʑɯ̟̃ᵝ
じゅ	ʑɨ̃ᵝ
じゅ	d͡ʑɯ̟̃ᵝ
じゅ	d͡ʑɨ̃ᵝ
づ	zɯ̟̃ᵝ
づ	zɨ̃ᵝ
づ	d͡zɯ̟̃ᵝ
づ	d͡zɨ̃ᵝ
ぢゅ	ʑɯ̟̃ᵝ
ぢゅ	ʑɨ̃ᵝ
ぢゅ	d͡ʑɯ̟̃ᵝ
ぢゅ	d͡zɨ̃ᵝ
ぶ	bɯ̟̃ᵝ
ぶ	bɨ̃ᵝ
びゅ	bʲɯ̟̃ᵝ
びゅ	bʲɨ̃ᵝ
ぷ	pɯ̟̃ᵝ
ぷ	pɨ̃ᵝ
ぴゅ	pʲɯ̟̃ᵝ
ぴゅ	pʲɨ̃ᵝ
え	ẽ̞
け	kẽ̞
せ	sẽ̞
て	tẽ̞
ね	nẽ̞
へ	hẽ̞
め	mẽ̞
れ	ɾẽ̞
げ	ɡẽ̞
ぜ	zẽ̞
ぜ	d͡zẽ̞
で	dẽ̞
べ	bẽ̞
ぺ	pẽ̞
お	õ̞
こ	kõ̞
きょ	kʲõ̞
そ	sõ̞
しょ	ɕõ̞
と	tõ̞
ちょ	t͡ɕõ̞
の	nõ̞
にょ	ɲ̟õ̞
ほ	hõ̞
ひょ	çõ̞
も	mõ̞
みょ	mʲõ̞
よ	jõ̞
ろ	ɾõ̞
りょ	ɾʲõ̞
を	õ̞
ご	ɡõ̞
ぎょ	ɡʲõ̞
ぞ	zõ̞
ぞ	d͡zõ̞
じょ	ʑõ̞
じょ	d͡ʑõ̞
ど	dõ̞
ぢょ	ʑõ̞
ぢょ	d͡ʑõ̞
ぼ	bõ̞
びょ	bʲõ̞
ぽ	põ̞
ぴょ	pʲõ̞
# WITH VOICELESS VOWELS
い	i̥
き	kʲi̥
し	ɕi̥
ち	t͡ɕi̥
ひ	çi̥
う	ɯ̟̊ᵝ
う	ɨ̥ᵝ
く	kɯ̟̊ᵝ
く	kɨ̥ᵝ
きゅ	kʲɯ̟̊ᵝ
きゅ	kʲɨ̥ᵝ
す	sɯ̟̊ᵝ
す	sɨ̥ᵝ
しゅ	ɕɯ̟̊ᵝ
しゅ	ɕɨ̥ᵝ
つ	t͡sɯ̟̊ᵝ
つ	t͡sɨ̥ᵝ
ちゅ	t͡ɕɯ̟̊ᵝ
ちゅ	t͡ɕɨ̥ᵝ
ふ	ɸɯ̟̊ᵝ
ふ	ɸɨ̥ᵝ
ひゅ	çɯ̟̊ᵝ
ひゅ	çɨ̥ᵝ
# WITH LONG VOWELS
ああ	a̠ː
かあ	ka̠ː
きゃあ	kʲa̠ː
さあ	sa̠ː
しゃあ	ɕa̠ː
たあ	ta̠ː
ちゃあ	t͡ɕa̠ː
なあ	na̠ː
にゃあ	ɲ̟a̠ː
はあ	ha̠ː
ひゃあ	ça̠ː
まあ	ma̠ː
みゃあ	mʲa̠ː
やあ	ja̠ː
らあ	ɾa̠ː
りゃあ	ɾʲa̠ː
わあ	ɰᵝa̠ː
があ	ɡa̠ː
ぎゃあ	ɡʲa̠ː
ざあ	za̠ː
ざあ	d͡za̠ː
じゃあ	ʑa̠ː
じゃあ	d͡ʑa̠ː
だあ	da̠ː
ぢゃあ	ʑa̠ː
ぢゃあ	d͡ʑa̠ː
ばあ	ba̠ː
びゃあ	bʲa̠ː
ぱあ	pa̠ː
ぴゃあ	pʲa̠ː
いい	iː
きい	kʲiː
しい	ɕiː
ちい	t͡ɕiː
にい	ɲ̟iː
ひい	çiː
みい	mʲiː
りい	ɾʲiː
ぎい	ɡʲiː
じい	ʑiː
じい	d͡ʑiː
ぢい	ʑiː
ぢい	d͡ʑiː
びい	biː
ぴい	piː
うう	ɯ̟ᵝː
うう	ɨᵝː
くう	kɯ̟ᵝː
くう	kɨᵝː
きゅう	kʲɯ̟ᵝː
きゅう	kʲɨᵝː
すう	sɯ̟ᵝː
すう	sɨᵝː
しゅう	ɕɯ̟ᵝː
しゅう	ɕɨᵝː
つう	t͡sɯ̟ᵝː
つう	t͡sɨᵝː
ちゅう	t͡ɕɯ̟ᵝː
ちゅう	t͡ɕɨᵝː
ぬう	nɯ̟ᵝː
ぬう	nɨᵝː
にゅう	ɲ̟ɯ̟ᵝː
にゅう	ɲ̟ɨᵝː
ふう	ɸɯ̟ᵝː
ふう	ɸɨᵝː
ひゅう	çɯ̟ᵝː
ひゅう	çɨᵝː
むう	mɯ̟ᵝː
むう	mɨᵝː
みゅう	mʲɯ̟ᵝː
みゅう	mʲɨᵝː
ゆう	jɯ̟ᵝː
ゆう	jɨᵝː
るう	ɾɯ̟ᵝː
るう	ɾɨᵝː
りゅう	ɾʲɯ̟ᵝː
りゅう	ɾʲɨᵝː
ぐう	ɡɯ̟ᵝː
ぐう	ɡɨᵝː
ぎゅう	ɡʲɯ̟ᵝː
ぎゅう	ɡʲɨᵝː
ずう	zɯ̟ᵝː
ずう	zɨᵝː
ずう	d͡zɯ̟ᵝː
ずう	d͡zɨᵝː
じゅう	ʑɯ̟ᵝː
じゅう	ʑɨᵝː
じゅう	d͡ʑɯ̟ᵝː
じゅう	d͡ʑɨᵝː
づう	zɯ̟ᵝː
づう	zɨᵝː
づう	d͡zɯ̟ᵝː
づう	d͡zɨᵝː
ぢゅう	ʑɯ̟ᵝː
ぢゅう	ʑɨᵝː
ぢゅう	d͡ʑɯ̟ᵝː
ぢゅう	d͡zɨᵝː
ぶう	bɯ̟ᵝː
ぶう	bɨᵝː
びゅう	bʲɯ̟ᵝː
びゅう	bʲɨᵝː
ぷう	pɯ̟ᵝː
ぷう	pɨᵝː
ぴゅう	pʲɯ̟ᵝː
ぴゅう	pʲɨᵝː
ええ	e̞ː
えい	e̞ː
けえ	ke̞ː
けい	ke̞ː
せえ	se̞ː
せい	se̞ː
てえ	te̞ː
てい	te̞ː
ねえ	ne̞ː
ねい	ne̞ː
へえ	he̞ː
へい	he̞ː
めえ	me̞ː
めい	me̞ː
れえ	ɾe̞ː
れい	ɾe̞ː
げえ	ɡe̞ː
げい	ɡe̞ː
ぜえ	ze̞ː
ぜい	ze̞ː
ぜえ	d͡ze̞ː
ぜい	d͡ze̞ː
でえ	de̞ː
でい	de̞ː
べえ	be̞ː
べい	be̞ː
ぺえ	pe̞ː
ぺい	pe̞ː
おお	o̞ː
おう	o̞ː
こお	ko̞ː
こう	ko̞ː
きょお	kʲo̞ː
きょう	kʲo̞ː
そお	so̞ː
そう	so̞ː
しょお	ɕo̞ː
しょう	ɕo̞ː
とお	to̞ː
とう	to̞ː
ちょお	t͡ɕo̞ː
ちょう	t͡ɕo̞ː
のお	no̞ː
のう	no̞ː
にょお	ɲ̟o̞ː
にょう	ɲ̟o̞ː
ほお	ho̞ː
ほう	ho̞ː
ひょお	ço̞ː
ひょう	ço̞ː
もお	mo̞ː
もう	mo̞ː
みょお	mʲo̞ː
みょう	mʲo̞ː
よお	jo̞ː
よう	jo̞ː
ろお	ɾo̞ː
ろう	ɾo̞ː
りょお	ɾʲo̞ː
りょう	ɾʲo̞ː
をお	o̞ː
をう	o̞ː
ごお	ɡo̞ː
ごう	ɡo̞ː
ぎょお	ɡʲo̞ː
ぎょう	ɡʲo̞ː
ぞお	zo̞ː
ぞう	zo̞ː
ぞお	d͡zo̞ː
ぞう	d͡zo̞ː
じょお	ʑo̞ː
じょう	ʑo̞ː
じょお	d͡ʑo̞ː
じょう	d͡ʑo̞ː
どお	do̞ː
どう	do̞ː
ぢょお	ʑo̞ː
ぢょう	ʑo̞ː
ぢょお	d͡ʑo̞ː
ぢょう	d͡ʑo̞ː
ぼお	bo̞ː
ぼう	bo̞ː
びょお	bʲo̞ː
びょう	bʲo̞ː
ぽお	po̞ː
ぽう	po̞ː
ぴょお	pʲo̞ː
ぴょう	pʲo̞ː
# CODA NASAL
ん	m
ん	n
ん	ɲ̟
ん	ŋ
ん	ŋʲ
ん	ɴ
ん	ɰ̃
んな	nːa̠
んな	nːã̠
んなあ	nːa̠ː
んま	mːa̠
んま	mːã̠
んまあ	mːa̠ː
んみゃ	mʲːa̠
んみゃ	mʲːã̠
んみゃあ	mʲːa̠ː
# /ɲ̟ː/ does not occur in the data (reason unknown)
んみ	mʲːi
んみ	mʲːĩ
んみい	mʲːiː
んぬ	nːɯ̟ᵝ
んぬ	nːɨᵝ
んぬ	nːɯ̟̃ᵝ
んぬ	nːɨ̃ᵝ
んぬう	nːɯ̟ᵝː
んぬう	nːɨᵝː
んむ	mːɯ̟ᵝ
んむ	mːɨᵝ
んむ	mːɯ̟̃ᵝ
んむ	mːɨ̃ᵝ
んむう	mːɯ̟ᵝː
んむう	mːɨᵝː
んみゅ	mʲːɯ̟ᵝ
んみゅ	mʲːɨᵝ
んみゅ	mʲːɯ̟̃ᵝ
んみゅ	mʲːɨ̃ᵝ
んみゅう	mʲːɯ̟ᵝː
んみゅう	mʲːɨᵝː
んね	nːe̞
んね	nːẽ̞
んねえ	nːe̞ː
んねい	nːe̞ː
んめ	mːe̞
んめ	mːẽ̞
んめえ	mːe̞ː
んめい	mːe̞ː
んの	nːo̞
んの	nːõ̞
んのお	nːo̞ː
んのお	nːo̞ː
んも	mːo̞
んも	mːõ̞
んもお	mːo̞ː
んもう	mːo̞ː
んみょ	mʲːo̞
んみょ	mʲːõ̞
んみょお	mʲːo̞ː
んみょう	mʲːo̞ː
# CODA GEMINATE
っ	k̚
っ	k̚ʲ
っ	s
っ	ɕ
っ	t̚
っ	p̚
っ	p̚ʲ
# PARTICLES WITH IDIOSYNCRATIC PRONUNCIATIONS
は	ɰᵝa̠
へ	e̞
# (NEARLY) OBSOLETE
ゐ	i
ゑ	e̞



================================================
FILE: data/frequencies/README.md
================================================
# Frequencies scripts

The scripts in this directory are responsible for downloading word frequency
counts from the [Leipzig Corpora
Collection](https://wortschatz.uni-leipzig.de/en/download/) and merging those
counts into [our corresponding TSVs](../tsv/).

## How to use

[`grab_wortschatz_data.py`](grab_wortschatz_data.py) downloads and unpacks the
TARs provided by the aforementioned Corpora Collection. [`merge.py`](merge.py)
merges in the word frequency counts with our TSVs such that, for the languages
covered by the Corpora Collection, we end up with three-column TSVs:

    bashkë  b a ʃ k ə   1005
    bashkëfajtor    b a ʃ k f a j t ɔ ɹ 2
    bashkëfajtor    b a ʃ k ə f a j t ɔ ɹ   2
    bashkëjetesë    b a ʃ k ə j ɛ t ɛ s ə   9

We generally choose to download the largest available News corpus for each
language, though in some cases other sources are used.
[`wortschatz_languages.json`](wortschatz_languages.json) contains a dictionary
of all the languages for which we download frequencies. The `"data_url"` key for
each language links to the particular corpus we download.

After successful merging, the user can delete the temporary `tgz` and `tsv`
subdirectories.

## Shared tasks

Specific configurations for shared tasks are stored [here](shared_tasks).


================================================
FILE: data/frequencies/grab_wortschatz_data.py
================================================
#!/usr/bin/env python
"""Downloads and decompresses Wortschatz frequency data."""

import argparse
import json
import logging
import os
import tarfile
import time

from typing import Any

import requests

_THIS_DIR = os.path.dirname(__file__)
WORTSCHATZ_DICT_PATH = os.path.join(_THIS_DIR, "wortschatz_languages.json")


def download(data_to_grab: dict[str, Any]) -> dict[str, Any]:
    to_retry = {}
    os.makedirs("tgz", exist_ok=True)
    for language in data_to_grab:
        url = data_to_grab[language]["data_url"]
        with requests.get(url, stream=True) as response:
            target_path = url.split("/")[-1]
            logging.info("Downloading: %s", target_path)
            if response.status_code == 200:
                with open(f"tgz/{target_path}", "wb") as sink:
                    sink.write(response.raw.read())
            else:
                logging.info(
                    "Status code %d while downloading %s",
                    response.status_code,
                    target_path,
                )
                to_retry[language] = data_to_grab[language]
        # 30 seconds appears to not be enough, 60-70 seconds works well
        # but takes a long time.
        time.sleep(45)
    return to_retry


def unpack() -> None:
    os.makedirs("tsv", exist_ok=True)
    for tarball in os.listdir("tgz"):
        logging.info("Unpacking: %s", tarball)
        with tarfile.open(name=f"tgz/{tarball}", mode="r:gz") as tar_data:
            for file_entry in tar_data:
                if file_entry.name.endswith("words.txt"):
                    # Removes inconsistent path in tarballs
                    # so tsv has uniform contents.
                    file_entry.name = os.path.basename(file_entry.name)
                    tar_data.extract(file_entry, "tsv")


def main(args: argparse.Namespace) -> None:
    with open(args.freq_json_path, "r", encoding="utf-8") as langs:
        languages = json.load(langs)
    # Hack for repeatedly attempting to download Wortschatz data
    # as a way of getting around 404 response from their server.
    langs_to_retry = download(languages)
    while langs_to_retry:
        langs_to_retry = download(langs_to_retry)
    unpack()


if __name__ == "__main__":
    logging.basicConfig(
        format="%(filename)s %(levelname)s: %(message)s", level="INFO"
    )
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument(
        "--freq-json-path",
        default=WORTSCHATZ_DICT_PATH,
        help="path to JSON file for the Wortschatz frequency download URLs",
    )
    main(parser.parse_args())


================================================
FILE: data/frequencies/merge.py
================================================
#!/usr/bin/env python
"""Merges downloaded frequency data with pronunciation data."""

import argparse
import csv
import json
import logging
import os


from grab_wortschatz_data import WORTSCHATZ_DICT_PATH


def write_frequency_tsv(
    wiki_tsv_prefix: str,
    level: str,
    frequencies_dict: dict[str, int],
    args: argparse.Namespace,
) -> None:
    # Complete WikiPron TSV paths.
    basename = f"{wiki_tsv_prefix}_{level}.tsv"
    source_path = os.path.join(args.source_pron_dir, basename)
    sink_path = os.path.join(args.dest_dir, basename)
    # Will try to overwrite narrow and broad WikiPron TSVs for all Wortschatz
    # languages. WikiPron may not have both a narrow and broad TSV for all
    # languages.
    try:
        # This is written to be run after remove_duplicates_and_split.sh
        # and retain sorted order.
        with open(source_path, "r", encoding="utf-8") as wiki_file:
            wiki_tsv = csv.reader(
                wiki_file, delimiter="\t", quoting=csv.QUOTE_NONE
            )
            with open(sink_path, "w") as source:
                # Our TSVs may be two or three columns
                # depending on if merge.py has been run.
                for word, pron, *prev_count in wiki_tsv:
                    # Check if WikiPron word is in Wortschatz frequencies
                    # else set frequency to 0.
                    if word in frequencies_dict:
                        print(
                            f"{word}\t{pron}\t{frequencies_dict[word]}",
                            file=source,
                        )
                    else:
                        print(f"{word}\t{pron}\t0", file=source)
    except FileNotFoundError as err:
        logging.info("File not found: %s", err)


def main(args: argparse.Namespace) -> None:
    with open(WORTSCHATZ_DICT_PATH, "r", encoding="utf-8") as langs:
        languages = json.load(langs)
    levels = [
        "narrow",
        "broad",
        "narrow_filtered",
        "broad_filtered",
    ]
    os.makedirs(args.dest_dir, exist_ok=True)
    for freq_file in os.listdir(args.source_freq_dir):
        word_freq_dict = {}
        # For accessing correct language in wortschatz_languages.json.
        file_to_match = freq_file.rsplit("-", 1)[0]
        logging.info("Currently working on: %s", file_to_match)

        freq_path = os.path.join(args.source_freq_dir, freq_file)
        with open(freq_path, "r", encoding="utf-8") as tsv:
            frequencies_tsv = csv.reader(
                tsv, delimiter="\t", quoting=csv.QUOTE_NONE
            )
            for row in frequencies_tsv:
                # Wortschatz TSVs are not uniformly formatted.
                # Some have 3 columns, some have 4.
                try:
                    word = row[2].casefold()
                    freq = int(row[3])
                except IndexError:
                    word = row[1].casefold()
                    freq = int(row[2])
                if word not in word_freq_dict:
                    word_freq_dict[word] = freq
                else:
                    word_freq_dict[word] = word_freq_dict[word] + freq
        for wiki_tsv_prefix in languages[file_to_match]["file_prefixes"]:
            for level in levels:
                write_frequency_tsv(
                    wiki_tsv_prefix, level, word_freq_dict, args
                )


if __name__ == "__main__":
    logging.basicConfig(
        format="%(filename)s %(levelname)s: %(message)s", level="INFO"
    )
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument(
        "--dest-dir",
        required=True,
        help="destination directory where the merged data is created",
    )
    parser.add_argument(
        "--source-pron-dir",
        required=True,
        help="source directory of pronunciation data as TSV files",
    )
    parser.add_argument(
        "--source-freq-dir",
        default="tsv",
        help=(
            "source directory of frequency data "
            "unpacked from the downloaded Wortschatz files"
        ),
    )
    main(parser.parse_args())


================================================
FILE: data/frequencies/shared_tasks/README.md
================================================
This directory contains files such that,
if supplied to the scripts under `data/frequencies/` to override the default
`../wortschatz_languages.json`, will generate frequencies for just the languages
targeted.


================================================
FILE: data/frequencies/shared_tasks/SIGMORPHON_2021.json
================================================
{
    "hye-am_web_2017_1M": {
        "file_prefixes": [
            "arm_e"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/hye-am_web_2017_1M.tar.gz"
    },
    "bul_newscrawl_2017_1M": {
        "file_prefixes": [
            "bul"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/bul_newscrawl_2017_1M.tar.gz"
    },
    "nld_mixed_2012_1M": {
        "file_prefixes": [
            "dut"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/nld_mixed_2012_1M.tar.gz"
    },
    "fra_news_2010_1M": {
        "file_prefixes": [
            "fre"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/fra_news_2010_1M.tar.gz"
    },
    "kat_newscrawl_2016_1M": {
        "file_prefixes": [
            "geo"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/kat_newscrawl_2016_1M.tar.gz"
    },
    "ell-gr_web_2015_1M": {
        "file_prefixes": [
            "gre"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/ell-gr_web_2015_1M.tar.gz"
    },
    "hbs_mixed_2014_1M": {
        "file_prefixes": [
            "hbs_latn"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/hbs_mixed_2014_1M.tar.gz"
    },
    "hun_mixed_2012_1M": {
        "file_prefixes": [
            "hun"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/hun_mixed_2012_1M.tar.gz"
    },
    "isl-is_web_2017_1M": {
        "file_prefixes": [
            "ice"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/isl-is_web_2017_1M.tar.gz"
    },
    "ita_news_2010_1M": {
        "file_prefixes": [
            "ita"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/ita_news_2010_1M.tar.gz"
    },
    "jpn_news_2011_1M": {
        "file_prefixes": [
            "jpn_hira"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/jpn_news_2011_1M.tar.gz"
    },
    "khm_community_2017": {
        "file_prefixes": [
            "khm"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/khm_community_2017.tar.gz"
    },
    "kor_news_2007_1M": {
        "file_prefixes": [
            "kor"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/kor_news_2007_1M.tar.gz"
    },
    "lav-lv_web_2015_1M": {
        "file_prefixes": [
            "lav"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/lav-lv_web_2015_1M.tar.gz"
    },
    "mlt_web_2012_300K": {
        "file_prefixes": [
            "mlt_latn"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/mlt_web_2012_300K.tar.gz"
    },
    "ron_news_2015_1M": {
        "file_prefixes": [
            "rum"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/ron_news_2015_1M.tar.gz"
    },
    "slv-si_web_2014_1M": {
        "file_prefixes": [
            "slv"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/slv-si_web_2014_1M.tar.gz"
    },
    "vie_newscrwal_2011_1M": {
        "file_prefixes": [
            "vie_hanoi"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/vie_newscrwal_2011_1M.tar.gz"
    },
    "cym_wikipedia_2016_100K": {
        "file_prefixes": [
            "wel_sw"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/cym_wikipedia_2016_100K.tar.gz"
    }
}


================================================
FILE: data/frequencies/shared_tasks/SIGMORPHON_2022.json
================================================
{
    "swe_news_2020_1M": {
        "file_prefixes": [
            "swe_latn"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/swe_news_2020_1M.tar.gz"
    },
    "nno-no_web_2020_1M": {
        "file_prefixes": [
            "nno_latn"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/nno-no_web_2020_1M.tar.gz"
    },
    "deu_news_2021_1M": {
        "file_prefixes": [
            "ger_latn"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/deu_news_2021_1M.tar.gz"
    },
    "nld_news_2020_1M": {
        "file_prefixes": [
            "dut_latn"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/nld_news_2020_1M.tar.gz"
    },
    "ita_news_2020_1M": {
        "file_prefixes": [
            "ita_latn"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/ita_news_2020_1M.tar.gz"
    },
    "ron_news_2020_1M": {
        "file_prefixes": [
            "rum_latn"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/ron_news_2020_1M.tar.gz"
    },
    "ukr_news_2020_1M": {
        "file_prefixes": [
            "ukr_cyrl"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/ukr_news_2020_1M.tar.gz"
    },
    "bel_news_2011_300K": {
        "file_prefixes": [
            "bel_cyrl"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/bel_news_2011_300K.tar.gz"
    },
    "tgl_newscrwal_2011_300K": {
        "file_prefixes": [
            "tgl_latn"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/tgl_newscrwal_2011_300K.tar.gz"
    },
    "ceb_wikipedia_2021_1M": {
        "file_prefixes": [
            "ceb_latn"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/ceb_wikipedia_2021_1M.tar.gz"
    },
    "ben_news_2020_300K": {
        "file_prefixes": [
            "ben_beng"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/ben_news_2020_300K.tar.gz"
    },
    "asm_wikipedia_2021_100K": {
        "file_prefixes": [
            "asm_beng"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/asm_wikipedia_2021_100K.tar.gz"
    },
    "fas_newscrawl_2019_1M": {
        "file_prefixes": [
            "per_arab"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/fas_newscrawl_2019_1M.tar.gz"
    },
    "tha-th_web_2018_1M": {
        "file_prefixes": [
            "tha_thai"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/tha-th_web_2018_1M.tar.gz"
    }
}


================================================
FILE: data/frequencies/wortschatz_languages.json
================================================
{
    "sqi_wikipedia_2016_300K": {
        "file_prefixes": [
            "alb"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/sqi_wikipedia_2016_300K.tar.gz"
    },
    "ara_news_2017_1M": {
        "file_prefixes": [
            "ara"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/ara_news_2017_1M.tar.gz"
    },
    "hye-am_web_2017_1M": {
        "file_prefixes": [
            "arm_e",
            "arm_w"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/hye-am_web_2017_1M.tar.gz"
    },
    "asm_wikipedia_2021_100K": {
        "file_prefixes": [
            "asm_beng"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/asm_wikipedia_2021_100K.tar.gz"
    },
    "aze_newscrawl_2013_1M": {
        "file_prefixes": [
            "aze_latn"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/aze_newscrawl_2013_1M.tar.gz"
    },
    "bak_news_2016_300K": {
        "file_prefixes": [
            "bak"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/bak_news_2016_300K.tar.gz"
    },
    "bel_news_2011_300K": {
        "file_prefixes": [
            "bel_cyrl"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/bel_news_2011_300K.tar.gz"
    },
    "bul_newscrawl_2017_1M": {
        "file_prefixes": [
            "bul"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/bul_newscrawl_2017_1M.tar.gz"
    },
    "cat_newscrawl_2016_1M": {
        "file_prefixes": [
            "cat"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/cat_newscrawl_2016_1M.tar.gz"
    },
    "ces_news_2005-2007_1M": {
        "file_prefixes": [
            "cze"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/ces_news_2005-2007_1M.tar.gz"
    },
    "dan_news_2007_1M": {
        "file_prefixes": [
            "dan"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/dan_news_2007_1M.tar.gz"
    },
    "nld_news_2020_1M": {
        "file_prefixes": [
            "dut_latn"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/nld_news_2020_1M.tar.gz"
    },
    "eng_news_2016_1M": {
        "file_prefixes": [
            "eng_uk",
            "eng_us"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/eng_news_2016_1M.tar.gz"
    },
    "epo_mixed_2012_1M": {
        "file_prefixes": [
            "epo"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/epo_mixed_2012_1M.tar.gz"
    },
    "fao-fo_web_2015_1M": {
        "file_prefixes": [
            "fao"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/fao-fo_web_2015_1M.tar.gz"
    },
    "fin_web_2002_1M": {
        "file_prefixes": [
            "fin"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/fin_web_2002_1M.tar.gz"
    },
    "fra_news_2010_1M": {
        "file_prefixes": [
            "fre"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/fra_news_2010_1M.tar.gz"
    },
    "glg_wikipedia_2016_300K": {
        "file_prefixes": [
            "glg"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/glg_wikipedia_2016_300K.tar.gz"
    },
    "kat_newscrawl_2016_1M": {
        "file_prefixes": [
            "geo"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/kat_newscrawl_2016_1M.tar.gz"
    },
    "deu_news_2021_1M": {
        "file_prefixes": [
            "ger_latn"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/deu_news_2021_1M.tar.gz"
    },
    "ell-gr_web_2015_1M": {
        "file_prefixes": [
            "gre"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/ell-gr_web_2015_1M.tar.gz"
    },
    "hbs_mixed_2014_1M": {
        "file_prefixes": [
            "hbs_cyrl",
            "hbs_latn"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/hbs_mixed_2014_1M.tar.gz"
    },
    "hin_news_2011_1M": {
        "file_prefixes": [
            "hin"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/hin_news_2011_1M.tar.gz"
    },
    "hun_mixed_2012_1M": {
        "file_prefixes": [
            "hun"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/hun_mixed_2012_1M.tar.gz"
    },
    "isl-is_web_2017_1M": {
        "file_prefixes": [
            "ice"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/isl-is_web_2017_1M.tar.gz"
    },
    "ido_wikipedia_2016_100K": {
        "file_prefixes": [
            "ido"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/ido_wikipedia_2016_100K.tar.gz"
    },
    "ind_mixed_2012_1M": {
        "file_prefixes": [
            "ind"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/ind_mixed_2012_1M.tar.gz"
    },
    "gle_newscrawl_2014_300K": {
        "file_prefixes": [
            "gle"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/gle_newscrawl_2014_300K.tar.gz"
    },
    "ita_news_2020_1M": {
        "file_prefixes": [
            "ita_latn"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/ita_news_2020_1M.tar.gz"
    },
    "jpn_news_2011_1M": {
        "file_prefixes": [
            "jpn_hira",
            "jpn_kana"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/jpn_news_2011_1M.tar.gz"
    },
    "kor_news_2007_1M": {
        "file_prefixes": [
            "kor"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/kor_news_2007_1M.tar.gz"
    },
    "kur_newscrawl_2011_30K": {
        "file_prefixes": [
            "kur"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/kur_newscrawl_2011_30K.tar.gz"
    },
    "lat_wikipedia_2018_100K": {
        "file_prefixes": [
            "lat"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/lat_wikipedia_2018_100K.tar.gz"
    },
    "lav-lv_web_2015_1M": {
        "file_prefixes": [
            "lav"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/lav-lv_web_2015_1M.tar.gz"
    },
    "lit-lt_web_2016_1M": {
        "file_prefixes": [
            "lit"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/lit-lt_web_2016_1M.tar.gz"
    },
    "dsb_wikipedia_2016_10K": {
        "file_prefixes": [
            "dsb"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/dsb_wikipedia_2016_10K.tar.gz"
    },
    "ltz-lu_web_2013_1M": {
        "file_prefixes": [
            "ltz"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/ltz-lu_web_2013_1M.tar.gz"
    },
    "mkd-mk_web_2015_1M": {
        "file_prefixes": [
            "mac"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/mkd-mk_web_2015_1M.tar.gz"
    },
    "msa_newscrawl_2016_300K": {
        "file_prefixes": [
            "may"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/msa_newscrawl_2016_300K.tar.gz"
    },
    "mlt_web_2012_300K": {
        "file_prefixes": [
            "mlt"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/mlt_web_2012_300K.tar.gz"
    },
    "mon_wikipedia_2016_100K": {
        "file_prefixes": [
            "mon"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/mon_wikipedia_2016_100K.tar.gz"
    },
    "sme-no_news_2015_10K": {
        "file_prefixes": [
            "sme"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/sme-no_news_2015_10K.tar.gz"
    },
    "nor_wikipedia_2016_1M": {
        "file_prefixes": [
            "nor"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/nor_wikipedia_2016_1M.tar.gz"
    },
    "nob_news_2013_1M": {
        "file_prefixes": [
            "nob"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/nob_news_2013_1M.tar.gz"
    },
    "nno-no_web_2020_1M": {
        "file_prefixes": [
            "nno_latn"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/nno-no_web_2020_1M.tar.gz"
    },
    "fas_newscrawl_2019_1M": {
        "file_prefixes": [
            "per_arab"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/fas_newscrawl_2019_1M.tar.gz"
    },
    "pol_news_2008_1M": {
        "file_prefixes": [
            "pol"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/pol_news_2008_1M.tar.gz"
    },
    "por_newscrawl_2016_1M": {
        "file_prefixes": [
            "por_bz",
            "por_po"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/por_newscrawl_2016_1M.tar.gz"
    },
    "ron_news_2020_1M": {
        "file_prefixes": [
            "rum_latn"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/ron_news_2020_1M.tar.gz"
    },
    "rus_news_2010_1M": {
        "file_prefixes": [
            "rus"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/rus_news_2010_1M.tar.gz"
    },
    "san_wikipedia_2016_100K": {
        "file_prefixes": [
            "san"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/san_wikipedia_2016_100K.tar.gz"
    },
    "slk-sk_web_2016_1M": {
        "file_prefixes": [
            "slo"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/slk-sk_web_2016_1M.tar.gz"
    },
    "slv-si_web_2014_1M": {
        "file_prefixes": [
            "slv"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/slv-si_web_2014_1M.tar.gz"
    },
    "spa_news_2011_1M": {
        "file_prefixes": [
            "spa_ca",
            "spa_la"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/spa_news_2011_1M.tar.gz"
    },
    "swe_news_2020_1M": {
        "file_prefixes": [
            "swe_latn"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/swe_news_2020_1M.tar.gz"
    },
    "tgl_newscrwal_2011_300K": {
        "file_prefixes": [
            "tgl_latn"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/tgl_newscrwal_2011_300K.tar.gz"
    },
    "tam_newscrawl_2011_1M": {
        "file_prefixes": [
            "tam"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/tam_newscrawl_2011_1M.tar.gz"
    },
    "tur_news_2005_1M": {
        "file_prefixes": [
            "tur"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/tur_news_2005_1M.tar.gz"
    },
    "ukr_news_2020_1M": {
        "file_prefixes": [
            "ukr_cyrl"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/ukr_news_2020_1M.tar.gz"
    },
    "vie_newscrwal_2011_1M": {
        "file_prefixes": [
            "vie_hanoi",
            "vie_hcmc",
            "vie_hue"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/vie_newscrwal_2011_1M.tar.gz"
    },
    "cym_wikipedia_2016_100K": {
        "file_prefixes": [
            "wel_nw",
            "wel_sw"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/cym_wikipedia_2016_100K.tar.gz"
    },
    "zho-cn_web_2015_1M": {
        "file_prefixes": [
            "cmn_hani"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/zho-cn_web_2015_1M.tar.gz"
    },
    "zul_mixed_2014_100K": {
        "file_prefixes": [
            "zul"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/zul_mixed_2014_100K.tar.gz"
    },
    "khm_community_2017": {
        "file_prefixes": [
            "khm"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/khm_community_2017.tar.gz"
    },
    "ceb_wikipedia_2021_1M": {
        "file_prefixes": [
            "ceb_latn"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/ceb_wikipedia_2021_1M.tar.gz"
    },
    "ben_news_2020_300K": {
        "file_prefixes": [
            "ben_beng"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/ben_news_2020_300K.tar.gz"
    },
    "tha-th_web_2018_1M": {
        "file_prefixes": [
            "tha_thai"
        ],
        "data_url": "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/tha-th_web_2018_1M.tar.gz"
    }
}


================================================
FILE: data/morphology/README.md
================================================
# Morphology scripts

The scripts in this directory are responsible for downloading morphological
paradigms from [UniMorph](https://unimorph.github.io/).

## How to use

[`grab_unimorph_data.py`](grab_unimorph_data.py) downloads UniMorph data as
three-column TSV files:

    кошка   кошек   N;GEN;PL
    кошка   кошка   N;NOM;SG
    кошка   кошкам  N;DAT;PL
    кошка   кошками N;INS;PL
    кошка   кошках  N;ESS;PL
    кошка   кошке   N;DAT;SG
    кошка   кошке   N;ESS;SG
    кошка   кошки   N;GEN;SG
    кошка   кошки   N;NOM;PL
    кошка   кошкой  N;INS;SG
    кошка   кошку   N;ACC;SG

## Shared tasks

Specific configurations for shared tasks are stored [here](shared_tasks).


================================================
FILE: data/morphology/grab_unimorph_data.py
================================================
#!/usr/bin/env python
"""Downloads UniMorph morphological paradigms data."""

import argparse
import json
import logging
import os
import time


import requests

_THIS_DIR = os.path.dirname(__file__)
UNIMORPH_DICT_PATH = os.path.join(_THIS_DIR, "unimorph_languages.json")


def download(data_to_grab: dict[str, list[str]]) -> dict[str, list[str]]:
    to_retry = {}
    os.makedirs("tsv", exist_ok=True)
    for language, urls in data_to_grab.items():
        with open(f"tsv/{language}.tsv", "wb") as sink:
            for url in urls:
                with requests.get(url, stream=True) as response:
                    logging.info("Downloading: %s", language)
                    if response.status_code == 200:
                        sink.write(response.content)
                    else:
                        logging.info(
                            "Status code %d while downloading %s",
                            response.status_code,
                            language,
                        )
                        to_retry[language] = data_to_grab[language]
        # 30 seconds appears to not be enough, 60-70 seconds works well
        # but takes a long time.
        time.sleep(45)
    return to_retry


def main(args: argparse.Namespace) -> None:
    with open(args.unimorph_json_path, "r", encoding="utf-8") as langs:
        languages = json.load(langs)
    # Hack for repeatedly attempting to download Wortschatz data
    # as a way of getting around 404 response from their server.
    langs_to_retry = download(languages)
    while langs_to_retry:
        langs_to_retry = download(langs_to_retry)


if __name__ == "__main__":
    logging.basicConfig(
        format="%(filename)s %(levelname)s: %(message)s", level="INFO"
    )
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument(
        "--unimorph-json-path",
        default=UNIMORPH_DICT_PATH,
        help="path to the JSON file for the UniMorph download URLs",
    )
    main(parser.parse_args())


================================================
FILE: data/morphology/shared_tasks/README.md
================================================
This directory contains files such that,
if supplied to the scripts under `data/morphology/` to override the default
`../unimorph_languages.json`, will generate morphological paradigms for just the
languages targeted.


================================================
FILE: data/morphology/shared_tasks/SIGMORPHON_2021.json
================================================
{
    "ady": [
        "https://raw.githubusercontent.com/unimorph/ady/master/ady"
    ],
    "arm_e": [
        "https://raw.githubusercontent.com/unimorph/hye/master/hye"
    ],
    "bul": [
        "https://raw.githubusercontent.com/unimorph/bul/master/bul"
    ],
    "dut": [
        "https://raw.githubusercontent.com/unimorph/nld/master/nld"
    ],
    "eng_us": [
        "https://raw.githubusercontent.com/unimorph/eng/master/eng"
    ],
    "fre": [
        "https://raw.githubusercontent.com/unimorph/fra/master/fra"
    ],
    "geo": [
        "https://raw.githubusercontent.com/unimorph/kat/master/kat"
    ],
    "gre": [
        "https://raw.githubusercontent.com/unimorph/ell/master/ell"
    ],
    "hbs_latn": [
        "https://raw.githubusercontent.com/unimorph/hbs/master/hbs"
    ],
    "hun": [
        "https://raw.githubusercontent.com/unimorph/hun/master/hun"
    ],
    "ice": [
        "https://raw.githubusercontent.com/unimorph/isl/master/isl"
    ],
    "ita": [
        "https://raw.githubusercontent.com/unimorph/ita/master/ita"
    ],
    "lav": [
        "https://raw.githubusercontent.com/unimorph/lav/master/lav"
    ],
    "mlt_latn": [
        "https://raw.githubusercontent.com/unimorph/mlt/master/mlt"
    ],
    "rum": [
        "https://raw.githubusercontent.com/unimorph/ron/master/ron"
    ],
    "slv": [
        "https://raw.githubusercontent.com/unimorph/slv/master/slv"
    ],
    "wel_sw": [
        "https://raw.githubusercontent.com/unimorph/cym/master/cym"
    ]
}


================================================
FILE: data/morphology/shared_tasks/SIGMORPHON_2022.json
================================================
{
    "asm": [
        "https://raw.githubusercontent.com/unimorph/asm/master/asm"
    ],
    "bel": [
        "https://raw.githubusercontent.com/unimorph/bel/master/bel"
    ],
    "ben": [
        "https://raw.githubusercontent.com/unimorph/ben/master/ben"
    ],
    "ceb": [
        "https://raw.githubusercontent.com/unimorph/ceb/master/ceb"
    ],
    "dut": [
        "https://raw.githubusercontent.com/unimorph/nld/master/nld"
    ],
    "ger": [
        "https://raw.githubusercontent.com/unimorph/deu/master/deu"
    ],
    "ita": [
        "https://raw.githubusercontent.com/unimorph/ita/master/ita"
    ],
    "nno": [
        "https://raw.githubusercontent.com/unimorph/nno/master/nno"
    ],
    "per": [
        "https://raw.githubusercontent.com/unimorph/fas/master/fas"
    ],
    "pus": [
        "https://raw.githubusercontent.com/unimorph/pus/master/pus"
    ],
    "rum": [
        "https://raw.githubusercontent.com/unimorph/ron/master/ron"
    ],
    "swe": [
        "https://raw.githubusercontent.com/unimorph/swe/master/swe"
    ],
    "tgl": [
        "https://raw.githubusercontent.com/unimorph/tgl/master/tgl"
    ],
    "ukr": [
        "https://raw.githubusercontent.com/unimorph/ukr/master/ukr"
    ]
}


================================================
FILE: data/morphology/unimorph_languages.json
================================================
{
    "ady": [
        "https://raw.githubusercontent.com/unimorph/ady/master/ady"
    ],
    "ang": [
        "https://raw.githubusercontent.com/unimorph/ang/master/ang"
    ],
    "ara": [
        "https://raw.githubusercontent.com/unimorph/ara/master/ara"
    ],
    "asm": [
        "https://raw.githubusercontent.com/unimorph/asm/master/asm"
    ],
    "ast": [
        "https://raw.githubusercontent.com/unimorph/ast/master/ast"
    ],
    "aze": [
        "https://raw.githubusercontent.com/unimorph/aze/master/aze"
    ],
    "bak": [
        "https://raw.githubusercontent.com/unimorph/bak/master/bak"
    ],
    "bel": [
        "https://raw.githubusercontent.com/unimorph/bel/master/bel"
    ],
    "ben": [
        "https://raw.githubusercontent.com/unimorph/ben/master/ben"
    ],
    "bre": [
        "https://raw.githubusercontent.com/unimorph/bre/master/bre"
    ],
    "bul": [
        "https://raw.githubusercontent.com/unimorph/bul/master/bul"
    ],
    "cat": [
        "https://raw.githubusercontent.com/unimorph/cat/master/cat"
    ],
    "ceb": [
        "https://raw.githubusercontent.com/unimorph/ceb/master/ceb"
    ],
    "ces": [
        "https://raw.githubusercontent.com/unimorph/ces/master/ces"
    ],
    "cor": [
        "https://raw.githubusercontent.com/unimorph/cor/master/cor"
    ],
    "cym": [
        "https://raw.githubusercontent.com/unimorph/cym/master/cym"
    ],
    "dan": [
        "https://raw.githubusercontent.com/unimorph/dan/master/dan"
    ],
    "deu": [
        "https://raw.githubusercontent.com/unimorph/deu/master/deu"
    ],
    "dsb": [
        "https://raw.githubusercontent.com/unimorph/dsb/master/dsb"
    ],
    "ell": [
        "https://raw.githubusercontent.com/unimorph/ell/master/ell"
    ],
    "eng": [
        "https://raw.githubusercontent.com/unimorph/eng/master/eng"
    ],
    "est": [
        "https://raw.githubusercontent.com/unimorph/est/master/est"
    ],
    "eus": [
        "https://raw.githubusercontent.com/unimorph/eus/master/eus"
    ],
    "fao": [
        "https://raw.githubusercontent.com/unimorph/fao/master/fao"
    ],
    "fas": [
        "https://raw.githubusercontent.com/unimorph/fas/master/fas"
    ],
    "fin": [
        "https://raw.githubusercontent.com/unimorph/fin/master/fin.1",
        "https://raw.githubusercontent.com/unimorph/fin/master/fin.2"
    ],
    "fra": [
        "https://raw.githubusercontent.com/unimorph/fra/master/fra"
    ],
    "fro": [
        "https://raw.githubusercontent.com/unimorph/fro/master/fro"
    ],
    "frr": [
        "https://raw.githubusercontent.com/unimorph/frr/master/frr"
    ],
    "gla": [
        "https://raw.githubusercontent.com/unimorph/gla/master/gla"
    ],
    "gle": [
        "https://raw.githubusercontent.com/unimorph/gle/master/gle"
    ],
    "glv": [
        "https://raw.githubusercontent.com/unimorph/glv/master/glv"
    ],
    "got": [
        "https://raw.githubusercontent.com/unimorph/got/master/got"
    ],
    "heb": [
        "https://raw.githubusercontent.com/unimorph/heb/master/heb"
    ],
    "hin": [
        "https://raw.githubusercontent.com/unimorph/hin/master/hin"
    ],
    "hun": [
        "https://raw.githubusercontent.com/unimorph/hun/master/hun"
    ],
    "hye": [
        "https://raw.githubusercontent.com/unimorph/hye/master/hye"
    ],
    "isl": [
        "https://raw.githubusercontent.com/unimorph/isl/master/isl"
    ],
    "ita": [
        "https://raw.githubusercontent.com/unimorph/ita/master/ita"
    ],
    "kal": [
        "https://raw.githubusercontent.com/unimorph/kal/master/kal"
    ],
    "kat": [
        "https://raw.githubusercontent.com/unimorph/kat/master/kat"
    ],
    "kaz": [
        "https://raw.githubusercontent.com/unimorph/kaz/master/kaz"
    ],
    "kbd": [
        "https://raw.githubusercontent.com/unimorph/kbd/master/kbd"
    ],
    "lat": [
        "https://raw.githubusercontent.com/unimorph/lat/master/lat"
    ],
    "lav": [
        "https://raw.githubusercontent.com/unimorph/lav/master/lav"
    ],
    "lit": [
        "https://raw.githubusercontent.com/unimorph/lit/master/lit"
    ],
    "mkd": [
        "https://raw.githubusercontent.com/unimorph/mkd/master/mkd"
    ],
    "mlt": [
        "https://raw.githubusercontent.com/unimorph/mlt/master/mlt"
    ],
    "nap": [
        "https://raw.githubusercontent.com/unimorph/nap/master/nap"
    ],
    "nav": [
        "https://raw.githubusercontent.com/unimorph/nav/master/nav"
    ],
    "nds": [
        "https://raw.githubusercontent.com/unimorph/nds/master/nds"
    ],
    "nld": [
        "https://raw.githubusercontent.com/unimorph/nld/master/nld"
    ],
    "nno": [
        "https://raw.githubusercontent.com/unimorph/nno/master/nno"
    ],
    "nob": [
        "https://raw.githubusercontent.com/unimorph/nob/master/nob"
    ],
    "oci": [
        "https://raw.githubusercontent.com/unimorph/oci/master/oci"
    ],
    "pol": [
        "https://raw.githubusercontent.com/unimorph/pol/master/pol"
    ],
    "por": [
        "https://raw.githubusercontent.com/unimorph/por/master/por"
    ],
    "pus": [
        "https://raw.githubusercontent.com/unimorph/pus/master/pus"
    ],
    "que": [
        "https://raw.githubusercontent.com/unimorph/que/master/que"
    ],
    "ron": [
        "https://raw.githubusercontent.com/unimorph/ron/master/ron"
    ],
    "rus": [
        "https://raw.githubusercontent.com/unimorph/rus/master/rus"
    ],
    "san": [
        "https://raw.githubusercontent.com/unimorph/san/master/san"
    ],
    "sga": [
        "https://raw.githubusercontent.com/unimorph/sga/master/sga"
    ],
    "slv": [
        "https://raw.githubusercontent.com/unimorph/slv/master/slv"
    ],
    "sme": [
        "https://raw.githubusercontent.com/unimorph/sme/master/sme"
    ],
    "spa": [
        "https://raw.githubusercontent.com/unimorph/spa/master/spa"
    ],
    "sqi": [
        "https://raw.githubusercontent.com/unimorph/sqi/master/sqi"
    ],
    "swe": [
        "https://raw.githubusercontent.com/unimorph/swe/master/swe"
    ],
    "syc": [
        "https://raw.githubusercontent.com/unimorph/syc/master/syc"
    ],
    "tgl": [
        "https://raw.githubusercontent.com/unimorph/tgl/master/tgl"
    ],
    "tur": [
        "https://raw.githubusercontent.com/unimorph/tur/master/tur"
    ],
    "ukr": [
        "https://raw.githubusercontent.com/unimorph/ukr/master/ukr"
    ],
    "urd": [
        "https://raw.githubusercontent.com/unimorph/urd/master/urd"
    ]
}

================================================
FILE: data/phones/HOWTO.md
================================================
Phones
======

A `.phones` file is a list of permitted phones; any pronunciation which is not
totally composed of the permitted phones will be filtered as a postprocessing 
step.

What they filter
----------------

There are several types of segments which should be filtered by a `.phones`
file:

-   Typos
-   Invalid IPA transcriptions (e.g., extra length indicated with /ːː/)
-   non-native segments (e.g., a transcription of *Bach* as ending in the
    voiceless velar fricative /x/)

When creating a `.phones` file for broadly transcribed data, the goal is often
to create something that approximates a list of phonemes. However, there may 
be segments that are properly considered pure allophones but appear in broad 
transcriptions. However, such segments may be quite frequent in the data and 
removing all pronunciations that contain them would greatly reduce the amount 
of available data. Therefore, we prefer to simply add a comment of the form 
`# Allophone of ...`; such annotations will ultimately be used improve 
Wiktionary itself.

Wiktionary has [transcription
guidelines](https://en.wiktionary.org/wiki/Appendix:English_pronunciation) for
many languages, which you can reference when building a `.phones` file; you can
also refer to the [Phoible](https://phoible.org/) inventories, but these often
line up poorly with the language-specific Wiktionary transcription guidelines.

How to submit a `.phones` file
------------------------------

We welcome user submissions for `.phones` files from linguists. Note that we use
the [fork and pull](../../CONTRIBUTING.md) model for contributions.

1.  Make a list of all phones or phonemes, in descending-frequency order, using
    the appropriate file in [`../scrape/tsv`](../scrape/tsv). The script
    [`list_phones.py`](lib/list_phones.py) is available to facilitate this
    step. Running `./list_phones.py ../tsv/<some-TSV-file> > foo.phones`
    generates `foo.phones` that you can edit by the following steps.
2.  Remove typos, invalid IPA transcriptions, and non-native segments. The
    `.phones` file generated by `list_phones.py` shows (as comments signaled by
    `#` for each phone/phoneme) the frequency of each phone/phoneme and several
    of its example word-pronunciation pairs, which should help you decide which
    phones/phonemes to remove. For the phones or phonemes to retain, remove the
    comments of counts and example word-pronunciation pairs.
3.  For a broadly transcribed list, add comments about allophony.
4.  Run [`postprocess`](postprocess).
5.  In [`../scrape`](../scrape) run `./scrape --restriction=<your-lang> &&
    ./postprocess`. This may take a while.
6.  Add the `.phones` file, the filtered `.tsv` file(s), and the summary files
    using `git add`. The `.phones` file must use the [NFC Unicode 
    normalization](https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization).
    If you used `../src/list_phones.py` to create the `.phones` file, then it
    should be in this form already. Otherwise, in [`lib`](lib), you can run
    `./normalize.py <your-file> NFC` to put your file in the correct form.
7.  Commit using `git commit`, push to your branch using `git push`, and then
    file a pull request.

The `.phones` file format is a UTF-8 encoded file with one segment per line,
with optional comments formatted as two spaces, `#`, one space, and then a
sentence or sentence fragment with appropriate punctuation (e.g.,
`tʰ  # Allophone of /t/.`). Please include a blank line at the end of the file.
The `.phones` file should have the same name as the corresponding TSV file, but
with a `.phones` extension instead of `.tsv`.


================================================
FILE: data/phones/README.md
================================================
See the [HOWTO](HOWTO.md) for the steps to generate phone lists.
| Link | ISO 639-3 Code | ISO 639 Language Name | Wiktionary Language Name | Narrow/broad | # of phones |
| :---- | :---- | :---- | :---- | :---- | ----: |
| [phone](phones/ady_narrow.phones) | ady | Adyghe | Adyghe | Narrow | 67 |
| [phone](phones/afr_broad.phones) | afr | Afrikaans | Afrikaans | Broad | 61 |
| [phone](phones/aze_narrow.phones) | aze | Azerbaijani | Azerbaijani | Narrow | 56 |
| [phone](phones/ben_dhaka_broad.phones) | ben | Bengali | Bengali (Dhaka) | Broad | 98 |
| [phone](phones/ben_rarh_broad.phones) | ben | Bengali | Bengali (Rarh, Standard Bengali) | Broad | 99 |
| [phone](phones/bul_broad.phones) | bul | Bulgarian | Bulgarian | Broad | 52 |
| [phone](phones/cym_nw_broad.phones) | cym | Welsh | Welsh (North Wales) | Broad | 63 |
| [phone](phones/cym_sw_broad.phones) | cym | Welsh | Welsh (South Wales) | Broad | 55 |
| [phone](phones/deu_broad.phones) | deu | German | German | Broad | 83 |
| [phone](phones/ell_broad.phones) | ell | Modern Greek (1453-) | Greek | Broad | 33 |
| [phone](phones/eng_uk_broad.phones) | eng | English | English (UK, Received Pronunciation) | Broad | 64 |
| [phone](phones/eng_us_broad.phones) | eng | English | English (US, General American) | Broad | 63 |
| [phone](phones/fra_broad.phones) | fra | French | French | Broad | 40 |
| [phone](phones/hbs_broad.phones) | hbs | Serbo-Croatian | Serbo-Croatian | Broad | 65 |
| [phone](phones/hin_broad.phones) | hin | Hindi | Hindi | Broad | 64 |
| [phone](phones/hun_narrow.phones) | hun | Hungarian | Hungarian | Narrow | 86 |
| [phone](phones/hye_e_narrow.phones) | hye | Armenian | Armenian (Eastern Armenian) | Narrow | 74 |
| [phone](phones/hye_w_narrow.phones) | hye | Armenian | Armenian (Western Armenian) | Narrow | 75 |
| [phone](phones/isl_broad.phones) | isl | Icelandic | Icelandic | Broad | 71 |
| [phone](phones/ita_broad.phones) | ita | Italian | Italian | Broad | 32 |
| [phone](phones/jpn_narrow.phones) | jpn | Japanese | Japanese | Narrow | 64 |
| [phone](phones/kat_broad.phones) | kat | Georgian | Georgian | Broad | 36 |
| [phone](phones/khm_broad.phones) | khm | Khmer | Khmer | Broad | 73 |
| [phone](phones/kor_narrow.phones) | kor | Korean | Korean | Narrow | 61 |
| [phone](phones/lat_clas_broad.phones) | lat | Latin | Latin (Classical) | Broad | 36 |
| [phone](phones/lav_narrow.phones) | lav | Latvian | Latvian | Narrow | 89 |
| [phone](phones/mlt_broad.phones) | mlt | Maltese | Maltese | Broad | 61 |
| [phone](phones/mya_broad.phones) | mya | Burmese | Burmese | Broad | 70 |
| [phone](phones/nld_broad.phones) | nld | Dutch | Dutch | Broad | 50 |
| [phone](phones/nob_broad.phones) | nob | Norwegian Bokmål | Norwegian Bokmål | Broad | 72 |
| [phone](phones/por_bz_broad.phones) | por | Portuguese | Portuguese (Brazil) | Broad | 45 |
| [phone](phones/por_po_broad.phones) | por | Portuguese | Portuguese (Portugal) | Broad | 44 |
| [phone](phones/ron_narrow.phones) | ron | Romanian | Romanian | Narrow | 51 |
| [phone](phones/slv_broad.phones) | slv | Slovenian | Slovene | Broad | 48 |
| [phone](phones/spa_ca_broad.phones) | spa | Spanish | Spanish (Castilian) | Broad | 29 |
| [phone](phones/spa_la_broad.phones) | spa | Spanish | Spanish (Latin America) | Broad | 28 |
| [phone](phones/tur_narrow.phones) | tur | Turkish | Turkish | Narrow | 51 |
| [phone](phones/vie_hanoi_narrow.phones) | vie | Vietnamese | Vietnamese (Hà Nội) | Narrow | 54 |
| [phone](phones/vie_hue_narrow.phones) | vie | Vietnamese | Vietnamese (Huế) | Narrow | 54 |
| [phone](phones/vie_saigon_narrow.phones) | vie | Vietnamese | Vietnamese (Saigon) | Narrow | 54 |


================================================
FILE: data/phones/lib/generate_summary.py
================================================
#!/usr/bin/env python

import csv
import json
import logging
import operator
import os

from typing import Any

LIB_DIRECTORY = os.path.dirname(os.path.realpath(__file__))
PHONES_DIRECTORY = os.path.normpath(os.path.join(LIB_DIRECTORY, os.pardir))
PHONES_README_PATH = os.path.join(PHONES_DIRECTORY, "README.md")
PHONES_SUMMARY_PATH = os.path.join(PHONES_DIRECTORY, "summary.tsv")
PHONES_PHONES_DIRECTORY = os.path.join(PHONES_DIRECTORY, "phones")
LANGUAGES_PATH = os.path.normpath(
    os.path.join(PHONES_DIRECTORY, os.pardir, "scrape/lib/languages.json")
)


def _handle_wiki_name(language: dict[str, Any], file_path: str) -> str:
    name = language["wiktionary_name"]
    if "dialect" in language:
        key = file_path[file_path.index("_") + 1 : file_path.rindex("_")]
        if not key:
            logging.info(
                "Failed to isolate key for dialect modifier in %r",
                file_path,
            )
        values = language["dialect"][key]
        if "|" in values:
            values = values.replace(" |", ",")
        name += f" ({values})"
    return name


def main() -> None:
    with open(LANGUAGES_PATH, "r", encoding="utf-8") as source:
        languages = json.load(source)
    readme_list = []
    phones_summaries = []
    for file_path in os.listdir(PHONES_PHONES_DIRECTORY):
        with open(
            f"{PHONES_PHONES_DIRECTORY}/{file_path}", "r", encoding="utf-8"
        ) as phone_list:
            # We exclude blank lines and comments.
            num_of_entries = sum(
                1
                for line in phone_list
                if line.strip() and not line.startswith("#")
            )
        iso639_code = file_path[: file_path.index("_")]
        if "broad" in file_path:
            transcription_level = "Broad"
        else:
            transcription_level = "Narrow"
        wiki_name = _handle_wiki_name(languages[iso639_code], file_path)
        row = [
            iso639_code,
            languages[iso639_code]["iso639_name"],
            wiki_name,
            transcription_level,
            num_of_entries,
        ]
        phones_summaries.append([file_path] + row)
        readme_list.append([f"[phone](phones/{file_path})"] + row)
    # Sorts by path to TSV.
    phones_summaries.sort(key=operator.itemgetter(0))
    readme_list.sort(key=operator.itemgetter(0))
    with open(PHONES_SUMMARY_PATH, "w", encoding="utf-8") as sink:
        tsv_writer_object = csv.writer(
            sink, delimiter="\t", lineterminator="\n"
        )
        tsv_writer_object.writerows(phones_summaries)
    # Writes the README.
    with open(PHONES_README_PATH, "w", encoding="utf-8") as sink:
        print(
            "See the [HOWTO](HOWTO.md) for the steps to generate phone lists.",
            file=sink,
        )
        print(
            "| Link | ISO 639-3 Code | ISO 639 Language Name "
            "| Wiktionary Language Name "
            "| Narrow/broad | # of phones |",
            file=sink,
        )
        print("| :---- " * 5 + "| ----: |", file=sink)
        for link, code, iso_name, wiki_name, ph, count in readme_list:
            print(
                f"| {link} | {code} | {iso_name} | {wiki_name} | {ph} "
                f"| {count:,} |",
                file=sink,
            )


if __name__ == "__main__":
    logging.basicConfig(
        format="%(filename)s %(levelname)s: %(message)s", level="INFO"
    )
    main()


================================================
FILE: data/phones/lib/list_phones.py
================================================
#!/usr/bin/env python
"""This script prints a tally of the phones/phonemes of a WikiPron TSV file.

For each phone/phoneme, this script prints:
- the phone/phoneme
- the number of words that have this phone/phoneme
- a few example word-pronunciation pairs for this phone/phoneme
"""

import argparse
import collections
import logging
import random
import unicodedata


import ipapy

_other_valid_ipa = frozenset(
    phone
    for phone in ipapy.UNICODE_TO_IPA.keys()
    if not ipapy.is_valid_ipa(unicodedata.normalize("NFD", phone))
)

_suffixed_other_valid_ipa = frozenset(
    phone + "ː" for phone in _other_valid_ipa
)

OTHER_VALID_IPA = _other_valid_ipa | _suffixed_other_valid_ipa


def _count_phones(filepath: str) -> dict[str, set[str]]:
    """Count the phones in the given TSV file.

    phone_to_examples as dict[str, set[str]] is the most straightforward
    data structure for our purposes. It's not memory-efficient
    (with the same word-pron pair appearing in different phones' sets),
    but anything fancier doesn't seem worth the work.
    """
    phone_to_examples = collections.defaultdict(set)
    with open(filepath, encoding="utf-8") as source:
        for line in source:
            line = line.strip()
            if not line:
                continue
            word, pron = line.split("\t", maxsplit=1)
            example = f"({word} | {pron})"
            phones = pron.split()
            for phone in phones:
                phone_to_examples[phone].add(example)
    return phone_to_examples


def _pick_examples_for_display(examples: set[str]) -> list[str]:
    """Pick examples of word-pron pairs for display.

    We could have exposed the maximum number of examples to display
    (set to be 3 now) for each phone to the command-line interface,
    but it doesn't seem worth it for the time being.
    """
    n_examples = min(len(examples), 3)
    # Using list() here because Python 3.9 has deprecated the use
    # of an _unordered_ set as the input to random.sample.
    return random.sample(list(examples), n_examples)


def _check_ipa_phonemes(phone_to_examples: dict[str, set[str]], filepath: str):
    """Given the phonemes checks whether they are represented in the IPA.

    This will catch problematic phonemes, according to the current IPA standard
    supported by `ipapy`. In addition, it is likely to complain about highly
    specific allophones, which are likely to be present in languages which have
    highly phonetic representation of their phoneme inventory. For a current
    IPA chart, please see:

        https://www.internationalphoneticassociation.org/IPAcharts/IPA_chart_orig/IPA_charts_E.html
    """
    bad_ipa_phonemes = frozenset(
        phone
        for phone in phone_to_examples.keys()
        if not (
            ipapy.is_valid_ipa(unicodedata.normalize("NFD", phone))
            or phone in OTHER_VALID_IPA
        )
    )
    if len(bad_ipa_phonemes) and filepath.endswith("broad.tsv"):
        logging.warning("Found %d invalid IPA phones:", len(bad_ipa_phonemes))
        phoneme_id = 1
        for phoneme in bad_ipa_phonemes:
            bad_chars = [
                "[%d %04x %s %s]"
                % (i, ord(c), unicodedata.category(c), unicodedata.name(c))
                for i, c in enumerate(ipapy.invalid_ipa_characters(phoneme))
            ]
            logging.warning(
                "[%d] Non-IPA transcription: %s (%s)",
                phoneme_id,
                phoneme,
                " ".join(bad_chars),
            )
            phoneme_id += 1


def main(args: argparse.Namespace):
    phone_to_examples: dict[str, set[str]] = _count_phones(args.tsv_path)
    for phone, examples in sorted(
        phone_to_examples.items(), key=lambda x: len(x[1]), reverse=True
    ):
        print(
            f"{phone}\t# {len(examples):10,}: "
            f"{', '.join(_pick_examples_for_display(examples))}"
        )
    print(f"\n# unique phones: {len(phone_to_examples)}")
    _check_ipa_phonemes(phone_to_examples, args.tsv_path)


if __name__ == "__main__":
    logging.basicConfig(format="%(levelname)s: %(message)s", level="INFO")
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument("tsv_path", help="path to TSV file")
    main(parser.parse_args())


================================================
FILE: data/phones/lib/normalize.py
================================================
#!/usr/bin/env python
"""In-place Unicode normalization.

Takes a file and applies the specified Unicode normalization "in place." In
order to avoid the issues of reading and writing to the same file at the same
time, this script puts the normalized version of the file argument in a
tempfile, then uses that tempfile to rewrite the original file."""

import argparse
import shutil
import tempfile
import unicodedata


def main(args: argparse.Namespace) -> None:
    with (
        open(args.path, "r") as source,
        tempfile.NamedTemporaryFile(mode="w+", delete=False) as sink,
    ):
        for line in source:
            print(unicodedata.normalize(args.norm, line), end="", file=sink)
    shutil.move(sink.name, args.path)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument("path", help="file to modify")
    parser.add_argument(
        "--norm",
        choices=["NFC", "NFD", "NFKC", "NFKD"],
        required=True,
        help="desired Unicode normalization form",
    )
    main(parser.parse_args())


================================================
FILE: data/phones/phones/ady_narrow.phones
================================================
# Based on
#
# https://en.wikipedia.org/wiki/Adyghe_language
# https://en.wikipedia.org/wiki/Adyghe_phonology
# https://en.wikipedia.org/wiki/Help:IPA/Adyghe
a
ə
n
aː
p
j
t
b
w
ʁ
ħ
z
m
r
ʁʷ
s
ɬ
x
d
q
t͡ʃʼ
ɕ
t͡ʃ
l
ɡʷ
ʃ
f
d͡ʒ
χʷ
ʂ
ʔ
qʷ
ʑ
kʷʼ
χ
ʔʷ
ʐ
ʃʷ
t͡sʼ
pʼ
ʃʼ
ʒ
t͡s
kʲʼ # Not found in all varieties.
kʷ
ɮ  # Allophone of /l/.
ʃʷʼ
tʼ
ʂʷ  # Allophone of /ʃʷ/.
kʲ  # Not found in all varieties
d͡z
ɬʼ
t͡ʂ
ɣ
ɡʲ  # Not found in all varieties
ʐʷ  # Allophone of /ʒʷ/.
sʼ  # Not found in all varieties
t͡ʂʼ
t͡ʃʷ  # Not found in all varieties.
t͡sʷ
tʷʼ
ʒʷ
pʷʼ
k  # Restricted to loanwords.
kʼ
d͡zʷ
ʔʲ  # Not found in all varieties.



================================================
FILE: data/phones/phones/afr_broad.phones
================================================
# Based on:
# https://taalportaal.org/taalportaal/topic/pid/topic-14566637229843775
# https://en.wikipedia.org/wiki/Afrikaans_phonology
# https://en.wikipedia.org/wiki/Help:IPA/Afrikaans
# https://phoible.org/inventories/view/1395
ə
r
s
t
l
k
n
a
f
ɑː
i
χ
ɔ
d
ə̯  # Part of falling diphthong [ɪə̯].
b
m
p
ɪ  # Part of diphthong /ɪø/.
ʊ  # Part of diphthong /ʊə/.
ɛ
œ
v
u
i̯  # Part of diphthong [œi̯],[əi̯].
ŋ
ʒ
ɦ
j
ø  # Part of diphthong /ɪø/. 
u̯  # Part of diphthong [œu̯].
oː
y
ɑ  # Transcriptive variant of /a/.
x  # Allophone of /χ/.
ɛː
iː
ʋ  # Allophone or approximant of /v/.
o
e
aː  # Transcriptive variant of /ɑː/.
eː
ɔː
uː
ʃ
ɪ̯  # Part of diphthong [ɪ̯ø].
æː
ɨ  # Probably allophone of /i/.
ɡ  # Mostly used in loanwords but also used as allophone of /χ/.
əː
w
ɑ̃ː
ɛ̃ː
ɔ̃ː  # Not found in Wiktionary scraped data.
ø̯  # Part of diphthong /ɪø̯/.
yː  # Allophone of /y/.
œː
ʊ̯
ɫ
c  # Allophone of /k/, not found in Wiktionary scraped data.
ɐ  # Phone of /a/, used in older sources.



================================================
FILE: data/phones/phones/aze_narrow.phones
================================================
# Based on
# https://en.wikipedia.org/wiki/Azerbaijani_language
# https://en.wikipedia.org/wiki/Help%3AIPA%2FAzerbaijani
#####
# NOTE:
# Wikipedia had very limited information on Azerbaijani phonology, and there
# are a number of points I am unsure about. The transcriptions did not always
# match the orthography in the expected way, and due to the lack of data, I 
# could not tell if this was due to allophony, inconsistent orthography, or
# errors in the transcription. I'm also unsure whether all vowels and consonants 
# have long counterparts, or only some.
#####
# CONSONANTS
p
b
m
mː
f
v
t
tː
d
t͡s  # The orthography in the examples suggests that this is an allophone of /t͡ʃ/, but I could not find confirmation in the sources.
d͡z  # The orthography in the examples suggests that this is an allophone of both /g/ and /dʒ/, but I could not find confirmation in the sources.
n
nː
r  # Allophone of /ɾ/? Not mentioned in sources, but very common in the data.
rː
ɾ
s
sː
z
l
lː
ɫ  # Allophone of /l/? Not mentioned in sources, but very common in the data.
ɫː
t͡ʃ
d͡ʒ
ʃ
ʒ
ç
j
c
ɟ
k  # Apparently, only occurs in loanwords. Nevertheless, it is very common in the data.
kː
ɡ
ŋ
x
ɣ
χ  # Allophone of /x/.
h
# VOWELS
i
iː
y
e
eː
œ
œː
æ
æː
a  # Narrow transcription of /æ/.
aː
ɯ
u
uː
o
ɑ
ɑː



================================================
FILE: data/phones/phones/ben_dhaka_broad.phones
================================================
# References:
# 
#      Thompson, H.-R. 2020. Bengali: A Comprehensive Grammar. Routledge.
#      Khan, S. u. D. 2010. Bengali (Bangladeshi Standard). Journal of the International Phonetic Association 40: 221-225.
# 
## Vowels.
a
ɔ
o
i
u
e
æ  # Indian SBC.
ɛ  # Bangladeshi standard Bangla.
## Nasal vowels.
ã
ɔ̃
õ
ĩ
ũ
ẽ
æ̃
ɛ̃
## Offglides.
i̯
e̯
o̯
u̯
w
j
## Consonants.
n
nː
ɹ  # Bangladeshi standard Bangla.
f  # Bangladeshi standard Bangla.
k
kː
ʃ
ʃː
m
mː
t̪
t̪ː
b
bː
l
lː
ɡ
ɡː
p
pː
d̪
h
kʰ
kʰː
ŋ
d̪ʱ
d̪ʱː
ʈ
ʈː
ʈʰ
ʈʰː
t̪ʰ
t̪ʰː
gʱ
ɖ
ɖː
ɽ
pʰ
d͡ʒ
d͡ʒː
t͡ʃ
t͡ʃː
ʲ  # A glide.
kː
t͡ʃʰ
t͡ʃʰː
ɡʱ
t̪ʰː
ɖʱ
bʱ
d͡ʒʱ
d͡ʒʱː
s
ɽʱ
bʱː
ɡʱː
## Bangladeshi standard Bangla equivalents for: 
t̠ɕ  # [t͡ʃ].
t̠ɕː
t̠ɕʰ  # [t͡ʃʰ].
t̠ɕʰː
d̠ʑ  # [d͡ʒ].
d̠ʑː 
d̠ʑʱ  # [d͡ʒʰ].
d̠ʑʱː 
t  # [ʈ].
tː
tʰ  # [ʈʰ].
tʰː
d  # [ɖ].
dː
dʱ  # [ɖʱ].
ɸ  # [f].
## Added to cover Wiktionary pronunciation.
r
ɾ
ɦ
v



================================================
FILE: data/phones/phones/ben_rarh_broad.phones
================================================
# References:
# 
#      Thompson, H.-R. 2020. Bengali: A Comprehensive Grammar. Routledge.
#      Khan, S. u. D. 2010. Bengali (Bangladeshi Standard). Journal of the International Phonetic Association 40: 221-225.
# 
## Vowels.
a
ɔ
o
i
u
e
æ  # Indian SBC.
ɛ  # Bangladeshi standard Bangla.
## Nasal vowels.
ã
ɔ̃
õ
ĩ
ũ
ẽ
æ̃
ɛ̃
## Offglides.
i̯
e̯
o̯
u̯
w
j
## Consonants.
n
nː
ɹ  # Bangladeshi standard Bangla.
f  # Bangladeshi standard Bangla.
k
kː
ʃ
ʃː
m
mː
t̪
t̪ː
b
bː
l
lː
ɡ
ɡː
p
pː
d̪
d̪ː
h
kʰ
kʰː
ŋ
d̪ʱ
d̪ʱː
ʈ
ʈː
ʈʰ
ʈʰː
t̪ʰ
t̪ʰː
gʱ
ɖ
ɖː
ɽ
pʰ
d͡ʒ
d͡ʒː
t͡ʃ
t͡ʃː
ʲ  # A glide.
kː
t͡ʃʰ
t͡ʃʰː
ɡʱ
t̪ʰː
ɖʱ
bʱ
d͡ʒʱ
d͡ʒʱː
s
ɽʱ
bʱː
ɡʱː
## Bangladeshi standard Bangla equivalents for: 
t̠ɕ  # [t͡ʃ].
t̠ɕː
t̠ɕʰ  # [t͡ʃʰ].
t̠ɕʰː
d̠ʑ  # [d͡ʒ].
d̠ʑː 
d̠ʑʱ  # [d͡ʒʰ].
d̠ʑʱː 
t  # [ʈ].
tː
tʰ  # [ʈʰ].
tʰː
d  # [ɖ].
dː
dʱ  # [ɖʱ].
ɸ  # [f].
## Added to cover Wiktionary pronunciation.
r
ɾ
ɦ
v



================================================
FILE: data/phones/phones/bul_broad.phones
================================================
# Based on
# https://en.wikipedia.org/wiki/Bulgarian_phonology
# http://www.personal.rdg.ac.uk/~llsroach/phon2/b_phon/b_phon.htm
ə  # Allophone of /a/ and /ɤ/ in unstressed syllables.
ɛ
t̪
i
n
a
v
p
ʃ
o  # Allophone of /u/ and /ɔ/ in unstressed syllables.
r
s
o̝  # Allophone of /u/ and /ɔ/ in unstressed syllables.
z
ɫ
ɐ  # Allophone of /a/ and /ɤ/ in unstressed syllables.
t  # Variously used as either a variant of /t̪/ or as part of an affricate.
d̪
l  # Allophone of /ɫ/.
j
x
b
m
k
ɡ
u
ɾ  # Allophone of /r/.
ʒ
ɔ
sʲ
ɤ
rʲ
nʲ
lʲ
vʲ
tʲ
f
kʲ
dʲ
t͡s
bʲ
d  # Variously used as either a variant of /d̪/ or as part of an affricate
pʲ
ɡʲ
ɱ  # Allophone of coda nasal before labiodental consonant. Surprisingly common in data.
zʲ
mʲ
t͡ʃ  # Only appears 3 times in data, presumably because it is being transcribed without the tie bar
fʲ  # Never appears in the data, but both sources list it as a phoneme of Bulgarian.
d͡z  # Never appears in the data, presumably because it is being transcried without the tie bar. Phonemic status debated.
t͡sʲ  # Never appears in the data, presumably because it is being transcribed without the tie bar.
d͡ʒ  # Never appears in the data, presumably because it is being transcribed without the tie bar.



================================================
FILE: data/phones/phones/cym_nw_broad.phones
================================================
# Based on
# https://en.wikipedia.org/wiki/Welsh_phonology
# https://en.wikipedia.org/wiki/Help:IPA/Welsh
# https://en.wikipedia.org/wiki/Welsh_orthography
#####
a
r
ɛ
n
d
l
ɔ
ɡ
s
ə
k
ɨ̯  # Used to transcribe diphthongs.
v
j
ɨ̞  # Alternate transcription of /ɨ/.
t
ð
ɪ
m
b
i̯  # Used to transcribe diphthongs.
e  # Used to transcribe diphthongs, sometimes in places where /ɛ/ should be used.
θ
w
ʊ
χ
p
ɬ
u̯  # Used to transcribe diphthongs.
h
ŋ
f
ɨ
aː
i  # Alternate transcription of /ɪ/.
ɨː
r̥
oː
iː
ʃ
uː
eː
u  # Alternate transcription of /ʊ/. Also used to transcribe diphthongs.
ŋ̊
n̥
m̥
ʒ  # Seems only to occur in the data when writing /dʒ/ without a tie.
d͡ʒ  # Dialectal, corresponding to /dj/ in other varieties. Otherwise, mainly occurs in loanwords.
t͡ʃ  # Dialectal, corresponding to /tj/ in other varieties. Otherwise, mainly occurs in loanwords.
# DIPHTHONGS
# Tie bars were not used to transcribe diphthongs in the data, so the 
# following do not actually appear as such in the data.
a͡i̯
a͡ɨ̯
aː͡ɨ̯
a͡u̯
ɛ͡u̯
ɛ͡i̯
ə͡ɨ̯
e͡ɨ̯  # Allophone of ə͡ɨ̯.
ə͡u̯
ɪ͡u̯
ɨ͡u̯
ɔ͡i̯
ɔ͡ɨ̯
ʊ̯͡ɨ



================================================
FILE: data/phones/phones/cym_sw_broad.phones
================================================
# Based on
# https://en.wikipedia.org/wiki/Welsh_phonology
# https://en.wikipedia.org/wiki/Help:IPA/Welsh
# https://en.wikipedia.org/wiki/Welsh_orthography
#####
a
r
ɛ
n
d
i̯  # Used to transcribe diphthongs.
l
ɔ
ɪ
ɡ
s
k
ə
v
t
ð
j
ʊ
m
b
w
θ
e  # Commonly used to transcribe diphthongs where /ɛ/ should be used.
aː
i  # Alternate transcription of /ɪ/.
χ
iː
eː
ɬ
h
p
u̯
oː
ŋ
f
r̥
ʃ
u  # Alternate transcription of /ʊ/.
uː
ŋ̊
z  # Mainly occurs in loanwords.
n̥
m̥
ʍ  # Allophonic; corresponds to /χw/.
ʒ  # Seems only to occur in the data when writing /dʒ/ without a tie.
d͡ʒ  # Dialectal, corresponding to /dj/ in other varieties. Otherwise, mainly occurs in loanwords.
t͡ʃ  # Dialectal, corresponding to /tj/ in other varieties. Otherwise, mainly occurs in loanwords.
# DIPHTHONGS
a͡i̯
a͡u̯
ɛ͡u̯
ɛ͡i̯
ə͡u̯
ɪ͡u̯
ɔ͡i̯
ʊ̯͡i



================================================
FILE: data/phones/phones/deu_broad.phones
================================================
# Based on:
# https://en.wiktionary.org/wiki/Appendix:German_pronunciation
# https://en.wikipedia.org/wiki/Standard_German_phonology
# https://en.wikipedia.org/wiki/Help:IPA/Standard_German
# https://en.wikipedia.org/wiki/Swiss_German#Phonology
# https://phoible.org/inventories/view/161
#
# Diphthongs are mostly transcribed without using tie bars.
t
n
a
ə
l
ɪ
s
k
ʁ
f
ɛ
ʃ
m
b
ɡ
p
ʊ
ç
ɪ̯  # Check whether it is a part of a diphthong /aɪ̯/.
d
aː
ɐ
v
z
ɔ
ɐ̯
iː
ʀ  # Allophone of /ʁ/.
eː
h
ŋ
oː
t͡s
n̩  # Probably allophone of /n/.
i
ʊ̯  # Part of a diphthong, /aʊ̯/.
ʔ
o
ɛː
uː
ʏ
e
yː
x
r  # Probably allophone of /ʁ/.
l̩  # Probably allophone of /l/.
u
øː
j
œ
χ  # Allophone of /x/.
ʏ̯  # Part of a diphthong /ɔʏ̯/.
y
p͡f
ŋ̩
m̩  # Probably allophone of /m/.
t͡ʃ
u̯
ɱ̩  # Allophone of /m/.
ø
tʰ  # Probably allophone of /t/.
ɔː  # Probably allophone of /ɔ/.
ŋ̍  # Probably allophone of /ŋ/.
t͜s  # A transcriptive variant of /t͡s/.
kʰ  # Allophone of /k/ in some northern varieties of German.
œː  # Swiss German.
ɘ  # Allophone of /ə/. 
ɑː  # Allophone of /aː/ in Standard Austrian pronunciation or northern German varieties influenced by Low German.
b̥  # Swiss German.
ɡ̊  # Allophone of /g/ in German Standard German , phoneme of Swiss and Austrian Standard German.
ɛ̃
õ
d̥  # Allophone of /d/ in German Standard German , phoneme of Swiss and Austrian Standard German.
ɒː  # Low German.
ɾ  # Allophone of /ʁ/ in German speaking Europe.
pʰ  # Allophone of /p/ in some northern varieties of German.
ɐ̯̯
y̯
z̥  # Allophone of /z/ in southern varieties of German.
a͡ʊ
œ
p͜f  # A transcriptive variant of /p͡f/.
ʋ  # Allophone of /v/ in southern varieties of German.



================================================
FILE: data/phones/phones/ell_broad.phones
================================================
# Based on:
# https://en.wiktionary.org/wiki/Appendix:Greek_pronunciation
#
# Most of the "Occasionally" forms are found in a few dozen words, so they'd
# be worth correcting at some point.
#
# There are also some cases of clear allophony that could be corrected.
a
e  # Occasionally /e̞, ɛ/.
i
o  # Occasionally /o̞, ɔ/.
u
s
n
ɲ  # Allophone of /n/.
ŋ  # Allophone of /n/.
m
t
p
k
c  # Allophone of /k/.
ɾ
l
ʎ  # Allophone of /l/.
r
f
z
ð
v
θ
ɣ  # Allphone of /ʝ/ or vice versa.
ʝ  # Occasionally /j/; allophone of /ɣ/ or vice versa.
x  # Allophone of /ç/ or vice versa.
ç  # Allophone of /x/ or vice versa.
d
b
ɡ
ɟ
ɱ  # Allophone of /m/.
i̯  # Used for the two diphthongs.



================================================
FILE: data/phones/phones/eng_uk_broad.phones
================================================
# TODO: This is not yet based closely on the guidelines:
# https://en.wiktionary.org/wiki/Appendix:English_pronunciation
## Vowels.
i
iː  #  Check.
ɪ
ɪː  #  Check.
ɪ̯   # Used for offglides.
u
uː  #  Check.
ʊ
ʊ̯   # Used for offglides.
ʌ
ɛ
ɛː  # Check.
ɜ
ɜː  # Check.
æ
æː  # Check.
e
eː  # Check.
o
oː  # Check.
ɔ
ɔː  # Check.
a
aː  # Check.
ɒ
ɑ
ɑː  # Check.
ɒ
ə
əː  # Check; used to spell <er, or, ur>.
ɚ  # Check; r-less?
ɝ  # Check; r-less?
ɝː  # Check; r-less?
## Consonants.
p
t
k
b
d
ɡ  # The IPA one.
ʔ  # Marginal.
m
n
ŋ
f
s
ʃ
θ
t͡ʃ
h
v
ð
z
ʒ
d͡ʒ
w
j
ʍ  # Marginal.
l
ɫ  # Allophone of /l/.
ɹ
r  # Common mistake for /ɹ /.
m̩  # Probably just allophone of /m/.
n̩  # Probably just allophone of /n/.
l̩  # Probably just allophone of /l/.



================================================
FILE: data/phones/phones/eng_us_broad.phones
================================================
# TODO: This is not yet based closely on the guidelines:
# https://en.wiktionary.org/wiki/Appendix:English_pronunciation
## Vowels.
i
iː  #  Check.
ɪ
ɪː  #  Check.
ɪ̯   # Used for offglides.
u
uː  #  Check.
ʊ
ʊ̯   # Used for offglides.
ʌ
ɛ
ɛː  # Check.
ɜ
ɜː  # Check.
æ
æː  # Check.
e
eː  # Check.
o
oː  # Check.
ɔ
ɔː  # Check.
a
aː  # Check.
ɒ
ɑ
ɑː  # Check.
ɒ
ə
ɚ  # Check.
ɝ  # Check.
ɝː  # Check.
## Consonants.
p
t
k
b
d
ɡ  # The IPA one.
ʔ  # Marginal.
m
n
ŋ
f
s
ʃ
θ
t͡ʃ
h
v
ð
z
ʒ
d͡ʒ
w
j
ʍ  # Marginal.
l
ɫ  # Allophone of /l/.
ɹ
r  # Common mistake for /ɹ /.
m̩  # Probably just allophone of /m/.
n̩  # Probably just allophone of /n/.
l̩  # Probably just allophone of /l/.



================================================
FILE: data/phones/phones/fra_broad.phones
================================================
# CONSONANTS
p
b
m
f
v
w
t
d
n
r  # Allophone of /ʁ/.
s
z
l
ʃ
ʒ
ɲ
j
ɥ
k
ɡ
ŋ  # Only occurs in loanwords.
ʁ
# ORAL VOWELS
i
y
e
ø
ɛ
ɛː  # Allophone of /ɛ/ in European French.
œ
a
ə
u
o
ɔ
ɑ  # Allophone of /a/ in European French.
# NASALIZED VOWELS
ɛ̃
œ̃  # Allophone of /ɛ̃/ in European French.
ɔ̃
ɑ̃
# The low tie bar is used in transcriptions to indicate that there is a 
# morpheme boundary within a syllable as in 
# *les uns les autres* /le.z‿œ̃ le.z‿otʁ/ 'each other (pl)' or 
# *d'habitude* /d‿a.bi.tyd/ 'usually.' Strictly speaking, it should not be
# included in a phonemic transcription, since syllabification is not part of
# the underlying form. However, for many entries, the only transcription given
# includes tie bars, so the tie bar is necessary if we don't want to filter 
# them out. Additionally, the tie bar is used when indicating the 
# pronunciation of a normally silent consonant in liaison, as in 
# "grand" /ɡʁɑ̃.t‿/.
‿



================================================
FILE: data/phones/phones/hbs_broad.phones
================================================
# Based on:
#
# https://en.wikipedia.org/wiki/Serbo-Croatian_phonology
# https://en.wikipedia.org/wiki/Help:IPA/Serbo-Croatian.
#####
a
t
r
n
s
k
ʋ
l
p
d
m
j
ʃ
z
b
ɡ
ʎ
x
ʒ
f
ɲ
t͡s
t͡ʃ
ɕ
ʑ
t͡ɕ
d͡ʑ
d͡ʒ
v  # Allophone of /ʋ/.
# VOWELS
i
iː
ǐ
ǐː
î
îː
u
uː
ǔ
ǔː
û
ûː
e
eː
ě
ěː
ê
êː
o
oː
ǒ
ǒː
ô
ôː
a
aː
ǎ
ǎː
â
âː
r̩  # Often transcribed without the diacritic indicating that it is syllabic.
rː
ř
řː
r̂
r̂ː



================================================
FILE: data/phones/phones/hin_broad.phones
================================================
# Based on
#
# https://en.wikipedia.org/wiki/Hindustani_phonology
#####
ə
ɑː
ɾ
n
iː
t̪
k
ɪ
s
m
l
d̪
p
ʊ
ʋ
b
j
ɡ
d͡ʒ
eː
ʃ
uː
oː
ɦ
t͡ʃ
ʈ
kʰ
ɳ
d̪ʱ
bʱ
ŋ
ʂ
z
pʰ
ɽ
ɛː
ɖ
t̪ʰ
f
ɔː
x  # Mainly occurs in loanwords from Persian.
q  # Mainly occurs in loanwords from Persian.
t͡ʃʰ
ʈʰ
ɡʱ
d͡ʒʱ
ɣ  # Mainly occurs in loanwords from Persian.
ɽʱ
ɑ̃ː
ɖʱ
aː  # Alternate transcription of /ɑː/.
ə̃
õː
ẽː
r
ĩː
ũː
ʊ̃
ɪ̃
ɛ̃ː
# NASAL VOWELS
# There was conflicting information on the status of nasalization in Hindi,
# but it seems that in principle all of the oral vowels have a nasal
# counterpart. The following are the nasal vowels that did not appear in the data.
ĩː
ẽː
ũː
ɔ̃ː



================================================
FILE: data/phones/phones/hun_narrow.phones
================================================
# Based on https://en.wiktionary.org/wiki/Appendix:Hungarian_pronunciation
# Vowels.
# I have no information about the palatal glides so I've just introduced them
# where they're present in the data; I believe they're all the result of hiatus.
ɒ
ɒʲ
uʲ
ɒː
aː
aːʲ
a  # Does not occur in the data. "Mostly used in foreign words."
ɛ
ɛʲ
ɛː
eː
eːʲ
i
iʲ
iː
iːʲ
o
oʲ
oː
oːʲ
ø
øː
øːʲ
u
uʲ
uː
uːʲ
y
yː
yːʲ
# Consonants.
b
bː
t͡s
t͡sː
t͡ʃ
t͡ʃː
d
dː
d͡z
d͡zː
d͡ʒ
d͡ʒː
f
fː
ɡ
ɡː
ɟ
ɟː
h  # No geminate form in the data.
ɦ  # No geminate form in the data.
x
xː
j
jː
ç  # No geminate form in the data.
ʝ  # No geminate form in the data.
k
kː
l
lː
m
mː
ɱ  # No geminate form in the data.
n
nː
ɲ
ɲː
ŋ  # No geminate form in the data.
p
pː
r
rː
ʃ
ʃː
s
sː
t
tː
c
cː
v
vː
z
zː
ʒ
ʒː



================================================
FILE: data/phones/phones/hye_e_narrow.phones
================================================
# Based on https://en.wikipedia.org/wiki/Armenian_language#Phonology
# And based on the pronunciation script from Wiktionary: https://en.wiktionary.org/wiki/Module:hy-pronunciation
#
# Vowels; these have been recently reworked upstream to use acute accents.
ɑ
ɑ́
e
é
ə
ə́
o
ó
i
í
u
ú
ʏ  # Allophone that is automatically transcribed from word-medial յու /ju/.
ʏ́
# Consonants.
m
n
ŋ
pʰ
p
b
tʰ
t
d
kʰ
k
ɡ
t͡sʰ
t͡s
d͡z
t͡ʃʰ
t͡ʃ
d͡ʒ
f
v
s
z
ʃ
ʒ
χ
ʁ
h
l
j
r
ɾ
#
# Past errors: some instances of [χ] were incorrectly transcribed as [x].
# Past errors: some affricates were missing a tie-bar.
#  * <ց> [tʃʰ]
#  * <չ> [tsʰ]
# These were fixed.
#
# Long consonants: A sequence of identical consonants are phonetically pronounced as
# long consonant. Armenian does not have phonemic consonant length or geminates.
mː
nː
pʰː
pː
bː
tʰː
tː
dː
kʰː
kː
ɡː
t͡sʰː
fː
vː
sː
zː
ʃː
χː
ʁː
lː
ɾː
#
# As an accidental gap,  as of Jan 2021, there are no Wiktionary entries that have the
# following long consonants:
t͡sː
d͡zː
t͡ʃʰː
t͡ʃː
d͡ʒː
ʒː
hː
jː
# This is just an accidental gap. The above long consonants can exist in Armenian.
# At some point in the future, the Armenian users might add some entries which contain
# these sequences.


================================================
FILE: data/phones/phones/hye_w_narrow.phones
================================================
# Based on https://en.wikipedia.org/wiki/Armenian_language#Phonology
# And based on the pronunciation script from Wiktionary: https://en.wiktionary.org/wiki/Module:hy-pronunciation
#
# Vowels; these have been recently reworked upstream to use acute accents.
ɑ
ɑ́
e
é
ə
ə́
o
ó
i
í
u
ú
ʏ  # Allophone that is automatically transcribed from word-medial յու /ju/.
ʏ́
# Consonants.
m
n
ŋ
pʰ
b
tʰ
d
kʰ
ɡ
t͡sʰ
d͡z
t͡ʃʰ
d͡ʒ
f
v
s
z
ʃ
ʒ
χ
ʁ
h
l
j
ɾ
r
# Western Armenian doesn't have phonemic voiceless unaspirated stops. But the hy-pron script generetes unaspirated consonants after the /s/ segment
p
t
k
t͡s
t͡ʃ
#
# Past errors: some instances of [χ] were incorrectly transcribed as [x].
# Past errors: some affricates were missing a tie-bar.
#  * <ց> [tʃʰ]
#  * <չ> [tsʰ]
#  These were fixed.
#
# Long consonants: A sequence of identical consonants are phonetically pronounced as long consonant. Armenian does not have phonemic consonant length or geminates. This is just an accidental gap. The above long consonants can exist in Armenian. At some point in the future, the Armenian users might add some entries which contain these sequences.
mː
nː
pʰː
bː
tʰː
dː
kʰː
ɡː
t͡sʰː
fː
vː
sː
zː
ʃː
χː
ʁː
lː
ɾː
r
# As an accidental gap,  as of Jan 2021, there are no Wiktionary entries that have
# following long consonants.
t͡ʃʰː
d͡zː
d͡ʒː
ʒː
hː
jː
#  Note that the voiceless unaspirated consonants are only possible as allophones
# in very restricted contexts:
pː
tː
kː
t͡sː
t͡ʃː
# These are all just an accidental gap. The above long consonants can exist in theory
# exist in Armenian. At some point in the future, the Armenian users might add some
# entries which contain these sequences.
#
# Western Armenian's phoneme inventory differs from Eastern Armenian in the following way:
# 1) Eastern Armenian has a trill grapheme ռ and a flap grapheme ր. In Western Armenian,
# both graphemes are pronounced as a flap. Western Armenian does not have a trill.
# 2) The graphemes for voiceless unaspirated stops+affricates are pronounced as voiceless
# unaspirated in Eastern Armenian, but as voiced in Western Armenian: կապ [kap] in
# Eastern, [gab] in Western.
# 3) The graphemes for voiced stops+affricates are pronounced as voiced in Eastern
# Armenian, but as voiceless aspirated in Western Armenian: գահ [gah] in Eastern, [kʰah] # in Western.
# * The graphemes for voiceless aspirated stops+affricates are pronounced as voiceless aspirated in both dialects:  քուն [kʰun] in both.



================================================
FILE: data/phones/phones/isl_broad.phones
================================================
# Based on:
#
# https://en.wikipedia.org/wiki/Icelandic_phonology
# https://en.wikipedia.org/wiki/Help:IPA/Icelandic
# https://en.wikipedia.org/wiki/Icelandic_orthography
#####
# There is apparently a good deal of disagreement on the phonemic inventory of
# Icelandic. This file largely reflects the table labled "Consonant phonemes"
# in the first link (including phonemes in parentheses), which gives an
# "orthographic" and "maximalist" approach to the analysis of the phonemic
# inventory of Icelandic.
#####
a
r
t
l
s
ɪ
n
ʏ
k
iː
p
uː
v
ð  # Allophone of /θ/.
h
m
ɛ
j
f
i
aː
o  # Used to transcribe diphthongs /ou/ and /ouː/.
e  # Used to transcribe diphthongs /ei/ and /eiː/.
c
u
ɛː
kʰ
ɔ
r̥
tʰ
ŋ
ɣ  # Allophone of /k/ and/or /kʰ/.
ɪː
œ
θ
ø  # Used as part of [øy] or [øyː], which are narrow transcriptions of /œ͡i/ and /œ͡iː/, respectively (see below).
nː
ɔː
pʰ
cʰ
œː
l̥
yː  # Used as part of [øyː], which is a narrow transcription of /œ͡iː/ (see below).
ʏː
n̥
sː
x  # Allophone of /kʰ/.
y  # Used as part of [øy], which is a narrow transcription of /œ͡i/ (see below).
ɲ  # Allophone of /ŋ/.
ç
kː  
cː
tː
mː
xʷ  # Dialectal; corresponds to standard /kʰv/.
rː
m̥  # Allophone of /m/.
lː
ŋ̊  # Allophone of /ŋ/.
pː
fː
#####
# Diphthongs
# The Wiktionary convention for Icelandic seems to be to transcribe diphthongs
# without using tie bars, so the following did not actually appear in the data.
a͡u
a͡uː
œ͡i  # Commonly transcribed as [øy] on Wiktionary.
œ͡iː  # Commonly transcribed as [øyː] on Wiktionary.
e͡i
e͡iː
o͡u
o͡uː
a͡i
a͡iː



================================================
FILE: data/phones/phones/ita_broad.phones
================================================
# Based on:
#
# https://en.wikipedia.org/wiki/Italian_phonology
# https://en.wiktionary.org/wiki/Appendix:Italian_pronunciation
# Loporcaro, M., & Bertinetto, P.M. (2005). The sound pattern of Standard
#   Italian, as compared with the varieties spoken in Florence. Journal of the
#   International Phonetic Association, 35(2): 131-151.
a
o
r
e
i
t
n
l
k
s
d
m
p
u
j
ɛ
b
f
v
ɔ
ɡ
t͡ʃ
d͡ʒ
z
t͡s
w
ʃ
u̯  # Allophone of /u/ in diphthongs.
d͡z
ɲ
ʎ
i̯  # Allophone of /i/ in diphthongs.



================================================
FILE: data/phones/phones/jpn_narrow.phones
================================================
# Based on
# https://en.wikipedia.org/wiki/Japanese_phonology
### 
a̠
i
k
ɯ̟ᵝ
o̞
o̞ː
s
ɕ
e̞
t
ɾ
ɨᵝ
m
n
ɡ
kʲ
ɴ
t͡s
ẽ̞
b
d
ã̠
h
j
ɾʲ
e̞ː
t͡ɕ
ĩ
ɰ̃
ɲ̟
ɨᵝː
ʑ
z
mʲ
d͡ʑ
ç
ŋ
ɰᵝ
õ̞
ɸ
ɡʲ
bʲ
d͡z
ɯ̟ᵝː
i̥
ɯ̟̊ᵝ
ɯ̟̃ᵝ
k̚  # Used in transcribing geminates, as in /k̚k/.
p
t̚  # Used in transcribing geminates, as in /t̚t/.
ŋʲ
iː
ɨ̥ᵝ
mː
p̚  # Used in transcribing geminates, as in /p̚p/.
nː
sː
ɕː
ɨ̃ᵝ
k̚ʲ  # Used in transcribing geminates, as in /k̚ʲkʲ/.
pʲ
mʲː
a̠ː
p̚ʲ  # Used in transcribing geminates, as in [p̚ʲpʲ].



================================================
FILE: data/phones/phones/kat_broad.phones
================================================
# Based on:
#
#     Robins, R. H., and Natalie Waterson. 1952. Notes on the phonetics of the
#     Georgian word. Bulletin of the School of Oriental and African Studies
#     141:55–72.
#
a  # Wikipedia has ɑ, but R&W say [ɑ] is a back allophone of /a/.
i
e  # Wikipedia has ɛ.
o  # Wikipedia has ɔ.
u
# CONSONANTS
r
l
s
d
m
n
b
tʼ
v
t
kʼ
ɡ
tʰ
x
ʃ
z
pʼ
pʰ
kʰ
qʼ
ɣ
ʒ
h
# */sʼ/ and */ʃʼ/ are not phonemes of Georgian, but are used to transcribe the
# ejective affricates /tsʼ/ and /tʃʼ/.
sʼ
ʃʼ
# AFFRICATES
t͡s
t͡sʼ
d͡z
t͡ʃ
t͡ʃʼ
d͡ʒ



================================================
FILE: data/phones/phones/khm_broad.phones
================================================
# Based on
# https://en.wikipedia.org/wiki/Khmer_language
# https://en.wikipedia.org/wiki/Khmer_script
### CONSONANTS ###
pʰ
p
ɓ
m
f  # Only occurs in loanwords.
ʋ
w  # Allophone of /ʋ/? Also used to transcribe diphthongs.
tʰ
t
ɗ
n
r
s
z  # Only occurs in loanwords.
l
cʰ
c
ɲ
j
kʰ
k
ɡ  # Only occurs in loanwords.
ŋ
ʔ
h
### VOWELS ###
i
iː
ĕ  # Part of diphthong /ĕ͡ə/.
e
eː
ɛː
a
aː
ɨ
ɨː
ə
əː
ŭ  # Part of diphthong /ŭ͡ə/.
u
uː
ŏ  # Part of diphthong /ŏ͡ə/.
o
oː
ɔː
ɑ
ɑː
# Supposedly, there is no /ɔ/ in Khmer, only /ɔː/. Nevertheless, the short
# vowel occurs slightly more often than the long vowel in the data, and 
# sometimes both as appear in the same word, as in (ជ្រលង | c r ɔ l ɔː ŋ). I
# do not know why, although it might be allophonic shortening in pretonic
# syllables. Some (but by no means all) instances of /ɔ/ are part of the /ɔə/
# diphthong.
ɔ
# DIPHTHONGS
# Diphthongs were not transcribed with tie bars, so the following do not
# actually appear in the data.
i͡ə
e͡i
a͡e
a͡ə
a͡o
ɨ͡ə
ə͡ɨ
u͡ə
o͡u
ɔ͡ə
# Short diphthongs
ĕ͡ə
ŭ͡ə
ŏ͡ə
# Maybe-diphthongs (arguably vowel-glide sequences, but some of them seem to
# get their own graphemes).
a͡j
aː͡j
a͡w
ɨ͡w
ə͡w
eː͡j
u͡j
iəj
iəw
ɨəj
aoj
aəj
uəj



================================================
FILE: data/phones/phones/kor_narrow.phones
================================================
e̞
n
d͡ʑ  # Allophone of /j/.
i
o̞
kʰ    
pʰ
ɕ͈  # Allophone of /s͈/.
b
a̠
ŋ
d
ɭ
k͈
t͡ɕ͈  # The diacritic location seems to be incorrect. Probably, it should be [t͈͡ɕ].
m
p
j
ʌ̹
ɕʰ  # Allophone of /s/.
ɡ
ɾ
ɯ
t͡ɕ
w
sʰ
k
o
u
ɰ
a̠ː
k̚  # Allophone of /k/.
ɦ  # Allophone of /h/.
p̚  # Allophone of /p/.
t͈ 
ʝ  # Allophone of /h/.
s͈
t̚  # Allophone of most of the alveolar, alveolo-palatal consonants.
ʎ  # Other sources except for Wiktionary use [l] on onset position followed by [l] instead of [ʎ].
ɲ  # Allophone of /n/.
t͡ɕʰ
ç  # Allophone of /h/.
p͈
ɥ
tʰ
β  # Allophone of /h/.
e̞ː
ɘː
o̞ː
ɣ  # Allophone of /h/.
x  # Allophone of /h/.
oː
uː
iː
ɯː
t
ɛː
ʃʰ  # Allophone of /s/.
h
ɸʷ  # Allophone of /h/.
ɸ  # Allophone of /h/.


================================================
FILE: data/phones/phones/lat_clas_broad.phones
================================================
# The upstream Latin data from Wiktionary no longer has broad transcription,
# though this file is being kept just in case it becomes relevant again.
b
d
f
ɡ
h
j
k
l
m
n
r
p
s
t
w
a
aː
e
eː
i
iː
o
oː
u
uː
u̯  # Offglide of the <au, eu> diphthongs.
e̯  # Offglide of the <ae, oe> diphthongs.
h
kʷ
ɡʷ
# And in Greek borrowings only:
pʰ
tʰ
kʰ
y
yː
z


================================================
FILE: data/phones/phones/lav_narrow.phones
================================================
# Based on
#
# https://en.wikipedia.org/wiki/Latvian_phonology
# https://en.wikipedia.org/wiki/Help%3AIPA%2FLatvian
# https://en.wikipedia.org/wiki/Latvian_orthography
### CONSONANTS ###
p
b
m
f  # Only occurs in loanwords.
v
w
t
d
n
r
ɾ  # Not mentioned in sources. Presumably /r/.
s
z
l
ʃ
ʒ
c
ɟ
ɲ
j
ʎ
k
ɡ
ŋ  # Allophone of nasals before velar consonants.
# Affricates
# Latvian has several affricates which were very consistently transcribed
# without tie bars, so the following do not actually appear as such in the 
# data.
t͡s
d͡z
t͡ʃ
d͡ʒ
### VOWELS ###
#    Vowels can be short or long. Stressed heavy syllables (consisting of a  
# long vowel, a dipthong, or a vowel+sonorant sequence) have one of three 
# possible tones.
#    Some vowel-suprasegmental combinations occurred vary rarely in the data.
# I'm not sure if this is due to accidental gaps, transcription errors, or 
# phonological restrictions that I'm not aware of.
#    /ɛ/ (and its variants with tone and length) do not occur in the sources, 
# except as part of diphthongs. Based on character frequencies, it seems to
# correspond to /e/ in the sources, rather than /æ/ (both of which are written
# <e>). /ɛ/ occurs much more frequently than /e/ in the Wiktionary data 
# (although I have included /e/ and its variants as well).
#    /o/ (and its variants with tone) are not native sounds. However, they are
# used here to transcribe the /uo/ diphthong.
i
ī
ì
î
iː
īː
ìː
îː
e
ē
è
ê
eː
ēː
èː  # Did not occur in data.
êː
ɛ
ɛ̄
ɛ̀
ɛ̂
ɛː
ɛ̄ː
ɛ̀ː
ɛ̂ː
æ
ǣ
æ̀
æ̂
æː
ǣː
æ̀ː
æ̂ː
a
ā
â
à
aː
āː
àː
âː
u
ū
ù
û
uː
ūː
ùː
ûː
o
ō
ò
ô
oː  # Only occurs in loanwords.
# Diphthongs
# Latvian supposedly has the following diphthongs, although tie bars were
# generally not used in these data, so the following do not actually occur.
i͡ɛ
i͡u
ɛ͡i
ɛ͡u
a͡i
a͡u
u͡i
u͡ɔ


================================================
FILE: data/phones/phones/mlt_broad.phones
================================================
# Based on
# https://en.wikipedia.org/wiki/Maltese_language
# https://en.wikipedia.org/wiki/Help:IPA/Maltese
#####
# CONSONANTS
# I am not sure whether all consonants have geminate counterparts. I only 
# included those that occurred more than once.
p
pː
b
bː
m
mː
f
fː
v
w
t
tː
d
t͡s
n
nː
r
s
sː
z
zː
l
lː
t͡ʃ
d͡z
d͡ʒ
ʃ
ʃː
j
jː
k
kː
ɡ
ɣ
ħ
ʔ
# VOWELS
iː
ɪ
ɪː
ɛ
ɛː
# As far as I can tell, /ɐ ɐː/ and /a aː/ are used for the same pair of 
# vowels. The latter pair is more frequent, so /ɐ/ should probably be changed
# to /a/.
ɐ
ɐː
a
aː
u
uː
ɔ
ɔː
# PHARYNGEALIZED VOWELS
# The digraph <għ> is in many cases realized as pharyngealization on an
# adjacent vowel. I don't know if all vowels have pharyngeal counterparts; I
# included only those that occurred more than once.
ɛˤː
əˤ
aˤ
aˤː
ɔˤː
# DIPHTHONGS
# Maltese supposedly has seven diphthongs, although I am not sure if they are
# phonemic. They were not transcribed with tie bars in the data, so the 
# following do not actually appear as such.
a͡ɪ
a͡ʊ
ɛ͡ɪ
ɛ͡ʊ
ɪ͡ʊ
ɔ͡ɪ
ɔ͡ʊ



================================================
FILE: data/phones/phones/mya_broad.phones
================================================
# Based on
# https://en.wikipedia.org/wiki/Burmese_phonology
# https://en.wikipedia.org/wiki/Help%3AIPA%2FBurmese
### CONSONANTS
pʰ
p
b
m̥
m
θ
ð  # Allophone of /θ/ according to <https://en.wikipedia.org/wiki/Burmese_phonology#ref_1>, but no citation is provided.
tʰ
t
d
n̥
n
sʰ
s
z
ɹ  # Only occurs in loanwords.
l̥
l
t͡ɕʰ
t͡ɕ
d͡ʑ
ɲ̊
ɲ
ʃ
j
kʰ
k
ɡ
ŋ̊
ŋ
ʍ
w
ɴ
ʔ
h
###
# VOWELS
ì
í
ḭ
ɪ  # Allophonic, see note. Also occurs when transcribing /aɪ/ and [eɪ].
ɪ̀  # Allophonic, see note.
ɪ́  # Allophonic, see note.
ɪ̰  # Allophonic, see note.
e  # See note.
è
é
ḛ
ɛ
ɛ̀
ɛ́
ɛ̰
ə
a
à
á
a̰
ù
ú
ṵ
ʊ  # Allophonic, see note. Also occurs when transcribing [oʊ] and [aʊ].
ʊ̀  # Allophonic, see note.
ʊ́  # Allophonic, see note.
ʊ̰  # Allophonic, see note.
o  # See note.
ò
ó
o̰
ɔ̀
ɔ́
ɔ̰
a͡ɪ  # Does not occur in data, because tie bars were not used for diphthongs.
###
# A NOTE ON RIMES
###
# [ɪ eɪ ʊ oʊ aʊ] are allophones of /i e u o ɔ/ that occur in closed syllables.
# /N ʔ/ are the only consonants that occur in coda position. When /ʔ/ is the 
# coda, the preceding vowel lacks phonemic tone (note that <̰> counts as a
# "tone"). Additionally, when a diphthong occurs with tone, it seems that the
# tone is generally marked on the first symbol in the diphthong (e.g. /àɪ/).
# Thus, as far I can tell, the only cases in which a vowel symbol (other than
# /ə/, which is always toneless) will occur without a tone diacritic are when 
# it occurs before /ʔ/ and when it is the second element in a diphthong.
# Finally, diphthongs were never transcribed with tie bars.



================================================
FILE: data/phones/phones/nld_broad.phones
================================================
ə
r
t
n
s
l
k
ɑ
ɛ
aː
b
d
m
oː
eː
i
p
ɪ
x
ɔ
v
f
ɣ
i̯
z
ʋ
ɦ
ŋ
u
ʏ
y̯
œ
j
u̯
y
øː
# Non-native obstruents: https://en.wikipedia.org/wiki/Dutch_phonology#Obstruents
ʃ
ʒ
ɡ
# Non-native vowels: https://en.wikipedia.org/wiki/Dutch_phonology#Monophthongs
iː
yː
uː
ɛː
œː
ɔː
# TODO(kbg): research these.
a
e
ʌ
o
ɪ̯



================================================
FILE: data/phones/phones/nob_broad.phones
================================================
# Based on
#
### CONSONANTS ###
p
pː
b
bː
m
mː
f
fː
v  # Allophone of /ʋ/.
ʋ
t
tː
d
dː
n
nː
r
rː
ɾ  # Less common (but apparently, more phonetically accurate) transcription of /r/.
s
sː
l
lː
ʃ  # Less common transcription of /ʂ/.
ʃː
ʈ  # Underlyingly /rt/.
ʈː
ɳ  # Underlyingly /rn/.
ʂ
ʂː
ɽ
ç
j
jː
k
kː
ɡ
ɡː
ŋ
ŋː
h
# Syllabic consonants
n̩
### VOWELS ###
i  # Less common transcription of /ɪ/.
iː
y
yː
ɪ
ʏ  # Slightly less common transcription of /y/.
e
eː
ø
øː
ɛ  # Less common transcription of /e/.
œ  # Less common transcription of /ø/.
æ  # Possibly an allophone of /e/. Also used in transcribing diphthongs (see below).
æː
a  # Less common transcription of /ɑ/.
aː  # Less common transcription of /ɑː/.
ʉ
ʉː
ə
u
uː
ʊ
o  # Less common transcription of /ɔ/.
oː
ɔ
ɑ
ɑː
# DIPHTHONGS
# Norwegian has several diphthongs, which were not transcribed with tie bars.
# As such, the following do not appear in the data.
œ͡ʏ
æ͡ɪ
æ͡ʉ
### PITCH ACCENT ###
# Norwegian has pitch accent. However, it is either very rare, or is not 
# transcribed very often on Wiktionary. Additionally, it is transcribed with
# non-standard IPA (<¹> and <²>), so it is not included here. The pitch accent
# is not reflected in the orthography.



================================================
FILE: data/phones/phones/por_bz_broad.phones
================================================
# Based on:
#
# https://en.wikipedia.org/wiki/Portuguese_phonology
#
# Ordinary vowels and glides.
a
i
e
o
u
ɐ
ɔ
ɛ
ø
j
w
# Nasalized vowels and glides.
# Note the upstream template occasionally generates "double-nasalization",
# i.e., vowels followed by two combining tildes. This is ignored since 
# it's not clear what this means.
ɐ̃
ẽ
ĩ
õ
ũ
ɔ̃
ɛ̃
ã
j̃
w̃
# Consonants.
ɾ
s
k
t
d
m
l
n
ʁ
t͡ʃ
p
z
ʃ
b
f
ɡ
kʷ
ɡʷ  # This gu in European, apparently.
d͡ʒ
v
ɻ
ʒ
ɲ
ʎ


================================================
FILE: data/phones/phones/por_po_broad.phones
================================================
# Based on:
#
# https://en.wikipedia.org/wiki/Portuguese_phonology
#
# Ordinary vowels and glides.
a
i
e
o
u
ɐ
ɔ
ɛ
ø
j
w
# Nasalized vowels and glides.
# Note the upstream template occasionally generates "double-nasalization",
# i.e., vowels followed by two combining tildes. This is ignored since 
# it's not clear what this means.
ɐ̃
ẽ
ĩ
õ
ũ
ɔ̃
ɛ̃
ã
j̃
w̃
# Consonants.
ɾ
s
k
t
d
m
l
n
ʁ
t͡ʃ
p
z
ʃ
b
f
ɡ
kʷ
d͡ʒ
v
ɻ
ʒ
ɲ
ʎ


================================================
FILE: data/phones/phones/ron_narrow.phones
================================================
# Based on https://en.wikipedia.org/wiki/Help:IPA/Romanian
# Vowels.
a
e
ə
i
iː  # This is not mentioned but seems to be <ii>. TODO: check this.
ɨ
o
u
e̯  # Glide portion of a diphthong.
o̯  # Glide portion of a diphthong.
# Consonants.
# I don't have any information on the palatal glides so I'm just listing the
# ones present in the data.
b
bʲ
d
dʲ
d͡ʒ
d͡ʒʲ
f
fʲ
ɡ   # No palatal variant.
h   # No palatal variant.
k
kʲ
l
lʲ
m
mʲ
n
nʲ
ŋ  # No palatal variant.
p
pʲ
r
rʲ
s
sʲ
ʃ
ʃʲ
t
tʲ
t͡s
t͡sʲ
t͡ʃ
t͡ʃʲ
v
vʲ
z
zʲ
ʒ
ʒʲ
j
w  # No palatal variant.


================================================
FILE: data/phones/phones/slv_broad.phones
================================================
# Based on
# https://en.wikipedia.org/wiki/Slovene_phonology
#####
t
a
i
r
n
s
k
l
ɔ
j
ʋ
ɛ
p
m
d
áː
b
àː
ə
ʃ
íː
ɡ
t͡ʃ
ìː
z
óː
t͡s
éː
èː
u
x
á
ʒ
úː
ùː
òː
ɔ̀ː
ɛ̀ː
f
ɔ́
ə́
ɛ́
ɛ́ː
ɔ́ː
ə̀
í
d͡ʒ
ú



================================================
FILE: data/phones/phones/spa_ca_broad.phones
================================================
# * Following practice on Wiktionary we assume <ll> /ʝ/.
# * Following practice on Wiktionary we assume <hue> /w̝e/.
# * In this dialect, <c, z> before <i, e> /θ/.
a
e
i
o
u
j
w
w̝
l
ʝ
ɾ
r
β
f
θ
s
ʃ
x
p
b
t
d
k
ɡ
t͡ʃ
m
n
ɲ
ŋ
# t͡ɬ is found in Meso-American placenames (etc.) but is not properly handled
# by the pronunciation templates.


================================================
FILE: data/phones/phones/spa_la_broad.phones
================================================
# * Following practice on Wiktionary we assume <ll> /ʝ/.
# * Following practice on Wiktionary we assume <hue> /w̝e/.
a
e
i
o
u
j
w
w̝
l
ʝ
ɾ
r
β
f
s
ʃ
x
p
b
t
d
k
ɡ
t͡ʃ
m
n
ɲ
ŋ
# t͡ɬ is found in Meso-American placenames (etc.) but is not properly handled
# by the pronunciation templates.


================================================
FILE: data/phones/phones/tur_narrow.phones
================================================
# Based on
# https://en.wikipedia.org/wiki/Turkish_phonology
# https://en.wikipedia.org/wiki/Help:IPA/Turkish
# CONSONANTS
p
b
m
β
f
v
t
d
n
r  # Allophone of /ɾ/? Unclear.
ɾ̝̊
ɾ
s
z
l
ɫ
t͡ʃ
d͡ʒ
ʃ
ʒ
c
ɟ
ɳ
j
k
ɡ
ŋ
ɰ
h
# VOWELS
i
iː
y
ɪ
ʏ
e
e̞  # Phonetically accurate, but occurs rarely in the data. Probably should be changed to [e] for consistency.
ø
ɛ
œ
æ
a
aː
ɯ
u
uː
ʊ
o
o̞
ɔ  # This vowel wasn't used in the sources, but it occurs frequently enough to be worth including.
ɑ
ɑː



================================================
FILE: data/phones/phones/vie_hanoi_narrow.phones
================================================
# Based on
#
# https://en.wikipedia.org/wiki/Vietnamese_phonology
# https://en.wikipedia.org/wiki/Help:IPA/Vietnamese
# https://en.wikipedia.org/wiki/Vietnamese_alphabet
#####
ʔ
ə
˧˧
˧˦
w
aː
˧˨  # Should be ˧ˀ˨ (see below).
n
i
˨˩
j  # Only occurs in coda position.
ŋ
a
m
˧˩
k
ɨ
t
t͡ɕ
z
ɗ
t̚
h
o
ɓ
s
tʰ
l
u
ŋ͡m
˦˥  # Should be ˦ˀ˥ (see below).
k̚
əː
ɔ
v
ï
e
ɛ
f
ɲ
p̚
ŋ̟
k͡p̚
ʊ
x
k̟̚
ɣ
ɹ  # Only occurs in loanwords.
p  # Only occurs in onset position in loanwords.
# DIPHTHONGS
# Diphthongs in the Vietnamese data are transcribed without tie bars, so the
# following do not appear as such in the data.
i͡ə
ɨ͡ə
u͡e
# GLOTTALIZED TONES
˧ˀ˨  # Often transcribed /˧˨ʔ/ on Wiktionary.
˦ˀ˥



================================================
FILE: data/phones/phones/vie_hue_narrow.phones
================================================
# Based on
#
# https://en.wikipedia.org/wiki/Help:IPA/Vietnamese
#####
ʔ
w
˧˧
ə
ŋ
j
aː
˨˩
˦˧˥
˦˩
˧˨
ɨ
m
k̚
k
i
a
˨˩˦
t
n
ɗ
h
o
ɓ
tʰ
l
ɛ
ŋ͡m
ɲ
ɪ
ʊ
t͡ɕ  # The source gives /c/ for this.
ʂ
e
əː
ɔ
v
ʈ
u
f
p̚
k͡p̚
t̚
s
kʰ
ʐ
ɣ
ɹ
x
p  # Only occurs in onset position in loanwords.
z
# DIPHTHONGS
# Diphthongs in the Vietnamese data are transcribed without tie bars, so the
# following do not appear as such in the data.
i͡ə
ɨ͡ə
u͡e
# A NOTE ON TONES
# There is almost no information on Wikipedia about tones in central 
# Vietnamese dialects, so the tones here are just the tones used on Wiktionary.


================================================
FILE: data/phones/phones/vie_saigon_narrow.phones
================================================
˧˧
w
˦˥
ə
ŋ
j
aː
˨˩˨
ʔ
˨˩
˨˩˦
ɨ
m
k̚
a
n
i
t
k
ɗ
o
ɓ
tʰ
l
ŋ͡m
ʊ
h
ɪ
əː
c
ʂ
ɔ
v
ʈ
ɛ
f
k͡p̚
ɲ
p̚
e
u
t̚
s
kʰ
ɹ
ɣ
x
z
# FIXME: these are being mischunked.
⁽ʷ
⁾
p  # Only occurs in onset position in loanwords.
# Diphthongs in the Vietnamese data are transcribed without tie bars, so the
# following do not appear as such in the data.
i͡ə
ɨ͡ə
u͡e
# Other stuff.


================================================
FILE: data/phones/postprocess
================================================
#!/bin/bash
# Runs summary generation script.

set -eou pipefail

./lib/generate_summary.py


================================================
FILE: data/phones/summary.tsv
================================================
ady_narrow.phones	ady	Adyghe	Adyghe	Narrow	67
afr_broad.phones	afr	Afrikaans	Afrikaans	Broad	61
aze_narrow.phones	aze	Azerbaijani	Azerbaijani	Narrow	56
ben_dhaka_broad.phones	ben	Bengali	Bengali (Dhaka)	Broad	98
ben_rarh_broad.phones	ben	Bengali	Bengali (Rarh, Standard Bengali)	Broad	99
bul_broad.phones	bul	Bulgarian	Bulgarian	Broad	52
cym_nw_broad.phones	cym	Welsh	Welsh (North Wales)	Broad	63
cym_sw_broad.phones	cym	Welsh	Welsh (South Wales)	Broad	55
deu_broad.phones	deu	German	German	Broad	83
ell_broad.phones	ell	Modern Greek (1453-)	Greek	Broad	33
eng_uk_broad.phones	eng	English	English (UK, Received Pronunciation)	Broad	64
eng_us_broad.phones	eng	English	English (US, General American)	Broad	63
fra_broad.phones	fra	French	French	Broad	40
hbs_broad.phones	hbs	Serbo-Croatian	Serbo-Croatian	Broad	65
hin_broad.phones	hin	Hindi	Hindi	Broad	64
hun_narrow.phones	hun	Hungarian	Hungarian	Narrow	86
hye_e_narrow.phones	hye	Armenian	Armenian (Eastern Armenian)	Narrow	74
hye_w_narrow.phones	hye	Armenian	Armenian (Western Armenian)	Narrow	75
isl_broad.phones	isl	Icelandic	Icelandic	Broad	71
ita_broad.phones	ita	Italian	Italian	Broad	32
jpn_narrow.phones	jpn	Japanese	Japanese	Narrow	64
kat_broad.phones	kat	Georgian	Georgian	Broad	36
khm_broad.phones	khm	Khmer	Khmer	Broad	73
kor_narrow.phones	kor	Korean	Korean	Narrow	61
lat_clas_broad.phones	lat	Latin	Latin (Classical)	Broad	36
lav_narrow.phones	lav	Latvian	Latvian	Narrow	89
mlt_broad.phones	mlt	Maltese	Maltese	Broad	61
mya_broad.phones	mya	Burmese	Burmese	Broad	70
nld_broad.phones	nld	Dutch	Dutch	Broad	50
nob_broad.phones	nob	Norwegian Bokmål	Norwegian Bokmål	Broad	72
por_bz_broad.phones	por	Portuguese	Portuguese (Brazil)	Broad	45
por_po_broad.phones	por	Portuguese	Portuguese (Portugal)	Broad	44
ron_narrow.phones	ron	Romanian	Romanian	Narrow	51
slv_broad.phones	slv	Slovenian	Slovene	Broad	48
spa_ca_broad.phones	spa	Spanish	Spanish (Castilian)	Broad	29
spa_la_broad.phones	spa	Spanish	Spanish (Latin America)	Broad	28
tur_narrow.phones	tur	Turkish	Turkish	Narrow	51
vie_hanoi_narrow.phones	vie	Vietnamese	Vietnamese (Hà Nội)	Narrow	54
vie_hue_narrow.phones	vie	Vietnamese	Vietnamese (Huế)	Narrow	54
vie_saigon_narrow.phones	vie	Vietnamese	Vietnamese (Saigon)	Narrow	54


================================================
FILE: data/scrape/README.md
================================================
* Languages: 307
  * Broad transcription files: 307
  * Narrow transcription files: 175
* Dialects: 17
  * Broad transcription files: 22
  * Narrow transcription files: 22
* Scripts: 42
* Pronunciations: 3,904,091


| Link | ISO 639-3 Code | ISO 639 Language Name | Wiktionary Language Name | Script | Dialect | Filtered | Narrow/Broad | # of entries |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | ----: |
| [TSV](tsv/aar_latn_broad.tsv) | aar | Afar | Afar | Latin |  | False | Broad | 1,584 |
| [TSV](tsv/aar_latn_narrow.tsv) | aar | Afar | Afar | Latin |  | False | Narrow | 1,548 |
| [TSV](tsv/abk_cyrl_broad.tsv) | abk | Abkhazian | Abkhaz | Cyrillic |  | False | Broad | 198 |
| [TSV](tsv/abk_cyrl_narrow.tsv) | abk | Abkhazian | Abkhaz | Cyrillic |  | False | Narrow | 841 |
| [TSV](tsv/acw_arab_broad.tsv) | acw | Hijazi Arabic | Hijazi Arabic | Arabic |  | False | Broad | 2,252 |
| [TSV](tsv/acw_arab_narrow.tsv) | acw | Hijazi Arabic | Hijazi Arabic | Arabic |  | False | Narrow | 711 |
| [TSV](tsv/ady_cyrl_narrow.tsv) | ady | Adyghe | Adyghe | Cyrillic |  | False | Narrow | 5,121 |
| [TSV](tsv/ady_cyrl_narrow_filtered.tsv) | ady | Adyghe | Adyghe | Cyrillic |  | True | Narrow | 4,893 |
| [TSV](tsv/afb_arab_broad.tsv) | afb | Gulf Arabic | Gulf Arabic | Arabic |  | False | Broad | 719 |
| [TSV](tsv/afr_latn_broad.tsv) | afr | Afrikaans | Afrikaans | Latin |  | False | Broad | 2,022 |
| [TSV](tsv/afr_latn_broad_filtered.tsv) | afr | Afrikaans | Afrikaans | Latin |  | True | Broad | 1,982 |
| [TSV](tsv/afr_latn_narrow.tsv) | afr | Afrikaans | Afrikaans | Latin |  | False | Narrow | 134 |
| [TSV](tsv/aii_syrc_narrow.tsv) | aii | Assyrian Neo-Aramaic | Assyrian Neo-Aramaic | Syriac |  | False | Narrow | 4,543 |
| [TSV](tsv/ajp_arab_broad.tsv) | ajp | South Levantine Arabic | South Levantine Arabic | Arabic |  | False | Broad | 3,124 |
| [TSV](tsv/ajp_arab_narrow.tsv) | ajp | South Levantine Arabic | South Levantine Arabic | Arabic |  | False | Narrow | 3,149 |
| [TSV](tsv/akk_latn_broad.tsv) | akk | Akkadian | Akkadian | Latin |  | False | Broad | 603 |
| [TSV](tsv/ale_latn_broad.tsv) | ale | Aleut | Aleut | Latin |  | False | Broad | 119 |
| [TSV](tsv/amh_ethi_broad.tsv) | amh | Amharic | Amharic | Ethiopic |  | False | Broad | 378 |
| [TSV](tsv/ang_latn_broad.tsv) | ang | Old English (ca. 450-1100) | Old English | Latin |  | False | Broad | 22,124 |
| [TSV](tsv/ang_latn_narrow.tsv) | ang | Old English (ca. 450-1100) | Old English | Latin |  | False | Narrow | 11,243 |
| [TSV](tsv/aot_latn_broad.tsv) | aot | Atong (India) | Atong (India) | Latin |  | False | Broad | 181 |
| [TSV](tsv/apw_latn_narrow.tsv) | apw | Western Apache | Western Apache | Latin |  | False | Narrow | 147 |
| [TSV](tsv/ara_arab_broad.tsv) | ara | Arabic | Arabic | Arabic |  | False | Broad | 13,339 |
| [TSV](tsv/ara_arab_narrow.tsv) | ara | Arabic | Arabic | Arabic |  | False | Narrow | 104 |
| [TSV](tsv/arc_hebr_broad.tsv) | arc | Official Aramaic (700-300 BCE) | Aramaic | Hebrew |  | False | Broad | 1,167 |
| [TSV](tsv/arg_latn_broad.tsv) | arg | Aragonese | Aragonese | Latin |  | False | Broad | 298 |
| [TSV](tsv/ary_arab_broad.tsv) | ary | Moroccan Arabic | Moroccan Arabic | Arabic |  | False | Broad | 2,043 |
| [TSV](tsv/arz_arab_broad.tsv) | arz | Egyptian Arabic | Egyptian Arabic | Arabic |  | False | Broad | 200 |
| [TSV](tsv/asm_beng_broad.tsv) | asm | Assamese | Assamese | Bengali |  | False | Broad | 2,925 |
| [TSV](tsv/ast_latn_broad.tsv) | ast | Asturian | Asturian | Latin |  | False | Broad | 1,018 |
| [TSV](tsv/ast_latn_narrow.tsv) | ast | Asturian | Asturian | Latin |  | False | Narrow | 986 |
| [TSV](tsv/ayl_arab_broad.tsv) | ayl | Libyan Arabic | Libyan Arabic | Arabic |  | False | Broad | 163 |
| [TSV](tsv/aze_latn_broad.tsv) | aze | Azerbaijani | Azerbaijani | Latin |  | False | Broad | 383 |
| [TSV](tsv/aze_latn_narrow.tsv) | aze | Azerbaijani | Azerbaijani | Latin |  | False | Narrow | 4,226 |
| [TSV](tsv/aze_latn_narrow_filtered.tsv) | aze | Azerbaijani | Azerbaijani | Latin |  | True | Narrow | 4,011 |
| [TSV](tsv/bak_cyrl_broad.tsv) | bak | Bashkir | Bashkir | Cyrillic |  | False | Broad | 179 |
| [TSV](tsv/bak_cyrl_narrow.tsv) | bak | Bashkir | Bashkir | Cyrillic |  | False | Narrow | 2,184 |
| [TSV](tsv/ban_bali_broad.tsv) | ban | Balinese | Balinese | Balinese |  | False | Broad | 410 |
| [TSV](tsv/bar_latn_broad.tsv) | bar | Bavarian | Bavarian | Latin |  | False | Broad | 1,542 |
| [TSV](tsv/bbl_geor_broad.tsv) | bbl | Bats | Bats | Georgian |  | False | Broad | 167 |
| [TSV](tsv/bbn_latn_broad.tsv) | bbn | Uneapa | Uneapa | Latin |  | False | Broad | 192 |
| [TSV](tsv/bcl_latn_broad.tsv) | bcl | Central Bikol | Bikol Central | Latin |  | False | Broad | 4,928 |
| [TSV](tsv/bcl_latn_narrow.tsv) | bcl | Central Bikol | Bikol Central | Latin |  | False | Narrow | 4,936 |
| [TSV](tsv/bdq_latn_broad.tsv) | bdq | Bahnar | Bahnar | Latin |  | False | Broad | 193 |
| [TSV](tsv/bel_cyrl_narrow.tsv) | bel | Belarusian | Belarusian | Cyrillic |  | False | Narrow | 5,516 |
| [TSV](tsv/ben_beng_broad.tsv) | ben | Bengali | Bengali | Bengali |  | False | Broad | 6,666 |
| [TSV](tsv/ben_beng_dhaka_broad.tsv) | ben | Bengali | Bengali | Bengali | Dhaka | False | Broad | 7,998 |
| [TSV](tsv/ben_beng_dhaka_broad_filtered.tsv) | ben | Bengali | Bengali | Bengali | Dhaka | True | Broad | 6,496 |
| [TSV](tsv/ben_beng_dhaka_narrow.tsv) | ben | Bengali | Bengali | Bengali | Dhaka | False | Narrow | 7,821 |
| [TSV](tsv/ben_beng_narrow.tsv) | ben | Bengali | Bengali | Bengali |  | False | Narrow | 5,933 |
| [TSV](tsv/ben_beng_rarh_broad.tsv) | ben | Bengali | Bengali | Bengali | Rarh, Standard Bengali | False | Broad | 4,980 |
| [TSV](tsv/ben_beng_rarh_broad_filtered.tsv) | ben | Bengali | Bengali | Bengali | Rarh, Standard Bengali | True | Broad | 4,123 |
| [TSV](tsv/ben_beng_rarh_narrow.tsv) | ben | Bengali | Bengali | Bengali | Rarh, Standard Bengali | False | Narrow | 6,474 |
| [TSV](tsv/bjb_latn_broad.tsv) | bjb | Banggarla | Barngarla | Latin |  | False | Broad | 136 |
| [TSV](tsv/blt_tavt_narrow.tsv) | blt | Tai Dam | Tai Dam | Tai Viet |  | False | Narrow | 239 |
| [TSV](tsv/bod_tibt_broad.tsv) | bod | Tibetan | Tibetan | Tibetan |  | False | Broad | 2,699 |
| [TSV](tsv/bre_latn_broad.tsv) | bre | Breton | Breton | Latin |  | False | Broad | 770 |
| [TSV](tsv/bua_cyrl_broad.tsv) | bua | Buriat | Buryat | Cyrillic |  | False | Broad | 125 |
| [TSV](tsv/bua_cyrl_narrow.tsv) | bua | Buriat | Buryat | Cyrillic |  | False | Narrow | 140 |
| [TSV](tsv/bul_cyrl_narrow.tsv) | bul | Bulgarian | Bulgarian | Cyrillic |  | False | Narrow | 42,309 |
| [TSV](tsv/cat_latn_broad.tsv) | cat | Catalan | Catalan | Latin |  | False | Broad | 176 |
| [TSV](tsv/cat_latn_narrow.tsv) | cat | Catalan | Catalan | Latin |  | False | Narrow | 92,225 |
| [TSV](tsv/cbn_thai_broad.tsv) | cbn | Nyahkur | Nyah Kur | Thai |  | False | Broad | 151 |
| [TSV](tsv/ceb_latn_broad.tsv) | ceb | Cebuano | Cebuano | Latin |  | False | Broad | 2,953 |
| [TSV](tsv/ceb_latn_narrow.tsv) | ceb | Cebuano | Cebuano | Latin |  | False | Narrow | 2,822 |
| [TSV](tsv/ces_latn_narrow.tsv) | ces | Czech | Czech | Latin |  | False | Narrow | 43,717 |
| [TSV](tsv/chb_latn_broad.tsv) | chb | Chibcha | Chibcha | Latin |  | False | Broad | 122 |
| [TSV](tsv/che_cyrl_broad.tsv) | che | Chechen | Chechen | Cyrillic |  | False | Broad | 172 |
| [TSV](tsv/cho_latn_broad.tsv) | cho | Choctaw | Choctaw | Latin |  | False | Broad | 124 |
| [TSV](tsv/chr_cher_broad.tsv) | chr | Cherokee | Cherokee | Cherokee |  | False | Broad | 103 |
| [TSV](tsv/cic_latn_broad.tsv) | cic | Chickasaw | Chickasaw | Latin |  | False | Broad | 286 |
| [TSV](tsv/ckb_arab_broad.tsv) | ckb | Central Kurdish | Central Kurdish | Arabic |  | False | Broad | 288 |
| [TSV](tsv/cnk_latn_broad.tsv) | cnk | Khumi Chin | Khumi Chin | Latin |  | False | Broad | 350 |
| [TSV](tsv/cop_copt_broad.tsv) | cop | Coptic | Coptic | Coptic |  | False | Broad | 820 |
| [TSV](tsv/cor_latn_broad.tsv) | cor | Cornish | Cornish | Latin |  | False | Broad | 174 |
| [TSV](tsv/cor_latn_narrow.tsv) | cor | Cornish | Cornish | Latin |  | False | Narrow | 706 |
| [TSV](tsv/cos_latn_broad.tsv) | cos | Corsican | Corsican | Latin |  | False | Broad | 476 |
| [TSV](tsv/crk_latn_broad.tsv) | crk | Plains Cree | Plains Cree | Latin |  | False | Broad | 108 |
| [TSV](tsv/crk_latn_narrow.tsv) | crk | Plains Cree | Plains Cree | Latin |  | False | Narrow | 144 |
| [TSV](tsv/crx_cans_broad.tsv) | crx | Carrier | Carrier | Canadian Aboriginal |  | False | Broad | 175 |
| [TSV](tsv/csb_latn_broad.tsv) | csb | Kashubian | Kashubian | Latin |  | False | Broad | 818 |
| [TSV](tsv/cym_latn_nw_broad.tsv) | cym | Welsh | Welsh | Latin | North Wales | False | Broad | 10,320 |
| [TSV](tsv/cym_latn_nw_broad_filtered.tsv) | cym | Welsh | Welsh | Latin | North Wales | True | Broad | 10,248 |
| [TSV](tsv/cym_latn_nw_narrow.tsv) | cym | Welsh | Welsh | Latin | North Wales | False | Narrow | 1,006 |
| [TSV](tsv/cym_latn_sw_broad.tsv) | cym | Welsh | Welsh | Latin | South Wales | False | Broad | 16,060 |
| [TSV](tsv/cym_latn_sw_broad_filtered.tsv) | cym | Welsh | Welsh | Latin | South Wales | True | Broad | 15,880 |
| [TSV](tsv/cym_latn_sw_narrow.tsv) | cym | Welsh | Welsh | Latin | South Wales | False | Narrow | 1,049 |
| [TSV](tsv/dan_latn_broad.tsv) | dan | Danish | Danish | Latin |  | False | Broad | 4,657 |
| [TSV](tsv/dan_latn_narrow.tsv) | dan | Danish | Danish | Latin |  | False | Narrow | 8,380 |
| [TSV](tsv/deu_latn_broad.tsv) | deu | German | German | Latin |  | False | Broad | 49,829 |
| [TSV](tsv/deu_latn_broad_filtered.tsv) | deu | German | German | Latin |  | True | Broad | 47,779 |
| [TSV](tsv/deu_latn_narrow.tsv) | deu | German | German | Latin |  | False | Narrow | 18,430 |
| [TSV](tsv/div_thaa_broad.tsv) | div | Dhivehi | Dhivehi | Thaana |  | False | Broad | 1,524 |
| [TSV](tsv/div_thaa_narrow.tsv) | div | Dhivehi | Dhivehi | Thaana |  | False | Narrow | 1,608 |
| [TSV](tsv/dlm_latn_broad.tsv) | dlm | Dalmatian | Dalmatian | Latin |  | False | Broad | 176 |
| [TSV](tsv/dng_cyrl_broad.tsv) | dng | Dungan | Dungan | Cyrillic |  | False | Broad | 255 |
| [TSV](tsv/dsb_latn_broad.tsv) | dsb | Lower Sorbian | Lower Sorbian | Latin |  | False | Broad | 2,258 |
| [TSV](tsv/dsb_latn_narrow.tsv) | dsb | Lower Sorbian | Lower Sorbian | Latin |  | False | Narrow | 1,428 |
| [TSV](tsv/dum_latn_broad.tsv) | dum | Middle Dutch (ca. 1050-1350) | Middle Dutch | Latin |  | False | Broad | 215 |
| [TSV](tsv/dzo_tibt_broad.tsv) | dzo | Dzongkha | Dzongkha | Tibetan |  | False | Broad | 212 |
| [TSV](tsv/egy_latn_broad.tsv) | egy | Egyptian (Ancient) | Egyptian | Latin |  | False | Broad | 4,046 |
| [TSV](tsv/ell_grek_broad.tsv) | ell | Modern Greek (1453-) | Greek | Greek |  | False | Broad | 15,241 |
| [TSV](tsv/ell_grek_broad_filtered.tsv) | ell | Modern Greek (1453-) | Greek | Greek |  | True | Broad | 14,825 |
| [TSV](tsv/ell_grek_narrow.tsv) | ell | Modern Greek (1453-) | Greek | Greek |  | False | Narrow | 342 |
| [TSV](tsv/eng_latn_uk_broad.tsv) | eng | English | English | Latin | UK, Received Pronunciation | False | Broad | 82,261 |
| [TSV](tsv/eng_latn_uk_broad_filtered.tsv) | eng | English | English | Latin | UK, Received Pronunciation | True | Broad | 81,576 |
| [TSV](tsv/eng_latn_uk_narrow.tsv) | eng | English | English | Latin | UK, Received Pronunciation | False | Narrow | 1,958 |
| [TSV](tsv/eng_latn_us_broad.tsv) | eng | English | English | Latin | US, General American | False | Broad | 81,323 |
| [TSV](tsv/eng_latn_us_broad_filtered.tsv) | eng | English | English | Latin | US, General American | True | Broad | 80,740 |
| [TSV](tsv/eng_latn_us_narrow.tsv) | eng | English | English | Latin | US, General American | False | Narrow | 2,903 |
| [TSV](tsv/enm_latn_broad.tsv) | enm | Middle English (1100-1500) | Middle English | Latin |  | False | Broad | 10,525 |
| [TSV](tsv/epo_latn_broad.tsv) | epo | Esperanto | Esperanto | Latin |  | False | Broad | 3,999 |
| [TSV](tsv/epo_latn_narrow.tsv) | epo | Esperanto | Esperanto | Latin |  | False | Narrow | 17,209 |
| [TSV](tsv/est_latn_broad.tsv) | est | Estonian | Estonian | Latin |  | False | Broad | 1,789 |
| [TSV](tsv/est_latn_narrow.tsv) | est | Estonian | Estonian | Latin |  | False | Narrow | 1,127 |
| [TSV](tsv/ett_ital_broad.tsv) | ett | Etruscan | Etruscan | Old Italic |  | False | Broad | 207 |
| [TSV](tsv/eus_latn_broad.tsv) | eus | Basque | Basque | Latin |  | False | Broad | 8,033 |
| [TSV](tsv/eus_latn_narrow.tsv) | eus | Basque | Basque | Latin |  | False | Narrow | 8,010 |
| [TSV](tsv/evn_cyrl_broad.tsv) | evn | Evenki | Evenki | Cyrillic |  | False | Broad | 126 |
| [TSV](tsv/ewe_latn_broad.tsv) | ewe | Ewe | Ewe | Latin |  | False | Broad | 136 |
| [TSV](tsv/fao_latn_broad.tsv) | fao | Faroese | Faroese | Latin |  | False | Broad | 1,947 |
| [TSV](tsv/fao_latn_narrow.tsv) | fao | Faroese | Faroese | Latin |  | False | Narrow | 1,175 |
| [TSV](tsv/fas_arab_broad.tsv) | fas | Persian | Persian | Arabic |  | False | Broad | 554 |
| [TSV](tsv/fas_arab_narrow.tsv) | fas | Persian | Persian | Arabic |  | False | Narrow | 34,033 |
| [TSV](tsv/fax_latn_broad.tsv) | fax | Fala | Fala | Latin |  | False | Broad | 538 |
| [TSV](tsv/fin_latn_broad.tsv) | fin | Finnish | Finnish | Latin |  | False | Broad | 158,880 |
| [TSV](tsv/fin_latn_narrow.tsv) | fin | Finnish | Finnish | Latin |  | False | Narrow | 158,871 |
| [TSV](tsv/fra_latn_broad.tsv) | fra | French | French | Latin |  | False | Broad | 80,943 |
| [TSV](tsv/fra_latn_broad_filtered.tsv) | fra | French | French | Latin |  | True | Broad | 80,690 |
| [TSV](tsv/fra_latn_narrow.tsv) | fra | French | French | Latin |  | False | Narrow | 254 |
| [TSV](tsv/fro_latn_broad.tsv) | fro | Old French (842-ca. 1400) | Old French | Latin |  | False | Broad | 929 |
| [TSV](tsv/frr_latn_broad.tsv) | frr | Northern Frisian | North Frisian | Latin |  | False | Broad | 167 |
| [TSV](tsv/fry_latn_broad.tsv) | fry | Western Frisian | West Frisian | Latin |  | False | Broad | 1,061 |
| [TSV](tsv/gla_latn_broad.tsv) | gla | Scottish Gaelic | Scottish Gaelic | Latin |  | False | Broad | 3,131 |
| [TSV](tsv/gla_latn_narrow.tsv) | gla | Scottish Gaelic | Scottish Gaelic | Latin |  | False | Narrow | 162 |
| [TSV](tsv/gle_latn_broad.tsv) | gle | Irish | Irish | Latin |  | False | Broad | 14,379 |
| [TSV](tsv/gle_latn_narrow.tsv) | gle | Irish | Irish | Latin |  | False | Narrow | 1,570 |
| [TSV](tsv/glg_latn_broad.tsv) | glg | Galician | Galician | Latin |  | False | Broad | 5,076 |
| [TSV](tsv/glg_latn_narrow.tsv) | glg | Galician | Galician | Latin |  | False | Narrow | 4,248 |
| [TSV](tsv/glv_latn_broad.tsv) | glv | Manx | Manx | Latin |  | False | Broad | 208 |
| [TSV](tsv/glv_latn_narrow.tsv) | glv | Manx | Manx | Latin |  | False | Narrow | 131 |
| [TSV](tsv/gml_latn_broad.tsv) | gml | Middle Low German | Middle Low German | Latin |  | False | Broad | 171 |
| [TSV](tsv/goh_latn_broad.tsv) | goh | Old High German (ca. 750-1050) | Old High German | Latin |  | False | Broad | 141 |
| [TSV](tsv/got_goth_broad.tsv) | got | Gothic | Gothic | Gothic |  | False | Broad | 1,785 |
| [TSV](tsv/got_goth_narrow.tsv) | got | Gothic | Gothic | Gothic |  | False | Narrow | 382 |
| [TSV](tsv/grc_grek_broad.tsv) | grc | Ancient Greek (to 1453) | Ancient Greek | Greek |  | False | Broad | 120,580 |
| [TSV](tsv/grn_latn_broad.tsv) | grn | Guarani | Guaraní | Latin |  | False | Broad | 213 |
| [TSV](tsv/gsw_latn_broad.tsv) | gsw | Swiss German | Alemannic German | Latin |  | False | Broad | 468 |
| [TSV](tsv/guj_gujr_broad.tsv) | guj | Gujarati | Gujarati | Gujarati |  | False | Broad | 2,058 |
| [TSV](tsv/gur_latn_broad.tsv) | gur | Farefare | Farefare | Latin |  | False | Broad | 104 |
| [TSV](tsv/guw_latn_broad.tsv) | guw | Gun | Gun | Latin |  | False | Broad | 682 |
| [TSV](tsv/hat_latn_broad.tsv) | hat | Haitian | Haitian Creole | Latin |  | False | Broad | 1,456 |
| [TSV](tsv/hau_latn_broad.tsv) | hau | Hausa | Hausa | Latin |  | False | Broad | 1,937 |
| [TSV](tsv/hau_latn_narrow.tsv) | hau | Hausa | Hausa | Latin |  | False | Narrow | 1,912 |
| [TSV](tsv/haw_latn_broad.tsv) | haw | Hawaiian | Hawaiian | Latin |  | False | Broad | 938 |
| [TSV](tsv/haw_latn_narrow.tsv) | haw | Hawaiian | Hawaiian | Latin |  | False | Narrow | 878 |
| [TSV](tsv/hbs_cyrl_broad.tsv) | hbs | Serbo-Croatian | Serbo-Croatian | Cyrillic |  | False | Broad | 23,019 |
| [TSV](tsv/hbs_cyrl_broad_filtered.tsv) | hbs | Serbo-Croatian | Serbo-Croatian | Cyrillic |  | True | Broad | 22,849 |
| [TSV](tsv/hbs_latn_broad.tsv) | hbs | Serbo-Croatian | Serbo-Croatian | Latin |  | False | Broad | 24,462 |
| [TSV](tsv/hbs_latn_broad_filtered.tsv) | hbs | Serbo-Croatian | Serbo-Croatian | Latin |  | True | Broad | 24,142 |
| [TSV](tsv/heb_hebr_broad.tsv) | heb | Hebrew | Hebrew | Hebrew |  | False | Broad | 1,957 |
| [TSV](tsv/heb_hebr_narrow.tsv) | heb | Hebrew | Hebrew | Hebrew |  | False | Narrow | 212 |
| [TSV](tsv/hil_latn_broad.tsv) | hil | Hiligaynon | Hiligaynon | Latin |  | False | Broad | 331 |
| [TSV](tsv/hil_latn_narrow.tsv) | hil | Hiligaynon | Hiligaynon | Latin |  | False | Narrow | 314 |
| [TSV](tsv/hin_deva_broad.tsv) | hin | Hindi | Hindi | Devanagari |  | False | Broad | 25,269 |
| [TSV](tsv/hin_deva_broad_filtered.tsv) | hin | Hindi | Hindi | Devanagari |  | True | Broad | 24,640 |
| [TSV](tsv/hin_deva_narrow.tsv) | hin | Hindi | Hindi | Devanagari |  | False | Narrow | 22,296 |
| [TSV](tsv/hrx_latn_broad.tsv) | hrx | Hunsrik | Hunsrik | Latin |  | False | Broad | 1,713 |
| [TSV](tsv/hsb_latn_broad.tsv) | hsb | Upper Sorbian | Upper Sorbian | Latin |  | False | Broad | 357 |
| [TSV](tsv/hsb_latn_narrow.tsv) | hsb | Upper Sorbian | Upper Sorbian | Latin |  | False | Narrow | 150 |
| [TSV](tsv/hts_latn_broad.tsv) | hts | Hadza | Hadza | Latin |  | False | Broad | 335 |
| [TSV](tsv/hun_latn_narrow.tsv) | hun | Hungarian | Hungarian | Latin |  | False | Narrow | 62,497 |
| [TSV](tsv/hun_latn_narrow_filtered.tsv) | hun | Hungarian | Hungarian | Latin |  | True | Narrow | 62,429 |
| [TSV](tsv/huu_latn_narrow.tsv) | huu | Murui Huitoto | Murui Huitoto | Latin |  | False | Narrow | 314 |
| [TSV](tsv/hye_armn_e_broad.tsv) | hye | Armenian | Armenian | Armenian | Eastern Armenian | False | Broad | 16,826 |
| [TSV](tsv/hye_armn_e_narrow.tsv) | hye | Armenian | Armenian | Armenian | Eastern Armenian | False | Narrow | 17,056 |
| [TSV](tsv/hye_armn_e_narrow_filtered.tsv) | hye | Armenian | Armenian | Armenian | Eastern Armenian | True | Narrow | 16,979 |
| [TSV](tsv/hye_armn_w_broad.tsv) | hye | Armenian | Armenian | Armenian | Western Armenian | False | Broad | 16,364 |
| [TSV](tsv/hye_armn_w_narrow.tsv) | hye | Armenian | Armenian | Armenian | Western Armenian | False | Narrow | 16,556 |
| [TSV](tsv/hye_armn_w_narrow_filtered.tsv) | hye | Armenian | Armenian | Armenian | Western Armenian | True | Narrow | 16,488 |
| [TSV](tsv/iba_latn_broad.tsv) | iba | Iban | Iban | Latin |  | False | Broad | 519 |
| [TSV](tsv/iba_latn_narrow.tsv) | iba | Iban | Iban | Latin |  | False | Narrow | 176 |
| [TSV](tsv/ido_latn_broad.tsv) | ido | Ido | Ido | Latin |  | False | Broad | 8,012 |
| [TSV](tsv/ilo_latn_broad.tsv) | ilo | Iloko | Ilocano | Latin |  | False | Broad | 805 |
| [TSV](tsv/ilo_latn_narrow.tsv) | ilo | Iloko | Ilocano | Latin |  | False | Narrow | 750 |
| [TSV](tsv/ina_latn_broad.tsv) | ina | Interlingua (International Auxiliary Language Association) | Interlingua | Latin |  | False | Broad | 321 |
| [TSV](tsv/ind_latn_broad.tsv) | ind | Indonesian | Indonesian | Latin |  | False | Broad | 4,952 |
| [TSV](tsv/ind_latn_narrow.tsv) | ind | Indonesian | Indonesian | Latin |  | False | Narrow | 6,125 |
| [TSV](tsv/inh_cyrl_broad.tsv) | inh | Ingush | Ingush | Cyrillic |  | False | Broad | 166 |
| [TSV](tsv/isl_latn_broad.tsv) | isl | Icelandic | Icelandic | Latin |  | False | Broad | 9,866 |
| [TSV](tsv/isl_latn_broad_filtered.tsv) | isl | Icelandic | Icelandic | Latin |  | True | Broad | 9,797 |
| [TSV](tsv/isl_latn_narrow.tsv) | isl | Icelandic | Icelandic | Latin |  | False | Narrow | 376 |
| [TSV](tsv/ita_latn_broad.tsv) | ita | Italian | Italian | Latin |  | False | Broad | 79,988 |
| [TSV](tsv/ita_latn_broad_filtered.tsv) | ita | Italian | Italian | Latin |  | True | Broad | 79,865 |
| [TSV](tsv/izh_latn_broad.tsv) | izh | Ingrian | Ingrian | Latin |  | False | Broad | 7,577 |
| [TSV](tsv/izh_latn_narrow.tsv) | izh | Ingrian | Ingrian | Latin |  | False | Narrow | 12,334 |
| [TSV](tsv/jam_latn_broad.tsv) | jam | Jamaican Creole English | Jamaican Creole | Latin |  | False | Broad | 207 |
| [TSV](tsv/jav_java_broad.tsv) | jav | Javanese | Javanese | Javanese |  | False | Broad | 664 |
| [TSV](tsv/jje_hang_broad.tsv) | jje | Jejueo | Jeju | Hangul |  | False | Broad | 739 |
| [TSV](tsv/jpn_hira_narrow.tsv) | jpn | Japanese | Japanese | Hiragana |  | False | Narrow | 26,604 |
| [TSV](tsv/jpn_hira_narrow_filtered.tsv) | jpn | Japanese | Japanese | Hiragana |  | True | Narrow | 26,460 |
| [TSV](tsv/jpn_kana_narrow.tsv) | jpn | Japanese | Japanese | Katakana |  | False | Narrow | 6,903 |
| [TSV](tsv/jpn_kana_narrow_filtered.tsv) | jpn | Japanese | Japanese | Katakana |  | True | Narrow | 6,289 |
| [TSV](tsv/kal_latn_broad.tsv) | kal | Kalaallisut | Greenlandic | Latin |  | False | Broad | 1,528 |
| [TSV](tsv/kal_latn_narrow.tsv) | kal | Kalaallisut | Greenlandic | Latin |  | False | Narrow | 1,324 |
| [TSV](tsv/kan_knda_broad.tsv) | kan | Kannada | Kannada | Kannada |  | False | Broad | 884 |
| [TSV](tsv/kas_arab_broad.tsv) | kas | Kashmiri | Kashmiri | Arabic |  | False | Broad | 421 |
| [TSV](tsv/kas_arab_narrow.tsv) | kas | Kashmiri | Kashmiri | Arabic |  | False | Narrow | 253 |
| [TSV](tsv/kas_deva_broad.tsv) | kas | Kashmiri | Kashmiri | Devanagari |  | False | Broad | 113 |
| [TSV](tsv/kat_geor_broad.tsv) | kat | Georgian | Georgian | Georgian |  | False | Broad | 17,212 |
| [TSV](tsv/kat_geor_broad_filtered.tsv) | kat | Georgian | Georgian | Georgian |  | True | Broad | 17,192 |
| [TSV](tsv/kat_geor_narrow.tsv) | kat | Georgian | Georgian | Georgian |  | False | Narrow | 13,940 |
| [TSV](tsv/kaw_latn_broad.tsv) | kaw | Kawi | Old Javanese | Latin |  | False | Broad | 593 |
| [TSV](tsv/kaz_cyrl_broad.tsv) | kaz | Kazakh | Kazakh | Cyrillic |  | False | Broad | 274 |
| [TSV](tsv/kaz_cyrl_narrow.tsv) | kaz | Kazakh | Kazakh | Cyrillic |  | False | Narrow | 1,396 |
| [TSV](tsv/kbd_cyrl_narrow.tsv) | kbd | Kabardian | Kabardian | Cyrillic |  | False | Narrow | 859 |
| [TSV](tsv/kgp_latn_broad.tsv) | kgp | Kaingang | Kaingang | Latin |  | False | Broad | 107 |
| [TSV](tsv/khb_talu_broad.tsv) | khb | Lü | Lü | New Tai Lue |  | False | Broad | 499 |
| [TSV](tsv/khm_khmr_broad.tsv) | khm | Khmer | Khmer | Khmer |  | False | Broad | 6,302 |
| [TSV](tsv/khm_khmr_broad_filtered.tsv) | khm | Khmer | Khmer | Khmer |  | True | Broad | 6,300 |
| [TSV](tsv/kik_latn_broad.tsv) | kik | Kikuyu | Kikuyu | Latin |  | False | Broad | 1,158 |
| [TSV](tsv/kir_cyrl_broad.tsv) | kir | Kirghiz | Kyrgyz | Cyrillic |  | False | Broad | 583 |
| [TSV](tsv/kir_cyrl_narrow.tsv) | kir | Kirghiz | Kyrgyz | Cyrillic |  | False | Narrow | 147 |
| [TSV](tsv/kix_latn_broad.tsv) | kix | Khiamniungan Naga | Khiamniungan Naga | Latin |  | False | Broad | 181 |
| [TSV](tsv/kld_latn_broad.tsv) | kld | Gamilaraay | Gamilaraay | Latin |  | False | Broad | 515 |
| [TSV](tsv/klj_latn_narrow.tsv) | klj | Khalaj | Khalaj | Latin |  | False | Narrow | 2,001 |
| [TSV](tsv/kmr_latn_broad.tsv) | kmr | Northern Kurdish | Northern Kurdish | Latin |  | False | Broad | 2,140 |
| [TSV](tsv/koi_cyrl_broad.tsv) | koi | Komi-Permyak | Komi-Permyak | Cyrillic |  | False | Broad | 182 |
| [TSV](tsv/koi_cyrl_narrow.tsv) | koi | Komi-Permyak | Komi-Permyak | Cyrillic |  | False | Narrow | 180 |
| [TSV](tsv/kok_deva_broad.tsv) | kok | Konkani (macrolanguage) | Konkani | Devanagari |  | False | Broad | 172 |
| [TSV](tsv/kok_deva_narrow.tsv) | kok | Konkani (macrolanguage) | Konkani | Devanagari |  | False | Narrow | 537 |
| [TSV](tsv/kor_hang_narrow.tsv) | kor | Korean | Korean | Hangul |  | False | Narrow | 25,800 |
| [TSV](tsv/kor_hang_narrow_filtered.tsv) | kor | Korean | Korean | Hangul |  | True | Narrow | 22,072 |
| [TSV](tsv/kpv_cyrl_broad.tsv) | kpv | Komi-Zyrian | Komi-Zyrian | Cyrillic |  | False | Broad | 834 |
| [TSV](tsv/kpv_cyrl_narrow.tsv) | kpv | Komi-Zyrian | Komi-Zyrian | Cyrillic |  | False | Narrow | 794 |
| [TSV](tsv/krl_latn_broad.tsv) | krl | Karelian | Karelian | Latin |  | False | Broad | 419 |
| [TSV](tsv/ksw_mymr_broad.tsv) | ksw | S'gaw Karen | S'gaw Karen | Myanmar |  | False | Broad | 177 |
| [TSV](tsv/ktz_latn_broad.tsv) | ktz | Juǀʼhoan | Juǀ'hoan | Latin |  | False | Broad | 132 |
| [TSV](tsv/kwk_latn_broad.tsv) | kwk | Kwakiutl | Kwak'wala | Latin |  | False | Broad | 107 |
| [TSV](tsv/kxd_latn_broad.tsv) | kxd | Brunei | Brunei Malay | Latin |  | False | Broad | 351 |
| [TSV](tsv/kyu_kali_broad.tsv) | kyu | Western Kayah | Western Kayah | Kayah Li |  | False | Broad | 128 |
| [TSV](tsv/lad_latn_broad.tsv) | lad | Ladino | Ladino | Latin |  | False | Broad | 120 |
| [TSV](tsv/lao_laoo_narrow.tsv) | lao | Lao | Lao | Lao |  | False | Narrow | 4,180 |
| [TSV](tsv/lat_latn_clas_narrow.tsv) | lat | Latin | Latin | Latin | Classical | False | Narrow | 41,312 |
| [TSV](tsv/lat_latn_eccl_narrow.tsv) | lat | Latin | Latin | Latin | Ecclesiastical | False | Narrow | 39,498 |
| [TSV](tsv/lav_latn_narrow.tsv) | lav | Latvian | Latvian | Latin |  | False | Narrow | 1,355 |
| [TSV](tsv/lav_latn_narrow_filtered.tsv) | lav | Latvian | Latvian | Latin |  | True | Narrow | 1,255 |
| [TSV](tsv/lif_limb_broad.tsv) | lif | Limbu | Limbu | Limbu |  | False | Broad | 108 |
| [TSV](tsv/lij_latn_broad.tsv) | lij | Ligurian | Ligurian | Latin |  | False | Broad | 816 |
| [TSV](tsv/lim_latn_broad.tsv) | lim | Limburgan | Limburgish | Latin |  | False | Broad | 949 |
| [TSV](tsv/lim_latn_narrow.tsv) | lim | Limburgan | Limburgish | Latin |  | False | Narrow | 230 |
| [TSV](tsv/lit_latn_broad.tsv) | lit | Lithuanian | Lithuanian | Latin |  | False | Broad | 370 |
| [TSV](tsv/lit_latn_narrow.tsv) | lit | Lithuanian | Lithuanian | Latin |  | False | Narrow | 12,831 |
| [TSV](tsv/liv_latn_broad.tsv) | liv | Liv | Livonian | Latin |  | False | Broad | 393 |
| [TSV](tsv/lmo_latn_broad.tsv) | lmo | Lombard | Lombard | Latin |  | False | Broad | 486 |
| [TSV](tsv/lmo_latn_narrow.tsv) | lmo | Lombard | Lombard | Latin |  | False | Narrow | 375 |
| [TSV](tsv/lmy_latn_narrow.tsv) | lmy | Lamboya | Laboya | Latin |  | False | Narrow | 129 |
| [TSV](tsv/lou_latn_broad.tsv) | lou | Louisiana Creole | Louisiana Creole | Latin |  | False | Broad | 240 |
| [TSV](tsv/lsi_latn_broad.tsv) | lsi | Lashi | Lashi | Latin |  | False | Broad | 324 |
| [TSV](tsv/ltg_latn_narrow.tsv) | ltg | Latgalian | Latgalian | Latin |  | False | Narrow | 444 |
| [TSV](tsv/ltz_latn_broad.tsv) | ltz | Luxembourgish | Luxembourgish | Latin |  | False | Broad | 4,090 |
| [TSV](tsv/ltz_latn_narrow.tsv) | ltz | Luxembourgish | Luxembourgish | Latin |  | False | Narrow | 2,654 |
| [TSV](tsv/lut_latn_broad.tsv) | lut | Lushootseed | Lushootseed | Latin |  | False | Broad | 121 |
| [TSV](tsv/lwl_thai_broad.tsv) | lwl | Eastern Lawa | Eastern Lawa | Thai |  | False | Broad | 253 |
| [TSV](tsv/lzz_geor_broad.tsv) | lzz | Laz | Laz | Georgian |  | False | Broad | 305 |
| [TSV](tsv/mah_latn_broad.tsv) | mah | Marshallese | Marshallese | Latin |  | False | Broad | 943 |
| [TSV](tsv/mah_latn_narrow.tsv) | mah | Marshallese | Marshallese | Latin |  | False | Narrow | 1,060 |
| [TSV](tsv/mai_deva_narrow.tsv) | mai | Maithili | Maithili | Devanagari |  | False | Narrow | 164 |
| [TSV](tsv/mak_latn_narrow.tsv) | mak | Makasar | Makasar | Latin |  | False | Narrow | 432 |
| [TSV](tsv/mal_mlym_broad.tsv) | mal | Malayalam | Malayalam | Malayalam |  | False | Broad | 7,100 |
| [TSV](tsv/mal_mlym_narrow.tsv) | mal | Malayalam | Malayalam | Malayalam |  | False | Narrow | 375 |
| [TSV](tsv/mar_deva_broad.tsv) | mar | Marathi | Marathi | Devanagari |  | False | Broad | 2,681 |
| [TSV](tsv/mar_deva_narrow.tsv) | mar | Marathi | Marathi | Devanagari |  | False | Narrow | 599 |
| [TSV](tsv/mdf_cyrl_broad.tsv) | mdf | Moksha | Moksha | Cyrillic |  | False | Broad | 131 |
| [TSV](tsv/mfe_latn_broad.tsv) | mfe | Morisyen | Mauritian Creole | Latin |  | False | Broad | 205 |
| [TSV](tsv/mfe_latn_narrow.tsv) | mfe | Morisyen | Mauritian Creole | Latin |  | False | Narrow | 105 |
| [TSV](tsv/mga_latn_broad.tsv) | mga | Middle Irish (900-1200) | Middle Irish | Latin |  | False | Broad | 317 |
| [TSV](tsv/mic_latn_broad.tsv) | mic | Mi'kmaq | Mi'kmaq | Latin |  | False | Broad | 203 |
| [TSV](tsv/mic_latn_narrow.tsv) | mic | Mi'kmaq | Mi'kmaq | Latin |  | False | Narrow | 201 |
| [TSV](tsv/mkd_cyrl_narrow.tsv) | mkd | Macedonian | Macedonian | Cyrillic |  | False | Narrow | 62,277 |
| [TSV](tsv/mlg_latn_broad.tsv) | mlg | Malagasy | Malagasy | Latin |  | False | Broad | 185 |
| [TSV](tsv/mlt_latn_broad.tsv) | mlt | Maltese | Maltese | Latin |  | False | Broad | 18,391 |
| [TSV](tsv/mlt_latn_broad_filtered.tsv) | mlt | Maltese | Maltese | Latin |  | True | Broad | 18,361 |
| [TSV](tsv/mnc_mong_narrow.tsv) | mnc | Manchu | Manchu | Mongolian |  | False | Narrow | 1,467 |
| [TSV](tsv/mnw_mymr_broad.tsv) | mnw | Mon | Mon | Myanmar |  | False | Broad | 1,079 |
| [TSV](tsv/mon_cyrl_broad.tsv) | mon | Mongolian | Mongolian | Cyrillic |  | False | Broad | 3,477 |
| [TSV](tsv/mon_cyrl_narrow.tsv) | mon | Mongolian | Mon

Download .txt

gitextract_29_vgov9/

├── .circleci/
│   └── config.yml
├── .github/
│   └── pull_request_template.md
├── .gitignore
├── CHANGELOG.md
├── CONTRIBUTING.md
├── LICENSE.txt
├── README.md
├── data/
│   ├── README.md
│   ├── covering_grammar/
│   │   ├── README.md
│   │   ├── lib/
│   │   │   ├── README.md
│   │   │   ├── covering_grammar.py
│   │   │   ├── error_analysis.py
│   │   │   └── make_test_file.py
│   │   └── tsv/
│   │       ├── ady_cyrl_narrow.tsv
│   │       ├── bul_cyrl_narrow.tsv
│   │       ├── fre_latn_broad.tsv
│   │       ├── geo_geor_broad.tsv
│   │       ├── gre_grek_broad.tsv
│   │       ├── ice_latn_broad.tsv
│   │       ├── ita_latn_broad.tsv
│   │       └── jpn_hira_narrow.tsv
│   ├── frequencies/
│   │   ├── README.md
│   │   ├── grab_wortschatz_data.py
│   │   ├── merge.py
│   │   ├── shared_tasks/
│   │   │   ├── README.md
│   │   │   ├── SIGMORPHON_2021.json
│   │   │   └── SIGMORPHON_2022.json
│   │   └── wortschatz_languages.json
│   ├── morphology/
│   │   ├── README.md
│   │   ├── grab_unimorph_data.py
│   │   ├── shared_tasks/
│   │   │   ├── README.md
│   │   │   ├── SIGMORPHON_2021.json
│   │   │   └── SIGMORPHON_2022.json
│   │   └── unimorph_languages.json
│   ├── phones/
│   │   ├── HOWTO.md
│   │   ├── README.md
│   │   ├── lib/
│   │   │   ├── generate_summary.py
│   │   │   ├── list_phones.py
│   │   │   └── normalize.py
│   │   ├── phones/
│   │   │   ├── ady_narrow.phones
│   │   │   ├── afr_broad.phones
│   │   │   ├── aze_narrow.phones
│   │   │   ├── ben_dhaka_broad.phones
│   │   │   ├── ben_rarh_broad.phones
│   │   │   ├── bul_broad.phones
│   │   │   ├── cym_nw_broad.phones
│   │   │   ├── cym_sw_broad.phones
│   │   │   ├── deu_broad.phones
│   │   │   ├── ell_broad.phones
│   │   │   ├── eng_uk_broad.phones
│   │   │   ├── eng_us_broad.phones
│   │   │   ├── fra_broad.phones
│   │   │   ├── hbs_broad.phones
│   │   │   ├── hin_broad.phones
│   │   │   ├── hun_narrow.phones
│   │   │   ├── hye_e_narrow.phones
│   │   │   ├── hye_w_narrow.phones
│   │   │   ├── isl_broad.phones
│   │   │   ├── ita_broad.phones
│   │   │   ├── jpn_narrow.phones
│   │   │   ├── kat_broad.phones
│   │   │   ├── khm_broad.phones
│   │   │   ├── kor_narrow.phones
│   │   │   ├── lat_clas_broad.phones
│   │   │   ├── lav_narrow.phones
│   │   │   ├── mlt_broad.phones
│   │   │   ├── mya_broad.phones
│   │   │   ├── nld_broad.phones
│   │   │   ├── nob_broad.phones
│   │   │   ├── por_bz_broad.phones
│   │   │   ├── por_po_broad.phones
│   │   │   ├── ron_narrow.phones
│   │   │   ├── slv_broad.phones
│   │   │   ├── spa_ca_broad.phones
│   │   │   ├── spa_la_broad.phones
│   │   │   ├── tur_narrow.phones
│   │   │   ├── vie_hanoi_narrow.phones
│   │   │   ├── vie_hue_narrow.phones
│   │   │   └── vie_saigon_narrow.phones
│   │   ├── postprocess
│   │   └── summary.tsv
│   └── scrape/
│       ├── README.md
│       ├── lib/
│       │   ├── README.md
│       │   ├── codes.py
│       │   ├── common_characters.py
│       │   ├── generate_summary.py
│       │   ├── languages.json
│       │   ├── languages_update.py
│       │   ├── scrape.py
│       │   ├── split.py
│       │   └── unmatched_languages.json
│       ├── postprocess
│       ├── scrape
│       ├── summary.tsv
│       └── tsv/
│           ├── aar_latn_broad.tsv
│           ├── aar_latn_narrow.tsv
│           ├── abk_cyrl_broad.tsv
│           ├── abk_cyrl_narrow.tsv
│           ├── acw_arab_broad.tsv
│           ├── acw_arab_narrow.tsv
│           ├── ady_cyrl_narrow.tsv
│           ├── ady_cyrl_narrow_filtered.tsv
│           ├── afb_arab_broad.tsv
│           ├── afr_latn_broad.tsv
│           ├── afr_latn_broad_filtered.tsv
│           ├── afr_latn_narrow.tsv
│           ├── aii_syrc_narrow.tsv
│           ├── ajp_arab_broad.tsv
│           ├── ajp_arab_narrow.tsv
│           ├── akk_latn_broad.tsv
│           ├── ale_latn_broad.tsv
│           ├── amh_ethi_broad.tsv
│           ├── ang_latn_broad.tsv
│           ├── ang_latn_narrow.tsv
│           ├── aot_latn_broad.tsv
│           ├── apw_latn_narrow.tsv
│           ├── ara_arab_broad.tsv
│           ├── ara_arab_narrow.tsv
│           ├── arc_hebr_broad.tsv
│           ├── arg_latn_broad.tsv
│           ├── ary_arab_broad.tsv
│           ├── arz_arab_broad.tsv
│           ├── asm_beng_broad.tsv
│           ├── ast_latn_broad.tsv
│           ├── ast_latn_narrow.tsv
│           ├── ayl_arab_broad.tsv
│           ├── aze_latn_broad.tsv
│           ├── aze_latn_narrow.tsv
│           ├── aze_latn_narrow_filtered.tsv
│           ├── bak_cyrl_broad.tsv
│           ├── bak_cyrl_narrow.tsv
│           ├── ban_bali_broad.tsv
│           ├── bar_latn_broad.tsv
│           ├── bbl_geor_broad.tsv
│           ├── bbn_latn_broad.tsv
│           ├── bcl_latn_broad.tsv
│           ├── bcl_latn_narrow.tsv
│           ├── bdq_latn_broad.tsv
│           ├── bel_cyrl_narrow.tsv
│           ├── ben_beng_broad.tsv
│           ├── ben_beng_dhaka_broad.tsv
│           ├── ben_beng_dhaka_broad_filtered.tsv
│           ├── ben_beng_dhaka_narrow.tsv
│           ├── ben_beng_narrow.tsv
│           ├── ben_beng_rarh_broad.tsv
│           ├── ben_beng_rarh_broad_filtered.tsv
│           ├── ben_beng_rarh_narrow.tsv
│           ├── bjb_latn_broad.tsv
│           ├── blt_tavt_narrow.tsv
│           ├── bod_tibt_broad.tsv
│           ├── bre_latn_broad.tsv
│           ├── bua_cyrl_broad.tsv
│           ├── bua_cyrl_narrow.tsv
│           ├── bul_cyrl_narrow.tsv
│           ├── cat_latn_broad.tsv
│           ├── cat_latn_narrow.tsv
│           ├── cbn_thai_broad.tsv
│           ├── ceb_latn_broad.tsv
│           ├── ceb_latn_narrow.tsv
│           ├── ces_latn_narrow.tsv
│           ├── chb_latn_broad.tsv
│           ├── che_cyrl_broad.tsv
│           ├── cho_latn_broad.tsv
│           ├── chr_cher_broad.tsv
│           ├── cic_latn_broad.tsv
│           ├── ckb_arab_broad.tsv
│           ├── cnk_latn_broad.tsv
│           ├── cop_copt_broad.tsv
│           ├── cor_latn_broad.tsv
│           ├── cor_latn_narrow.tsv
│           ├── cos_latn_broad.tsv
│           ├── crk_latn_broad.tsv
│           ├── crk_latn_narrow.tsv
│           ├── crx_cans_broad.tsv
│           ├── csb_latn_broad.tsv
│           ├── cym_latn_nw_broad.tsv
│           ├── cym_latn_nw_broad_filtered.tsv
│           ├── cym_latn_nw_narrow.tsv
│           ├── cym_latn_sw_broad.tsv
│           ├── cym_latn_sw_broad_filtered.tsv
│           ├── cym_latn_sw_narrow.tsv
│           ├── dan_latn_broad.tsv
│           ├── dan_latn_narrow.tsv
│           ├── deu_latn_broad.tsv
│           ├── deu_latn_broad_filtered.tsv
│           ├── deu_latn_narrow.tsv
│           ├── div_thaa_broad.tsv
│           ├── div_thaa_narrow.tsv
│           ├── dlm_latn_broad.tsv
│           ├── dng_cyrl_broad.tsv
│           ├── dsb_latn_broad.tsv
│           ├── dsb_latn_narrow.tsv
│           ├── dum_latn_broad.tsv
│           ├── dzo_tibt_broad.tsv
│           ├── egy_latn_broad.tsv
│           ├── ell_grek_broad.tsv
│           ├── ell_grek_broad_filtered.tsv
│           ├── ell_grek_narrow.tsv
│           ├── eng_latn_uk_broad.tsv
│           ├── eng_latn_uk_broad_filtered.tsv
│           ├── eng_latn_uk_narrow.tsv
│           ├── eng_latn_us_broad.tsv
│           ├── eng_latn_us_broad_filtered.tsv
│           ├── eng_latn_us_narrow.tsv
│           ├── enm_latn_broad.tsv
│           ├── epo_latn_broad.tsv
│           ├── epo_latn_narrow.tsv
│           ├── est_latn_broad.tsv
│           ├── est_latn_narrow.tsv
│           ├── ett_ital_broad.tsv
│           ├── eus_latn_broad.tsv
│           ├── eus_latn_narrow.tsv
│           ├── evn_cyrl_broad.tsv
│           ├── ewe_latn_broad.tsv
│           ├── fao_latn_broad.tsv
│           ├── fao_latn_narrow.tsv
│           ├── fas_arab_broad.tsv
│           ├── fas_arab_narrow.tsv
│           ├── fax_latn_broad.tsv
│           ├── fin_latn_broad.tsv
│           ├── fin_latn_narrow.tsv
│           ├── fra_latn_broad.tsv
│           ├── fra_latn_broad_filtered.tsv
│           ├── fra_latn_narrow.tsv
│           ├── fro_latn_broad.tsv
│           ├── frr_latn_broad.tsv
│           ├── fry_latn_broad.tsv
│           ├── gla_latn_broad.tsv
│           ├── gla_latn_narrow.tsv
│           ├── gle_latn_broad.tsv
│           ├── gle_latn_narrow.tsv
│           ├── glg_latn_broad.tsv
│           ├── glg_latn_narrow.tsv
│           ├── glv_latn_broad.tsv
│           ├── glv_latn_narrow.tsv
│           ├── gml_latn_broad.tsv
│           ├── goh_latn_broad.tsv
│           ├── got_goth_broad.tsv
│           ├── got_goth_narrow.tsv
│           ├── grc_grek_broad.tsv
│           ├── grn_latn_broad.tsv
│           ├── gsw_latn_broad.tsv
│           ├── guj_gujr_broad.tsv
│           ├── gur_latn_broad.tsv
│           ├── guw_latn_broad.tsv
│           ├── hat_latn_broad.tsv
│           ├── hau_latn_broad.tsv
│           ├── hau_latn_narrow.tsv
│           ├── haw_latn_broad.tsv
│           ├── haw_latn_narrow.tsv
│           ├── hbs_cyrl_broad.tsv
│           ├── hbs_cyrl_broad_filtered.tsv
│           ├── hbs_latn_broad.tsv
│           ├── hbs_latn_broad_filtered.tsv
│           ├── heb_hebr_broad.tsv
│           ├── heb_hebr_narrow.tsv
│           ├── hil_latn_broad.tsv
│           ├── hil_latn_narrow.tsv
│           ├── hin_deva_broad.tsv
│           ├── hin_deva_broad_filtered.tsv
│           ├── hin_deva_narrow.tsv
│           ├── hrx_latn_broad.tsv
│           ├── hsb_latn_broad.tsv
│           ├── hsb_latn_narrow.tsv
│           ├── hts_latn_broad.tsv
│           ├── hun_latn_narrow.tsv
│           ├── hun_latn_narrow_filtered.tsv
│           ├── huu_latn_narrow.tsv
│           ├── hye_armn_e_broad.tsv
│           ├── hye_armn_e_narrow.tsv
│           ├── hye_armn_e_narrow_filtered.tsv
│           ├── hye_armn_w_broad.tsv
│           ├── hye_armn_w_narrow.tsv
│           ├── hye_armn_w_narrow_filtered.tsv
│           ├── iba_latn_broad.tsv
│           ├── iba_latn_narrow.tsv
│           ├── ido_latn_broad.tsv
│           ├── ilo_latn_broad.tsv
│           ├── ilo_latn_narrow.tsv
│           ├── ina_latn_broad.tsv
│           ├── ind_latn_broad.tsv
│           ├── ind_latn_narrow.tsv
│           ├── inh_cyrl_broad.tsv
│           ├── isl_latn_broad.tsv
│           ├── isl_latn_broad_filtered.tsv
│           ├── isl_latn_narrow.tsv
│           ├── ita_latn_broad.tsv
│           ├── ita_latn_broad_filtered.tsv
│           ├── izh_latn_broad.tsv
│           ├── izh_latn_narrow.tsv
│           ├── jam_latn_broad.tsv
│           ├── jav_java_broad.tsv
│           ├── jje_hang_broad.tsv
│           ├── jpn_hira_narrow.tsv
│           ├── jpn_hira_narrow_filtered.tsv
│           ├── jpn_kana_narrow.tsv
│           ├── jpn_kana_narrow_filtered.tsv
│           ├── kal_latn_broad.tsv
│           ├── kal_latn_narrow.tsv
│           ├── kan_knda_broad.tsv
│           ├── kas_arab_broad.tsv
│           ├── kas_arab_narrow.tsv
│           ├── kas_deva_broad.tsv
│           ├── kat_geor_broad.tsv
│           ├── kat_geor_broad_filtered.tsv
│           ├── kat_geor_narrow.tsv
│           ├── kaw_latn_broad.tsv
│           ├── kaz_cyrl_broad.tsv
│           ├── kaz_cyrl_narrow.tsv
│           ├── kbd_cyrl_narrow.tsv
│           ├── kgp_latn_broad.tsv
│           ├── khb_talu_broad.tsv
│           ├── khm_khmr_broad.tsv
│           ├── khm_khmr_broad_filtered.tsv
│           ├── kik_latn_broad.tsv
│           ├── kir_cyrl_broad.tsv
│           ├── kir_cyrl_narrow.tsv
│           ├── kix_latn_broad.tsv
│           ├── kld_latn_broad.tsv
│           ├── klj_latn_narrow.tsv
│           ├── kmr_latn_broad.tsv
│           ├── koi_cyrl_broad.tsv
│           ├── koi_cyrl_narrow.tsv
│           ├── kok_deva_broad.tsv
│           ├── kok_deva_narrow.tsv
│           ├── kor_hang_narrow.tsv
│           ├── kor_hang_narrow_filtered.tsv
│           ├── kpv_cyrl_broad.tsv
│           ├── kpv_cyrl_narrow.tsv
│           ├── krl_latn_broad.tsv
│           ├── ksw_mymr_broad.tsv
│           ├── ktz_latn_broad.tsv
│           ├── kwk_latn_broad.tsv
│           ├── kxd_latn_broad.tsv
│           ├── kyu_kali_broad.tsv
│           ├── lad_latn_broad.tsv
│           ├── lao_laoo_narrow.tsv
│           ├── lat_latn_clas_narrow.tsv
│           ├── lat_latn_eccl_narrow.tsv
│           ├── lav_latn_narrow.tsv
│           ├── lav_latn_narrow_filtered.tsv
│           ├── lif_limb_broad.tsv
│           ├── lij_latn_broad.tsv
│           ├── lim_latn_broad.tsv
│           ├── lim_latn_narrow.tsv
│           ├── lit_latn_broad.tsv
│           ├── lit_latn_narrow.tsv
│           ├── liv_latn_broad.tsv
│           ├── lmo_latn_broad.tsv
│           ├── lmo_latn_narrow.tsv
│           ├── lmy_latn_narrow.tsv
│           ├── lou_latn_broad.tsv
│           ├── lsi_latn_broad.tsv
│           ├── ltg_latn_narrow.tsv
│           ├── ltz_latn_broad.tsv
│           ├── ltz_latn_narrow.tsv
│           ├── lut_latn_broad.tsv
│           ├── lwl_thai_broad.tsv
│           ├── lzz_geor_broad.tsv
│           ├── mah_latn_broad.tsv
│           ├── mah_latn_narrow.tsv
│           ├── mai_deva_narrow.tsv
│           ├── mak_latn_narrow.tsv
│           ├── mal_mlym_broad.tsv
│           ├── mal_mlym_narrow.tsv
│           ├── mar_deva_broad.tsv
│           ├── mar_deva_narrow.tsv
│           ├── mdf_cyrl_broad.tsv
│           ├── mfe_latn_broad.tsv
│           ├── mfe_latn_narrow.tsv
│           ├── mga_latn_broad.tsv
│           ├── mic_latn_broad.tsv
│           ├── mic_latn_narrow.tsv
│           ├── mkd_cyrl_narrow.tsv
│           ├── mlg_latn_broad.tsv
│           ├── mlt_latn_broad.tsv
│           ├── mlt_latn_broad_filtered.tsv
│           ├── mnc_mong_narrow.tsv
│           ├── mnw_mymr_broad.tsv
│           ├── mon_cyrl_broad.tsv
│           ├── mon_cyrl_narrow.tsv
│           ├── mqs_latn_broad.tsv
│           ├── msa_arab_ara_broad.tsv
│           ├── msa_arab_ara_narrow.tsv
│           ├── msa_arab_broad.tsv
│           ├── msa_arab_narrow.tsv
│           ├── msa_latn_broad.tsv
│           ├── msa_latn_narrow.tsv
│           ├── mtq_latn_broad.tsv
│           ├── mww_latn_broad.tsv
│           ├── mya_mymr_broad.tsv
│           ├── mya_mymr_broad_filtered.tsv
│           ├── nap_latn_broad.tsv
│           ├── nap_latn_narrow.tsv
│           ├── nav_latn_broad.tsv
│           ├── nci_latn_broad.tsv
│           ├── nci_latn_narrow.tsv
│           ├── nds_latn_broad.tsv
│           ├── nep_deva_narrow.tsv
│           ├── new_deva_narrow.tsv
│           ├── nhg_latn_narrow.tsv
│           ├── nhn_latn_broad.tsv
│           ├── nhx_latn_broad.tsv
│           ├── niv_cyrl_broad.tsv
│           ├── nld_latn_broad.tsv
│           ├── nld_latn_broad_filtered.tsv
│           ├── nld_latn_narrow.tsv
│           ├── nmy_latn_narrow.tsv
│           ├── nno_latn_broad.tsv
│           ├── nno_latn_narrow.tsv
│           ├── nob_latn_broad.tsv
│           ├── nob_latn_broad_filtered.tsv
│           ├── nob_latn_narrow.tsv
│           ├── non_latn_broad.tsv
│           ├── nor_latn_broad.tsv
│           ├── nrf_latn_broad.tsv
│           ├── nup_latn_broad.tsv
│           ├── nya_latn_broad.tsv
│           ├── oci_latn_broad.tsv
│           ├── oci_latn_narrow.tsv
│           ├── ofs_latn_broad.tsv
│           ├── okm_hang_broad.tsv
│           ├── okm_hang_narrow.tsv
│           ├── olo_latn_broad.tsv
│           ├── orv_cyrl_broad.tsv
│           ├── osp_latn_broad.tsv
│           ├── osx_latn_broad.tsv
│           ├── ota_arab_broad.tsv
│           ├── ota_arab_narrow.tsv
│           ├── pag_latn_broad.tsv
│           ├── pag_latn_narrow.tsv
│           ├── pam_latn_broad.tsv
│           ├── pam_latn_narrow.tsv
│           ├── pan_arab_broad.tsv
│           ├── pan_guru_broad.tsv
│           ├── pan_guru_narrow.tsv
│           ├── pbv_latn_broad.tsv
│           ├── pcc_latn_broad.tsv
│           ├── pdc_latn_broad.tsv
│           ├── phl_latn_broad.tsv
│           ├── pjt_latn_narrow.tsv
│           ├── pms_latn_broad.tsv
│           ├── pol_latn_broad.tsv
│           ├── por_latn_bz_broad.tsv
│           ├── por_latn_bz_broad_filtered.tsv
│           ├── por_latn_bz_narrow.tsv
│           ├── por_latn_po_broad.tsv
│           ├── por_latn_po_broad_filtered.tsv
│           ├── por_latn_po_narrow.tsv
│           ├── pox_latn_broad.tsv
│           ├── ppl_latn_broad.tsv
│           ├── pqm_latn_broad.tsv
│           ├── pqm_latn_narrow.tsv
│           ├── pus_arab_broad.tsv
│           ├── rgn_latn_broad.tsv
│           ├── rgn_latn_narrow.tsv
│           ├── rom_latn_broad.tsv
│           ├── ron_latn_broad.tsv
│           ├── ron_latn_narrow.tsv
│           ├── ron_latn_narrow_filtered.tsv
│           ├── rup_latn_narrow.tsv
│           ├── rus_cyrl_narrow.tsv
│           ├── sah_cyrl_broad.tsv
│           ├── san_deva_broad.tsv
│           ├── san_deva_narrow.tsv
│           ├── sce_latn_broad.tsv
│           ├── scn_latn_broad.tsv
│           ├── scn_latn_narrow.tsv
│           ├── sco_latn_broad.tsv
│           ├── sco_latn_narrow.tsv
│           ├── sdc_latn_broad.tsv
│           ├── sga_latn_broad.tsv
│           ├── sga_latn_narrow.tsv
│           ├── shn_mymr_broad.tsv
│           ├── sia_cyrl_broad.tsv
│           ├── sid_latn_broad.tsv
│           ├── sin_sinh_broad.tsv
│           ├── sin_sinh_narrow.tsv
│           ├── sjd_cyrl_broad.tsv
│           ├── skr_arab_broad.tsv
│           ├── slk_latn_broad.tsv
│           ├── slk_latn_narrow.tsv
│           ├── slr_latn_broad.tsv
│           ├── slr_latn_narrow.tsv
│           ├── slv_latn_broad.tsv
│           ├── slv_latn_broad_filtered.tsv
│           ├── slv_latn_narrow.tsv
│           ├── sme_latn_broad.tsv
│           ├── sms_latn_broad.tsv
│           ├── snd_arab_broad.tsv
│           ├── spa_latn_ca_broad.tsv
│           ├── spa_latn_ca_broad_filtered.tsv
│           ├── spa_latn_ca_narrow.tsv
│           ├── spa_latn_la_broad.tsv
│           ├── spa_latn_la_broad_filtered.tsv
│           ├── spa_latn_la_narrow.tsv
│           ├── sqi_latn_broad.tsv
│           ├── sqi_latn_narrow.tsv
│           ├── srd_latn_broad.tsv
│           ├── srd_latn_narrow.tsv
│           ├── srn_latn_broad.tsv
│           ├── srs_latn_broad.tsv
│           ├── stq_latn_broad.tsv
│           ├── swa_latn_broad.tsv
│           ├── swe_latn_broad.tsv
│           ├── swe_latn_narrow.tsv
│           ├── syc_syrc_narrow.tsv
│           ├── syl_sylo_broad.tsv
│           ├── szl_latn_broad.tsv
│           ├── tam_taml_broad.tsv
│           ├── tam_taml_narrow.tsv
│           ├── tby_latn_narrow.tsv
│           ├── tel_telu_broad.tsv
│           ├── tel_telu_narrow.tsv
│           ├── tft_latn_broad.tsv
│           ├── tft_latn_narrow.tsv
│           ├── tgk_cyrl_broad.tsv
│           ├── tgk_cyrl_narrow.tsv
│           ├── tgl_latn_broad.tsv
│           ├── tgl_latn_narrow.tsv
│           ├── tha_thai_broad.tsv
│           ├── tkl_latn_narrow.tsv
│           ├── ton_latn_broad.tsv
│           ├── tpw_latn_broad.tsv
│           ├── tru_syrc_broad.tsv
│           ├── tuk_latn_broad.tsv
│           ├── tur_latn_broad.tsv
│           ├── tur_latn_narrow.tsv
│           ├── tur_latn_narrow_filtered.tsv
│           ├── twf_latn_broad.tsv
│           ├── tyv_cyrl_broad.tsv
│           ├── tzm_tfng_broad.tsv
│           ├── tzm_tfng_narrow.tsv
│           ├── uby_cyrl_narrow.tsv
│           ├── uig_arab_ara_broad.tsv
│           ├── uig_arab_broad.tsv
│           ├── ukr_cyrl_narrow.tsv
│           ├── urd_arab_broad.tsv
│           ├── urd_arab_narrow.tsv
│           ├── urk_thai_broad.tsv
│           ├── urk_thai_narrow.tsv
│           ├── uzb_latn_broad.tsv
│           ├── uzb_latn_narrow.tsv
│           ├── vie_latn_hanoi_narrow.tsv
│           ├── vie_latn_hanoi_narrow_filtered.tsv
│           ├── vie_latn_hue_narrow.tsv
│           ├── vie_latn_hue_narrow_filtered.tsv
│           ├── vie_latn_saigon_narrow.tsv
│           ├── vie_latn_saigon_narrow_filtered.tsv
│           ├── vol_latn_broad.tsv
│           ├── vol_latn_narrow.tsv
│           ├── vot_latn_broad.tsv
│           ├── vot_latn_narrow.tsv
│           ├── wau_latn_broad.tsv
│           ├── wbk_latn_broad.tsv
│           ├── wiy_latn_broad.tsv
│           ├── wlm_latn_broad.tsv
│           ├── wln_latn_broad.tsv
│           ├── xal_cyrl_broad.tsv
│           ├── xho_latn_narrow.tsv
│           ├── xsl_latn_narrow.tsv
│           ├── ybi_deva_broad.tsv
│           ├── ycl_latn_narrow.tsv
│           ├── yid_hebr_broad.tsv
│           ├── yid_hebr_narrow.tsv
│           ├── yor_latn_broad.tsv
│           ├── yrk_cyrl_narrow.tsv
│           ├── yue_hani_broad.tsv
│           ├── yue_latn_broad.tsv
│           ├── yux_cyrl_narrow.tsv
│           ├── zha_latn_broad.tsv
│           ├── zho_hani_broad.tsv
│           ├── zho_latn_broad.tsv
│           ├── zom_latn_broad.tsv
│           ├── zul_latn_broad.tsv
│           └── zza_latn_narrow.tsv
├── pyproject.toml
├── requirements.txt
├── src/
│   └── wikipron/
│       ├── __init__.py
│       ├── cli.py
│       ├── config.py
│       ├── extract/
│       │   ├── __init__.py
│       │   ├── blt.py
│       │   ├── cmn.py
│       │   ├── core.py
│       │   ├── default.py
│       │   ├── eng.py
│       │   ├── jpn.py
│       │   ├── khb.py
│       │   ├── khm.py
│       │   ├── lat.py
│       │   ├── shn.py
│       │   ├── tha.py
│       │   ├── vie.py
│       │   └── yue.py
│       ├── html_utils.py
│       ├── languagecodes.py
│       ├── py.typed
│       ├── scrape.py
│       └── typing.py
└── tests/
    ├── __init__.py
    ├── test_data/
    │   ├── __init__.py
    │   ├── test_scrape.py
    │   ├── test_split.py
    │   └── test_summary.py
    └── test_wikipron/
        ├── __init__.py
        ├── test_cli.py
        ├── test_config.py
        ├── test_extract.py
        ├── test_languagecodes.py
        ├── test_scrape.py
        └── test_version.py

Download .txt

SYMBOL INDEX (44 symbols across 15 files)

FILE: data/covering_grammar/lib/covering_grammar.py
  function main (line 11) | def main(args: argparse.Namespace) -> None:

FILE: data/covering_grammar/lib/error_analysis.py
  function get_current_timestamp (line 30) | def get_current_timestamp():
  function log (line 34) | def log() -> str:
  function match_pronunciation_rule (line 40) | def match_pronunciation_rule(ortho, pron, cg_fst):
  function main (line 47) | def main(args: argparse.Namespace) -> None:

FILE: data/covering_grammar/lib/make_test_file.py
  function main (line 16) | def main(args: argparse.Namespace) -> None:

FILE: data/frequencies/grab_wortschatz_data.py
  function download (line 19) | def download(data_to_grab: dict[str, Any]) -> dict[str, Any]:
  function unpack (line 43) | def unpack() -> None:
  function main (line 56) | def main(args: argparse.Namespace) -> None:

FILE: data/frequencies/merge.py
  function write_frequency_tsv (line 14) | def write_frequency_tsv(
  function main (line 51) | def main(args: argparse.Namespace) -> None:

FILE: data/morphology/grab_unimorph_data.py
  function download (line 17) | def download(data_to_grab: dict[str, list[str]]) -> dict[str, list[str]]:
  function main (line 40) | def main(args: argparse.Namespace) -> None:

FILE: data/phones/lib/generate_summary.py
  function _handle_wiki_name (line 21) | def _handle_wiki_name(language: dict[str, Any], file_path: str) -> str:
  function main (line 37) | def main() -> None:

FILE: data/phones/lib/list_phones.py
  function _count_phones (line 32) | def _count_phones(filepath: str) -> dict[str, set[str]]:
  function _pick_examples_for_display (line 54) | def _pick_examples_for_display(examples: set[str]) -> list[str]:
  function _check_ipa_phonemes (line 67) | def _check_ipa_phonemes(phone_to_examples: dict[str, set[str]], filepath...
  function main (line 104) | def main(args: argparse.Namespace):

FILE: data/phones/lib/normalize.py
  function main (line 15) | def main(args: argparse.Namespace) -> None:

FILE: data/scrape/lib/codes.py
  function _get_language_categories (line 52) | def _get_language_categories() -> list[str]:
  function _get_language_sizes (line 82) | def _get_language_sizes(categories: list[str]) -> dict[str, int]:
  function _scrape_wiktionary_language_code (line 114) | def _scrape_wiktionary_language_code(lang_title: str) -> str:
  function _check_language_code_against_wiki (line 136) | def _check_language_code_against_wiki(
  function main (line 155) | def main() -> None:

FILE: data/scrape/lib/common_characters.py
  function _extend_regex (line 43) | def _extend_regex(
  function _is_common (line 61) | def _is_common(word: str) -> str | None:
  function _inherited_check (line 72) | def _inherited_check(word: str) -> str | None:
  function main (line 83) | def main(args: argparse.Namespace) -> None:

FILE: data/scrape/lib/generate_summary.py
  function _handle_modifiers (line 20) | def _handle_modifiers(language: dict[str, Any], file_path: str):
  function main (line 36) | def main() -> None:

FILE: data/scrape/lib/languages_update.py
  function _detect_best_script_name (line 25) | def _detect_best_script_name(
  function _get_alias (line 53) | def _get_alias(
  function _remove_mismatch_ids (line 68) | def _remove_mismatch_ids(
  function main (line 84) | def main():

FILE: data/scrape/lib/scrape.py
  function _phones_reader (line 32) | def _phones_reader(path: str) -> Iterator[str]:
  function _filter (line 40) | def _filter(word: str, pron: str, phones: frozenset[str]) -> bool:
  function scrape_multi (line 52) | def scrape_multi(
  function _call_scrape_multi (line 84) | def _call_scrape_multi(
  function build_scraping_config (line 115) | def build_scraping_config(
  function main (line 156) | def main(args: argparse.Namespace) -> None:

FILE: data/scrape/lib/split.py
  function _generalized_check (line 21) | def _generalized_check(script: str, word: str, extension: str) -> bool:
  function _iterate_through_file (line 29) | def _iterate_through_file(
  function main (line 47) | def main(args: argparse.Namespace) -> None:

Copy disabled (too large) Download .json

Condensed preview — 612 files, each showing path, character count, and a content snippet. Download the .json file for the full structured content (135,559K chars).

[
  {
    "path": ".circleci/config.yml",
    "chars": 2760,
    "preview": "version: 2.1\n\norbs:\n  win: circleci/windows@5.0\n\n\njobs:\n  pre-build:\n    description: A check that needs to be done on o"
  },
  {
    "path": ".github/pull_request_template.md",
    "chars": 85,
    "preview": "- [ ] Updated `Unreleased` in `CHANGELOG.md` to reflect the changes in code or data.\n"
  },
  {
    "path": ".gitignore",
    "chars": 245,
    "preview": "__pycache__/\n.ipynb_checkpoints\n.mypy_cache/\n*.py[cdo]\n*.egg-info/\n*.log\nenv/\n.idea/\n.DS_Store\n.vscode/\n\n# Temporary dat"
  },
  {
    "path": "CHANGELOG.md",
    "chars": 21391,
    "preview": "Changelog\n=========\n\nAll notable changes to this project will be documented in this file.\n\nThe format is based on [Keep "
  },
  {
    "path": "CONTRIBUTING.md",
    "chars": 3253,
    "preview": "# Contributing\n\nThank you for your interest in contributing to the `wikipron` codebase!\n\nThis page assumes that you have"
  },
  {
    "path": "LICENSE.txt",
    "chars": 10787,
    "preview": "                                 Apache License\n                           Version 2.0, January 2004\n                   "
  },
  {
    "path": "README.md",
    "chars": 6717,
    "preview": "# WikiPron\n\n[![PyPI\nversion](https://badge.fury.io/py/wikipron.svg)](https://pypi.org/project/wikipron)\n[![Supported Pyt"
  },
  {
    "path": "data/README.md",
    "chars": 1425,
    "preview": "Data directories\n================\n\n- The [scrape](./scrape) directory contains our \"Big Scrape\" scripts, the\n  [TSVs](./"
  },
  {
    "path": "data/covering_grammar/README.md",
    "chars": 11,
    "preview": "(TEMPORARY)"
  },
  {
    "path": "data/covering_grammar/lib/README.md",
    "chars": 1234,
    "preview": "# Error analysis tool for grapheme-to-phoneme (g2p) conversion\n\nThis tool performs a fine-grained error analysis of a G2"
  },
  {
    "path": "data/covering_grammar/lib/covering_grammar.py",
    "chars": 1335,
    "preview": "#!/usr/bin/env python\n\"\"\"Creates covering grammar FST from TSV of correspondences.\"\"\"\n\nimport argparse\n\nimport pynini\n\nT"
  },
  {
    "path": "data/covering_grammar/lib/error_analysis.py",
    "chars": 4939,
    "preview": "#!/usr/bin/env python\n\"\"\"Error analysis tool for G2P.\n\nTwo input files are required:\n\n1.  Covering grammar: a two-column"
  },
  {
    "path": "data/covering_grammar/lib/make_test_file.py",
    "chars": 1691,
    "preview": "#!/usr/bin/env python\n\"\"\"Makes test file.\n\nUsing gold data and the model output, this script creates a three-column TSV\n"
  },
  {
    "path": "data/covering_grammar/tsv/ady_cyrl_narrow.tsv",
    "chars": 916,
    "preview": "а\taː\nа\ta\nа\tʔa\nа\tə\nб\tb\nб\tp\nв\tv\nв\tf\nг\tɣ\nг\tɡʷ\nг\tɡ\nгу\tɡʷ\nгъ\tʁ\nгъ\tʁʷ\nгъ\tɡʷ\nгъу\tʁʷ\nгь\tɡʲ\nд\td\nдж\td͡ʒ\nдж\td͡ʒʷ\nдз\td͡z\nдз\td͡ʒ\nдз\td"
  },
  {
    "path": "data/covering_grammar/tsv/bul_cyrl_narrow.tsv",
    "chars": 434,
    "preview": "а\tɤ\nа\ta\nа\tɐ\nа\tə\nа\t\nб\tb\nб\tbʲ\nб\tp\nв\tf\nв\tv\nв\tvʲ\nг\tɡ\nг\tɡʲ\nг\tk\nг\tɟ\nд\td̪\nд\td\nд\tdʲ\nд\tt̪\nд\tt\nд\t\nе\tɛ\nе\t\nж\tʒ\nж\tʃ\nж\td͡ʒ\nз\tz\nз\tzʲ\nз\t"
  },
  {
    "path": "data/covering_grammar/tsv/fre_latn_broad.tsv",
    "chars": 1873,
    "preview": "# CONSONANT LETTERS\nb\tb\nb\tp  # Before a voiceless consonant.\nb\t\nc\ts  # Before <e, i, y>. Before <æ, œ> in some Greek and"
  },
  {
    "path": "data/covering_grammar/tsv/geo_geor_broad.tsv",
    "chars": 148,
    "preview": "ა\tɑ\nბ\tb\nგ\tɡ\nდ\td\nე\tɛ\nვ\tv\nზ\tz\nთ\ttʰ\nი\ti\nკ\tkʼ\nლ\tl\nმ\tm\nნ\tn\nო\tɔ\nპ\tpʼ\nჟ\tʒ\nრ\tr\nს\ts\nტ\ttʼ\nუ\tu\nფ\tpʰ\nქ\tkʰ\nღ\tɣ\nყ\tqʼ\nშ\tʃ\nჩ\ttʃ\nც\tts\nძ\td"
  },
  {
    "path": "data/covering_grammar/tsv/gre_grek_broad.tsv",
    "chars": 532,
    "preview": "ά\ta\r\nλά\tal\r\nβ\tv\r\nλ\tʎ\r\nλ\tl\r\nλλ\tl\r\nα\to\r\nα\ta\r\nο\ta\r\nο\to\r\nσ\ts\r\nσ\tz\r\nσσ\ts\r\nς\ts\r\nπ\tp\r\nμπ\tb\r\nτ\tt\r\nτ\td\r\nττ\tt\r\nφ\tf\r\nγ\tɣ\r\nγ\tʝ\r\nγγ\tŋ"
  },
  {
    "path": "data/covering_grammar/tsv/ice_latn_broad.tsv",
    "chars": 657,
    "preview": "a\taː\na\ta\na\tau\ná\tauː\ná\tau\næ\tai\næ\taiː\nau\tøy\nau\tøyː\nb\tp\nd\ttː\nd\tt\nð\tð\ndd\ttː\ne\tɛ\ne\tɛː\ne\te\né\tjɛː\né\tjɛ\né\tɛ\né\tɛː\nei\tei\nei\teiː\ney"
  },
  {
    "path": "data/covering_grammar/tsv/ita_latn_broad.tsv",
    "chars": 443,
    "preview": "a\ta\nà\ta\nb\tb\nc\tk\nc\tt͡ʃ\nd\td\ne\te\ne\tɛ\nè\tɛ\né\te\nf\tf\ng\tɡ\ng\td͡ʒ\nh\t\ni\ni\ti\ni\tj\ni\ti̯\nì\ti\ní\ti\nl\tl\nm\tm\nn\tn\no\to\no\tɔ\nò\tɔ\nó\to\np\tp\nq\tk\nr\t"
  },
  {
    "path": "data/covering_grammar/tsv/jpn_hira_narrow.tsv",
    "chars": 4843,
    "preview": "あ\ta̠\nか\tka̠\nきゃ\tkʲa̠\nさ\tsa̠\nしゃ\tɕa̠\nた\tta̠\nちゃ\tt͡ɕa̠\nな\tna̠\nにゃ\tɲ̟a̠\nは\tha̠\nひゃ\tça̠\nま\tma̠\nみゃ\tmʲa̠\nや\tja̠\nら\tɾa̠\nりゃ\tɾʲa̠\nわ\tɰᵝa̠\nが\tɡa̠"
  },
  {
    "path": "data/frequencies/README.md",
    "chars": 1278,
    "preview": "# Frequencies scripts\n\nThe scripts in this directory are responsible for downloading word frequency\ncounts from the [Lei"
  },
  {
    "path": "data/frequencies/grab_wortschatz_data.py",
    "chars": 2611,
    "preview": "#!/usr/bin/env python\n\"\"\"Downloads and decompresses Wortschatz frequency data.\"\"\"\n\nimport argparse\nimport json\nimport lo"
  },
  {
    "path": "data/frequencies/merge.py",
    "chars": 4110,
    "preview": "#!/usr/bin/env python\n\"\"\"Merges downloaded frequency data with pronunciation data.\"\"\"\n\nimport argparse\nimport csv\nimport"
  },
  {
    "path": "data/frequencies/shared_tasks/README.md",
    "chars": 209,
    "preview": "This directory contains files such that,\nif supplied to the scripts under `data/frequencies/` to override the default\n`."
  },
  {
    "path": "data/frequencies/shared_tasks/SIGMORPHON_2021.json",
    "chars": 3791,
    "preview": "{\n    \"hye-am_web_2017_1M\": {\n        \"file_prefixes\": [\n            \"arm_e\"\n        ],\n        \"data_url\": \"http://pcai"
  },
  {
    "path": "data/frequencies/shared_tasks/SIGMORPHON_2022.json",
    "chars": 2853,
    "preview": "{\n    \"swe_news_2020_1M\": {\n        \"file_prefixes\": [\n            \"swe_latn\"\n        ],\n        \"data_url\": \"http://pca"
  },
  {
    "path": "data/frequencies/wortschatz_languages.json",
    "chars": 13693,
    "preview": "{\n    \"sqi_wikipedia_2016_300K\": {\n        \"file_prefixes\": [\n            \"alb\"\n        ],\n        \"data_url\": \"http://p"
  },
  {
    "path": "data/morphology/README.md",
    "chars": 682,
    "preview": "# Morphology scripts\n\nThe scripts in this directory are responsible for downloading morphological\nparadigms from [UniMor"
  },
  {
    "path": "data/morphology/grab_unimorph_data.py",
    "chars": 2019,
    "preview": "#!/usr/bin/env python\n\"\"\"Downloads UniMorph morphological paradigms data.\"\"\"\n\nimport argparse\nimport json\nimport logging"
  },
  {
    "path": "data/morphology/shared_tasks/README.md",
    "chars": 218,
    "preview": "This directory contains files such that,\nif supplied to the scripts under `data/morphology/` to override the default\n`.."
  },
  {
    "path": "data/morphology/shared_tasks/SIGMORPHON_2021.json",
    "chars": 1517,
    "preview": "{\n    \"ady\": [\n        \"https://raw.githubusercontent.com/unimorph/ady/master/ady\"\n    ],\n    \"arm_e\": [\n        \"https:"
  },
  {
    "path": "data/morphology/shared_tasks/SIGMORPHON_2022.json",
    "chars": 1235,
    "preview": "{\n    \"asm\": [\n        \"https://raw.githubusercontent.com/unimorph/asm/master/asm\"\n    ],\n    \"bel\": [\n        \"https://"
  },
  {
    "path": "data/morphology/unimorph_languages.json",
    "chars": 6499,
    "preview": "{\n    \"ady\": [\n        \"https://raw.githubusercontent.com/unimorph/ady/master/ady\"\n    ],\n    \"ang\": [\n        \"https://"
  },
  {
    "path": "data/phones/HOWTO.md",
    "chars": 3648,
    "preview": "Phones\n======\n\nA `.phones` file is a list of permitted phones; any pronunciation which is not\ntotally composed of the pe"
  },
  {
    "path": "data/phones/README.md",
    "chars": 3660,
    "preview": "See the [HOWTO](HOWTO.md) for the steps to generate phone lists.\n| Link | ISO 639-3 Code | ISO 639 Language Name | Wikti"
  },
  {
    "path": "data/phones/lib/generate_summary.py",
    "chars": 3433,
    "preview": "#!/usr/bin/env python\n\nimport csv\nimport json\nimport logging\nimport operator\nimport os\n\nfrom typing import Any\n\nLIB_DIRE"
  },
  {
    "path": "data/phones/lib/list_phones.py",
    "chars": 4291,
    "preview": "#!/usr/bin/env python\n\"\"\"This script prints a tally of the phones/phonemes of a WikiPron TSV file.\n\nFor each phone/phone"
  },
  {
    "path": "data/phones/lib/normalize.py",
    "chars": 1077,
    "preview": "#!/usr/bin/env python\n\"\"\"In-place Unicode normalization.\n\nTakes a file and applies the specified Unicode normalization \""
  },
  {
    "path": "data/phones/phones/ady_narrow.phones",
    "chars": 626,
    "preview": "# Based on\n#\n# https://en.wikipedia.org/wiki/Adyghe_language\n# https://en.wikipedia.org/wiki/Adyghe_phonology\n# https://"
  },
  {
    "path": "data/phones/phones/afr_broad.phones",
    "chars": 992,
    "preview": "# Based on:\n# https://taalportaal.org/taalportaal/topic/pid/topic-14566637229843775\n# https://en.wikipedia.org/wiki/Afri"
  },
  {
    "path": "data/phones/phones/aze_narrow.phones",
    "chars": 1292,
    "preview": "# Based on\n# https://en.wikipedia.org/wiki/Azerbaijani_language\n# https://en.wikipedia.org/wiki/Help%3AIPA%2FAzerbaijani"
  },
  {
    "path": "data/phones/phones/ben_dhaka_broad.phones",
    "chars": 883,
    "preview": "# References:\n# \n#      Thompson, H.-R. 2020. Bengali: A Comprehensive Grammar. Routledge.\n#      Khan, S. u. D. 2010. B"
  },
  {
    "path": "data/phones/phones/ben_rarh_broad.phones",
    "chars": 887,
    "preview": "# References:\n# \n#      Thompson, H.-R. 2020. Bengali: A Comprehensive Grammar. Routledge.\n#      Khan, S. u. D. 2010. B"
  },
  {
    "path": "data/phones/phones/bul_broad.phones",
    "chars": 1233,
    "preview": "# Based on\n# https://en.wikipedia.org/wiki/Bulgarian_phonology\n# http://www.personal.rdg.ac.uk/~llsroach/phon2/b_phon/b_"
  },
  {
    "path": "data/phones/phones/cym_nw_broad.phones",
    "chars": 1093,
    "preview": "# Based on\n# https://en.wikipedia.org/wiki/Welsh_phonology\n# https://en.wikipedia.org/wiki/Help:IPA/Welsh\n# https://en.w"
  },
  {
    "path": "data/phones/phones/cym_sw_broad.phones",
    "chars": 822,
    "preview": "# Based on\n# https://en.wikipedia.org/wiki/Welsh_phonology\n# https://en.wikipedia.org/wiki/Help:IPA/Welsh\n# https://en.w"
  },
  {
    "path": "data/phones/phones/deu_broad.phones",
    "chars": 1664,
    "preview": "# Based on:\n# https://en.wiktionary.org/wiki/Appendix:German_pronunciation\n# https://en.wikipedia.org/wiki/Standard_Germ"
  },
  {
    "path": "data/phones/phones/ell_broad.phones",
    "chars": 677,
    "preview": "# Based on:\n# https://en.wiktionary.org/wiki/Appendix:Greek_pronunciation\n#\n# Most of the \"Occasionally\" forms are found"
  },
  {
    "path": "data/phones/phones/eng_uk_broad.phones",
    "chars": 816,
    "preview": "# TODO: This is not yet based closely on the guidelines:\r\n# https://en.wiktionary.org/wiki/Appendix:English_pronunciatio"
  },
  {
    "path": "data/phones/phones/eng_us_broad.phones",
    "chars": 750,
    "preview": "# TODO: This is not yet based closely on the guidelines:\r\n# https://en.wiktionary.org/wiki/Appendix:English_pronunciatio"
  },
  {
    "path": "data/phones/phones/fra_broad.phones",
    "chars": 947,
    "preview": "# CONSONANTS\np\nb\nm\nf\nv\nw\nt\nd\nn\nr  # Allophone of /ʁ/.\ns\nz\nl\nʃ\nʒ\nɲ\nj\nɥ\nk\nɡ\nŋ  # Only occurs in loanwords.\nʁ\n# ORAL VOWELS"
  },
  {
    "path": "data/phones/phones/hbs_broad.phones",
    "chars": 401,
    "preview": "# Based on:\n#\n# https://en.wikipedia.org/wiki/Serbo-Croatian_phonology\n# https://en.wikipedia.org/wiki/Help:IPA/Serbo-Cr"
  },
  {
    "path": "data/phones/phones/hin_broad.phones",
    "chars": 662,
    "preview": "# Based on\n#\n# https://en.wikipedia.org/wiki/Hindustani_phonology\n#####\nə\nɑː\nɾ\nn\niː\nt̪\nk\nɪ\ns\nm\nl\nd̪\np\nʊ\nʋ\nb\nj\nɡ\nd͡ʒ\neː\nʃ"
  },
  {
    "path": "data/phones/phones/hun_narrow.phones",
    "chars": 763,
    "preview": "# Based on https://en.wiktionary.org/wiki/Appendix:Hungarian_pronunciation\n# Vowels.\n# I have no information about the p"
  },
  {
    "path": "data/phones/phones/hye_e_narrow.phones",
    "chars": 1216,
    "preview": "# Based on https://en.wikipedia.org/wiki/Armenian_language#Phonology\n# And based on the pronunciation script from Wiktio"
  },
  {
    "path": "data/phones/phones/hye_w_narrow.phones",
    "chars": 2468,
    "preview": "# Based on https://en.wikipedia.org/wiki/Armenian_language#Phonology\n# And based on the pronunciation script from Wiktio"
  },
  {
    "path": "data/phones/phones/isl_broad.phones",
    "chars": 1540,
    "preview": "# Based on:\n#\n# https://en.wikipedia.org/wiki/Icelandic_phonology\n# https://en.wikipedia.org/wiki/Help:IPA/Icelandic\n# h"
  },
  {
    "path": "data/phones/phones/ita_broad.phones",
    "chars": 482,
    "preview": "# Based on:\n#\n# https://en.wikipedia.org/wiki/Italian_phonology\n# https://en.wiktionary.org/wiki/Appendix:Italian_pronun"
  },
  {
    "path": "data/phones/phones/jpn_narrow.phones",
    "chars": 503,
    "preview": "# Based on\n# https://en.wikipedia.org/wiki/Japanese_phonology\n### \na̠\ni\nk\nɯ̟ᵝ\no̞\no̞ː\ns\nɕ\ne̞\nt\nɾ\nɨᵝ\nm\nn\nɡ\nkʲ\nɴ\nt͡s\nẽ̞\nb\nd"
  },
  {
    "path": "data/phones/phones/kat_broad.phones",
    "chars": 532,
    "preview": "# Based on:\n#\n#     Robins, R. H., and Natalie Waterson. 1952. Notes on the phonetics of the\n#     Georgian word. Bullet"
  },
  {
    "path": "data/phones/phones/khm_broad.phones",
    "chars": 1211,
    "preview": "# Based on\n# https://en.wikipedia.org/wiki/Khmer_language\n# https://en.wikipedia.org/wiki/Khmer_script\n### CONSONANTS ##"
  },
  {
    "path": "data/phones/phones/kor_narrow.phones",
    "chars": 728,
    "preview": "e̞\nn\nd͡ʑ  # Allophone of /j/.\ni\no̞\nkʰ    \npʰ\nɕ͈  # Allophone of /s͈/.\nb\na̠\nŋ\nd\nɭ\nk͈\nt͡ɕ͈  # The diacritic location seems"
  },
  {
    "path": "data/phones/phones/lat_clas_broad.phones",
    "chars": 347,
    "preview": "# The upstream Latin data from Wiktionary no longer has broad transcription,\n# though this file is being kept just in ca"
  },
  {
    "path": "data/phones/phones/lav_narrow.phones",
    "chars": 1818,
    "preview": "# Based on\n#\n# https://en.wikipedia.org/wiki/Latvian_phonology\n# https://en.wikipedia.org/wiki/Help%3AIPA%2FLatvian\n# ht"
  },
  {
    "path": "data/phones/phones/mlt_broad.phones",
    "chars": 1025,
    "preview": "# Based on\n# https://en.wikipedia.org/wiki/Maltese_language\n# https://en.wikipedia.org/wiki/Help:IPA/Maltese\n#####\n# CON"
  },
  {
    "path": "data/phones/phones/mya_broad.phones",
    "chars": 1557,
    "preview": "# Based on\n# https://en.wikipedia.org/wiki/Burmese_phonology\n# https://en.wikipedia.org/wiki/Help%3AIPA%2FBurmese\n### CO"
  },
  {
    "path": "data/phones/phones/nld_broad.phones",
    "chars": 306,
    "preview": "ə\nr\nt\nn\ns\nl\nk\nɑ\nɛ\naː\nb\nd\nm\noː\neː\ni\np\nɪ\nx\nɔ\nv\nf\nɣ\ni̯\nz\nʋ\nɦ\nŋ\nu\nʏ\ny̯\nœ\nj\nu̯\ny\nøː\n# Non-native obstruents: https://en.wikip"
  },
  {
    "path": "data/phones/phones/nob_broad.phones",
    "chars": 1217,
    "preview": "# Based on\n#\n### CONSONANTS ###\np\npː\nb\nbː\nm\nmː\nf\nfː\nv  # Allophone of /ʋ/.\nʋ\nt\ntː\nd\ndː\nn\nnː\nr\nrː\nɾ  # Less common (but a"
  },
  {
    "path": "data/phones/phones/por_bz_broad.phones",
    "chars": 462,
    "preview": "# Based on:\n#\n# https://en.wikipedia.org/wiki/Portuguese_phonology\n#\n# Ordinary vowels and glides.\na\ni\ne\no\nu\nɐ\nɔ\nɛ\nø\nj\nw"
  },
  {
    "path": "data/phones/phones/por_po_broad.phones",
    "chars": 423,
    "preview": "# Based on:\n#\n# https://en.wikipedia.org/wiki/Portuguese_phonology\n#\n# Ordinary vowels and glides.\na\ni\ne\no\nu\nɐ\nɔ\nɛ\nø\nj\nw"
  },
  {
    "path": "data/phones/phones/ron_narrow.phones",
    "chars": 548,
    "preview": "# Based on https://en.wikipedia.org/wiki/Help:IPA/Romanian\n# Vowels.\na\ne\nə\ni\niː  # This is not mentioned but seems to be"
  },
  {
    "path": "data/phones/phones/slv_broad.phones",
    "chars": 192,
    "preview": "# Based on\n# https://en.wikipedia.org/wiki/Slovene_phonology\n#####\nt\na\ni\nr\nn\ns\nk\nl\nɔ\nj\nʋ\nɛ\np\nm\nd\náː\nb\nàː\nə\nʃ\níː\nɡ\nt͡ʃ\nìː"
  },
  {
    "path": "data/phones/phones/spa_ca_broad.phones",
    "chars": 337,
    "preview": "# * Following practice on Wiktionary we assume <ll> /ʝ/.\n# * Following practice on Wiktionary we assume <hue> /w̝e/.\n# *"
  },
  {
    "path": "data/phones/phones/spa_la_broad.phones",
    "chars": 288,
    "preview": "# * Following practice on Wiktionary we assume <ll> /ʝ/.\n# * Following practice on Wiktionary we assume <hue> /w̝e/.\na\ne"
  },
  {
    "path": "data/phones/phones/tur_narrow.phones",
    "chars": 482,
    "preview": "# Based on\n# https://en.wikipedia.org/wiki/Turkish_phonology\n# https://en.wikipedia.org/wiki/Help:IPA/Turkish\n# CONSONAN"
  },
  {
    "path": "data/phones/phones/vie_hanoi_narrow.phones",
    "chars": 685,
    "preview": "# Based on\n#\n# https://en.wikipedia.org/wiki/Vietnamese_phonology\n# https://en.wikipedia.org/wiki/Help:IPA/Vietnamese\n# "
  },
  {
    "path": "data/phones/phones/vie_hue_narrow.phones",
    "chars": 593,
    "preview": "# Based on\n#\n# https://en.wikipedia.org/wiki/Help:IPA/Vietnamese\n#####\nʔ\nw\n˧˧\nə\nŋ\nj\naː\n˨˩\n˦˧˥\n˦˩\n˧˨\nɨ\nm\nk̚\nk\ni\na\n˨˩˦\nt\nn"
  },
  {
    "path": "data/phones/phones/vie_saigon_narrow.phones",
    "chars": 357,
    "preview": "˧˧\nw\n˦˥\nə\nŋ\nj\naː\n˨˩˨\nʔ\n˨˩\n˨˩˦\nɨ\nm\nk̚\na\nn\ni\nt\nk\nɗ\no\nɓ\ntʰ\nl\nŋ͡m\nʊ\nh\nɪ\nəː\nc\nʂ\nɔ\nv\nʈ\nɛ\nf\nk͡p̚\nɲ\np̚\ne\nu\nt̚\ns\nkʰ\nɹ\nɣ\nx\nz\n# FIX"
  },
  {
    "path": "data/phones/postprocess",
    "chars": 92,
    "preview": "#!/bin/bash\n# Runs summary generation script.\n\nset -eou pipefail\n\n./lib/generate_summary.py\n"
  },
  {
    "path": "data/phones/summary.tsv",
    "chars": 2239,
    "preview": "ady_narrow.phones\tady\tAdyghe\tAdyghe\tNarrow\t67\nafr_broad.phones\tafr\tAfrikaans\tAfrikaans\tBroad\t61\naze_narrow.phones\taze\tAz"
  },
  {
    "path": "data/scrape/README.md",
    "chars": 49502,
    "preview": "* Languages: 307\n  * Broad transcription files: 307\n  * Narrow transcription files: 175\n* Dialects: 17\n  * Broad transcr"
  },
  {
    "path": "data/scrape/lib/README.md",
    "chars": 3661,
    "preview": "# \"Big scrape\" scripts\n\n[`scrape`](../scrape) calls WikiPron's scraping functions on all Wiktionary\nlanguages with over "
  },
  {
    "path": "data/scrape/lib/codes.py",
    "chars": 7932,
    "preview": "#!/usr/bin/env python\n\"\"\"Identifies all languages with over 100 IPA entries on Wiktionary.\n\nThis is a tool for scraping "
  },
  {
    "path": "data/scrape/lib/common_characters.py",
    "chars": 4866,
    "preview": "#!/usr/bin/env python\n\"\"\"Creates JSON file containing list of \"Common\" for each language.\n\nThis module takes TSV files f"
  },
  {
    "path": "data/scrape/lib/generate_summary.py",
    "chars": 5991,
    "preview": "#!/usr/bin/env python\n\nimport csv\nimport json\nimport logging\nimport operator\nimport os\nfrom typing import Any\n\nimport pa"
  },
  {
    "path": "data/scrape/lib/languages.json",
    "chars": 66638,
    "preview": "{\n    \"aar\": {\n        \"iso639_name\": \"Afar\",\n        \"wiktionary_name\": \"Afar\",\n        \"wiktionary_code\": \"aa\",\n      "
  },
  {
    "path": "data/scrape/lib/languages_update.py",
    "chars": 4274,
    "preview": "#!/usr/bin/env python\n\"\"\"Updates languages.json.\n\nThis module takes the data/tsv directory as input and returns an\nupdat"
  },
  {
    "path": "data/scrape/lib/scrape.py",
    "chars": 10082,
    "preview": "#!/usr/bin/env python\n\"\"\"Runs the big scrape.\"\"\"\n\nimport argparse\nimport contextlib\nimport datetime\nimport json\nimport l"
  },
  {
    "path": "data/scrape/lib/split.py",
    "chars": 3072,
    "preview": "#!/usr/bin/env python\n\"\"\"Splits TSV files into new TSV files.\n\nThis module identifies the orthographic script present in"
  },
  {
    "path": "data/scrape/lib/unmatched_languages.json",
    "chars": 905,
    "preview": "{\n    \"cel-bry-pro\": {\n        \"wiktionary_name\": \"Proto-Brythonic\"\n    },\n    \"gem-pro\": {\n        \"wiktionary_name\": \""
  },
  {
    "path": "data/scrape/postprocess",
    "chars": 473,
    "preview": "#!/bin/bash\n# Applies sorting and splitting on all files.\n\nset -eou pipefail\n\nmain() {\n    ./lib/languages_update.py\n   "
  },
  {
    "path": "data/scrape/scrape",
    "chars": 70,
    "preview": "#!/bin/bash\n# Runs the scrape.\n\nset -eou pipefail\n\n./lib/scrape.py $@\n"
  },
  {
    "path": "data/scrape/summary.tsv",
    "chars": 33907,
    "preview": "aar_latn_broad.tsv\taar\tAfar\tAfar\tLatin\t\tFalse\tBroad\t1584\naar_latn_narrow.tsv\taar\tAfar\tAfar\tLatin\t\tFalse\tNarrow\t1548\nabk_"
  },
  {
    "path": "data/scrape/tsv/aar_latn_broad.tsv",
    "chars": 29590,
    "preview": "Amcar\ta m ħ a r\nAmcarto\ta m ħ a r t o\nAmcartu\ta m ħ a r t u\nBaariis\tb aː r iː s\nBiritaanya\tb i r i t aː n j a\nCabsa\tħ a "
  },
  {
    "path": "data/scrape/tsv/aar_latn_narrow.tsv",
    "chars": 30079,
    "preview": "Amcar\tʔ ʌ m ħ ʌ ɾ\nAmcarto\tʔ ʌ m ħ ʌ ɾ t ɔ\nAmcartu\tʔ ʌ m ħ ʌ ɾ t ʊ\nBaariis\tb aː ɾ iː s\nBiritaanya\tb ɪ ɾ ɪ t aː n j ʌ\nCabs"
  },
  {
    "path": "data/scrape/tsv/abk_cyrl_broad.tsv",
    "chars": 1999,
    "preview": "Џ\td͜ʐ\nЏь\td͡ʒ\nГь\tc\nГә\tkʷ\nДә\tdʷ\nЖь\tʒ\nЖә\tʒʷ\nКь\tcʼ\nКә\tkʼʷ\nТә\ttʷʼ\nХ'\tχˤ\nХ'ә\tχˤʷ\nХә\tʍ\nЦь\tt ɕʰ\nЦә\tt͡ɕʷʰ\nШь\tʃ\nШә\tʃʷ\nааба\ta a p a"
  },
  {
    "path": "data/scrape/tsv/abk_cyrl_narrow.tsv",
    "chars": 19532,
    "preview": "Џаџа\tɖ ʐ a ɖ ʐ a\nЏибути\tɖ ʐ ə j b u tʼ i\nАбзагә\ta b z a ɡʷ\nАбуџа\ta b u ɖ ʐ a\nАдамыр\ta d a m ə r\nАдаԥазар\ta d a pʰ a z a "
  },
  {
    "path": "data/scrape/tsv/acw_arab_broad.tsv",
    "chars": 38055,
    "preview": "آذى\tʔ aː z a\nآسيا\tʔ aː s j a\nآيسكريم\tʔ aː j s k i r iː m\nأبجورات\tʔ a b a d͡ʒ oː r aː t\nأبجورة\tʔ a b a d͡ʒ oː r a\nأبغى\tʔ "
  },
  {
    "path": "data/scrape/tsv/acw_arab_narrow.tsv",
    "chars": 12079,
    "preview": "أبجورة\tʔ a b a d͡ʒ o̞ː r a\nأحبك\tʔ a ħ ʊ b b a k\nأحبك\tʔ a ħ ʊ b b ɪ k\nأذربيجان\tʔ a ð a r b e̞ː d͡ʒ aː n\nأرقوز\tʔ a r a ɡ o"
  },
  {
    "path": "data/scrape/tsv/ady_cyrl_narrow.tsv",
    "chars": 112490,
    "preview": "Алахь\taː lˤ aː ħ\nАмыщ\taː m ə ɕ\nАушыджэр\taː w ʃ ə d͡ʒ a r\nАфы\taː f ə\nАхын\taː x ə n\nБытырбыф\tb a t ə r b ə f\nДаут\td a w ə "
  },
  {
    "path": "data/scrape/tsv/ady_cyrl_narrow_filtered.tsv",
    "chars": 107264,
    "preview": "Амыщ\taː m ə ɕ\nАушыджэр\taː w ʃ ə d͡ʒ a r\nАфы\taː f ə\nАхын\taː x ə n\nБытырбыф\tb a t ə r b ə f\nДаут\td a w ə t\nДжантый\td͡ʒ aː "
  },
  {
    "path": "data/scrape/tsv/afb_arab_broad.tsv",
    "chars": 11612,
    "preview": "آش\tɑː ʃ\nأوتي\tuː t i\nأوردي\to rː ɪ d i\nإثم\tɪ θ m\nإثم\tɪ θ ɪ m\nإنترنت\tɪ n t ə r n ə t\nإنجليزي\tɪ ŋ ɡ ɪ l eː z i\nابلة\tə b lˤ ə"
  },
  {
    "path": "data/scrape/tsv/afr_latn_broad.tsv",
    "chars": 39614,
    "preview": "'is\tə s\n'n\tə\nA\tɑː\nAWB\taː v i ə b i ə\nAarde\tɑː r d ə\nAfrika\tɑ f ɾ i k ä\nAfrikaander\ta f r i k ɑː n d ə r\nAfrikaans\ta f r "
  },
  {
    "path": "data/scrape/tsv/afr_latn_broad_filtered.tsv",
    "chars": 38779,
    "preview": "'is\tə s\n'n\tə\nA\tɑː\nAWB\taː v i ə b i ə\nAarde\tɑː r d ə\nAfrikaander\ta f r i k ɑː n d ə r\nAfrikaans\ta f r i k ɑː n s\nAfrikaan"
  },
  {
    "path": "data/scrape/tsv/afr_latn_narrow.tsv",
    "chars": 2156,
    "preview": "Charlize\tʃ a r l i z\nChina\tʃ i n a\nTsjeggies\tt ʃ ɛ χ i s\naand\tɑː n t\naardwolf\tɑː r t v o ɫ f\nafslaandak\tɑ f s l ɑː n d ɑ"
  },
  {
    "path": "data/scrape/tsv/aii_syrc_narrow.tsv",
    "chars": 87491,
    "preview": "ܐܐܪ\tʔ ɑː ʔ a r\nܐܐܪܝܐ\tʔ ɑː ʔ a r ɑː j ɑː\nܐܒ\tʔ ɑ bː\nܐܒܐ\tʔ a v ɑː\nܐܒܐ\tʔ a w ɑː\nܐܒܓܕ\ta b d͡ʒ a d\nܐܒܓܕ\ta b ɡ a d\nܐܒܗܐ\tʔ a v ɑ"
  },
  {
    "path": "data/scrape/tsv/ajp_arab_broad.tsv",
    "chars": 51178,
    "preview": "آخر\tʔ aː x i r\nآخرة\tʔ aː x r a\nآخرة\tʔ aː x r e\nآذار\tʔ aː ð aː r\nآسف\tʔ aː s e f\nآسيا\tʔ aː s j a\nآلاف\tʔ aː l aː f\nآلة\tʔ aː"
  },
  {
    "path": "data/scrape/tsv/ajp_arab_narrow.tsv",
    "chars": 52284,
    "preview": "آخر\tʔ aː x ɪ r\nآخر\tʔ æː x ɪ r\nآخرة\tʔ æː x r a\nآخرة\tʔ æː x r e\nآذار\tʔ a zˤ ɑː rˤ\nآذار\tʔ a ðˤ ɑː rˤ\nآسف\tʔ æː s ɪ f\nآسيا\tʔ "
  },
  {
    "path": "data/scrape/tsv/akk_latn_broad.tsv",
    "chars": 11739,
    "preview": "Abum\ta b u m\nAdad\ta d a d\nAddarum\ta d d a r u m\nAkkade\ta k k a d e\nAmurrum\ta m u r r u m\nAusi'i\ta u s i ʔ i\nAyyārum\ta j "
  },
  {
    "path": "data/scrape/tsv/ale_latn_broad.tsv",
    "chars": 1986,
    "preview": "Hl\tɬ\nHm\tm̥\nHn\tn̥\nHng\tŋ̊\nKasakax̂\tk a s a k a χ\nNiiĝuĝis\tn iː ʁ u ʁ i s\nX̂\tχ\nYapuunax̂\tj a p uː n a χ\nYapuunix̂\tj a p uː "
  },
  {
    "path": "data/scrape/tsv/amh_ethi_broad.tsv",
    "chars": 5807,
    "preview": "ሀሎ\th a l o\nሀያ\th a j a\nሀይቅ\th a j kʼ\nሁለተኛ\th u l ə tː ə ɲː a\nሁለት\th u l ə tː\nሁሉ\th u lː u\nሄሊኮፕተር\th e l i k o p t ə ɾ\nህያ\th ɨ j"
  },
  {
    "path": "data/scrape/tsv/ang_latn_broad.tsv",
    "chars": 533326,
    "preview": "Abraham\tɑː b r ɑ x ɑː m\nAbrahames\tɑː b r ɑ x ɑː m e s\nAffrica\tɑ f f r i k ɑ\nAfricanisc\tɑː f r i k ɑː n i ʃ\nAfrisc\tɑː f r"
  },
  {
    "path": "data/scrape/tsv/ang_latn_narrow.tsv",
    "chars": 288254,
    "preview": "Abraham\tɑː b r ɑ h ɑː m\nAbrahames\tɑː b r ɑ h ɑː m e s\nAfricanisc\tɑː v r i k ɑː n i ʃ\nAlda\tɑ ɫ d ɑ\nAldwine\tɑ ɫ d w i n e\n"
  },
  {
    "path": "data/scrape/tsv/aot_latn_broad.tsv",
    "chars": 2956,
    "preview": "aat\taː t\nagos\ta ɡ o s\naguk\ta ɡ u k\nagynja\ta ɡ ə n d͡ʑ a\nagys\ta ɡ ə s\naina\ta j n a\nasalja\ta sʰ a l d͡ʑ a\nasingja\ta sʰ i ŋ"
  },
  {
    "path": "data/scrape/tsv/apw_latn_narrow.tsv",
    "chars": 3016,
    "preview": "Dziłtʼaadn\td z ɪ̀ ɬ t ɑ́ː d ǹ̩\nat’eedn\ta tʰ ʔ eː t n\nbesh\tp ɛ́ ʃ\nbichan\tp i t ʃʰ a n\nbichizh\tp i t ʃʰ ɪ ʒ\nbiditlid\tp ɪ t"
  },
  {
    "path": "data/scrape/tsv/ara_arab_broad.tsv",
    "chars": 255203,
    "preview": "ء\tʔ\nآ\tʔ aː\nآبد\tʔ aː b i d\nآبلة\tʔ aː b i l a\nآبيس\tʔ aː b iː s\nآتشيوت\taː t ʃ i j oː t\nآتشيوت\tʔ aː t ʃ i j uː t\nآتون\tʔ aː t"
  },
  {
    "path": "data/scrape/tsv/ara_arab_narrow.tsv",
    "chars": 1791,
    "preview": "أذربيجان\tʔ a ð a r b e̞ː d͡ʒ aː n\nأفلاطون\tʔ a f l aː tˤ o̞ː n\nأقنع\tʔ a q n a ʕ\nأقنع\tʔ a ɡ n a ʕ\nأكتوبر\tʔ ʊ k t o̞ː b a r"
  },
  {
    "path": "data/scrape/tsv/arc_hebr_broad.tsv",
    "chars": 21533,
    "preview": "אבא\tʔ ɛ b ɑ ʔ\nאבאשותא\tʔ a b e ʔ ɑ ʃ u t ɑ ʔ\nאבד\tʔ ə b a d\nאבדא\tʔ a b d ɑ ʔ\nאבדלתא\tʔ a b d ɑ l t ɑ ʔ\nאבדנא\tʔ a b d ɑ n ɑ "
  },
  {
    "path": "data/scrape/tsv/arg_latn_broad.tsv",
    "chars": 5912,
    "preview": "ababol\ta b a b o l\nabadexo\ta b a d e ʃ o\nabarca\ta b a ɾ k a\nabella\ta b e ʎ a\nabet\ta b e t\nabetullo\ta b e t u ʎ o\nabril\ta"
  },
  {
    "path": "data/scrape/tsv/ary_arab_broad.tsv",
    "chars": 33410,
    "preview": "آخر\tʔ aː x i r\nآخر\tʔ aː x u r\nآخرة\tʔ aː x r a\nآدى\tʔ aː d a\nآش\tʔ aː ʃ\nآنا\tʔ aː n a\nآه\tʔ aː h\nأبجور\tʔ a b a ʒ uː r\nأبدا\tʔ "
  },
  {
    "path": "data/scrape/tsv/arz_arab_broad.tsv",
    "chars": 3005,
    "preview": "آه\tʔ aː\nأربعة\tɑ ɾˤ b ɑ ʕ ɑ\nأربعتاشر\tɑ ɾˤ b ɑ ʕ t ɑː ʃ ɑ ɾˤ\nأربعين\tʔ a r b a ʕ iː n\nأستيكة\ta s t iː k a\nأنا\tä n ä\nأنهي\tʔ "
  },
  {
    "path": "data/scrape/tsv/asm_beng_broad.tsv",
    "chars": 46258,
    "preview": "ং\tŋ\nঅ\tɔ\nঅঁ\tɔ̃\nঅঁকৰা\tɔ̃ k ɔ ɹ a\nঅংক\tɔ ŋ k ɔ\nঅংগ\tɔ ŋ ɡ ɔ\nঅইন\tɔ ɪ n\nঅকল\tɔ k ɔ l\nঅকলশৰীয়া\tɔ k ɔ l x ɔ ɹ i a\nঅকলে\tɔ k ɔ l ɛ\n"
  },
  {
    "path": "data/scrape/tsv/ast_latn_broad.tsv",
    "chars": 21535,
    "preview": "Afganistán\ta f ɡ a n i s t a n\nAique\ta i k e\nAlbania\ta l b a n j a\nAlemaña\ta l e m a ɲ a\nAlexandru\ta l e ʃ a n d ɾ u\nAmb"
  },
  {
    "path": "data/scrape/tsv/ast_latn_narrow.tsv",
    "chars": 22034,
    "preview": "Afganistán\ta f ɣ̞ a n i s̪ t̪ ã ŋ\nAique\ta i̯ k e\nAlbania\ta l β̞ a n j a\nAlemaña\ta l e m a ɲ a\nAlexandru\ta l e ɕ ã n̪ d̪ "
  },
  {
    "path": "data/scrape/tsv/ayl_arab_broad.tsv",
    "chars": 2787,
    "preview": "آكو\taː k o\nالزب\tə zˤː ɑ b\nبالعاني\tb ə l ʕ aː n i\nبتات\tb t aː t\nبخشة\tb o χ ʃ a\nبزار\tb zˤ ɑː r\nبزع\tb zˤ ɑ ʕ\nبزع\tb ɑ zˤː ɑ "
  },
  {
    "path": "data/scrape/tsv/aze_latn_broad.tsv",
    "chars": 6593,
    "preview": "Amsterdam\tɑ m s t e r d ɑ m\nCəbrayıl\td ʒ æ b ɾ ɑ j ɯ l\nDaşkənd\td ɑ ʃ c æ n t\nGəncə\td ʒ æ n d͡z æ\nGəncə\tɟ æ n d ʒ æ\nHeydə"
  },
  {
    "path": "data/scrape/tsv/aze_latn_narrow.tsv",
    "chars": 87719,
    "preview": "ABŞ\tɑ b ʃ\nAlmaniya\tɑ ɫ m ɑ n i j ɑ\nAminə\tɑ m i n æ\nArkadi\tɑ r k ɑ d i\nAvroviziya\tɑ v r o v iː z i j ɑ\nAvroviziya\tɑ v r ɑ"
  },
  {
    "path": "data/scrape/tsv/aze_latn_narrow_filtered.tsv",
    "chars": 83222,
    "preview": "ABŞ\tɑ b ʃ\nAlmaniya\tɑ ɫ m ɑ n i j ɑ\nAminə\tɑ m i n æ\nArkadi\tɑ r k ɑ d i\nAvroviziya\tɑ v r o v iː z i j ɑ\nAvroviziya\tɑ v r ɑ"
  },
  {
    "path": "data/scrape/tsv/bak_cyrl_broad.tsv",
    "chars": 1511,
    "preview": "А\tä\nА\tɑ\nБ\tb\nБ\tβ\nВ\tv\nВ\tw\nГ\tɡ\nД\td\nЕ\tj e\nЕ\tj ɪ̞\nЕ\tɪ̞\nЕ\tʲe\nЖ\tʐ\nЖ\tʒ\nЗ\tz\nИ\te\nИ\ti\nИ\tɪ j\nЙ\tj\nК\tk\nЛ\tl\nЛ\tɫ\nМ\tm\nМарат\tm ɑ r ɑ t\nМор"
  },
  {
    "path": "data/scrape/tsv/bak_cyrl_narrow.tsv",
    "chars": 39993,
    "preview": "Ё\tj o\nЁ\tj ɵ\nЁ\tʲo\nЁ\tʲɵ\nАбдулла\tɑ b d u ɫː ɑ\nАвстралия\tä f s t r ä lʲ i j ä\nАзамат\tɑ z ɑ m ɑ t\nАзия\tä zʲ i j ä\nАзия\tä zʲ i"
  },
  {
    "path": "data/scrape/tsv/ban_bali_broad.tsv",
    "chars": 5796,
    "preview": "ᬅᬃᬘᬵ\ta r t͡ʃ ə\nᬅᬃᬚᬸᬦ\ta r d͡ʒ u n ə\nᬅᬃᬣ\ta r t ə\nᬅᬓ᭄ᬱᬭ\ta k s a r ə\nᬅᬗᬶᬦ᭄\ta ŋ i n\nᬅᬗ᭄ᬕᬭ\ta ŋ ɡ a r ə\nᬅᬗ᭄ᬰ\ta ŋ s ə\nᬅᬡ᭄ᬥ\ta n d"
  },
  {
    "path": "data/scrape/tsv/bar_latn_broad.tsv",
    "chars": 29052,
    "preview": "Adabei\taː d̥ a b̥ a ɛ̯\nAdabei\taː d̥ a b̥ æː\nAdachsl\tɑː d̥ ɑ ɡ̥ s l̩\nAdress\tɑ d̥ r e̞ s\nAdresserl\tɑ d̥ r e̞ s ɐ l\nAdresse"
  },
  {
    "path": "data/scrape/tsv/bbl_geor_broad.tsv",
    "chars": 2269,
    "preview": "ალე\taː l ĕ\nალხაძურ\ta l x a d͡z u r\nანგრიშ\ta n ɡ r i ʃ\nანწლ\ta n t͡sʼ l\nატტაჼ\ta tː ã\nატტიჼ\ta tːʼ ĩ\nატყ\ta tʼ qʼ\nაუ\ta u̯\nაუჼ"
  },
  {
    "path": "data/scrape/tsv/bbn_latn_broad.tsv",
    "chars": 3298,
    "preview": "Muduapa\tm uⁿ d u a p a\nUniapa\tu n i a p a\nala\ta l a\nani\ta n i\nba\tb a\nbali\tb a l i\nbalu\tb a l u\nbarema\tb a r e m a\nbarita"
  },
  {
    "path": "data/scrape/tsv/bcl_latn_broad.tsv",
    "chars": 101373,
    "preview": "Abril\tʔ a b ɾ i l\nAgosto\tʔ a ɡ o s t o\nAguilar\tʔ a ɡ i l a ɾ\nAlbay\tʔ a l b a j\nAlzaga\tʔ a l s a ɡ a\nAsya\tʔ a s j a\nAñonu"
  },
  {
    "path": "data/scrape/tsv/bcl_latn_narrow.tsv",
    "chars": 105963,
    "preview": "Abril\tʔ a b ɾ i l̪\nAgosto\tʔ a ɡ o s t o\nAguilar\tʔ a ɡ i l̪ a ɾ\nAlbay\tʔ a l̪ b a ɪ̯\nAlzaga\tʔ a l̪ s a ɡ a\nAsya\tʔ a ʃ a\nAñ"
  },
  {
    "path": "data/scrape/tsv/bdq_latn_broad.tsv",
    "chars": 2563,
    "preview": "Rađe\tr a ɗ ɛː\nahrei\tʔ a h r ɛː j\nakar\tʔ a k aː r\nakăn\tʔ a k a n\nao\tʔ aː w\napo\tʔ a p ɔː\nareh\tʔ a r ɛ h\narăng\tʔ a r a ŋ\nat"
  },
  {
    "path": "data/scrape/tsv/bel_cyrl_narrow.tsv",
    "chars": 126981,
    "preview": "Іарданія\tj i a r d a nʲ i j a\nІван\tj i v a n\nІванава\tj i v a n a v a\nІваноў\tj i v a n o u̯\nІвацэвічы\tj i v a t͡s ɛ vʲ i "
  },
  {
    "path": "data/scrape/tsv/ben_beng_broad.tsv",
    "chars": 128977,
    "preview": "ং\tŋ\nঅ\to\nঅ\tɔ\nঅঁকোল\tɔ k o l\nঅঁকোল\tɔ̃ k o l\nঅংকন\tɔ ŋ k o n\nঅংগুস্তানা\to ŋ ɡ u ʃ t a n a\nঅংশ\tɔ ŋ ʃ o\nঅংশক\tɔ ŋ ʃ ɔ k\nঅংশমান\tɔ"
  },
  {
    "path": "data/scrape/tsv/ben_beng_dhaka_broad.tsv",
    "chars": 154924,
    "preview": "ং\tŋ\nঅ\to\nঅ\tɔ\nঅঁকোল\tɔ k o l\nঅঁকোল\tɔ̃ k o l\nঅংকন\tɔ ŋ k o n\nঅংগুস্তানা\tɔ ŋ ɡ u s t̪ a n a\nঅংশ\tɔ ŋ ʃ o\nঅংশক\tɔ ŋ ʃ ɔ k\nঅংশমান\t"
  },
  {
    "path": "data/scrape/tsv/ben_beng_dhaka_broad_filtered.tsv",
    "chars": 125251,
    "preview": "ং\tŋ\nঅ\to\nঅ\tɔ\nঅঁকোল\tɔ k o l\nঅঁকোল\tɔ̃ k o l\nঅংকন\tɔ ŋ k o n\nঅংগুস্তানা\tɔ ŋ ɡ u s t̪ a n a\nঅংশ\tɔ ŋ ʃ o\nঅংশক\tɔ ŋ ʃ ɔ k\nঅংশমান\t"
  },
  {
    "path": "data/scrape/tsv/ben_beng_dhaka_narrow.tsv",
    "chars": 150866,
    "preview": "ং\tŋ\nঅঁকোল\tɔ k o l\nঅঁকোল\tɔ̃ k o l\nঅংগুস্তানা\tɔ ŋ ɡ u s t̪ a n aˑ\nঅংশ\tɔ ŋ ʃ oˑ\nঅংশক\tɔ ŋ ʃ ɔ k\nঅংশমান\tɔ ŋ ʃ ɔ m a n\nঅংশল\tɔ "
  },
  {
    "path": "data/scrape/tsv/ben_beng_narrow.tsv",
    "chars": 114166,
    "preview": "অঁকোল\tɔ k o l\nঅঁকোল\tɔ̃ k o l\nঅংশ\tɔ ŋ ʃ oˑ\nঅংশক\tɔ ŋ ʃ ɔ k\nঅংশমান\tɔ ŋ ʃ ɔ m a n\nঅংশল\tɔ ŋ ʃ ɔ l\nঅংশাবতার\tɔ ŋ ʃ a b o t̪ a ɹ"
  },
  {
    "path": "data/scrape/tsv/ben_beng_rarh_broad.tsv",
    "chars": 97111,
    "preview": "ং\tŋ\nঅ\to\nঅ\tɔ\nঅঁকোল\tɔ k o l\nঅঁকোল\tɔ̃ k o l\nঅংকন\tɔ ŋ k o n\nঅংগুস্তানা\tɔ ŋ ɡ u s t̪ a n a\nঅংশ\tɔ ŋ ʃ o\nঅংশক\tɔ ŋ ʃ ɔ k\nঅংশমান\t"
  },
  {
    "path": "data/scrape/tsv/ben_beng_rarh_broad_filtered.tsv",
    "chars": 80448,
    "preview": "ং\tŋ\nঅ\to\nঅ\tɔ\nঅঁকোল\tɔ k o l\nঅঁকোল\tɔ̃ k o l\nঅংকন\tɔ ŋ k o n\nঅংগুস্তানা\tɔ ŋ ɡ u s t̪ a n a\nঅংশ\tɔ ŋ ʃ o\nঅংশক\tɔ ŋ ʃ ɔ k\nঅংশমান\t"
  },
  {
    "path": "data/scrape/tsv/ben_beng_rarh_narrow.tsv",
    "chars": 124185,
    "preview": "ং\tŋ\nঅঁকোল\tɔ k o l\nঅঁকোল\tɔ̃ k o l\nঅংগুস্তানা\tɔ ŋ ɡ u s t̪ a n aˑ\nঅংশ\tɔ ŋ ʃ oˑ\nঅংশক\tɔ ŋ ʃ ɔ k\nঅংশমান\tɔ ŋ ʃ ɔ m a n\nঅংশল\tɔ "
  },
  {
    "path": "data/scrape/tsv/bjb_latn_broad.tsv",
    "chars": 2583,
    "preview": "Galinyala\tɡ a l i ɲ a l a\namiwarra\ta m i w a ɾ a\nawoo\ta w u\nbabi\tb a b i\nbabmandidhi\tb a b m a n d i d̟ i\nbalgiringgoodh"
  },
  {
    "path": "data/scrape/tsv/blt_tavt_narrow.tsv",
    "chars": 3329,
    "preview": "ꪀꪚꪾ\tk a p̚ ˦˥\nꪀꪫꪱꪉ\tkʷ aː ŋ ˨\nꪀꪱ\tk aː ˨\nꪀꪳ꪿\tk ɨː ˦˥\nꪀꪷꪵꪀ\tk ɔː ˨ k ɛː ˨\nꪀꪾꪚ\tk a p̚ ˦˥\nꪀ꪿ꪱ\tk aː ˦˥\nꪁꪫꪱꪣ\tkʷ aː m ˥\nꪁꪫꪱꪥ\tkʷ a"
  },
  {
    "path": "data/scrape/tsv/bod_tibt_broad.tsv",
    "chars": 53709,
    "preview": "ཀ\t* k a\nཀ\tk a ˥˥\nཀ་ཀ་ནི\t* k a k a n i\nཀ་ཀ་རང\t* k a k a r a ŋ\nཀ་ཀོ་ལ\t* k a k o l a\nཀ་ཁུ་བྷ་ཡ\t* k a kʰ u p h a j a\nཀ་གཉིས་"
  },
  {
    "path": "data/scrape/tsv/bre_latn_broad.tsv",
    "chars": 14583,
    "preview": "Breizh\tb r ɛ j s\nBreizhad\tb r ɛ j z a t\nBreizhiz\tb r ɛ j z i s\nBrezhon\tb ʁ e z ɔ̃ː n\nC'hwevrer\tx w e v r ɛ r\nDu\td yː\nEbr"
  },
  {
    "path": "data/scrape/tsv/bua_cyrl_broad.tsv",
    "chars": 2174,
    "preview": "Ага\ta ɢ a\nАзи\taː z i\nБайгал\tb ɛː ɢ a l\nБуряад\tb ʊ rʲ aː d\nБээжэн\tb eː ʒ e ŋ\nВикипеэди\tv i k i pʲ eː d i\nДоржо\td ɔ r ʒ ɔ\n"
  },
  {
    "path": "data/scrape/tsv/bua_cyrl_narrow.tsv",
    "chars": 2692,
    "preview": "Ага\tä ʁ ɐ\nАзи\täː zʲ ɪ\nБайгал\tb ɛ͡e̞ ʁ ɐ l\nБуряад\tb ʊ rʲ äː t\nБээжэн\tb eː ʒ ɤ ŋ\nВикипеэди\tvʲ ɪ kʰʲ ɪ pʰʲ e̞ː dʲ ɪ\nДоржо\td"
  },
  {
    "path": "data/scrape/tsv/bul_cyrl_narrow.tsv",
    "chars": 1160981,
    "preview": "А\ta\nАбаджиев\tɐ b ɐ d ʒ i ɛ f\nАвакум\tɐ v ɐ k u m\nАврамов\tɐ v r a m o f\nАвстралия\tɐ f s t r a l i j ɐ\nАвстрия\ta f s t r i "
  },
  {
    "path": "data/scrape/tsv/cat_latn_broad.tsv",
    "chars": 2297,
    "preview": "Bosch\tb ɔ s k\nCanigó\tk a n i ɡ u\nCànoes\tk a n u s\nElna\tj e l n ə\nSàhara\ts a h a ɾ a\nSàhara\ts a h ə ɾ ə\na\ta\na\tə\nab\ta b\nab"
  },
  {
    "path": "data/scrape/tsv/cat_latn_narrow.tsv",
    "chars": 2357534,
    "preview": "A\ta\nA\tə\nAMPA\ta m p a\nAMPA\ta m p ə\nAaron\ta a ɾ o n\nAaron\tə ə ɾ o n\nAaró\ta a ɾ o\nAaró\tə ə ɾ o\nAbelard\ta b e l a ɾ t\nAbelar"
  },
  {
    "path": "data/scrape/tsv/cbn_thai_broad.tsv",
    "chars": 2097,
    "preview": "กวน\tk u a n\nกะชอʔ\tk a cʰ ɔː ʔ\nกะมุร\tk a m u r\nกะเตาแด็ร\tk a t a w d æ r\nกะเทอะ่\tk a tʰ ɤ̤ ʔ\nกาʔ\tk aː ʔ\nกาว\tk aː w\nกุร\tk "
  },
  {
    "path": "data/scrape/tsv/ceb_latn_broad.tsv",
    "chars": 60849,
    "preview": "Abapo\tʔ a b a p o\nAbayon\tʔ a b a j o n\nAbello\tʔ a b e l j o\nAbesamis\tʔ a b e s a m i s\nAbril\tʔ a b ɾ i l\nAcebedo\tʔ a s e"
  },
  {
    "path": "data/scrape/tsv/ceb_latn_narrow.tsv",
    "chars": 63057,
    "preview": "Abapo\tʔ ʌ b a p ɔ\nAbayon\tʔ ʌ b a j ɔ n̪\nAbello\tʔ ʌ b i l̪ j ɔ\nAbesamis\tʔ ʌ b ɪ s̪ a m ɪ s̪\nAbril\tʔ ʌ b ɾ̪ i l̪\nAcebedo\tʔ"
  },
  {
    "path": "data/scrape/tsv/ces_latn_narrow.tsv",
    "chars": 1022381,
    "preview": "AIDS\ta j t s\nAIDS\tɛ j t s\nAbcházie\ta p x aː z ɪ j ɛ\nAbdiáš\ta b ɟ ɪ j aː ʃ\nAbdon\ta b d o n\nAbidžan\ta b ɪ d ʒ a n\nAbraham\t"
  },
  {
    "path": "data/scrape/tsv/chb_latn_broad.tsv",
    "chars": 2169,
    "preview": "a\ta\naba\ta β a\nabago\ta β a ɣ o\nabquy\ta b k ɨ\nabquy\ta β k ɨ\nagua\ta ɣ u a\naica\ta i k a\naisado\ta i s a d o\nalcalde\ta l k a l"
  },
  {
    "path": "data/scrape/tsv/che_cyrl_broad.tsv",
    "chars": 2439,
    "preview": "Гӏалгӏай\tʁ ə l ʁ ɑ j\nПхьармат\tp ħ a ɾ m a t\nРосси\tr ɐ s i\nатта\taː tː a\nаьрзу\tæ r z uː\nаьтту\tæː tː uː\nбархӏ\tb a r̥\nбод\tb "
  },
  {
    "path": "data/scrape/tsv/cho_latn_broad.tsv",
    "chars": 2269,
    "preview": "A̱\tãː\nChalakki\tt͡ʃ a l ə k k í ʔ\nChalvkki\tt͡ʃ a l ə k k í ʔ\nChihowa\tt ʃ i h oː w á h\nChiisas\tt ʃ iː s ə́ s\nChisvs\tt ʃ iː"
  },
  {
    "path": "data/scrape/tsv/chr_cher_broad.tsv",
    "chars": 1132,
    "preview": "Ꭰ\ta\nᎠᎦᏍᎬ\ta ɡ a s ɡ ə̃\nᎠᎵᏥᎵᏯ\ta l i z i l i a\nᎠᏂᎪᎳ\ta n i ɡ o l a\nᎠᏂᏙᎳ\ta n i d o l a\nᎠᏂᏛᏥ\ta n i d ə̃ t s i\nᎠᏏᎵᏆᏌᏂ\ta s i l i"
  },
  {
    "path": "data/scrape/tsv/cic_latn_broad.tsv",
    "chars": 6781,
    "preview": "Chahtaꞌ\tt͡ʃ a h t a ʔ\nChalakkiꞌ\tt͡ʃ a l a k k i ʔ\nChihoowa\tt͡ʃ i h oː w a\nChikashsha\tt͡ʃ i k a ʃ ʃ a\nChikashshanompaꞌ\tt͡"
  },
  {
    "path": "data/scrape/tsv/ckb_arab_broad.tsv",
    "chars": 4710,
    "preview": "ئاسن\tʔ aː s n\nئاشتی\tʔ aː ʃ t iː\nئافرەت\tʔ aː f ɾ a t\nئاوی\tʔ aː w iː\nئاپارتمان\tʔ aː p aː ɾ t m aː n\nئاڵا\tʔ aː ɫ aː\nئۆقیانو"
  },
  {
    "path": "data/scrape/tsv/cnk_latn_broad.tsv",
    "chars": 7938,
    "preview": "adawngteh\tʔ a ˧ d ɔ̃ ˩ t e ʔ ˧\nadawteh\tʔ a ˧ d ɔ ˩ t e ʔ ˧\nae\tʔ ɛ ˧\naedawngte\tʔ ɛ ˧ d ɔ̃ ˩ t e ˧\naedawte\tʔ ɛ ˧ d ɔ ˩ t e"
  },
  {
    "path": "data/scrape/tsv/cop_copt_broad.tsv",
    "chars": 13268,
    "preview": "ϣϩⲓϭ\tʃ h i c\nϣϫⲟⲙ\tʔ ə ʃ t͡ʃʼ o m\nϣⲁϣϥ\tʃ a ʃ f\nϣⲁϣⲙⲓ\tʃ a ʃ m iː\nϣⲁϫⲉ\tʃ a ʔ t͡ʃ ə\nϣⲁⲓⲣⲓ\tʃ a ɪ r ə\nϣⲁⲗⲓⲟⲩ\tʃ ɑ l j u\nϣⲁⲗⲓⲩ\tʃ"
  },
  {
    "path": "data/scrape/tsv/cor_latn_broad.tsv",
    "chars": 2606,
    "preview": "Adam\tæ d ə m\nBrython\tb r ə θ ɔ n\nGrifyuth\tɡ r i f j ʏ θ\nGrifyuth\tɡ ɾ ɪ f j ɪ θ\nKernewek\tk ɛ r n ɛ w ɛ k\nKernow\tk ɛ r n ɔ"
  },
  {
    "path": "data/scrape/tsv/cor_latn_narrow.tsv",
    "chars": 13199,
    "preview": "Almayn\ta l m a ɪ n\nAlmayn\tæ l m ɐ n\nBeybel\tb ə ɪ b ə l\nCryst\tk ɹ iː s t\nDu\td yː\nEbrel\tɛ b r ɛ l\nEjyp\tɛ d͡ʒ ɪ p\nEjyptek\tə"
  },
  {
    "path": "data/scrape/tsv/cos_latn_broad.tsv",
    "chars": 9416,
    "preview": "A\ta\nAfrica\ta f r i k a\nAiacciu\ta j a t t ʃ u\nArabia\ta r a b i a\nAsia\ta z j a\nBelgica\tb e l d͡ʒ i k a\nCina\tt͡ʃ i n a\nCors"
  },
  {
    "path": "data/scrape/tsv/crk_latn_broad.tsv",
    "chars": 2727,
    "preview": "acosis\ta t͡s o s i s\namiskwak\ta m i s k w a k\nanikwacâs\ta n i k w a t͡s aː s\nasinîwaciy\ta s i n iː w a t͡s i j\naskihkwak"
  },
  {
    "path": "data/scrape/tsv/crk_latn_narrow.tsv",
    "chars": 3296,
    "preview": "Kisêmanitôw\tk ɪ s eː m ʌ n ɪ t oː w\nacimosis\tʌ t͡s ɪ m ʊ s ɪ s\nakohp\tʌ k k ʊʰ p\namisk\tʌ m ɪ s k\napoy\tʌ p p ʊ j\nasikan\tʌ "
  },
  {
    "path": "data/scrape/tsv/crx_cans_broad.tsv",
    "chars": 1208,
    "preview": "ᐈ\te\nᐉ\ti\nᐶ\th e\nᐷ\th i\nᑋ\th\nᑓ\tt e\nᑔ\tt i\nᓑ\tŋ\nᗄ\tx u\nᗅ\tx o\nᗆ\tx ə\nᗇ\tx e\nᗈ\tx i\nᗉ\tx a\nᗊ\tɣ u\nᗋ\tɣ o\nᗌ\tɣ ə\nᗍ\tɣ e\nᗎ\tɣ i\nᗏ\tɣ a\nᗐ\tw u\nᗑ\t"
  },
  {
    "path": "data/scrape/tsv/csb_latn_broad.tsv",
    "chars": 15700,
    "preview": "Afrika\ta f rʲ i k a\nBerlëno\tb e r l ə n ɔ\nBôłt\tb ɞ w t\nDż\td ʒ\nJastra\tj a s t r a\nJastrë\tj a s t r ə\nKaszëbskô\tk a ʃ ɜ b "
  },
  {
    "path": "data/scrape/tsv/cym_latn_nw_broad.tsv",
    "chars": 216908,
    "preview": "'ch\tχ\n'm\tm\n'ma\tm a\n'mond\tm ɔ n d\n'ny\tn ɨ̞\n'r\tr\n'sti\ts d iː\n'sti\ts d ɪ\n'th\tθ\n'u\ti̯\n'y\tə\nA\taː\nAberaeron\ta b ɛ r e ɨ̯ r ɔ n"
  },
  {
    "path": "data/scrape/tsv/cym_latn_nw_broad_filtered.tsv",
    "chars": 215334,
    "preview": "'ch\tχ\n'm\tm\n'ma\tm a\n'mond\tm ɔ n d\n'ny\tn ɨ̞\n'r\tr\n'sti\ts d iː\n'sti\ts d ɪ\n'th\tθ\n'u\ti̯\n'y\tə\nA\taː\nAberaeron\ta b ɛ r e ɨ̯ r ɔ n"
  },
  {
    "path": "data/scrape/tsv/cym_latn_nw_narrow.tsv",
    "chars": 24763,
    "preview": "'sti\ts t iː\n'sti\ts t ɪ\nAberystwyth\ta b ɛ r ə s t w ɨ̞ θ\nAffganistan\ta f ɡ a n ɪ s t a n\nAifft\ta i f t\nAifft\tə r a i f t\n"
  },
  {
    "path": "data/scrape/tsv/cym_latn_sw_broad.tsv",
    "chars": 337991,
    "preview": "'ch\tχ\n'chi\tχ iː\n'i\ti̯\n'm\tm\n'ma\tm a\n'ma\tm ə\n'mond\tm ɔ n d\n'n\tn\n'na\tn a\n'na\tn ə\n'ny\tn i\n'ny\tn iː\n'r\tr\n'sti\ts d iː\n'th\tθ\n'u"
  },
  {
    "path": "data/scrape/tsv/cym_latn_sw_broad_filtered.tsv",
    "chars": 333847,
    "preview": "'ch\tχ\n'chi\tχ iː\n'i\ti̯\n'm\tm\n'ma\tm a\n'ma\tm ə\n'mond\tm ɔ n d\n'n\tn\n'na\tn a\n'na\tn ə\n'ny\tn i\n'ny\tn iː\n'r\tr\n'sti\ts d iː\n'th\tθ\n'u"
  },
  {
    "path": "data/scrape/tsv/cym_latn_sw_narrow.tsv",
    "chars": 25866,
    "preview": "'sti\ts t iː\nAberystwyth\ta b ɛ r ə s t w ɪ θ\nAberystwyth\ta b ɛ r ə s t ʊ i̯ θ\nAffganistan\ta f ɡ a n ɪ s t a n\nAifft\ta i f"
  },
  {
    "path": "data/scrape/tsv/dan_latn_broad.tsv",
    "chars": 92209,
    "preview": "A\tæː\nA'er\taːˀ ə r\nA'erne\tæː ɐ n ə\nA'ernes\tæː ɐ n ə s\nA'ers\tæː ɐ s\nA'et\tæː ə t\nA'ets\tæː ə t s\nA's\tæː s\nABC\ta b e s eː\nAal"
  },
  {
    "path": "data/scrape/tsv/dan_latn_narrow.tsv",
    "chars": 171383,
    "preview": "A\tæːˀ\nA'er\tæːˀ ɐ\nA'erne\tæː ɐ n ə\nA'ernes\tæː ɐ n ə s\nA'ers\tæː ɐ s\nA'et\tæː ə d̥\nA'ets\tæː ə d̥ s\nA's\tæː s\nABC\tæ b̥ e s eːˀ\n"
  },
  {
    "path": "data/scrape/tsv/deu_latn_broad.tsv",
    "chars": 1446775,
    "preview": "'n\tn\n'n\tə n\n'ne\tn ə\n'nem\tn ə m\n'nen\tn ə n\n'ner\tn ɐ\n'türlich\tt yː ɐ̯ l ɪ ç\nA\tʔ aː\nAAD\taː ʔ aː d eː\nABC\taː b e t s eː\nADHS"
  },
  {
    "path": "data/scrape/tsv/deu_latn_broad_filtered.tsv",
    "chars": 1378226,
    "preview": "'n\tn\n'n\tə n\n'ne\tn ə\n'nem\tn ə m\n'nen\tn ə n\n'ner\tn ɐ\n'türlich\tt yː ɐ̯ l ɪ ç\nA\tʔ aː\nAAD\taː ʔ aː d eː\nABC\taː b e t s eː\nADHS"
  },
  {
    "path": "data/scrape/tsv/deu_latn_narrow.tsv",
    "chars": 560723,
    "preview": "'türlich\ttʰ yː ɐ̯ l ɪ ç\n'türlich\ttʰ ʏ l ɪ ç\nAE\taː eː\nAach\taː χ\nAachen\täː χ n̩\nAak\taː k\nAal\tʔ aː l\nAalbeere\taː l b eː r ə"
  },
  {
    "path": "data/scrape/tsv/div_thaa_broad.tsv",
    "chars": 30265,
    "preview": "ހަ\th ə\nހަށަނަރަ\th a ʂ a n̪ a ɾ a\nހަނދި\th aⁿ d̪ i\nހަނދު\th aⁿ d̪ u\nހަނޑޫ\th əᶯ ɖ uː\nހަނު\th a n̪ u\nހަން\th a n̪\nހަންދަރު\th a "
  },
  {
    "path": "data/scrape/tsv/div_thaa_narrow.tsv",
    "chars": 32201,
    "preview": "ހ\th aː\nހަށަނަރަ\th ä ʂ ä n̪ ä ɾ ä\nހަށްޓި\th ə ʂ ʈ i\nހަނދި\th äⁿ d̪ i\nހަނދު\th äⁿ d̪ u\nހަނޑޫ\th əᶯ ɖ uː\nހަނު\th ä n̪ u\nހަން\th ä"
  },
  {
    "path": "data/scrape/tsv/dlm_latn_broad.tsv",
    "chars": 3264,
    "preview": "acait\ta k a i t\nalegr\ta l e ɡ r ə\naninč\ta n i n t͡ʃ\nardar\ta r d a r\narziant\ta r d͡z i̯ a n t\nascondro\ta s k o n d r ə\nbi"
  },
  {
    "path": "data/scrape/tsv/dng_cyrl_broad.tsv",
    "chars": 6297,
    "preview": "Быйҗин\tp e i ² ⁴ t͡ɕ i ŋ ² ⁴\nДунган\tt u ŋ ² ⁴ k æ̃ ⁵ ¹\nЖыбын\tʐ̩ ⁴ ⁴ p ə ŋ ⁵ ¹\nЙиндў\ti ŋ ⁴ ⁴ t u ⁴ ⁴\nМыйгуй\tm e i ⁵ ¹ k u"
  },
  {
    "path": "data/scrape/tsv/dsb_latn_broad.tsv",
    "chars": 38308,
    "preview": "A\ta\nAfrika\ta f rʲ i k a\nAfriku\ta f rʲ i k u\nAlbanarka\ta l b a n a r k a\nAlbanaŕ\ta l b a n a rʲ\nAlojs\ta l ɔ j s\nAwstriska"
  },
  {
    "path": "data/scrape/tsv/dsb_latn_narrow.tsv",
    "chars": 25402,
    "preview": "Bog\tb ɔ k\nBulgarska\tb u l ɡ a r s k a\nChóśebuski\tx ɛ ɕ ə b u s kʲ i\nChóśebuski\tx ɨ ɕ ə b u s kʲ i\nChóśebuski\tx ʊ ɕ ə b u"
  },
  {
    "path": "data/scrape/tsv/dum_latn_broad.tsv",
    "chars": 3173,
    "preview": "acht\ta x t\naen\taː n\nal\ta l\nalse\ta l z ə\nalso\ta l z oː\nan\ta n\nane\taː n ə\narm\ta r m\nbat\tb a t\nbedriven\tb ə d r iː v ə n\nbe"
  },
  {
    "path": "data/scrape/tsv/dzo_tibt_broad.tsv",
    "chars": 3338,
    "preview": "ཀ\tk ɑ\nཀི་ཏབ\tk i ˥ t ɑ p ˥\nཀླད་ཀོར\tl e ˥ k o ˥\nཁ\tkʰ ɑ ˥\nཁ་ཆེ\tkʰ ɑ ˥ t͡ɕʰ e ˥\nཁ་ཆེའི་ལྷ\tkʰ ɑ ˥ t͡ɕʰ e j ˥ ɬ ɑ ˥\nཁོ\tkʰ o ˥\n"
  },
  {
    "path": "data/scrape/tsv/egy_latn_broad.tsv",
    "chars": 59384,
    "preview": "bbr\tb ɛ b ɛ r\nbbt\tb ɛ b ɛ t\nbbwt\tb ɛ b uː t\nbdt\tb ɛ d ɛ t\nbdš\tb ɛ d ɛ ʃ\nbgs\tb ɛ ɡ ɛ s\nbhd\tb ɛ h ɛ d\nbhn\tb ɛ h ɛ n\nbht\tb "
  },
  {
    "path": "data/scrape/tsv/ell_grek_broad.tsv",
    "chars": 401519,
    "preview": "Άγγελος\ta ŋ ɟ e l o s\nΆγγελου\ta ŋ ɟ e l u\nΆγια\ta ʝ i a\nΆγιε\ta ʝ i e\nΆγιες\ta ʝ i e s\nΆγιο\ta ʝ i o\nΆγιοι\ta ʝ i i\nΆγιος\ta ʝ"
  },
  {
    "path": "data/scrape/tsv/ell_grek_broad_filtered.tsv",
    "chars": 389522,
    "preview": "Άγγελος\ta ŋ ɟ e l o s\nΆγγελου\ta ŋ ɟ e l u\nΆγια\ta ʝ i a\nΆγιε\ta ʝ i e\nΆγιες\ta ʝ i e s\nΆγιο\ta ʝ i o\nΆγιοι\ta ʝ i i\nΆγιος\ta ʝ"
  },
  {
    "path": "data/scrape/tsv/ell_grek_narrow.tsv",
    "chars": 7983,
    "preview": "Άαχεν\tɐˑ ɐ ç e n\nΊσις\ti s i s\nΓη\tʝ i\nΓιόντα\tʝ o d a\nΘάνατος\tθ a n a t o s\nΙώ\ti o\nΛεωνάρδος\tl ɛ ɔ n a ɾ ð ɔ s\nΜεξικού\tm e"
  },
  {
    "path": "data/scrape/tsv/eng_latn_uk_broad.tsv",
    "chars": 1917486,
    "preview": "'Murica\tm ɝ ə k ə\n'Murica\tm ɝ ɪ k ə\n'bout\tb a ʊ t\n'cause\tk ɒ z\n'cause\tk ɔ z\n'cause\tk ə z\n'd\td\n'dine\td iː n\n'dswounds\td z"
  },
  {
    "path": "data/scrape/tsv/eng_latn_uk_broad_filtered.tsv",
    "chars": 1902110,
    "preview": "'Murica\tm ɝ ə k ə\n'Murica\tm ɝ ɪ k ə\n'bout\tb a ʊ t\n'cause\tk ɒ z\n'cause\tk ɔ z\n'cause\tk ə z\n'd\td\n'dine\td iː n\n'dswounds\td z"
  }
]

// ... and 412 more files (download for full content)

About this extraction

This page contains the full source code of the CUNY-CL/wikipron GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 612 files (145.1 MB), approximately 26.5M tokens, and a symbol index with 44 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo