Showing preview only (483K chars total). Download the full file or copy to clipboard to get everything.
Repository: neocl/jamdict
Branch: main
Commit: 85c66c190649
Files: 55
Total size: 453.7 KB
Directory structure:
gitextract_xlbja41h/
├── .gitignore
├── .gitmodules
├── LICENSE
├── MANIFEST.in
├── README.md
├── TODO.md
├── _config.yml
├── data/
│ └── README.md
├── docs/
│ ├── Makefile
│ ├── api.rst
│ ├── conf.py
│ ├── contributing.rst
│ ├── index.rst
│ ├── install.rst
│ ├── make.bat
│ ├── recipes.rst
│ ├── requirements.txt
│ ├── tutorials.rst
│ └── updates.rst
├── jamdict/
│ ├── __init__.py
│ ├── __main__.py
│ ├── __version__.py
│ ├── config.py
│ ├── data/
│ │ ├── config_template.json
│ │ ├── setup_jmdict.sql
│ │ ├── setup_jmnedict.sql
│ │ └── setup_kanjidic2.sql
│ ├── jmdict.py
│ ├── jmdict_sqlite.py
│ ├── jmnedict_sqlite.py
│ ├── kanjidic2.py
│ ├── kanjidic2_sqlite.py
│ ├── krad.py
│ ├── tools.py
│ └── util.py
├── jamdict_demo.py
├── jamdol-flask.py
├── jmd
├── logging.json
├── release.sh
├── requirements.txt
├── run
├── setup.py
├── test/
│ ├── __init__.py
│ ├── data/
│ │ ├── JMdict_mini.xml
│ │ ├── jamdict.json
│ │ ├── jmendict_mini.xml
│ │ └── kanjidic2_mini.xml
│ ├── logging.json
│ ├── test_jamdict.py
│ ├── test_jmdict_sqlite.py
│ ├── test_jmnedict.py
│ ├── test_kanjidic2_sqlite.py
│ └── test_krad.py
└── test.sh
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
.idea/
test/data/test.db
*.py~
*.sh~
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
.hypothesis/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# IPython Notebook
.ipynb_checkpoints
# pyenv
.python-version
# celery beat schedule file
celerybeat-schedule
# dotenv
.env
# virtualenv
venv/
ENV/
# Spyder project settings
.spyderproject
# Rope project settings
.ropeproject
================================================
FILE: .gitmodules
================================================
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2016 Le Tuan Anh <tuananh.ke@gmail.com>
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: MANIFEST.in
================================================
include README.rst
include CHANGES.md
include LICENSE
include requirements*.txt
recursive-include jamdict/data/ *.sql
recursive-include jamdict/data/ *.json
recursive-include jamdict/data/ *.gz
================================================
FILE: README.md
================================================
# Jamdict
[Jamdict](https://github.com/neocl/jamdict) is a Python 3 library for manipulating Jim Breen's JMdict, KanjiDic2, JMnedict and kanji-radical mappings.
[](https://jamdict.readthedocs.io/)
**Documentation:** https://jamdict.readthedocs.io/
# Main features
* Support querying different Japanese language resources
- Japanese-English dictionary JMDict
- Kanji dictionary KanjiDic2
- Kanji-radical and radical-kanji maps KRADFILE/RADKFILE
- Japanese Proper Names Dictionary (JMnedict)
* Fast look up (dictionaries are stored in SQLite databases)
* Command-line lookup tool [(Example)](#command-line-tools)
[Contributors](#contributors) are welcome! 🙇. If you want to help, please see [Contributing](https://jamdict.readthedocs.io/en/latest/contributing.html) page.
# Try Jamdict out
Jamdict is used in [Jamdict-web](https://jamdict.herokuapp.com/) - a web-based free and open-source Japanese reading assistant software.
Please try out the demo instance online at:
https://jamdict.herokuapp.com/
There also is a demo [Jamdict virtual machine](https://replit.com/@tuananhle/jamdict-demo) online for trying out Jamdict Python code on Repl.it:
https://replit.com/@tuananhle/jamdict-demo
# Installation
Jamdict & Jamdict database are both available on [PyPI](https://pypi.org/project/jamdict/) and can be installed using pip
```bash
pip install --upgrade jamdict jamdict-data
```
# Sample jamdict Python code
```python
from jamdict import Jamdict
jam = Jamdict()
# use wildcard matching to find anything starts with 食べ and ends with る
result = jam.lookup('食べ%る')
# print all word entries
for entry in result.entries:
print(entry)
# [id#1358280] たべる (食べる) : 1. to eat ((Ichidan verb|transitive verb)) 2. to live on (e.g. a salary)/to live off/to subsist on
# [id#1358300] たべすぎる (食べ過ぎる) : to overeat ((Ichidan verb|transitive verb))
# [id#1852290] たべつける (食べ付ける) : to be used to eating ((Ichidan verb|transitive verb))
# [id#2145280] たべはじめる (食べ始める) : to start eating ((Ichidan verb))
# [id#2449430] たべかける (食べ掛ける) : to start eating ((Ichidan verb))
# [id#2671010] たべなれる (食べ慣れる) : to be used to eating/to become used to eating/to be accustomed to eating/to acquire a taste for ((Ichidan verb))
# [id#2765050] たべられる (食べられる) : 1. to be able to eat ((Ichidan verb|intransitive verb)) 2. to be edible/to be good to eat ((pre-noun adjectival (rentaishi)))
# [id#2795790] たべくらべる (食べ比べる) : to taste and compare several dishes (or foods) of the same type ((Ichidan verb|transitive verb))
# [id#2807470] たべあわせる (食べ合わせる) : to eat together (various foods) ((Ichidan verb))
# print all related characters
for c in result.chars:
print(repr(c))
# 食:9:eat,food
# 喰:12:eat,drink,receive (a blow),(kokuji)
# 過:12:overdo,exceed,go beyond,error
# 付:5:adhere,attach,refer to,append
# 始:8:commence,begin
# 掛:11:hang,suspend,depend,arrive at,tax,pour
# 慣:14:accustomed,get used to,become experienced
# 比:4:compare,race,ratio,Philippines
# 合:6:fit,suit,join,0.1
```
## Command line tools
To make sure that jamdict is configured properly, try to look up a word using command line
```bash
python3 -m jamdict lookup 言語学
========================================
Found entries
========================================
Entry: 1264430 | Kj: 言語学 | Kn: げんごがく
--------------------
1. linguistics ((noun (common) (futsuumeishi)))
========================================
Found characters
========================================
Char: 言 | Strokes: 7
--------------------
Readings: yan2, eon, 언, Ngôn, Ngân, ゲン, ゴン, い.う, こと
Meanings: say, word
Char: 語 | Strokes: 14
--------------------
Readings: yu3, yu4, eo, 어, Ngữ, Ngứ, ゴ, かた.る, かた.らう
Meanings: word, speech, language
Char: 学 | Strokes: 8
--------------------
Readings: xue2, hag, 학, Học, ガク, まな.ぶ
Meanings: study, learning, science
No name was found.
```
## Using KRAD/RADK mapping
Jamdict has built-in support for KRAD/RADK (i.e. kanji-radical and radical-kanji mapping).
The terminology of radicals/components used by Jamdict can be different from else where.
- A radical in Jamdict is a principal component, each character has only one radical.
- A character may be decomposed into several writing components.
By default jamdict provides two maps:
- jam.krad is a Python dict that maps characters to list of components.
- jam.radk is a Python dict that maps each available components to a list of characters.
```python
# Find all writing components (often called "radicals") of the character 雲
print(jam.krad['雲'])
# ['一', '雨', '二', '厶']
# Find all characters with the component 鼎
chars = jam.radk['鼎']
print(chars)
# {'鼏', '鼒', '鼐', '鼎', '鼑'}
# look up the characters info
result = jam.lookup(''.join(chars))
for c in result.chars:
print(c, c.meanings())
# 鼏 ['cover of tripod cauldron']
# 鼒 ['large tripod cauldron with small']
# 鼐 ['incense tripod']
# 鼎 ['three legged kettle']
# 鼑 []
```
## Finding name entities
```bash
# Find all names with 鈴木 inside
result = jam.lookup('%鈴木%')
for name in result.names:
print(name)
# [id#5025685] キューティーすずき (キューティー鈴木) : Kyu-ti- Suzuki (1969.10-) (full name of a particular person)
# [id#5064867] パパイヤすずき (パパイヤ鈴木) : Papaiya Suzuki (full name of a particular person)
# [id#5089076] ラジカルすずき (ラジカル鈴木) : Rajikaru Suzuki (full name of a particular person)
# [id#5259356] きつねざきすずきひなた (狐崎鈴木日向) : Kitsunezakisuzukihinata (place name)
# [id#5379158] こすずき (小鈴木) : Kosuzuki (family or surname)
# [id#5398812] かみすずき (上鈴木) : Kamisuzuki (family or surname)
# [id#5465787] かわすずき (川鈴木) : Kawasuzuki (family or surname)
# [id#5499409] おおすずき (大鈴木) : Oosuzuki (family or surname)
# [id#5711308] すすき (鈴木) : Susuki (family or surname)
# ...
```
## Exact matching
Use exact matching for faster search.
Find the word 花火 by idseq (1194580)
```python
>>> result = jam.lookup('id#1194580')
>>> print(result.names[0])
[id#1194580] はなび (花火) : fireworks ((noun (common) (futsuumeishi)))
```
Find an exact name 花火 by idseq (5170462)
```python
>>> result = jam.lookup('id#5170462')
>>> print(result.names[0])
[id#5170462] はなび (花火) : Hanabi (female given name or forename)
```
See `jamdict_demo.py` and `jamdict/tools.py` for more information.
# Useful links
* JMdict: [http://edrdg.org/jmdict/edict_doc.html](http://edrdg.org/jmdict/edict_doc.html)
* kanjidic2: [https://www.edrdg.org/wiki/index.php/KANJIDIC_Project](https://www.edrdg.org/wiki/index.php/KANJIDIC_Project)
* JMnedict: [https://www.edrdg.org/enamdict/enamdict_doc.html](https://www.edrdg.org/enamdict/enamdict_doc.html)
* KRADFILE: [http://www.edrdg.org/krad/kradinf.html](http://www.edrdg.org/krad/kradinf.html)
# Contributors
- [Le Tuan Anh](https://github.com/letuananh) (Maintainer)
- [alt-romes](https://github.com/alt-romes)
- [Matteo Fumagalli](https://github.com/matteofumagalli1275)
- [Reem Alghamdi](https://github.com/reem-codes)
- [Techno-coder](https://github.com/Techno-coder)
================================================
FILE: TODO.md
================================================
================================================
FILE: _config.yml
================================================
theme: jekyll-theme-minimal
================================================
FILE: data/README.md
================================================
Copy dictionary files (JMdict_e.xml, kanjidic2.xml, kradfile, etc.) here
================================================
FILE: docs/Makefile
================================================
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
serve:
cd _build/dirhtml && python -m http.server 7001
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
================================================
FILE: docs/api.rst
================================================
.. _api_index:
jamdict APIs
============
An overview of jamdict modules.
.. warning::
👉 ⚠️ THIS SECTION IS STILL UNDER CONSTRUCTION ⚠️ Help is much needed.
.. module:: jamdict
.. autoclass:: jamdict.util.Jamdict
:members:
:member-order: groupwise
:exclude-members: get_ne, has_jmne, import_data, jmnedict
.. autoclass:: jamdict.util.LookupResult
:members:
:member-order: groupwise
.. autoclass:: jamdict.util.IterLookupResult
:members:
:member-order: groupwise
.. module:: jamdict.jmdict
.. autoclass:: JMDEntry
:members:
.. module:: jamdict.kanjidic2
.. autoclass:: Character
:members:
.. automodule:: jamdict.krad
================================================
FILE: docs/conf.py
================================================
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import sys
sys.path.insert(0, os.path.abspath('../'))
# -- Project information -----------------------------------------------------
project = 'jamdict'
copyright = '2021, Le Tuan Anh'
author = 'Le Tuan Anh'
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ['sphinx.ext.autodoc', 'sphinx.ext.viewcode', 'sphinx.ext.doctest']
# -- Highlight code block -----------------
pygments_style = 'sphinx'
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'bizstyle'
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
================================================
FILE: docs/contributing.rst
================================================
.. _contributing:
Contributing
============
There are many ways to contribute to the Jamdict project.
The one that Jamdict development team are focusing on at the moment are:
- Fixing :ref:`existing bugs <contrib_bugfix>`
- Improving query functions
- Improving :ref:`documentation <contrib_docs>`
- Keeping jamdict database up to date
If you have some suggestions or bug reports, please share on `jamdict issues tracker <https://github.com/neocl/jamdict/issues>`_.
.. _contrib_bugfix:
Fixing bugs
-----------
If you found a bug please report at https://github.com/neocl/jamdict/issues
When it is possible, please also share how to reproduce the bugs and a snapshot of jamdict info to help with the bug finding process.
.. code:: bash
python3 -m jamdict info
Pull requests are welcome.
.. _contrib_docs:
Updating Documentation
----------------------
1. Fork `jamdict <https://github.com/neocl/jamdict>`_ repository to your own Github account.
#. Clone `jamdict` repository to your local machine.
.. code:: bash
git clone https://github.com/<your-account-name>/jamdict
#. Create a virtual environment (optional, but highly recommended)
.. code:: bash
# if you use virtualenvwrapper
mkvirtualenv jamdev
workon jamdev
# if you use Python venv
python3 -m venv .env
. .env/bin/activate
python3 -m pip install --upgrade pip wheel Sphinx
#. Build the docs
.. code:: bash
cd jamdict/docs
# compile the docs
make dirhtml
# serve the docs using Python3 built-in development server
# Note: this requires Python >= 3.7 to support --directory
python3 -m http.server 7000 --directory _build/dirhtml
# if you use earlier Python 3, you may use
cd _build/dirhtml
python3 -m http.server 7000
#. Now the docs should be ready to view at http://localhost:7000 . You can visit that URL on your browser to view the docs.
#. More information:
- Sphinx tutorial: https://sphinx-tutorial.readthedocs.io/start/
- Using `virtualenv`: https://virtualenvwrapper.readthedocs.io/en/latest/install.html
- Using `venv`: https://docs.python.org/3/library/venv.html
.. _contrib_dev:
Development
-----------
Development contributions are welcome.
Setting up development environment for Jamdict should be similar to :ref:`contrib_docs`.
Please contact the development team if you need more information: https://github.com/neocl/jamdict/issues
================================================
FILE: docs/index.rst
================================================
Jamdict's documentation!
========================
`Jamdict <https://github.com/neocl/jamdict>`_ is a Python 3 library for manipulating Jim Breen's JMdict, KanjiDic2, JMnedict and kanji-radical mappings.
Welcome
-------
Are you new to this documentation? Here are some useful pages:
- Want to try out Jamdict package? Try `Jamdict online demo <https://replit.com/@tuananhle/jamdict-demo>`_
- Want some useful code samples? See :ref:`recipes`.
- Want to look deeper into the package? See :ref:`api_index`.
- If you want to help developing Jamdict, please visit :ref:`contributing` page.
Main features
-------------
- Support querying different Japanese language resources
- Japanese-English dictionary JMDict
- Kanji dictionary KanjiDic2
- Kanji-radical and radical-kanji maps KRADFILE/RADKFILE
- Japanese Proper Names Dictionary (JMnedict)
- Fast look up (dictionaries are stored in SQLite databases)
- Command-line lookup tool :ref:`(Example) <commandline>`
..
Hide this for now
- jamdol (jamdol-flask) - a Python/Flask server that provides Jamdict lookup via REST API (experimental state)
:ref:`Contributors <contributors>` are welcome! 🙇.
If you want to help developing Jamdict, please visit :ref:`contributing` page.
Installation
------------
Jamdict and `jamdict-data <https://pypi.org/project/jamdict/>`_ are both `available on PyPI <https://pypi.org/project/jamdict/>`_ and
can be installed using pip.
For more information please see :ref:`installpage` page.
.. code:: bash
pip install jamdict jamdict-data
Also, there is an online demo Jamdict virtual machine to try out on Repl.it
https://replit.com/@tuananhle/jamdict-demo
Sample jamdict Python code
--------------------------
Looking up words
>>> from jamdict import Jamdict
>>> jam = Jamdict()
>>> result = jam.lookup('はな')
>>> for word in result.entries:
... print(word)
...
[id#1194500] はな (花) : 1. flower/blossom/bloom/petal ((noun (common) (futsuumeishi))) 2. cherry blossom 3. beauty 4. blooming (esp. of cherry blossoms) 5. ikebana 6. Japanese playing cards 7. (the) best
[id#1486720] はな (鼻) : nose ((noun (common) (futsuumeishi)))
[id#1581610] はし (端) : 1. end (e.g. of street)/tip/point/edge/margin ((noun (common) (futsuumeishi))) 2. beginning/start/first 3. odds and ends/scrap/odd bit/least
[id#1634180] はな (洟) : snivel/nasal mucus/snot ((noun (common) (futsuumeishi)))
Looking up kanji characters
>>> for c in result.chars:
... print(repr(c))
...
花:7:flower
華:10:splendor,flower,petal,shine,luster,ostentatious,showy,gay,gorgeous
鼻:14:nose,snout
端:14:edge,origin,end,point,border,verge,cape
洟:9:tear,nasal discharge
Looking up named entities
>>> result = jam.lookup('ディズニー%')
>>> for name in result.names:
... print(name)
...
[id#5053163] ディズニー : Disney (family or surname/company name)
[id#5741091] ディズニーランド : Disneyland (place name)
See :ref:`recipes` for more code samples.
.. _commandline:
Command line tools
------------------
Jamdict can be used from the command line.
.. code:: bash
python3 -m jamdict lookup 言語学
========================================
Found entries
========================================
Entry: 1264430 | Kj: 言語学 | Kn: げんごがく
--------------------
1. linguistics ((noun (common) (futsuumeishi)))
========================================
Found characters
========================================
Char: 言 | Strokes: 7
--------------------
Readings: yan2, eon, 언, Ngôn, Ngân, ゲン, ゴン, い.う, こと
Meanings: say, word
Char: 語 | Strokes: 14
--------------------
Readings: yu3, yu4, eo, 어, Ngữ, Ngứ, ゴ, かた.る, かた.らう
Meanings: word, speech, language
Char: 学 | Strokes: 8
--------------------
Readings: xue2, hag, 학, Học, ガク, まな.ぶ
Meanings: study, learning, science
No name was found.
To show help you may use
.. code:: bash
python3 -m jamdict --help
Documentation
-------------
.. toctree::
:maxdepth: 2
install
tutorials
recipes
api
contributing
updates
Other info
==========
Release Notes
-------------
Release notes is available :ref:`here <updates>`.
.. _contributors:
Contributors
------------
- `Le Tuan Anh <https://github.com/letuananh>`__ (Maintainer)
- `alt-romes <https://github.com/alt-romes>`__
- `Matteo Fumagalli <https://github.com/matteofumagalli1275>`__
- `Reem Alghamdi <https://github.com/reem-codes>`__
- `Techno-coder <https://github.com/Techno-coder>`__
Useful links
------------
- jamdict on PyPI: https://pypi.org/project/jamdict/
- jamdict source code: https://github.com/neocl/jamdict/
- Documentation: https://jamdict.readthedocs.io/
- Dictionaries
- JMdict: http://edrdg.org/jmdict/edict_doc.html
- kanjidic2: https://www.edrdg.org/wiki/index.php/KANJIDIC_Project
- JMnedict: https://www.edrdg.org/enamdict/enamdict_doc.html
- KRADFILE: http://www.edrdg.org/krad/kradinf.html
Indices and tables
------------------
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
================================================
FILE: docs/install.rst
================================================
.. _installpage:
Installation
=============
jamdict and jamdict dictionary data are both available on PyPI and can be installed using `pip`.
.. code-block:: bash
pip install --user jamdict jamdict-data
# pip script sometimes doesn't work properly
# so you may want to try this instead
python3 -m pip install jamdict jamdict-data
.. note::
When you use :code:`pip install` in a virtual environment, especially the ones created via :code:`python3 -m venv`,
wheel support can be missing. :code:`jamdict-data` relies on wheel/pip to extract xz-compressed database and this may cause a problem.
If you encounter any error, please make sure that wheel is available
.. code-block:: bash
# list all available packages in pip
pip list
# ensure wheel support in pip
pip install -U wheel
You may need to uninstall :code:`jamdict-data` before reinstalling it.
.. code-block:: bash
pip uninstall jamdict-data
Download database file manually
-------------------------------
This should not be useful anymore from version 0.1a8 with the release of the `jamdict_data <https://pypi.org/project/jamdict_data/>`_ package on PyPI.
If for some reason you want to download and install jamdict database by yourself, here are the steps:
1. Download the offical, pre-compiled jamdict database
(``jamdict-0.1a7.tar.xz``) from Google Drive
https://drive.google.com/drive/u/1/folders/1z4zF9ImZlNeTZZplflvvnpZfJp3WVLPk
2. Extract and copy ``jamdict.db`` to jamdict data folder (defaulted to
``~/.jamdict/data/jamdict.db``)
3. To know where to copy data files you can use `python3 -m jamdict info` command via a terminal:
.. code:: bash
python3 -m jamdict info
# Jamdict 0.1a8
# Python library for manipulating Jim Breen's JMdict, KanjiDic2, KRADFILE and JMnedict
#
# Basic configuration
# ------------------------------------------------------------
# JAMDICT_HOME : ~/local/jamdict
# jamdict_data availability: False
# Config file location : /home/tuananh/.jamdict/config.json
#
# Custom Data files
# ------------------------------------------------------------
# Jamdict DB location: ~/local/jamdict/data/jamdict.db - [OK]
# JMDict XML file : ~/local/jamdict/data/JMdict_e.gz - [OK]
# KanjiDic2 XML file : ~/local/jamdict/data/kanjidic2.xml.gz - [OK]
# JMnedict XML file : ~/local/jamdict/data/JMnedict.xml.gz - [OK]
#
# Others
# ------------------------------------------------------------
# lxml availability: False
Build database file from source
-------------------------------
Normal users who just want to look up the dictionaries do not have to do this.
If you are a developer and want to build jamdict database from source,
copy the dictionary source files to jamdict data folder.
The original XML files can be downloaded either from the official website
https://www.edrdg.org/ or from `this jamdict Google Drive folder <https://drive.google.com/drive/folders/1ZMM6Xb46XcwwQGWBZnY3gj637exWPWuU>`_.
To find out where to copy the files or whether they are recognised by jamdict,
you may use the command `python3 -m jamdict info` as in the section above.
You should make sure that all files under the section `Custom data files` are all marked [OK].
After that you should be able to build the database with the command:
.. code:: bash
python3 -m jamdict import
Note on XML parser: jamdict will use `lxml` instead of Python 3 default `xml` when it is available.
================================================
FILE: docs/make.bat
================================================
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=.
set BUILDDIR=_build
if "%1" == "" goto help
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.http://sphinx-doc.org/
exit /b 1
)
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
:end
popd
================================================
FILE: docs/recipes.rst
================================================
.. _recipes:
Common Recipes
==============
.. contents::
:local:
:depth: 2
.. warning::
👉 ⚠️ THIS SECTION IS STILL UNDER CONSTRUCTION ⚠️
All code here assumed that you have created a Jamdict object named :samp:`jam`, like this
>>> from jamdict import Jamdict
>>> jam = Jamdict()
High-performance tuning
-----------------------
When you need to do a lot of queries on the database, it is possible to load the whole database
into memory to boost up querying performance (This will takes about 400 MB of RAM) by using the :class:`memory_mode <jamdict.util.Jamdict>`
keyword argument, like this:
>>> from jamdict import Jamdict
>>> jam = Jamdict(memory_mode=True)
The first query will be extremely slow (it may take about a minute for the whole database to be loaded into memory)
but subsequent queries will be much faster.
Iteration search
----------------
Sometimes people want to look through a set of search results only once and determine which items to keep
and then discard the rest. In these cases :func:`lookup_iter <jamdict.util.Jamdict.lookup_iter>` should be used.
This function returns an :class:`IterLookupResult <jamdict.util.IterLookupResult>` object immediately after called.
Users may loop through ``result.entries``, ``result.chars``, and ``result.names`` exact one loop for each
set to find the items that they want. Users will have to store the desired word entries, characters, and names
by themselves since they are discarded after yield.
>>> res = jam.lookup_iter("花見")
>>> for word in res.entries:
... print(word) # do somethign with the word
>>> for c in res.chars:
... print(c)
>>> for name in res.names:
... print(name)
Part-of-speeches and named-entity types
---------------------------------------
Use :func:`Jamdict.all_pos <jamdict.util.Jamdict.all_pos>` to list all available part-of-speeches
and :func:`Jamdict.all_ne_type <jamdict.util.Jamdict.all_pos>` named-entity types:
>>> for pos in jam.all_pos():
... print(pos) # pos is a string
>>> for ne_type in jam.all_ne_type():
... print(ne_type) # ne_type is a string
To filter words by part-of-speech use the keyword argument ``pos``
in :func:`loookup() <jamdict.util.Jamdict.lookup>` or :func:`lookup_iter() <jamdict.util.Jamdict.lookup_iter>`
functions.
For example to look for all "かえる" that are nouns use:
>>> result = jam.lookup("かえる", pos=["noun (common) (futsuumeishi)"])
To search for all named-entities that are "surname" use:
>>> result = jam.lookup("surname")
Kanjis and radical/components (KRAD/RADK mappings)
--------------------------------------------------
Jamdict has built-in support for KRAD/RADK (i.e. kanji-radical and
radical-kanji mapping). The terminology of radicals/components used by
Jamdict can be different from else where.
- A radical in Jamdict is a principal component, each character has
only one radical.
- A character may be decomposed into several writing components.
By default jamdict provides two maps:
- jam.krad is a Python dict that maps characters to list of components.
- jam.radk is a Python dict that maps each available components to a
list of characters.
.. code:: python
# Find all writing components (often called "radicals") of the character 雲
print(jam.krad['雲'])
# ['一', '雨', '二', '厶']
# Find all characters with the component 鼎
chars = jam.radk['鼎']
print(chars)
# {'鼏', '鼒', '鼐', '鼎', '鼑'}
# look up the characters info
result = jam.lookup(''.join(chars))
for c in result.chars:
print(c, c.meanings())
# 鼏 ['cover of tripod cauldron']
# 鼒 ['large tripod cauldron with small']
# 鼐 ['incense tripod']
# 鼎 ['three legged kettle']
# 鼑 []
Finding name entities
---------------------
.. code:: bash
# Find all names that contain the string 鈴木
result = jam.lookup('%鈴木%')
for name in result.names:
print(name)
# [id#5025685] キューティーすずき (キューティー鈴木) : Kyu-ti- Suzuki (1969.10-) (full name of a particular person)
# [id#5064867] パパイヤすずき (パパイヤ鈴木) : Papaiya Suzuki (full name of a particular person)
# [id#5089076] ラジカルすずき (ラジカル鈴木) : Rajikaru Suzuki (full name of a particular person)
# [id#5259356] きつねざきすずきひなた (狐崎鈴木日向) : Kitsunezakisuzukihinata (place name)
# [id#5379158] こすずき (小鈴木) : Kosuzuki (family or surname)
# [id#5398812] かみすずき (上鈴木) : Kamisuzuki (family or surname)
# [id#5465787] かわすずき (川鈴木) : Kawasuzuki (family or surname)
# [id#5499409] おおすずき (大鈴木) : Oosuzuki (family or surname)
# [id#5711308] すすき (鈴木) : Susuki (family or surname)
# ...
Exact matching
--------------
Use exact matching for faster search
.. code:: python
# Find an entry (word, name entity) by idseq
result = jam.lookup('id#5711308')
print(result.names[0])
# [id#5711308] すすき (鈴木) : Susuki (family or surname)
result = jam.lookup('id#1467640')
print(result.entries[0])
# ねこ (猫) : 1. cat 2. shamisen 3. geisha 4. wheelbarrow 5. clay bed-warmer 6. bottom/submissive partner of a homosexual relationship
# use exact matching to increase searching speed (thanks to @reem-codes)
result = jam.lookup('猫')
for entry in result.entries:
print(entry)
# [id#1467640] ねこ (猫) : 1. cat ((noun (common) (futsuumeishi))) 2. shamisen 3. geisha 4. wheelbarrow 5. clay bed-warmer 6. bottom/submissive partner of a homosexual relationship
# [id#2698030] ねこま (猫) : cat ((noun (common) (futsuumeishi)))
Low-level data queries
----------------------
It’s possible to access to the dictionary data by querying database directly using lower level APIs.
However these are prone to future changes so please keep that in mind.
When you create a Jamdict object, you have direct access to the
underlying databases, via these properties
.. code:: python
from jamdict import Jamdict
jam = Jamdict()
>>> jam.jmdict # jamdict.JMDictSQLite object for accessing word dictionary
>>> jam.kd2 # jamdict.KanjiDic2SQLite object, for accessing kanji dictionary
>>> jam.jmnedict # jamdict.JMNEDictSQLite object, for accessing named-entities dictionary
You can perform database queries on each of these databases by obtaining
a database cursor with ``ctx()`` function (i.e. database query context).
For example the following code list down all existing part-of-speeches
in the database.
.. code:: python
# returns a list of sqlite3.Row object
pos_rows = jam.jmdict.ctx().select("SELECT DISTINCT text FROM pos")
# access columns in each query row by name
all_pos = [x['text'] for x in pos_rows]
# sort all POS
all_pos.sort()
for pos in all_pos:
print(pos)
For more information, please see `Jamdict database schema </_static/jamdict_db_schema.png>`_.
Say we want to get all irregular suru verbs, we can start with finding
all Sense IDs with pos = ``suru verb - irregular``, and then find all the
Entry idseq connected to those Senses.
Words (and also named entities) can be retrieved directly using their ``idseq``.
Each word may have many Senses (meaning) and each Sense may have different pos.
::
# Entry (idseq) --(has many)--> Sense --(has many)--> pos
.. note::
Tips: Since we hit the database so many times (to find the IDs, to retrieve
each word, etc.), we also should consider to reuse the database
connection using database context to have better performance
(``with jam.jmdict.ctx() as ctx:`` and ``ctx=ctx`` in the code below).
Here is the sample code:
.. code:: python
# find all idseq of lexical entry (i.e. words) that have at least 1 sense with pos = suru verb - irregular
with jam.jmdict.ctx() as ctx:
# query all word's idseqs
rows = ctx.select(
query="SELECT DISTINCT idseq FROM Sense WHERE ID IN (SELECT sid FROM pos WHERE text = ?) LIMIT 10000",
params=("suru verb - irregular",))
for row in rows:
# reuse database connection with ctx=ctx for better performance
word = jam.jmdict.get_entry(idseq=row['idseq'], ctx=ctx)
print(word)
================================================
FILE: docs/requirements.txt
================================================
jamdict
Sphinx
================================================
FILE: docs/tutorials.rst
================================================
Tutorials
=========
Getting started
---------------
Just install ``jamdict`` and ``jamdict_data`` packages via pip and you are ready to go.
.. code:: python
from jamdict import Jamdict
jam = Jamdict()
The most useful function is :func:`jamdict.util.Jamdict.lookup`.
For example:
.. code:: python
# use wildcard matching to find any word, or Kanji character, or name
# that starts with 食べ and ends with る
result = jam.lookup('食べ%る')
To access the result object you may use:
.. code:: python
# print all word entries
for entry in result.entries:
print(entry)
# [id#1358280] たべる (食べる) : 1. to eat ((Ichidan verb|transitive verb)) 2. to live on (e.g. a salary)/to live off/to subsist on
# [id#1358300] たべすぎる (食べ過ぎる) : to overeat ((Ichidan verb|transitive verb))
# [id#1852290] たべつける (食べ付ける) : to be used to eating ((Ichidan verb|transitive verb))
# [id#2145280] たべはじめる (食べ始める) : to start eating ((Ichidan verb))
# [id#2449430] たべかける (食べ掛ける) : to start eating ((Ichidan verb))
# [id#2671010] たべなれる (食べ慣れる) : to be used to eating/to become used to eating/to be accustomed to eating/to acquire a taste for ((Ichidan verb))
# [id#2765050] たべられる (食べられる) : 1. to be able to eat ((Ichidan verb|intransitive verb)) 2. to be edible/to be good to eat ((pre-noun adjectival (rentaishi)))
# [id#2795790] たべくらべる (食べ比べる) : to taste and compare several dishes (or foods) of the same type ((Ichidan verb|transitive verb))
# [id#2807470] たべあわせる (食べ合わせる) : to eat together (various foods) ((Ichidan verb))
# print all related characters
for c in result.chars:
print(repr(c))
# 食:9:eat,food
# 喰:12:eat,drink,receive (a blow),(kokuji)
# 過:12:overdo,exceed,go beyond,error
# 付:5:adhere,attach,refer to,append
# 始:8:commence,begin
# 掛:11:hang,suspend,depend,arrive at,tax,pour
# 慣:14:accustomed,get used to,become experienced
# 比:4:compare,race,ratio,Philippines
# 合:6:fit,suit,join,0.1
================================================
FILE: docs/updates.rst
================================================
.. _updates:
Jamdict Changelog
=================
jamdict 0.1a11
--------------
- 2021-05-25
- Added ``lookup_iter()`` for iteration search
- Added ``pos`` filter for filtering words by part-of-speeches
- Added ``all_pos()`` and ``all_ne_type()`` to Jamdict to list part-of-speeches and named-entity types
- Better version checking in ``__version__.py``
- Improved documentation
- 2021-05-29
- (.post1) Sorted kanji readings to have on & kun readings listed first
- (.post1) Add ``on_readings``, ``kun_readings``, and ``other_readings`` filter to ``kanjidic2.RMGroup``
jamdict 0.1a10
--------------
- 2021-05-19
- Added ``memory_mode`` keyword to load database into memory before querying to boost up performance
- Improved import performance by using puchikarui's ``buckmode``
- Tested with both puchikarui 0.1.* and 0.2.*
jamdict 0.1a9
-------------
- 2021-04-19
- Fix data audit query
- Enhanced ``Jamdict()`` constructor. ``Jamdict('/path/to/jamdict.db')``
works properly.
- Code quality review
- Automated documentation build via
`readthedocs.org <https://jamdict.readthedocs.io/en/latest/>`__
jamdict 0.1a8
-------------
- 2021-04-15
- Make ``lxml`` optional
- Data package can be installed via PyPI with ``jamdict_data`` package
- Make configuration file optional as data files can be installed via PyPI.
jamdict 0.1a7
-------------
- 2020-05-31
- Added Japanese Proper Names Dictionary (JMnedict) support
- Included built-in KRADFILE/RADKFile support
- Improved command line tools (json, compact mode, etc.)
Older versions
--------------
- 2017-08-18
- Support KanjiDic2 (XML/SQLite formats)
- 2016-11-09
- Release first version to Github
================================================
FILE: jamdict/__init__.py
================================================
# -*- coding: utf-8 -*-
'''
Python library for manipulating Jim Breen's JMdict and KanjiDic2
Latest version can be found at https://github.com/neocl/jamdict
This package uses the [EDICT][1] and [KANJIDIC][2] dictionary files.
These files are the property of the [Electronic Dictionary Research and Development Group][3], and are used in conformance with the Group's [licence][4].
[1]: http://www.csse.monash.edu.au/~jwb/edict.html
[2]: http://www.csse.monash.edu.au/~jwb/kanjidic.html
[3]: http://www.edrdg.org/
[4]: http://www.edrdg.org/edrdg/licence.html
References:
JMDict website:
http://www.csse.monash.edu.au/~jwb/edict.html
@author: Le Tuan Anh <tuananh.ke@gmail.com>
@license: MIT
'''
# Copyright (c) 2016, Le Tuan Anh <tuananh.ke@gmail.com>
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
########################################################################
from . import __version__ as version_info
from .__version__ import __author__, __email__, __copyright__, __maintainer__
from .__version__ import __credits__, __license__, __description__, __url__
from .__version__ import __version__, __version_long__, __status__
from .jmdict_sqlite import JMDictSQLite
from .kanjidic2_sqlite import KanjiDic2SQLite
from .util import Jamdict, JMDictXML, KanjiDic2XML
from .krad import KRad
__all__ = ['Jamdict', 'JMDictSQLite', 'JMDictXML', 'KanjiDic2SQLite', 'KanjiDic2XML', 'KRad',
"__version__", "__author__", "__description__", "__copyright__", "version_info"]
================================================
FILE: jamdict/__main__.py
================================================
from . import tools
tools.main()
================================================
FILE: jamdict/__version__.py
================================================
# -*- coding: utf-8 -*-
# jamdict's package version information
__author__ = "Le Tuan Anh"
__email__ = "tuananh.ke@gmail.com"
__copyright__ = "Copyright (c) 2016, Le Tuan Anh"
__credits__ = []
__license__ = "MIT License"
__description__ = "Python library for using Japanese dictionaries and resources (Jim Breen's JMdict, KanjiDic2, KRADFILE, JMnedict)"
__url__ = "https://github.com/neocl/jamdict"
__maintainer__ = "Le Tuan Anh"
# ------------------------------------------------------------------------------
# Version configuration (enforcing PEP 440)
# ------------------------------------------------------------------------------
__status__ = "3 - Alpha"
__version_tuple__ = (0, 1, 0, 11, 2)
__version_status__ = '' # a specific value ('rc', 'dev', etc.) or leave blank to be auto-filled
# ------------------------------------------------------------------------------
__status_map__ = {'3 - Alpha': 'a', '4 - Beta': 'b', '5 - Production/Stable': '', '6 - Mature': ''}
if not __version_status__:
__version_status__ = __status_map__[__status__]
if len(__version_tuple__) == 3:
__version_build__ = ''
elif len(__version_tuple__) == 4:
__version_build__ = f"{__version_tuple__[3]}"
elif len(__version_tuple__) == 5:
__version_build__ = f"{__version_tuple__[3]}.post{__version_tuple__[4]}"
else:
raise ValueError("Invalid version information")
if __version_tuple__[2] == 0:
__version_main__ = f"{'.'.join(str(n) for n in __version_tuple__[:2])}"
else:
__version_main__ = f"{'.'.join(str(n) for n in __version_tuple__[:3])}"
__version__ = f"{__version_main__}{__version_status__}{__version_build__}"
__version_long__ = f"{__version_main__} - {__status__.split('-')[1].strip()} {__version_build__}"
================================================
FILE: jamdict/config.py
================================================
# -*- coding: utf-8 -*-
"""
Jamdict configuration management
"""
# This code is a part of jamdict library: https://github.com/neocl/jamdict
# :copyright: (c) 2016 Le Tuan Anh <tuananh.ke@gmail.com>
# :license: MIT, see LICENSE for more details.
import os
from pathlib import Path
import logging
from chirptext import AppConfig
from chirptext.chio import read_file, write_file
# ----------------------------------------------------------------------
# Configuration
# ----------------------------------------------------------------------
MY_DIR = os.path.dirname(__file__)
CONFIG_TEMPLATE = os.path.join(MY_DIR, 'data', 'config_template.json')
__jamdict_home = os.environ.get('JAMDICT_HOME', MY_DIR)
__app_config = AppConfig('jamdict', mode=AppConfig.JSON, working_dir=__jamdict_home)
def _get_config_manager():
''' Internal function for retrieving application config manager object
Don't use this directly, use read_config() method instead
'''
return __app_config
def _ensure_config(config_path='~/.jamdict/config.json', mkdir=True):
_path = Path(os.path.expanduser(config_path))
# auto create config dir
if mkdir:
_path.parent.mkdir(exist_ok=True)
if not _path.exists():
default_config = read_file(CONFIG_TEMPLATE)
logging.getLogger(__name__).warning(f"Jamdict configuration file could not be found. A new configuration file will be generated at {_path}")
logging.getLogger(__name__).debug(f"Default config: {default_config}")
write_file(_path, default_config)
def read_config(config_file=None, force_refresh=False, ensure_config=False):
''' Read jamdict configuration (jamdict home folder, database name, etc.) from config file.
When no configuration is available, jamdict will default JAMDICT_HOME to ``~/.jamdict``
This function should be called right after import statements (i.e. before jam = Jamdict())
The "standard" locations for configuration file include but not limited to:
~/.jamdict/config.json
~/.config/jamdict/config.json
./data/jamdict.json
./jamdict.json
./data/.jamdict.json
./.jamdict.json
:param config_file: Path to configuration file. When config_file is None, jamdict will try to guess the location of the file.
:param force_refresh: Force to re-read configuration from file
:param ensure_config: Create configuration file automatically if it does not exist
'''
if ensure_config and not config_file and not __app_config.locate_config():
# [2021-04-15] data can be installed via PyPI
# configuration file can be optional now
# load config from default template
_ensure_config()
if force_refresh or not __app_config.config:
if config_file and os.path.isfile(config_file):
__app_config.load(config_file)
else:
__app_config.load(CONFIG_TEMPLATE)
# read config
config = __app_config.config
return config
def home_dir():
''' Find JAMDICT_HOME folder.
if there is an environment variable that points to an existing directory
(e.g. export JAMDICT_HOME=/home/user/jamdict)
that folder will be used instead of the configured in jamdict JSON config file
'''
_config = read_config()
# [2020-06-01] Allow JAMDICT_HOME to be overridden by environment variables
if 'JAMDICT_HOME' in os.environ:
_env_jamdict_home = os.path.abspath(os.path.expanduser(os.environ['JAMDICT_HOME']))
if os.path.isdir(_env_jamdict_home):
logging.getLogger(__name__).debug("JAMDICT_HOME: {}".format(_env_jamdict_home))
return _env_jamdict_home
return _config.get('JAMDICT_HOME', __jamdict_home)
def data_dir():
_config = read_config()
_data_dir = _config.get('JAMDICT_DATA', '{JAMDICT_HOME}/data').format(JAMDICT_HOME=home_dir())
return _data_dir
def get_file(file_key):
''' Get configured path by key '''
_config = read_config()
_data_dir = data_dir()
_home = home_dir()
_value = _config.get(file_key)
return _value.format(JAMDICT_DATA=_data_dir, JAMDICT_HOME=_home) if _value else ''
================================================
FILE: jamdict/data/config_template.json
================================================
{
"JAMDICT_HOME": "~/.jamdict",
"JAMDICT_DATA": "{JAMDICT_HOME}/data",
"JAMDICT_DB": "{JAMDICT_DATA}/jamdict.db",
"JMDICT_XML": "{JAMDICT_DATA}/JMdict_e.gz",
"JMNEDICT_XML": "{JAMDICT_DATA}/JMnedict.xml.gz",
"KD2_XML": "{JAMDICT_DATA}/kanjidic2.xml.gz",
"KRADFILE": "{JAMDICT_DATA}/kradfile-u.gz"
}
================================================
FILE: jamdict/data/setup_jmdict.sql
================================================
/* Add meta info */
CREATE TABLE IF NOT EXISTS meta (
key TEXT PRIMARY KEY NOT NULL,
value TEXT NOT NULL
);
-------------------------------------------------------------------------------------
-- JMDict
-------------------------------------------------------------------------------------
CREATE TABLE Entry (
idseq INTEGER NOT NULL UNIQUE
);
-- Entry's links (EntryInfo)
CREATE TABLE Link (
ID INTEGER PRIMARY KEY
,idseq INTEGER
,tag TEXT
,desc TEXT
,uri TEXT
,FOREIGN KEY (idseq) REFERENCES Entry(idseq)
);
-- Entry's bibinfo (EntryInfo)
CREATE TABLE Bib (
ID INTEGER PRIMARY KEY
,idseq INTEGER
,tag TEXT
,text TEXT
,FOREIGN KEY (idseq) REFERENCES Entry(idseq)
);
-- Entry's etym (EntryInfo)
CREATE TABLE Etym (
idseq INTEGER
,text TEXT
,FOREIGN KEY (idseq) REFERENCES Entry(idseq)
);
-- Entry's audit (EntryInfo)
CREATE TABLE Audit (
idseq INTEGER
,upd_date TEXT
,upd_detl TEXT
,FOREIGN KEY (idseq) REFERENCES Entry(idseq)
);
-------------------------------------------------------------------------------------
-- Kanji reading(s) of an entry
-------------------------------------------------------------------------------------
CREATE TABLE Kanji (
ID INTEGER PRIMARY KEY
,idseq INTEGER
,text TEXT
,FOREIGN KEY (idseq) REFERENCES Entry(idseq)
);
-- Kanji's info
CREATE TABLE KJI (
kid INTEGER
,text TEXT
,FOREIGN KEY (kid) REFERENCES Kanji(id)
);
-- Kanji priority
CREATE TABLE KJP (
kid INTEGER
,text TEXT
,FOREIGN KEY (kid) REFERENCES Kanji(id)
);
-------------------------------------------------------------------------------------
-- Kana reading(s) of an entry
-------------------------------------------------------------------------------------
CREATE TABLE Kana (
ID INTEGER PRIMARY KEY
,idseq INTEGER
,text TEXT
,nokanji BOOLEAN
,FOREIGN KEY (idseq) REFERENCES Entry(idseq)
);
-- re_restr
CREATE TABLE KNR (
kid INTEGER
,text TEXT
,FOREIGN KEY (kid) REFERENCES Kana(id)
);
-- Kana's info
CREATE TABLE KNI (
kid INTEGER
,text TEXT
,FOREIGN KEY (kid) REFERENCES Kana(id)
);
-- Kana priority
CREATE TABLE KNP (
kid INTEGER
,text TEXT
,FOREIGN KEY (kid) REFERENCES Kana(id)
);
-------------------------------------------------------------------------------------
-- Senses of an entry
-------------------------------------------------------------------------------------
CREATE TABLE Sense (
ID INTEGER PRIMARY KEY
,idseq INTEGER
,FOREIGN KEY (idseq) REFERENCES Entry(idseq)
);
CREATE TABLE stagk (
sid INTEGER
,text TEXT
,FOREIGN KEY (sid) REFERENCES Sense(id)
);
CREATE TABLE stagr (
sid INTEGER
,text TEXT
,FOREIGN KEY (sid) REFERENCES Sense(id)
);
CREATE TABLE pos (
sid INTEGER
,text TEXT
,FOREIGN KEY (sid) REFERENCES Sense(id)
);
CREATE TABLE xref (
sid INTEGER
,text TEXT
,FOREIGN KEY (sid) REFERENCES Sense(id)
);
CREATE TABLE antonym (
sid INTEGER
,text TEXT
,FOREIGN KEY (sid) REFERENCES Sense(id)
);
CREATE TABLE field (
sid INTEGER
,text TEXT
,FOREIGN KEY (sid) REFERENCES Sense(id)
);
CREATE TABLE misc (
sid INTEGER
,text TEXT
,FOREIGN KEY (sid) REFERENCES Sense(id)
);
CREATE TABLE SenseInfo (
sid INTEGER
,text TEXT
,FOREIGN KEY (sid) REFERENCES Sense(id)
);
CREATE TABLE SenseSource (
sid INTEGER
,text TEXT
,lang TEXT
,lstype TEXT
,wasei TEXT
,FOREIGN KEY (sid) REFERENCES Sense(id)
);
CREATE TABLE dialect (
sid INTEGER
,text TEXT
,FOREIGN KEY (sid) REFERENCES Sense(id)
);
CREATE TABLE SenseGloss (
sid INTEGER
,lang TEXT
,gend TEXT
,text TEXT
,FOREIGN KEY (sid) REFERENCES Sense(id)
);
-------------------------------------------------------------------------------------
-- INDICES - JMDict
-------------------------------------------------------------------------------------
CREATE INDEX Link_idseq ON Link(idseq);
CREATE INDEX Link_tag ON Link(tag);
CREATE INDEX Bib_idseq ON Link(idseq);
CREATE INDEX Etym_idseq ON Etym(idseq);
CREATE INDEX Audit_idseq ON Audit(idseq);
CREATE INDEX Kanji_idseq ON Kanji(idseq);
CREATE INDEX Kanji_text ON Kanji(text);
CREATE INDEX KJI_kid ON KJI(kid);
CREATE INDEX KJP_kid ON KJP(kid);
CREATE INDEX Kana_idseq ON Kana(idseq);
CREATE INDEX Kana_text ON Kana(text);
CREATE INDEX KNR_kid ON KNR(kid);
CREATE INDEX KNR_text ON KNR(text);
CREATE INDEX KNI_kid ON KNI(kid);
CREATE INDEX KNI_text ON KNI(text);
CREATE INDEX KNP_kid ON KNP(kid);
CREATE INDEX KNP_text ON KNP(text);
CREATE INDEX Sense_idseq ON Sense(idseq);
CREATE INDEX stagk_sid ON stagk(sid);
CREATE INDEX stagk_text ON stagk(text);
CREATE INDEX stagr_sid ON stagr(sid);
CREATE INDEX stagr_text ON stagr(text);
CREATE INDEX pos_sid ON pos(sid);
CREATE INDEX pos_text ON pos(text);
CREATE INDEX xref_sid ON xref(sid);
CREATE INDEX xref_text ON xref(text);
CREATE INDEX antonym_sid ON antonym(sid);
CREATE INDEX antonym_text ON antonym(text);
CREATE INDEX field_sid ON field(sid);
CREATE INDEX field_text ON field(text);
CREATE INDEX misc_sid ON misc(sid);
CREATE INDEX misc_text ON misc(text);
CREATE INDEX SenseInfo_sid ON SenseInfo(sid);
CREATE INDEX SenseInfo_text ON SenseInfo(text);
CREATE INDEX SenseSource_sid ON SenseSource(sid);
CREATE INDEX SenseSource_text ON SenseSource(text);
CREATE INDEX dialect_sid ON dialect(sid);
CREATE INDEX dialect_text ON dialect(text);
CREATE INDEX SenseGloss_sid ON SenseGloss(sid);
CREATE INDEX SenseGloss_lang ON SenseGloss(lang);
CREATE INDEX SenseGloss_gend ON SenseGloss(gend);
CREATE INDEX SenseGloss_text ON SenseGloss(text);
================================================
FILE: jamdict/data/setup_jmnedict.sql
================================================
/* Add meta info */
CREATE TABLE IF NOT EXISTS meta (
key TEXT PRIMARY KEY NOT NULL,
value TEXT NOT NULL
);
-------------------------------------------------------------------------------------
-- JMDict
-------------------------------------------------------------------------------------
CREATE TABLE NEEntry (
idseq INTEGER NOT NULL UNIQUE
);
-------------------------------------------------------------------------------------
-- Kanji reading(s) of an entry
-------------------------------------------------------------------------------------
CREATE TABLE NEKanji (
ID INTEGER PRIMARY KEY
,idseq INTEGER
,text TEXT
,FOREIGN KEY (idseq) REFERENCES Entry(idseq)
);
-------------------------------------------------------------------------------------
-- Kana reading(s) of an entry
-------------------------------------------------------------------------------------
CREATE TABLE NEKana (
ID INTEGER PRIMARY KEY
,idseq INTEGER
,text TEXT
,nokanji BOOLEAN
,FOREIGN KEY (idseq) REFERENCES Entry(idseq)
);
-------------------------------------------------------------------------------------
-- Senses of an entry
-------------------------------------------------------------------------------------
CREATE TABLE NETranslation (
ID INTEGER PRIMARY KEY
,idseq INTEGER
,FOREIGN KEY (idseq) REFERENCES Entry(idseq)
);
CREATE TABLE NETransType (
tid INTEGER
,text TEXT
,FOREIGN KEY (tid) REFERENCES NETranslation(id)
);
CREATE TABLE NETransXRef (
tid INTEGER
,text TEXT
,FOREIGN KEY (tid) REFERENCES NETranslation(id)
);
CREATE TABLE NETransGloss (
tid INTEGER
,lang TEXT
,gend TEXT
,text TEXT
,FOREIGN KEY (tid) REFERENCES NETranslation(id)
);
-------------------------------------------------------------------------------------
-- INDICES - JMneDict
-------------------------------------------------------------------------------------
CREATE INDEX NEKanji_idseq ON NEKanji(idseq);
CREATE INDEX NEKanji_text ON NEKanji(text);
CREATE INDEX NEKana_idseq ON NEKana(idseq);
CREATE INDEX NEKana_text ON NEKana(text);
CREATE INDEX NETranslation_idseq ON NETranslation(idseq);
CREATE INDEX NETransType_tid ON NETransType(tid);
CREATE INDEX NETransType_text ON NETransType(text);
CREATE INDEX NETransXRef_tid ON NETransXRef(tid);
CREATE INDEX NETransXRef_text ON NETransXRef(text);
CREATE INDEX NETransGloss_tid ON NETransGloss(tid);
CREATE INDEX NETransGloss_lang ON NETransGloss(lang);
CREATE INDEX NETransGloss_text ON NETransGloss(text);
================================================
FILE: jamdict/data/setup_kanjidic2.sql
================================================
/* Add meta info */
CREATE TABLE IF NOT EXISTS meta (
key TEXT UNIQUE,
value TEXT NOT NULL
);
-------------------------------------------------------------------------------------
-- KanjiDic2 tables
-------------------------------------------------------------------------------------
CREATE TABLE character (
ID INTEGER PRIMARY KEY AUTOINCREMENT,
literal TEXT NOT NULL,
stroke_count INTEGER,
grade TEXT,
freq TEXT,
jlpt TEXT
);
CREATE TABLE codepoint (
cid INTEGER
,cp_type TEXT
,value TEXT
,FOREIGN KEY (cid) REFERENCES character(ID)
);
CREATE TABLE radical (
cid INTEGER
,rad_type TEXT
,value TEXT
,FOREIGN KEY (cid) REFERENCES character(ID)
);
CREATE TABLE stroke_miscount (
cid INTEGER
,value INTEGER
,FOREIGN KEY (cid) REFERENCES character(ID)
);
CREATE TABLE variant (
cid INTEGER
,var_type TEXT
,value TEXT
,FOREIGN KEY (cid) REFERENCES character(ID)
);
CREATE TABLE rad_name (
cid INTEGER
,value TEXT
,FOREIGN KEY (cid) REFERENCES character(ID)
);
CREATE TABLE dic_ref (
cid INTEGER
,dr_type TEXT
,value TEXT
n ,m_vol TEXT
,m_page TEXT
,FOREIGN KEY (cid) REFERENCES character(ID)
);
CREATE TABLE query_code (
cid INTEGER
,qc_type TEXT
,value TEXT
,skip_misclass TEXT
,FOREIGN KEY (cid) REFERENCES character(ID)
);
CREATE TABLE nanori (
cid INTEGER
,value TEXT
,FOREIGN KEY (cid) REFERENCES character(ID)
);
CREATE TABLE rm_group (
ID INTEGER PRIMARY KEY AUTOINCREMENT
,cid INTEGER
,FOREIGN KEY (cid) REFERENCES character(ID)
);
CREATE TABLE reading (
gid INTEGER
,r_type TEXT
,value TEXT
,on_type TEXT
,r_status TEXT
,FOREIGN KEY (gid) REFERENCES rm_group(id)
);
CREATE TABLE meaning (
gid INTEGER
,value TEXT
,m_lang TEXT
,FOREIGN KEY (gid) REFERENCES rm_group(id)
);
-------------------------------------------------------------------------------------
-- INDICES - KanjiDic2
-------------------------------------------------------------------------------------
CREATE INDEX character_literal ON character(literal);
CREATE INDEX character_stroke_count ON character(stroke_count);
CREATE INDEX character_grade ON character(grade);
CREATE INDEX character_jlpt ON character(jlpt);
CREATE INDEX codepoint_value ON codepoint(value);
CREATE INDEX radical_value ON radical(value);
CREATE INDEX variant_value ON variant(value);
CREATE INDEX rad_name_value ON rad_name(value);
CREATE INDEX dic_ref_value ON dic_ref(value);
CREATE INDEX query_code_value ON query_code(value);
CREATE INDEX nanori_value ON nanori(value);
CREATE INDEX rm_group_cid ON rm_group(cid);
CREATE INDEX reading_r_type ON reading(r_type);
CREATE INDEX reading_value ON reading(value);
CREATE INDEX meaning_value ON meaning(value);
CREATE INDEX meaning_m_lang ON meaning(m_lang);
================================================
FILE: jamdict/jmdict.py
================================================
# -*- coding: utf-8 -*-
"""
JMdict Models
"""
# This code is a part of jamdict library: https://github.com/neocl/jamdict
# :copyright: (c) 2016 Le Tuan Anh <tuananh.ke@gmail.com>
# :license: MIT, see LICENSE for more details.
import os
import logging
import warnings
from typing import List
try:
from lxml import etree
_LXML_AVAILABLE = True
except Exception as e:
# logging.getLogger(__name__).debug("lxml is not available, fall back to xml.etree.ElementTree")
from xml.etree import ElementTree as etree
_LXML_AVAILABLE = False
from chirptext import chio
logger = logging.getLogger(__name__)
########################################################################
class JMDEntry(object):
''' Represents a dictionary Word entry.
Entries consist of kanji elements, reading elements,
general information and sense elements. Each entry must have at
least one reading element and one sense element. Others are optional.
XML DTD <!ELEMENT entry (ent_seq, k_ele*, r_ele+, info?, sense+)>'''
def __init__(self, idseq=''):
# A unique numeric sequence number for each entry
self.idseq = idseq # ent_seq
self.kanji_forms: List[KanjiForm] = [] # k_ele*
self.kana_forms: List[KanaForm] = [] # r_ele+ => KanaForm[]
self.info: EntryInfo = None # info? => EntryInfo
self.senses: List[Sense] = [] # sense+
def __len__(self):
return len(self.senses)
def __getitem__(self, idx):
return self.senses[idx]
def set_info(self, info):
if self.info:
logging.warning("WARNING: multiple info tag")
self.info = info
def text(self, compact=True, separator=' ', no_id=False):
tmp = []
if not compact and not no_id:
tmp.append('[id#%s]' % self.idseq)
if self.kana_forms:
tmp.append(self.kana_forms[0].text)
if self.kanji_forms:
tmp.append("({})".format(self.kanji_forms[0].text))
if self.senses:
tmp.append(':')
if len(self.senses) == 1:
tmp.append(self.senses[0].text(compact=compact))
else:
for sense, idx in zip(self.senses, range(len(self.senses))):
tmp.append('{i}. {s}'.format(i=idx + 1, s=sense.text(compact=compact)))
return separator.join(tmp)
def __repr__(self):
return self.text(compact=True)
def __str__(self):
return self.text(compact=False)
def to_json(self):
warnings.warn("to_json() is deprecated and will be removed in the next major release. Use to_dict() instead.",
DeprecationWarning, stacklevel=2)
return self.to_dict()
def to_dict(self):
ed = {'idseq': self.idseq,
'kanji': [x.to_dict() for x in self.kanji_forms],
'kana': [x.to_dict() for x in self.kana_forms],
'senses': [x.to_dict() for x in self.senses]}
if self.info:
ed['info'] = self.info.to_dict()
return ed
class KanjiForm(object):
''' The kanji element, or in its absence, the reading element, is
the defining component of each entry.
The overwhelming majority of entries will have a single kanji
element associated with a word in Japanese. Where there are
multiple kanji elements within an entry, they will be orthographical
variants of the same word, either using variations in okurigana, or
alternative and equivalent kanji. Common "mis-spellings" may be
included, provided they are associated with appropriate information
fields. Synonyms are not included; they may be indicated in the
cross-reference field associated with the sense element.
DTD <!ELEMENT k_ele (keb, ke_inf*, ke_pri*)>
text --- a kanji written form of an entry, string
info --- coded information field, a list of strings
pri --- relative priority of the entry, a list of strings
'''
def __init__(self, text=''):
'''This element will contain a word or short phrase in Japanese
which is written using at least one non-kana character (usually kanji,
but can be other characters). The valid characters are
kanji, kana, related characters such as chouon and kurikaeshi, and
in exceptional cases, letters from other alphabets.
'''
self.text = text # <!ELEMENT keb (#PCDATA)>
'''This is a coded information field related specifically to the
orthography of the keb, and will typically indicate some unusual
aspect, such as okurigana irregularity.'''
self.info = [] # <!ELEMENT ke_inf (#PCDATA)>*
'''This and the equivalent re_pri field are provided to record
information about the relative priority of the entry, and consist
of codes indicating the word appears in various references which
can be taken as an indication of the frequency with which the word
is used. This field is intended for use either by applications which
want to concentrate on entries of a particular priority, or to
generate subset files.
The current values in this field are:
- news1/2: appears in the "wordfreq" file compiled by Alexandre Girardi
from the Mainichi Shimbun. (See the Monash ftp archive for a copy.)
Words in the first 12,000 in that file are marked "news1" and words
in the second 12,000 are marked "news2".
- ichi1/2: appears in the "Ichimango goi bunruishuu", Senmon Kyouiku
Publishing, Tokyo, 1998. (The entries marked "ichi2" were
demoted from ichi1 because they were observed to have low
frequencies in the WWW and newspapers.)
- spec1 and spec2: a small number of words use this marker when they
are detected as being common, but are not included in other lists.
- gai1/2: common loanwords, based on the wordfreq file.
- nfxx: this is an indicator of frequency-of-use ranking in the
wordfreq file. "xx" is the number of the set of 500 words in which
the entry can be found, with "01" assigned to the first 500, "02"
to the second, and so on. (The entries with news1, ichi1, spec1 and
gai1 values are marked with a "(P)" in the EDICT and EDICT2
files.)
The reason both the kanji and reading elements are tagged is because
on occasions a priority is only associated with a particular
kanji/reading pair.'''
self.pri = [] # <!ELEMENT ke_pri (#PCDATA)>*
def set_text(self, text):
if self.text:
logging.warning("WARNING: duplicated text for k_ele")
self.text = text
def to_json(self):
warnings.warn("to_json() is deprecated and will be removed in the next major release. Use to_dict() instead.",
DeprecationWarning, stacklevel=2)
return self.to_dict()
def to_dict(self):
kjd = {'text': self.text}
if self.info:
kjd['info'] = self.info
if self.pri:
kjd['pri'] = self.pri
return kjd
def __repr__(self):
return str(self)
def __str__(self):
return self.text
class KanaForm(object):
'''<!ELEMENT r_ele (reb, re_nokanji?, re_restr*, re_inf*, re_pri*)>
The reading element typically contains the valid readings
of the word(s) in the kanji element using modern kanadzukai.
Where there are multiple reading elements, they will typically be
alternative readings of the kanji element. In the absence of a
kanji element, i.e. in the case of a word or phrase written
entirely in kana, these elements will define the entry.
text --- a kana written form of an entry, string
nokanji --- True means this entry cannot be regarded as a true reading of the kanji, boolean
restr --- use to restrict the reading to a subset of the available kanji forms, list of string
info --- coded information field, a list of strings
pri --- relative priority of the entry, a list of strings
'''
def __init__(self, text='', nokanji=False):
'''this element content is restricted to kana and related
characters such as chouon and kurikaeshi. Kana usage will be
consistent between the keb and reb elements; e.g. if the keb
contains katakana, so too will the reb.'''
self.text = text # <!ELEMENT reb (#PCDATA)>
'''This element, which will usually have a null value, indicates
that the reb, while associated with the keb, cannot be regarded
as a true reading of the kanji. It is typically used for words
such as foreign place names, gairaigo which can be in kanji or
katakana, etc.'''
self.nokanji = nokanji # <!ELEMENT re_nokanji (#PCDATA)>?
'''This element is used to indicate when the reading only applies
to a subset of the keb elements in the entry. In its absence, all
readings apply to all kanji elements. The contents of this element
must exactly match those of one of the keb elements.'''
self.restr = [] # <!ELEMENT re_restr (#PCDATA)>*
'''General coded information pertaining to the specific reading.
Typically it will be used to indicate some unusual aspect of
the reading.'''
self.info = [] # <!ELEMENT re_inf (#PCDATA)>*
'''See the comment on ke_pri above.'''
self.pri = [] # <!ELEMENT re_pri (#PCDATA)>*
def set_text(self, text):
if self.text:
logging.warning("WARNING: duplicated text for k_ele")
self.text = text
def to_json(self):
warnings.warn("to_json() is deprecated and will be removed in the next major release. Use to_dict() instead.",
DeprecationWarning, stacklevel=2)
return self.to_dict()
def to_dict(self):
knd = {'text': self.text,
'nokanji': self.nokanji}
if self.restr:
knd['restr'] = self.restr
if self.info:
knd['info'] = self.info
if self.pri:
knd['pri'] = self.pri
return knd
def __repr__(self):
return str(self)
def __str__(self):
return self.text
class EntryInfo(object):
"""General coded information relating to the entry as a whole.
DTD: <!ELEMENT info (links*, bibl*, etym*, audit*)>
"""
def __init__(self):
self.links: List[Link] = [] # link*
self.bibinfo: List[BibInfo] = [] # bibl*
'''This field is used to hold information about the etymology
of the kanji or kana parts of the entry. For gairaigo,
etymological information may also be in the <lsource> element.'''
self.etym = [] # <!ELEMENT etym (#PCDATA)>*
self.audit: List[Audit] = [] # audit*
def to_json(self):
warnings.warn("to_json() is deprecated and will be removed in the next major release. Use to_dict() instead.",
DeprecationWarning, stacklevel=2)
return self.to_dict()
def to_dict(self):
return {'links': [x.to_dict() for x in self.links],
'bibinfo': [x.to_dict() for x in self.bibinfo],
'etym': self.etym,
'audit': [x.to_dict() for x in self.audit]}
class Link(object):
'''This element holds details of linking information to
entries in other electronic repositories. The link_tag will be
coded to indicate the type of link (text, image, sound), the
link_desc will provided a textual label for the link, and the
link_uri contains the actual URI.
<!ELEMENT links (link_tag, link_desc, link_uri)>'''
def __init__(self, tag, desc, uri):
self.tag: str = tag # <!ELEMENT link_tag (#PCDATA)>
self.desc: str = desc # <!ELEMENT link_desc (#PCDATA)>
self.uri: str = uri # <!ELEMENT link_uri (#PCDATA)>
def to_json(self):
warnings.warn("to_json() is deprecated and will be removed in the next major release. Use to_dict() instead.",
DeprecationWarning, stacklevel=2)
return self.to_dict()
def to_dict(self):
return {'tag': self.tag,
'desc': self.desc,
'uri': self.uri}
class BibInfo(object):
'''Bibliographic information about the entry. The bib_tag will a
coded reference to an entry in an external bibliographic database.
The bib_txt field may be used for brief (local) descriptions.
<!ELEMENT bibl (bib_tag?, bib_txt?)>
<!ELEMENT bib_tag (#PCDATA)>
<!ELEMENT bib_txt (#PCDATA)>
'''
def __init__(self, tag='', text=''):
self.tag: str = tag
self.text: str = text
def set_tag(self, tag):
if self.tag:
logging.warning("WARNING: duplicate tag in bibinfo")
self.tag = tag
def set_text(self, text):
if self.text:
logging.warning("WARNING: duplicate text in bibinfo")
self.text = text
def to_json(self):
warnings.warn("to_json() is deprecated and will be removed in the next major release. Use to_dict() instead.",
DeprecationWarning, stacklevel=2)
return self.to_dict()
def to_dict(self):
return {'tag': self.tag, 'text': self.text}
class Audit(object):
'''The audit element will contain the date and other information
about updates to the entry. Can be used to record the source of
the material.
<!ELEMENT audit (upd_date, upd_detl)>'''
def __init__(self, upd_date, upd_detl):
self.upd_date = upd_date # <!ELEMENT upd_date (#PCDATA)>
self.upd_detl = upd_detl # <!ELEMENT upd_detl (#PCDATA)>
def to_json(self):
warnings.warn("to_json() is deprecated and will be removed in the next major release. Use to_dict() instead.",
DeprecationWarning, stacklevel=2)
return self.to_dict()
def to_dict(self):
return {'upd_date': self.upd_date, 'upd_detl': self.upd_detl}
class Sense(object):
'''The sense element will record the translational equivalent
of the Japanese word, plus other related information. Where there
are several distinctly different meanings of the word, multiple
sense elements will be employed.
<!ELEMENT sense (stagk*, stagr*, pos*, xref*, ant*, field*, misc*, s_inf*, lsource*, dial*, gloss*, example*)>
'''
def __init__(self):
'''These elements, if present, indicate that the sense is restricted
to the lexeme represented by the keb and/or reb.'''
self.stagk = [] # <!ELEMENT stagk (#PCDATA)>
self.stagr = [] # <!ELEMENT stagr (#PCDATA)>
'''Part-of-speech information about the entry/sense. Should use
appropriate entity codes. In general where there are multiple senses
in an entry, the part-of-speech of an earlier sense will apply to
later senses unless there is a new part-of-speech indicated.'''
self.pos = [] # <!ELEMENT pos (#PCDATA)>
'''This element is used to indicate a cross-reference to another
entry with a similar or related meaning or sense. The content of
this element is typically a keb or reb element in another entry. In some
cases a keb will be followed by a reb and/or a sense number to provide
a precise target for the cross-reference. Where this happens, a JIS
"centre-dot" (0x2126) is placed between the components of the
cross-reference.
<!ELEMENT xref (#PCDATA)*>'''
self.xref = [] # xref
'''This element is used to indicate another entry which is an
antonym of the current entry/sense. The content of this element
must exactly match that of a keb or reb element in another entry.'''
self.antonym = [] # <!ELEMENT ant (#PCDATA)*>
'''Information about the field of application of the entry/sense.
When absent, general application is implied. Entity coding for
specific fields of application.'''
self.field = [] # <!ELEMENT field (#PCDATA)>
'''This element is used for other relevant information about
the entry/sense. As with part-of-speech, information will usually
apply to several senses.'''
self.misc = [] # <!ELEMENT misc (#PCDATA)>
'''The sense-information elements provided for additional
information to be recorded about a sense. Typical usage would
be to indicate such things as level of currency of a sense, the
regional variations, etc.'''
self.info = [] # <!ELEMENT s_inf (#PCDATA)>
self.lsource: List[LSource] = [] # <!ELEMENT lsource (#PCDATA)>
'''For words specifically associated with regional dialects in
Japanese, the entity code for that dialect, e.g. ksb for Kansaiben.'''
self.dialect = [] # <!ELEMENT dial (#PCDATA)>
self.gloss: List[SenseGloss] = [] # <!ELEMENT gloss (#PCDATA | pri)*>
'''The example elements provide for pairs of short Japanese and
target-language phrases or sentences which exemplify the usage of the
Japanese head-word and the target-language gloss. Words in example
fields would typically not be indexed by a dictionary application.'''
# It seems that this field is not used anymore!
self.examples = [] # <!ELEMENT example (#PCDATA)>
def __repr__(self):
return str(self)
def __str__(self):
return self.text(compact=False)
def text(self, compact=True):
tmp = [str(x) for x in self.gloss]
if not compact and self.pos:
return '{gloss} ({pos})'.format(gloss='/'.join(tmp), pos=('(%s)' % '|'.join(self.pos)))
else:
return '/'.join(tmp)
def to_json(self):
warnings.warn("to_json() is deprecated and will be removed in the next major release. Use to_dict() instead.",
DeprecationWarning, stacklevel=2)
return self.to_dict()
def to_dict(self):
sd = {}
if self.stagk:
sd['stagk'] = self.stagk
if self.stagr:
sd['stagr'] = self.stagr
if self.pos:
sd['pos'] = self.pos
if self.xref:
sd['xref'] = self.xref
if self.antonym:
sd['antonym'] = self.antonym
if self.field:
sd['field'] = self.field
if self.misc:
sd['misc'] = self.misc
if self.info:
sd['SenseInfo'] = self.info
if self.lsource:
sd['SenseSource'] = [x.to_dict() for x in self.lsource]
if self.dialect:
sd['dialect'] = self.dialect
if self.gloss:
sd['SenseGloss'] = [x.to_dict() for x in self.gloss]
return sd
class Translation(Sense):
''' The trans element will record the translational equivalent
of the Japanese name, plus other related information. (JMendict)
<!ELEMENT trans (name_type*, xref*, trans_det*)>'''
def __init__(self):
super().__init__()
self.name_type = [] # mapped to name_type*
self.xref = [] # mapped to xref
self.gloss = [] # mapped to trans_det
def name_type_human(self):
return [JMENDICT_TYPE_MAP[x] if x in JMENDICT_TYPE_MAP else x for x in self.name_type]
def text(self, compact=True):
tmp = [str(x) for x in self.gloss]
types = "/".join(self.name_type) if compact else "/".join(self.name_type_human())
return '{gloss} ({types})'.format(gloss='/'.join(tmp), types=types)
def to_json(self):
warnings.warn("to_json() is deprecated and will be removed in the next major release. Use to_dict() instead.",
DeprecationWarning, stacklevel=2)
return self.to_dict()
def to_dict(self):
sd = super().to_dict()
sd['name_type'] = self.name_type
return sd
class SenseGloss(object):
'''Within each sense will be one or more "glosses", i.e.
target-language words or phrases which are equivalents to the
Japanese word. This element would normally be present, however it
may be omitted in entries which are purely for a cross-reference.
DTD: <!ELEMENT gloss (#PCDATA | pri)*>
<!ATTLIST gloss xml:lang CDATA "eng">
The xml:lang attribute defines the target language of the
gloss. It will be coded using the three-letter language code from
the ISO 639 standard. When absent, the value "eng" (i.e. English)
is the default value.
<!ATTLIST gloss g_gend CDATA #IMPLIED>
The g_gend attribute defines the gender of the gloss (typically
a noun in the target language. When absent, the gender is either
not relevant or has yet to be provided.
<!ELEMENT pri (#PCDATA)>
These elements highlight particular target-language words which
are strongly associated with the Japanese word. The purpose is to
establish a set of target-language words which can effectively be
used as head-words in a reverse target-language/Japanese relationship.'''
def __init__(self, lang, gend, text):
self.lang = lang
self.gend = gend
self.text = text
def __repr__(self):
return str(self)
def __str__(self):
tmp = [self.text]
if self.lang and self.lang != 'eng':
# lang = eng is trivial
tmp.append('(lang:%s)' % self.lang)
if self.gend:
tmp.append('(gend:%s)' % self.gend)
return ' '.join(tmp)
def to_json(self):
warnings.warn("to_json() is deprecated and will be removed in the next major release. Use to_dict() instead.",
DeprecationWarning, stacklevel=2)
return self.to_dict()
def to_dict(self):
gd = {}
if self.lang:
gd['lang'] = self.lang
if self.gend:
gd['gend'] = self.gend
if self.text:
gd['text'] = self.text
return gd
class LSource:
'''This element records the information about the source
language(s) of a loan-word/gairaigo. If the source language is other
than English, the language is indicated by the xml:lang attribute.
The element value (if any) is the source word or phrase.
<!ATTLIST lsource xml:lang CDATA "eng">
The xml:lang attribute defines the language(s) from which
a loanword is drawn. It will be coded using the three-letter language
code from the ISO 639-2 standard. When absent, the value "eng" (i.e.
English) is the default value. The bibliographic (B) codes are used.
<!ATTLIST lsource ls_type CDATA #IMPLIED>
The ls_type attribute indicates whether the lsource element
fully or partially describes the source word or phrase of the
loanword. If absent, it will have the implied value of "full".
Otherwise it will contain "part".
<!ATTLIST lsource ls_wasei CDATA #IMPLIED>
The ls_wasei attribute indicates that the Japanese word
has been constructed from words in the source language, and
not from an actual phrase in that language. Most commonly used to
indicate "waseieigo".'''
def __init__(self, lang, lstype, wasei, text):
self.lang = lang
self.lstype = lstype
self.wasei = wasei
self.text = text
def to_json(self):
warnings.warn("to_json() is deprecated and will be removed in the next major release. Use to_dict() instead.",
DeprecationWarning, stacklevel=2)
return self.to_dict()
def to_dict(self):
return {'lang': self.lang,
'lstype': self.lstype,
'wasei': self.wasei,
'text': self.text}
JMENDICT_TYPES = (("surname", "family or surname"),
("place", "place name"),
("unclass", "unclassified name"),
("company", "company name"),
("product", "product name"),
("work", "work of art, literature, music, etc. name"),
("masc", "male given name or forename"),
("fem", "female given name or forename"),
("person", "full name of a particular person"),
("given", "given name or forename, gender not specified"),
("station", "railway station"),
("organization", "organization name"),
("ok", "old or irregular kana form"))
JMENDICT_TYPE_MAP = dict(JMENDICT_TYPES)
JMENDICT_TYPE_MAP_DECODE = {v: k for k, v in JMENDICT_TYPES}
class Meta(object):
def __init__(self, key='', value=''):
self.key = key
self.value = value
def __repr__(self):
return "{{{}: {}}}".format(self.key, self.value)
def __str__(self):
return repr(self)
class JMDictXMLParser(object):
'''JMDict XML parser
'''
def __init__(self):
pass
def parse_file(self, jmdict_file_path):
''' Parse JMDict_e.xml file and return a list of JMDEntry objects
'''
actual_path = os.path.abspath(os.path.expanduser(jmdict_file_path))
logger.debug('Loading data from file: {}'.format(actual_path))
with chio.open(actual_path, mode='rb') as jmfile:
tree = etree.iterparse(jmfile)
entries = []
for event, element in tree:
if event == 'end' and element.tag == 'entry':
entries.append(self.parse_entry_tag(element))
# and then we can clear the element to save memory
element.clear()
return entries
def parse_entry_tag(self, etag):
'''Parse a lxml XML Node and generate a JMDEntry entry'''
entry = JMDEntry()
# parse ent_seq
for child in etag:
if child.tag == 'ent_seq':
self.parse_ent_seq(child, entry)
elif child.tag == 'k_ele':
self.parse_k_ele(child, entry)
elif child.tag == 'r_ele':
self.parse_r_ele(child, entry)
elif child.tag == 'info':
self.parse_info(child, entry)
elif child.tag == 'sense':
self.parse_sense(child, entry)
elif child.tag == 'trans':
# JMendict support
self.parse_ne_translation(child, entry)
else:
raise Exception("Invalid tag: %s" % child.tag)
return entry
def parse_ent_seq(self, seq_tag, entry):
idseq = seq_tag.text
if entry.idseq:
raise Exception("WARNING: duplicated ent_seq tag")
entry.idseq = idseq
def get_single(self, tag_name, a_tag):
children = a_tag.findall(tag_name)
if len(children) == 0:
return None
elif len(children) > 1:
raise Exception("There are %s %s tags in %s" % (len(children), tag_name, a_tag.tag))
else:
return children[0]
def parse_k_ele(self, k_ele, entry):
kr = KanjiForm()
for child in k_ele:
if child.tag == 'keb':
kr.set_text(child.text)
elif child.tag == 'ke_inf':
kr.info.append(child.text)
elif child.tag == 'ke_pri':
kr.pri.append(child.text)
else:
raise Exception("WARNING: invalid tag %s in k_ele" % child.tag)
# parse kebs
entry.kanji_forms.append(kr)
return kr
def parse_r_ele(self, r_ele, entry):
kr = KanaForm()
for child in r_ele:
if child.tag == 'reb':
kr.set_text(child.text)
elif child.tag == 're_nokanji':
kr.nokanji = True
elif child.tag == 're_restr':
kr.restr.append(child.text)
elif child.tag == 're_inf':
kr.info.append(child.text)
elif child.tag == 're_pri':
kr.pri.append(child.text)
else:
raise Exception("WARNING: invalid tag %s in r_ele" % child.tag)
# parse kebs
entry.kana_forms.append(kr)
return kr
def parse_info(self, info_tag, entry):
einfo = EntryInfo()
for child in info_tag:
if child.tag == 'links':
self.parse_link(child, einfo)
elif child.tag == 'bibl':
self.parse_bibinfo(child, einfo)
elif child.tag == 'etym':
einfo.etym.append(child.text)
elif child.tag == 'audit':
self.parse_audit(child, einfo)
else:
raise Exception("WARNING: invalid tag in info tag (child.tag = %s)" % child.tag)
entry.set_info(einfo)
return einfo
def parse_link(self, link_tag, entry_info):
tag = self.get_single('link_tag', link_tag).text
desc = self.get_single('link_desc', link_tag).text
uri = self.get_single('link_uri', link_tag).text
link = Link(tag, desc, uri)
entry_info.links.append(link)
return link
def parse_bibinfo(self, bib_tag, entry_info):
bib = BibInfo()
for child in bib_tag:
if child.tag == 'bib_tag':
bib.set_tag(child.text)
elif child.tag == 'bib_txt':
bib.set_text(child.text)
else:
raise Exception("WARNING: invalid tag in bibinfo (child.tag = %s)" % child.tag)
entry_info.bibinfo.append(bib)
return bib
def parse_ne_translation(self, trans_tag, entry):
translation = Translation()
for child in trans_tag:
if child.tag == 'name_type':
_name_type = JMENDICT_TYPE_MAP_DECODE[child.text] if child.text in JMENDICT_TYPE_MAP_DECODE else child.text
translation.name_type.append(_name_type)
elif child.tag == 'trans_det':
# add sensegloss
lang = self.get_attrib(trans_tag, 'xml:lang', default_value='eng')
gloss = SenseGloss(lang=lang, gend='', text=child.text)
translation.gloss.append(gloss)
elif child.tag == 'xref':
translation.xref.append(child.text)
else:
raise Exception("Invalid tag: {} in JMendict/trans tag".format(child.tag))
entry.senses.append(translation)
return translation
def parse_sense(self, sense_tag, entry):
sense = Sense()
for child in sense_tag:
if child.tag == 'stagk':
sense.stagk.append(child.text)
elif child.tag == 'stagr':
sense.stagr.append(child.text)
elif child.tag == 'pos':
sense.pos.append(child.text)
elif child.tag == 'xref':
sense.xref.append(child.text)
elif child.tag == 'ant':
sense.antonym.append(child.text)
elif child.tag == 'field':
sense.field.append(child.text)
elif child.tag == 'misc':
sense.misc.append(child.text)
elif child.tag == 's_inf':
sense.info.append(child.text)
elif child.tag == 'dial':
sense.dialect.append(child.text)
elif child.tag == 'example':
sense.examples.append(child.text)
elif child.tag == 'lsource':
self.parse_lsource(child, sense)
elif child.tag == 'gloss':
self.parse_sensegloss(child, sense)
else:
raise Exception("WARNING: invalid tag in sense tag (child.tag = %s) content = %s" % (child.tag, etree.tostring(child)))
entry.senses.append(sense)
return sense
def get_attrib(self, a_tag, attr_name, default_value=''):
if attr_name == 'xml:lang':
attr_name = '''{http://www.w3.org/XML/1998/namespace}lang'''
if attr_name in a_tag.attrib:
return a_tag.attrib[attr_name]
else:
return default_value
def parse_sensegloss(self, gloss_tag, sense):
lang = self.get_attrib(gloss_tag, 'xml:lang')
gend = self.get_attrib(gloss_tag, 'g_gend')
text = gloss_tag.text # TODO: pri tag? raw text?
gloss = SenseGloss(lang, gend, text)
sense.gloss.append(gloss)
return gloss
def parse_lsource(self, lsource_tag, sense):
lang = self.get_attrib(lsource_tag, 'xml:lang')
lstype = self.get_attrib(lsource_tag, 'ls_type')
wasei = self.get_attrib(lsource_tag, 'ls_wasei')
lsource = LSource(lang, lstype, wasei, lsource_tag.text)
sense.lsource.append(lsource)
return lsource
================================================
FILE: jamdict/jmdict_sqlite.py
================================================
# -*- coding: utf-8 -*-
"""
JMDict in SQLite format
"""
# This code is a part of jamdict library: https://github.com/neocl/jamdict
# :copyright: (c) 2016 Le Tuan Anh <tuananh.ke@gmail.com>
# :license: MIT, see LICENSE for more details.
import os
import logging
from puchikarui import Schema
from . import __version__ as JAMDICT_VERSION, __url__ as JAMDICT_URL
from .jmdict import Meta, JMDEntry, EntryInfo, Link, BibInfo, Audit, KanjiForm, KanaForm, Sense, SenseGloss, LSource
# -------------------------------------------------------------------------------
# Configuration
# -------------------------------------------------------------------------------
MY_FOLDER = os.path.dirname(os.path.abspath(__file__))
SCRIPT_FOLDER = os.path.join(MY_FOLDER, 'data')
JMDICT_SETUP_FILE = os.path.join(SCRIPT_FOLDER, 'setup_jmdict.sql')
JMDICT_VERSION = '1.08'
JMDICT_URL = 'http://www.csse.monash.edu.au/~jwb/edict.html'
SETUP_SCRIPT = '''INSERT INTO meta VALUES ('jmdict.version', '{jv}');
INSERT INTO meta VALUES ('jmdict.url', '{ju}');
INSERT INTO meta VALUES ('generator', 'jamdict');
INSERT INTO meta VALUES ('generator_version', '{gv}');
INSERT INTO meta VALUES ('generator_url', '{gu}');'''.format(
jv=JMDICT_VERSION,
ju=JMDICT_URL,
gv=JAMDICT_VERSION,
gu=JAMDICT_URL
)
def getLogger():
return logging.getLogger(__name__)
# -------------------------------------------------------------------------------
# Models
# -------------------------------------------------------------------------------
class JMDictSchema(Schema):
KEY_JMD_VER = "jmdict.version"
KEY_JMD_URL = "jmdict.url"
def __init__(self, db_path, *args, **kwargs):
super().__init__(db_path, *args, **kwargs)
self.add_script(SETUP_SCRIPT)
self.add_file(JMDICT_SETUP_FILE)
# Meta
self.add_table('meta', ['key', 'value'], proto=Meta).set_id('key')
self.add_table('Entry', ['idseq'])
self.add_table('Link', ['ID', 'idseq', 'tag', 'desc', 'uri'])
self.add_table('Bib', ['ID', 'idseq', 'tag', 'text'])
self.add_table('Etym', ['idseq', 'text'])
self.add_table('Audit', ['idseq', 'upd_date', 'upd_detl'])
# Kanji
self.add_table('Kanji', ['ID', 'idseq', 'text'])
self.add_table('KJI', ['kid', 'text'])
self.add_table('KJP', ['kid', 'text'])
# Kana
self.add_table('Kana', ['ID', 'idseq', 'text', 'nokanji'])
self.add_table('KNI', ['kid', 'text'])
self.add_table('KNP', ['kid', 'text'])
self.add_table('KNR', ['kid', 'text'])
# Senses
self.add_table('Sense', ['ID', 'idseq'])
self.add_table('stagk', ['sid', 'text'])
self.add_table('stagr', ['sid', 'text'])
self.add_table('pos', ['sid', 'text'])
self.add_table('xref', ['sid', 'text'])
self.add_table('antonym', ['sid', 'text'])
self.add_table('field', ['sid', 'text'])
self.add_table('misc', ['sid', 'text'])
self.add_table('SenseInfo', ['sid', 'text'])
self.add_table('SenseSource', ['sid', 'text', 'lang', 'lstype', 'wasei'])
self.add_table('dialect', ['sid', 'text'])
self.add_table('SenseGloss', ['sid', 'lang', 'gend', 'text'])
class JMDictSQLite(JMDictSchema):
def __init__(self, db_path, *args, **kwargs):
super().__init__(db_path, *args, **kwargs)
def update_jmd_meta(self, version, url, ctx=None):
# create a default context if none was provided
if ctx is None:
with self.open(ctx) as ctx:
return self.update_jmd_meta(version, url, ctx=ctx)
# else (a context is provided)
# version
jv = ctx.meta.by_id(self.KEY_JMD_VER)
if not jv:
ctx.meta.insert(self.KEY_JMD_VER, version)
else:
jv.value = version
ctx.meta.save(jv)
# url
ju = ctx.meta.by_id(self.KEY_JMD_URL)
if not ju:
ctx.meta.insert(self.KEY_JMD_URL, version)
else:
ju.value = url
ctx.meta.save(ju)
def all_pos(self, ctx=None):
if ctx is None:
return self.all_pos(ctx=self.ctx())
else:
return [x['text'] for x in ctx.execute("SELECT DISTINCT text FROM pos")]
def _build_search_query(self, query, pos=None):
where = []
params = []
if query.startswith('id#'):
query_int = int(query[3:])
if query_int >= 0:
getLogger().debug("Searching by ID: {}".format(query_int))
where.append("idseq = ?")
params.append(query_int)
elif query and query != "%":
_is_wildcard_search = '_' in query or '@' in query or '%' in query
if _is_wildcard_search:
where.append("(idseq IN (SELECT idseq FROM Kanji WHERE text like ?) OR idseq IN (SELECT idseq FROM Kana WHERE text like ?) OR idseq IN (SELECT idseq FROM sense JOIN sensegloss ON sense.ID == sensegloss.sid WHERE text like ?))")
else:
where.append("(idseq IN (SELECT idseq FROM Kanji WHERE text == ?) OR idseq IN (SELECT idseq FROM Kana WHERE text == ?) OR idseq IN (SELECT idseq FROM sense JOIN sensegloss ON sense.ID == sensegloss.sid WHERE text == ?))")
params += (query, query, query)
if pos:
if isinstance(pos, str):
getLogger().warning("POS filter should be a collection, not a string")
pos = [pos]
# allow to search by POS
slots = len(pos)
if where:
where.append("AND")
where.append(f"idseq IN (SELECT idseq FROM Sense WHERE ID IN (SELECT sid FROM pos WHERE text IN ({','.join('?' * slots)})))")
params += pos
# else (a context is provided)
logging.getLogger(__name__).debug(f"Search query: {where} -- Params: {params}")
return where, params
def search(self, query, ctx=None, pos=None, **kwargs):
# ensure context
if ctx is None:
with self.ctx() as ctx:
return self.search(query, ctx=ctx, pos=pos)
where, params = self._build_search_query(query, pos=pos)
where.insert(0, 'SELECT idseq FROM Entry WHERE ')
entries = []
for (idseq,) in ctx.conn.cursor().execute(' '.join(where), params):
entries.append(self.get_entry(idseq, ctx=ctx))
return entries
def search_iter(self, query, ctx=None, pos=None, **kwargs):
# ensure context
if ctx is None:
with self.ctx() as ctx:
return self.search(query, ctx=ctx, pos=pos, iter_mode=iter_mode)
where, params = self._build_search_query(query, pos=pos)
where.insert(0, 'SELECT idseq FROM Entry WHERE ')
for (idseq,) in ctx.conn.cursor().execute(' '.join(where), params):
yield self.get_entry(idseq, ctx=ctx)
def get_entry(self, idseq, ctx=None):
# ensure context
if ctx is None:
with self.ctx() as new_context:
return self.get_entry(idseq, new_context)
# else (a context is provided)
# select entry & info
entry = JMDEntry(idseq)
# links, bibs, etym, audit ...
dblinks = ctx.Link.select('idseq=?', (idseq,))
dbbibs = ctx.Bib.select('idseq=?', (idseq,))
dbetym = ctx.Etym.select('idseq=?', (idseq,))
dbaudit = ctx.Audit.select('idseq=?', (idseq,))
if dblinks or dbbibs or dbetym or dbaudit:
entry.info = EntryInfo()
if dblinks:
for l in dblinks:
entry.info.links.append(Link(l.tag, l.desc, l.uri))
if dbbibs:
for b in dbbibs:
entry.info.bibinfo.append(BibInfo(b.tag, b.text))
if dbetym:
for e in dbetym:
entry.info.etym.append(e)
if dbaudit:
for a in dbaudit:
entry.info.audit.append(Audit(a.upd_date, a.upd_detl))
# select kanji
kanjis = ctx.Kanji.select('idseq=?', (idseq,))
for dbkj in kanjis:
kj = KanjiForm(dbkj.text)
kjis = ctx.KJI.select('kid=?', (dbkj.ID,))
for i in kjis:
kj.info.append(i.text)
kjps = ctx.KJP.select('kid=?', (dbkj.ID,))
for p in kjps:
kj.pri.append(p.text)
entry.kanji_forms.append(kj)
# select kana
kanas = ctx.Kana.select('idseq=?', (idseq,))
for dbkn in kanas:
kn = KanaForm(dbkn.text, dbkn.nokanji)
knis = ctx.KNI.select('kid=?', (dbkn.ID,))
for i in knis:
kn.info.append(i.text)
knps = ctx.KNP.select('kid=?', (dbkn.ID,))
for p in knps:
kn.pri.append(p.text)
knrs = ctx.KNR.select('kid=?', (dbkn.ID,))
for r in knrs:
kn.restr.append(r.text)
entry.kana_forms.append(kn)
# select senses
senses = ctx.Sense.select('idseq=?', (idseq,))
for dbs in senses:
s = Sense()
# stagk
ks = ctx.stagk.select('sid=?', (dbs.ID,))
for k in ks:
s.stagk.append(k.text)
# stagr
rs = ctx.stagr.select('sid=?', (dbs.ID,))
for r in rs:
s.stagr.append(r.text)
# pos
ps = ctx.pos.select('sid=?', (dbs.ID,))
for p in ps:
s.pos.append(p.text)
# xref
xs = ctx.xref.select('sid=?', (dbs.ID,))
for x in xs:
s.xref.append(x.text)
# antonym
ans = ctx.antonym.select('sid=?', (dbs.ID,))
for a in ans:
s.antonym.append(a.text)
# field
fs = ctx.field.select('sid=?', (dbs.ID,))
for f in fs:
s.field.append(f.text)
# misc
ms = ctx.misc.select('sid=?', (dbs.ID,))
for m in ms:
s.misc.append(m.text)
# SenseInfo
sis = ctx.SenseInfo.select('sid=?', (dbs.ID,))
for si in sis:
s.info.append(si.text)
# SenseSource
lss = ctx.SenseSource.select('sid=?', (dbs.ID,))
for ls in lss:
s.lsource.append(LSource(ls.lang, ls.lstype, ls.wasei, ls.text))
# dialect
ds = ctx.dialect.select('sid=?', (dbs.ID,))
for d in ds:
s.dialect.append(d.text)
# SenseGloss
gs = ctx.SenseGloss.select('sid=?', (dbs.ID,))
for g in gs:
s.gloss.append(SenseGloss(g.lang, g.gend, g.text))
entry.senses.append(s)
return entry
def insert_entries(self, entries, ctx=None):
# ensure context
if ctx is None:
with self.ctx() as new_context:
return self.insert_entries(entries, ctx=new_context)
# else
getLogger().debug("JMdict bulk insert {} entries".format(len(entries)))
for entry in entries:
self.insert_entry(entry, ctx)
def insert_entry(self, entry, ctx=None):
# ensure context
if ctx is None:
with self.ctx() as ctx:
return self.insert_entry(entry, ctx=ctx)
# else (a context is provided)
self.Entry.insert(entry.idseq, ctx=ctx)
# insert info
if entry.info:
# links
for lnk in entry.info.links:
ctx.Link.insert(entry.idseq, lnk.tag, lnk.desc, lnk.uri)
# bibs
for bib in entry.info.bibinfo:
ctx.Bib.insert(entry.idseq, bib.tag, bib.text)
# etym
for e in entry.info.etym:
ctx.Etym.insert(entry.idseq, e)
# audit
for a in entry.info.audit:
ctx.Audit.insert(entry.idseq, a.upd_date, a.upd_detl)
# insert kanji
for kj in entry.kanji_forms:
kjid = ctx.Kanji.insert(entry.idseq, kj.text)
# KJI
for kji in kj.info:
ctx.KJI.insert(kjid, kji)
# KJP
for kjp in kj.pri:
ctx.KJP.insert(kjid, kjp)
# insert kana
for kn in entry.kana_forms:
knid = ctx.Kana.insert(entry.idseq, kn.text, kn.nokanji)
# KNI
for kni in kn.info:
ctx.KNI.insert(knid, kni)
# KNP
for knp in kn.pri:
ctx.KNP.insert(knid, knp)
# KNR
for knr in kn.restr:
ctx.KNR.insert(knid, knr)
# insert senses
for s in entry.senses:
sid = ctx.Sense.insert(entry.idseq)
# stagk
for sk in s.stagk:
ctx.stagk.insert(sid, sk)
# stagr
for sr in s.stagr:
ctx.stagr.insert(sid, sr)
# pos
for pos in s.pos:
ctx.pos.insert(sid, pos)
# xref
for xr in s.xref:
ctx.xref.insert(sid, xr)
# antonym
for a in s.antonym:
ctx.antonym.insert(sid, a)
# field
for f in s.field:
ctx.field.insert(sid, f)
# misc
for m in s.misc:
ctx.misc.insert(sid, m)
# SenseInfo
for i in s.info:
ctx.SenseInfo.insert(sid, i)
# SenseSource
for l in s.lsource:
ctx.SenseSource.insert(sid, l.text, l.lang, l.lstype, l.wasei)
# dialect
for d in s.dialect:
ctx.dialect.insert(sid, d)
# SenseGloss
for g in s.gloss:
ctx.SenseGloss.insert(sid, g.lang, g.gend, g.text)
================================================
FILE: jamdict/jmnedict_sqlite.py
================================================
# -*- coding: utf-8 -*-
"""
Japanese Multilingual Named Entity Dictionary (JMnedict) in SQLite format
References:
ENAMDICT/JMnedict - Japanese Proper Names Dictionary Files
https://www.edrdg.org/enamdict/enamdict_doc.html
"""
# This code is a part of jamdict library: https://github.com/neocl/jamdict
# :copyright: (c) 2020 Le Tuan Anh <tuananh.ke@gmail.com>
# :license: MIT, see LICENSE for more details.
import os
import logging
from typing import Sequence
from puchikarui import Schema
from . import __version__ as JAMDICT_VERSION, __url__ as JAMDICT_URL
from .jmdict import Meta, JMDEntry, KanjiForm, KanaForm, Translation, SenseGloss
# -------------------------------------------------------------------------------
# Configuration
# -------------------------------------------------------------------------------
MY_FOLDER = os.path.dirname(os.path.abspath(__file__))
SCRIPT_FOLDER = os.path.join(MY_FOLDER, 'data')
JMNEDICT_SETUP_FILE = os.path.join(SCRIPT_FOLDER, 'setup_jmnedict.sql')
JMNEDICT_VERSION = '1.08'
JMNEDICT_URL = 'https://www.edrdg.org/enamdict/enamdict_doc.html'
JMNEDICT_DATE = '2020-05-29'
JMNEDICT_SETUP_SCRIPT = '''INSERT INTO meta VALUES ('jmnedict.version', '{jv}');
INSERT INTO meta VALUES ('jmnedict.url', '{ju}');
INSERT INTO meta VALUES ('jmnedict.date', '{jud}');
INSERT INTO meta SELECT 'generator', 'jamdict' WHERE NOT EXISTS (SELECT 1 FROM meta WHERE key = 'generator');
INSERT INTO meta SELECT 'generator_version', '{gv}' WHERE NOT EXISTS (SELECT 1 FROM meta WHERE key = 'generator_version');
INSERT INTO meta SELECT 'generator_url', '{gu}' WHERE NOT EXISTS (SELECT 1 FROM meta WHERE key = 'generator_url');'''.format(
jv=JMNEDICT_VERSION,
ju=JMNEDICT_URL,
jud=JMNEDICT_DATE,
gv=JAMDICT_VERSION,
gu=JAMDICT_URL
)
def getLogger():
return logging.getLogger(__name__)
# -------------------------------------------------------------------------------
# Models
# -------------------------------------------------------------------------------
class JMNEDictSchema(Schema):
def __init__(self, db_path, *args, **kwargs):
super().__init__(db_path, *args, **kwargs)
self.add_script(JMNEDICT_SETUP_SCRIPT)
self.add_file(JMNEDICT_SETUP_FILE)
# Meta
self.add_table('meta', ['key', 'value'], proto=Meta).set_id('key')
self.add_table('NEEntry', ['idseq'])
# Kanji
self.add_table('NEKanji', ['ID', 'idseq', 'text'])
# Kana
self.add_table('NEKana', ['ID', 'idseq', 'text', 'nokanji'])
# Translation (~Sense of JMdict)
self.add_table('NETranslation', ['ID', 'idseq'])
self.add_table('NETransType', ['tid', 'text'])
self.add_table('NETransXRef', ['tid', 'text'])
self.add_table('NETransGloss', ['tid', 'lang', 'gend', 'text'])
class JMNEDictSQLite(JMNEDictSchema):
def __init__(self, db_path, *args, **kwargs):
super().__init__(db_path, *args, **kwargs)
def all_ne_type(self, ctx=None):
if ctx is None:
return self.all_ne_type(ctx=self.ctx())
else:
return [x['text'] for x in ctx.execute("SELECT DISTINCT text FROM NETransType")]
def _build_ne_search_query(self, query):
_is_wildcard_search = '_' in query or '@' in query or '%' in query
if _is_wildcard_search:
where = "idseq IN (SELECT idseq FROM NEKanji WHERE text like ?) OR idseq IN (SELECT idseq FROM NEKana WHERE text like ?) OR idseq IN (SELECT idseq FROM NETranslation JOIN NETransGloss ON NETranslation.ID == NETransGloss.tid WHERE NETransGloss.text like ?) OR idseq IN (SELECT idseq FROM NETranslation JOIN NETransType ON NETranslation.ID == NETransType.tid WHERE NETransType.text like ?)"
else:
where = "idseq IN (SELECT idseq FROM NEKanji WHERE text == ?) OR idseq IN (SELECT idseq FROM NEKana WHERE text == ?) OR idseq IN (SELECT idseq FROM NETranslation JOIN NETransGloss ON NETranslation.ID == NETransGloss.tid WHERE NETransGloss.text == ?) or idseq in (SELECT idseq FROM NETranslation JOIN NETransType ON NETranslation.ID == NETransType.tid WHERE NETransType.text == ?)"
params = [query, query, query, query]
try:
if query.startswith('id#'):
query_int = int(query[3:])
if query_int >= 0:
where = "idseq = ?"
params = [query_int]
except Exception:
pass
getLogger().debug(f"where={where} | params={params}")
return where, params
def search_ne(self, query, ctx=None, **kwargs) -> Sequence[JMDEntry]:
if ctx is None:
with self.ctx() as ctx:
return self.search_ne(query, ctx=ctx)
where, params = self._build_ne_search_query(query)
where = 'SELECT idseq FROM NEEntry WHERE ' + where
entries = []
for (idseq,) in ctx.conn.cursor().execute(where, params):
entries.append(self.get_ne(idseq, ctx=ctx))
return entries
def search_ne_iter(self, query, ctx=None, **kwargs):
if ctx is None:
with self.ctx() as ctx:
return self.search_ne(query, ctx=ctx)
where, params = self._build_ne_search_query(query)
where = 'SELECT idseq FROM NEEntry WHERE ' + where
for (idseq,) in ctx.conn.cursor().execute(where, params):
yield self.get_ne(idseq, ctx=ctx)
def get_ne(self, idseq, ctx=None) -> JMDEntry:
# ensure context
if ctx is None:
with self.ctx() as new_context:
return self.get_entry(idseq, new_context)
# else (a context is provided)
# select entry & info
entry = JMDEntry(idseq)
# select kanji
kanjis = ctx.NEKanji.select('idseq=?', (idseq,))
for dbkj in kanjis:
kj = KanjiForm(dbkj.text)
entry.kanji_forms.append(kj)
# select kana
kanas = ctx.NEKana.select('idseq=?', (idseq,))
for dbkn in kanas:
kn = KanaForm(dbkn.text, dbkn.nokanji)
entry.kana_forms.append(kn)
# select senses
senses = ctx.NETranslation.select('idseq=?', (idseq,))
for dbs in senses:
s = Translation()
# name_type
nts = ctx.NETransType.select('tid=?', (dbs.ID,))
for nt in nts:
s.name_type.append(nt.text)
# xref
xs = ctx.NETransXRef.select('tid=?', (dbs.ID,))
for x in xs:
s.xref.append(x.text)
# SenseGloss
gs = ctx.NETransGloss.select('tid=?', (dbs.ID,))
for g in gs:
s.gloss.append(SenseGloss(g.lang, g.gend, g.text))
entry.senses.append(s)
return entry
def insert_name_entities(self, entries, ctx=None):
# ensure context
if ctx is None:
with self.ctx() as new_context:
return self.insert_name_entities(entries, ctx=new_context)
# else
for entry in entries:
self.insert_name_entity(entry, ctx)
def insert_name_entity(self, entry, ctx=None):
# ensure context
if ctx is None:
with self.ctx() as ctx:
return self.insert_name_entity(entry, ctx=ctx)
# else (a context is provided)
self.NEEntry.insert(entry.idseq, ctx=ctx)
# insert kanji
for kj in entry.kanji_forms:
ctx.NEKanji.insert(entry.idseq, kj.text)
# insert kana
for kn in entry.kana_forms:
ctx.NEKana.insert(entry.idseq, kn.text, kn.nokanji)
# insert translations
for s in entry.senses:
tid = ctx.NETranslation.insert(entry.idseq)
# insert name_type
for nt in s.name_type:
ctx.NETransType.insert(tid, nt)
# xref
for xr in s.xref:
ctx.NETransXRef.insert(tid, xr)
# Gloss
for g in s.gloss:
ctx.NETransGloss.insert(tid, g.lang, g.gend, g.text)
================================================
FILE: jamdict/kanjidic2.py
================================================
# -*- coding: utf-8 -*-
"""
Kanjidic2 models
"""
# This code is a part of jamdict library: https://github.com/neocl/jamdict
# :copyright: (c) 2016 Le Tuan Anh <tuananh.ke@gmail.com>
# :license: MIT, see LICENSE for more details.
import os
import logging
import warnings
from typing import List
try:
from lxml import etree
_LXML_AVAILABLE = True
except Exception as e:
# logging.getLogger(__name__).debug("lxml is not available, fall back to xml.etree.ElementTree")
from xml.etree import ElementTree as etree
_LXML_AVAILABLE = False
from chirptext import chio
from chirptext.sino import Radical as KangxiRadical
from .krad import KRad
# ------------------------------------------------------------------------------
# Configuration
# ------------------------------------------------------------------------------
krad = KRad()
def getLogger():
return logging.getLogger(__name__)
# ------------------------------------------------------------------------------
# Models
# ------------------------------------------------------------------------------
class KanjiDic2(object):
def __init__(self, file_version, database_version, date_of_creation):
"""
<!ELEMENT header (file_version,database_version,date_of_creation)>
<!--
The single header element will contain identification information
about the version of the file
-->
<!ELEMENT file_version (#PCDATA)>
<!--
This field denotes the version of kanjidic2 structure, as more
than one version may exist.
-->
<!ELEMENT database_version (#PCDATA)>
<!--
The version of the file, in the format YYYY-NN, where NN will be
a number starting with 01 for the first version released in a
calendar year, then increasing for each version in that year.
-->
<!ELEMENT date_of_creation (#PCDATA)>
<!--
The date the file was created in international format (YYYY-MM-DD).
-->"""
self.file_version = file_version
self.database_version = database_version
self.date_of_creation = date_of_creation
self.characters = []
def __len__(self):
return len(self.characters)
def __getitem__(self, idx):
return self.characters[idx]
class Character(object):
""" Represent a kanji character.
<!ELEMENT character (literal,codepoint, radical, misc, dic_number?, query_code?, reading_meaning?)*>"""
def __init__(self):
"""
"""
self.ID = None
self.literal = '' # <!ELEMENT literal (#PCDATA)> The character itself in UTF8 coding.
self.codepoints: List[CodePoint] = [] # <!ELEMENT codepoint (cp_value+)>
self.radicals: List[Radical] = [] # <!ELEMENT radical (rad_value+)>
self.__canon_radical = None
self.stroke_count = None # first stroke_count in misc
self.grade = None # <misc>/<grade>
self.stroke_miscounts = [] # <misc>/stroke_count[1:]
self.variants: List[Variant] = [] # <misc>/<variant>
self.freq = None # <misc>/<freq>
self.rad_names = [] # <misc>/<rad_name> a list of strings
self.jlpt = None # <misc>/<jlpt>
self.dic_refs: List[DicRef] = [] # DicRef[]
self.query_codes: List[QueryCode] = [] # QueryCode[]
self.rm_groups: List[RMGroup] = [] # reading_meaning groups
self.nanoris = [] # a list of strings
@property
def text(self):
return self.literal
def __repr__(self):
meanings = self.meanings(english_only=True)
return "{l}:{sc}:{meanings}".format(l=self.literal, sc=self.stroke_count, meanings=','.join(meanings))
def __str__(self):
return self.literal
def meanings(self, english_only=False):
''' Accumulate all meanings as a list of string. Each string is a meaning (i.e. sense) '''
meanings = []
for rm in self.rm_groups:
for m in rm.meanings:
if english_only and m.m_lang != '':
continue
meanings.append(m.value)
return meanings
@property
def components(self):
''' Kanji writing components that compose this character '''
if self.literal in krad.krad:
return krad.krad[self.literal]
else:
return []
@property
def radical(self):
if self.__canon_radical is None:
for rad in self.radicals:
if rad.rad_type == 'classical':
self.__canon_radical = KangxiRadical.kangxi()[rad.value]
return self.__canon_radical
def to_json(self):
warnings.warn("to_json() is deprecated and will be removed in the next major release. Use to_dict() instead.",
DeprecationWarning, stacklevel=2)
return self.to_dict()
def to_dict(self):
return {'literal': self.literal,
'codepoints': [cp.to_dict() for cp in self.codepoints],
'radicals': [r.to_dict() for r in self.radicals],
'stroke_count': self.stroke_count,
'grade': self.grade if self.grade else '',
'stroke_miscounts': self.stroke_miscounts,
'variants': [v.to_dict() for v in self.variants],
'freq': self.freq if self.freq else 0,
'rad_names': self.rad_names,
'jlpt': self.jlpt if self.jlpt else '',
'dic_refs': [r.to_dict() for r in self.dic_refs],
'q_codes': [q.to_dict() for q in self.query_codes],
'rm': [rm.to_dict() for rm in self.rm_groups],
'nanoris': list(self.nanoris)}
class CodePoint(object):
def __init__(self, cp_type='', value=''):
"""<!ELEMENT cp_value (#PCDATA)>
<!--
The cp_value contains the codepoint of the character in a particular
standard. The standard will be identified in the cp_type attribute.
-->
"""
self.cid = None
self.cp_type = cp_type
self.value = value
def __repr__(self):
if self.r_type:
return "({t}) {v}".format(t=self.cp_type, v=self.value)
else:
return self.value
def __str__(self):
return self.value
def to_json(self):
warnings.warn("to_json() is deprecated and will be removed in the next major release. Use to_dict() instead.",
DeprecationWarning, stacklevel=2)
return self.to_dict()
def to_dict(self):
return {'type': self.cp_type, 'value': self.value}
class Radical(object):
def __init__(self, rad_type='', value=''):
"""<!ELEMENT radical (rad_value+)>
<!ELEMENT rad_value (#PCDATA)>
<!--
The radical number, in the range 1 to 214. The particular
classification type is stated in the rad_type attribute.
-->"""
self.cid = None
self.rad_type = rad_type
self.value = value
def __repr__(self):
if self.rad_type:
return "({t}) {v}".format(t=self.rad_type, v=self.value)
else:
return self.value
def __str__(self):
return self.value
def to_json(self):
warnings.warn("to_json() is deprecated and will be removed in the next major release. Use to_dict() instead.",
DeprecationWarning, stacklevel=2)
return self.to_dict()
def to_dict(self):
return {'type': self.rad_type, 'value': self.value}
class Variant(object):
def __init__(self, var_type='', value=''):
"""<!ELEMENT variant (#PCDATA)>
<!--
Either a cross-reference code to another kanji, usually regarded as a
variant, or an alternative indexing code for the current kanji.
The type of variant is given in the var_type attribute.
-->
<!ATTLIST variant var_type CDATA #REQUIRED>
<!--
The var_type attribute indicates the type of variant code. The current
values are:
jis208 - in JIS X 0208 - kuten coding
jis212 - in JIS X 0212 - kuten coding
jis213 - in JIS X 0213 - kuten coding
(most of the above relate to "shinjitai/kyuujitai"
alternative character glyphs)
deroo - De Roo number - numeric
njecd - Halpern NJECD index number - numeric
s_h - The Kanji Dictionary (Spahn & Hadamitzky) - descriptor
nelson_c - "Classic" Nelson - numeric
oneill - Japanese Names (O'Neill) - numeric
ucs - Unicode codepoint- hex
--> """
self.cid = None
self.var_type = var_type
self.value = value
def __repr__(self):
if self.var_type:
return "({t}) {v}".format(t=self.var_type, v=self.value)
else:
return self.value
def __str__(self):
return self.value
def to_json(self):
warnings.warn("to_json() is deprecated and will be removed in the next major release. Use to_dict() instead.",
DeprecationWarning, stacklevel=2)
return self.to_dict()
def to_dict(self):
return {'type': self.var_type, 'value': self.value}
class DicRef(object):
def __init__(self, dr_type='', value='', m_vol='', m_page=''):
"""<!ELEMENT dic_ref (#PCDATA)>
<!--
Each dic_ref contains an index number. The particular dictionary,
etc. is defined by the dr_type attribute.
-->
<!ATTLIST dic_ref dr_type CDATA #REQUIRED>
<!--
The dr_type defines the dictionary or reference book, etc. to which
dic_ref element applies. The initial allocation is:
nelson_c - "Modern Reader's Japanese-English Character Dictionary",
edited by Andrew Nelson (now published as the "Classic"
Nelson).
nelson_n - "The New Nelson Japanese-English Character Dictionary",
edited by John Haig.
halpern_njecd - "New Japanese-English Character Dictionary",
edited by Jack Halpern.
halpern_kkld - "Kanji Learners Dictionary" (Kodansha) edited by
Jack Halpern.
heisig - "Remembering The Kanji" by James Heisig.
gakken - "A New Dictionary of Kanji Usage" (Gakken)
oneill_names - "Japanese Names", by P.G. O'Neill.
oneill_kk - "Essential Kanji" by P.G. O'Neill.
moro - "Daikanwajiten" compiled by Morohashi. For some kanji two
additional attributes are used: m_vol: the volume of the
dictionary in which the kanji is found, and m_page: the page
number in the volume.
henshall - "A Guide To Remembering Japanese Characters" by
Kenneth G. Henshall.
sh_kk - "Kanji and Kana" by Spahn and Hadamitzky.
sakade - "A Guide To Reading and Writing Japanese" edited by
Florence Sakade.
jf_cards - Japanese Kanji Flashcards, by Max Hodges and
Tomoko Okazaki. (Series 1)
henshall3 - "A Guide To Reading and Writing Japanese" 3rd
edition, edited by Henshall, Seeley and De Groot.
tutt_cards - Tuttle Kanji Cards, compiled by Alexander Kask.
crowley - "The Kanji Way to Japanese Language Power" by
Dale Crowley.
kanji_in_context - "Kanji in Context" by Nishiguchi and Kono.
busy_people - "Japanese For Busy People" vols I-III, published
by the AJLT. The codes are the volume.chapter.
kodansha_compact - the "Kodansha Compact Kanji Guide".
maniette - codes from Yves Maniette's "Les Kanjis dans la tete" French adaptation of Heisig.
-->"""
self.cid = None
self.dr_type = dr_type
self.value = value
self.m_vol = m_vol
self.m_page = m_page
def __repr__(self):
if self.dr_type:
return "({t}) {v}".format(t=self.dr_type, v=self.value)
else:
return self.value
def __str__(self):
return self.value
def to_json(self):
warnings.warn("to_json() is deprecated and will be removed in the next major release. Use to_dict() instead.",
DeprecationWarning, stacklevel=2)
return self.to_dict()
def to_dict(self):
return {'type': self.dr_type,
'value': self.value,
"m_vol": self.m_vol,
"m_page": self.m_page}
class QueryCode(object):
def __init__(self, qc_type='', value='', skip_misclass=""):
"""<!ELEMENT query_code (q_code+)>
<!--
These codes contain information relating to the glyph, and can be used
for finding a required kanji. The type of code is defined by the
qc_type attribute.
-->
<!ELEMENT q_code (#PCDATA)>
<!--
The q_code contains the actual query-code value, according to the
qc_type attribute.
-->
<!ATTLIST q_code qc_type CDATA #REQUIRED>
<!--
The qc_type attribute defines the type of query code. The current values
are:
skip - Halpern's SKIP (System of Kanji Indexing by Patterns)
code. The format is n-nn-nn. See the KANJIDIC documentation
for a description of the code and restrictions on the
commercial use of this data. [P] There are also
a number of misclassification codes, indicated by the
"skip_misclass" attribute.
sh_desc - the descriptor codes for The Kanji Dictionary (Tuttle
1996) by Spahn and Hadamitzky. They are in the form nxnn.n,
e.g. 3k11.2, where the kanji has 3 strokes in the
identifying radical, it is radical "k" in the SH
classification system, there are 11 other strokes, and it is
the 2nd kanji in the 3k11 sequence. (I am very grateful to
Mark Spahn for providing the list of these descriptor codes
for the kanji in this file.) [I]
four_corner - the "Four Corner" code for the kanji. This is a code
invented by Wang Chen in 1928. See the KANJIDIC documentation
for an overview of the Four Corner System. [Q]
deroo - the codes developed by the late Father Joseph De Roo, and
published in his book "2001 Kanji" (Bonjinsha). Fr De Roo
gave his permission for these codes to be included. [DR]
misclass - a possible misclassification of the kanji according
to one of the code types. (See the "Z" codes in the KANJIDIC
documentation for more details.)
-->
<!ATTLIST q_code skip_misclass CDATA #IMPLIED>
<!--
The values of this attribute indicate the type if
misclassification:
- posn - a mistake in the division of the kanji
- stroke_count - a mistake in the number of strokes
- stroke_and_posn - mistakes in both division and strokes
- stroke_diff - ambiguous stroke counts depending on glyph
S --> """
self.cid = None
self.qc_type = qc_type
self.value = value
self.skip_misclass = skip_misclass
def __repr__(self):
if self.qc_type:
return "({t}) {v}".format(t=self.qc_type, v=self.value)
else:
return self.value
def __str__(self):
return self.value
def to_json(self):
warnings.warn("to_json() is deprecated and will be removed in the next major release. Use to_dict() instead.",
DeprecationWarning, stacklevel=2)
return self.to_dict()
def to_dict(self):
return {'type': self.qc_type, 'value': self.value, "skip_misclass": self.skip_misclass}
class RMGroup(object):
def __init__(self, readings=None, meanings=None):
"""<!ELEMENT reading_meaning (rmgroup*, nanori*)>
<!--
The readings for the kanji in several languages, and the meanings, also
in several languages. The readings and meanings are grouped to enable
the handling of the situation where the meaning is differentiated by
reading. [T1]
-->
<!ELEMENT rmgroup (reading*, meaning*)>
"""
self.ID = None
self.cid = None
self.readings: List[Reading] = readings if readings else []
self.meanings: List[Meaning] = meanings if meanings else []
def __repr__(self):
return "R: {} | M: {}".format(
", ".join([r.value for r in self.readings]),
", ".join(m.value for m in self.meanings))
def __str__(self):
return repr(self)
@property
def on_readings(self):
return [r for r in self.readings if r.r_type == 'ja_on']
@property
def kun_readings(self):
return [r for r in self.readings if r.r_type == 'ja_kun']
@property
def other_readings(self):
return [r for r in self.readings if r.r_type not in('ja_kun', 'ja_on')]
def to_json(self):
warnings.warn("to_json() is deprecated and will be removed in the next major release. Use to_dict() instead.",
DeprecationWarning, stacklevel=2)
return self.to_dict()
def to_dict(self):
sorted_readings = sorted(self.readings,
key=lambda x: x.r_type.startswith('ja_'),
reverse=True)
return {'readings': [r.to_dict() for r in sorted_readings],
'meanings': [m.to_dict() for m in self.meanings]}
class Reading(object):
def __init__(self, r_type='', value='', on_type="", r_status=""):
"""<!ELEMENT reading (#PCDATA)>
<!--
The reading element contains the reading or pronunciation
of the kanji.
-->
<!ATTLIST reading r_type CDATA #REQUIRED>
<!--
The r_type attribute defines the type of reading in the reading
element. The current values are:
pinyin - the modern PinYin romanization of the Chinese reading
of the kanji. The tones are represented by a concluding
digit. [Y]
korean_r - the romanized form of the Korean reading(s) of the
kanji. The readings are in the (Republic of Korea) Ministry
of Education style of romanization. [W]
korean_h - the Korean reading(s) of the kanji in hangul.
ja_on - the "on" Japanese reading of the kanji, in katakana.
Another attribute r_status, if present, will indicate with
a value of "jy" whether the reading is approved for a
"Jouyou kanji".
A further attribute on_type, if present, will indicate with
a value of kan, go, tou or kan'you the type of on-reading.
ja_kun - the "kun" Japanese reading of the kanji, usually in
hiragana.
Where relevant the okurigana is also included separated by a
".". Readings associated with prefixes and suffixes are
marked with a "-". A second attribute r_status, if present,
will indicate with a value of "jy" whether the reading is
approved for a "Jouyou kanji".
-->
<!ATTLIST reading on_type CDATA #IMPLIED>
<!--
See under ja_on above.
-->
<!ATTLIST reading r_status CDATA #IMPLIED>
<!--
See under ja_on and ja_kun above.
-->"""
self.gid = None
self.r_type = r_type
self.value = value
self.on_type = on_type
self.r_status = r_status
def __repr__(self):
if self.r_type:
return "({t}) {v}".format(t=self.r_type, v=self.value)
else:
return self.value
def __str__(self):
return self.value
def to_json(self):
warnings.warn("to_json() is deprecated and will be removed in the next major release. Use to_dict() instead.",
DeprecationWarning, stacklevel=2)
return self.to_dict()
def to_dict(self):
return {'type': self.r_type,
'value': self.value,
'on_type': self.on_type,
'r_status': self.r_status}
class Meaning(object):
def __init__(self, value='', m_lang=''):
"""<!ELEMENT meaning (#PCDATA)>
<!--
The meaning associated with the kanji.
-->
<!ATTLIST meaning m_lang CDATA #IMPLIED>
<!--
The m_lang attribute defines the target language of the meaning. It
will be coded using the two-letter language code from the ISO 639-1
standard. When absent, the value "en" (i.e. English) is implied. [{}]
-->"""
self.gid = None
self.m_lang = m_lang
self.value = value
def __repr__(self):
if self.m_lang:
return "({l}) {v}".format(l=self.m_lang, v=self.value)
else:
return self.value
def __str__(self):
return self.value
def to_json(self):
warnings.warn("to_json() is deprecated and will be removed in the next major release. Use to_dict() instead.",
DeprecationWarning, stacklevel=2)
return self.to_dict()
def to_dict(self):
return {'m_lang': self.m_lang, 'value': self.value}
class Kanjidic2XMLParser(object):
""" JMDict XML parser
"""
def __init__(self):
pass
def get_attrib(self, a_tag, attr_name, default_value=''):
if attr_name == 'xml:lang':
attr_name = '''{http://www.w3.org/XML/1998/namespace}lang'''
if attr_name in a_tag.attrib:
return a_tag.attrib[attr_name]
else:
return default_value
def parse_file(self, kd2_file_path):
''' Parse all characters from Kanjidic2 XML file
'''
actual_path = os.path.abspath(os.path.expanduser(kd2_file_path))
getLogger().debug('Loading data from file: {}'.format(actual_path))
with chio.open(actual_path, mode='rb') as kd2file:
tree = etree.iterparse(kd2file)
kd2 = None
for event, element in tree:
if event == 'end':
if element.tag == 'header':
kd2 = self.parse_header(element)
element.clear()
elif element.tag == 'character':
kd2.characters.append(self.parse_char(element))
element.clear()
return kd2
def parse_header(self, e):
fv = None
dbv = None
doc = None
for child in e:
if child.tag == 'file_version':
fv = child.text
elif child.tag == 'database_version':
dbv = child.text
elif child.tag == 'date_of_creation':
doc = child.text
return KanjiDic2(fv, dbv, doc)
def parse_char(self, e):
char = Character()
for child in e:
if child.tag == 'literal':
char.literal = child.text
elif child.tag == 'codepoint':
self.parse_codepoint(child, char)
elif child.tag == 'radical':
self.parse_radical(child, char)
elif child.tag == 'misc':
self.parse_misc(child, char)
elif child.tag == 'dic_number':
self.parse_dic_refs(child, char)
elif child.tag == 'query_code':
self.parse_query_code(child, char)
elif child.tag == 'reading_meaning':
self.parse_reading_meaning(child, char)
else:
getLogger().warning("Unknown tag in child: {}".format(child.tag))
return char
def parse_codepoint(self, e, char):
for child in e:
if child.tag == 'cp_value':
cp = CodePoint(self.get_attrib(child, 'cp_type'), child.text)
char.codepoints.append(cp)
else:
getLogger().warning("Unknown tag: {}".format(child.tag))
def parse_radical(self, e, char):
for child in e:
if child.tag == 'rad_value':
rad = Radical(self.get_attrib(child, "rad_type"), child.text)
char.radicals.append(rad)
else:
getLogger().warning("Unknown tag: {}".format(child.tag))
def parse_misc(self, e, char):
for child in e:
# grade?, stroke_count+, variant*, freq?, rad_name*,jlpt?
if child.tag == 'grade':
char.grade = child.text
elif child.tag == 'stroke_count':
if char.stroke_count is None:
char.stroke_count = int(child.text)
else:
char.stroke_miscounts.append(int(child.text))
elif child.tag == 'variant':
v = Variant(self.get_attrib(child, "var_type"), child.text)
char.variants.append(v)
elif child.tag == 'freq':
char.freq = child.text
elif child.tag == 'rad_name':
char.rad_names.append(child.text)
elif child.tag == 'jlpt':
char.jlpt = child.text
else:
getLogger().warning("Unknown tag: {}".format(child.tag))
def parse_dic_refs(self, e, char):
for child in e:
if child.tag == 'dic_ref':
dr_type = self.get_attrib(child, "dr_type")
m_vol = self.get_attrib(child, "m_vol")
m_page = self.get_attrib(child, "m_page")
dr = DicRef(dr_type, child.text, m_vol, m_page)
char.dic_refs.append(dr)
else:
getLogger().warning("Unknown tag: {}".format(child.tag))
def parse_query_code(self, e, char):
for child in e:
if child.tag == "q_code":
qc_type = self.get_attrib(child, "qc_type")
skip_misclass = self.get_attrib(child, "skip_misclass")
char.query_codes.append(QueryCode(qc_type, child.text, skip_misclass))
else:
getLogger().warning("Unknown tag: {}".format(child.tag))
def parse_reading_meaning(self, e, char):
for child in e:
if child.tag == "nanori":
char.nanoris.append(child.text)
elif child.tag == "rmgroup":
rmgroup = RMGroup()
char.rm_groups.append(rmgroup)
for grandchild in child:
if grandchild.tag == 'reading':
r_type = self.get_attrib(grandchild, "r_type")
on_type = self.get_attrib(grandchild, "on_type")
r_status = self.get_attrib(grandchild, "r_status")
r = Reading(r_type, grandchild.text, on_type, r_status)
rmgroup.readings.append(r)
elif grandchild.tag == 'meaning':
m = Meaning(grandchild.text, self.get_attrib(grandchild, "m_lang"))
rmgroup.meanings.append(m)
else:
getLogger().warning("Unknown tag: {}".format(grandchild.tag))
else:
getLogger().warning("Unknown tag: {}".format(child.tag))
================================================
FILE: jamdict/kanjidic2_sqlite.py
================================================
# -*- coding: utf-8 -*-
"""
KanjiDic2 in SQLite format
References:
KANJIDIC2 project
https://www.edrdg.org/wiki/index.php/KANJIDIC_Project
"""
# This code is a part of jamdict library: https://github.com/neocl/jamdict
# :copyright: (c) 2017 Le Tuan Anh <tuananh.ke@gmail.com>
# :license: MIT, see LICENSE for more details.
import os
import logging
from puchikarui import Schema
from . import __version__ as JAMDICT_VERSION, __url__ as JAMDICT_URL
from .jmdict import Meta
from .kanjidic2 import Character, CodePoint, Radical, Variant, DicRef, QueryCode, RMGroup, Reading, Meaning
# ------------------------------------------------------------------------------
# Configuration
# ------------------------------------------------------------------------------
MY_FOLDER = os.path.dirname(os.path.abspath(__file__))
SCRIPT_FOLDER = os.path.join(MY_FOLDER, 'data')
KANJIDIC2_VERSION = '1.6'
KANJIDIC2_URL = 'https://www.edrdg.org/wiki/index.php/KANJIDIC_Project'
KANJIDIC2_DATE = 'April 2008'
KANJIDIC2_SETUP_FILE = os.path.join(SCRIPT_FOLDER, 'setup_kanjidic2.sql')
KANJIDIC2_SETUP_SCRIPT = '''
INSERT INTO meta VALUES ('kanjidic2.version', '{kdv}');
INSERT INTO meta VALUES ('kanjidic2.url', '{kdu}');
INSERT INTO meta VALUES ('kanjidic2.date', '{kdd}');
INSERT INTO meta SELECT 'generator', 'jamdict'
WHERE NOT EXISTS (SELECT 1 FROM meta WHERE key='generator');
INSERT INTO meta SELECT 'generator_version', '{gv}'
WHERE NOT EXISTS (SELECT 1 FROM meta WHERE key='generator_version');
INSERT INTO meta SELECT 'generator_url', '{gu}'
WHERE NOT EXISTS (SELECT 1 FROM meta WHERE key='generator_url');'''.format(
kdv=KANJIDIC2_VERSION,
kdu=KANJIDIC2_URL,
kdd=KANJIDIC2_DATE,
gv=JAMDICT_VERSION,
gu=JAMDICT_URL
)
def getLogger():
return logging.getLogger(__name__)
# ------------------------------------------------------------------------------
# Models
# ------------------------------------------------------------------------------
class KanjiDic2Schema(Schema):
KEY_FILE_VER = 'kanjidic2.file_version'
KEY_DB_VER = 'kanjidic2.database_version'
KEY_CREATED_DATE = 'kanjidic2.date_of_creation'
def __init__(self, db_path, *args, **kwargs):
super().__init__(db_path, *args, **kwargs)
self.add_file(KANJIDIC2_SETUP_FILE)
self.add_script(KANJIDIC2_SETUP_SCRIPT)
# Meta
self.add_table('meta', ['key', 'value'], proto=Meta).set_id('key')
self.add_table('character', ['ID', 'literal', 'stroke_count', 'grade', 'freq', 'jlpt'], proto=Character, alias="char").set_id('ID')
self.add_table('codepoint', ['cid', 'cp_type', 'value'], proto=CodePoint)
self.add_table('radical', ['cid', 'rad_type', 'value'], proto=Radical)
self.add_table('stroke_miscount', ['cid', 'value'], alias="smc")
self.add_table('variant', ['cid', 'var_type', 'value'], proto=Variant)
self.add_table('rad_name', ['cid', 'value'])
self.add_table('dic_ref', ['cid', 'dr_type', 'value', 'm_vol', 'm_page'], proto=DicRef)
self.add_table('query_code', ['cid', 'qc_type', 'value', 'skip_misclass'], proto=QueryCode)
self.add_table('nanori', ['cid', 'value'])
self.add_table('rm_group', ['ID', 'cid'], proto=RMGroup, alias='rmg').set_id('ID')
self.add_table('reading', ['gid', 'r_type', 'value', 'on_type', 'r_status'], proto=Reading)
self.add_table('meaning', ['gid', 'value', 'm_lang'], proto=Meaning)
class KanjiDic2SQLite(KanjiDic2Schema):
def __init__(self, db_path, *args, **kwargs):
super().__init__(db_path, *args, **kwargs)
def update_kd2_meta(self, file_version, database_version, date_of_creation, ctx=None):
# ensure context
if ctx is None:
with self.ctx() as new_context:
return self.update_kd2_meta(file_version, database_version, date_of_creation, new_context)
# else
# file_version
fv = ctx.meta.by_id(self.KEY_FILE_VER)
if not fv:
ctx.meta.insert(self.KEY_FILE_VER, file_version)
else:
fv.value = file_version
ctx.meta.save(fv)
# database_version
dv = ctx.meta.by_id(self.KEY_DB_VER)
if not dv:
ctx.meta.insert(self.KEY_DB_VER, database_version)
else:
dv.value = database_version
ctx.meta.save(dv)
# date_of_creation
doc = ctx.meta.by_id(self.KEY_CREATED_DATE)
if not doc:
ctx.meta.insert(self.KEY_CREATED_DATE, date_of_creation)
else:
doc.value = date_of_creation
ctx.meta.save(doc)
def insert_chars(self, chars, ctx=None):
# ensure context
if ctx is None:
with self.ctx() as ctx:
return self.insert_chars(chars, ctx=ctx)
# else
for c in chars:
self.insert_char(c, ctx=ctx)
def insert_char(self, c, ctx=None):
# ensure context
if ctx is None:
with self.ctx() as ctx:
return self.insert_char(c, ctx=ctx)
# else
c.ID = ctx.character.save(c)
# save codepoints
for cp in c.codepoints:
cp.cid = c.ID
ctx.codepoint.save(cp)
# radicals
for r in c.radicals:
r.cid = c.ID
ctx.radical.save(r)
# stroke_miscount
for smc in c.stroke_miscounts:
ctx.smc.insert(c.ID, smc)
# variants
for v in c.variants:
v.cid = c.ID
ctx.variant.save(v)
# radnames
for rn in c.rad_names:
ctx.rad_name.insert(c.ID, rn)
# dic_refs
for dr in c.dic_refs:
dr.cid = c.ID
ctx.dic_ref.save(dr)
# query_codes
for qc in c.query_codes:
qc.cid = c.ID
ctx.query_code.save(qc)
# nanoris
for n in c.nanoris:
ctx.nanori.insert(c.ID, n)
# reading groups
for rmg in c.rm_groups:
rmg.cid = c.ID
rmg.ID = ctx.rmg.save(rmg)
# save readings inside
for r in rmg.readings:
r.gid = rmg.ID
ctx.reading.save(r)
# save meanings inside
for m in rmg.meanings:
m.gid = rmg.ID
ctx.meaning.save(m)
def search_chars_iter(self, chars, ctx=None):
if ctx is None:
with self.ctx() as ctx:
return self.search_chars_iter(chars, ctx=ctx)
for c in chars:
res = self.get_char(c, ctx=ctx)
if res is not None:
yield res
def get_char(self, literal, ctx=None):
if ctx is None:
with self.ctx() as ctx:
return self.get_char(literal, ctx=ctx)
# context was ensured
c = ctx.char.select_single('literal=?', (literal,))
if not c:
getLogger().debug("character {} could not be found".format(literal))
return None
else:
return self.char_by_id(c.ID, ctx)
def char_by_id(self, cid, ctx=None):
if ctx is None:
with self.ctx() as ctx:
return self.select_char(cid, ctx=ctx)
# context was ensured
c = ctx.char.by_id(cid)
c.codepoints = ctx.codepoint.select('cid=?', (cid,))
c.radicals = ctx.radical.select('cid=?', (cid,))
for smc in ctx.smc.select('cid=?', (cid,)):
c.stroke_miscounts.append(smc.value)
c.variants = ctx.variant.select('cid=?', (cid,))
for r in ctx.rad_name.select('cid=?', (cid,)):
c.rad_names.append(r.value)
c.dic_refs = ctx.dic_ref.select('cid=?', (cid,))
c.query_codes = ctx.query_code.select('cid=?', (cid,))
for n in ctx.nanori.select('cid=?', (cid,)):
c.nanoris.append(n.value)
c.rm_groups = ctx.rmg.select('cid=?', (cid,))
for rmg in c.rm_groups:
rmg.readings = ctx.reading.select('gid=?', (rmg.ID,))
rmg.meanings = ctx.meaning.select('gid=?', (rmg.ID,))
return c
================================================
FILE: jamdict/krad.py
================================================
# -*- coding: utf-8 -*-
"""
jamdict.krad is a module for retrieving kanji components (i.e. radicals)
"""
# This code is a part of jamdict library: https://github.com/neocl/jamdict
# :copyright: (c) 2016 Le Tuan Anh <tuananh.ke@gmail.com>
# :license: MIT, see LICENSE for more details.
import os
import logging
import threading
from collections import defaultdict as dd
from typing import Mapping
from chirptext import chio
# ------------------------------------------------------------------------------
# Configuration
# ------------------------------------------------------------------------------
MY_FOLDER = os.path.dirname(os.path.abspath(__file__))
DATA_FOLDER = os.path.join(MY_FOLDER, 'data')
KRADFILE = os.path.join(DATA_FOLDER, 'kradfile-u.gz')
RADKFILE = os.path.join(DATA_FOLDER, 'radkfile.gz')
logger = logging.getLogger(__name__)
########################################################################
class KRad:
''' This class contains mapping from radicals to kanjis (radk) and kanjis to radicals (krad)
'''
def __init__(self, **kwargs):
""" Kanji-Radical mapping """
self.__krad_map: Mapping = None
self.__radk_map: Mapping = None
self.__rads = {}
self.lock = threading.Lock()
def _build_krad_map(self):
with self.lock:
lines = chio.read_file(KRADFILE, mode='rt').splitlines()
# build the krad map
self.__krad_map = {}
self.__radk_map = dd(set)
for line in lines:
if line.startswith("#"):
continue
else:
parts = line.split(':', maxsplit=1)
if len(parts) == 2:
rads = [r.strip() for r in parts[1].split()]
char_literal = parts[0].strip()
self.__krad_map[char_literal] = rads
for rad in rads:
self.__radk_map[rad].add(char_literal)
@property
def radk(self) -> Mapping:
if self.__radk_map is None:
self._build_krad_map()
return self.__radk_map
@property
def krad(self) -> Mapping:
if self.__krad_map is None:
self._build_krad_map()
return self.__krad_map
================================================
FILE: jamdict/tools.py
================================================
# -*- coding: utf-8 -*-
"""
Jamdict console app
"""
# This code is a part of jamdict library: https://github.com/neocl/jamdict
# :copyright: (c) 2017 Le Tuan Anh <tuananh.ke@gmail.com>
# :license: MIT, see LICENSE for more details.
import os
import json
import logging
from chirptext import __version__ as chirptext_version
from puchikarui import __version__ as puchikarui_version
from chirptext import confirm, TextReport, Timer
from chirptext.cli import CLIApp, setup_logging
import jamdict
# -------------------------------------------------------------------------------
# Configuration
# -------------------------------------------------------------------------------
if os.path.isfile('logging.json'):
setup_logging('logging.json', 'logs')
else:
setup_logging(os.path.join(jamdict.config.home_dir(), 'logging.json'), 'logs')
def getLogger():
return logging.getLogger(__name__)
# -------------------------------------------------------------------------------
# Functions
# -------------------------------------------------------------------------------
def get_jam(cli, args):
if not args.jdb:
args.jdb = None
if args.config:
jamdict.config.read_config(args.config)
if args.kd2 or args.jmne:
cli.logger.warning("Jamdict database location: {}".format(args.jdb))
cli.logger.warning("Kanjidic2 database location: {}".format(args.kd2))
jmd = jamdict.Jamdict(db_file=args.jdb, kd2_file=args.kd2,
jmd_xml_file=args.jmdxml,
kd2_xml_file=args.kd2xml,
jmnedict_file=args.jmne,
jmnedict_xml_file=args.jmnexml)
else:
cli.logger.debug("Using the same database for both JMDict and Kanjidic2")
jmd = jamdict.Jamdict(db_file=args.jdb,
kd2_file=args.jdb,
jmnedict_file=args.jdb,
jmd_xml_file=args.jmdxml,
kd2_xml_file=args.kd2xml,
jmnedict_xml_file=args.jmnexml)
if jmd.kd2 is None:
cli.logger.warning("Kanjidic2 database could not be found")
return jmd
def import_data(cli, args):
'''Generate Jamdict SQLite database from XML data files'''
rp = TextReport()
t = Timer(report=rp)
show_info(cli, args)
jam = get_jam(cli, args)
if not jam.db_file:
print("Database path is not available")
elif os.path.isfile(jam.db_file):
if not confirm("Database file exists. Do you want to overwite (This action cannot be undone! yes/no?) "):
cli.logger.warning("Program aborted.")
exit()
else:
os.unlink(jam.db_file)
# perform input
print(f"Importing data to: {jam.db_file}")
t.start("Creating Jamdict SQLite database. This process may take very long time ...")
jam.import_data()
t.stop()
def dump_result(results, report=None):
if report is None:
report = TextReport()
if results.entries:
report.print("=" * 40)
report.print("Found entries")
report.print("=" * 40)
for e in results.entries:
kj = ', '.join([k.text for k in e.kanji_forms])
kn = ', '.join([k.text for k in e.kana_forms])
report.print("Entry: {} | Kj: {} | Kn: {}".format(e.idseq, kj, kn))
report.print("-" * 20)
for idx, s in enumerate(e.senses):
report.print("{idx}. {s}".format(idx=idx + 1, s=s))
report.print('')
else:
report.print("No dictionary entry was found.")
if results.chars:
report.print("=" * 40)
report.print("Found characters")
report.print("=" * 40)
for c in results.chars:
report.print("Char: {} | Strokes: {}".format(c, c.stroke_count))
report.print("-" * 20)
for rmg in c.rm_groups:
report.print("Readings:", ", ".join([r.value for r in rmg.readings]))
report.print("Meanings:", ", ".join([m.value for m in rmg.meanings if not m.m_lang or m.m_lang == 'en']))
report.print('')
report.print('')
else:
report.print("No character was found.")
if results.names:
report.print("=" * 40)
report.print("Found name entities")
report.print("=" * 40)
for e in results.names:
kj = ', '.join([k.text for k in e.kanji_forms])
kn = ', '.join([k.text for k in e.kana_forms])
report.print("Names: {} | Kj: {} | Kn: {}".format(e.idseq, kj, kn))
report.print("-" * 20)
for idx, s in enumerate(e.senses):
report.print("{idx}. {s}".format(idx=idx + 1, s=s))
report.print('')
else:
report.print("No name was found.")
def lookup(cli, args):
'''Lookup words by kanji/kana'''
jam = get_jam(cli, args)
if jam.ready:
results = jam.lookup(args.query, strict_lookup=args.strict)
report = TextReport(args.output)
if args.format == 'json':
report.print(json.dumps(results.to_dict(),
ensure_ascii=args.ensure_ascii,
indent=args.indent if args.indent else None))
else:
if args.compact:
report.print(results.text(separator='\n------\n', entry_sep='\n'))
else:
dump_result(results, report=report)
else:
getLogger().warning(f"Jamdict database is not available.\nThere are 3 ways to install data: \n 1) install jamdict_data via PyPI using `pip install jamdict_data` \n 2) download prebuilt dictionary database file from: {jamdict.__url__}, \n 3) or build your own database file from XML source files.")
def file_status(file_path):
if file_path:
real_path = os.path.abspath(os.path.expanduser(file_path))
if os.path.isfile(real_path):
return '[OK]'
return '[NOT FOUND]'
def hello_jamdict(cli, args):
''' Say hello and test if Jamdict is working '''
jam = get_jam(cli, args)
if jam.ready:
results = jam.lookup("一期一会")
dump_result(results, report=TextReport())
else:
getLogger().warning("Hello there, unfortunately jamdict data is not available. Please try to install using `pip install jamdict-data`")
def show_info(cli, args):
''' Show jamdict configuration (data folder, configuration file location, etc.) '''
output = TextReport(args.output) if 'output' in args else TextReport()
if args.config:
jamdict.config.read_config(args.config)
output.print("Jamdict " + jamdict.version_info.__version__)
output.print(jamdict.version_info.__description__)
jam = get_jam(cli, args)
output.header("Basic configuration")
jamdict_home = jamdict.config.home_dir()
if not os.path.isdir(os.path.expanduser(jamdict_home)):
jamdict_home += " [Missing]"
else:
jamdict_home += " [OK]"
output.print(f"JAMDICT_HOME: {jamdict_home}")
if jamdict.util._JAMDICT_DATA_AVAILABLE:
import jamdict_data
data_pkg = f"version {jamdict_data.__version__} [OK]"
else:
data_pkg = "Not installed"
output.print(f"jamdict-data: {data_pkg}")
if args.config:
_config_path = args.config + " [Custom]"
if not os.path.isfile(args.config):
_config_path += " [Missing]"
else:
_config_path = jamdict.config._get_config_manager().locate_config()
if not _config_path:
_config_path = "Not available.\n Run `python3 -m jamdict config` to create configuration file if needed."
output.print(f"Config file : {_config_path}")
output.header("Data files")
output.print(f"Jamdict DB location: {jam.db_file} - {file_status(jam.db_file)}")
output.print(f"JMDict XML file : {jam.jmd_xml_file} - {file_status(jam.jmd_xml_file)}")
output.print(f"KanjiDic2 XML file : {jam.kd2_xml_file} - {file_status(jam.kd2_xml_file)}")
output.print(f"JMnedict XML file : {jam.jmnedict_xml_file} - {file_status(jam.jmnedict_xml_file)}")
if jam.ready:
output.header("Jamdict database metadata")
try:
for meta in jam.jmdict.meta.select():
output.print(f"{meta.key}: {meta.value}")
except Exception as e:
print(e)
output.print("Error happened while retrieving database meta data")
output.header("Others")
output.print(f"puchikarui: version {puchikarui_version}")
output.print(f"chirptext : version {chirptext_version}")
output.print(f"lxml : {jamdict.jmdict._LXML_AVAILABLE}")
def show_version(cli, args):
''' Show Jamdict version '''
if args.verbose:
print("Jamdict {v} - {d}".format(d=jamdict.version_info.__description__, v=jamdict.version_info.__version__))
else:
print("Jamdict {}".format(jamdict.version_info.__version__))
def config_jamdict(cli, args):
''' Create Jamdict configuration file '''
if args.config:
jamdict.config._ensure_config(args.config)
else:
jamdict.config._ensure_config()
show_info(cli, args)
# -------------------------------------------------------------------------------
# Main
# -------------------------------------------------------------------------------
def add_data_config(parser):
parser.add_argument('-c', '--config', help='Path to Jamdict config file (i.e. ~/.jamdict/config.json)', default=None)
parser.add_argument('-J', '--jdb', help='Path to JMDict SQLite file', default=None)
parser.add_argument('-j', '--jmdxml', help='Path to JMdict XML file', default=None)
parser.add_argument('-k', '--kd2xml', help='Path to KanjiDic2 XML file', default=None)
parser.add_argument('-e', '--jmnexml', help='Path to JMnedict XML file', default=None)
parser.add_argument('-K', '--kd2', help='Path to KanjiDic2 SQLite file', default=None)
parser.add_argument('-E', '--jmne', help='Path to JMnedict SQLite file', default=None)
def main():
'''Main entry of jamtk
'''
app = CLIApp(desc='Jamdict command-line toolkit', logger=__name__, show_version=show_version)
add_data_config(app.parser)
# import task
import_task = app.add_task('import', func=import_data)
add_data_config(import_task)
# show info
info_task = app.add_task('info', func=show_info)
info_task.add_argument('-o', '--output', help='Write information to a text file')
add_data_config(info_task)
# show version
version_task = app.add_task('version', func=show_version)
add_data_config(version_task)
# create config file
config_task = app.add_task('config', func=config_jamdict)
add_data_config(config_task)
# hello
hello_task = app.add_task('hello', func=hello_jamdict)
add_data_config(hello_task)
# look up task
lookup_task = app.add_task('lookup', func=lookup)
lookup_task.add_argument('query', help='kanji/kana')
lookup_task.add_argument('-f', '--format', help='json or text')
lookup_task.add_argument('--compact', action='store_true')
lookup_task.add_argument('-s', '--strict', action='store_true')
lookup_task.add_argument('--ensure_ascii', help='Force JSON dumps to ASCII only', action='store_true')
lookup_task.add_argument('--indent', help='JSON default indent', default=2, type=int)
lookup_task.add_argument('-o', '--output', help='Path to a file to output lookup result, leave blank to write to console standard output')
lookup_task.set_defaults(func=lookup)
add_data_config(lookup_task)
# run app
app.run()
if __name__ == "__main__":
main()
================================================
FILE: jamdict/util.py
================================================
# -*- coding: utf-8 -*-
"""
Jamdict public APIs
"""
# This code is a part of jamdict library: https://github.com/neocl/jamdict
# :copyright: (c) 2016 Le Tuan Anh <tuananh.ke@gmail.com>
# :license: MIT, see LICENSE for more details.
import os
import logging
import threading
import warnings
from pathlib import Path
from collections import defaultdict as dd
from collections import OrderedDict
from typing import List, Sequence
from chirptext.deko import HIRAGANA, KATAKANA
_MEMORY_MODE = False
try:
from puchikarui import MemorySource
_MEMORY_MODE = True
except ImportError:
pass
from puchikarui import ExecutionContext
from . import config
from .jmdict import JMDictXMLParser, JMDEntry
from .krad import KRad
from .jmdict_sqlite import JMDictSQLite
from .kanjidic2 import Kanjidic2XMLParser, Character
from .kanjidic2_sqlite import KanjiDic2SQLite
from .jmnedict_sqlite import JMNEDictSQLite
try:
import jamdict_data
_JAMDICT_DATA_AVAILABLE = True
except Exception:
_JAMDICT_DATA_AVAILABLE = False
########################################################################
def getLogger():
return logging.getLogger(__name__)
########################################################################
class LookupResult(object):
""" Contain lookup results (words, Kanji characters, or named entities) from Jamdict.
A typical jamdict lookup is like this:
>>> jam = Jamdict()
>>> result = jam.lookup('食べ%る')
The command above returns a :any:`LookupResult` object which contains found words (:any:`entries`),
kanji characters (:any:`chars`), and named entities (:any:`names`).
"""
def __init__(self, entries, chars, names=None):
self.__entries: Sequence[JMDEntry] = entries if entries else []
self.__chars: Sequence[Character] = chars if chars else []
self.__names: Sequence[JMDEntry] = names if names else []
@property
def entries(self) -> Sequence[JMDEntry]:
""" A list of words entries
:returns: a list of :class:`JMDEntry <jamdict.jmdict.JMDEntry>` object
:rtype: List[JMDEntry]
"""
return self.__entries
@entries.setter
def entries(self, values: Sequence[JMDEntry]):
self.__entries = values
@property
def chars(self) -> Sequence[Character]:
""" A list of found kanji characters
:returns: a list of :class:`Character <jamdict.kanjidic2.Character>` object
:rtype: Sequence[Character]
"""
return self.__chars
@chars.setter
def chars(self, values: Sequence[Character]):
self.__chars = values
@property
def names(self) -> Sequence[JMDEntry]:
""" A list of found named entities
:returns: a list of :class:`JMDEntry <jamdict.jmdict.JMDEntry>` object
:rtype: Sequence[JMDEntry]
"""
return self.__names
@names.setter
def names(self, values: Sequence[JMDEntry]):
self.__names = values
def text(self, compact=True, entry_sep='。', separator=' | ', no_id=False, with_chars=True) -> str:
""" Generate a text string that contains all found words, characters, and named entities.
:param compact: Make the output string more compact (fewer info, fewer whitespaces, etc.)
:param no_id: Do not include jamdict's internal object IDs (for direct query via API)
:param entry_sep: The text to separate entries
:param with_chars: Include characters information
:returns: A formatted string ready for display
"""
output = []
if self.entries:
entry_txts = []
for idx, e in enumerate(self.entries, start=1):
entry_txt = e.text(compact=compact, separator=' ', no_id=no_id)
entry_txts.append("#{}: {}".format(idx, entry_txt))
output.append("[Entries]")
output.append(entry_sep)
output.append(entry_sep.join(entry_txts))
elif not compact:
output.append("No entries")
if self.chars and with_chars:
if compact:
chars_txt = ', '.join(str(c) for c in self.chars)
else:
chars_txt = ', '.join(repr(c) for c in self.chars)
if output:
output.append(separator) # TODO: section separator?
output.append("[Chars]")
output.append(entry_sep)
output.append(chars_txt)
if self.names:
name_txts = []
for idx, n in enumerate(self.names, start=1):
name_txt = n.text(compact=compact, separator=' ', no_id=no_id)
name_txts.append("#{}: {}".format(idx, name_txt))
if output:
output.append(separator)
output.append("[Names]")
output.append(entry_sep)
output.append(entry_sep.join(name_txts))
return "".join(output) if output else "Found nothing"
def __repr__(self):
return self.text(compact=True)
def __str__(self):
return self.text(compact=False)
def to_json(self):
warnings.warn("to_json() is deprecated and will be removed in the next major release. Use to_dict() instead.",
DeprecationWarning, stacklevel=2)
return self.to_dict()
def to_dict(self) -> dict:
return {'entries': [e.to_dict() for e in self.entries],
'chars': [c.to_dict() for c in self.chars],
'names': [n.to_dict() for n in self.names]}
class IterLookupResult(object):
""" Contain lookup results (words, Kanji characters, or named entities) from Jamdict.
A typical jamdict lookup is like this:
>>> res = jam.lookup_iter("花見")
``res`` is an :class:`IterLookupResult` object which contains iterators
to scan through found words (``entries``), kanji characters (``chars``),
and named entities (:any:`names`) one by one.
>>> for word in res.entries:
... print(word) # do somethign with the word
>>> for c in res.chars:
... print(c)
>>> for name in res.names:
... print(name)
"""
def __init__(self, entries, chars=None, names=None):
self.__entries = entries if entries is not None else []
self.__chars = chars if chars is not None else []
self.__names = names if names is not None else []
@property
def entries(self):
""" Iterator for looping one by one through all found entries, can only be used once """
return self.__entries
@property
def chars(self):
""" Iterator for looping one by one through all found kanji characters, can only be used once """
return self.__chars
@property
def names(self):
""" Iterator for looping one by one through all found named entities, can only be used once """
return self.__names
class JamdictSQLite(KanjiDic2SQLite, JMNEDictSQLite, JMDictSQLite):
def __init__(self, db_file, *args, **kwargs):
super().__init__(db_file, *args, **kwargs)
class Jamdict(object):
""" Main entry point to access all available dictionaries in jamdict.
>>> from jamdict import Jamdict
>>> jam = Jamdict()
>>> result = jam.lookup('食べ%る')
# print all word entries
>>> for entry in result.entries:
>>> print(entry)
# print all related characters
>>> for c in result.chars:
>>> print(repr(c))
To filter results by ``pos``, for example look for all "かえる" that are nouns, use:
>>> result = jam.lookup("かえる", pos=["noun (common) (futsuumeishi)"])
To search for named-entities by type, use the type string as query.
For example to search for all "surname" use:
>>> result = jam.lookup("surname")
To find out which part-of-speeches or named-entities types are available in the
dictionary, use :func:`Jamdict.all_pos <jamdict.util.Jamdict.all_pos>`
and :func:`Jamdict.all_ne_type <jamdict.util.Jamdict.all_pos>`.
Jamdict >= 0.1a10 support ``memory_mode`` keyword argument for reading
the whole database into memory before querying to boost up search speed.
The database may take about a minute to load. Here is the sample code:
>>> jam = Jamdict(memory_mode=True)
When there is no suitable database available, Jamdict will try to use database
from `jamdict-data <https://pypi.org/project/jamdict-data/>`_ package by default.
If there is a custom database available in configuration file,
Jamdict will prioritise to use it over the ``jamdict-data`` package.
"""
def __init__(self, db_file=None, kd2_file=None,
jmd_xml_file=None, kd2_xml_file=None,
auto_config=True, auto_expand=True, reuse_ctx=True,
jmnedict_file=None, jmnedict_xml_file=None,
memory_mode=False, **kwargs):
# data sources
self.reuse_ctx = reuse_ctx
self._db_sqlite = None
self._kd2_sqlite = None
self._jmne_sqlite = None
self._jmd_xml = None
self._kd2_xml = None
self._jmne_xml = None
self.__krad_map = None
self.__jm_ctx = None # for reusing database context
self.__memory_mode = memory_mode
# file paths configuration
self.auto_expand = auto_expand
self.jmd_xml_file = jmd_xml_file if jmd_xml_file else config.get_file('JMDICT_XML') if auto_config else None
self.kd2_xml_file = kd2_xml_file if kd2_xml_file else config.get_file('KD2_XML') if auto_config else None
self.jmnedict_xml_file = jmnedict_xml_file if jmnedict_xml_file else config.get_file('JMNEDICT_XML') if auto_config else None
if auto_expand:
if self.jmd_xml_file:
self.jmd_xml_file = os.path.expanduser(self.jmd_xml_file)
if self.kd2_xml_file:
self.kd2_xml_file = os.path.expanduser(self.kd2_xml_file)
if self.jmnedict_xml_file:
self.jmnedict_xml_file = os.path.expanduser(self.jmnedict_xml_file)
self.db_file = db_file if db_file else config.get_file('JAMDICT_DB') if auto_config else None
if not self.db_file or (self.db_file != ':memory:' and not os.path.isfile(self.db_file)):
if _JAMDICT_DATA_AVAILABLE:
self.db_file = jamdict_data.JAMDICT_DB_PATH
elif self.jmd_xml_file and os.path.isfile(self.jmd_xml_file):
getLogger().warning("JAMDICT_DB could NOT be found. Searching will be extremely slow. Please run `python3 -m jamdict import` first")
self.kd2_file = kd2_file if kd2_file else self.db_file if auto_config else None
if not self.kd2_file or (self.kd2_file != ':memory:' and not os.path.isfile(self.kd2_file)):
if _JAMDICT_DATA_AVAILABLE:
self.kd2_file = None # jamdict_data.JAMDICT_DB_PATH
elif self.kd2_xml_file and os.path.isfile(self.kd2_xml_file):
getLogger().warning("Kanjidic2 database could NOT be found. Searching will be extremely slow. Please run `python3 -m jamdict import` first")
self.jmnedict_file = jmnedict_file if jmnedict_file else self.db_file if auto_config else None
if not self.jmnedict_file or (self.jmnedict_file != ':memory:' and not os.path.isfile(self.jmnedict_file)):
if _JAMDICT_DATA_AVAILABLE:
self.jmnedict_file = None # jamdict_data.JAMDICT_DB_PATH
elif self.jmnedict_xml_file and os.path.isfile(self.jmnedict_xml_file):
getLogger().warning("JMNE database could NOT be found. Searching will be extremely slow. Please run `python3 -m jamdict import` first")
@property
def ready(self) -> bool:
""" Check if Jamdict database is available """
return os.path.isfile(self.db_file) and self.jmdict is not None
def __del__(self):
if self.__jm_ctx is not None:
try:
# try to close default SQLite context if needed
self.__jm_ctx.close()
except Exception:
pass
def __make_db_ctx(self) -> ExecutionContext:
""" Try to reuse context if allowed """
try:
if not self.reuse_ctx:
return self.jmdict.ctx()
elif self.__jm_ctx is None and self.db_file and (self.db_file == ":memory:" or os.path.isfile(self.db_file)):
self.__jm_ctx = self.jmdict.ctx()
except Exception:
getLogger().warning("JMdict data could not be accessed.")
return self.__jm_ctx
@property
def db_file(self):
return self.__db_file
@db_file.setter
def db_file(self, value):
if self.auto_expand and value and value != ':memory:':
self.__db_file = os.path.abspath(os.path.expanduser(value))
else:
self.__db_file = value
@property
def kd2_file(self):
return self.__kd2_file
@kd2_file.setter
def kd2_file(self, value):
if self.auto_expand and value and value != ':memory:':
self.__kd2_file = os.path.abspath(os.path.expanduser(value))
else:
self.__kd2_file = value
@property
def jmnedict_file(self):
return self.__jmnedict_file
@jmnedict_file.setter
def jmnedict_file(self, value):
if self.auto_expand and value and value != ':memory:':
self.__jmnedict_file = os.path.abspath(os.path.expanduser(value))
else:
self.__jmnedict_file = value
@property
def memory_mode(self):
""" if memory_mode = True, Jamdict DB will be loaded into RAM before querying for better performance """
return self.__memory_mode
@property
def jmdict(self):
if not self._db_sqlite and self.db_file:
with threading.Lock():
# Use 1 DB for all
if self.memory_mode and _MEMORY_MODE:
data_source = MemorySource(self.db_file)
else:
if self.memory_mode and not _MEMORY_MODE:
logging.getLogger(__name__).error("Memory mode could not be enabled because puchikarui version is too old. Fallback to normal file DB mode")
data_source = self.db_file
self._db_sqlite = JamdictSQLite(data_source, auto_expand_path=self.auto_expand)
return self._db_sqlite
@property
def kd2(self):
if self._kd2_sqlite is None:
if self.kd2_file is not None and os.path.isfile(self.kd2_file):
with threading.Lock():
if self.memory_mode and _MEMORY_MODE:
data_source = MemorySource(self.kd2_file)
else:
if self.memory_mode and not _MEMORY_MODE:
logging.getLogger(__name__).error("Memory mode could not be enabled because puchikarui version is too old. Fallback to normal file DB mode")
data_source = self.kd2_file
self._kd2_sqlite = KanjiDic2SQLite(data_source, auto_expand_path=self.auto_expand)
elif not self.kd2_file or self.kd2_file == self.db_file:
self._kd2_sqlite = self.jmdict
return self._kd2_sqlite
@property
def jmnedict(self):
""" JM NE SQLite database access object """
if self._jmne_sqlite is None:
if self.jmnedict_file is not None:
with threading.Lock():
if self.memory_mode and _MEMORY_MODE:
data_source = MemorySource(self.jmnedict_file)
else:
if self.memory_mode and not _MEMORY_MODE:
logging.getLogger(__name__).error("Memory mode could not be enabled because puchikarui version is too old. Fallback to normal file DB mode")
data_source = self.jmnedict_file
self._jmne_sqlite = JMNEDictSQLite(data_source, auto_expand_path=self.auto_expand)
elif not self.jmnedict_file or self.jmnedict_file == self.db_file:
self._jmne_sqlite = self.jmdict
return self._jmne_sqlite
@property
def jmdict_xml(self):
if not self._jmd_xml and self.jmd_xml_file:
with threading.Lock():
getLogger().info("Loading JMDict from XML file at {}".format(self.jmd_xml_file))
self._jmd_xml = JMDictXML.from_file(self.jmd_xml_file)
getLogger().info("Loaded JMdict entries: {}".format(len(self._jmd_xml)))
return self._jmd_xml
@property
def krad(self):
""" Break a kanji down to writing components
>>> jam = Jamdict()
>>> print(jam.krad['雲'])
['一', '雨', '二', '厶']
"""
if not self.__krad_map:
with threading.Lock():
self.__krad_map = KRad()
return self.__krad_map.krad
@property
def radk(self):
""" Find all kanji with a writing component
>>> jam = Jamdict()
>>> print(jam.radk['鼎'])
{'鼏', '鼒', '鼐', '鼎', '鼑'}
"""
if not self.__krad_map:
with threading.Lock():
self.__krad_map = KRad()
return self.__krad_map.radk
@property
def kd2_xml(self):
if not self._kd2_xml and self.kd2_xml_file:
with threading.Lock():
getLogger().info("Loading KanjiDic2 from XML file at {}".format(self.kd2_xml_file))
self._kd2_xml = KanjiDic2XML.from_file(self.kd2_xml_file)
getLogger().info("Loaded KanjiDic2 entries: {}".format(len(self._kd2_xml)))
return self._kd2_xml
@property
def jmne_xml(self):
if not self._jmne_xml and self.jmnedict_xml_file:
with threading.Lock():
getLogger().info("Loading JMnedict from XML file at {}".format(self.jmnedict_xml_file))
self._jmne_xml = JMNEDictXML.from_file(self.jmnedict_xml_file)
getLogger().info("Loaded JMnedict entries: {}".format(len(self._jmne_xml)))
return self._jmne_xml
def has_kd2(self) -> bool:
return self.db_file is not None or self.kd2_file is not None or self.kd2_xml_file is not None
def has_jmne(self, ctx=None) -> bool:
""" Check if current database has jmne support """
if ctx is None:
ctx = self.__make_db_ctx()
m = ctx.meta.select_single('key=?', ('jmnedict.version',)) if ctx is not None else None
return m is not None and len(m.value) > 0
def is_available(self) -> bool:
# this function is for developer only
# don't expose it to the public
# ready should be used instead
return (self.db_file is not None or self.jmd_xml_file is not None or
self.kd2_file is not None or self.kd2_xml_file is not None or
self.jmnedict_file is not None or self.jmnedict_xml_file is not None)
def import_data(self):
""" Import JMDict and KanjiDic2 data from XML to SQLite """
if self.db_file and not os.path.exists(self.db_file):
Path(self.db_file).touch()
ctx = self.__make_db_ctx()
ctx.buckmode()
ctx.auto_commit = False
if self.jmdict and self.jmdict_xml:
getLogger().info("Importing JMDict data")
self.jmdict.insert_entries(self.jmdict_xml, ctx=ctx)
# import KanjiDic2
if self.kd2_xml is not None and os.path.isfile(self.kd2_xml_file):
getLogger().info("Importing KanjiDic2 data")
if self.jmdict is not None and self.kd2_file == self.db_file:
self.jmdict.insert_chars(self.kd2_xml, ctx=ctx)
elif self.kd2 is not None:
getLogger().warning(f"Building Kanjidic2 DB using a different DB context {self.kd2_file} vs {self.db_file}")
with self.kd2.ctx() as kd_ctx:
self.kd2.insert_chars(self.kd2_xml, ctx=kd_ctx)
else:
getLogger().warning(f"Kanjidic2 DB path could not be found")
else:
print(f"kd2_xml: {self.kd2_xml}")
print(f"kd2_xml_file: {self.kd2_xml_file}")
getLogger().warning("KanjiDic2 XML data is not available - skipped!")
# import JMNEdict
if self.jmne_xml is not None and os.path.isfile(self.jmnedict_xml_file):
getLogger().info("Importing JMNEdict data")
if self.jmdict is not None and self.jmnedict_file == self.db_file:
self.jmnedict.insert_name_entities(self.jmne_xml, ctx=ctx)
elif self.jmnedict is not None:
getLogger().warning(f"Building Kanjidic2 DB using a different DB context {self.jmne_file} vs {self.db_file}")
with self.jmnedict.ctx() as ne_ctx:
self.jmnedict.insert_name_entities(self.jmne_xml, ctx=ne_ctx)
else:
getLogger().warning(f"JMNE DB path could not be found")
else:
getLogger().warning("JMNEdict XML data is not available - skipped!")
_buckmode_off = getattr(ctx, "buckmode_off", None)
if _buckmode_off is not None:
_buckmode_off()
ctx.commit()
def get_ne(self, idseq, ctx=None) -> JMDEntry:
""" Get name entity by idseq in JMNEdict """
if self.jmnedict is not None:
if ctx is None:
ctx = self.__make_db_ctx()
return self.jmnedict.get_ne(idseq, ctx=ctx)
elif self.jmnedict_xml_file:
return self.jmne_xml.lookup(idseq)
else:
raise LookupError("There is no JMnedict data source available")
def get_char(self, literal, ctx=None) -> Character:
if self.kd2 is not None:
if ctx is None:
ctx = self.__make_db_ctx()
return self.kd2.get_char(literal, ctx=ctx)
elif self.kd2_xml:
return self.kd2_xml.lookup(literal)
else:
raise LookupError("There is no KanjiDic2 data source available")
def get_entry(self, idseq) -> JMDEntry:
if self.jmdict:
return self.jmdict.get_entry(idseq)
elif self.jmdict_xml:
return self.jmdict_xml.lookup(idseq)[0]
else:
raise LookupError("There is no backend data available")
def all_pos(self, ctx=None) -> List[str]:
""" Find all available part-of-speeches
:returns: A list of part-of-speeches (a list of strings)
"""
if ctx is None:
ctx = self.__make_db_ctx()
return self.jmdict.all_pos(ctx=ctx)
def all_ne_type(self, ctx=None) -> List[str]:
""" Find all available named-entity types
:returns: A list of named-entity types (a list of strings)
"""
if ctx is None:
ctx = self.__make_db_ctx()
return self.jmnedict.all_ne_type(ctx=ctx)
def lookup(self, query, strict_lookup=False, lookup_chars=True, ctx=None,
lookup_ne=True, pos=None, **kwargs) -> LookupResult:
""" Search words, characters, and characters.
Keyword arguments:
:param query: Text to query, may contains wildcard characters. Use `?` for 1 exact character and `%` to match any number of characters.
:param strict_lookup: only look up the Kanji characters in query (i.e. discard characters from variants)
:type strict_lookup: bool
:param: lookup_chars: set lookup_chars to False to disable character lookup
:type lookup_chars: bool
:param pos: Filter words by part-of-speeches
:type pos: list of strings
:param ctx: database access context, can be reused for better performance. Normally users do not have to touch this and database connections will be reused by default.
:param lookup_ne: set lookup_ne to False to disable name-entities lookup
:type lookup_ne: bool
:returns: Return a LookupResult object.
:rtype: :class:`jamdict.util.LookupResult`
>>> # match any word that starts with "食べ" and ends with "る" (anything from between is fine)
>>> jam = Jamdict()
>>> results = jam.lookup('食べ%る')
"""
if not self.is_available():
raise LookupError("There is no backend data available")
elif (not query or query == "%") and not pos:
raise ValueError("Query and POS filter cannot be both empty")
if ctx is None:
ctx = self.__make_db_ctx()
entries = []
chars = []
names = []
if self.jmdict is not None:
entries = self.jmdict.search(query, pos=pos, ctx=ctx)
elif self.jmdict_xml:
entries = self.jmdict_xml.lookup(query)
if lookup_chars and self.has_kd2():
# lookup each character in query and kanji readings of each found entries
chars_to_search = OrderedDict({c: c for c in query})
if not strict_lookup and entries:
# auto add characters from entries
for e in entries:
for k in e.kanji_forms:
for c in k.text:
if c not in HIRAGANA and c not in KATAKANA:
chars_to_search[c] = c
for c in chars_to_search:
result = self.get_char(c, ctx=ctx)
if result is not None:
chars.append(result)
# lookup name-entities
if lookup_ne and self.has_jmne(ctx=ctx):
names = self.jmnedict.search_ne(query, ctx=ctx)
# finish
return LookupResult(entries, chars, names)
def lookup_iter(self, query, strict_lookup=False,
lookup_chars=True, lookup_ne=True,
ctx=None, pos=None, **kwargs) -> LookupResult:
""" Search for words, characters, and characters iteratively.
An :class:`IterLookupResult` object will be returned instead of the normal ``LookupResult``.
``res.entries``, ``res.chars``, ``res.names`` are iterators instead of lists and each of them
can only be looped through once. Users have to store the results manually.
>>> res = jam.lookup_iter("花見")
>>> for word in res.entries:
... print(word) # do somethign with the word
>>> for c in res.chars:
... print(c)
>>> for name in res.names:
... print(name)
Keyword arguments:
:param query: Text to query, may contains wildcard characters. Use `?` for 1 exact character and `%` to match any number of characters.
:param strict_lookup: only look up the Kanji characters in query (i.e. discard characters from variants)
:type strict_lookup: bool
:param: lookup_chars: set lookup_chars to False to disable character lookup
:type lookup_chars: bool
:param pos: Filter words by part-of-speeches
:type pos: list of strings
:param ctx: database access context, can be reused for better performance. Normally users do not have to touch this and database connections will be reused by default.
:param lookup_ne: set lookup_ne to False to disable name-entities lookup
:type lookup_ne: bool
:returns: Return an IterLookupResult object.
:rtype: :class:`jamdict.util.IterLookupResult`
"""
if not self.is_available():
raise LookupError("There is no backend data available")
elif (not query or query == "%") and not pos:
raise ValueError("Query and POS filter cannot be both empty")
if ctx is None:
ctx = self.__make_db_ctx()
# Lookup entries, chars, and names
entries = None
chars = None
names = None
if self.jmdict is not None:
entries = self.jmdict.search_iter(query, pos=pos, ctx=ctx)
if lookup_chars and self.has_kd2():
chars_to_search = OrderedDict({c: c for c in query if c not in HIRAGANA and c not in KATAKANA})
chars = self.kd2.search_chars_iter(chars_to_search, ctx=ctx)
# lookup name-entities
if lookup_ne and self.has_jmne(ctx=ctx):
names = self.jmnedict.search_ne_iter(query, ctx=ctx)
# finish
return IterLookupResult(entries, chars, names)
class JMDictXML(object):
""" JMDict API for looking up information in XML
"""
def __init__(self, entries):
self.entries = entries
self._seqmap = {} # entryID - entryObj map
self._textmap = dd(set)
# compile map
for entry in self.entries:
self._seqmap[entry.idseq] = entry
for kn in entry.kana_forms:
self._textmap[kn.text].add(entry)
for kj in entry.kanji_forms:
self._textmap[kj.text].add(entry)
def __len__(self):
return len(self.entries)
def __getitem__(self, idx):
return self.entries[idx]
def lookup(self, a_query) -> Sequence[JMDEntry]:
if a_query in self._textmap:
return tuple(self._textmap[a_query])
elif a_query.startswith('id#'):
entry_id = a_query[3:]
if entry_id in self._seqmap:
return (self._seqmap[entry_id],)
# found nothing
return ()
@staticmethod
def from_file(filename):
parser = JMDictXMLParser()
return JMDictXML(parser.parse_file(os.path.abspath(os.path.expanduser(filename))))
class JMNEDictXML(JMDictXML):
pass
class KanjiDic2XML(object):
def __init__(self, kd2):
"""
"""
self.kd2 = kd2
self.char_map = {}
for char in self.kd2:
if char.literal in self.char_map:
getLogger().warning("Duplicate character entry: {}".format(char.literal))
self.char_map[char.literal] = char
def __len__(self):
return len(self.kd2)
def __getitem__(self, idx):
return self.kd2[idx]
def lookup(self, char):
if char in self.char_map:
return self.char_map[char]
else:
return None
@staticmethod
def from_file(filename):
parser = Kanjidic2XMLParser()
return KanjiDic2XML(parser.parse_file(filename))
================================================
FILE: jamdict_demo.py
================================================
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Jamdict demo application
Latest version can be found at https://github.com/neocl/jamdict
This package uses the [EDICT][1] and [KANJIDIC][2] dictionary files.
These files are the property of the [Electronic Dictionary Research and Development Group][3], and are used in conformance with the Group's [licence][4].
[1]: http://www.csse.monash.edu.au/~jwb/edict.html
[2]: http://www.csse.monash.edu.au/~jwb/kanjidic.html
[3]: http://www.edrdg.org/
[4]: http://www.edrdg.org/edrdg/licence.html
References:
JMDict website:
http://www.csse.monash.edu.au/~jwb/edict.html
"""
# Copyright (c) 2016, Le Tuan Anh <tuananh.ke@gmail.com>
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
########################################################################
import json
from jamdict import Jamdict
########################################################################
# Create an instance of Jamdict
jam = Jamdict()
print("Jamdict DB file: {}".format(jam.db_file))
if not jam.ready:
print("""Jamdict DB is not available. Database can be installed via PyPI:
pip install jamdict-data
Or downloaded from: https://jamdict.readthedocs.io/en/latest/install.html
To create a config file, run:
python3 -m jamdict config
Program aborted.""")
exit()
# Lookup by kana
result = jam.lookup('おかえし')
for entry in result.entries:
print(entry)
# Lookup by kanji
print("-----------------")
result = jam.lookup('御土産')
for entry in result.entries:
print(entry)
# Lookup a name
# a name entity is also a jamdict.jmdict.JMDEntry object
# excep that the senses is a list of Translation objects instead of Sense objects
print("-----------------")
if jam.has_jmne():
result = jam.lookup('鈴木')
for name in result.names:
print(name)
# Use wildcard matching
# Find all names ends with -jida
print("-----------------")
result = jam.lookup('%じだ')
for name in result.names:
print(name)
# ------------------------------------------------------------------------------
# lookup entry by idseq
print("---------------------")
otenki = jam.lookup('id#1002470').entries[0]
# extract all kana forms
kana_forms = ' '.join([x.text for x in otenki.kana_forms])
# extract all kanji forms
kanji_forms = ' '.join([x.text for x in otenki.kanji_forms])
print("Entry #{id}: Kanji: {kj} - Kana: {kn}".format(id=otenki.idseq, kj=kanji_forms, kn=kana_forms))
# extract all sense glosses
for idx, sense in enumerate(otenki):
print("{i}. {s}".format(i=idx, s=sense))
# Look up radical & writing components of kanji characters
# 1. Lookup kanji's components
print("---------------------")
result = jam.lookup('筋斗雲')
for c in result.chars:
meanings = ', '.join(c.meanings())
# has components
print(f"{c}: {meanings}")
print(f" Radical: {c.radical}")
print(f" Components: {c.components}")
# 2. Lookup kanjis by component
print("---------------------")
chars = jam.radk['鼎'] # this returns a list of strings (each string is the literal of a character)
result = jam.lookup(''.join(chars))
for c in result.chars:
meanings = ', '.join(c.meanings())
# has components
print(f"{c}: {meanings}")
print(f" Radical: {c.radical}")
print(f" Components: {c.components}")
# using JSON
print("---------------------")
result = jam.lookup('こうしえん')
print(result.text(separator='\n'))
print("---------------------")
otenki_dict = result.to_json() # get a dict structure to produce a JSON str
gitextract_xlbja41h/ ├── .gitignore ├── .gitmodules ├── LICENSE ├── MANIFEST.in ├── README.md ├── TODO.md ├── _config.yml ├── data/ │ └── README.md ├── docs/ │ ├── Makefile │ ├── api.rst │ ├── conf.py │ ├── contributing.rst │ ├── index.rst │ ├── install.rst │ ├── make.bat │ ├── recipes.rst │ ├── requirements.txt │ ├── tutorials.rst │ └── updates.rst ├── jamdict/ │ ├── __init__.py │ ├── __main__.py │ ├── __version__.py │ ├── config.py │ ├── data/ │ │ ├── config_template.json │ │ ├── setup_jmdict.sql │ │ ├── setup_jmnedict.sql │ │ └── setup_kanjidic2.sql │ ├── jmdict.py │ ├── jmdict_sqlite.py │ ├── jmnedict_sqlite.py │ ├── kanjidic2.py │ ├── kanjidic2_sqlite.py │ ├── krad.py │ ├── tools.py │ └── util.py ├── jamdict_demo.py ├── jamdol-flask.py ├── jmd ├── logging.json ├── release.sh ├── requirements.txt ├── run ├── setup.py ├── test/ │ ├── __init__.py │ ├── data/ │ │ ├── JMdict_mini.xml │ │ ├── jamdict.json │ │ ├── jmendict_mini.xml │ │ └── kanjidic2_mini.xml │ ├── logging.json │ ├── test_jamdict.py │ ├── test_jmdict_sqlite.py │ ├── test_jmnedict.py │ ├── test_kanjidic2_sqlite.py │ └── test_krad.py └── test.sh
SYMBOL INDEX (469 symbols across 19 files)
FILE: jamdict/config.py
function _get_config_manager (line 28) | def _get_config_manager():
function _ensure_config (line 35) | def _ensure_config(config_path='~/.jamdict/config.json', mkdir=True):
function read_config (line 47) | def read_config(config_file=None, force_refresh=False, ensure_config=Fal...
function home_dir (line 81) | def home_dir():
function data_dir (line 97) | def data_dir():
function get_file (line 103) | def get_file(file_key):
FILE: jamdict/data/setup_jmdict.sql
type meta (line 2) | CREATE TABLE IF NOT EXISTS meta (
type Entry (line 11) | CREATE TABLE Entry (
type Link (line 16) | CREATE TABLE Link (
type Bib (line 26) | CREATE TABLE Bib (
type Etym (line 35) | CREATE TABLE Etym (
type Audit (line 42) | CREATE TABLE Audit (
type Kanji (line 52) | CREATE TABLE Kanji (
type KJI (line 60) | CREATE TABLE KJI (
type KJP (line 67) | CREATE TABLE KJP (
type Kana (line 76) | CREATE TABLE Kana (
type KNR (line 85) | CREATE TABLE KNR (
type KNI (line 92) | CREATE TABLE KNI (
type KNP (line 99) | CREATE TABLE KNP (
type Sense (line 108) | CREATE TABLE Sense (
type stagk (line 114) | CREATE TABLE stagk (
type stagr (line 120) | CREATE TABLE stagr (
type pos (line 126) | CREATE TABLE pos (
type xref (line 132) | CREATE TABLE xref (
type antonym (line 138) | CREATE TABLE antonym (
type field (line 144) | CREATE TABLE field (
type misc (line 150) | CREATE TABLE misc (
type SenseInfo (line 156) | CREATE TABLE SenseInfo (
type SenseSource (line 162) | CREATE TABLE SenseSource (
type dialect (line 171) | CREATE TABLE dialect (
type SenseGloss (line 177) | CREATE TABLE SenseGloss (
type Link_idseq (line 189) | CREATE INDEX Link_idseq ON Link(idseq)
type Link_tag (line 190) | CREATE INDEX Link_tag ON Link(tag)
type Bib_idseq (line 191) | CREATE INDEX Bib_idseq ON Link(idseq)
type Etym_idseq (line 192) | CREATE INDEX Etym_idseq ON Etym(idseq)
type Audit_idseq (line 193) | CREATE INDEX Audit_idseq ON Audit(idseq)
type Kanji_idseq (line 195) | CREATE INDEX Kanji_idseq ON Kanji(idseq)
type Kanji_text (line 196) | CREATE INDEX Kanji_text ON Kanji(text)
type KJI_kid (line 197) | CREATE INDEX KJI_kid ON KJI(kid)
type KJP_kid (line 198) | CREATE INDEX KJP_kid ON KJP(kid)
type Kana_idseq (line 200) | CREATE INDEX Kana_idseq ON Kana(idseq)
type Kana_text (line 201) | CREATE INDEX Kana_text ON Kana(text)
type KNR_kid (line 202) | CREATE INDEX KNR_kid ON KNR(kid)
type KNR_text (line 203) | CREATE INDEX KNR_text ON KNR(text)
type KNI_kid (line 204) | CREATE INDEX KNI_kid ON KNI(kid)
type KNI_text (line 205) | CREATE INDEX KNI_text ON KNI(text)
type KNP_kid (line 206) | CREATE INDEX KNP_kid ON KNP(kid)
type KNP_text (line 207) | CREATE INDEX KNP_text ON KNP(text)
type Sense_idseq (line 209) | CREATE INDEX Sense_idseq ON Sense(idseq)
type stagk_sid (line 210) | CREATE INDEX stagk_sid ON stagk(sid)
type stagk_text (line 211) | CREATE INDEX stagk_text ON stagk(text)
type stagr_sid (line 212) | CREATE INDEX stagr_sid ON stagr(sid)
type stagr_text (line 213) | CREATE INDEX stagr_text ON stagr(text)
type pos_sid (line 214) | CREATE INDEX pos_sid ON pos(sid)
type pos_text (line 215) | CREATE INDEX pos_text ON pos(text)
type xref_sid (line 216) | CREATE INDEX xref_sid ON xref(sid)
type xref_text (line 217) | CREATE INDEX xref_text ON xref(text)
type antonym_sid (line 218) | CREATE INDEX antonym_sid ON antonym(sid)
type antonym_text (line 219) | CREATE INDEX antonym_text ON antonym(text)
type field_sid (line 220) | CREATE INDEX field_sid ON field(sid)
type field_text (line 221) | CREATE INDEX field_text ON field(text)
type misc_sid (line 222) | CREATE INDEX misc_sid ON misc(sid)
type misc_text (line 223) | CREATE INDEX misc_text ON misc(text)
type SenseInfo_sid (line 224) | CREATE INDEX SenseInfo_sid ON SenseInfo(sid)
type SenseInfo_text (line 225) | CREATE INDEX SenseInfo_text ON SenseInfo(text)
type SenseSource_sid (line 226) | CREATE INDEX SenseSource_sid ON SenseSource(sid)
type SenseSource_text (line 227) | CREATE INDEX SenseSource_text ON SenseSource(text)
type dialect_sid (line 228) | CREATE INDEX dialect_sid ON dialect(sid)
type dialect_text (line 229) | CREATE INDEX dialect_text ON dialect(text)
type SenseGloss_sid (line 230) | CREATE INDEX SenseGloss_sid ON SenseGloss(sid)
type SenseGloss_lang (line 231) | CREATE INDEX SenseGloss_lang ON SenseGloss(lang)
type SenseGloss_gend (line 232) | CREATE INDEX SenseGloss_gend ON SenseGloss(gend)
type SenseGloss_text (line 233) | CREATE INDEX SenseGloss_text ON SenseGloss(text)
FILE: jamdict/data/setup_jmnedict.sql
type meta (line 2) | CREATE TABLE IF NOT EXISTS meta (
type NEEntry (line 11) | CREATE TABLE NEEntry (
type NEKanji (line 18) | CREATE TABLE NEKanji (
type NEKana (line 28) | CREATE TABLE NEKana (
type NETranslation (line 39) | CREATE TABLE NETranslation (
type NETransType (line 45) | CREATE TABLE NETransType (
type NETransXRef (line 51) | CREATE TABLE NETransXRef (
type NETransGloss (line 57) | CREATE TABLE NETransGloss (
type NEKanji_idseq (line 69) | CREATE INDEX NEKanji_idseq ON NEKanji(idseq)
type NEKanji_text (line 70) | CREATE INDEX NEKanji_text ON NEKanji(text)
type NEKana_idseq (line 72) | CREATE INDEX NEKana_idseq ON NEKana(idseq)
type NEKana_text (line 73) | CREATE INDEX NEKana_text ON NEKana(text)
type NETranslation_idseq (line 75) | CREATE INDEX NETranslation_idseq ON NETranslation(idseq)
type NETransType_tid (line 76) | CREATE INDEX NETransType_tid ON NETransType(tid)
type NETransType_text (line 77) | CREATE INDEX NETransType_text ON NETransType(text)
type NETransXRef_tid (line 78) | CREATE INDEX NETransXRef_tid ON NETransXRef(tid)
type NETransXRef_text (line 79) | CREATE INDEX NETransXRef_text ON NETransXRef(text)
type NETransGloss_tid (line 80) | CREATE INDEX NETransGloss_tid ON NETransGloss(tid)
type NETransGloss_lang (line 81) | CREATE INDEX NETransGloss_lang ON NETransGloss(lang)
type NETransGloss_text (line 82) | CREATE INDEX NETransGloss_text ON NETransGloss(text)
FILE: jamdict/data/setup_kanjidic2.sql
type meta (line 2) | CREATE TABLE IF NOT EXISTS meta (
type character (line 11) | CREATE TABLE character (
type codepoint (line 20) | CREATE TABLE codepoint (
type radical (line 27) | CREATE TABLE radical (
type stroke_miscount (line 34) | CREATE TABLE stroke_miscount (
type variant (line 40) | CREATE TABLE variant (
type rad_name (line 47) | CREATE TABLE rad_name (
type dic_ref (line 53) | CREATE TABLE dic_ref (
type query_code (line 62) | CREATE TABLE query_code (
type nanori (line 70) | CREATE TABLE nanori (
type rm_group (line 76) | CREATE TABLE rm_group (
type reading (line 82) | CREATE TABLE reading (
type meaning (line 91) | CREATE TABLE meaning (
type character_literal (line 102) | CREATE INDEX character_literal ON character(literal)
type character_stroke_count (line 103) | CREATE INDEX character_stroke_count ON character(stroke_count)
type character_grade (line 104) | CREATE INDEX character_grade ON character(grade)
type character_jlpt (line 105) | CREATE INDEX character_jlpt ON character(jlpt)
type codepoint_value (line 107) | CREATE INDEX codepoint_value ON codepoint(value)
type radical_value (line 108) | CREATE INDEX radical_value ON radical(value)
type variant_value (line 109) | CREATE INDEX variant_value ON variant(value)
type rad_name_value (line 110) | CREATE INDEX rad_name_value ON rad_name(value)
type dic_ref_value (line 111) | CREATE INDEX dic_ref_value ON dic_ref(value)
type query_code_value (line 112) | CREATE INDEX query_code_value ON query_code(value)
type nanori_value (line 113) | CREATE INDEX nanori_value ON nanori(value)
type rm_group_cid (line 114) | CREATE INDEX rm_group_cid ON rm_group(cid)
type reading_r_type (line 115) | CREATE INDEX reading_r_type ON reading(r_type)
type reading_value (line 116) | CREATE INDEX reading_value ON reading(value)
type meaning_value (line 117) | CREATE INDEX meaning_value ON meaning(value)
type meaning_m_lang (line 118) | CREATE INDEX meaning_m_lang ON meaning(m_lang)
FILE: jamdict/jmdict.py
class JMDEntry (line 31) | class JMDEntry(object):
method __init__ (line 40) | def __init__(self, idseq=''):
method __len__ (line 48) | def __len__(self):
method __getitem__ (line 51) | def __getitem__(self, idx):
method set_info (line 54) | def set_info(self, info):
method text (line 59) | def text(self, compact=True, separator=' ', no_id=False):
method __repr__ (line 76) | def __repr__(self):
method __str__ (line 79) | def __str__(self):
method to_json (line 82) | def to_json(self):
method to_dict (line 87) | def to_dict(self):
class KanjiForm (line 97) | class KanjiForm(object):
method __init__ (line 114) | def __init__(self, text=''):
method set_text (line 158) | def set_text(self, text):
method to_json (line 163) | def to_json(self):
method to_dict (line 168) | def to_dict(self):
method __repr__ (line 176) | def __repr__(self):
method __str__ (line 179) | def __str__(self):
class KanaForm (line 183) | class KanaForm(object):
method __init__ (line 198) | def __init__(self, text='', nokanji=False):
method set_text (line 226) | def set_text(self, text):
method to_json (line 231) | def to_json(self):
method to_dict (line 236) | def to_dict(self):
method __repr__ (line 247) | def __repr__(self):
method __str__ (line 250) | def __str__(self):
class EntryInfo (line 254) | class EntryInfo(object):
method __init__ (line 258) | def __init__(self):
method to_json (line 267) | def to_json(self):
method to_dict (line 272) | def to_dict(self):
class Link (line 279) | class Link(object):
method __init__ (line 286) | def __init__(self, tag, desc, uri):
method to_json (line 291) | def to_json(self):
method to_dict (line 296) | def to_dict(self):
class BibInfo (line 302) | class BibInfo(object):
method __init__ (line 310) | def __init__(self, tag='', text=''):
method set_tag (line 314) | def set_tag(self, tag):
method set_text (line 319) | def set_text(self, text):
method to_json (line 324) | def to_json(self):
method to_dict (line 329) | def to_dict(self):
class Audit (line 333) | class Audit(object):
method __init__ (line 338) | def __init__(self, upd_date, upd_detl):
method to_json (line 342) | def to_json(self):
method to_dict (line 347) | def to_dict(self):
class Sense (line 351) | class Sense(object):
method __init__ (line 358) | def __init__(self):
method __repr__ (line 416) | def __repr__(self):
method __str__ (line 419) | def __str__(self):
method text (line 422) | def text(self, compact=True):
method to_json (line 429) | def to_json(self):
method to_dict (line 434) | def to_dict(self):
class Translation (line 461) | class Translation(Sense):
method __init__ (line 465) | def __init__(self):
method name_type_human (line 471) | def name_type_human(self):
method text (line 474) | def text(self, compact=True):
method to_json (line 479) | def to_json(self):
method to_dict (line 484) | def to_dict(self):
class SenseGloss (line 490) | class SenseGloss(object):
method __init__ (line 511) | def __init__(self, lang, gend, text):
method __repr__ (line 516) | def __repr__(self):
method __str__ (line 519) | def __str__(self):
method to_json (line 528) | def to_json(self):
method to_dict (line 533) | def to_dict(self):
class LSource (line 544) | class LSource:
method __init__ (line 565) | def __init__(self, lang, lstype, wasei, text):
method to_json (line 571) | def to_json(self):
method to_dict (line 576) | def to_dict(self):
class Meta (line 600) | class Meta(object):
method __init__ (line 602) | def __init__(self, key='', value=''):
method __repr__ (line 606) | def __repr__(self):
method __str__ (line 609) | def __str__(self):
class JMDictXMLParser (line 613) | class JMDictXMLParser(object):
method __init__ (line 617) | def __init__(self):
method parse_file (line 620) | def parse_file(self, jmdict_file_path):
method parse_entry_tag (line 636) | def parse_entry_tag(self, etag):
method parse_ent_seq (line 658) | def parse_ent_seq(self, seq_tag, entry):
method get_single (line 664) | def get_single(self, tag_name, a_tag):
method parse_k_ele (line 673) | def parse_k_ele(self, k_ele, entry):
method parse_r_ele (line 688) | def parse_r_ele(self, r_ele, entry):
method parse_info (line 707) | def parse_info(self, info_tag, entry):
method parse_link (line 723) | def parse_link(self, link_tag, entry_info):
method parse_bibinfo (line 731) | def parse_bibinfo(self, bib_tag, entry_info):
method parse_ne_translation (line 743) | def parse_ne_translation(self, trans_tag, entry):
method parse_sense (line 761) | def parse_sense(self, sense_tag, entry):
method get_attrib (line 793) | def get_attrib(self, a_tag, attr_name, default_value=''):
method parse_sensegloss (line 801) | def parse_sensegloss(self, gloss_tag, sense):
method parse_lsource (line 809) | def parse_lsource(self, lsource_tag, sense):
FILE: jamdict/jmdict_sqlite.py
function getLogger (line 39) | def getLogger():
class JMDictSchema (line 47) | class JMDictSchema(Schema):
method __init__ (line 52) | def __init__(self, db_path, *args, **kwargs):
class JMDictSQLite (line 87) | class JMDictSQLite(JMDictSchema):
method __init__ (line 89) | def __init__(self, db_path, *args, **kwargs):
method update_jmd_meta (line 92) | def update_jmd_meta(self, version, url, ctx=None):
method all_pos (line 113) | def all_pos(self, ctx=None):
method _build_search_query (line 119) | def _build_search_query(self, query, pos=None):
method search (line 149) | def search(self, query, ctx=None, pos=None, **kwargs):
method search_iter (line 161) | def search_iter(self, query, ctx=None, pos=None, **kwargs):
method get_entry (line 171) | def get_entry(self, idseq, ctx=None):
method insert_entries (line 274) | def insert_entries(self, entries, ctx=None):
method insert_entry (line 284) | def insert_entry(self, entry, ctx=None):
FILE: jamdict/jmnedict_sqlite.py
function getLogger (line 47) | def getLogger():
class JMNEDictSchema (line 55) | class JMNEDictSchema(Schema):
method __init__ (line 57) | def __init__(self, db_path, *args, **kwargs):
class JMNEDictSQLite (line 75) | class JMNEDictSQLite(JMNEDictSchema):
method __init__ (line 77) | def __init__(self, db_path, *args, **kwargs):
method all_ne_type (line 80) | def all_ne_type(self, ctx=None):
method _build_ne_search_query (line 86) | def _build_ne_search_query(self, query):
method search_ne (line 104) | def search_ne(self, query, ctx=None, **kwargs) -> Sequence[JMDEntry]:
method search_ne_iter (line 115) | def search_ne_iter(self, query, ctx=None, **kwargs):
method get_ne (line 124) | def get_ne(self, idseq, ctx=None) -> JMDEntry:
method insert_name_entities (line 161) | def insert_name_entities(self, entries, ctx=None):
method insert_name_entity (line 170) | def insert_name_entity(self, entry, ctx=None):
FILE: jamdict/kanjidic2.py
function getLogger (line 36) | def getLogger():
class KanjiDic2 (line 44) | class KanjiDic2(object):
method __init__ (line 46) | def __init__(self, file_version, database_version, date_of_creation):
method __len__ (line 73) | def __len__(self):
method __getitem__ (line 76) | def __getitem__(self, idx):
class Character (line 80) | class Character(object):
method __init__ (line 85) | def __init__(self):
method text (line 107) | def text(self):
method __repr__ (line 110) | def __repr__(self):
method __str__ (line 114) | def __str__(self):
method meanings (line 117) | def meanings(self, english_only=False):
method components (line 128) | def components(self):
method radical (line 136) | def radical(self):
method to_json (line 143) | def to_json(self):
method to_dict (line 148) | def to_dict(self):
class CodePoint (line 165) | class CodePoint(object):
method __init__ (line 167) | def __init__(self, cp_type='', value=''):
method __repr__ (line 178) | def __repr__(self):
method __str__ (line 184) | def __str__(self):
method to_json (line 187) | def to_json(self):
method to_dict (line 192) | def to_dict(self):
class Radical (line 196) | class Radical(object):
method __init__ (line 198) | def __init__(self, rad_type='', value=''):
method __repr__ (line 209) | def __repr__(self):
method __str__ (line 215) | def __str__(self):
method to_json (line 218) | def to_json(self):
method to_dict (line 223) | def to_dict(self):
class Variant (line 227) | class Variant(object):
method __init__ (line 229) | def __init__(self, var_type='', value=''):
method __repr__ (line 256) | def __repr__(self):
method __str__ (line 262) | def __str__(self):
method to_json (line 265) | def to_json(self):
method to_dict (line 270) | def to_dict(self):
class DicRef (line 274) | class DicRef(object):
method __init__ (line 276) | def __init__(self, dr_type='', value='', m_vol='', m_page=''):
method __repr__ (line 327) | def __repr__(self):
method __str__ (line 333) | def __str__(self):
method to_json (line 336) | def to_json(self):
method to_dict (line 341) | def to_dict(self):
class QueryCode (line 348) | class QueryCode(object):
method __init__ (line 350) | def __init__(self, qc_type='', value='', skip_misclass=""):
method __repr__ (line 405) | def __repr__(self):
method __str__ (line 411) | def __str__(self):
method to_json (line 414) | def to_json(self):
method to_dict (line 419) | def to_dict(self):
class RMGroup (line 423) | class RMGroup(object):
method __init__ (line 425) | def __init__(self, readings=None, meanings=None):
method __repr__ (line 440) | def __repr__(self):
method __str__ (line 445) | def __str__(self):
method on_readings (line 449) | def on_readings(self):
method kun_readings (line 453) | def kun_readings(self):
method other_readings (line 457) | def other_readings(self):
method to_json (line 460) | def to_json(self):
method to_dict (line 465) | def to_dict(self):
class Reading (line 473) | class Reading(object):
method __init__ (line 475) | def __init__(self, r_type='', value='', on_type="", r_status=""):
method __repr__ (line 520) | def __repr__(self):
method __str__ (line 526) | def __str__(self):
method to_json (line 529) | def to_json(self):
method to_dict (line 534) | def to_dict(self):
class Meaning (line 541) | class Meaning(object):
method __init__ (line 543) | def __init__(self, value='', m_lang=''):
method __repr__ (line 558) | def __repr__(self):
method __str__ (line 564) | def __str__(self):
method to_json (line 567) | def to_json(self):
method to_dict (line 572) | def to_dict(self):
class Kanjidic2XMLParser (line 576) | class Kanjidic2XMLParser(object):
method __init__ (line 580) | def __init__(self):
method get_attrib (line 583) | def get_attrib(self, a_tag, attr_name, default_value=''):
method parse_file (line 591) | def parse_file(self, kd2_file_path):
method parse_header (line 610) | def parse_header(self, e):
method parse_char (line 623) | def parse_char(self, e):
method parse_codepoint (line 644) | def parse_codepoint(self, e, char):
method parse_radical (line 652) | def parse_radical(self, e, char):
method parse_misc (line 660) | def parse_misc(self, e, char):
method parse_dic_refs (line 682) | def parse_dic_refs(self, e, char):
method parse_query_code (line 693) | def parse_query_code(self, e, char):
method parse_reading_meaning (line 702) | def parse_reading_meaning(self, e, char):
FILE: jamdict/kanjidic2_sqlite.py
function getLogger (line 50) | def getLogger():
class KanjiDic2Schema (line 58) | class KanjiDic2Schema(Schema):
method __init__ (line 64) | def __init__(self, db_path, *args, **kwargs):
class KanjiDic2SQLite (line 84) | class KanjiDic2SQLite(KanjiDic2Schema):
method __init__ (line 86) | def __init__(self, db_path, *args, **kwargs):
method update_kd2_meta (line 89) | def update_kd2_meta(self, file_version, database_version, date_of_crea...
method insert_chars (line 117) | def insert_chars(self, chars, ctx=None):
method insert_char (line 126) | def insert_char(self, c, ctx=None):
method search_chars_iter (line 175) | def search_chars_iter(self, chars, ctx=None):
method get_char (line 184) | def get_char(self, literal, ctx=None):
method char_by_id (line 196) | def char_by_id(self, cid, ctx=None):
FILE: jamdict/krad.py
class KRad (line 32) | class KRad:
method __init__ (line 36) | def __init__(self, **kwargs):
method _build_krad_map (line 43) | def _build_krad_map(self):
method radk (line 62) | def radk(self) -> Mapping:
method krad (line 68) | def krad(self) -> Mapping:
FILE: jamdict/tools.py
function getLogger (line 32) | def getLogger():
function get_jam (line 40) | def get_jam(cli, args):
function import_data (line 66) | def import_data(cli, args):
function dump_result (line 87) | def dump_result(results, report=None):
function lookup (line 134) | def lookup(cli, args):
function file_status (line 153) | def file_status(file_path):
function hello_jamdict (line 161) | def hello_jamdict(cli, args):
function show_info (line 171) | def show_info(cli, args):
function show_version (line 223) | def show_version(cli, args):
function config_jamdict (line 231) | def config_jamdict(cli, args):
function add_data_config (line 244) | def add_data_config(parser):
function main (line 254) | def main():
FILE: jamdict/util.py
function getLogger (line 46) | def getLogger():
class LookupResult (line 52) | class LookupResult(object):
method __init__ (line 65) | def __init__(self, entries, chars, names=None):
method entries (line 71) | def entries(self) -> Sequence[JMDEntry]:
method entries (line 80) | def entries(self, values: Sequence[JMDEntry]):
method chars (line 84) | def chars(self) -> Sequence[Character]:
method chars (line 93) | def chars(self, values: Sequence[Character]):
method names (line 97) | def names(self) -> Sequence[JMDEntry]:
method names (line 106) | def names(self, values: Sequence[JMDEntry]):
method text (line 109) | def text(self, compact=True, entry_sep='。', separator=' | ', no_id=Fal...
method __repr__ (line 151) | def __repr__(self):
method __str__ (line 154) | def __str__(self):
method to_json (line 157) | def to_json(self):
method to_dict (line 162) | def to_dict(self) -> dict:
class IterLookupResult (line 168) | class IterLookupResult(object):
method __init__ (line 188) | def __init__(self, entries, chars=None, names=None):
method entries (line 194) | def entries(self):
method chars (line 199) | def chars(self):
method names (line 204) | def names(self):
class JamdictSQLite (line 209) | class JamdictSQLite(KanjiDic2SQLite, JMNEDictSQLite, JMDictSQLite):
method __init__ (line 211) | def __init__(self, db_file, *args, **kwargs):
class Jamdict (line 215) | class Jamdict(object):
method __init__ (line 254) | def __init__(self, db_file=None, kd2_file=None,
method ready (line 305) | def ready(self) -> bool:
method __del__ (line 309) | def __del__(self):
method __make_db_ctx (line 317) | def __make_db_ctx(self) -> ExecutionContext:
method db_file (line 329) | def db_file(self):
method db_file (line 333) | def db_file(self, value):
method kd2_file (line 340) | def kd2_file(self):
method kd2_file (line 344) | def kd2_file(self, value):
method jmnedict_file (line 351) | def jmnedict_file(self):
method jmnedict_file (line 355) | def jmnedict_file(self, value):
method memory_mode (line 362) | def memory_mode(self):
method jmdict (line 367) | def jmdict(self):
method kd2 (line 381) | def kd2(self):
method jmnedict (line 397) | def jmnedict(self):
method jmdict_xml (line 414) | def jmdict_xml(self):
method krad (line 423) | def krad(self):
method radk (line 436) | def radk(self):
method kd2_xml (line 449) | def kd2_xml(self):
method jmne_xml (line 458) | def jmne_xml(self):
method has_kd2 (line 466) | def has_kd2(self) -> bool:
method has_jmne (line 469) | def has_jmne(self, ctx=None) -> bool:
method is_available (line 476) | def is_available(self) -> bool:
method import_data (line 484) | def import_data(self):
method get_ne (line 527) | def get_ne(self, idseq, ctx=None) -> JMDEntry:
method get_char (line 538) | def get_char(self, literal, ctx=None) -> Character:
method get_entry (line 548) | def get_entry(self, idseq) -> JMDEntry:
method all_pos (line 556) | def all_pos(self, ctx=None) -> List[str]:
method all_ne_type (line 565) | def all_ne_type(self, ctx=None) -> List[str]:
method lookup (line 574) | def lookup(self, query, strict_lookup=False, lookup_chars=True, ctx=None,
method lookup_iter (line 630) | def lookup_iter(self, query, strict_lookup=False,
class JMDictXML (line 684) | class JMDictXML(object):
method __init__ (line 687) | def __init__(self, entries):
method __len__ (line 699) | def __len__(self):
method __getitem__ (line 702) | def __getitem__(self, idx):
method lookup (line 705) | def lookup(self, a_query) -> Sequence[JMDEntry]:
method from_file (line 716) | def from_file(filename):
class JMNEDictXML (line 721) | class JMNEDictXML(JMDictXML):
class KanjiDic2XML (line 725) | class KanjiDic2XML(object):
method __init__ (line 727) | def __init__(self, kd2):
method __len__ (line 737) | def __len__(self):
method __getitem__ (line 740) | def __getitem__(self, idx):
method lookup (line 743) | def lookup(self, char):
method from_file (line 750) | def from_file(filename):
FILE: jamdol-flask.py
function getLogger (line 51) | def getLogger():
function jsonp (line 59) | def jsonp(func):
function get_entry (line 79) | def get_entry(idseq):
function search (line 87) | def search(query, strict=None):
function index (line 94) | def index():
function version (line 100) | def version():
FILE: setup.py
function read (line 17) | def read(*filenames, **kwargs):
FILE: test/test_jamdict.py
function getLogger (line 32) | def getLogger():
function all_kana (line 36) | def all_kana(result, forms=None):
function all_kanji (line 44) | def all_kanji(result, forms=None):
class TestConfig (line 52) | class TestConfig(unittest.TestCase):
method test_config (line 54) | def test_config(self):
method setUpClass (line 154) | def setUpClass(cls):
method tearDownClass (line 158) | def tearDownClass(cls):
method clean_config_file (line 162) | def clean_config_file(cls):
method test_config_file (line 168) | def test_config_file(self):
method test_ensure_config (line 176) | def test_ensure_config(self):
method test_home_dir (line 182) | def test_home_dir(self):
class TestModels (line 61) | class TestModels(unittest.TestCase):
method test_basic_models (line 63) | def test_basic_models(self):
method test_lookup_result (line 75) | def test_lookup_result(self):
class TestJamdictXML (line 86) | class TestJamdictXML(unittest.TestCase):
method setUpClass (line 89) | def setUpClass(cls):
method test_jmdict_xml (line 93) | def test_jmdict_xml(self):
method test_jmdict_fields (line 100) | def test_jmdict_fields(self):
method test_jmdict_json (line 108) | def test_jmdict_json(self):
method test_kanjidic2_xml (line 117) | def test_kanjidic2_xml(self):
method test_kanjidic2_json (line 129) | def test_kanjidic2_json(self):
method test_jamdict_xml (line 136) | def test_jamdict_xml(self):
class TestConfig (line 148) | class TestConfig(unittest.TestCase):
method test_config (line 54) | def test_config(self):
method setUpClass (line 154) | def setUpClass(cls):
method tearDownClass (line 158) | def tearDownClass(cls):
method clean_config_file (line 162) | def clean_config_file(cls):
method test_config_file (line 168) | def test_config_file(self):
method test_ensure_config (line 176) | def test_ensure_config(self):
method test_home_dir (line 182) | def test_home_dir(self):
class TestAPIWarning (line 202) | class TestAPIWarning(unittest.TestCase):
method test_warn_to_json_deprecated (line 204) | def test_warn_to_json_deprecated(self):
class TestJamdictSQLite (line 218) | class TestJamdictSQLite(unittest.TestCase):
method tearDownClass (line 221) | def tearDownClass(cls):
method test_search_by_pos (line 225) | def test_search_by_pos(self):
method test_search_by_ne_type (line 272) | def test_search_by_ne_type(self):
method test_find_all_verbs (line 293) | def test_find_all_verbs(self):
method test_jamdict_data (line 315) | def test_jamdict_data(self):
method test_jamdict_sqlite_all (line 338) | def test_jamdict_sqlite_all(self):
method test_lookup_iter (line 376) | def test_lookup_iter(self):
FILE: test/test_jmdict_sqlite.py
function getLogger (line 40) | def getLogger():
class TestJamdictSQLite (line 48) | class TestJamdictSQLite(unittest.TestCase):
method __init__ (line 50) | def __init__(self, *args, **kwargs):
method setUpClass (line 57) | def setUpClass(cls):
method test_xml2sqlite (line 62) | def test_xml2sqlite(self):
method test_import_to_ram (line 77) | def test_import_to_ram(self):
method test_import_function (line 84) | def test_import_function(self):
method test_search (line 89) | def test_search(self):
method test_iter_search (line 107) | def test_iter_search(self):
FILE: test/test_jmnedict.py
function getLogger (line 32) | def getLogger():
class TestJMendictModels (line 40) | class TestJMendictModels(unittest.TestCase):
method extract_fields (line 46) | def extract_fields(self):
method test_ne_type_map (line 56) | def test_ne_type_map(self):
method test_jmne_support (line 64) | def test_jmne_support(self):
method test_xml2ramdb (line 73) | def test_xml2ramdb(self):
method test_query_netype (line 137) | def test_query_netype(self):
FILE: test/test_kanjidic2_sqlite.py
function getLogger (line 31) | def getLogger():
class TestJamdictSQLite (line 39) | class TestJamdictSQLite(unittest.TestCase):
method setUpClass (line 46) | def setUpClass(cls):
method test_xml2sqlite (line 51) | def test_xml2sqlite(self):
method test_reading_order (line 82) | def test_reading_order(self):
FILE: test/test_krad.py
function getLogger (line 34) | def getLogger():
class TestConfig (line 38) | class TestConfig(unittest.TestCase):
method test_config (line 40) | def test_config(self):
class TestModels (line 47) | class TestModels(unittest.TestCase):
method test_read_krad (line 49) | def test_read_krad(self):
Condensed preview — 55 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (500K chars).
[
{
"path": ".gitignore",
"chars": 1082,
"preview": ".idea/\ntest/data/test.db\n*.py~\n*.sh~\n# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C exte"
},
{
"path": ".gitmodules",
"chars": 0,
"preview": ""
},
{
"path": "LICENSE",
"chars": 1112,
"preview": "MIT License\r\n\r\nCopyright (c) 2016 Le Tuan Anh <tuananh.ke@gmail.com>\r\n\r\nPermission is hereby granted, free of charge, to"
},
{
"path": "MANIFEST.in",
"chars": 194,
"preview": "include README.rst\ninclude CHANGES.md\ninclude LICENSE\ninclude requirements*.txt\nrecursive-include jamdict/data/ *.sql\nre"
},
{
"path": "README.md",
"chars": 7159,
"preview": "# Jamdict\r\n\r\n[Jamdict](https://github.com/neocl/jamdict) is a Python 3 library for manipulating Jim Breen's JMdict, Kanj"
},
{
"path": "TODO.md",
"chars": 0,
"preview": ""
},
{
"path": "_config.yml",
"chars": 27,
"preview": "theme: jekyll-theme-minimal"
},
{
"path": "data/README.md",
"chars": 73,
"preview": "Copy dictionary files (JMdict_e.xml, kanjidic2.xml, kradfile, etc.) here\n"
},
{
"path": "docs/Makefile",
"chars": 691,
"preview": "# Minimal makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line, and also\n# from the "
},
{
"path": "docs/api.rst",
"chars": 701,
"preview": ".. _api_index:\n\njamdict APIs\n============\n\nAn overview of jamdict modules.\n\n.. warning::\n 👉 ⚠️ THIS SECTION IS STILL "
},
{
"path": "docs/conf.py",
"chars": 1984,
"preview": "# Configuration file for the Sphinx documentation builder.\n#\n# This file only contains a selection of the most common op"
},
{
"path": "docs/contributing.rst",
"chars": 2471,
"preview": ".. _contributing:\n\nContributing\n============\n\nThere are many ways to contribute to the Jamdict project.\nThe one that Jam"
},
{
"path": "docs/index.rst",
"chars": 5068,
"preview": "\nJamdict's documentation!\n========================\n\n`Jamdict <https://github.com/neocl/jamdict>`_ is a Python 3 library "
},
{
"path": "docs/install.rst",
"chars": 3496,
"preview": ".. _installpage:\n\nInstallation\n=============\n\njamdict and jamdict dictionary data are both available on PyPI and can be "
},
{
"path": "docs/make.bat",
"chars": 795,
"preview": "@ECHO OFF\r\n\r\npushd %~dp0\r\n\r\nREM Command file for Sphinx documentation\r\n\r\nif \"%SPHINXBUILD%\" == \"\" (\r\n\tset SPHINXBUILD=sp"
},
{
"path": "docs/recipes.rst",
"chars": 8023,
"preview": ".. _recipes:\n\nCommon Recipes\n==============\n\n.. contents::\n :local: \n :depth: 2\n\n.. warning::\n 👉 ⚠️ THIS SECTIO"
},
{
"path": "docs/requirements.txt",
"chars": 15,
"preview": "jamdict\nSphinx\n"
},
{
"path": "docs/tutorials.rst",
"chars": 1970,
"preview": "Tutorials\n=========\n\nGetting started\n---------------\n\nJust install ``jamdict`` and ``jamdict_data`` packages via pip and"
},
{
"path": "docs/updates.rst",
"chars": 1743,
"preview": ".. _updates:\n\nJamdict Changelog\n=================\n\njamdict 0.1a11\n--------------\n\n- 2021-05-25\n\n - Added ``lookup_iter"
},
{
"path": "jamdict/__init__.py",
"chars": 2510,
"preview": "# -*- coding: utf-8 -*-\n\n'''\nPython library for manipulating Jim Breen's JMdict and KanjiDic2\nLatest version can be foun"
},
{
"path": "jamdict/__main__.py",
"chars": 33,
"preview": "from . import tools\ntools.main()\n"
},
{
"path": "jamdict/__version__.py",
"chars": 1729,
"preview": "# -*- coding: utf-8 -*-\n\n# jamdict's package version information\n__author__ = \"Le Tuan Anh\"\n__email__ = \"tuananh.ke@gmai"
},
{
"path": "jamdict/config.py",
"chars": 4132,
"preview": "# -*- coding: utf-8 -*-\n\n\"\"\"\nJamdict configuration management\n\"\"\"\n\n# This code is a part of jamdict library: https://git"
},
{
"path": "jamdict/data/config_template.json",
"chars": 327,
"preview": "{\n \"JAMDICT_HOME\": \"~/.jamdict\",\n \"JAMDICT_DATA\": \"{JAMDICT_HOME}/data\",\n \"JAMDICT_DB\": \"{JAMDICT_DATA}/jamdict"
},
{
"path": "jamdict/data/setup_jmdict.sql",
"chars": 5939,
"preview": "/* Add meta info */\nCREATE TABLE IF NOT EXISTS meta (\n key TEXT PRIMARY KEY NOT NULL,\n value TEXT NOT NULL\n)"
},
{
"path": "jamdict/data/setup_jmnedict.sql",
"chars": 2631,
"preview": "/* Add meta info */\nCREATE TABLE IF NOT EXISTS meta (\n key TEXT PRIMARY KEY NOT NULL,\n value TEXT NOT NULL\n)"
},
{
"path": "jamdict/data/setup_kanjidic2.sql",
"chars": 3040,
"preview": "/* Add meta info */\nCREATE TABLE IF NOT EXISTS meta (\n key TEXT UNIQUE,\n value TEXT NOT NULL\n);\n\n-----------"
},
{
"path": "jamdict/jmdict.py",
"chars": 32766,
"preview": "# -*- coding: utf-8 -*-\n\n\"\"\"\nJMdict Models\n\"\"\"\n\n# This code is a part of jamdict library: https://github.com/neocl/jamdi"
},
{
"path": "jamdict/jmdict_sqlite.py",
"chars": 13948,
"preview": "# -*- coding: utf-8 -*-\n\n\"\"\"\nJMDict in SQLite format\n\"\"\"\n\n# This code is a part of jamdict library: https://github.com/n"
},
{
"path": "jamdict/jmnedict_sqlite.py",
"chars": 8061,
"preview": "# -*- coding: utf-8 -*-\n\n\"\"\"\nJapanese Multilingual Named Entity Dictionary (JMnedict) in SQLite format\n\nReferences:\n "
},
{
"path": "jamdict/kanjidic2.py",
"chars": 26784,
"preview": "# -*- coding: utf-8 -*-\n\n\"\"\"\nKanjidic2 models\n\"\"\"\n\n# This code is a part of jamdict library: https://github.com/neocl/ja"
},
{
"path": "jamdict/kanjidic2_sqlite.py",
"chars": 8122,
"preview": "# -*- coding: utf-8 -*-\n\n\"\"\"\nKanjiDic2 in SQLite format\n\nReferences:\n KANJIDIC2 project\n https://www.edrdg.org"
},
{
"path": "jamdict/krad.py",
"chars": 2368,
"preview": "# -*- coding: utf-8 -*-\r\n\r\n\"\"\"\r\njamdict.krad is a module for retrieving kanji components (i.e. radicals)\r\n\"\"\"\r\n\r\n# This "
},
{
"path": "jamdict/tools.py",
"chars": 11725,
"preview": "# -*- coding: utf-8 -*-\n\n\"\"\"\nJamdict console app\n\"\"\"\n\n# This code is a part of jamdict library: https://github.com/neocl"
},
{
"path": "jamdict/util.py",
"chars": 30297,
"preview": "# -*- coding: utf-8 -*-\n\n\"\"\"\nJamdict public APIs\n\"\"\"\n\n# This code is a part of jamdict library: https://github.com/neocl"
},
{
"path": "jamdict_demo.py",
"chars": 4596,
"preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n\"\"\"\nJamdict demo application\n\nLatest version can be found at https://git"
},
{
"path": "jamdol-flask.py",
"chars": 3668,
"preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n\"\"\"\njamdol - JAMDict OnLine (REST server)\nLatest version can be found at"
},
{
"path": "jmd",
"chars": 106,
"preview": "#!/bin/bash\n\nexport JAMDICT_HOME=~/local/jamdict\ncd ${JAMDICT_HOME}\npython3 -m jamdict.tools lookup \"$@\"\n\n"
},
{
"path": "logging.json",
"chars": 1587,
"preview": "{\n \"version\": 1,\n \"disable_existing_loggers\": false,\n \"formatters\": {\n \"simple\": {\n \"format\":"
},
{
"path": "release.sh",
"chars": 94,
"preview": "#!/bin/bash\n\n# pandoc --from=markdown --to=rst README.md -o README.rst\npython3 setup.py sdist\n"
},
{
"path": "requirements.txt",
"chars": 51,
"preview": "chirptext >= 0.1, <= 0.2\npuchikarui >= 0.1, < 0.3\n\n"
},
{
"path": "run",
"chars": 68,
"preview": "#!/bin/bash\n\nexport FLASK_APP=jamdol-flask.py\nflask run --port 5002\n"
},
{
"path": "setup.py",
"chars": 2938,
"preview": "#!/usr/bin/env python3\r\n# -*- coding: utf-8 -*-\r\n\r\n'''\r\nSetup script for jamdict\r\n\r\nLatest version can be found at https"
},
{
"path": "test/__init__.py",
"chars": 470,
"preview": "# -*- coding: utf-8 -*-\n\n\"\"\" Jamdict Test Scripts\n\"\"\"\n\n# This code is a part of jamdict library: https://github.com/neoc"
},
{
"path": "test/data/JMdict_mini.xml",
"chars": 88192,
"preview": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE JMdict [\n<!ELEMENT JMdict (entry*)>\n<!-- "
},
{
"path": "test/data/jamdict.json",
"chars": 328,
"preview": "{\n \"JAMDICT_HOME\": \".\",\n \"JAMDICT_DATA\": \"{JAMDICT_HOME}/data\",\n \"JAMDICT_DB\": \"{JAMDICT_HOME}/test/data/jamdic"
},
{
"path": "test/data/jmendict_mini.xml",
"chars": 9490,
"preview": "<?xml version=\"1.0\"?>\n<!DOCTYPE JMnedict [\n<!--\n\tThis is the DTD of the Japanese-Multilingual Named Entity\n\tDictionary f"
},
{
"path": "test/data/kanjidic2_mini.xml",
"chars": 127460,
"preview": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE kanjidic2 [\n\t<!-- Version 1.6 - April 2008\n\tThis is the DTD of the XML-"
},
{
"path": "test/logging.json",
"chars": 1498,
"preview": "{\n \"version\": 1,\n \"disable_existing_loggers\": false,\n \"formatters\": {\n \"simple\": {\n \"format\":"
},
{
"path": "test/test_jamdict.py",
"chars": 15801,
"preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n\"\"\"\nScript for testing jamdict library\n\"\"\"\n\n# This code is a part of jam"
},
{
"path": "test/test_jmdict_sqlite.py",
"chars": 4134,
"preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n\"\"\"\nTest script for JMDict SQLite\n\"\"\"\n\n# This code is a part of jamdict "
},
{
"path": "test/test_jmnedict.py",
"chars": 6468,
"preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n\"\"\"\nTest script for JMendict support\n\"\"\"\n\n# This code is a part of jamdi"
},
{
"path": "test/test_kanjidic2_sqlite.py",
"chars": 4723,
"preview": "# -*- coding: utf-8 -*-\n\n\"\"\"\nTest script for Jamcha SQLite\n\"\"\"\n\n# This code is a part of jamdict library: https://github"
},
{
"path": "test/test_krad.py",
"chars": 1848,
"preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n\"\"\"\nScript for testing KRad module library\n\nReferences:\n RADKFILE/KR"
},
{
"path": "test.sh",
"chars": 42,
"preview": "#!/bin/bash\n\npython3 -m unittest discover\n"
}
]
About this extraction
This page contains the full source code of the neocl/jamdict GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 55 files (453.7 KB), approximately 143.3k tokens, and a symbol index with 469 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.