Repository: vietnlp/etnlp
Branch: master
Commit: 88862f63d4a8
Files: 59
Total size: 13.1 MB
Directory structure:
gitextract_cg5yuxp9/
├── .gitignore
├── README.md
└── src/
├── codes/
│ ├── 00.run_etnlp_preprocessing.sh
│ ├── 01.run_etnlp_evaluator.sh
│ ├── 02.run_etnlp_extractor.sh
│ ├── 03.run_etnlp_visualizer_inter.sh
│ ├── 04.run_etnlp_visualizer_sbs.sh
│ ├── api/
│ │ ├── __init__.py
│ │ ├── embedding_evaluator.py
│ │ ├── embedding_extractor.py
│ │ ├── embedding_preprocessing.py
│ │ └── embedding_visualizer.py
│ ├── embeddings/
│ │ ├── __init__.py
│ │ ├── embedding_configs.py
│ │ ├── embedding_models.py
│ │ └── embedding_utils.py
│ ├── etnlp_api.py
│ ├── requirements.txt
│ ├── setup.py
│ ├── utils/
│ │ ├── __init__.py
│ │ ├── emb_utils.py
│ │ ├── embedding_io.py
│ │ ├── eval_utils.py
│ │ ├── file_utils.py
│ │ ├── string_utils.py
│ │ ├── vectors.py
│ │ └── word.py
│ └── visualizer/
│ ├── README.md
│ ├── __init__.py
│ ├── outof_w2vec.dict
│ ├── static/
│ │ └── style.css
│ ├── templates/
│ │ ├── app.html
│ │ └── search.html
│ └── visualizer_sbs.py
├── data/
│ ├── embedding_analogies/
│ │ ├── english/
│ │ │ └── english-word-analogy.txt
│ │ ├── portuguese/
│ │ │ ├── LX-4WAnalogies-ETNLP.txt
│ │ │ ├── LX-4WAnalogies.txt
│ │ │ ├── POST_TAG_vocabulary.txt
│ │ │ ├── evaluator_results.txt
│ │ │ └── vocab.txt
│ │ └── vi/
│ │ ├── Multi_evaluator_results.txt
│ │ ├── analogy_list_vi_ner.txt
│ │ └── elmo_results_out_dict.txt
│ ├── embedding_dicts/
│ │ ├── C2V.vec
│ │ ├── ELMO_23.vec
│ │ ├── FastText_23.vec
│ │ ├── MULTI_23.vec
│ │ ├── W2V_C2V_23.vec
│ │ ├── baomoi_c2v_dims_300.vec
│ │ └── vn_elmo_medium_c2v.vec
│ ├── glove2vec_dicts/
│ │ ├── glove1.vec
│ │ ├── glove1_w2v.vec
│ │ ├── glove2.vec
│ │ └── glove2_w2v.vec
│ └── vocab.txt
└── examples/
├── test1_etnlp_preprocessing.py
├── test2_etnlp_extractor.py
├── test3_etnlp_evaluator.py
└── test4_etnlp_visualizer.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
dist/
develop-eggs/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
.python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don’t work, or not
# install all needed dependencies.
#Pipfile.lock
# celery beat schedule file
celerybeat-schedule
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
================================================
FILE: README.md
================================================
ETNLP: A Toolkit for Extraction, Evaluation and Visualization of Pre-trained Word Embeddings
=====
# Table of contents
1. [Introduction](#introduction)
2. [More about ETNLP](#moreaboutETNLP)
3. [Installation and How to Use](#installation_and_howtouse)
4. [Download Resources](#Download_Resources)
# I. Overview
## A glimpse of ETNLP:
- Github: https://github.com/vietnlp/etnlp
- Video: https://vimeo.com/317599106
- Paper: https://arxiv.org/abs/1903.04433
# II. How do I cite ETNLP?
Please CITE paper the Arxiv paper whenever ETNLP (or the pre-trained embeddings) is used to produce published results or incorporated into other software:
```
@inproceedings{vu:2019n,
title={ETNLP: A Visual-Aided Systematic Approach to Select Pre-Trained Embeddings for a Downstream Task},
author={Vu, Xuan-Son and Vu, Thanh and Tran, Son N and Jiang, Lili},
booktitle={Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP)},
year={2019}
}
```
# III. More about ETNLP :
## 1. Embedding Evaluator:
To compare quality of embedding models on the word analogy task.
- Input: a pre-trained embedding vector file (word2vec format), and word analogy file.
- Output: (1) evaluate quality of the embedding model based on the MAP/P@10 score, (2) Paired t-tests to show significant level between different word embeddings.
### 1.1. Note: The word analogy list is created by:
- Adopt from the English list by selecting suitable categories and translating to the target language (i.e., Vietnamese).
- Removing inappropriate categories (i.e., category 6, 10, 11, 14) in the target language (i.e., Vietnamese).
- Adding custom category that is suitable for the target language (e.g., cities and their zones in Vietnam for Vietnamese).
Since most of this process is automatically done, it can be applied in other languages as well.
### 1.2. Selected categories for Vietnamese:
> 1. capital-common-countries
> 2. capital-world
> 3. currency: E.g., Algeria | dinar | Angola | kwanza
> 4. city-in-zone (Vietnam's cities and its zone)
> 5. family (boy|girl | brother | sister)
> 6. gram1-adjective-to-adverb (NOT USED)
> 7. gram2-opposite (e.g., acceptable | unacceptable | aware | unaware)
> 8. gram3-comparative (e.g., bad | worse | big | bigger)
> 9. gram4-superlative (e.g., bad | worst | big | biggest)
> 10. gram5-present-participle (NOT USED)
> 11. gram6-nationality-adjective-nguoi-tieng (e.g., Albania | Albanian | Argentina | Argentinean)
> 12. gram7-past-tense (NOT USED)
> 13. gram8-plural-cac-nhung (e.g., banana | bananas | bird | birds) (NOT USED)
> 14. gram9-plural-verbs (NOT USED)
### 1.3 Evaluation results (in details)
* Analogy: Word Analogy Task
* NER (w): NER task with hyper-parameters selected from the best F1 on validation set.
* NER (w.o): NER task without selecting hyper-parameters from the validation set.
| Model | NER.w | NER.w.o | Analogy |
|------------------------------ |------------- | ------------------ |------------------ |
| BiLC3 + w2v | 89.01 | 89.41 | 0.4796 |
| BiLC3 + Bert_Base | 88.26 | 89.91 | 0.4609 |
| BiLC3 + w2v_c2v | 89.46 | 89.46 | 0.4796 |
| BiLC3 + fastText | 89.65 | 89.84 | 0.4970 |
| BiLC3 + Elmo | 89.67 | 90.84 | **0.4999** |
| BiLC3 + MULTI_WC_F_E_B | **91.09** | **91.75** | 0.4906|
## 2. Embedding Extractor: To extract embedding vectors for other tasks.
- Input: (1) list of input embeddings, (2) a vocabulary file.
- Output: embedding vectors of the given vocab file in `.txt`, i.e., each line conains the embedding for a word. The file then be compressed in .gz format. This format is widely used in existing NLP Toolkits (e.g., Reimers et al. [1]).
### Extra options:
- `-input-c2v`: character embedding file
- `solveoov:1`: to solve OOV words of the 1st embedding. Similarly for more than one embedding: e.g., `solveoov:1:2`.
[1] Nils Reimers and Iryna Gurevych, Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging, 2017, http://arxiv.org/abs/1707.09861, arXiv.
## 3. Visualizer: to explore the embedding space and compare between different embeddings.
### Screenshot of viewing multiple-embeddings side-by-side (Vietnamese):

### Screenshot of viewing each embedding interactively (Vietnamese):

### Screenshot of viewing each embedding side-by-side (English):

# IV. Installation and How to use ETNLP
## 1. Installation:
From source codes (Python 3.6.x):
> 1. cd src/codes/
> 2. pip install -r requirements.txt
> 3. python setup.py install
From pip (python 3.6.x)
> 1. sudo apt-get install python3-dev
> 2. pip install cython
> 3. pip install git+git://github.com/vietnlp/etnlp.git
OR:
> 1. pip install etnlp
## 2. Examples
> 1. cd src/examples
> 2. python test1_etnlp_preprocessing.py
> 3. python test2_etnlp_extractor.py
> 4. python test3_etnlp_evaluator.py
> 5. python test4_etnlp_visualizer.py
### Example of using Fasttext-Sent2Vec:
- 01. Install: https://github.com/epfml/sent2vec
```
01. git clone https://github.com/epfml/sent2vec
02. cd sent2vec; pip install .
```
- 02. Extract embeddings for sentences (no requirement for tokenization before extracting embedding of sentences).
```
import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model('opendata_wiki_lowercase_words.bin')
emb = model.embed_sentence("tôi là sinh viên đh công nghệ, đại học quôc gia hà nội")
embs = model.embed_sentences(["tôi là sinh viên", "tôi là nhà thơ", "tôi là bác sĩ"])
```
### 3. Visualization
Side-by-side visualization:
> 1. sh src/codes/04.run_etnlp_visualizer_sbs.sh
Interactive visualization:
> 1. sh src/codes/04.run_etnlp_visualizer_inter.sh
# V. Available Lexical Resources
## 1. Word Analogy List for Vietnamese
| Word Analogy List | Download Link (NER Task)| Download Link (General)|
|------------------------------|---------------|---------------|
| Vietnamese (This work) | [Link1](https://drive.google.com/file/d/1eA5yvla4BhAIfWsmZherT1GEW6gzDC-1/view?usp=sharing)| [Link1](https://drive.google.com/file/d/1YJ9d5rVKMMKF1xWWZi26_sNpgULTvxwg/view?usp=sharing)|
| English (Mirkolov et al. [2]) | [Link2]| [Link2](https://drive.google.com/file/d/10rWxGu8-nbQmYC8wrIussSZjY0lDh6RP/view?usp=sharing)|
| Portuguese (Hartmann et al. [3]) | [Link3]| [Link3](https://github.com/nathanshartmann/portuguese_word_embeddings/blob/master/analogies/testset/LX-4WAnalogies.txt)|
## 2. Multiple pre-trained embedding models for Vietnamese
- Training data: Wiki in Vietnamese:
| # of sentences | # of tokenized words|
|------------------------------|---------------|
| 6,685,621 | 114,997,587 |
- Download Pre-trained Embeddings:
(Note: The MULTI_WC_F_E_B is the concatenation of four embeddings: W2V_C2V, fastText, ELMO, and Bert_Base.)
| Embedding Model | Download Link (NER Task) | Download Link (AIVIVN SentiTask) | Download Link (General) |
|------------------------------|---------------|---------------|---------------|
| w2v | [Link1](https://drive.google.com/file/d/1LHaZ8LXxteHzod42naqJZYCwwq5mI9aL/view?usp=sharing) (dim=300)| [Link1] | [Link1] |
| w2v_c2v | [Link2](https://drive.google.com/file/d/1-M9Tb9l8mNmP3RKxZiZNK1Vpbng2yw4l/view?usp=sharing) (dim=300)| [Link2] | [Link2] |
| fastText | [Link3](https://drive.google.com/file/d/1dHCPhKFjtDjbrUeeymheDnlhjtaljPGE/view?usp=sharing) (dim=300)| [Link3] | [Link3] |
| fastText-[Sent2Vec](https://github.com/epfml/sent2vec) | [Link3]| [Link3] | [Link3](https://drive.google.com/file/d/1BzL1mpdfqCCJioCdAlTVshbrz0lGfP2D/view?usp=sharing) (dim=300, 6GB, trained on 20GB of [news data](https://github.com/binhvq/news-corpus) and Wiki-data of ETNLP. |
| Elmo | [Link4](https://drive.google.com/file/d/1zDaSD8NsZNXGyd9iVOxTcb7CP61Ixo-r/view?usp=sharing) (dim=1024)| [Link4](https://drive.google.com/file/d/1jVJtF0f6SbtUd-t3bnywP6mFnz0QXPIx/view?usp=sharing) (dim=1024)| [Link4](https://drive.google.com/file/d/1XPsTzg1Gex-Hh2nl9344YlZc1orOVBDp/view?usp=sharing) (dim=1024, 731MB and 1.9GB after extraction.)|
| Bert_base | [Link5](https://drive.google.com/file/d/16fRkmIHiB16OlM8WdFmoApGtLMf6YJJ8/view?usp=sharing) (dim=768)| [Link5] | [Link5] |
| MULTI_WC_F_E_B | [Link6](https://drive.google.com/file/d/1gq7b8hs31VzoeO3n3C__ftlDnE_iBZW2/view?usp=sharing) (dim=2392)| [Link6] | [Link6] |
# VI. Versioning
For transparency and insight into our release cycle, and for striving to maintain backward compatibility, ETNLP will be maintained under the Semantic Versioning guidelines as much as possible.
Releases will be numbered with the following format:
`..`
And constructed with the following guidelines:
* Breaking backward compatibility bumps the major (and resets the minor and patch)
* New additions without breaking backward compatibility bumps the minor (and resets the patch)
* Bug fixes and misc changes bumps the patch
For more information on SemVer, please visit http://semver.org/.
================================================
FILE: src/codes/00.run_etnlp_preprocessing.sh
================================================
#!/bin/sh
export PYTHONPATH="$PYTHONPATH:$PWD"
INPUT_FILES="../data/glove2vec_dicts/glove1.vec;../data/glove2vec_dicts/glove2.vec"
OUTPUT_FILES="../data/glove2vec_dicts/glove1_w2v.vec;../data/glove2vec_dicts/glove2_w2v.vec"
# do_normalize: use this flag to normalize in case of multiple embeddings.
python ./etnlp_api.py -input $INPUT_FILES -output $OUTPUT_FILES -args "glove2w2v"
================================================
FILE: src/codes/01.run_etnlp_evaluator.sh
================================================
#!/bin/sh
export PYTHONPATH="$PYTHONPATH:$PWD"
INPUT_FILES="../data/embedding_dicts/ELMO_23.vec;../data/embedding_dicts/FastText_23.vec;../data/embedding_dicts/W2V_C2V_23.vec;../data/embedding_dicts/MULTI_23.vec"
ANALOGY_FILE="../data/embedding_analogies/vi/solveable_analogies_vi.txt"
OUT_FILE="../data/embedding_analogies/vi/Multi_evaluator_results.txt"
python ./etnlp_api.py -input $INPUT_FILES -output $OUT_FILE -analoglist $ANALOGY_FILE -args eval
================================================
FILE: src/codes/02.run_etnlp_extractor.sh
================================================
#!/bin/sh
export PYTHONPATH="$PYTHONPATH:$PWD"
INPUT_FILES="../data/embedding_dicts/ELMO_23.vec;../data/embedding_dicts/FastText_23.vec;../data/embedding_dicts/W2V_C2V_23.vec;../data/embedding_dicts/MULTI_23.vec"
C2V="../data/embedding_dicts/C2V.vec"
OUTPUT="../data/embedding_dicts/MULTI_W_F_B_E.vec"
VOCAB_FILE="../data/vocab.txt"
python ./etnlp_api.py -input $INPUT_FILES -vocab $VOCAB_FILE -input_c2v $C2V -args "extract" -output $OUTPUT
================================================
FILE: src/codes/03.run_etnlp_visualizer_inter.sh
================================================
#!/bin/sh
export PYTHONPATH="$PYTHONPATH:$PWD"
INPUT_FILES="../data/embedding_dicts/ELMO_23.vec;../data/embedding_dicts/FastText_23.vec;../data/embedding_dicts/W2V_C2V_23.vec;../data/embedding_dicts/MULTI_23.vec"
python3 ./etnlp_api.py -input $INPUT_FILES -args visualizer -port 8889
================================================
FILE: src/codes/04.run_etnlp_visualizer_sbs.sh
================================================
#!/bin/sh
export PYTHONPATH="$PYTHONPATH:$PWD"
INPUT_FILES="../data/embedding_dicts/ELMO_23.vec;../data/embedding_dicts/FastText_23.vec;../data/embedding_dicts/W2V_C2V_23.vec;../data/embedding_dicts/MULTI_23.vec"
# python ./visualizer/visualizer_sbs.py -input $INPUT_FILES -args visualizer
python3 ./visualizer/visualizer_sbs.py $INPUT_FILES
================================================
FILE: src/codes/api/__init__.py
================================================
================================================
FILE: src/codes/api/embedding_evaluator.py
================================================
import logging
import gensim
import argparse
from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors, Word2VecKeyedVectors
from gensim import utils, matutils
from six import string_types
from numpy import dot, float32 as REAL, array, ndarray, argmax
from utils import embedding_io, emb_utils
from embeddings.embedding_configs import EmbeddingConfigs
logger = logging.getLogger(__name__)
class new_Word2VecKeyedVectors(Word2VecKeyedVectors):
def __init__(self, vector_size):
super(Word2VecKeyedVectors, self).__init__(vector_size=vector_size)
def most_similar(self, positive=None, negative=None, topn=10, restrict_vocab=None, indexer=None):
"""
Find the top-N most similar words. Positive words contribute positively towards the
similarity, negative words negatively.
This method computes cosine similarity between a simple mean of the projection
weight vectors of the given words and the vectors for each word in the model.
The method corresponds to the `word-analogy` and `distance` scripts in the original
word2vec implementation.
If topn is False, most_similar returns the vector of similarity scores.
`restrict_vocab` is an optional integer which limits the range of vectors which
are searched for most-similar values. For example, restrict_vocab=10000 would
only check the first 10000 word vectors in the vocabulary order. (This may be
meaningful if you've sorted the vocabulary by descending frequency.)
Example::
>>> trained_model.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.50882536), ...]
"""
if positive is None:
positive = []
if negative is None:
negative = []
self.init_sims()
if isinstance(positive, string_types) and not negative:
# allow calls like most_similar('dog'), as a shorthand for most_similar(['dog'])
positive = [positive]
# add weights for each word, if not already present; default to 1.0 for positive and -1.0 for negative words
positive = [
(word, 1.0) if isinstance(word, string_types + (ndarray,)) else word
for word in positive
]
negative = [
(word, -1.0) if isinstance(word, string_types + (ndarray,)) else word
for word in negative
]
# compute the weighted average of all words
all_words, mean = set(), []
for word, weight in positive + negative:
if isinstance(word, ndarray):
mean.append(weight * word)
else:
mean.append(weight * self.word_vec(word, use_norm=True))
if word in self.vocab:
all_words.add(self.vocab[word].index)
if not mean:
raise ValueError("cannot compute similarity with no input")
mean = matutils.unitvec(array(mean).mean(axis=0)).astype(REAL)
if indexer is not None:
return indexer.most_similar(mean, topn)
limited = self.syn0norm if restrict_vocab is None else self.syn0norm[:restrict_vocab]
dists = dot(limited, mean)
if not topn:
return dists
best = matutils.argsort(dists, topn=topn + len(all_words), reverse=True)
# ignore (don't return) words from the input
result = [(self.index2word[sim], float(dists[sim])) for sim in best if sim not in all_words]
return result[:topn]
def new_accuracy(self, questions, restrict_vocab=30000, most_similar=most_similar, case_insensitive=True):
"""
Compute accuracy of the model. `questions` is a filename where lines are
4-tuples of words, split into sections by ": SECTION NAME" lines.
See questions-words.txt in
https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip
for an example.
The accuracy is reported (=printed to log and returned as a list) for each
section separately, plus there's one aggregate summary at the end.
Use `restrict_vocab` to ignore all questions containing a word not in the first `restrict_vocab`
words (default 30,000). This may be meaningful if you've sorted the vocabulary by descending frequency.
In case `case_insensitive` is True, the first `restrict_vocab` words are taken first, and then
case normalization is performed.
Use `case_insensitive` to convert all words in questions and vocab to their uppercase form before
evaluating the accuracy (default True). Useful in case of case-mismatch between training tokens
and question words. In case of multiple case variants of a single word, the vector for the first
occurrence (also the most frequent if vocabulary is sorted) is taken.
This method corresponds to the `compute-accuracy` script of the original C word2vec.
"""
print("INFO: Using new accuracy")
ok_vocab = [(w, self.vocab[w]) for w in self.index2word[:restrict_vocab]]
ok_vocab = {w.upper(): v for w, v in reversed(ok_vocab)} if case_insensitive else dict(ok_vocab)
oov_counter, idx_cnt, is_vn_counter = 0, 0, 0
sections, section = [], None
for line_no, line in enumerate(utils.smart_open(questions)):
# TODO: use level3 BLAS (=evaluate multiple questions at once), for speed
line = utils.to_unicode(line)
if line.startswith(': '):
# a new section starts => store the old section
if section:
sections.append(section)
self.log_accuracy(section)
section = {'section': line.lstrip(': ').strip(), 'correct': [], 'incorrect': []}
else:
# Count number of analogy to check
idx_cnt += 1
if not section:
raise ValueError("missing section header before line #%i in %s" % (line_no, questions))
try:
if case_insensitive:
a, b, c, expected = [word.upper() for word in line.split(" | ")]
else:
a, b, c, expected = [word for word in line.split(" | ")]
# print("Line : ", line)
# print("a, b, c, expected: %s, %s, %s, %s"%(a, b, c, expected))
# input(">>> Wait ...")
except ValueError:
logger.info("SVX: ERROR skipping invalid line #%i in %s", line_no, questions)
print("Line : ", line)
print("a, b, c, expected: %s, %s, %s, %s" % (a, b, c, expected))
input(">>> Wait ...")
continue
# In case of Vietnamese, word analogy can be a phrase
if " " in a or " " in b or " " in c or " " in expected:
is_vn_counter += 1
pass
else:
if a not in ok_vocab or b not in ok_vocab or c not in ok_vocab or expected not in ok_vocab:
logger.debug("SVX: skipping line #%i with OOV words: %s", line_no, line.strip())
oov_counter += 1
continue
original_vocab = self.vocab
self.vocab = ok_vocab
ignore = {a, b, c} # input words to be ignored
predicted = None
# find the most likely prediction, ignoring OOV words and input words
sims = most_similar(self, positive=[b, c], negative=[a], topn=False, restrict_vocab=restrict_vocab)
self.vocab = original_vocab
for index in matutils.argsort(sims, reverse=True):
predicted = self.index2word[index].upper() if case_insensitive else self.index2word[index]
if predicted in ok_vocab and predicted not in ignore:
if predicted != expected:
logger.debug("%s: expected %s, predicted %s", line.strip(), expected, predicted)
break
if predicted == expected:
section['correct'].append((a, b, c, expected))
else:
section['incorrect'].append((a, b, c, expected))
if section:
# store the last section, too
sections.append(section)
self.log_accuracy(section)
total = {
'OOV/Total/VNCompound_Words': [oov_counter, (idx_cnt), is_vn_counter],
'section': 'total',
'correct': sum((s['correct'] for s in sections), []),
'incorrect': sum((s['incorrect'] for s in sections), []),
}
self.log_accuracy(total)
sections.append(total)
return sections
def convert_conll_format_to_normal(connl_file, out_file):
"""
read file conll format
return format : One sentence per line
sentences_arr: [EU rejects German call .., ...]
tags_arr: [B-ORG O B-MIST O ..., ...]
"""
f = open(connl_file)
sentences = []
sentence = ""
for line in f:
# print("line: ", line)
if len(line) == 0 or line.startswith('-DOCSTART') or line[0] == "\n":
sentences.append(sentence.rstrip())
sentence = ""
continue
else:
splits = line.split('\t')
sentence += splits[1].rstrip() + " "
# To handle the last sentence.
if len(sentence) > 0:
sentences.append(sentence)
del sentence
# Write to output
if out_file is None:
out_file = connl_file + ".std.txt"
writer = open(out_file, "w")
for sen in sentences:
writer.write(sen + "\n")
writer.flush()
writer.close()
return sentences
def verify_word_analogies(file):
"""
Verify the word analogy file.
:param file:
:return:
"""
f_reader = open(file, "r")
valid_cnt, invalid_cnt = 0, 0
for line in f_reader:
# print("line: ", line)
if len(line) == 0 or line.startswith('-DOCSTART') or line[0] == "\n":
continue
else:
splits = line.split('\t')
if len(splits) != 4:
invalid_cnt += 1
else:
valid_cnt += 1
print("Valid analogy: %s, invalid analogy: %s" % (valid_cnt, invalid_cnt))
def check_oov_of_word_analogies(w2v_format_emb_file, analogy_file, is_vn=True, case_sensitive=True):
emb_model = gensim.models.KeyedVectors.load_word2vec_format(w2v_format_emb_file,
binary=False,
unicode_errors='ignore')
f_reader = open(analogy_file, "r")
vocab_arr = []
for line in f_reader:
if not case_sensitive:
line = line.lower()
if line.startswith(': '):
continue
else:
for word in line.split(" | "):
# In Vietnamese, we have compound and single word.
# if is_vn:
# if " " in word:
# print("I should not going here")
# single_words = word.split(" ")
# for single_word in single_words:
# vocab_arr.append(single_word)
# For other languages.
# else:
vocab_arr.append(word)
print("Before unique set: len = ", len(vocab_arr))
unique_vocab_arr = set(vocab_arr)
print("After unique set: len = ", len(unique_vocab_arr))
valid_word_cnt = 0
for word in unique_vocab_arr:
if word in emb_model:
valid_word_cnt += 1
print("With Is_VN = %s, case_sensitive = %s, Valid word = %s/%s" % (is_vn,
case_sensitive,
valid_word_cnt,
len(unique_vocab_arr)))
def evaluator_api(input_files, analoglist, output, embed_config=None):
"""
:param input_files:
:param analoglist:
:param output:
:param embed_config:
:return:
"""
if embed_config is None:
embed_config = EmbeddingConfigs() # Initialize default config for embedding.
local_embedding_names, local_word_embeddings = embedding_io.load_word_embeddings(input_files, embed_config)
# emb_utils.print_analogy('man', 'him', 'woman', emb_words)
local_output_str = emb_utils.eval_word_analogy_4_all_embeddings(analoglist,
local_embedding_names,
local_word_embeddings,
output_file=output)
print("OUTPUT: ", local_output_str)
if __name__ == "__main__":
"""
Evaluates a given word embedding model.
To use:
evaluate.py path_to_model [-restrict]
optional restrict argument performs an evaluation using the original
Mikolov restriction of vocabulary
"""
desc = "Evaluates a word embedding model"
parser = argparse.ArgumentParser(description=desc)
parser.add_argument("-input",
required=True,
default="../data/embedding_dicts/ELMO_23.vec",
help="Input multiple word embeddings, each model separated by a `;`.")
parser.add_argument("-analoglist",
nargs="?",
# default="../data/embedding_analogies/vi/analogy_vn_seg.txt.std.txt",
default="../data/embedding_analogies/vi/solveable_analogies_vi.txt",
help="Input analogy file to run the word analogy evaluation.")
parser.add_argument("-r",
nargs="?",
default=False,
help="Vocabulary restriction")
parser.add_argument("-checkoov",
nargs="?",
default=False,
help="Check OOV percentage")
parser.add_argument("-lang",
nargs="?",
default="VI",
help="Specify language, by default, it's Vietnamese.")
parser.add_argument("-lowercase",
nargs="?",
default=True,
help="Lowercase all word analogies? (depends on how the emb was trained).")
parser.add_argument("-output",
nargs="?",
default="../data/embedding_analogies/vi/results_out.txt",
help="Output file of word analogy task")
parser.add_argument("-remove_redundancy",
nargs="?",
default=True,
help="Remove redundancy in predicted words")
print("Params: ", parser)
args = parser.parse_args()
embedding_config = EmbeddingConfigs()
paths_of_models = args.input
testset = args.analoglist
is_vietnamese = args.lang
output_file = args.output
# use restriction?
restriction = None
if args.r:
restriction = 30000
# set logging definitions
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO)
if args.checkoov:
print("Checking OOV ...")
check_oov_of_word_analogies(paths_of_models, testset, is_vn=is_vietnamese)
if not args.checkoov:
print("Evaluating embeddings on the word analogy task ...")
if is_vietnamese:
print(" ... for ETNLP's evaluation approach.")
embedding_names, word_embeddings = embedding_io.load_word_embeddings(paths_of_models, embedding_config)
# emb_utils.print_analogy('man', 'him', 'woman', emb_words)
output_str = emb_utils.eval_word_analogy_4_all_embeddings(testset, embedding_names, word_embeddings,
output_file=args.output_file)
print("#"*20)
print(output_str)
print("#" * 20)
else:
print(" ... for Mirkolov et al.'s evaluation approach.")
word_analogy_obj = new_Word2VecKeyedVectors(1024)
# load and evaluate
model = word_analogy_obj.load_word2vec_format(
paths_of_models,
binary=False,
unicode_errors='ignore')
model.accuracy = word_analogy_obj.new_accuracy
acc = model.accuracy(testset, restrict_vocab=restriction, case_insensitive=False)
print("Acc = ", acc)
print("DONE")
================================================
FILE: src/codes/api/embedding_extractor.py
================================================
from embeddings import embedding_utils
from pathlib import Path
import numpy as np
import os
import logging
import gzip
from embeddings.embedding_configs import EmbeddingConfigs
def get_multi_embedding_models(config: EmbeddingConfigs):
"""
:param config:
:return:
"""
model_paths_list = config.model_paths_list
model_names_list = config.model_names_list
model_dims_list = config.model_dims_list
char_model_path = config.char_model_path
char_model_dims = config.char_model_dims
if char_model_path:
char_model = embedding_utils.reload_char2vec_model(char_model_path, char_model_dims)
else:
char_model = None
embedding_models = embedding_utils.reload_embedding_models(model_paths_list,
model_names_list,
model_dims_list,
char_model)
# doc_vector = embedding_models.get_vector_of_document(tokenized_text)
return embedding_models
def get_emb_dim(emb_file):
idx = 0
dim = 0
with open(emb_file, "r") as reader:
if idx == 0:
line = reader.readline().rstrip()
dim = int(line.split(" ")[1])
return dim
def extract_embedding_for_vocab_file(paths_of_emb_models, vocab_words_file, c2v_emb_file, output_file, output_format):
"""
:param paths_of_emb_models:
:param vocab_words_file:
:param c2v_emb_file:
:param output_file:
:param output_format:
:return:
"""
config = EmbeddingConfigs()
config.output_format = output_format
config.model_paths_list = paths_of_emb_models.split(";")
embedding_file_names = []
embedding_dims = []
if c2v_emb_file:
config.char_model_path = c2v_emb_file
config.char_model_dims = get_emb_dim(c2v_emb_file)
print("02. Extracting word embeddings ...")
if paths_of_emb_models and paths_of_emb_models.__contains__(";"):
files = paths_of_emb_models.split(";")
for emb_file in files:
embedding_name = os.path.basename(os.path.normpath(emb_file))
embedding_file_names.append(embedding_name)
embedding_dim = get_emb_dim(emb_file)
embedding_dims.append(embedding_dim)
elif paths_of_emb_models: # In case there is only one embedding
embedding_name = os.path.basename(os.path.normpath(paths_of_emb_models))
embedding_file_names.append(embedding_name)
embedding_dim = get_emb_dim(paths_of_emb_models)
embedding_dims.append(embedding_dim)
else:
raise Exception("List of embeddings cannot be None.")
# Data type:
embedding_names = ["word2vec"]*len(embedding_dims) # embedding type, only support w2v and c2v type now
config.model_names_list = embedding_names
config.model_dims_list = embedding_dims
# Do extracting embeddings
extract_embedding_vectors(vocab_words_file, output_file, config)
print("Done")
def extract_embedding_vectors(vocab_words_file, output_file, config: EmbeddingConfigs):
"""
:param vocab_words_file:
:param output_file:
:param config:
:return:
"""
# Load vocab
with Path(vocab_words_file).open() as f:
word_to_idx = {line.strip(): idx for idx, line in enumerate(f)}
size_vocab = len(word_to_idx)
# Output writer
fwriter = open(output_file, "w")
# Array of zeros
dim_size = sum(config.model_dims_list)
found = 0
print('Reading embedding file (may take a while)')
embedding_models = get_multi_embedding_models(config)
embeddings = np.zeros((size_vocab, dim_size))
line_idx = 0
for word in word_to_idx.keys():
word_idx = word_to_idx[word]
word = word.rstrip()
try:
if line_idx % 100000 == 0:
print('- At line {}'.format(line_idx))
w2v_vector = embedding_models.get_word_vector_of_multi_embeddings(word)
if w2v_vector is not None and len(w2v_vector) > 0:
embeddings[word_idx] = w2v_vector
line = "%s %s" % (word, " ".join(str(scalar) for scalar in w2v_vector))
fwriter.write(line + "\n")
fwriter.flush()
found += 1
logging.debug("Embedding: ", w2v_vector)
except Exception as e:
logging.debug("Unexpected error: word = %s, error = %s" % (word, e))
pass
line_idx += 1
print('- done. Found {} vectors for {} words'.format(found, size_vocab))
fwriter.close()
# Open file again to add meta data:
src = open(output_file, "r")
meta_line = "%s %s\n"%(found, dim_size)
oline = src.readlines()
# Here, we prepend the string we want to on first line
oline.insert(0, meta_line)
src.close()
# We again open the file in WRITE mode
src = open(output_file, "w")
src.writelines(oline)
src.close()
# Done with writing.
if config.output_format.__contains__(".gz"):
content = open(output_file, "rb").read()
gzip_out_file = output_file + '.gz'
with gzip.open(gzip_out_file, 'wb') as f:
f.write(content)
print("Saved embedding to %s" % (gzip_out_file))
if config.output_format.__contains__(".npz"):
npz_out_file = output_file + '.npz'
np.savez_compressed(npz_out_file, embeddings=embeddings)
print("Saved embedding to %s"%(npz_out_file))
return
================================================
FILE: src/codes/api/embedding_preprocessing.py
================================================
# Convert to a standard word2vec format
import gensim
from utils import embedding_io
import sys
from threading import Thread
from embeddings.embedding_configs import EmbeddingConfigs
def convert_to_w2v(vocab_file, embedding_file, out_file):
"""
Export from a word2vec file by filtering out vocabs based on the input vocab file.
:param vocab_file:
:param embedding_file:
:param out_file:
:return: word2vec file
"""
std_vocab = []
with open(vocab_file) as f:
for word in f:
std_vocab.append(word)
print ("Loaded NER vocab_size = %s" % (len(std_vocab)))
is_binary = False
if embedding_file.endswith(".bin"):
is_binary = True
print("Loading w2v model ...")
emb_model = gensim.models.KeyedVectors.load_word2vec_format(embedding_file,
binary=is_binary,
unicode_errors='ignore')
print("LOADED model: vocab_size = %s" % (len(emb_model.wv.vocab)))
f_writer = open(out_file, "w")
for word in std_vocab:
word = word.rstrip()
line = None
if word in emb_model:
vector = " ".join(str(item) for item in emb_model[word])
# word = word.lower()
line = "%s %s" % (word, vector)
else:
word = word.lower()
if word in emb_model:
vector = " ".join(str(item) for item in emb_model[word])
line = "%s %s" % (word, vector)
# print("LINE: ", line)
if line:
f_writer.write(line + "\n")
f_writer.close()
def test():
vocab_file = "../data/vnner_BiLSTM_CRF/vocab.words.txt"
embedding_file = "../data/embedding_dicts/elmo_embeddings_large.txt"
out_file = "../data/embedding_dicts/elmo_1024dims_wiki_normalcase2lowercase_NER.vec"
convert_to_w2v(vocab_file, embedding_file, out_file)
print("Out file: ", out_file)
print("DONE")
def load_and_save_2_word2vec_model(input_model_path, output_model_path, embedding_config):
"""
Process one embedding model
:param input_model_path:
:param output_model_path:
:return:
"""
model_in = embedding_io.load_word_embedding(input_model_path, embedding_config)
embedding_io.save_model_to_file(model_in, output_model_path)
print("Write model back to ", output_model_path)
def load_and_save_2_word2vec_models(input_embedding_files_str, output_embedding_files_str, embedding_config):
"""
Multi-threaded processing to export to word2vec format
:param input_embedding_files_str:
:param output_embedding_files_str:
:return:
"""
if input_embedding_files_str.__contains__(";"):
input_model_files = input_embedding_files_str.split(";")
else:
input_model_files = [input_embedding_files_str]
if output_embedding_files_str.__contains__(";"):
output_model_files = output_embedding_files_str.split(";")
else:
output_model_files = [output_embedding_files_str]
# Double check input files and output files.
assert (len(output_model_files) == len(input_model_files)), \
"Number of input files and output files must be equal. Exiting ..."
# create a list of threads
threads = []
for model_in, model_out in zip(input_model_files, output_model_files):
# We start one thread per file.
process = Thread(target=load_and_save_2_word2vec_model, args=[model_in, model_out, embedding_config])
process.start()
threads.append(process)
# load_and_save_2_word2vec_model(model_in, model_out)
# This to ensure each thread has finished processing the input file.
for process in threads:
process.join()
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Missing input arguments. Input format: ./*.py . Exiting ...")
exit(0)
embedding_config = EmbeddingConfigs()
# We don't need to be word2vec format for pre-processing here but it still shows warning
# if input files aren't in w2v format.
embedding_config.is_word2vec_format = True
embedding_config.do_normalize_emb = False # If you don't want to normalize the embedding vectors.
if sys.argv[1].__contains__(";"):
in_model_files = sys.argv[1].split(";")
else:
in_model_files = [sys.argv[1]]
out_model_files = [input_model_path + ".extracted.vec" for input_model_path in in_model_files]
load_and_save_2_word2vec_models(in_model_files, out_model_files)
================================================
FILE: src/codes/api/embedding_visualizer.py
================================================
# 1. Read embedding file
# 2. Convert to tensorboard
# 3. Visualize
# encoding: utf-8
import sys, os
import gensim
import tensorflow as tf
import numpy as np
from tensorflow.contrib.tensorboard.plugins import projector
import logging
from tensorboard import default
from tensorboard import program
class TensorBoardTool:
def __init__(self, dir_path):
self.dir_path = dir_path
def run(self, emb_name, port):
# Remove http messages
# log = logging.getLogger('sonvx').setLevel(logging.INFO)
logging.basicConfig(level=logging.INFO)
logging.propagate = False
# Start tensorboard server
tb = program.TensorBoard(default.get_plugins(), default.get_assets_zip_provider())
tb.configure(argv=[None, '--logdir', self.dir_path, '--port', str(port)])
url = tb.launch()
sys.stdout.write('TensorBoard of %s at %s \n' % (emb_name, url))
def convert_multiple_emb_models_2_tf(emb_name_arr, w2v_model_arr, output_path, port):
"""
:param emb_name_arr:
:param w2v_model_arr:
:param output_path:
:param port:
:return:
"""
idx = 0
# define the model without training
sess = tf.InteractiveSession()
config = projector.ProjectorConfig()
for w2v_model in w2v_model_arr:
emb_name = emb_name_arr[idx]
meta_file = "%s.tsv" % emb_name
placeholder = np.zeros((len(w2v_model.wv.index2word), w2v_model.vector_size))
with open(os.path.join(output_path, meta_file), 'wb') as file_metadata:
for i, word in enumerate(w2v_model.wv.index2word):
placeholder[i] = w2v_model[word]
# temporary solution for https://github.com/tensorflow/tensorflow/issues/9094
if word == '':
print("Empty Line, should replaced by any thing else, or will cause a bug of tensorboard")
file_metadata.write(u"{0}".format('').encode('utf-8') + b'\n')
else:
file_metadata.write(u"{0}".format(word).encode('utf-8') + b'\n')
word_embedding_var = tf.Variable(placeholder, trainable=False, name=emb_name)
tf.global_variables_initializer().run()
sess.run(word_embedding_var)
# adding into projector
embed = config.embeddings.add()
embed.tensor_name = emb_name
embed.metadata_path = meta_file
idx += 1
saver = tf.train.Saver()
writer = tf.summary.FileWriter(output_path, sess.graph)
# Specify the width and height of a single thumbnail.
projector.visualize_embeddings(writer, config)
all_emb_name = "_".join(emb_name for emb_name in emb_name_arr)
saver.save(sess, os.path.join(output_path, '%s.ckpt' % all_emb_name))
# tf.flags.FLAGS.logdir = output_path
# print('Running `tensorboard --logdir={0}` to run visualize result on tensorboard'.format(output_path))
# tb.run_main()q
tb_tool = TensorBoardTool(output_path)
tb_tool.run(all_emb_name, port)
return
def convert_one_emb_model_2_tf(emb_name, model, output_path, port):
"""
:param model: Word2Vec model
:param output_path:
:return:
"""
# emb_name = "word_embedding"
meta_file = "%s.tsv"%emb_name
placeholder = np.zeros((len(model.wv.index2word), model.vector_size))
with open(os.path.join(output_path, meta_file), 'wb') as file_metadata:
for i, word in enumerate(model.wv.index2word):
placeholder[i] = model[word]
# temporary solution for https://github.com/tensorflow/tensorflow/issues/9094
if word == '':
print("Empty Line, should replaced by any thing else, or will cause a bug of tensorboard")
file_metadata.write(u"{0}".format('').encode('utf-8') + b'\n')
else:
file_metadata.write(u"{0}".format(word).encode('utf-8') + b'\n')
# define the model without training
sess = tf.InteractiveSession()
word_embedding_var = tf.Variable(placeholder, trainable=False, name=emb_name)
sess.run(word_embedding_var)
# tf.global_variables_initializer().run()
saver = tf.train.Saver()
writer = tf.summary.FileWriter(output_path, sess.graph)
# adding into projector
config = projector.ProjectorConfig()
embed = config.embeddings.add()
embed.tensor_name = emb_name
embed.metadata_path = meta_file
# Specify the width and height of a single thumbnail.
projector.visualize_embeddings(writer, config)
saver.save(sess, os.path.join(output_path, '%s.ckpt'%emb_name))
# tf.flags.FLAGS.logdir = output_path
# print('Running `tensorboard --logdir={0}` to run visualize result on tensorboard'.format(output_path))
# tb.run_main()q
tb_tool = TensorBoardTool(output_path)
tb_tool.run(emb_name, port)
return
def visualize_multiple_embeddings_individually(paths_of_emb_models):
output_root_dir = "../data/embedding_tf_data/"
starting_port = 6006
embedding_names = []
print("Loaded all word embeddings, going to visualize ...")
if paths_of_emb_models and paths_of_emb_models.__contains__(";"):
files = paths_of_emb_models.split(";")
for emb_file in files:
embedding_name = os.path.basename(os.path.normpath(emb_file))
tf_data_folder = output_root_dir + embedding_name
if not os.path.exists(tf_data_folder):
os.makedirs(tf_data_folder)
is_binary = False
if emb_file.endswith(".bin"):
is_binary = True
emb_model = gensim.models.KeyedVectors.load_word2vec_format(emb_file, binary=is_binary)
convert_one_emb_model_2_tf(embedding_name, emb_model, tf_data_folder, starting_port)
embedding_names.append(embedding_name)
starting_port += 1
while True:
print("Type exit to quite the visualizer: ")
user_input = input()
if user_input == "exit":
break
return
def visualize_multiple_embeddings_all_in_one(paths_of_emb_models, port):
output_root_dir = "../data/embedding_tf_data/"
starting_port = port
embedding_names = []
print("Loaded all word embeddings, going to visualize ...")
embedding_name_arr = []
w2v_embedding_model_arr = []
if paths_of_emb_models and paths_of_emb_models.__contains__(";"):
files = paths_of_emb_models.split(";")
for emb_file in files:
embedding_name = os.path.basename(os.path.normpath(emb_file))
embedding_name_arr.append(embedding_name)
is_binary = False
if emb_file.endswith(".bin"):
is_binary = True
emb_model = gensim.models.KeyedVectors.load_word2vec_format(emb_file, binary=is_binary)
w2v_embedding_model_arr.append(emb_model)
embedding_names.append(embedding_name)
# print("View side-by-side word similarity of multiple embeddings at: http://Sons-MBP.lan:8089")
all_emb_name = "_".join(emb_name for emb_name in embedding_name_arr)
tf_data_folder = output_root_dir + all_emb_name
if not os.path.exists(tf_data_folder):
os.makedirs(tf_data_folder)
convert_multiple_emb_models_2_tf(embedding_name_arr, w2v_embedding_model_arr, tf_data_folder, starting_port)
while True:
print("Type exit to quite the visualizer: ")
user_input = input()
if user_input == "exit":
break
return
def visualize_multiple_embeddings(paths_of_emb_models, port):
"""
API to other part to call, don't modify this function.
:param paths_of_emb_models:
:param port:
:return:
"""
visualize_multiple_embeddings_all_in_one(paths_of_emb_models, port)
if __name__ == "__main__":
"""
Just run `python w2v_visualizer.py word2vec.model visualize_result`
"""
try:
model_path = sys.argv[1]
output_path = sys.argv[2]
except Exception as e:
print("Please provide model path and output path %s " % e)
# model = Word2Vec.load(model_path)
model = gensim.models.KeyedVectors.load_word2vec_format(model_path, binary=True)
convert_one_emb_model_2_tf(model, output_path)
================================================
FILE: src/codes/embeddings/__init__.py
================================================
================================================
FILE: src/codes/embeddings/embedding_configs.py
================================================
class EmbeddingConfigs(object):
"""
Configuration information
"""
is_word2vec_format = True
do_normalize_emb = True
model_paths_list = []
model_names_list = []
model_dims_list = []
char_model_path = None
char_model_dims = -1
output_format = ".txt;.npz;.gz"
================================================
FILE: src/codes/embeddings/embedding_models.py
================================================
from gensim.models import KeyedVectors as Word2Vec
import numpy as np
from embeddings import embedding_utils
from utils import file_utils
import os, re
import logging
DEBUG = False
class Model_Constants(object):
word2vec = "word2vec"
char2vec = "char2vec"
private_word2vec = "private_word2vec"
elmo = "elmo"
class Embedding_Model(object):
def __init__(self, name, vector_dim):
self.name = name
self.model = None
self.char_model = None
self.vocabs_list = None
self.vector_dim = vector_dim
# TODO: update this changeable param later
# unk, random, mean, replace_by_character_embedding
self.unknown_word = "replace_by_character_embedding"
# self.MAX_DIM = 400 # No longer use MAX_DIM, now it depends on input dims
def load_model(self, model_path):
if self.name == Model_Constants.word2vec or self.name == Model_Constants.elmo:
if model_path.endswith(".bin"):
self.model = Word2Vec.load_word2vec_format(model_path, binary=True)
else:
self.model = Word2Vec.load_word2vec_format(model_path, binary=False)
elif self.name == Model_Constants.char2vec:
self.model = dict()
print("Loading model_path = ", model_path)
file = open(model_path, "r")
for line in file:
elements = line.split()
if len(elements) > 100: # because embedding dim is higher than 100.
# char_model[elements[0]] = np.array(map(float, elements[1:])).tolist()
self.model[elements[0]] = np.array([float(i) for i in elements[1:]]).tolist()
return self.model
elif self.name == Model_Constants.private_word2vec:
self.model, _, self.vocabs_list = embedding_utils.reload_embeddings(model_path)
else:
raise Exception("Unknown embedding models!")
def is_punct(self, word):
arr_list = [
'!',
'"',
'%',
'&',
"'",
"''",
'(',
'(.',
')',
'*',
'+',
',',
'-',
'---',
'.',
'..',
'...',
'....',
'/',
]
if word in arr_list:
return True
else:
return False
def is_number(self, word):
regex = r"^[0-9]+"
matches = re.finditer(regex, word, re.MULTILINE)
matchNum = 0
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
if matchNum > 0:
return True
else:
return False
def set_char_model(self, char_model):
self.char_model = char_model
def load_vocabs_list(self, vocab_file_path):
"""
Load vocabs list for private w2v model. Has to be pickle file.
:param vocab_file_path:
:return:
"""
if vocab_file_path:
self.vocabs_list = file_utils.load_obj(vocab_file_path)
def get_char_vector(self, char_model, word):
"""
char_model here is an instance of embedding_model
:param char_model: an instance of embedding_model
:param word:
:return:
"""
if char_model is None:
# Sonvx on March 20, 2019: we now allow the char_model is None,
# cannot call this get_char_vector in such case.
raise Exception("Char_model is None! Cannot use character-embedding.")
out_char_2_vec = []
char_vecs = []
chars = list(word)
vecs = []
for c in chars:
if c in char_model.model:
emb_vector = char_model.model[c]
vecs.append(emb_vector)
if DEBUG:
input(">>>>>>")
print("Char_emb_vector=", emb_vector)
# char_vecs.extend(list(vecs))
if len(vecs) > 0:
out_char_2_vec = np.mean(vecs, axis=0)
if DEBUG:
print(">>> Output of char2vec: %s"%(out_char_2_vec))
input(">>>> outc2v ...")
return out_char_2_vec
def is_unknown_word(self, word):
"""Check whether or not a word is unknown"""
is_unknown_word = False
if self.vocabs_list is not None:
if word not in self.vocabs_list:
is_unknown_word = True
else:
if word not in self.model:
is_unknown_word = True
return is_unknown_word
def get_word_vector(self, word):
"""
Handle unknown word: In case of our private word2vec, we have a vocabs_list to check. With regular models,
we can check inside the model. Note that by default, we use char-model to handle unknown words.
:param word:
:param char_model:
:return:
"""
rtn_vector = []
# try first time with normal case
is_unknown_word = self.is_unknown_word(word)
# try 2nd times with lowercase.
if is_unknown_word:
word = word.lower()
is_unknown_word = self.is_unknown_word(word)
# unknown word
if is_unknown_word and self.char_model:
# Sonvx on March 20, 2019: solve unknown only when char_model is SET.
rtn_vector = self.get_vector_of_unknown(word)
else:
# normal case
if self.name == Model_Constants.word2vec:
rtn_vector = self.model[word]
# For now we have self.vector_dim, max_dim, and len(rtn_vector)
# Update: move to use self.vector_dim only
if len(rtn_vector) > self.vector_dim:
print("Warning: auto trim to %s/%s dimensions"%(self.vector_dim, len(rtn_vector)))
rtn_vector = self.model[word][:self.vector_dim]
elif self.name == Model_Constants.elmo:
rtn_vector = self.model[word]
if self.vector_dim == len(rtn_vector)/2:
vector1 = rtn_vector[:self.vector_dim]
vector2 = rtn_vector[self.vector_dim:]
print("Notice: auto average to b[i] = (a[i] + a[i + %s])/2 /%s dimensions" % (self.vector_dim,
len(rtn_vector)))
rtn_vector = np.mean([vector1, vector2], 0)
elif len(rtn_vector) > self.vector_dim:
print("Warning: auto trim to %s/%s dimensions" % (self.vector_dim, len(rtn_vector)))
rtn_vector = self.model[word][:self.vector_dim]
elif self.name == Model_Constants.char2vec:
rtn_vector = self.get_char_vector(self, word)
elif self.name == Model_Constants.private_word2vec:
# Handle unknown word - Not need for now since we handle unknown words first
if word not in self.vocabs_list:
word = "UNK"
word_idx = self.vocabs_list.index(word)
emb_vector = self.model[word_idx]
rtn_vector = emb_vector
# final check before returning vector
if DEBUG:
print(">>> DEBUG: len(rtn_vector) = %s" % (len(rtn_vector)))
input(">>> before returning vector ...")
if len(rtn_vector) < 1:
return np.zeros(self.vector_dim)
else:
if len(rtn_vector) == self.vector_dim:
return rtn_vector
# TODO: find a better way to represent unknown word by character to have same-size with word-vector-size
# For now, I add 0 to the [current-len, expected-len]
else:
logging.debug("Model name = %s, Current word = %s, Current size = %s, expected size = %s"
%(self.name, word, len(rtn_vector), self.vector_dim))
return np.append(rtn_vector, np.zeros(self.vector_dim - len(rtn_vector)))
def get_vector_of_unknown(self, word):
"""
If word is UNK, use char_vector model instead.
:param word:
:return:
"""
# Here we handle features based on the w2v model where
# numbers and punctuations are encoded as ,
if self.name == Model_Constants.word2vec:
if self.is_number(word):
rtn_vector = self.model[""]
elif self.is_punct(word):
rtn_vector = self.model[""]
else:
rtn_vector = self.get_char_vector(self.char_model, word)
if rtn_vector is not None:
if len(rtn_vector) > self.vector_dim:
print("Warning: auto trim to %s/%s dimensions"%(self.vector_dim, len(rtn_vector)))
return rtn_vector[:self.vector_dim]
else:
return rtn_vector
# otherwise, using c2v to build-up the embedding vector
else:
return self.get_char_vector(self.char_model, word)
class Embedding_Models(object):
"""
Using all available embedding models to generate vectors
"""
def __init__(self, list_models):
self.list_models = list_models # list of embedding_model_objs: ['word2vec', 'char2vec', 'private_word2vec']
def add_model(self, emb_model, char_model):
"""
Add new model into the collection of embedding models. Note that, every model has to add char_model to handle
unknown word.
:param emb_model:
:param char_model:
:return:
"""
if char_model is None:
print("Warning: char_model is None -> cannot solve OOV word. Keep going ...")
# Sonvx on March 20, 2019: change to allow None char_model
# raise Exception("char_model cannot be None.")
if isinstance(emb_model, Embedding_Model):
emb_model.set_char_model(char_model)
self.list_models.append(emb_model)
else:
raise Exception("Not an instance of embedding_model class.")
def get_vector_of_document(self, document):
"""
Get all embedding vectors for one document
:param document:
:return:
"""
doc_vector = []
# debug_dict = {}
# print ("len_doc = ", len(document))
for word in document:
all_vectors_of_word = []
# get all embedding vectors of a word
for emb_model in self.list_models:
emb_vector = emb_model.get_word_vector(word)
# print("len_emb_vector = ", len(emb_vector))
all_vectors_of_word.extend(emb_vector)
# if word in debug_dict.keys():
# debug_dict[word].append(len(emb_vector))
# else:
# debug_dict[word] = [len(emb_vector)]
# stack a combined vector of all words
doc_vector.append(all_vectors_of_word)
# print("list of words and emb size = ", debug_dict)
# get the mean of them to represent a document
doc_vector = np.mean(doc_vector, axis=0)
return doc_vector
def get_word_vector_of_multi_embeddings(self, word):
"""
Get all embedding vectors for one document
:param word:
:return:
"""
word_vector = []
for emb_model in self.list_models:
emb_vector = emb_model.get_word_vector(word)
word_vector.extend(emb_vector)
return word_vector
================================================
FILE: src/codes/embeddings/embedding_utils.py
================================================
import os
from utils import file_utils
from embeddings.embedding_models import Embedding_Model, Embedding_Models
def reload_char2vec_model(model_path, model_dim):
char_model = Embedding_Model("char2vec", model_dim)
char_model.load_model(model_path)
return char_model
def reload_embedding_models(model_paths_list, model_names_list, model_dims_list, char_model):
"""
Reload collection of embedding models to serve feature extraction task.
:param model_paths_list:
:param model_names_list:
:param model_dims_list:
:param char_model:
:return:
"""
# model path list and name list must be equal.
print("model_paths_list = ", model_paths_list)
print("model_formats_list = ", model_names_list)
assert (len(model_names_list) == len(model_paths_list)), "Not equal length"
assert (len(model_names_list) == len(model_dims_list)), "Not equal length"
all_emb_models = Embedding_Models([])
for model_idx in range(len(model_paths_list)):
# get model path based on index
model_path = model_paths_list[model_idx]
model_name = model_names_list[model_idx]
model_dim = model_dims_list[model_idx]
if model_path is not None:
emb_model = Embedding_Model(model_name, model_dim)
emb_model.load_model(model_path)
# add to final list of emb_models
all_emb_models.add_model(emb_model, char_model)
return all_emb_models
def save_embedding_models_tofolder(dir_path, final_embeddings, reverse_dictionary, vocabulary_size):
"""
Save all trained word-embedding model of the custom word2vec.
:param final_embeddings:
:param reverse_dictionary:
:param vocabulary_size:
:return:
"""
if not os.path.exists(dir_path):
os.makedirs(dir_path)
def save_to_word2vec_model(vocabs_list):
# print("Saving word2vec format ...")
filewriter = open(os.path.join(dir_path, "word2vec.txt"), "w", encoding="utf-8")
filewriter.write("%s %s\n" % (len(vocabs_list), len(final_embeddings[0])))
for word in vocabs_list:
word_idx = vocabs_list.index(word)
emb_vector = final_embeddings[word_idx]
line = ' '.join(["%s" % (x) for x in emb_vector])
filewriter.write(word + " " + line + "\n")
filewriter.close()
# print("Done!")
file_utils.save_obj(final_embeddings, os.path.join(dir_path, "final_embeddings"))
# We don't need to save reversed_dictionary
# file_utils.save_obj(reverse_dictionary, os.path.join(FLAGS.trained_models, "reversed_dictionary"))
vocab_list = [reverse_dictionary[i] for i in range(vocabulary_size)]
save_to_word2vec_model(vocab_list)
file_utils.save_obj(vocab_list, os.path.join(dir_path, "words_dictionary"))
def save_embedding_models(FLAGS, final_embeddings, reverse_dictionary, vocabulary_size):
"""
Keep for old implementation.
:param FLAGS:
:param final_embeddings:
:param reverse_dictionary:
:param vocabulary_size:
:return:
"""
save_embedding_models_tofolder(FLAGS.trained_models, final_embeddings,
reverse_dictionary, vocabulary_size)
def reload_embeddings(trained_models_dir):
"""
Reload trained word-embedding model of the custom word2vec.
:param trained_models_dir:
:return:
"""
final_embeddings = file_utils.load_obj(os.path.join(trained_models_dir, "final_embeddings"))
# reverse_dictionary = file_utils.load_obj(os.path.join(trained_models_dir, "reversed_dictionary"))
reverse_dictionary = None
labels = file_utils.load_obj(os.path.join(trained_models_dir, "words_dictionary"))
return final_embeddings, reverse_dictionary, labels
def create_single_utf8_file(input_dir, output_file):
import glob
# path = './wiki_data/*.txt'
# out = './wiki_all.vi.utf8.txt'
files = glob.glob(input_dir)
for file in files:
with open(output_file, "a") as myfile:
with open(file, "r") as fp:
for line in fp:
line = line.strip().lower()
line = line.decode('utf-8', 'ignore').encode("utf-8")
myfile.write(line)
print("done")
================================================
FILE: src/codes/etnlp_api.py
================================================
import argparse
from api import embedding_preprocessing, embedding_evaluator, embedding_extractor, embedding_visualizer
from visualizer import visualizer_sbs
import logging
import os
from embeddings.embedding_configs import EmbeddingConfigs
__version__ = "0.1.3"
embedding_config = EmbeddingConfigs()
if __name__ == "__main__":
"""
ETNLP: a toolkit for evaluate, extract, and visualize multiple word embeddings
"""
_desc = "Evaluates a word embedding model"
_parser = argparse.ArgumentParser(description=_desc)
_parser.add_argument("-input",
required=True,
default="../data/embedding_dicts/elmo_embeddings.txt",
#
help="model")
_parser.add_argument("-analoglist",
nargs="?",
# default="../data/embedding_analogies/vi/analogy_vn_seg.txt.std.txt",
default="./data/embedding_analogy/solveable_analogies_vi.txt",
help="testset")
_parser.add_argument("-args",
nargs="?",
default="eval",
help="Run evaluation")
_parser.add_argument("-lang",
nargs="?",
default="VI",
help="Specify language, by default, it's Vietnamese.")
_parser.add_argument("-vocab",
nargs="?",
default="../data/vocab.txt",
help="Vocab to be extracted")
_parser.add_argument("-port",
nargs="?",
default=8889,
help="Port for visualization")
_parser.add_argument("-input_c2v",
nargs="?",
default=None,
help="C2V embedding")
_parser.add_argument("-output",
nargs="?",
default="../data/embedding_analogies/vi/results_out.txt",
help="Output file of word analogy task")
_parser.add_argument("-output_format",
nargs="?",
default=".txt",
help="Format of output file of the extracted embedding.")
_args = _parser.parse_args()
# Set logging level
logging.basicConfig(level=logging.INFO)
logging.disable(logging.INFO)
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '5'
input_embedding_files_str = _args.input
analoglist = _args.analoglist
is_vietnamese = _args.lang
output_files_str = _args.output
options_str = _args.args
vocab_file = _args.vocab
output_format = _args.output_format
port = _args.port
# By default, we process all embeddings as word2vec format.
embedding_preprocessing.is_word2vec_format = True
if options_str == 'eval':
print("Starting evaluator ...")
embedding_evaluator.evaluator_api(input_files=input_embedding_files_str, analoglist=analoglist,
output=output_files_str)
print("Done evaluator !")
elif options_str == 'visualizer':
print("Starting visualizer ...")
embedding_visualizer.visualize_multiple_embeddings(input_embedding_files_str, port)
print("Done visualizer !")
elif options_str.startswith("extract"):
print("Starting extractor ...")
embedding_extractor.extract_embedding_for_vocab_file(input_embedding_files_str, vocab_file,
_args.input_c2v, output_files_str, output_format)
print("Done extractor !")
elif options_str.startswith("glove2w2v"):
print("Starting pre-processing: convert to word2vec format ...")
embedding_config.is_word2vec_format = False
if options_str.__contains__("do_normalize"):
embedding_config.do_normalize_emb = True
else:
embedding_config.do_normalize_emb = False
embedding_preprocessing.load_and_save_2_word2vec_models(input_embedding_files_str,
output_files_str,
embedding_config)
else:
print("Invalid options")
print("Done!")
================================================
FILE: src/codes/requirements.txt
================================================
gensim==3.4.0
scipy==1.1.0
six==1.12.0
setuptools==40.6.2
tensorflow==1.12.0
Flask==1.0.2
tensorboard==1.12.0
numpy==1.15.4
scikit_learn==0.20.3
typing==3.6.6
================================================
FILE: src/codes/setup.py
================================================
from setuptools import setup, find_packages
from etnlp_api import __version__
with open("../../README.md", "r") as fh:
long_description = fh.read()
setup(
name='ETNLP',
version=__version__,
# packages=['api', 'utils', 'embeddings', 'visualizer'],
packages=find_packages(),
py_modules=['etnlp_api'],
long_description=long_description,
long_description_content_type="text/markdown",
url='https://github.com/vietnlp/etnlp',
license='MIT',
author='vietnlp',
author_email='sonvx.coltech@gmail.com',
description='ETNLP: Embedding Toolkit for NLP Tasks'
)
# from setuptools import setup, find_packages
# import sys
#
# with open('requirements.txt') as f:
# reqs = f.read()
# setup(
# name='ETNLP',
# version='0.1.0',
# description='ETNLP: Embedding Toolkit for NLP Tasks',
# python_requires='>=3.5',
# packages=find_packages(exclude=('data')),
# install_requires=reqs.strip().split('\n'),
# )
================================================
FILE: src/codes/utils/__init__.py
================================================
================================================
FILE: src/codes/utils/emb_utils.py
================================================
from sklearn.metrics.pairwise import cosine_similarity
from typing import Any, Iterable, List, Optional, Set, Tuple
from utils.vectors import Vector
from utils import vectors
from utils.word import Word
from utils import eval_utils
from gensim import utils as genutils
import logging
import numpy as np
from scipy import stats
# Timing info for most_similar (100k words):
# Original version: 7.3s
# Normalized vectors: 3.4s
logger = logging.getLogger(__name__)
def most_similar(base_vector: Vector, words: List[Word]) -> List[Tuple[float, Word]]:
"""Finds n words with smallest cosine similarity to a given word"""
words_with_distance = [(vectors.cosine_similarity_normalized(base_vector, w.vector), w) for w in words]
# We want cosine similarity to be as large as possible (close to 1)
sorted_by_distance = sorted(words_with_distance, key=lambda t: t[0], reverse=True)
# Sonvx: remove duplications (not understand why yet, probably because the w2v?)
# sorted_by_distance = list(set(sorted_by_distance))
return sorted_by_distance
def print_most_similar(words: List[Word], text: str) -> None:
base_word = find_word(text, words)
if not base_word:
print("Unknown word: %s"%(text))
return
print("Words related to %s:" % (base_word.text))
sorted_by_distance = [
word.text for (dist, word) in
most_similar(base_word.vector, words)
if word.text.lower() != base_word.text.lower()
]
print(', '.join(sorted_by_distance[:10]))
def read_word() -> str:
return input("Type a word: ")
def find_word(text: str, words: List[Word]) -> Optional[Word]:
try:
return next(w for w in words if text == w.text)
except StopIteration:
return None
def closest_analogies_OLD(
left2: str, left1: str, right2: str, words: List[Word]
) -> List[Tuple[float, Word]]:
word_left1 = find_word(left1, words)
word_left2 = find_word(left2, words)
word_right2 = find_word(right2, words)
if (not word_left1) or (not word_left2) or (not word_right2):
return []
vector = vectors.add(
vectors.sub(word_left1.vector, word_left2.vector),
word_right2.vector)
closest = most_similar(vector, words)[:10]
def is_redundant(word: str) -> bool:
"""
Sometimes the two left vectors are so close the answer is e.g.
"shirt-clothing is like phone-phones". Skip 'phones' and get the next
suggestion, which might be more interesting.
"""
word_lower = word.lower()
return (
left1.lower() in word_lower or
left2.lower() in word_lower or
right2.lower() in word_lower)
closest_filtered = [(dist, w) for (dist, w) in closest if not is_redundant(w.text)]
return closest_filtered
def closest_analogies_vectors(
word_left2: Word, word_left1: Word, word_right2: Word, words: List[Word]) \
-> List[Tuple[float, Word]]:
"""
Sonvx:
:param word_left2:
:param word_left1:
:param word_right2:
:param words:
:param remove_redundancy: remove suggestions if they contain the given words.
:return:
"""
# print(">>>> Remove redundancy = ", remove_redundancy)
# input(">>>>")
vector = vectors.add(
vectors.sub(word_left1.vector, word_left2.vector),
word_right2.vector)
closest = most_similar(vector, words)[:10]
def is_redundant(word: str) -> bool:
"""
Sometimes the two left vectors are so close the answer is e.g.
"shirt-clothing is like phone-phones". Skip 'phones' and get the next
suggestion, which might be more interesting.
"""
word_lower = word.lower()
return (
word_left1.text.lower() in word_lower or
word_left2.text.lower() in word_lower or
word_right2.text.lower() in word_lower)
# It doesn't work this way for Vietnamese, so we try both of this to test for now
if False:
closest_filtered = [(dist, w) for (dist, w) in closest if not is_redundant(w.text)]
else:
closest_filtered = closest
return closest_filtered
def get_avg_vector(word, embedding_words):
if " " in word:
single_words = word.split(" ")
list_vector = []
for single_word in single_words:
word_vec = find_word(single_word, embedding_words)
if word_vec:
list_vector.append(word_vec.vector)
else:
# Try again with lowercase
single_word = single_word.lower()
word_vec = find_word(single_word, embedding_words)
if word_vec:
list_vector.append(word_vec.vector)
# print("list_vector: ", list_vector)
# input(">>>>>>>>")
returned_Word = Word(word, vectors.mean_list(list_vector), 1)
else:
returned_Word = find_word(word, embedding_words)
# print("Avg returned vector = ", returned_vector)
# input(">>>>")
return returned_Word
def run_paired_ttests(all_map_arr, embedding_names):
"""
Run Paired t-tests on MAP results
:param all_map_arr:
:param embedding_names:
:return:
"""
str_out = ""
num_embs = len(all_map_arr)
# Verify to make sure they have the same length
if all_map_arr and embedding_names:
for i in range(0, num_embs - 1):
for j in range(i + 1, num_embs):
if len(all_map_arr[i]) != len(all_map_arr[j]):
raise Exception("Two embedding (%s, %s) have different MAP list, sizes: %s vs. %s"
% (embedding_names[i], embedding_names[j], len(all_map_arr[i]), len(all_map_arr[j])))
else:
logging.error("Inputs are NULL")
result_str_ttest_arr = []
for i in range(0, num_embs - 1):
for j in range(i + 1, num_embs):
stat_test_ret = stats.ttest_rel(all_map_arr[i], all_map_arr[j])
# if stat_test_ret.pvalue >= 0.05:
result = "%s vs. %s: %s" % (embedding_names[i], embedding_names[j], stat_test_ret)
str_out += result + "\n"
return str_out
def eval_word_analogy_4_all_embeddings(word_analogies_file, embedding_names: List[str],
word_embeddings: List[List[Word]], output_file):
"""
Run word analogy for all embeddings
:param word_analogies_file:
:param embedding_names:
:param word_embeddings:
:param output_file:
:return:
"""
fwriter = open(output_file, "w")
idx = 0
all_map_arr = []
console_output_str = ""
category = ": | Word Analogy Task results\n"
fwriter.write(category)
console_output_str += category
for word_embedding in word_embeddings:
embedding_name = embedding_names[idx]
map_at_10, map_arr, result_str = eval_word_analogies(word_analogies_file, word_embedding, embedding_name)
all_map_arr.append(map_arr)
meta_info = "\nEmbedding: %s"%(embedding_names[idx])
fwriter.write(meta_info + "\n")
fwriter.write(result_str)
fwriter.write("MAP_arr = %s"%(map_arr))
fwriter.write("MAP@10 = %s" % (map_at_10))
fwriter.flush()
console_output_str += meta_info + "\n" + "MAP@10 = %s" % (map_at_10) + "\n"
idx += 1
# Getting significant Paired t-tests
category = "\n: | Paired t-tests results\n"
fwriter.write(category)
console_output_str += category
ttests_result = run_paired_ttests(all_map_arr, embedding_names)
console_output_str += ttests_result
fwriter.write(ttests_result)
fwriter.flush()
fwriter.close()
return console_output_str
def eval_word_analogies(word_analogies_file, words: List[Word], embedding_name):
"""
Sonvx: Evaluate word analogy for one embedding.
:param word_analogies_file:
:param words:
:return:
"""
# input("GO checking >>>>")
oov_counter, idx_cnt, is_vn_counter, phrase_cnt = 0, -1, 0, 0
sections, section = [], None
# map_arr = []
out_str = ""
map_ret_dict = {}
for line_no, line in enumerate(genutils.smart_open(word_analogies_file)):
# TODO: use level3 BLAS (=evaluate multiple questions at once), for speed
line = genutils.to_unicode(line)
line = line.rstrip()
if line.startswith(': |'):
# a new section starts => store the old section
if section:
sections.append(section)
section = {'section': line.lstrip(': ').strip(), 'correct': [], 'incorrect': []}
else:
# Count number of analogy to check
idx_cnt += 1
# Set default map value
map_ret_dict[idx_cnt] = 0.0
if not section:
raise ValueError("missing section header before line #%i in %s" % (line_no, word_analogies_file))
try:
# a - b + c = expected
# Input: Baghdad | Irac | Bangkok | Thai_Lan
# Baghdad - Irac = Bangkok - Thai_Lan
# -> Baghdad - Irac + Thai_Lan = Bangkok
# =>
a, b, expected, c = [word for word in line.split(" | ")]
except ValueError:
logger.debug("SVX: ERROR skipping invalid line #%i in %s", line_no, word_analogies_file)
print("Line : ", line)
print("a, b, c, expected: %s, %s, %s, %s" % (a, b, c, expected))
# input(">>> Wait ...")
continue
# In case of Vietnamese, word analogy can be a phrase
if " " in expected:
print("INFO: we don't support to find word analogies for phrase for NOW.")
phrase_cnt += 1
continue
elif " " in a or " " in b or " " in c:
is_vn_counter += 1
word_left1 = get_avg_vector(a, words)
word_left2 = get_avg_vector(b, words)
word_right2 = get_avg_vector(c, words)
else:
word_left1 = find_word(a, words)
word_left2 = find_word(b, words)
word_right2 = find_word(c, words)
if (not word_left1) or (not word_left2) or (not word_right2):
logger.debug("SVX: skipping line #%i with OOV words: %s", line_no, line.strip())
oov_counter += 1
continue
# Write solable analogy to a file
# fsolveable_writer.write(line + "\n")
logger.debug("word_left1 = %s", word_left1.text)
logger.debug("word_left2 = %s", word_left2.text)
logger.debug("word_right2 = %s", word_right2.text)
# Start finding close word:
# Note: we can only find 1 expected word in Vietnamese for NOW
top10_candidate = closest_analogies_vectors(word_left2, word_left1,
word_right2, words)
list_candidate_arr = []
for tuple in top10_candidate:
list_candidate_arr.append(tuple[1].text)
logger.debug("Expected Word: %s, candidate = %s" % (expected, list_candidate_arr))
# input(">>>>>")
# Calculate MAP@10 score
this_map_result = eval_utils.mapk(expected, list_candidate_arr, word_level=True)
if this_map_result >= 0:
this_map_result = round(this_map_result, 6)
# map_arr[idx_cnt] = this_map_result
else:
this_map_result = 0.0
# map_arr.append(0.0)
# map_arr[idx_cnt] = this_map_result
map_ret_dict[idx_cnt] = this_map_result
if expected in list_candidate_arr:
section['correct'].append((a, b, c, expected))
out_line = "%s - %s + %s = ?; Expect: %s, candidate: %s" % \
(word_left1, word_left2, word_right2, expected, list_candidate_arr)
out_str += out_line + "\n"
# else:
# section['incorrect'].append((a, b, c, expected))
# fsolveable_writer.close()
if section:
# store the last section, too
sections.append(section)
map_arr = list(map_ret_dict.values())
logger.debug("map_arr = ", map_arr)
logger.debug("MAP_RET_DICT = ", map_ret_dict)
# input("Check result dict: >>>>>")
total = {
"Emb_Name: " + embedding_name + '/OOV/Total/VN_Solveable_Cases/VN_Phrase_Target':
[oov_counter, (idx_cnt + 1), is_vn_counter, phrase_cnt],
'MAP@10': np.mean(map_arr)
# ,
# 'section': 'total'
# ,
# 'correct': sum((s['correct'] for s in sections), []),
# 'incorrect': sum((s['incorrect'] for s in sections), []),
}
# print (out_str)
# print(total)
# logger.info(total)
sections.append(total)
sections_str = "\n%s\n" % sections
return np.mean(map_arr), map_arr, sections_str
def print_analogy(left2: str, left1: str, right2: str, words: List[Word]) -> None:
analogies = closest_analogies_OLD(left2, left1, right2, words)
if (len(analogies) == 0):
# print(f"{left2}-{left1} is like {right2}-?")
print("%s-%s is like %s-?"%(left2, left1, right2))
# man-king is like woman-king
# input: man is to king is like woman is to ___?(queen).
else:
(dist, w) = analogies[0]
# alternatives = ', '.join([f"{w.text} ({dist})" for (dist, w) in analogies])
# print(f"{left2}-{left1} is like {right2}-{w.text}")
print("%s-%s is like %s-%s"%(left2, left1, right2, w.text))
================================================
FILE: src/codes/utils/embedding_io.py
================================================
from typing import Iterable, List, Set
from itertools import groupby
import numpy as np
import re
import utils.vectors as v
from utils.word import Word
import logging
import os
from embeddings.embedding_configs import EmbeddingConfigs
def save_model_to_file(embedding_model: List[Word], model_file_out: str):
"""
Save loaded model back to file (to remove duplicated items).
:param embedding_model:
:param model_file_out:
:return:
"""
fwriter = open(model_file_out, "w")
meta_data = "%s %s\n"%(len(embedding_model), len(embedding_model[0].vector))
fwriter.write(meta_data)
fwriter.flush()
for w_Word in embedding_model:
line = w_Word.text + " " + " ".join(str(scalar) for scalar in w_Word.vector.tolist())
fwriter.write(line + "\n")
fwriter.flush()
fwriter.close()
def load_word_embeddings(file_paths: str, emb_config: EmbeddingConfigs) -> List[List[Word]]:
"""
Sonvx: load multiple embeddings: e.g., ;
:param file_paths:
:param emb_config:
:return:
"""
embedding_models = []
embedding_names = []
if file_paths and file_paths.__contains__(";"):
files = file_paths.split(";")
for emb_file in files:
word_embedding = load_word_embedding(emb_file.replace("\"", ""), emb_config)
embedding_name = os.path.basename(os.path.normpath(emb_file))
embedding_models.append(word_embedding)
embedding_names.append(embedding_name)
else:
return [load_word_embedding(file_paths), emb_config]
return embedding_names, embedding_models
def load_word_embedding(file_path: str, emb_config: EmbeddingConfigs) -> List[Word]:
"""
Load and cleanup the data.
:param file_path:
:param emb_config:
:return:
"""
# print(f"Loading {file_path}...")
print("Loading %s ..."%(file_path))
words = load_words_raw(file_path, emb_config)
# print(f"Loaded {len(words)} words.")
print("Loaded %s words." %(len(words)))
# Test
word1 = words[1]
print("Vec Len(word1) = ", len(word1.vector))
# num_dimensions = most_common_dimension(words)
# words = [w for w in words if len(w.vector) == dims]
# print(f"Using {num_dimensions}-dimensional vectors, {len(words)} remain.")
# words = remove_stop_words(words)
# print(f"Removed stop words, {len(words)} remain.")
# ords = remove_duplicates(words)
# print(f"Removed duplicates, {len(words)} remain.")
logging.debug("Embedding words: ", words[:10])
print("Emb_vocab_size = ", len(words))
# input("Done loading embedding: >>>>")
return words
def load_words_raw(file_path: str, emb_config: EmbeddingConfigs) -> List[Word]:
"""
Load the file as-is, without doing any validation or cleanup.
:param file_path:
:param emb_config:
:return:
"""
def parse_line(line: str, frequency: int) -> Word:
# print("Line=", line)
tokens = line.split(" ")
word = tokens[0]
if emb_config.do_normalize_emb:
vector = v.normalize(np.array([float(x) for x in tokens[1:]]))
else:
vector = np.array([float(x) for x in tokens[1:]])
return Word(word, vector, frequency)
# Sonvx: NOT loading the same word twice.
unique_dict = {}
words = []
# Words are sorted from the most common to the least common ones
frequency = 1
duplicated_entry = 0
idx_counter, vocab_size, emb_dim = 0, 0, 0
with open(file_path) as f:
for line in f:
line = line.rstrip()
# print("Processing line: ", line)
if idx_counter == 0 and emb_config.is_word2vec_format:
try:
meta_info = line.split(" ")
vocab_size = int(meta_info[0])
emb_dim = int(meta_info[1])
idx_counter += 1
continue
except Exception as e:
print("meta_info = "%(meta_info))
logging.error("Input embedding has format issue: Error = %s" % (e))
# if len(line) < 20: # Ignore the first line of w2v format.
# continue
w = parse_line(line, frequency)
# Svx: only load if the word is not existed in the list.
if w.text not in unique_dict:
unique_dict[w.text] = frequency
words.append(w)
frequency += 1
else:
duplicated_entry += 1
# print("Loading the same word again")
# # Svx: check if the embedding dim is the same with the metadata, random check only
if idx_counter == 10:
if len(w.vector) != emb_dim:
message = "Metadata and the real vector size do not match: meta:real = %s:%s" \
% (emb_dim, len(w.vector))
logging.error(message)
raise ValueError(message)
idx_counter += 1
if duplicated_entry > 0:
logging.debug("Loading the same word again: %s"%(duplicated_entry))
# Final check:
if (frequency - 1) != vocab_size:
msg = "Loaded %s/%s unique vocab." % ((frequency - 1), vocab_size)
logging.info(msg)
return words
def iter_len(iter: Iterable[complex]) -> int:
return sum(1 for _ in iter)
def most_common_dimension(words: List[Word]) -> int:
"""
There is a line in the input file which is missing a word
(search -0.0739, -0.135, 0.0584).
"""
lengths = sorted([len(word.vector) for word in words])
dimensions = [(k, iter_len(v)) for k, v in groupby(lengths)]
print("Dimensions:")
for (dim, num_vectors) in dimensions:
# print(f"{num_vectors} {dim}-dimensional vectors")
print("%s %s-dimensional vectors"%(num_vectors, dim))
most_common = sorted(dimensions, key=lambda t: t[1], reverse=True)[0]
return most_common[0]
# We want to ignore these characters,
# so that e.g. "U.S.", "U.S", "US_" and "US" are the same word.
ignore_char_regex = re.compile("[\W_]")
# Has to start and end with an alphanumeric character
is_valid_word = re.compile("^[^\W_].*[^\W_]$")
def remove_duplicates(words: List[Word]) -> List[Word]:
seen_words: Set[str] = set()
unique_words: List[Word] = []
for w in words:
canonical = ignore_char_regex.sub("", w.text)
if not canonical in seen_words:
seen_words.add(canonical)
# Keep the original ordering
unique_words.append(w)
return unique_words
def remove_stop_words(words: List[Word]) -> List[Word]:
return [w for w in words if (
len(w.text) > 1 and is_valid_word.match(w.text))]
# Run "smoke tests" on import
assert [w.text for w in remove_stop_words([
Word('a', [], 1),
Word('ab', [], 1),
Word('-ab', [], 1),
Word('ab_', [], 1),
Word('a.', [], 1),
Word('.a', [], 1),
Word('ab', [], 1),
])] == ['ab', 'ab']
assert [w.text for w in remove_duplicates([
Word('a.b', [], 1),
Word('-a-b', [], 1),
Word('ab_+', [], 1),
Word('.abc...', [], 1),
])] == ['a.b', '.abc...']
================================================
FILE: src/codes/utils/eval_utils.py
================================================
"""
MAP@K word level and character level are explained in detail in this paper:
dpUGC: Learn Differentially Private Representationfor User Generated Contents
Xuan-Son Vu, Son N. Tran, Lili Jiang
In: Proceedings of the 20th International Conference on Computational Linguistics and
Intelligent Text Processing, April, 2019, (to appear)
Please cite the above paper if you use codes in this file.
"""
def apk(actual, predicted, k=10):
"""
Computes the average precision at k.
This function computes the average prescision at k between two lists of
items.
Parameters
----------
actual : list
A list of elements that are to be predicted (order doesn't matter)
predicted : list
A list of predicted elements (order does matter)
k : int, optional
The maximum number of predicted elements
Returns
-------
score : double
The average precision at k over the input lists
"""
if len(predicted) > k:
predicted = predicted[:k]
score = 0.0
num_hits = 0.0
for i, p in enumerate(predicted):
if p in actual and p not in predicted[:i]:
num_hits += 1.0
score += num_hits / (i + 1.0)
if not actual:
return 0.0
return score / min(len(actual), k)
def mapk(actual, predicted, k=10, word_level=True):
"""
Computes the mean average precision at k.
This function computes the mean average prescision at k between two lists
of lists of items.
Parameters
----------
actual : list
A list of lists of elements that are to be predicted
(order doesn't matter in the lists)
predicted : list
A list of lists of predicted elements
(order matters in the lists)
k : int, optional
The maximum number of predicted elements
Returns
-------
score : double
The mean average precision at k over the input lists
"""
# print("Sending arr = ", arr)
if word_level:
return calc_map(actual, predicted, topK=k)
else:
# arr = [apk(a, p, k) for a, p in zip(actual, predicted)]
# return np.mean(arr)
return calc_map_character_level(actual, predicted, topK=k)
def calc_map(actual, predicted, topK=10):
"""
:param actual:
:param predicted:
:param topK:
:return:
"""
# print("Input: actual %s, predicted %s"%(actual, predicted))
if len(predicted) > topK:
predicted = predicted[:topK]
idx = 1
hit = 0
map_arr = []
for answer in predicted:
if answer in actual[:topK]:
hit += 1
val = (hit * 1.0) / (idx * 1.0)
# print("hit = %s, idx = %s"%(hit, idx))
map_arr.append(val)
# print("hit: %s, map_arr = %s"%(answer, map_arr))
idx += 1
# print("map_arr = %s done", map_arr)
if len(map_arr) > 0:
return np.mean(map_arr)
else:
return 0.0
def calc_map_character_level(actual, predicted, topK=10):
"""
:param actual:
:param predicted:
:param topK:
:return:
"""
# print("Input: actual %s, predicted %s" % (actual, predicted))
if len(predicted) > topK:
predicted = predicted[:topK]
if len(actual) > topK:
actual = actual[:topK]
rank = 1
hit = 0
actual_seq = ''.join([word for word in actual])
predicted_seq = ''.join([word for word in predicted])
map_arr = []
for char in predicted_seq:
if char in actual_seq[:rank]:
hit += 1
val = (hit * 1.0) / (rank * 1.0)
# print("hit = %s, idx = %s" % (hit, rank))
map_arr.append(val)
# print("hit: %s, map_arr = %s" % (char, map_arr))
rank += 1
# print("map_arr = %s done", map_arr)
return np.mean(map_arr)
import unittest
import numpy as np
def test_apk(self):
self.assertAlmostEqual(apk(range(1, 6), [6, 4, 7, 1, 2], 2), 0.25)
self.assertAlmostEqual(apk(range(1, 6), [1, 1, 1, 1, 1], 5), 0.2)
predicted = range(1, 21)
predicted.extend(range(200, 600))
self.assertAlmostEqual(apk(range(1, 100), predicted, 20), 1.0)
def test_mapk(self):
self.assertAlmostEqual(mapk([range(1, 5)], [range(1, 5)], 3), 1.0)
self.assertAlmostEqual(mapk([[1, 3, 4], [1, 2, 4], [1, 3]],
[range(1, 6), range(1, 6), range(1, 6)], 3), 0.685185185185185)
self.assertAlmostEqual(mapk([range(1, 6), range(1, 6)],
[[6, 4, 7, 1, 2], [1, 1, 1, 1, 1]], 5), 0.26)
self.assertAlmostEqual(mapk([[1, 3], [1, 2, 3], [1, 2, 3]],
[range(1, 6), [1, 1, 1], [1, 2, 1]], 3), 11.0 / 18)
if __name__ == '__main__':
a1 = ["1", '2', '3', '4']
b1 = ['1', '5', '2', '8']
print(mapk(a1, b1, 4))
a1 = ["15"]
b1 = ["1", "2", "3", "4", "5","6","7","8","9","10"]
print("MapK:", mapk(a1, b1, 4))
# unittest.main()
================================================
FILE: src/codes/utils/file_utils.py
================================================
import pickle
def save_obj(obj, file_path):
with open(file_path + '.pkl', 'wb') as f:
pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)
def load_obj(file_path):
with open(file_path + '.pkl', 'rb') as f:
return pickle.load(f)
def get_unique_vocab(analogy_file_path, write_out_file):
"""
:param analogy_file_path:
:param write_out_file:
:return:
"""
vocab_dict = {}
with open(analogy_file_path, "r") as freader:
for line in freader:
if line.__contains__(" | "):
word_parts = line.split(" | ")
for word in word_parts:
word = word.rstrip()
vocab_dict[word] = 0
fwriter = open(write_out_file, "w")
for word in vocab_dict.keys():
fwriter.write(word + "\n")
fwriter.close()
print("Write dictionary file to %s"%(write_out_file))
return vocab_dict
if __name__ == '__main__':
get_unique_vocab("../data/embedding_analogies/portuguese/LX-4WAnalogies-ETNLP.txt",
"../data/embedding_analogies/portuguese/vocab.txt")
================================================
FILE: src/codes/utils/string_utils.py
================================================
import six
def convert_to_unicode(text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text.decode("utf-8", "ignore")
elif isinstance(text, unicode):
return text
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
================================================
FILE: src/codes/utils/vectors.py
================================================
from typing import List, Any, Optional
import math
import numpy as np
# Adopt from https://github.com/mkonicek/nlp/vecters.py
# Vector = np.ndarray[float]
Vector = 'np.ndarray[float]'
vector_type = 'np.ndarray[float]'
# Vector = np.ndarray(dtype=float)
def l2_len(v: vector_type) -> float:
return math.sqrt(np.dot(v, v))
def dot(v1: vector_type, v2: vector_type) -> float:
assert v1.shape == v2.shape
return np.dot(v1, v2)
def mean(v1: vector_type, v2: vector_type) -> Vector:
"""
Added by Sonvx: get mean of 2 vectors.
:param v1:
:param v2:
:return:
"""
assert v1.shape == v2.shape
return np.mean([v1, v2], axis=0)
def mean_list(v1: List[Vector]) -> Vector:
"""
Added by Sonvx: get mean of 2 vectors.
:param v1:
:return:
"""
if len(v1) > 0:
return np.mean(v1, axis=0)
else:
return None
def add(v1: vector_type, v2: vector_type) -> Vector:
assert v1.shape == v2.shape
return np.add(v1, v2)
def sub(v1: vector_type, v2: vector_type) -> Vector:
assert v1.shape == v2.shape
return np.subtract(v1, v2)
def normalize(v: vector_type) -> Vector:
return v / l2_len(v)
def cosine_similarity_normalized(v1: vector_type, v2: vector_type) -> float:
"""
Returns the cosine of the angle between the two vectors.
Each of the vectors must have length (L2-norm) equal to 1.
Results range from -1 (very different) to 1 (very similar).
"""
return dot(v1, v2)
================================================
FILE: src/codes/utils/word.py
================================================
from typing import List
from utils.vectors import Vector
# Adopt from https://github.com/mkonicek/nlp/Word.py
class Word:
"""A single word (one line of the input vector embedding file)"""
def __init__(self, text: str, vector: Vector, frequency: int) -> None:
self.text = text
self.vector = vector
self.frequency = frequency
def __repr__(self) -> str:
vector_preview = ', '.join(map(str, self.vector[:2]))
# return f"{self.text} [{vector_preview}, ...]"
return "%s [%s, ...]"%(self.text, vector_preview)
================================================
FILE: src/codes/visualizer/README.md
================================================
# Requirements:
- ```pip install gensim flask```
- Download any pre-trained embeddings and put it into ../03.run_etnlp_visualizer_inter.sh
# How to run
> 1. sh ../03.run_etnlp_visualizer_inter.sh
> 2. Visit http://localhost:8089
# Screenshot

================================================
FILE: src/codes/visualizer/__init__.py
================================================
================================================
FILE: src/codes/visualizer/outof_w2vec.dict
================================================
'news'
'news'
'news'
'news'
'news'
'news'
'news'
'back'
'back'
'back'
'back'
'news'
'news'
'back'
'back'
'back'
'back'
'news'
'news'
'back'
'back'
'back'
'back'
'news'
'news'
'lovely'
'lovely'
'lovely'
'lovely'
'love'
'love'
================================================
FILE: src/codes/visualizer/static/style.css
================================================
.container-4{
overflow: hidden;
width: 300px;
vertical-align: middle;
white-space: nowrap;
}
.container-4 input#search{
width: 300px;
height: 50px;
background: #2b303b;
border: none;
font-size: 10pt;
float: left;
color: #fff;
padding-left: 15px;
-webkit-border-radius: 5px;
-moz-border-radius: 5px;
border-radius: 5px;
}
.container-4 input#search::-webkit-input-placeholder {
color: #65737e;
}
.container-4 input#search:-moz-placeholder { /* Firefox 18- */
color: #65737e;
}
.container-4 input#search::-moz-placeholder { /* Firefox 19+ */
color: #65737e;
}
.container-4 input#search:-ms-input-placeholder {
color: #65737e;
}
.container-4 button.icon{
-webkit-border-top-right-radius: 5px;
-webkit-border-bottom-right-radius: 5px;
-moz-border-radius-topright: 5px;
-moz-border-radius-bottomright: 5px;
border-top-right-radius: 5px;
border-bottom-right-radius: 5px;
border: none;
background: #232833;
height: 50px;
width: 50px;
color: #4f5b66;
opacity: 0;
font-size: 10pt;
-webkit-transition: all .55s ease;
-moz-transition: all .55s ease;
-ms-transition: all .55s ease;
-o-transition: all .55s ease;
transition: all .55s ease;
}
.container-4:hover button.icon, .container-4:active button.icon, .container-4:focus button.icon{
outline: none;
opacity: 1;
margin-left: -50px;
}
.container-4:hover button.icon:hover{
background: white;
}
div#answers {
background-color: #f2f2f2;
padding-top: 2px;
padding-bottom: 2px;
padding-left: 100px;
}
================================================
FILE: src/codes/visualizer/templates/app.html
================================================
Title