Full Code of eyaler/word2vec-slim for AI

master 35249de20187 cached
3 files
5.3 KB
1.5k tokens
1 requests
Download .txt
Repository: eyaler/word2vec-slim
Branch: master
Commit: 35249de20187
Files: 3
Total size: 5.3 KB

Directory structure:
gitextract_ktilvx5m/

├── .gitattributes
├── README.md
└── source/
    └── word2vec-slim.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitattributes
================================================
GoogleNews-vectors-negative300-SLIM.bin.gz filter=lfs diff=lfs merge=lfs -text


================================================
FILE: README.md
================================================
#GoogleNews-vectors-negative300-SLIM

tl;dr: Filter down Google News word2vec model from 3 million words to 300k, by crossing it with English dictionaries.

In several projects i've been using the [word2vec](https://code.google.com/archive/p/word2vec/) pre-trained Google News model [(GoogleNews-vectors-negative300)](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM)
with the [gensim](https://radimrehurek.com/gensim/) Python library.
The model was trained over a 3 billion word corpus, and contains 3 million words (of which ~930k are NOT phrases, i.e. do not contain underscores).
The compressed file size is 1.6 GB, and it takes over 3 minutes to load in gensim on my laptop.

As many words are less useful for my use cases (e.g. Chinese names), I made a slimmer version which saves on disk space, loading time and memory.

I found several large English word lists at [github.com/dwyl/english-words](https://github.com/dwyl/english-words).
I combined all the words found in the files: words.txt, words2.txt and words3.txt, and converted to lowercase.
This gave a total of 466,920 unique words.

We could use the above to filter word2vec, however we would lose some contemporary words of the zeitgeist, absent from the outdated dictionaries, e.g. "feminazi", "douchebag", "bukkake", "hashtag", "meme", "transgender", "metrosexual", "polyamory", as well as "google" and "facebook".
 
Such words can be found in the [Urban Dictionary](http://www.urbandictionary.com/), and fortunately the complete word list from March 2016 can be found at [github.com/mattbierner/urban-dictionary-entry-collector](https://github.com/mattbierner/urban-dictionary-entry-collector).
It contains ~2 million entries, of which ~1.5 million are lowercase unique (of which ~800k are NOT phrases, i.e. do not contain spaces). 
This is quite noisy, and we can filter it down on `max(thumbs up vote)>=50`, leaving 86,724 spaceless words (of which 55k are not contained in our previous word list, comparing by lowercase).
Combining with the previous word list, we get a total of 521,924 unique words.

I then filtered the Google News words to retain only those which their lowercase version appear in my combined word list.
This leaves 288,751 words (28,534 due to Urban Dictionary).

I was still missing some inflections which are in the word2vec model but which only have their base form in the word list, e.g. 'antisemites' or 'feminazis'. 
This can be dealt with by considering also the base form of long words (`min_base_len = 8`), after truncating a short suffix (`max_suffix_len = 2`).
The above parameter choice added 10,816 words.
 
The final slim model has 299,567 words, saved in a 270 MB compressed word2vec format file, and loads in 20 seconds on my laptop. 

Notes:

1. If you are filtering words by their vocab dictionary index, note that these indices have been updated according to the smaller container size.
2. You can find the filtered down Urban Dictionary word list (sorted by decreasing max up vote) [here](https://github.com/eyaler/word2vec-slim/blob/master/source/dicts/urban50.txt.gz).
3. You will need to install [git lfs](https://git-lfs.github.com/) to be able to clone this. If you cannot download the model file due to data quota exceedance - please let me know!


================================================
FILE: source/word2vec-slim.py
================================================
# allow plural words which have a singular form in the dicts (use stem logic)

from gensim.models import word2vec
import time
import numpy as np
import gzip
import os

model_folder = 'd:/data'
model_filename = 'GoogleNews-vectors-negative300.bin.gz'
slim_filename = 'GoogleNews-vectors-negative300-SLIM.bin.gz'

max_suffix_len = 2
min_base_len = 8

words = set()
for dict_filename in os.listdir('dicts'):
    with gzip.open('dicts/'+dict_filename, 'rt', encoding='utf8') as f:
        temp = f.readlines()
        save_len = len(temp)
        for i in range(len(temp)):
            temp[i] = temp[i].strip().lower()
        temp = set(temp)
        print('%s: %d -> %d' % (dict_filename, save_len, len(temp)))
    words |= temp
print('combined: %d' % (len(words)))

start = time.time()
model = word2vec.Word2Vec.load_word2vec_format(model_folder + '/' + model_filename, binary=True)
print('Finished loading original model %.2f min' % ((time.time()-start)/60))
print('word2vec: %d' % len(model.vocab))
print('non-phrases: %d' % len([w for w in model.vocab.keys() if '_' not in w]))

indices_to_delete = []
j = 0
suffix_grace_words = 0
for i,w in enumerate(model.index2word):
    l = w.strip().lower()
    found = False
    if l in words:
        found = True
    else:
        for s in range(1, 1+max_suffix_len):
            if len(l)-s<min_base_len:
                break
            elif l[:-s] in words:
                suffix_grace_words += 1
                found = True
                break

    if found:
        model.vocab[w].index = j
        j += 1
    else:
        del model.vocab[w]
        indices_to_delete.append(i)

model.syn0 = np.delete(model.syn0, indices_to_delete, axis=0)
print('slim: %d' % len(model.vocab))
print('suffix grace words: %d' % (suffix_grace_words))

model.save_word2vec_format(model_folder + '/' + slim_filename, binary=True)
del model

start = time.time()
model = word2vec.Word2Vec.load_word2vec_format(model_folder + '/' + slim_filename, binary=True)
print('Finished loading slim model %.1f sec' % ((time.time()-start)))
Download .txt
gitextract_ktilvx5m/

├── .gitattributes
├── README.md
└── source/
    └── word2vec-slim.py
Condensed preview — 3 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (6K chars).
[
  {
    "path": ".gitattributes",
    "chars": 79,
    "preview": "GoogleNews-vectors-negative300-SLIM.bin.gz filter=lfs diff=lfs merge=lfs -text\n"
  },
  {
    "path": "README.md",
    "chars": 3279,
    "preview": "#GoogleNews-vectors-negative300-SLIM\n\ntl;dr: Filter down Google News word2vec model from 3 million words to 300k, by cro"
  },
  {
    "path": "source/word2vec-slim.py",
    "chars": 2063,
    "preview": "# allow plural words which have a singular form in the dicts (use stem logic)\n\nfrom gensim.models import word2vec\nimport"
  }
]

About this extraction

This page contains the full source code of the eyaler/word2vec-slim GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 3 files (5.3 KB), approximately 1.5k tokens. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!