Showing preview only (306K chars total). Download the full file or copy to clipboard to get everything.
Repository: chewxy/lingo
Branch: master
Commit: 491e816b48d4
Files: 128
Total size: 278.9 KB
Directory structure:
gitextract_whqjv2y6/
├── .gitignore
├── .travis.yml
├── CONTRIBUTING.md
├── CONTRIBUTORS.md
├── LICENSE
├── POSTag.go
├── POSTag_stanford.go
├── POSTag_stanford_string.go
├── POSTag_universal.go
├── POSTag_universal_string.go
├── README.md
├── annotation.go
├── annotationSet.go
├── annotationSet_bench_test.go
├── browncluster.go
├── cmd/
│ ├── demo/
│ │ ├── io.go
│ │ ├── main.go
│ │ └── nlp.go
│ ├── dep/
│ │ ├── fixer.go
│ │ ├── io.go
│ │ ├── main.go
│ │ ├── pipeline.go
│ │ └── train.go
│ ├── lexer/
│ │ └── main.go
│ └── pos/
│ ├── crossvalidation.go
│ ├── fixer.go
│ └── main.go
├── const.go
├── corpus/
│ ├── consopt.go
│ ├── corpus.go
│ ├── corpus_test.go
│ ├── functions.go
│ ├── functions_test.go
│ ├── inflection.go
│ ├── inflection_test.go
│ ├── io.go
│ ├── io_test.go
│ ├── lda.go
│ ├── test_test.go
│ └── utils.go
├── dep/
│ ├── README.md
│ ├── arcStandard.go
│ ├── arcStandard_test.go
│ ├── configuration.go
│ ├── configuration_test.go
│ ├── debug.go
│ ├── dependencyParser.go
│ ├── documentation/
│ │ ├── iamhuman.dot
│ │ └── thecatsatonthemat.dot
│ ├── errors.go
│ ├── evaluation.go
│ ├── example.go
│ ├── example_test.go
│ ├── featureExtraction.go
│ ├── features.go
│ ├── features_string.go
│ ├── fix.go
│ ├── init.go
│ ├── models.go
│ ├── models_test.go
│ ├── move.go
│ ├── move_string.go
│ ├── nn2.go
│ ├── nn2_io.go
│ ├── nn2_io_test.go
│ ├── nn2_test.go
│ ├── nnconfig.go
│ ├── release.go
│ ├── span.go
│ ├── test_test.go
│ ├── train.go
│ ├── train_test.go
│ ├── transition.go
│ └── util.go
├── dependency.go
├── dependencyTree.go
├── dependencyType.go
├── dependencyType_stanford.go
├── dependencyType_stanford_string.go
├── dependencyType_universal.go
├── dependencyType_universal_string.go
├── errors.go
├── go.mod
├── go.sum
├── interfaces.go
├── io.go
├── io_test.go
├── lexeme.go
├── lexemetype_string.go
├── lexer/
│ ├── lexer.go
│ ├── lexer_test.go
│ └── stateFn.go
├── lingo.go
├── pos/
│ ├── allinone_test.go
│ ├── context.go
│ ├── context_test.go
│ ├── contexttype_string.go
│ ├── debug.go
│ ├── errors.go
│ ├── features.go
│ ├── features_test.go
│ ├── featuretype_string.go
│ ├── models.go
│ ├── models_test.go
│ ├── perceptron.go
│ ├── perceptron_io.go
│ ├── perceptron_io_test.go
│ ├── postagger.go
│ ├── release.go
│ ├── sentence.go
│ ├── test_test.go
│ ├── util.go
│ └── util_test.go
├── sentence.go
├── sets.go
├── shape.go
├── stopwords.go
├── treebank/
│ ├── const_postag_stanford.go
│ ├── const_postag_universal.go
│ ├── const_rel_stanford.go
│ ├── const_rel_universal.go
│ ├── sentenceTag.go
│ ├── sentenceTag_test.go
│ ├── treebank.go
│ ├── treebank_test.go
│ └── util.go
├── utils.go
└── wordFlags.go
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
# Compiled Object files, Static and Dynamic libs (Shared Objects)
*.o
*.a
*.so
# Folders
_obj
_test
# Architecture specific extensions/prefixes
*.[568vq]
[568vq].out
*.cgo1.go
*.cgo2.c
_cgo_defun.c
_cgo_gotypes.go
_cgo_export.*
_testmain.go
*.exe
*.test
*.prof
================================================
FILE: .travis.yml
================================================
language: go
branches:
only:
- master
go:
- 1.11.x
- 1.12.x
- 1.13.x
- tip
env:
- GO111MODULE=on
matrix:
allow_failures:
- go: tip
================================================
FILE: CONTRIBUTING.md
================================================
# Contributing #
Contributors are welcome! We want to make contributing as easy as possible, and the process is very Github-centric. [Github Issues](https://github.com/chewxy/lingo/issues) are used to manage any contributions and changes. If you don't have a github account, please feel free to email me (my user name [at] gmail.com), and I'll gladly open an issue on your behalf.
# Process #
Say you have a change you want to make, this is the process:
1. Open an issue.
2. I'll have a brief discussion with you. If you don't feel comfortable with a public discussion, I'm okay to email.
3. Fork this project on Github, and clone it to your local machine.
4. Make your changes
5. Make sure you have tests. If you foresee breaking any API, it is vital that it be discussed beforehand.
6. Make sure your tests pass.
7. `gofmt` your code
8. Send a Pull Request.
Say you instead saw one of the [many issues](https://github.com/chewxy/lingo/issues) and want to solve one of them. This is the process:
1. Comment on the issue saying you'll pick it up. (Alternatively, email me)
2. Fork the project on Github, clone to your local drive.
3. Fork this project on Github, and clone it to your local machine.
4. Make your changes
5. Make sure you have tests. If you foresee breaking any API, it is vital that it be discussed beforehand.
6. Make sure your tests pass.
7. `gofmt` your code
8. Send a Pull Request.
## Pull Requests ##
I'll review every pull request. I may request some changes, or delve into further discussions. After that, once I'm satisfied everything passes, I'll merge the pull request. Then I'll add your name into the CONTRIBUTORS list.
# Debugging #
This package comes with a debug tag option. Most subpackages will have a `debug.go` which contain a `logf` function for logging any traces you wish to trace.
================================================
FILE: CONTRIBUTORS.md
================================================
# Contributors #
* Xuanyi Chew (@chewxy) - initial package
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2017 Chewxy
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: POSTag.go
================================================
package lingo
import (
"fmt"
"strings"
)
// POSTag represents a Part of Speech Tag.
type POSTag byte
var posTagLookup map[string]POSTag
func init() {
posTagLookup = make(map[string]POSTag)
for t := X; t < MAXTAG; t++ {
s := t.String()
posTagLookup[s] = POSTag(t)
posTagLookup[strings.ToLower(s)] = POSTag(t)
}
}
func (p POSTag) MarshalText() ([]byte, error) {
return []byte(fmt.Sprintf("%v", p)), nil // add quotes back
}
func (p *POSTag) UnmarshalText(text []byte) error {
str := strings.Trim(string(text), `"`) // for JSON use, if any
tag, _ := posTagLookup[str]
*p = tag
return nil
}
// POSTag related functions
func InPOSTags(x POSTag, set []POSTag) bool {
for _, v := range set {
if v == x {
return true
}
}
return false
}
func IsAdjective(x POSTag) bool { return InPOSTags(x, Adjectives) }
func IsNoun(x POSTag) bool { return InPOSTags(x, Nouns) }
func IsProperNoun(x POSTag) bool { return InPOSTags(x, ProperNouns) }
func IsVerb(x POSTag) bool { return InPOSTags(x, Verbs) }
func IsAdverb(x POSTag) bool { return InPOSTags(x, Adverbs) }
func IsInterrogative(x POSTag) bool { return InPOSTags(x, Interrogatives) }
func IsDeterminer(x POSTag) bool { return InPOSTags(x, Determiners) }
func IsNumber(x POSTag) bool { return InPOSTags(x, Numbers) }
func IsSymbol(x POSTag) bool { return InPOSTags(x, Symbols) }
================================================
FILE: POSTag_stanford.go
================================================
// +build stanfordtags
package lingo
//go:generate stringer -type=POSTag -output=POSTag_stanford_string.go
const BUILD_TAGSET = "stanfordtags"
const (
X POSTag = iota // aka NULLTAG
UNKNOWN_TAG // Unknown
ROOT_TAG // For Root
CC // Coordinating conjunction
CD // Cardinal number
DT // Determiner
EX // Existential there
FW // Foreign word
IN // Preposition or subordinating conjunction
JJ // Adjective
JJR // Adjective, comparative
JJS // Adjective, superlative
LS // List item marker
MD // Modal
NN // Noun, singular or mass
NNS // Noun, plural
NNP // Proper noun, singular
NNPS // Proper noun, plural
PDT // Predeterminer
POS // Possessive ending
PRP // Personal pronoun
PPRP // Possessive pronoun (PRP$)
RB // Adverb
RBR // Adverb, comparative
RBS // Adverb, superlative
RP // Particle
SYM // Symbol
TO // to
UH // Interjection
VB // Verb, base form
VBD // Verb, past tense
VBG // Verb, gerund or present participle
VBN // Verb, past participle
VBP // Verb, non-3rd person singular present
VBZ // Verb, 3rd person singular present
WDT // Wh-determiner
WP // Wh-pronoun
PWP // Possessive wh-pronoun (WP$)
WRB // Wh-adverb
// Punctuation related stuff: http://stackoverflow.com/a/21546294
COMMA // Obvious isn't it?
FULLSTOP // fullstop
OPENQUOTE // Penn Treebank uses ``
CLOSEQUOTE // Penn Treebank uses ''
COLON
DOLLAR
HASHSIGN
LEFTBRACE
RIGHTBRACE
// Extensions for web shit: https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/etb-supplementary-guidelines-2009-addendum.pdf
// http://clear.colorado.edu/compsem/documents/treebank_guidelines.pdf
HYPH // Hyphen in split compounds
AFX // affix
ADD // url or email addy
NFP // superfluous (non final) puncutation
GW // Goes WIth
XX // deidentified data (aka giberish)
MAXTAG
)
// POSTagShortcut is a shortcut function to help the POSTagger shortcircuit some decisions about what the tag is
func POSTagShortcut(l Lexeme) (POSTag, bool) {
switch l.LexemeType {
case Number:
return CD, true
case Punctuation:
switch l.Value {
case ",":
return COMMA, true
case ".":
return FULLSTOP, true
case "``":
return OPENQUOTE, true
case "''":
return CLOSEQUOTE, true
case ":":
return COLON, true
case "#":
return HASHSIGN, true
case "(":
return LEFTBRACE, true
case ")":
return RIGHTBRACE, true
default:
return X, false
}
case Symbol:
return SYM, true
case URI:
return ADD, true
case Date:
return CD, true
case Time:
return CD, true
case EOF:
return X, true
}
return X, false
}
// sets
var Adjectives = []POSTag{JJ, JJR, JJS}
var Nouns = []POSTag{NN, NNP, NNS, NNPS}
var ProperNouns = []POSTag{NNP, NNPS}
var Verbs = []POSTag{VB, VBD, VBG, VBN, VBP, VBZ}
var Adverbs = []POSTag{RB, RBR, RBS}
var Determiners = []POSTag{DT, PDT}
var Interrogatives = []POSTag{WDT, WP, PWP, WRB}
var Numbers = []POSTag{CD}
var Symbols = []POSTag{SYM, FULLSTOP, COMMA, OPENQUOTE, COLON, DOLLAR, HASHSIGN, LEFTBRACE, RIGHTBRACE, HYPH, NFP}
// IsIN returns true if the POSTag is a subordinating conjunction.
// The reason why this exists is because in the stanford tag, IN is the POSTag
// while in the universal dependencies, it's the SCONJ POSTag
func IsIN(x POSTag) bool { return x == IN }
================================================
FILE: POSTag_stanford_string.go
================================================
// +build stanfordtags
// Code generated by "stringer -type=POSTag -output=POSTag_stanford_string.go"; DO NOT EDIT
package lingo
import "fmt"
const _POSTag_name = "XUNKNOWN_TAGROOT_TAGCCCDDTEXFWINJJJJRJJSLSMDNNNNSNNPNNPSPDTPOSPRPPPRPRBRBRRBSRPSYMTOUHVBVBDVBGVBNVBPVBZWDTWPPWPWRBCOMMAFULLSTOPOPENQUOTECLOSEQUOTECOLONDOLLARHASHSIGNLEFTBRACERIGHTBRACEHYPHAFXADDNFPGWXXMAXTAG"
var _POSTag_index = [...]uint8{0, 1, 12, 20, 22, 24, 26, 28, 30, 32, 34, 37, 40, 42, 44, 46, 49, 52, 56, 59, 62, 65, 69, 71, 74, 77, 79, 82, 84, 86, 88, 91, 94, 97, 100, 103, 106, 108, 111, 114, 119, 127, 136, 146, 151, 157, 165, 174, 184, 188, 191, 194, 197, 199, 201, 207}
func (i POSTag) String() string {
if i >= POSTag(len(_POSTag_index)-1) {
return fmt.Sprintf("POSTag(%d)", i)
}
return _POSTag_name[_POSTag_index[i]:_POSTag_index[i+1]]
}
================================================
FILE: POSTag_universal.go
================================================
// +build !stanfordtags
package lingo
//go:generate stringer -type=POSTag -output=POSTag_universal_string.go
const BUILD_TAGSET = "universaltags"
const (
X POSTag = iota // aka NULLTAG
UNKNOWN_TAG
ROOT_TAG
ADJ
ADP
ADV
AUX
CONJ
DET
INTJ
NOUN
NUM
PART
PRON
PROPN
PUNCT
SCONJ
SYM
VERB
MAXTAG // MAXTAG is provided here as index support
)
// POSTagShortcut is a shortcut function to help the POSTagger shortcircuit some decisions about what the tag is
func POSTagShortcut(l Lexeme) (POSTag, bool) {
switch l.LexemeType {
case Number:
return NUM, true
case Punctuation:
return PUNCT, true
case Symbol:
return SYM, true
case URI:
return X, true
case Date:
return NUM, true
case Time:
return NUM, true
case EOF:
return X, true
}
return X, false
}
var Adjectives = []POSTag{ADJ}
var Nouns = []POSTag{NOUN, PROPN}
var ProperNouns = []POSTag{PROPN}
var Verbs = []POSTag{VERB}
var Adverbs = []POSTag{ADV}
var Determiners = []POSTag{DET}
var Interrogatives = []POSTag{PRON, DET, ADV}
var Numbers = []POSTag{NUM}
var Symbols = []POSTag{SYM, PUNCT}
// IsIN returns true if the POSTag is a subordinating conjunction.
// The reason why this exists is because in the stanford tag, IN is the POSTag
// while in the universal dependencies, it's the SCONJ POSTag
func IsIN(x POSTag) bool { return x == SCONJ }
================================================
FILE: POSTag_universal_string.go
================================================
// +build !stanfordtags
// Code generated by "stringer -type=POSTag -output=POSTag_universal_string.go"; DO NOT EDIT
package lingo
import "fmt"
const _POSTag_name = "XUNKNOWN_TAGROOT_TAGADJADPADVAUXCONJDETINTJNOUNNUMPARTPRONPROPNPUNCTSCONJSYMVERBMAXTAG"
var _POSTag_index = [...]uint8{0, 1, 12, 20, 23, 26, 29, 32, 36, 39, 43, 47, 50, 54, 58, 63, 68, 73, 76, 80, 86}
func (i POSTag) String() string {
if i >= POSTag(len(_POSTag_index)-1) {
return fmt.Sprintf("POSTag(%d)", i)
}
return _POSTag_name[_POSTag_index[i]:_POSTag_index[i+1]]
}
================================================
FILE: README.md
================================================
# lingo #
<img src="https://raw.githubusercontent.com/chewxy/lingo/master/media/gopher_small.png" align="right" />
[](https://travis-ci.org/chewxy/lingo)
package `lingo` provides the data structures and algorithms required for natural language processing.
Specifically, it provides a POS Tagger (`lingo/pos`), a Dependency Parser (`lingo/dep`), and a basic tokenizer (`lingo/lexer`) for English. It also provides data structures for holding corpuses (`lingo/corpus`), and treebanks (`lingo/treebank`).
The aim of this package is to provide a production quality pipeline for natural language processing.
# Install #
The package is go-gettable: `go get -u github.com/chewxy/lingo`
This package and its subpackages depend on very few external packages. Here they are:
| Package | Used For | Vitality | Notes | Licence |
|---------|----------|----------|-------|---------|
| [gorgonia](https://github.com/chewxy/gorgonia) | Machine learning | Vital. It won't be hard to rewrite them, but why? | Same author | [Gorgonia Licence](https://github.com/chewxy/gorgonia/blob/master/LICENSE) (Apache 2.0-like) |
| [gographviz](https://github.com/awalterschulze/gographviz) | Visualization of annotations, and other graph-related visualizations | Vital for visualizations, which are a nice-to-have feature | API last changed 12th April 2017 | [gographviz licence](https://github.com/awalterschulze/gographviz/blob/master/LICENSE) (Apache 2.0) |
| [errors](https://github.com/pkg/errors) | Errors | The package won't die without it, but it's a very nice to have | Stable API for the past year | [errors licence](https://github.com/pkg/errors/blob/master/LICENSE) (MIT/BSD like) |
| [set](https://github.com/xtgo/set) | Set operations | Can be easily replaced | Stable API for the past year | [set licence](https://github.com/xtgo/set/blob/master/LICENSE) (MIT/BSD-like) |
# Usage #
See the individual packages for usage. There is also a bunch of executables in the `cmd` directory. They're meant to be examples as to how a natural language processing pipeline can be set up.
A natural language pipeline with this package is heavily channels driven. Here's is an example for dependency parsing:
```go
func main() {
inputString: `The cat sat on the mat`
lx := lexer.New("dummy", strings.NewReader(inputString)) // lexer - required to break a sentence up into words.
pt := pos.New(pos.WithModel(posModel)) // POS Tagger - required to tag the words with a part of speech tag.
dp := dep.New(depModel) // Creates a new parser
// set up a pipeline
pt.Input = lx.Output
dp.Input = pt.Output
// run all
go lx.Run()
go pt.Run()
go dp.Run()
// wait to receive:
for {
select {
case d := <- dp.Output:
// do something
case err:= <-dp.Error:
// handle error
}
}
}
```
# How It Works #
For specific tasks (POS tagging, parsing, named entity recognition etc), refer to the README of each subpackage. This package on its own mainly provides the data structures that the subpackages will use.
Perhaps the most important data structure is the `*Annotation` structure. It basically holds a word and the associated metadata for the word.
For dependency parses, the graph takes three forms: `*Dependency`, `*DependencyTree` and `*Annotation`. All three forms are convertable from one to another. TODO: explain rationale behind each data type.
## Quirks ##
### Very Oddly Specific POS Tags and Dependency Rel Types ###
A particular quirk you may have noticed is that the `POSTag` and `DependencyType` are hard coded in as constants. This package does in fact provide two variations of each: one from Stanford/Penn Treebank and one from [UniversalDependencies](http://universaldependencies.org/).
The main reason for hardcoding these are mainly for performance reasons - knowing ahead how much to allocate reduces a lot of additional work the program has to do. It also reduces the chances of mutating a global variable.
Of course this comes as a tradeoff - programs are limited to these two options. Thankfully there are only a limited number of POS Tag and Dependency Relation types. Two of the most popular ones (Stanford/PTB and Universal Dependencies) have been implemented.
The following build tags are supported:
* stanfordtags
* universaltags
* stanfordrel
* universalrel
To use a specific tagset or relset, build your program thusly: `go build -tags='stanfordtags'`.
The default tag and dependency rel types are the universal dependencies version.
### Lexer ###
You should also note that the tokenizer, `lingo/lexer` is not your usual run-of-the-mill NLP tokenizer. It's a tokenizer that tokenizes by space, with some specific rules for English. It was inspired by Rob Pike's talk on lexers. I thought it'd be cool to write something like that for NLP.
The test cases in package `lingo/lexer` showcases how it handles unicode, and other pathalogical english.
# Contributing #
see CONTRIBUTING.md for more info
# Licence #
This package is licenced under the MIT licence.
================================================
FILE: annotation.go
================================================
package lingo
import (
"errors"
"fmt"
"strings"
)
// Annotation is the word and it's metadata.
// This includes the position, its dependency head (if available), its lemma, POSTag, etc
//
// A collection of Annoations - AnnotatedSentence is also a representation of a dependency parse
//
// Every field is exported for easy gobbing. be very careful with setting stuff
type Annotation struct {
Lexeme
POSTag
// NER
// fields to do with an annotation being in a collection
DependencyType
ID int
Head *Annotation
children AnnotationSet //will not be serialized
// info about the annotation itself
Lemma string
Lowered string
Stem string
// auxiliary data for processing
Cluster
Shape
WordFlag
}
func NewAnnotation() *Annotation {
return &Annotation{
Lexeme: nullLexeme,
Lemma: "",
Shape: Shape(""),
}
}
// AnnotationFromLexTag is only ever used in tests. Fixer is optional
func AnnotationFromLexTag(l Lexeme, t POSTag, f AnnotationFixer) *Annotation {
a := &Annotation{
Lexeme: l,
POSTag: t,
DependencyType: NoDepType,
Lemma: "",
Lowered: strings.ToLower(l.Value),
}
// it's ok to panic - it will cause the tests to fail
if err := a.Process(f); err != nil {
panic(err)
}
return a
}
func (a *Annotation) Clone() *Annotation {
b := *a
b.ID = -1
b.Head = nil
b.children = nil
b.DependencyType = NoDepType
return &b
}
func (a *Annotation) SetHead(headAnn *Annotation) {
a.Head = headAnn
if headAnn != rootAnnotation && headAnn != startAnnotation && headAnn != nullAnnotation {
headAnn.children = append(headAnn.children, a)
}
}
func (a *Annotation) HeadID() int {
if a.Head != nil {
return a.Head.ID
}
return -1
}
func (a *Annotation) IsNumber() bool {
return IsNumber(a.POSTag) && (a.LexemeType != Date && a.LexemeType != Time && a.LexemeType != URI)
}
func (a *Annotation) String() string {
return a.Value
}
func (a *Annotation) GoString() string {
s := fmt.Sprintf("%q/%s", a.Lexeme.Value, a.POSTag)
if a.Head != nil {
return fmt.Sprintf("(%v) <-%v- (%q/%s) ", s, a.DependencyType, a.Head.Value, a.Head.POSTag)
}
return s
}
func (a *Annotation) Process(f AnnotationFixer) error {
if a.Lexeme != nullLexeme {
a.Lowered = strings.ToLower(a.Value)
a.Shape = a.Lexeme.Shape()
a.WordFlag = a.Lexeme.Flags()
var err error
if f != nil {
var stem string
if stem, err = f.Stem(a.Lowered); err != nil {
if _, ok := err.(componentUnavailable); !ok {
return err
}
}
a.Stem = stem
var clust map[string]Cluster
if clust, err = f.Clusters(); err == nil {
a.Cluster = clust[a.Value]
}
}
return nil
}
return errors.New("No Lexeme!")
}
var rootAnnotation = &Annotation{
Lexeme: rootLexeme,
POSTag: ROOT_TAG,
DependencyType: Root,
ID: 0,
Head: nil,
Lemma: "",
Lowered: "",
Cluster: 0,
Shape: "",
WordFlag: NoFlag,
}
var startAnnotation = &Annotation{
Lexeme: startLexeme,
POSTag: ROOT_TAG,
DependencyType: NoDepType,
ID: -1,
Head: nil,
Lemma: "",
Lowered: "",
Cluster: 0,
Shape: "",
WordFlag: NoFlag,
}
var nullAnnotation = &Annotation{
Lexeme: nullLexeme,
POSTag: X,
DependencyType: NoDepType,
ID: -1,
Head: nil,
Lemma: "",
Lowered: "",
Cluster: 0,
Shape: "",
WordFlag: NoFlag,
}
func RootAnnotation() *Annotation { return rootAnnotation }
func StartAnnotation() *Annotation { return startAnnotation }
func NullAnnotation() *Annotation { return nullAnnotation }
func StringToAnnotation(s string, f AnnotationFixer) *Annotation {
l := MakeLexeme(s, Word)
a := NewAnnotation()
a.Lexeme = l
if err := a.Process(f); err != nil {
panic(err.Error())
}
return a
}
type AnnotationFixer interface {
Lemmatizer
Stemmer
Clusters() (map[string]Cluster, error)
}
================================================
FILE: annotationSet.go
================================================
package lingo
import (
"sort"
"unsafe"
"github.com/xtgo/set"
)
type AnnotationSet []*Annotation
func (as AnnotationSet) Len() int { return len(as) }
func (as AnnotationSet) Swap(i, j int) { as[i], as[j] = as[j], as[i] }
func (as AnnotationSet) Less(i, j int) bool {
return uintptr(unsafe.Pointer(as[i])) < uintptr(unsafe.Pointer(as[j]))
}
func (as AnnotationSet) Set() AnnotationSet {
sort.Sort(as)
n := set.Uniq(as)
return as[:n]
}
func (as AnnotationSet) Contains(a *Annotation) bool {
if as.Index(a) == len(as) {
return false
}
return true
}
func (as AnnotationSet) Index(a *Annotation) int {
for i, an := range as {
if an == a {
return i
}
}
return len(as)
}
func (as AnnotationSet) Add(a *Annotation) AnnotationSet {
if as.Contains(a) {
return as
}
as = append(as, a)
return as
}
================================================
FILE: annotationSet_bench_test.go
================================================
package lingo
import (
"sort"
"testing"
)
func (as AnnotationSet) index2(a *Annotation) int {
sort.Sort(as)
f := func(i int) bool { return as[i] == a }
return sort.Search(len(as), f)
}
var benchIndexRes int
func benchASIndex(size int, b *testing.B) {
var as AnnotationSet
for i := 0; i < size; i++ {
as = append(as, new(Annotation))
}
doesntcontain := new(Annotation)
contains := as[0]
for n := 0; n < b.N; n++ {
benchIndexRes = as.Index(doesntcontain)
benchIndexRes = as.Index(contains)
}
}
func benchASIndex2(size int, b *testing.B) {
var as AnnotationSet
for i := 0; i < size; i++ {
as = append(as, new(Annotation))
}
doesntcontain := new(Annotation)
contains := as[0]
for n := 0; n < b.N; n++ {
benchIndexRes = as.index2(doesntcontain)
benchIndexRes = as.index2(contains)
}
}
func BenchmarkAnnotationSetIndex_1(b *testing.B) { benchASIndex(1, b) }
func BenchmarkAnnotationSetIndex_2(b *testing.B) { benchASIndex(2, b) }
func BenchmarkAnnotationSetIndex_8(b *testing.B) { benchASIndex(8, b) }
func BenchmarkAnnotationSetIndex_16(b *testing.B) { benchASIndex(16, b) }
func BenchmarkAnnotationSetIndex_32(b *testing.B) { benchASIndex(32, b) }
func BenchmarkAnnotationSetIndex_64(b *testing.B) { benchASIndex(64, b) }
func BenchmarkAnnotationSetIndex_128(b *testing.B) { benchASIndex(128, b) }
func BenchmarkAnnotationSetIndex_256(b *testing.B) { benchASIndex(256, b) }
func BenchmarkAnnotationSetIndex_512(b *testing.B) { benchASIndex(512, b) }
func BenchmarkAnnotationSetIndex_1024(b *testing.B) { benchASIndex(1024, b) }
func BenchmarkAnnotationSetIndex2_1(b *testing.B) { benchASIndex2(1, b) }
func BenchmarkAnnotationSetIndex2_2(b *testing.B) { benchASIndex2(2, b) }
func BenchmarkAnnotationSetIndex2_8(b *testing.B) { benchASIndex2(8, b) }
func BenchmarkAnnotationSetIndex2_16(b *testing.B) { benchASIndex2(16, b) }
func BenchmarkAnnotationSetIndex2_32(b *testing.B) { benchASIndex2(32, b) }
func BenchmarkAnnotationSetIndex2_64(b *testing.B) { benchASIndex2(64, b) }
func BenchmarkAnnotationSetIndex2_128(b *testing.B) { benchASIndex2(128, b) }
func BenchmarkAnnotationSetIndex2_256(b *testing.B) { benchASIndex2(256, b) }
func BenchmarkAnnotationSetIndex2_512(b *testing.B) { benchASIndex2(512, b) }
func BenchmarkAnnotationSetIndex2_1024(b *testing.B) { benchASIndex2(1024, b) }
================================================
FILE: browncluster.go
================================================
package lingo
import (
"bufio"
"io"
"strconv"
"strings"
)
// this file provides IO support and type safety for brown clusters.
// The creation of brownclusters is not done here.
// Right now lingo does not generate clusters - use PercyLiang's excellent tool for that
// Cluster represents a brown cluster
type Cluster int
// ReadCluster reads PercyLiang's cluster file format and returns a map of strings to Cluster
func ReadCluster(r io.Reader) map[string]Cluster {
scanner := bufio.NewScanner(r)
clusters := make(map[string]Cluster)
for scanner.Scan() {
line := scanner.Text()
splits := strings.Split(line, "\t")
var word string
var cluster, freq int
word = splits[1]
var i64 int64
var err error
if i64, err = strconv.ParseInt(splits[0], 2, 64); err != nil {
panic(err)
}
cluster = int(i64)
if freq, err = strconv.Atoi(splits[2]); err != nil {
panic(err)
}
// if clusterer has only seen a word a few times, then the cluster is not reliable
if freq >= 3 {
clusters[word] = Cluster(cluster)
} else {
clusters[word] = Cluster(0)
}
}
// expand clusters with recasing
for word, clust := range clusters {
lowered := strings.ToLower(word)
if _, ok := clusters[lowered]; !ok {
clusters[lowered] = clust
}
titled := strings.ToTitle(word)
if _, ok := clusters[titled]; !ok {
clusters[titled] = clust
}
uppered := strings.ToUpper(word)
if _, ok := clusters[uppered]; !ok {
clusters[uppered] = clust
}
}
return clusters
}
================================================
FILE: cmd/demo/io.go
================================================
package main
import (
"log"
"os"
"github.com/chewxy/lingo"
"github.com/chewxy/lingo/dep"
"github.com/chewxy/lingo/pos"
)
const (
posModelFile = `model/pos_stanfordtags_universalrel.final.model`
depModelFile = `model/dep_stanfordtags_universalrel.final.model`
brownCluster = `clusters.txt`
)
func io() {
var err error
log.Println("loading POS Tagger model")
if posModel, err = pos.Load(posModelFile); err != nil {
log.Fatal(err)
}
log.Println("loading Dependency Parser model")
if depModel, err = dep.Load(depModelFile); err != nil {
log.Fatal(err)
}
var f *os.File
if f, err = os.Open(brownCluster); err != nil {
log.Fatal(err)
}
clusters = lingo.ReadCluster(f)
}
================================================
FILE: cmd/demo/main.go
================================================
package main
import (
"io/ioutil"
"os"
"os/exec"
"github.com/abiosoft/ishell"
"github.com/chewxy/lingo"
"github.com/pkg/browser"
)
func main() {
io()
shell := ishell.New()
var d *lingo.Dependency
// var sent lingo.AnnotatedSentence
var err error
shell.AddCmd(&ishell.Cmd{
Name: "dep",
Help: "perform dependency parsing",
Func: func(c *ishell.Context) {
c.ShowPrompt(false)
defer c.ShowPrompt(true)
c.Print("Query: ")
query := c.ReadLine()
if d, err = pipeline(query); err != nil {
c.Printf("Error: %v", err)
}
c.Printf("%v\n", d)
},
})
shell.AddCmd(&ishell.Cmd{
Name: "show",
Help: "show dependency parse on browser",
Func: func(c *ishell.Context) {
var tmp *os.File
if tmp, err = ioutil.TempFile("", "dep"); err != nil {
c.Printf("Cannot open file %v\n", err)
return
}
defer os.Remove(tmp.Name())
c.Printf("%v\n", tmp.Name())
dot := d.Tree().Dot()
tmp.Write([]byte(dot))
if err := tmp.Close(); err != nil {
c.Printf("Error closing file %v", err)
}
cmd := exec.Command("dot", "-Tpng", "-O", tmp.Name())
if err = cmd.Run(); err != nil {
c.Printf("Cannot execute dot: %v\n", err)
}
browser.OpenFile(tmp.Name() + ".png")
},
})
shell.Start()
}
================================================
FILE: cmd/demo/nlp.go
================================================
package main
import (
"fmt"
"strings"
"github.com/chewxy/lingo"
"github.com/chewxy/lingo/dep"
"github.com/chewxy/lingo/lexer"
"github.com/chewxy/lingo/pos"
"github.com/kljensen/snowball"
"github.com/pkg/errors"
)
var posModel *pos.Model
var depModel *dep.Model
var clusters map[string]lingo.Cluster
type stemmer struct{}
func (stemmer) Stem(a string) (string, error) {
return snowball.Stem(a, "english", true)
}
type fixer struct {
stemmer
}
func (f fixer) Clusters() (map[string]lingo.Cluster, error) { return clusters, nil }
func (f fixer) Lemmatize(a string, pt lingo.POSTag) ([]string, error) {
return nil, nocomp("lemmatizer")
}
type nocomp string
func (e nocomp) Error() string { return fmt.Sprintf("no %v", string(e)) }
func (e nocomp) Component() string { return string(e) }
func pipeline(s string) (d *lingo.Dependency, err error) {
if posModel == nil || depModel == nil {
return nil, errors.Errorf("Unable to create a pipeline")
}
lx := lexer.New(s, strings.NewReader(s))
pt := pos.New(pos.WithModel(posModel), pos.WithStemmer(stemmer{}))
dp := dep.New(depModel)
// pipeline
pt.Input = lx.Output
dp.Input = pt.Output
go lx.Run()
go pt.Run()
go dp.Run()
var ok bool
for {
select {
case d, ok = <-dp.Output:
if !ok {
continue
}
return
case err = <-dp.Error:
return
}
}
}
================================================
FILE: cmd/dep/fixer.go
================================================
package main
import (
"fmt"
"github.com/chewxy/lingo"
"github.com/kljensen/snowball"
)
type stemmer struct{}
func (stemmer) Stem(a string) (string, error) {
return snowball.Stem(a, "english", true)
}
type fixer struct {
stemmer
}
func (f fixer) Clusters() (map[string]lingo.Cluster, error) { return clusters, nil }
func (f fixer) Lemmatize(a string, pt lingo.POSTag) ([]string, error) {
return nil, nocomp("lemmatizer")
}
type nocomp string
func (e nocomp) Error() string { return fmt.Sprintf("no %v", string(e)) }
func (e nocomp) Component() string { return string(e) }
================================================
FILE: cmd/dep/io.go
================================================
package main
import (
"log"
"github.com/chewxy/lingo/dep"
"github.com/chewxy/lingo/pos"
"github.com/chewxy/lingo/treebank"
)
func validateFlags() {
if *load == "" && *trainFile == "" {
log.Fatal("Must either load a model or pass in a training file")
}
if *epoch < 0 {
log.Fatal("epochs must only be positive numbers")
}
if *load != "" {
toLoad = true
}
if *trainFile != "" {
toTrain = true
}
if *testFile != "" {
*cv = true
}
// warnings
if *load == "" && *save == "" {
log.Println("WARNING: Models that have been trained will NOT be saved")
}
}
func loadTreebanks() {
if *trainFile != "" {
trainTB = treebank.LoadUniversal(*trainFile)
}
if *testFile != "" {
testTB = treebank.LoadUniversal(*testFile)
}
}
func loadPOSModel() {
var err error
if *loadPOS == "" {
log.Fatal("Cannot proceed without having a POS model")
}
if POSModel, err = pos.Load(*loadPOS); err != nil {
log.Fatal(err)
}
}
func loadDepModel() {
var err error
if DepModel, err = dep.Load(*load); err != nil {
log.Fatal(err)
}
}
func saveModel() {
if *save != "" && DepModel != nil {
DepModel.Save(*save)
}
}
================================================
FILE: cmd/dep/main.go
================================================
package main
import (
"flag"
"log"
"os"
"os/signal"
"runtime/pprof"
"syscall"
"github.com/chewxy/lingo"
"github.com/chewxy/lingo/dep"
"github.com/chewxy/lingo/pos"
)
var save = flag.String("save", "", "save as...")
var load = flag.String("load", "", "load a model")
var loadPOS = flag.String("PTmodel", "", "load a POS Tagger model")
var clusterFiles = flag.String("cluster", "", "Brown Cluster files. If nothing is passed in, then the brown cluster won't be used")
var trainFile = flag.String("train", "", "Training on... (Only CONLLU formatted training files are accepted)")
var testFile = flag.String("test", "", "Test on... (Only CONLLU formatted training files are accepted). If this is not provided, the model will be trained without crossvalidation")
var cv = flag.Bool("cv", false, "Cross validate training model? Defaults to false.")
var epoch = flag.Int("epoch", 10, "Training epochs. Defaults to 10")
var format = flag.String("f", "", "Format to output. Default is none. Accepts: {json, dot}")
var cpuprofile = flag.String("cpuprofile", "", "write cpu profile to file")
var memprofile = flag.String("memprofile", "", "write memory profile to this file")
var clusters map[string]lingo.Cluster
var POSModel *pos.Model
var DepModel *dep.Model
var toLoad, toTrain bool
func init() {
if lingo.BUILD_TAGSET != "stanfordtags" && lingo.BUILD_TAGSET != "universaltags" {
log.Fatalf("Tagset %q unsupported", lingo.BUILD_TAGSET)
}
if lingo.BUILD_RELSET != "stanfordrel" && lingo.BUILD_RELSET != "universalrel" {
log.Fatalf("Relset %q unsupported", lingo.BUILD_RELSET)
}
}
func cleanup(sigChan chan os.Signal, cpuprofiling, memprofiling bool) {
select {
case <-sigChan:
log.Println("EMERGENCY EXIT")
if cpuprofiling {
pprof.StopCPUProfile()
}
if memprofiling {
f, err := os.Create(*memprofile)
if err != nil {
log.Fatal(err)
}
pprof.WriteHeapProfile(f)
f.Close()
}
saveModel()
os.Exit(1)
}
}
func main() {
flag.Parse()
validateFlags()
sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)
var cpuprofiling, memprofiling bool
if *cpuprofile != "" {
f, err := os.Create(*cpuprofile)
if err != nil {
log.Fatal(err)
}
cpuprofiling = true
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
}
if *memprofile != "" {
memprofiling = true
}
go cleanup(sigChan, cpuprofiling, memprofiling)
loadPOSModel()
if toLoad {
loadDepModel()
}
if toTrain {
loadTreebanks()
train()
}
saveModel()
}
================================================
FILE: cmd/dep/pipeline.go
================================================
package main
import (
"encoding/json"
"fmt"
"strings"
"github.com/chewxy/lingo"
"github.com/chewxy/lingo/dep"
"github.com/chewxy/lingo/lexer"
"github.com/chewxy/lingo/pos"
)
func receive(deps chan *lingo.Dependency, errs, errChan chan error) {
defer close(errChan)
for {
select {
case dep, ok := <-deps:
if !ok {
continue
}
switch *format {
case "json":
bs, _ := json.MarshalIndent(dep, "", "\t")
fmt.Printf("%s\n", string(bs))
case "dot":
fmt.Printf("%v\n", dep.Tree().Dot())
}
case err := <-errs:
errChan <- err
}
}
}
func pipeline(s string) error {
lx := lexer.New(s, strings.NewReader(s))
pt := pos.New(pos.WithModel(POSModel))
dp := dep.New(DepModel)
pt.Input = lx.Output
dp.Input = pt.Output
errChan := make(chan error)
go lx.Run()
go pt.Run()
go receive(dp.Output, dp.Error, errChan)
dp.Run()
return <-errChan
}
================================================
FILE: cmd/dep/train.go
================================================
package main
import (
"log"
"github.com/chewxy/lingo/dep"
"github.com/chewxy/lingo/treebank"
"gorgonia.org/tensor"
)
var trainTB []treebank.SentenceTag
var testTB []treebank.SentenceTag
func train() {
conf := dep.DefaultNNConfig
conf.Dtype = tensor.Float32
var trainer *dep.Trainer
if testTB != nil {
log.Printf("TRAINING WITH CROSSVALIDATION")
trainer = dep.NewTrainer(dep.WithGeneratedCorpus(trainTB...), dep.WithTrainingSet(trainTB), dep.WithCrossValidationSet(testTB), dep.WithConfig(conf))
trainer.SaveBest = "TMP.model"
if err := trainer.Init(); err != nil {
log.Fatalf("Unable to initialize trainer: \n%+v", err)
}
prog := trainer.Perf()
cost := trainer.Cost()
go func() {
for {
select {
case p := <-prog:
log.Printf("%v\n", p)
case c := <-cost:
log.Printf("Cost %v\n", c)
}
}
}()
} else {
trainer = dep.NewTrainer(dep.WithGeneratedCorpus(trainTB...), dep.WithTrainingSet(trainTB), dep.WithConfig(conf))
if err := trainer.Init(); err != nil {
log.Fatalf("Unable to initialize trainer: \n%+v", err)
}
prog := trainer.Cost()
go func() {
for cost := range prog {
log.Printf("Cost %v\n", cost)
}
}()
}
if err := trainer.Train(*epoch); err != nil {
log.Fatal(err)
}
DepModel = trainer.Model
}
================================================
FILE: cmd/lexer/main.go
================================================
package main
import (
"flag"
"fmt"
"strings"
"github.com/chewxy/lingo"
"github.com/chewxy/lingo/lexer"
)
var input = flag.String("input", "", "input string to lex")
var output = make(chan lingo.Lexeme)
func receieve() {
for l := range output {
fmt.Printf("%v\n", l)
}
}
func main() {
flag.Parse()
s := *input
go receieve()
l := lexer.New(s, strings.NewReader(s))
l.Output = output
l.Run()
}
================================================
FILE: cmd/pos/crossvalidation.go
================================================
package main
import (
"bytes"
"fmt"
"log"
"os"
"strings"
"sync"
"github.com/chewxy/lingo"
"github.com/chewxy/lingo/lexer"
"github.com/chewxy/lingo/pos"
"github.com/chewxy/lingo/treebank"
)
type testResult struct {
tagged lingo.AnnotatedSentence
actual lingo.AnnotatedSentence
}
func (tr testResult) compare() (int, bool) {
tagged := tr.tagged
actual := tr.actual
var sameLength bool = true
if len(tagged) != len(actual) {
sameLength = false
}
var counter int
for i, v := range actual {
if i >= len(tagged) {
break
}
if v.POSTag == tagged[i].POSTag {
counter++
}
}
return counter, sameLength
}
func crossValidate(resultChan chan testResult) {
diffLengthCount := 0
totalLength := 0
correctCount := 0
sentences := 0
var wrongResults []testResult
for res := range resultChan {
sentences++
length := len(res.actual)
cc, sl := res.compare()
if !sl {
diffLengthCount++
}
correctCount += cc
totalLength += length
if cc != length && *inspect != "" {
wrongResults = append(wrongResults, res)
}
}
if *inspect != "" {
f, err := os.OpenFile(*inspect, os.O_WRONLY|os.O_CREATE, 0666)
if err != nil {
log.Fatal(err)
}
// can write directly to f
var buf bytes.Buffer
for _, res := range wrongResults {
fmt.Fprintf(&buf, "Sentence: \nW:%v\nG:%v\nTags:\nW: %v\nG: %v\n\n", res.actual.StringSlice(), res.tagged.StringSlice(), res.actual.Tags(), res.tagged.Tags())
}
f.WriteString(buf.String())
f.Close()
}
fmt.Printf("CrossValidation: %d/%d = %f. Differing Lengths : %d/%d = %f\n", correctCount, totalLength, float64(correctCount)/float64(totalLength), diffLengthCount, sentences, float64(diffLengthCount)/float64(sentences))
}
func collect(ch chan lingo.AnnotatedSentence, correct lingo.AnnotatedSentence, outCh chan testResult, wg *sync.WaitGroup) {
defer wg.Done()
for sentence := range ch {
outCh <- testResult{sentence, correct}
}
}
func testModel(sentences []treebank.SentenceTag) {
resultChan := make(chan testResult)
go func() {
defer close(resultChan)
var wg sync.WaitGroup
for _, sentence := range sentences {
wg.Add(1)
input := sentence.String()
correct := sentence.AnnotatedSentence(fixer{stemmer{}})
ch := make(chan lingo.AnnotatedSentence)
go collect(ch, correct, resultChan, &wg)
go cvpipeline(input, ch)
}
wg.Wait()
}()
crossValidate(resultChan)
}
func cvpipeline(s string, output chan lingo.AnnotatedSentence) {
l := lexer.New(s, strings.NewReader(s))
pt := pos.New(pos.WithModel(model))
pt.Input = l.Output
pt.Output = output
go l.Run()
pt.Run()
}
================================================
FILE: cmd/pos/fixer.go
================================================
// +build !chewxy
package main
import (
"fmt"
"github.com/chewxy/lingo"
"github.com/kljensen/snowball"
)
type stemmer struct{}
func (stemmer) Stem(a string) (string, error) {
return snowball.Stem(a, "english", true)
}
type fixer struct {
stemmer
}
func (f fixer) Clusters() (map[string]lingo.Cluster, error) { return clusters, nil }
func (f fixer) Lemmatize(a string, pt lingo.POSTag) ([]string, error) {
return nil, nocomp("lemmatizer")
}
type nocomp string
func (e nocomp) Error() string { return fmt.Sprintf("no %v", string(e)) }
func (e nocomp) Component() string { return string(e) }
================================================
FILE: cmd/pos/main.go
================================================
package main
import (
"flag"
"fmt"
"log"
"os"
"os/signal"
"runtime/pprof"
"strings"
"sync"
"syscall"
"time"
"github.com/chewxy/lingo"
"github.com/chewxy/lingo/lexer"
"github.com/chewxy/lingo/pos"
"github.com/chewxy/lingo/treebank"
)
var save = flag.String("save", "", "save as...")
var load = flag.String("load", "", "load a model")
var clusterFiles = flag.String("cluster", "", "Brown Cluster files. If nothing is passed in, then the brown cluster won't be used")
var trainFile = flag.String("train", "", "Training on... files that end with '.conllu' will be treated as CONLLU formatted files. Files ending with '.zip' will be treted as EWT files")
var testFile = flag.String("test", "", "Test on... Files to cross validate the model on. If this is provided, automatic crossvalidation will be done")
var cv = flag.Bool("cv", false, "Cross validate training model? Defaults to false.")
var epoch = flag.Int("epoch", 1500, "Training epochs. Defaults to 1500")
var inspect = flag.String("inpect", "", "Inspect all the wrong outputs to figure out what went wrong in the POSTagging. This is useful for debugging")
var input = flag.String("input", "", "Input sentence to tag")
var cpuprofile = flag.String("cpuprofile", "", "write cpu profile to file")
var memprofile = flag.String("memprofile", "", "write memory profile to this file")
var clusters map[string]lingo.Cluster
var model *pos.Model
func receive(sentences chan lingo.AnnotatedSentence, wg *sync.WaitGroup) {
defer wg.Done()
for sent := range sentences {
for _, a := range sent {
fmt.Printf("%#v: %s| %s | %s | %d\n", a, a.POSTag, a.Lemma, a.WordFlag, a.Cluster)
}
}
}
func pipeline(s string) {
l := lexer.New(s, strings.NewReader(s))
pt := pos.New(pos.WithModel(model))
pt.Input = l.Output
var wg sync.WaitGroup
go l.Run()
go receive(pt.Output, &wg)
wg.Add(1)
pt.Run()
wg.Wait()
}
func validateFlags() {
if *load == "" && *trainFile == "" {
log.Fatal("Must either load a model or pass in a training file")
}
if *epoch < 0 {
log.Fatal("epochs must be positive numbers only!")
}
if *testFile != "" {
*cv = true
}
// warnings
if *load == "" && *save == "" {
log.Println("WARNING: Models that are trained will NOT be saved")
}
}
func loadOrTrain() {
var trained *pos.Tagger
if *clusterFiles != "" {
f, err := os.Open(*clusterFiles)
if err != nil {
log.Fatal(err)
}
clusters = lingo.ReadCluster(f)
trained = pos.New(pos.WithCluster(clusters), pos.WithStemmer(stemmer{}))
} else {
trained = pos.New()
}
if *load != "" {
start := time.Now()
var err error
if model, err = pos.Load(*load); err != nil {
log.Fatal(err)
}
log.Printf("Loading model from %q took %v", *load, time.Since(start))
return
}
var sentences []treebank.SentenceTag
switch {
case strings.HasSuffix(*trainFile, ".zip"):
sentences = treebank.LoadEWT(*trainFile)
// TODO split sentences for crossvalidation
case strings.HasSuffix(*trainFile, ".conllu"):
sentences = treebank.LoadUniversal(*trainFile)
default:
f, err := os.Open(*trainFile)
if err != nil {
log.Fatal(err)
}
sentences = treebank.ReadConllu(f)
}
log.Printf("Start training for %d epochs...", *epoch)
start := time.Now()
trained.Train(sentences, *epoch)
log.Printf("End Training. Training took %v minutes", time.Since(start).Minutes())
if *save != "" {
trained.Save(*save)
log.Printf("Model saved as: %v", *save)
}
}
func cleanup(sigChan chan os.Signal, profiling bool) {
select {
case <-sigChan:
log.Println("EMERGENCY EXIT")
if profiling {
pprof.StopCPUProfile()
}
os.Exit(1)
}
}
func main() {
flag.Parse()
if lingo.BUILD_TAGSET != "stanfordtags" && lingo.BUILD_TAGSET != "universaltags" {
log.Fatalf("Tagset: %v is unsupported", lingo.BUILD_TAGSET)
}
sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)
var profiling bool
if *cpuprofile != "" {
f, err := os.Create(*cpuprofile)
if err != nil {
log.Fatal(err)
}
profiling = true
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
}
go cleanup(sigChan, profiling)
validateFlags()
loadOrTrain()
if *memprofile != "" {
f, err := os.Create(*memprofile)
if err != nil {
log.Fatal(err)
}
pprof.WriteHeapProfile(f)
f.Close()
}
if *input != "" {
pipeline(*input)
}
if *cv {
log.Printf("Cross Validating now")
testSentences := treebank.LoadUniversal(*testFile)
testModel(testSentences)
}
}
================================================
FILE: const.go
================================================
package lingo
// constants that are not pertaining to build tags
var empty struct{}
// NumberWords was generated with this python code
/*
numberWords = {}
simple = '''zero one two three four five six seven eight nine ten eleven twelve
thirteen fourteen fifteen sixteen seventeen eighteen nineteen
twenty'''.split()
for i, word in zip(xrange(0, 20+1), simple):
numberWords[word] = i
tense = '''thirty forty fifty sixty seventy eighty ninety hundred'''.split()
for i, word in zip(xrange(30, 100+1, 10), tense):
numberWords[word] = i
larges = '''thousand million billion trillion quadrillion quintillion sextillion septillion'''.split()
for i, word in zip(xrange(3, 24+1, 3), larges):
numberWords[word] = 10**i
*/
var NumberWords = map[string]int{
"zero": 0,
"one": 1,
"two": 2,
"three": 3,
"four": 4,
"five": 5,
"six": 6,
"seven": 7,
"eight": 8,
"nine": 9,
"ten": 10,
"eleven": 11,
"twelve": 12,
"thirteen": 13,
"fourteen": 14,
"fifteen": 15,
"sixteen": 16,
"nineteen": 19,
"seventeen": 17,
"eighteen": 18,
"twenty": 20,
"thirty": 30,
"forty": 40,
"fifty": 50,
"sixty": 60,
"seventy": 70,
"eighty": 80,
"ninety": 90,
"hundred": 100,
"thousand": 1000,
"million": 1000000,
"billion": 1000000000,
"trillion": 1000000000000,
"quadrillion": 1000000000000000,
// "quintillion": 1000000000000000000,
// "sextillion": 1000000000000000000000,
// "septillion": 1000000000000000000000000,
}
================================================
FILE: corpus/consopt.go
================================================
package corpus
import (
"log"
"sort"
"sync/atomic"
"unicode/utf8"
"github.com/pkg/errors"
"github.com/xtgo/set"
)
// ConsOpt is a construction option for manual creation of a Corpus
type ConsOpt func(c *Corpus) error
// WithWords creates a corpus from a word list. It may have repeated words
func WithWords(a []string) ConsOpt {
f := func(c *Corpus) error {
s := set.Strings(a)
c.words = s
c.frequencies = make([]int, len(s))
ids := make(map[string]int)
maxID := len(s)
var totalFreq, maxWL int
// NOTE: here we're iterating over the set of words
for i, w := range s {
runeCount := utf8.RuneCountInString(w)
if runeCount > c.maxWordLength {
maxWL = runeCount
}
ids[w] = i
}
// NOTE: here we're iterating over the original word list.
for _, w := range a {
c.frequencies[ids[w]]++
totalFreq++
}
c.ids = ids
atomic.AddInt64(&c.maxid, int64(maxID))
c.totalFreq = totalFreq
c.maxWordLength = maxWL
return nil
}
return f
}
// WithOrderedWords creates a Corpus with the given word order
func WithOrderedWords(a []string) ConsOpt {
f := func(c *Corpus) error {
s := a
c.words = s
c.frequencies = make([]int, len(s))
for i := range c.frequencies {
c.frequencies[i] = 1
}
ids := make(map[string]int)
maxID := len(s)
totalFreq := len(s)
var maxWL int
for i, w := range a {
runeCount := utf8.RuneCountInString(w)
if runeCount > c.maxWordLength {
maxWL = runeCount
}
ids[w] = i
}
c.ids = ids
atomic.AddInt64(&c.maxid, int64(maxID))
c.totalFreq = totalFreq
c.maxWordLength = maxWL
return nil
}
return f
}
// WithSize preallocates all the things in Corpus
func WithSize(size int) ConsOpt {
return func(c *Corpus) error {
c.words = make([]string, 0, size)
c.frequencies = make([]int, 0, size)
return nil
}
}
// FromDict is a construction option to take a map[string]int where the int represents the word ID.
// This is useful for constructing corpuses from foreign sources where the ID mappings are important
func FromDict(d map[string]int) ConsOpt {
return func(c *Corpus) error {
var a sortutil
for k, v := range d {
a.words = append(a.words, k)
a.ids = append(a.ids, v)
}
sort.Sort(&a)
c.ids = make(map[string]int)
for i, w := range a.words {
if i != a.ids[i] {
return errors.Errorf("Unmarshaling error. Expected %dth ID to be %d. Got %d instead. Perhaps something went wrong during sorting? SLYTHERIN IT IS!", i, i, a.ids[i])
}
c.words = append(c.words, w)
c.frequencies = append(c.frequencies, 1)
c.ids[w] = i
c.totalFreq++
runeCount := utf8.RuneCountInString(w)
if runeCount > c.maxWordLength {
log.Printf("FD MaxWordLength %d - %q", runeCount, w)
c.maxWordLength = runeCount
}
}
c.maxid = int64(len(a.words))
return nil
}
}
// FromDictWithFreq is like FromDict, but also has a frequency.
func FromDictWithFreq(d map[string]struct{ ID, Freq int }) ConsOpt {
return func(c *Corpus) error {
var a sortutil
for k, v := range d {
a.words = append(a.words, k)
a.ids = append(a.ids, v.ID)
a.freqs = append(a.freqs, v.Freq)
}
sort.Sort(&a)
c.ids = make(map[string]int)
for i, w := range a.words {
if i != a.ids[i] {
return errors.Errorf("Unmarshaling error. Expected %dth ID to be %d. Got %d instead. Perhaps something went wrong during sorting? SLYTHERIN IT IS!", i, i, a.ids[i])
}
c.words = append(c.words, w)
c.frequencies = append(c.frequencies, a.freqs[i])
c.ids[w] = i
c.totalFreq += a.freqs[i]
runeCount := utf8.RuneCountInString(w)
if runeCount > c.maxWordLength {
c.maxWordLength = runeCount
}
}
c.maxid = int64(len(a.words))
return nil
}
}
================================================
FILE: corpus/corpus.go
================================================
package corpus
import (
"sync/atomic"
"unicode/utf8"
"github.com/pkg/errors"
)
// Corpus is a data structure holding the relevant metadata and information for a corpus of text.
// It serves as vocabulary with ID for lookup. This is very useful as neural networks rely on the IDs rather than the text themselves
type Corpus struct {
words []string
frequencies []int
ids map[string]int
// atomic read and write plz
maxid int64
totalFreq int
maxWordLength int
}
// New creates a new *Corpus
func New() *Corpus {
c := &Corpus{
words: make([]string, 0),
frequencies: make([]int, 0),
ids: make(map[string]int),
}
// add some default words
c.Add("") // aka NULL - when there are no words
c.Add("-UNKNOWN-")
c.Add("-ROOT-")
c.maxWordLength = 0 // specials don't have lengths
return c
}
// Construct creates a Corpus given the construction options. This allows for more flexibility
func Construct(opts ...ConsOpt) (*Corpus, error) {
c := new(Corpus)
// checks
if c.words == nil {
c.words = make([]string, 0)
}
if c.frequencies == nil {
c.frequencies = make([]int, 0)
}
if c.ids == nil {
c.ids = make(map[string]int)
}
for _, opt := range opts {
if err := opt(c); err != nil {
return nil, err
}
}
return c, nil
}
// ID returns the ID of a word and whether or not it was found in the corpus
func (c *Corpus) Id(word string) (int, bool) {
id, ok := c.ids[word]
return id, ok
}
// Word returns the word given the ID, and whether or not it was found in the corpus
func (c *Corpus) Word(id int) (string, bool) {
size := atomic.LoadInt64(&c.maxid)
maxid := int(size)
if id >= maxid {
return "", false
}
return c.words[id], true
}
// Add adds a word to the corpus and returns its ID. If a word was previously in the corpus, it merely updates the frequency count and returns the ID
func (c *Corpus) Add(word string) int {
if id, ok := c.ids[word]; ok {
c.frequencies[id]++
c.totalFreq++
return id
}
id := atomic.AddInt64(&c.maxid, 1)
c.ids[word] = int(id - 1)
c.words = append(c.words, word)
c.frequencies = append(c.frequencies, 1)
c.totalFreq++
runeCount := utf8.RuneCountInString(word)
if runeCount > c.maxWordLength {
c.maxWordLength = runeCount
}
return int(id - 1)
}
// Size returns the size of the corpus.
func (c *Corpus) Size() int {
size := atomic.LoadInt64(&c.maxid)
return int(size)
}
// WordFreq returns the frequency of the word. If the word wasn't in the corpus, it returns 0.
func (c *Corpus) WordFreq(word string) int {
id, ok := c.ids[word]
if !ok {
return 0
}
return c.frequencies[id]
}
// IDFreq returns the frequency of a word given an ID. If the word isn't in the corpus it returns 0.
func (c *Corpus) IDFreq(id int) int {
size := atomic.LoadInt64(&c.maxid)
maxid := int(size)
if id >= maxid {
return 0
}
return c.frequencies[id]
}
// TotalFreq returns the total number of words ever seen by the corpus. This number includes the count of repeat words.
func (c *Corpus) TotalFreq() int {
return c.totalFreq
}
// MaxWordLength returns the length of the longest known word in the corpus.
func (c *Corpus) MaxWordLength() int {
return c.maxWordLength
}
// WordProb returns the probability of a word appearing in the corpus.
func (c *Corpus) WordProb(word string) (float64, bool) {
id, ok := c.Id(word)
if !ok {
return 0, false
}
count := c.frequencies[id]
return float64(count) / float64(c.totalFreq), true
}
// Merge combines two corpuses. The receiver is the one that is mutated.
func (c *Corpus) Merge(other *Corpus) {
for i, word := range other.words {
freq := other.frequencies[i]
if id, ok := c.ids[word]; ok {
c.frequencies[id] += freq
c.totalFreq += freq
} else {
id := c.Add(word)
c.frequencies[id] += freq - 1
c.totalFreq += freq - 1
}
}
}
// Replace replaces the content of a word. The old reference remains.
//
// e.g: c.Replace("foo", "bar")
// c.Id("foo") will still return a ID. The ID will be the same as c.Id("bar")
func (c *Corpus) Replace(a, with string) error {
old, ok := c.ids[a]
if !ok {
return errors.Errorf("Cannot replace %q with %q. %q is not found", a, with, a)
}
if _, ok := c.ids[with]; ok {
return errors.Errorf("Cannot replace %q with %q. %q exists in the corpus", a, with, with)
}
c.words[old] = with
return nil
}
// ReplaceWord replaces the word associated with the given ID. The old reference remains.
func (c *Corpus) ReplaceWord(id int, with string) error {
if id >= len(c.words) {
return errors.Errorf("Cannot replace word with ID %d. Out of bounds.", id)
}
if _, ok := c.ids[with]; ok {
return errors.Errorf("Cannot replace word with ID %d with %q. %q exists in the corpus", id, with, with)
}
c.words[id] = with
return nil
}
================================================
FILE: corpus/corpus_test.go
================================================
package corpus
import (
"testing"
"github.com/stretchr/testify/assert"
)
func TestCorpus(t *testing.T) {
assert := assert.New(t)
dict := New()
assert.Equal(0, dict.WordFreq("hello")) // frequency of a word not in dict ould have to be 0
assert.Equal(0, dict.IDFreq(3)) // ditto
id := dict.Add("hello")
assert.Equal(3, id)
assert.Equal([]string{"", "-UNKNOWN-", "-ROOT-", "hello"}, dict.words)
assert.Equal(map[string]int{"": 0, "-UNKNOWN-": 1, "-ROOT-": 2, "hello": 3}, dict.ids)
assert.Equal(4, dict.Size())
id2, ok := dict.Id("hello")
if !ok {
t.Errorf("The ID of null should be 0")
}
assert.Equal(id, id2)
word, ok := dict.Word(3)
if !ok {
t.Errorf("Expected word of ID 3 to be found")
}
assert.Equal("hello", word)
dict.Add(word)
assert.Equal(2, dict.WordFreq(word))
assert.Equal(2, dict.IDFreq(3))
assert.Equal(5, dict.TotalFreq())
assert.Equal(5, dict.MaxWordLength())
prob, ok := dict.WordProb(word)
if !ok {
t.Errorf("Expected a probability")
}
assert.Equal(0.4, prob)
// t.Logf("%q: %v", word, dict.WordProb(word))
}
func TestCorpus_Merge(t *testing.T) {
assert := assert.New(t)
dict := New()
id := dict.Add("hello")
dict.frequencies[id] += 4 // freq for "hello" is 5
dict.totalFreq += 4
other := New()
id = other.Add("hello")
other.frequencies[id] += 2 // freq for "hello" is 3
other.totalFreq += 2
id = other.Add("world")
other.frequencies[id] += 1
other.totalFreq += 1
dict.Merge(other)
assert.Equal(8, dict.WordFreq("hello"))
assert.Equal(2, dict.WordFreq("world"))
}
================================================
FILE: corpus/functions.go
================================================
package corpus
import (
"math"
"strings"
"unicode/utf8"
"github.com/chewxy/lingo"
"github.com/chewxy/lingo/treebank"
"github.com/pkg/errors"
)
// GenerateCorpus creates a Corpus given a set of SentenceTag from a training set.
func GenerateCorpus(sentenceTags []treebank.SentenceTag) *Corpus {
words := make([]string, 3)
frequencies := make([]int, 3)
words[0] = "" // aka NULL, for when no word can be found
frequencies[0] = 0 // no word is never found
words[1] = "-UNKNOWN-"
frequencies[1] = 0
words[2] = "-ROOT-"
frequencies[2] = 1
knownWords := make(map[string]int)
knownWords[""] = 0
knownWords["-UNKNOWN-"] = 1
knownWords["-ROOT-"] = 2
maxWordLength := 0
for _, sentenceTag := range sentenceTags {
for _, lex := range sentenceTag.Sentence {
id, ok := knownWords[lex.Value]
if !ok {
knownWords[lex.Value] = len(words)
words = append(words, lex.Value)
frequencies = append(frequencies, 1)
runeCount := utf8.RuneCountInString(lex.Value)
if runeCount > maxWordLength {
maxWordLength = runeCount
}
} else {
frequencies[id]++
}
}
}
var totals int
for _, f := range frequencies {
totals += f
}
return &Corpus{words, frequencies, knownWords, int64(len(words)), totals, maxWordLength}
}
// ViterbiSplit is a Viterbi algorithm for splitting words given a corpus
func ViterbiSplit(input string, c *Corpus) []string {
s := strings.ToLower(input)
probabilities := []float64{1.0}
lasts := []int{0}
runes := []int{}
for i := range s {
runes = append(runes, i)
}
runes = append(runes, len(s)+1)
for i := range s {
probs := make([]float64, 0)
ls := make([]int, 0)
// m := maxInt(0, i-c.maxWordLength)
for j, r := range runes {
if r > i {
break
}
p, ok := c.WordProb(s[r : i+1])
if !ok {
// http://stackoverflow.com/questions/195010/how-can-i-split-multiple-joined-words#comment48879458_481773
p = (math.Log(float64(1)/float64(c.totalFreq)) - float64(c.maxWordLength) - float64(1)) * float64(i-r) // note it should be i-r not j-i as per the SO post
}
prob := probabilities[j] * p
probs = append(probs, prob)
ls = append(ls, r)
}
maxProb := -math.SmallestNonzeroFloat64
maxK := -1 << 63
for j, p := range probs {
if p > maxProb {
maxProb = p
maxK = ls[j]
}
}
probabilities = append(probabilities, maxProb)
lasts = append(lasts, maxK)
}
words := make([]string, 0)
i := utf8.RuneCountInString(s)
for i > 0 {
start := lasts[i]
words = append(words, s[start:i])
i = start
}
// reverse it
for i, j := 0, len(words)-1; i < j; i, j = i+1, j-1 {
words[i], words[j] = words[j], words[i]
}
return words
}
// CosineSimilarity measures the cosine similarity of two strings.
func CosineSimilarity(a, b []string) float64 {
countsA := make([]float64, 0)
countsB := make([]float64, 0)
uniques := make(map[string]int)
// index the strings first
for _, st := range a {
s := strings.ToLower(st)
id, ok := uniques[s]
if !ok {
uniques[s] = len(countsA)
countsA = append(countsA, 1)
countsB = append(countsB, 0) // create for countsB, but don't add
} else {
countsA[id]++
}
}
for _, st := range b {
s := strings.ToLower(st)
id, ok := uniques[s]
if !ok {
uniques[s] = len(countsA)
countsA = append(countsA, 0)
countsB = append(countsB, 1)
} else {
countsB[id]++
}
}
magA, err := mag(countsA)
if err != nil {
panic(err)
}
magB, err := mag(countsB)
if err != nil {
panic(err)
}
dotProd, err := dot(countsA, countsB)
if err != nil {
panic(err)
}
return dotProd / (magA * magB)
}
// DamerauLevenshtein calculates the Damerau-Levensthtein distance between two strings. See more at https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
func DamerauLevenshtein(s1 string, s2 string) (distance int) {
// index by code point, not byte
r1 := []rune(s1)
r2 := []rune(s2)
// the maximum possible distance
inf := len(r1) + len(r2)
// if one string is blank, we needs insertions
// for all characters in the other one
if len(r1) == 0 {
return len(r2)
}
if len(r2) == 0 {
return len(r1)
}
// construct the edit-tracking matrix
matrix := make([][]int, len(r1))
for i := range matrix {
matrix[i] = make([]int, len(r2))
}
// seen characters
seenRunes := make(map[rune]int)
if r1[0] != r2[0] {
matrix[0][0] = 1
}
seenRunes[r1[0]] = 0
for i := 1; i < len(r1); i++ {
deleteDist := matrix[i-1][0] + 1
insertDist := (i+1)*1 + 1
var matchDist int
if r1[i] == r2[0] {
matchDist = i
} else {
matchDist = i + 1
}
matrix[i][0] = minInt(minInt(deleteDist, insertDist), matchDist)
}
for j := 1; j < len(r2); j++ {
deleteDist := (j + 1) * 2
insertDist := matrix[0][j-1] + 1
var matchDist int
if r1[0] == r2[j] {
matchDist = j
} else {
matchDist = j + 1
}
matrix[0][j] = minInt(minInt(deleteDist, insertDist), matchDist)
}
for i := 1; i < len(r1); i++ {
var maxSrcMatchIndex int
if r1[i] == r2[0] {
maxSrcMatchIndex = 0
} else {
maxSrcMatchIndex = -1
}
for j := 1; j < len(r2); j++ {
swapIndex, ok := seenRunes[r2[j]]
jSwap := maxSrcMatchIndex
deleteDist := matrix[i-1][j] + 1
insertDist := matrix[i][j-1] + 1
matchDist := matrix[i-1][j-1]
if r1[i] != r2[j] {
matchDist += 1
} else {
maxSrcMatchIndex = j
}
// for transpositions
var swapDist int
if ok && jSwap != -1 {
iSwap := swapIndex
var preSwapCost int
if iSwap == 0 && jSwap == 0 {
preSwapCost = 0
} else {
preSwapCost = matrix[maxInt(0, iSwap-1)][maxInt(0, jSwap-1)]
}
swapDist = i + j + preSwapCost - iSwap - jSwap - 1
} else {
swapDist = inf
}
matrix[i][j] = minInt(minInt(minInt(deleteDist, insertDist), matchDist), swapDist)
}
seenRunes[r1[i]] = i
}
return matrix[len(r1)-1][len(r2)-1]
}
// LongestCommonPrefix takes a slice of strings, and finds the longest common prefix
func LongestCommonPrefix(strs ...string) string {
switch len(strs) {
case 0:
return "" // idiots
case 1:
return strs[0]
}
min := strs[0]
max := strs[0]
for _, s := range strs[1:] {
switch {
case s < min:
min = s
case s > max:
max = s
}
}
for i := 0; i < len(min) && i < len(max); i++ {
if min[i] != max[i] {
return min[:i]
}
}
// In the case where lengths are not equal but all bytes
// are equal, min is the answer ("foo" < "foobar").
return min
}
/* The following two functions help in parsing a string into numbers. It's recommended you write abstractions over the functions*/
// StrsToInts converts a string slice into an int slice, with the help of NumberWords.
// The function assumes all helper words like "and" have been stripped.
// "One hundred and five" -> []string{"one", "hundred", "five"}
// This is a very primitive method, and doesn't take into account other words like "a hundred" or "a couple of hundred"
func StrsToInts(strs []string) (retVal []int, err error) {
for _, s := range strs {
intVal, ok := lingo.NumberWords[s]
if !ok {
return nil, errors.Errorf("Unable to parse the words %q as numbers", s)
}
if len(retVal) > 0 && intVal == 100 && retVal[len(retVal)-1] < 100 {
retVal[len(retVal)-1] *= 100
} else if len(retVal) > 0 && retVal[len(retVal)-1] < 1000 && intVal < 1000 {
retVal[len(retVal)-1] += intVal
} else {
retVal = append(retVal, intVal)
}
}
return
}
// CombineInts takes a int slice, and tries to make it one integer.
// It works by taking advantage of english - anything more than 1000 has a repeated pattern
// e.g.
// one hundred and fifty thousand two hundred and two
// there are 2 repeated patterns (one hundred and fifty) and (two hundred and two)
//
// This allows us to repeatedly combine by addition or multiplication until there is one left
func CombineInts(ints []int) int {
var total int
for len(ints) > 0 {
if len(ints) == 1 || ints[0] >= 1000 {
last := ints[len(ints)-1]
total += last
ints = ints[0 : len(ints)-1] //pop it
} else {
if ints[1] < 1000 {
// something went wrong
panic("HELP!")
}
total += ints[0] * ints[1]
ints = ints[2:]
}
}
return total
}
================================================
FILE: corpus/functions_test.go
================================================
package corpus
import (
"strings"
"testing"
"github.com/stretchr/testify/assert"
)
func Test_GenerateCorpus(t *testing.T) {
sentenceTags := mediumSentence()
dict := GenerateCorpus(sentenceTags)
// testing time
assert := assert.New(t)
expectedWords := []string{"", "-UNKNOWN-", "-ROOT-", "President", "Bush", "on", "Tuesday", "nominated", "two", "individuals", "to", "replace", "retiring", "jurists", "federal", "courts", "in", "the", "Washington", "area", "."}
expectedIDs := make(map[string]int)
for i, w := range expectedWords {
expectedIDs[w] = i
}
assert.Equal(expectedWords, dict.words, "Corpus known words should be the same as the manually annotated expected values")
assert.Equal(expectedIDs, dict.ids, "IDs should be the same as expected IDs")
assert.Equal(int64(len(expectedWords)), dict.maxid)
}
func TestViterbiSplit(t *testing.T) {
assert := assert.New(t)
dict := GenerateCorpus(mediumSentence())
s2 := "twoindividuals"
words := ViterbiSplit(s2, dict)
assert.Equal([]string{"two", "individuals"}, words)
s2 = "FederalCourts"
words = ViterbiSplit(s2, dict)
assert.Equal([]string{"federal", "courts"}, words)
s3 := "toreplaceon"
words = ViterbiSplit(s3, dict)
assert.Equal([]string{"to", "replace", "on"}, words)
}
func TestCosineSimilarity(t *testing.T) {
a := strings.Split("This is a test of cosine similarity", " ")
b := strings.Split("This is not a test of cosine similarity", " ")
s1 := CosineSimilarity(a, a)
s2 := CosineSimilarity(a, b)
if !floatEquals64(s1, 1) {
t.Error("Expected similarity to be 1 when compared with itself")
}
if s2 > s1 {
t.Error("Something went wrong with the cosine similarity algorithm")
}
c := strings.Split("Parramatta Road", " ")
d := strings.Split("Parramatta Rd", " ")
s1 = CosineSimilarity(c, c)
s2 = CosineSimilarity(c, d)
if !floatEquals64(s1, 1) {
t.Error("Expected similarity to be 1 when compared with itself")
}
if s2 > s1 {
t.Error("Something went wrong with the cosine similarity algorithm")
}
}
func TestDL(t *testing.T) {
a := "This is a test of Damerau Levenshtein"
b := "This is not a test of Damerau Levenshtein"
s1 := DamerauLevenshtein(a, a)
s2 := DamerauLevenshtein(a, b)
if s1 != 0 {
t.Errorf("Expected the distance to be 0 when compared against itself. Got %d", s1)
}
if s2 < s1 {
t.Error("Expected DL similarity to be greater when compared against itself")
}
c := "Parramatta Road"
d := "Paramatta Rd"
s1 = DamerauLevenshtein(c, c)
s2 = DamerauLevenshtein(c, d)
if s1 != 0 {
t.Errorf("Expected the distance to be 0 when compared against itself. Got %d", s1)
}
if s2 < s1 {
t.Error("Expected DL similarity to be greater when compared against itself")
}
}
func TestLCP(t *testing.T) {
assert := assert.New(t)
lcp := LongestCommonPrefix("Hello World", "Hell yeah!")
assert.Equal("Hell", lcp)
lcp = LongestCommonPrefix("Hello World", "Hell yeah!", "hey there")
assert.Equal("", lcp)
lcp = LongestCommonPrefix()
assert.Equal("", lcp)
lcp = LongestCommonPrefix("OneWord")
assert.Equal("OneWord", lcp)
lcp = LongestCommonPrefix("foo", "foobar")
assert.Equal("foo", lcp)
}
var parseNumTests = []struct {
s string
v int
}{
{"twenty nine", 29},
{"one hundred five", 105},
{"five hundred twenty thousand twenty one", 520021},
}
func TestParseNumber(t *testing.T) {
for _, pnts := range parseNumTests {
s := strings.Split(pnts.s, " ")
ints, err := StrsToInts(s)
if err != nil {
t.Error(err)
continue
}
v := CombineInts(ints)
if v != pnts.v {
t.Errorf("Expected %q to be parsed to %d. Got %d instead", pnts.s, pnts.v, v)
}
}
}
================================================
FILE: corpus/inflection.go
================================================
package corpus
import (
"regexp"
"github.com/chewxy/lingo"
)
type conversionPattern struct {
pattern *regexp.Regexp
replacement string
}
func newConversionPattern(from, to string) conversionPattern {
rFrom := regexp.MustCompile(from)
return conversionPattern{rFrom, to}
}
// plural -> singular
var plural = []conversionPattern{
newConversionPattern("(quiz)$", "${1}zes"),
newConversionPattern("^(ox)$", "${1}en"),
newConversionPattern("([m|l])ouse$", "${1}ice"),
newConversionPattern("(matr|vert|ind)ix|ex$", "${1}ices"),
newConversionPattern("(x|ch|ss|sh)$", "${1}es"),
newConversionPattern("([^aeiouy]|qu)ies$", "${1}y"),
newConversionPattern("([^aeiouy]|qu)y$", "${1}ies"),
newConversionPattern("(hive)$", "${1}s"),
newConversionPattern("(?:([^f])fe|([lr])f)$", "${1}${2}ves"),
newConversionPattern("sis$", "ses"),
newConversionPattern("([ti])um$", "${1}a"),
newConversionPattern("(buffal|tomat|potat)o$", "${1}oes"),
newConversionPattern("(bu)s$", "${1}ses"),
newConversionPattern("(alias|status|sex)$", "${1}es"),
newConversionPattern("(octop|vir)us$", "${1}i"),
newConversionPattern("(ax|test)is$", "${1}es"),
newConversionPattern("s$", "s"),
newConversionPattern("$", "s"),
}
// singular -> plural
var singular = []conversionPattern{
newConversionPattern("(quiz)zes$", "${1}"),
newConversionPattern("(matr)ices$", "${1}ix"),
newConversionPattern("(vert|ind)ices$", "${1}ex"),
newConversionPattern("^(ox)en", "${1}"),
newConversionPattern("(alias|status)es$", "${1}"),
newConversionPattern("(octop|vir)i$", "${1}us"),
newConversionPattern("(cris|ax|test)es$", "${1}is"),
newConversionPattern("(shoe)s$", "${1}"),
newConversionPattern("(o)es$", "${1}"),
newConversionPattern("(bus)es$", "${1}"),
newConversionPattern("([m|l])ice$", "${1}ouse"),
newConversionPattern("(x|ch|ss|sh)es$", "${1}"),
newConversionPattern("(m)ovies$", "${1}ovie"),
newConversionPattern("(s)eries$", "${1}eries"),
newConversionPattern("([^aeiouy]|qu)ies$", "${1}y"),
newConversionPattern("([lr])ves$", "${1}f"),
newConversionPattern("(tive)s$", "${1}"),
newConversionPattern("(hive)s$", "${1}"),
newConversionPattern("([^f])ves$", "${1}fe"),
newConversionPattern("(^analy)ses$", "${1}sis"),
newConversionPattern("((a)naly|(b)a|(d)iagno|(p)arenthe|(p)rogno|(s)ynop|(t)he)ses$", "${1}${2}sis"),
newConversionPattern("([ti])a$", "${1}um"),
newConversionPattern("(n)ews$", "${1}ews"),
newConversionPattern("s$", ""),
}
// weird pluralizations that don't match the rules above
var irregular = []conversionPattern{
newConversionPattern("person", "people"),
newConversionPattern("man", "men"),
newConversionPattern("child", "children"),
newConversionPattern("sex", "sexes"),
newConversionPattern("move", "moves"),
newConversionPattern("sleeve", "sleeves"),
newConversionPattern("datum", "data"),
newConversionPattern("box", "boxes"),
newConversionPattern("knife", "knives"),
}
var unconvertable = []string{
"equipment",
"information",
"rice",
"money",
"species",
"series",
"fish",
"sheep",
}
// Pluralize pluralizes words based on rules known
func Pluralize(word string) string {
if lingo.InStringSlice(word, unconvertable) {
return word
}
for _, cp := range irregular {
if cp.pattern.MatchString(word) {
return cp.replacement
}
}
for _, cp := range plural {
if cp.pattern.MatchString(word) {
// log.Printf("\t%q Matches %q", word, cp.pattern.String())
return cp.pattern.ReplaceAllString(word, cp.replacement)
}
}
return word
}
// Singularize singularizes words based on rules known
func Singularize(word string) string {
if lingo.InStringSlice(word, unconvertable) {
return word
}
for _, cp := range singular {
if cp.pattern.MatchString(word) {
return cp.pattern.ReplaceAllString(word, cp.replacement)
}
}
return word
}
================================================
FILE: corpus/inflection_test.go
================================================
package corpus
import "testing"
var pluralizeTest = []struct {
word, correct string
}{
{"friend", "friends"},
{"tomato", "tomatoes"},
{"knife", "knives"},
{"dwarf", "dwarves"},
{"box", "boxes"},
{"ox", "oxen"},
{"man", "men"},
{"equipment", "equipment"},
}
var singularizeTest = []struct {
word, correct string
}{
{"condolences", "condolence"},
{"fish", "fish"},
{"shoes", "shoe"},
{"viri", "virus"},
{"elves", "elf"},
}
func TestPluralize(t *testing.T) {
for _, pts := range pluralizeTest {
got := Pluralize(pts.word)
if got != pts.correct {
t.Errorf("Pluralizing %q failed. Want %q. Got %q instead", pts.word, pts.correct, got)
}
}
}
func TestSingularize(t *testing.T) {
for _, pts := range singularizeTest {
got := Singularize(pts.word)
if got != pts.correct {
t.Errorf("Singularizing %q failed. Want %q. Got %q instead", pts.word, pts.correct, got)
}
}
}
================================================
FILE: corpus/io.go
================================================
package corpus
import (
"bufio"
"bytes"
"encoding/gob"
"io"
"strconv"
"strings"
)
// sortutil is a utility struct meant to sort words based on IDs
type sortutil struct {
words []string
ids []int
freqs []int
}
func (s *sortutil) Len() int { return len(s.words) }
func (s *sortutil) Less(i, j int) bool { return s.ids[i] < s.ids[j] }
func (s *sortutil) Swap(i, j int) {
s.words[i], s.words[j] = s.words[j], s.words[i]
s.ids[i], s.ids[j] = s.ids[j], s.ids[i]
if len(s.freqs) > 0 {
s.freqs[i], s.freqs[j] = s.freqs[j], s.freqs[i]
}
}
// ToDictWithFreq returns a simple marshalable type. Conceptually it's a JSON object with the words as the keys. The values are a pair - ID and Freq.
func ToDictWithFreq(c *Corpus) map[string]struct{ ID, Freq int } {
retVal := make(map[string]struct{ ID, Freq int })
for i, w := range c.words {
retVal[w] = struct{ ID, Freq int }{i, c.frequencies[i]}
}
return retVal
}
// ToDict returns a marshalable dict. It returns a copy of the ID mapping.
func ToDict(c *Corpus) map[string]int {
retVal := make(map[string]int)
for k, v := range c.ids {
retVal[k] = v
}
return retVal
}
// GobEncode implements GobEncoder for *Corpus
func (c *Corpus) GobEncode() ([]byte, error) {
var buf bytes.Buffer
encoder := gob.NewEncoder(&buf)
if err := encoder.Encode(c.words); err != nil {
return nil, err
}
if err := encoder.Encode(c.ids); err != nil {
return nil, err
}
if err := encoder.Encode(c.frequencies); err != nil {
return nil, err
}
if err := encoder.Encode(c.maxid); err != nil {
return nil, err
}
if err := encoder.Encode(c.totalFreq); err != nil {
return nil, err
}
if err := encoder.Encode(c.maxWordLength); err != nil {
return nil, err
}
return buf.Bytes(), nil
}
// GobDecode implements GobDecoder for *Corpus
func (c *Corpus) GobDecode(buf []byte) error {
b := bytes.NewBuffer(buf)
decoder := gob.NewDecoder(b)
if err := decoder.Decode(&c.words); err != nil {
return err
}
if err := decoder.Decode(&c.ids); err != nil {
return err
}
if err := decoder.Decode(&c.frequencies); err != nil {
return err
}
if err := decoder.Decode(&c.maxid); err != nil {
return err
}
if err := decoder.Decode(&c.totalFreq); err != nil {
return err
}
if err := decoder.Decode(&c.maxWordLength); err != nil {
return err
}
return nil
}
// LoadOneGram loads a 1_gram.txt file, which is a tab separated file which lists the frequency counts of words. Example:
// the 23135851162
// of 13151942776
// and 12997637966
// to 12136980858
// a 9081174698
// in 8469404971
// for 5933321709
func (c *Corpus) LoadOneGram(r io.Reader) error {
scanner := bufio.NewScanner(r)
for scanner.Scan() {
line := scanner.Text()
splits := strings.Split(line, "\t")
if len(splits) == 0 {
break
}
word := splits[0] // TODO: normalize
count, err := strconv.Atoi(splits[1])
if err != nil {
return err
}
id := c.Add(word)
c.frequencies[id] = count
c.totalFreq--
c.totalFreq += count
wc := len([]rune(word))
if wc > c.maxWordLength {
c.maxWordLength = wc
}
}
return nil
}
================================================
FILE: corpus/io_test.go
================================================
package corpus
import (
"bytes"
"encoding/gob"
"strings"
"testing"
"github.com/stretchr/testify/assert"
)
func TestCorpusGob(t *testing.T) {
buf := new(bytes.Buffer)
c := New()
c.Add("Hello")
c.Add("World")
helloID, _ := c.Id("Hello")
worldID, _ := c.Id("World")
encoder := gob.NewEncoder(buf)
decoder := gob.NewDecoder(buf)
if err := encoder.Encode(c); err != nil {
t.Fatal(err)
}
c2 := New()
if err := decoder.Decode(c2); err != nil {
t.Fatal(err)
}
if hid, ok := c2.Id("Hello"); !ok || (ok && hid != helloID) {
t.Errorf("\"Hello\" not found after decoding.")
}
if wid, ok := c2.Id("World"); !ok || (ok && wid != worldID) {
t.Errorf("\"World\" not found after decoding.")
}
}
func TestCorpusToDict(t *testing.T) {
assert := assert.New(t)
c, _ := Construct(WithWords([]string{"World", "Hello", "World"}))
d := ToDict(c)
c2, err := Construct(FromDict(d))
if err != nil {
t.Fatal(err)
}
assert.Equal(c.words, c2.words, "Expected words to be the same")
assert.Equal(c.ids, c2.ids, "Expected IDs to be the same")
assert.NotEqual(c.frequencies, c2.frequencies, "Expected frequencies to not be the same")
assert.Equal(c.maxid, c2.maxid, "Expected maxID to be the same")
assert.NotEqual(c.totalFreq, c2.totalFreq, "Expected totalFreq to be different.")
assert.Equal(c.maxWordLength, c2.maxWordLength, "Expected maxWordLength to be the same")
}
func TestCorpusToDictWithFreq(t *testing.T) {
assert := assert.New(t)
c, _ := Construct(WithWords([]string{"World", "Hello", "World"}))
d := ToDictWithFreq(c)
c2, err := Construct(FromDictWithFreq(d))
if err != nil {
t.Fatal(err)
}
assert.Equal(c, c2)
}
func TestLoadOneGram(t *testing.T) {
assert := assert.New(t)
r := strings.NewReader(sample1Gram)
c := New()
err := c.LoadOneGram(r)
assert.Nil(err)
assert.Equal(10, c.Size())
id, ok := c.Id("for")
if !ok {
t.Errorf("Expected \"for\" to be in corpus after loading one gram file")
}
assert.Equal(int(c.maxid-1), id)
}
================================================
FILE: corpus/lda.go
================================================
package corpus
import (
"gorgonia.org/tensor"
)
// LDAModel ... TODO
//https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
type LDAModel struct {
// params
Alpha tensor.Tensor // is a Row
Eta tensor.Tensor // is a Col
//Kappa gorgonia.Scalar // Decay
//Tau0 gorgonia.Scalar // offset
// parameters needed for working
Topics int
ChunkSize int
Terms int
UpdateEvery int
EvalEvery int
// consts
Iterations int
GammaThreshold float64
MinimumProb float64
// track current progress
Updates int
// type
Dtype tensor.Dtype
}
func (l *LDAModel) init() {
eta := tensor.New(tensor.Of(l.Dtype), tensor.WithShape(l.Topics))
alpha := tensor.New(tensor.Of(l.Dtype), tensor.WithShape(l.Topics))
switch l.Dtype {
case tensor.Float64:
v := 1.0 / float64(l.Topics)
eta.Memset(v)
alpha.Memset(v)
case tensor.Float32:
v := float32(1) / float32(l.Topics)
eta.Memset(v)
alpha.Memset(v)
}
l.Alpha = alpha
l.Eta = eta
}
================================================
FILE: corpus/test_test.go
================================================
package corpus
import (
"strings"
"github.com/chewxy/lingo/treebank"
)
const sample1Gram = `the 23135851162
of 13151942776
and 12997637966
to 12136980858
a 9081174698
in 8469404971
for 5933321709`
func mediumSentence() []treebank.SentenceTag {
conllu := `1 President President PROPN NNP Number=Sing 2 compound _ _
2 Bush Bush PROPN NNP Number=Sing 5 nsubj _ _
3 on on ADP IN _ 4 case _ _
4 Tuesday Tuesday PROPN NNP Number=Sing 5 nmod _ _
5 nominated nominate VERB VBD Mood=Ind|Tense=Past|VerbForm=Fin 0 root _ _
6 two two NUM CD NumType=Card 7 nummod _ _
7 individuals individual NOUN NNS Number=Plur 5 dobj _ _
8 to to PART TO _ 9 mark _ _
9 replace replace VERB VB VerbForm=Inf 5 advcl _ _
10 retiring retire VERB VBG VerbForm=Ger 11 amod _ _
11 jurists jurist NOUN NNS Number=Plur 9 dobj _ _
12 on on ADP IN _ 14 case _ _
13 federal federal ADJ JJ Degree=Pos 14 amod _ _
14 courts court NOUN NNS Number=Plur 11 nmod _ _
15 in in ADP IN _ 18 case _ _
16 the the DET DT Definite=Def|PronType=Art 18 det _ _
17 Washington Washington PROPN NNP Number=Sing 18 compound _ _
18 area area NOUN NN Number=Sing 14 nmod _ _
19 . . PUNCT . _ 5 punct _ _
`
readr := strings.NewReader(conllu)
return treebank.ReadConllu(readr)
}
const EPSILON64 float64 = 1e-10
func floatEquals64(a, b float64) bool {
if (a-b) < EPSILON64 && (b-a) < EPSILON64 {
return true
}
return false
}
================================================
FILE: corpus/utils.go
================================================
package corpus
import (
"errors"
"math"
)
func minInt(a, b int) int {
if a < b {
return a
}
return b
}
func maxInt(a, b int) int {
if a > b {
return a
}
return b
}
func dot(a, b []float64) (float64, error) {
if len(a) != len(b) {
return 0, errors.New("Differing lengths!")
}
var retVal float64
for i, v := range a {
retVal += v * b[i]
}
return retVal, nil
}
func mag(a []float64) (float64, error) {
dotProd, err := dot(a, a)
if err != nil {
return dotProd, err
}
return math.Sqrt(dotProd), nil
}
================================================
FILE: dep/README.md
================================================
# Dependency Parser #
Package `dependencyparser` is a package that provides data structures and algorithms for a dependency parser as described by [Chen and Manning 2014](http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf) [PDF]. It achieves similar accuracy scores as the the cited paper.
# Installing #
`go get -u github.com/chewxy/lingo/dep`
# How It Works #
## Transition Based Parsing ##
The core of the parser is a transition based parser, as popularized by [Nivre 2003](https://stp.lingfil.uu.se/~nivre/docs/iwpt03.pdf) [PDF]. It's essentially a [shift-reduce parser](https://en.wikipedia.org/wiki/Shift-reduce_parser) with more states. Dan Jurafsky has a very [complete overview of transition-based parsing](https://web.stanford.edu/~jurafsky/slp3/14.pdf) [PDF], which should be consulted should more questions arise.
### Transitions ###
At the core of a transition based parser are two data structures: a stack and a queue. The queue, or buffer holds a list of words waiting to be parsed. Parsing is then simply a matter of manipulating the state of the stack and queue. Specifically there are three possible actions in an arc-standard parser:
* `Shift`: Shift simply shifts one word from the buffer on to the top of the stack
* `Left`: Left means the top of the stack is the head of the word underneath it. After the transition is applied (the link between the nodes attached), the word underneath the stack is removed.
* `Right`: Right means that the top of the stack is the child of the word underneath it. After the transition is applied, the top of the stack is popped.
A word on the terms "head", and "child". Consider the sentence "I am human":

We say "human" is the head of the words "I" and "am". Therefore, "I" and "am" are considered to be children of "human".
### Example ###
Let's look at a simple example to concrefy the ideas: "The cat sat on the mat". Here are the states
| Step | Stack | Buffer | Transition |
|------|-------------------------------|-------------------------------------------|------------|
|0 | [ROOT] | ["The", "cat", "sat", "on", "the", "mat"] | Shift |
|1 | [ROOT, "The"] | ["cat", "sat", "on", "the", "mat"] | Shift |
|2 | [ROOT, "The", "cat"] | ["sat", "on", "the", "mat"] | Left |
|3 | [ROOT, "cat"] | ["sat", "on", "the", "mat"] | Shift |
|4 | [ROOT, "cat", "sat"] | ["on", "the", "mat"] | Left |
|5 | [ROOT, "sat"] | ["on", "the", "mat"] | Shift |
|6 | [ROOT, "sat", "on"] | ["the", "mat"] | Shift |
|7 | [ROOT, "sat", "on", "the"] | ["mat"] | Shift |
|8 | [ROOT, "sat", "on", "the", "mat"] | [] | Left |
|9 | [ROOT, "sat", "on", "mat"] | [] | Left |
|10| [ROOT, "sat", "mat"] | [] | Right |
|11| [ROOT, "sat"] | [] | Left |
The above transitions produces this parse tree:

The real question then is of course - how does the system know which is the correct transition to emit, given the state?
The answer is machine learning.
## Machine Learning ##
What exactly are we learning? Or more carefully put, what are the inputs and outputs of the machine learning algorithm? The table in the example above provides a template for the inputs and output. The output is easy - the transition is what we want to learn.
As for the input, it's a little bit more complex. The input consists of the stack and the buffer. It'd be impractical and slow to include everything in the stack and buffer (dynamic neural networks are somewhat slower than static ones). So Chen and Manning came up with an ingenious idea -
* Use the top 3 words of the stack
* Use the top 3 words of the buffer
* Use the first and second leftmost/rightmost children of the first two words of the stack
Instead of directly using the words, POS Tag and dependency relations as features, the rather ingenious idea was that it would use vectors drawn from an embedding matrix to represent these features instead. So instead of building sparse features, concatenating the vectors form a fixed sized input vector. This makes training the network much more expedient.
You'll find this in [features.go](https://github.com/chewxy/lingo/blob/master/dependencyParser/features.go)
Given each state above, it'd be fairly trivial to extract an input vector based on the 18 "features" listed and feed forwards to a neural network. The result is a fast parser.
### Neural Network ###
The machine learning algorithm behind this parser is a simple 3-layered network. An input layer is constructed from the embedding matrices, and is forwarded to the first layer, which is activated by a cube activation function. This then passes forwards to a dropout layer before the last layer, which is a softmax layer.
[image of NN]
## Hairy Bits ##
The hairy bits of this is the oracle. Specifically, the question: given a training sentence, how do we generate correct examples such as the table above?
TODO: finish writing this section
# How To Use #
This package provides three main data structures for use:
* `Parser`
* `Model`
* `Trainer`
`Trainer` takes a `[]treebank.SentenceTag` and produces a `Model`. `Parser` requires a `Model` to run, and is basically a exported wrapper over `configuration` that handles a pipeline.
## Basic NLP Pipeline ##
```go
func main() {
inputString: `The cat sat on the mat`
lx := lexer.New("dummy", strings.NewReader(inputString)) // lexer - required to break a sentence up into words.
pt := pos.New(pos.WithModel(posModel)) // POS Tagger - required to tag the words with a part of speech tag.
dp := dep.New(depModel) // Creates a new parser
// set up a pipeline
pt.Input = lx.Output
dp.Input = pt.Output
// run all
go lx.Run()
go pt.Run()
go dp.Run()
// wait to receive:
for {
select {
case d := <- dp.Output:
// do something
case err:= <-dp.Error:
// handle error
}
}
}
```
## Training A Model ##
To train a model you'd use the `Trainer`. The trainer accepts a `[]treebank.SentenceTag`. As long as you can parse your training file into those (package `treebank` accepts CONLLU formatted files as well as the PennTreebank formatted files), you'd be fine.
An example trainer is in the cmd directory of `lingo`
# FAQ #
**Why not an LSTM or RNN to encode the state of the stack and buffer?**
The answer is simplicity and speed. I have attempted variants of the parser with different neural networks - they don't work as fast as this. I am aware of Parsey-McParseface and the slightly improved accuracy compared to this model, but the speed has been not as great as I expect. This package emphasises parsing speed over accuracy - for most well written English sentences, this package performs well.
**Why are there no models?**
I'm afraid you're gonna have to train your own models. Training takes days on the Universal Dependency dataset and I haven't had the time to train on those. All my models are specific to the use of the company, and hence cannot be released.
**What caveats are there?**
Chen and Manning described using pre-computed activations for the top 10000 or so words. I did not implement that, but it would be trivial to revisit and implement it. Feel free to send a pull request.
**How can this be sped up?**
Use multiple, smaller trainers, each training on a separate batch. You can hence train them concurrently (pass the costs in a channel and collect at the end). At the end, sum the gradients before applying adagrad. The trade off is that a LOT more memory will be used. It's also the reason why it wasn't included as the default. It's quite trivial to write though. Send a pull request if you have managed to reduce memory usage.
# Contributing #
see package lingo's CONTRIBUTING.md for more information. There is currently a list of issues in Github issues. Those are good places to start.
# Licence #
This package is MIT licenced.
================================================
FILE: dep/arcStandard.go
================================================
package dep
import "github.com/chewxy/lingo"
// var SingleRoot bool = true // make this part of a build process
// canApply checks if a particular transition can be applied
func (c *configuration) canApply(t transition) bool {
var h head
if t.Move == Left || t.Move == Right {
if t.Move == Left {
h = c.stackValue(0)
} else {
h = c.stackValue(1)
}
if h < 0 {
return false
}
if h == 0 && t.DependencyType != lingo.Root {
return false
}
}
stackSize := c.stackSize()
bufferSize := c.bufferSize()
if t.Move == Left {
return stackSize > 2
}
if t.Move == Right {
return stackSize > 2 || (stackSize == 2 && bufferSize == 0)
// if not single root build
// return stackSize >= 2
}
return bufferSize > 0 // strange other thing...
}
// apply applies the transition
func (c *configuration) apply(t transition) {
logf("Applying %v", t)
w1 := int(c.stackValue(1))
w2 := int(c.stackValue(0))
if t.Move == Left {
c.AddArc(w2, w1, t.DependencyType)
c.removeSecondTopStack()
} else if t.Move == Right {
c.AddArc(w1, w2, t.DependencyType)
c.removeTopStack()
} else {
c.shift()
}
}
// oracle gets the gold transition given the state
func (c *configuration) oracle(goldParse *lingo.Dependency) (t transition) {
w1 := int(c.stackValue(1))
w2 := int(c.stackValue(0))
if w1 > 0 && goldParse.Head(w1) == w2 {
t.Move = Left
t.DependencyType = goldParse.Label(w1)
return
} else if w1 >= 0 && goldParse.Head(w2) == w1 && !c.hasOtherChildren(w2, goldParse) {
t.Move = Right
t.DependencyType = goldParse.Label(w2)
return
}
return // default transition is Shift
}
================================================
FILE: dep/arcStandard_test.go
================================================
package dep
import (
"testing"
"github.com/chewxy/lingo"
"github.com/stretchr/testify/assert"
)
func TestCanApply(t *testing.T) {
dep := simpleSentence()[0].Dependency(dummyFix{})
buffer := make([]head, 0)
for i := 1; i < dep.WordCount(); i++ {
buffer = append(buffer, head(i))
}
stack := []head{0}
c := &configuration{
Dependency: dep,
stack: stack,
buffer: buffer,
}
assert := assert.New(t)
logf("Start config: \n%v", c)
rootLeft := c.canApply(transition{Left, lingo.Root})
rootRight := c.canApply(transition{Right, lingo.Root})
NSubjLeft := c.canApply(transition{Left, lingo.NSubj})
NSubjRight := c.canApply(transition{Right, lingo.NSubj})
ShiftDep := c.canApply(transition{Shift, lingo.NoDepType})
assert.Equal(false, rootLeft, "rootLeft should be false")
assert.Equal(false, rootRight, "rootRight should be false")
assert.Equal(false, NSubjLeft, "NSubjLeft should be false")
assert.Equal(false, NSubjRight, "NSubjRight should be false")
assert.Equal(true, ShiftDep, "ShiftDep should be true")
logf("rootRight: %v, rootLeft: %v", rootLeft, rootRight)
logf("NSubjRight: %v, NSubjLeft: %v", NSubjRight, NSubjLeft)
logf("ShiftDep: %v", ShiftDep)
c.shift()
c.shift()
logf("%v", c)
rootLeft = c.canApply(transition{Left, lingo.Root})
rootRight = c.canApply(transition{Right, lingo.Root})
NSubjLeft = c.canApply(transition{Left, lingo.NSubj})
NSubjRight = c.canApply(transition{Right, lingo.NSubj})
ShiftDep = c.canApply(transition{Shift, lingo.NoDepType})
assert.Equal(true, rootLeft, "rootLeft should be true")
assert.Equal(true, rootRight, "rootRight should be true")
assert.Equal(true, NSubjLeft, "NSubjLeft should be true")
assert.Equal(true, NSubjRight, "NSubjRight should be true")
assert.Equal(true, ShiftDep, "ShiftDep should be true")
logf("rootRight: %v, rootLeft: %v", rootLeft, rootRight)
logf("NSubjRight: %v, NSubjLeft: %v", NSubjRight, NSubjLeft)
logf("ShiftDep: %v", ShiftDep)
}
func TestOracle(t *testing.T) {
st := simpleSentence()[0]
s := st.AnnotatedSentence(nil)
c := newConfiguration(s, true)
d := s.Dependency()
for count := 0; !c.isTerminal() && count < 100; count++ {
oracle := c.oracle(d)
if !c.canApply(oracle) && (oracle != transition{Right, lingo.Root}) {
t.Errorf("Cannot apply %v", oracle)
break
}
c.apply(oracle)
}
assert.Equal(t, d.Heads(), c.Heads())
}
================================================
FILE: dep/configuration.go
================================================
package dep
import (
"fmt"
"github.com/chewxy/lingo"
)
// describes the current state of the parser
type head int
const (
DOES_NOT_EXIST head = iota - 1
)
// configuration is the meat of the shift-reduce parsing. It holds the state for the shift reduction
type configuration struct {
*lingo.Dependency
stack []head
buffer []head
bp int // buffer pointer - starts at 0, increments
}
func newConfiguration(sentence lingo.AnnotatedSentence, fromGold bool) *configuration {
if fromGold {
sentence = sentence.Clone()
}
dep := lingo.NewDependency(lingo.FromAnnotatedSentence(sentence), lingo.AllocTree())
dep.SetID()
sentence = sentence[1:] // because the POSTagger automatically adds a ROOTTAG at the end of it
var buffer []head
for i := 1; i <= len(sentence); i++ {
buffer = append(buffer, head(i))
}
var stack []head
stack = append(stack, head(0)) // add root
return &configuration{
Dependency: dep,
stack: stack,
buffer: buffer,
}
}
func (c *configuration) String() string {
return fmt.Sprintf("Stack: %v Buffer(%d): %v", c.stack, c.bp, c.buffer[c.bp:])
}
func (c *configuration) GoString() string {
return fmt.Sprintf("Stack: %v Buffer(%d): %v\nHeads: %v\nRels: %v\n", c.stack, c.bp, c.buffer[c.bp:], c.Heads(), c.Labels())
}
func (c *configuration) bufferSize() int {
return len(c.buffer) - c.bp
}
func (c *configuration) stackSize() int {
return len(c.stack)
}
func (c *configuration) head(i int) head {
heads := c.Heads() // TODO: maybe some sanity checks?
return head(heads[i])
}
// gets the sentence index of the ith word on the stack. If there isn't anything on the stack, it returns DOES_NOT_EXIST
func (c *configuration) stackValue(i int) head {
size := c.stackSize()
if i >= size || i < 0 {
return DOES_NOT_EXIST
}
return c.stack[size-1-i]
}
func (c *configuration) bufferValue(i int) head {
size := c.bufferSize()
if i >= size {
return DOES_NOT_EXIST
}
return c.buffer[i+c.bp]
}
/* stack machinations */
// pop pops the stack. It isn't really used any more. removeStack(), removeTopStack() and removeSecondTopStack() has superseded its function
func (c *configuration) pop() head {
retVal := c.stack[len(c.stack)-1]
c.stack = c.stack[0 : len(c.stack)-1]
return retVal
}
// removes a value from the stack.
func (c *configuration) removeStack(i int) {
c.stack = c.stack[:i+copy(c.stack[i:], c.stack[i+1:])]
}
// removeSecondTopStack removes the 2nd-to-last element
func (c *configuration) removeSecondTopStack() bool {
stackSize := c.stackSize()
if stackSize < 2 {
return false
}
i := stackSize - 2
c.removeStack(i)
return true
}
func (c *configuration) removeTopStack() bool {
stackSize := c.stackSize()
if stackSize < 1 {
return false
}
i := stackSize - 1
c.removeStack(i)
return true
}
/* Dependency related stuff */
func (c *configuration) label(i head) lingo.DependencyType {
if i < 0 {
return lingo.NoDepType
}
if i == 0 {
return lingo.NoDepType
}
return c.Label(int(i))
// i--
// labels := c.Labels()
// return labels[i]
}
func (c *configuration) annotation(i head) *lingo.Annotation {
if i < 0 {
return lingo.NullAnnotation()
}
if i == 0 {
return lingo.RootAnnotation()
}
// i--
return c.Annotation(int(i))
// return c.Sentence()[i]
}
// gets the jth left child of the ith word of a sentence
func (c *configuration) lc(k, cnt head) head {
if k < 0 || int(k) > c.N() {
return DOES_NOT_EXIST
}
cc := 0
for i := 1; i < int(k); i++ {
if c.Head(i) == int(k) {
cc++
if int(cnt) == cc {
return head(i)
}
}
}
return DOES_NOT_EXIST
}
func (c *configuration) rc(k, cnt head) head {
if k < 0 || int(k) > c.N() {
return DOES_NOT_EXIST
}
cc := 0
for i := c.N(); i > int(k); i-- {
if c.Head(i) == int(k) {
cc++
if cc == int(cnt) {
return head(i)
}
}
}
return DOES_NOT_EXIST
}
func (c *configuration) hasOtherChildren(i int, goldParse *lingo.Dependency) bool {
for j := 1; j <= goldParse.N(); j++ {
if goldParse.Head(j) == i && c.Head(j) != i {
return true
}
}
return false
}
func (c *configuration) isTerminal() bool {
return c.stackSize() == 1 && c.bufferSize() == 0
}
// Actual Transitioning stuff
func (c *configuration) shift() bool {
i := c.bufferValue(0)
if i == DOES_NOT_EXIST {
return false
}
c.bp++ // move the buffer pointer up
c.stack = append(c.stack, i) // push to it.... gotta work the pop
return true
}
================================================
FILE: dep/configuration_test.go
================================================
package dep
import (
"testing"
"github.com/chewxy/lingo"
"github.com/stretchr/testify/assert"
)
func TestStackAppendRemove(t *testing.T) {
sentence := mediumSentence()[0]
as := sentence.AnnotatedSentence(dummyFix{})
c := newConfiguration(as, true)
t.Logf("C: %v", c)
t.Logf("C: %#v", c)
assert := assert.New(t)
c.stack = append(c.stack, 200)
assert.Equal([]head{0, 200}, c.stack, "stack is not equal after appending")
c.removeTopStack()
assert.Equal([]head{0}, c.stack, "stack is not equal after removeTopStack")
c.stack = append(c.stack, 200)
c.removeSecondTopStack()
assert.Equal([]head{200}, c.stack, "stack is not equal after removeSecondTopStack()")
correctHeads := []int{-1} // the -1 is the root
correctHeads = append(correctHeads, sentence.Heads...)
correctLabels := []lingo.DependencyType{lingo.Root}
correctLabels = append(correctLabels, sentence.Labels...)
dep := sentence.Dependency(dummyFix{})
assert.Equal(correctHeads, dep.Heads(), "Heads are not equal")
assert.Equal(correctLabels, dep.Labels(), "Labels are not equal %v \n %v", correctLabels, dep.Labels())
}
func TestConfiguration_StackValue(t *testing.T) {
c := new(configuration)
c.stack = []head{0, 1, 2, 5, 6}
zero := c.stackValue(0)
one := c.stackValue(1)
four := c.stackValue(4)
five := c.stackValue(5)
negone := c.stackValue(-1)
assert := assert.New(t)
assert.Equal(head(6), zero, "Zeroth value not the same")
assert.Equal(head(5), one, "First value not the same")
assert.Equal(head(0), four, "Fourth value not the same")
assert.Equal(DOES_NOT_EXIST, five, "Fifth value not the same")
assert.Equal(DOES_NOT_EXIST, negone, "NegOne value not the same")
}
================================================
FILE: dep/debug.go
================================================
// +build debug
package dep
import (
"bytes"
"fmt"
"log"
"runtime"
"strings"
"sync/atomic"
"github.com/chewxy/lingo"
)
const BUILD_DEBUG = "PARSER: DEBUG BUILD"
const BUILD_DIAG = "Diagnostic Build"
const DEBUG = true
var READMEMSTATS = true
var TABCOUNT uint32 = 0
func tabcount() int {
return int(atomic.LoadUint32(&TABCOUNT))
}
func enterLoggingContext() {
atomic.AddUint32(&TABCOUNT, 1)
tc := tabcount()
log.SetPrefix(strings.Repeat("\t", tc))
}
func leaveLoggingContext() {
tc := tabcount()
tc--
if tc < 0 {
atomic.StoreUint32(&TABCOUNT, 0)
tc = 0
} else {
atomic.StoreUint32(&TABCOUNT, uint32(tc))
}
log.SetPrefix(strings.Repeat("\t", tc))
}
func logf(format string, others ...interface{}) {
if !DEBUG {
return
}
log.Printf(format, others...)
}
func logTrainingProgress(iteration, correct, total, length, possibles int) {
if !DEBUG {
return
}
log.Printf("Iteration %d. Correct/Total: %d/%d = %.2f", iteration, correct, total, float64(correct)/float64(total))
log.Printf("DictSize: %d/%d, load factor of: %.2f", length, possibles, float64(length)/float64(possibles))
}
func logMemStats() {
if !DEBUG || !READMEMSTATS {
return
}
var mem runtime.MemStats
runtime.ReadMemStats(&mem)
log.Printf("Allocated : %.2f MB", (float64(mem.Alloc)/1024)/float64(1024))
log.Printf("Total Allocated : %.2f MB", (float64(mem.TotalAlloc)/1024)/float64(1024))
log.Printf("Heap Allocted : %.2f MB", (float64(mem.HeapAlloc)/1024)/float64(1024))
log.Printf("Sys Total Allocated: %.2f MB", (float64(mem.HeapSys)/1024)/float64(1024))
log.Println("----------")
}
func recoverFrom(format string, attrs ...interface{}) {
if r := recover(); r != nil {
log.Printf(format, attrs...)
panic(r)
}
}
/* Nice output of shit */
func (d *Parser) SprintFeatures(features []int) string {
// tabcount := int(atomic.LoadUint32(&TABCOUNT))
var buf bytes.Buffer
for i := 0; i < 18; i++ {
number := features[i]
id := number - wordFeatsStartAt
word, _ := d.corpus.Word(id)
if word == "" {
word = "-NULL-"
}
buf.WriteString(fmt.Sprintf("%d, %q, %d \n", feature(i), word, number))
}
for i := 0; i < 18; i++ {
number := features[i+18]
buf.WriteString(fmt.Sprintf("%d, %v, %d\n", feature(i+18), lingo.POSTag(number), number))
}
for i := 0; i < 12; i++ {
number := features[i+36]
id := number - labelFeatsStartAt
buf.WriteString(fmt.Sprintf("%d, %v, %d\n", feature(i+36), lingo.DependencyType(id), number))
}
return buf.String()
}
func SprintScores(scores []float64, ts []transition) string {
var buf bytes.Buffer
for i, v := range scores {
if i >= len(ts) {
buf.WriteString(fmt.Sprintf("UNKNOWN TRANSITION, %v\n", v))
continue
}
buf.WriteString(fmt.Sprintf("%v, %v\n", ts[i], v))
}
return buf.String()
}
func SprintFloatSlice(a []float64) string {
var buf bytes.Buffer
buf.WriteString("[")
for i, v := range a {
if i < len(a)-1 {
buf.WriteString(fmt.Sprintf("%v, ", v))
} else {
buf.WriteString(fmt.Sprintf("%v", v))
}
}
buf.WriteString("]")
return buf.String()
}
================================================
FILE: dep/dependencyParser.go
================================================
package dep
import (
"fmt"
"github.com/chewxy/lingo"
"github.com/chewxy/lingo/corpus"
"github.com/pkg/errors"
)
var KnownWords *corpus.Corpus // package provided global
// Parser is the object that performs the dependency parsing
// It contains a neural network, which is the core of it.
//
// The same object can be used to train the NN
type Parser struct {
Input chan lingo.AnnotatedSentence
Output chan *lingo.Dependency
Error chan error
*Model
}
// New creates a new Parser
func New(m *Model) *Parser {
d := &Parser{
Output: make(chan *lingo.Dependency),
Error: make(chan error),
Model: m,
}
return d
}
// Run is used when using the NN to parse a sentence. For training, see Train()
func (d *Parser) Run() {
defer close(d.Output)
for sentence := range d.Input {
dep, err := d.predict(sentence)
if err != nil {
d.Error <- err
return
}
d.Output <- dep
}
return
}
func (d *Parser) predict(sentence lingo.AnnotatedSentence) (*lingo.Dependency, error) {
// defer func() {
// if r := recover(); r != nil {
// log.Printf("Parsing for %q", sentence.ValueString())
// panic(r)
// }
// }()
c := newConfiguration(sentence, false)
var err error
var argmax int
var count int
for !c.isTerminal() && count < 100 {
logf("%v", c)
if count == 99 {
logf("TARPIT")
}
features := getFeatures(c, d.corpus)
// features2 := getFeatureArray(c, d.dict)
if argmax, err = d.nn.pred(features); err != nil {
return nil, err
}
// log.Printf("Argmax: %v, len(d.ts): %v, len(transitions) %v", argmax, len(d.ts), len(transitions))
t := transitions[argmax] // no this is NOT a mistake
if !c.canApply(t) {
t = transition{Shift, lingo.NoDepType} // reset
// manual argmaxing
switch scores := d.nn.scores.Value().Data().(type) {
case []float32:
var maxScore float32
for i, kt := range d.ts {
if scores[i] > maxScore && c.canApply(kt) {
maxScore = scores[i]
t = kt
}
}
case []float64:
var maxScore float64
for i, kt := range d.ts {
if scores[i] > maxScore && c.canApply(kt) {
maxScore = scores[i]
t = kt
}
}
default:
return nil, errors.Errorf("Unhandled score type %T", d.nn.scores.Value())
}
}
c.apply(t)
count++
}
fix(c.Dependency)
return c.Dependency, err
}
func (d *Parser) String() string {
var nns, ds string
if d.corpus != nil {
ds = fmt.Sprintf("\nDict Size: %d words\nMAXTAG: %d\nMAXDEPTYPE: %d\n", d.corpus.Size(), lingo.MAXTAG, lingo.MAXDEPTYPE)
} else {
ds = "\n"
}
if d.nn != nil && d.nn.initialized() {
nns = fmt.Sprintf("\nNeural Network:\n=================\n%v\n", d.nn)
}
if !d.nn.initialized() {
panic(fmt.Sprintf("%v", d.nn))
}
base := "\n\nDependency Parser Info:\n=======================\n"
return base + ds + nns
}
================================================
FILE: dep/documentation/iamhuman.dot
================================================
digraph G {
Node_0xc425b88740->Node_0xc425b88780[ label=Root ];
Node_0xc425b88780->Node_0xc425b88800[ label=Cop ];
Node_0xc425b88780->Node_0xc425b887c0[ label=NSubj ];
Node_0xc425b88740 [ label="0: "-ROOT-/ROOT_TAG"" ];
Node_0xc425b88780 [ label="3: "human/JJ"" ];
Node_0xc425b887c0 [ label="1: "I/PRP"" ];
Node_0xc425b88800 [ label="2: "am/VBP"" ];
}
================================================
FILE: dep/documentation/thecatsatonthemat.dot
================================================
digraph G {
Node_0xc4349eeec0->Node_0xc4349eef80[ label=Root ];
Node_0xc4349eef80->Node_0xc4349eefc0[ label=NMod ];
Node_0xc4349eefc0->Node_0xc4349ef040[ label=Det ];
Node_0xc4349eef80->Node_0xc4349eef00[ label=NSubj ];
Node_0xc4349eef00->Node_0xc4349eef40[ label=Det ];
Node_0xc4349eefc0->Node_0xc4349ef000[ label=Case ];
Node_0xc4349eeec0 [ label="0: "-ROOT-/ROOT_TAG"" ];
Node_0xc4349eef00 [ label="2: "cat/NN"" ];
Node_0xc4349eef40 [ label="1: "the/DT"" ];
Node_0xc4349eef80 [ label="3: "sat/VBD"" ];
Node_0xc4349eefc0 [ label="6: "mat/NN"" ];
Node_0xc4349ef000 [ label="4: "on/IN"" ];
Node_0xc4349ef040 [ label="5: "the/DT"" ];
}
================================================
FILE: dep/errors.go
================================================
package dep
import (
"fmt"
"github.com/chewxy/lingo"
)
type componentUnavailable string
func (c componentUnavailable) Error() string { return fmt.Sprintf("%v unavailable", c) }
func (c componentUnavailable) Component() string { return string(c) }
// TarpitError is an error when the arc-standard is stuck.
// It implements GoStringer, which when called will output the state as a string.
// It also implements lingo.Sentencer, so the offending sentence can easily be retrieved
type TarpitError struct{ *configuration }
func (err TarpitError) Error() string { return "Tarpit Error" }
// NonProjective error is the error that is emitted when the dependency tree is not projective (that is to say the children cross lines)
type NonProjectiveError struct{ *lingo.Dependency }
func (err NonProjectiveError) Error() string { return "Non-projective tree" }
================================================
FILE: dep/evaluation.go
================================================
package dep
import (
"fmt"
"io/ioutil"
"github.com/chewxy/lingo"
"github.com/chewxy/lingo/treebank"
)
// Performance is a tuple that holds performance information from a training session
type Performance struct {
Iter int // which training iteration is this?
UAS float64 // Unlabelled Attachment Score
LAS float64 // Labeled Attachment Score
UEM float64 // Unlabelled Exact Match
Root float64 // Correct Roots Ratio
}
func (p Performance) String() string {
s := `EPO: %d
UAS: %.5f
LAS: %.5f
UEM: %.5f
ROO: %.5f`
return fmt.Sprintf(s, p.Iter, p.UAS, p.LAS, p.UEM, p.Root)
}
// performance evaluation related code goes here
// Evaluate compares predicted trees with the gold standard trees and returns a Performance. It panics if the number of predicted trees and the number of gold trees aren't the same
func Evaluate(predictedTrees, goldTrees []*lingo.Dependency) Performance {
if len(predictedTrees) != len(goldTrees) {
panic(fmt.Sprintf("%d predicted trees; %d gold trees. Unable to compare", len(predictedTrees), len(goldTrees)))
}
var correctLabels, correctHeads, correctTrees, correctRoot, sumArcs float64
var check int
for i, tr := range predictedTrees {
gTr := goldTrees[i]
if len(tr.AnnotatedSentence) != len(gTr.AnnotatedSentence) {
sumArcs += float64(gTr.N())
// log.Printf("WARNING: %q and %q do not have the same length", tr, gTr)
continue
}
var nCorrectHead int
for j, a := range tr.AnnotatedSentence[1:] {
b := gTr.AnnotatedSentence[j+1]
if a.HeadID() == b.HeadID() {
correctHeads++
nCorrectHead++
}
if a.DependencyType == b.DependencyType {
correctLabels++
}
sumArcs++
}
if nCorrectHead == gTr.N() {
correctTrees++
}
if tr.Root() == gTr.Root() {
correctRoot++
}
// check 5 per iteration
if check < 5 {
logf("predictedHeads: \n%v\n%v\n", tr.Heads(), gTr.Heads())
logf("Ns: %v | %v || Correct: %v", tr.N(), gTr.N(), nCorrectHead)
check++
}
}
uas := correctHeads / sumArcs
las := correctLabels / sumArcs
uem := correctTrees / float64(len(predictedTrees))
roo := correctRoot / float64(len(predictedTrees))
return Performance{UAS: uas, LAS: las, UEM: uem, Root: roo}
}
func (t *Trainer) crossValidate(st []treebank.SentenceTag) Performance {
preds := t.predMany(st)
golds := make([]*lingo.Dependency, len(st))
for i, s := range st {
golds[i] = s.Dependency(t)
}
return Evaluate(preds, golds)
}
func (t *Trainer) predMany(sentenceTags []treebank.SentenceTag) []*lingo.Dependency {
retVal := make([]*lingo.Dependency, len(sentenceTags))
for i, st := range sentenceTags {
dep, err := t.pred(st.AnnotatedSentence(t))
if err != nil {
ioutil.WriteFile("fullGraph.dot", []byte(t.nn.g.ToDot()), 0644)
panic(fmt.Sprintf("%+v", err))
}
retVal[i] = dep
}
return retVal
}
func (t *Trainer) pred(as lingo.AnnotatedSentence) (*lingo.Dependency, error) {
d := new(Parser)
d.Model = t.Model
return d.predict(as)
}
================================================
FILE: dep/example.go
================================================
package dep
import (
"math/rand"
"github.com/chewxy/lingo"
"github.com/chewxy/lingo/corpus"
"github.com/chewxy/lingo/treebank"
)
// example is a training example.
type example struct {
transition
features []int // features are used in the embeddings
labels []int // labels are used in scoring the transitions
}
func makeExamples(sentenceTags []treebank.SentenceTag, conf NNConfig, dict *corpus.Corpus, ts []transition, f lingo.AnnotationFixer) []example {
var examples []example
var tarpit, nonprojective, good int
for i, sentenceTag := range sentenceTags {
exs, err := makeOneExample(i, sentenceTag, dict, ts, f)
if err != nil {
switch err.(type) {
case TarpitError:
tarpit++
case NonProjectiveError:
nonprojective++
}
} else {
examples = append(examples, exs...)
good++
}
}
logf("Number of SentenceTags Generated Into Examples: %d/%d | Number of Examples: %d | Number of nonprojective examples: %d | Number of tarpit examples: %d", good, len(sentenceTags), len(examples), nonprojective, tarpit)
return examples
}
// makeOneExample is an example of a poorly named function. It makes an example from a SentenceTag
func makeOneExample(i int, sentenceTag treebank.SentenceTag, dict *corpus.Corpus, ts []transition, f lingo.AnnotationFixer) ([]example, error) {
var examples []example
s := sentenceTag.AnnotatedSentence(f)
dep := s.Dependency()
if dep.IsProjective() {
c := newConfiguration(s, true)
count := 0
for !c.isTerminal() && count < 1000 {
if count == 999 {
return examples, TarpitError{c}
}
oracle := c.oracle(dep)
features := getFeatures(c, dict)
labels := make([]int, MAXTRANSITION)
for i, t := range ts {
if t == oracle {
labels[i] = 1
} else if c.canApply(t) {
labels[i] = 0
} else {
labels[i] = -1
}
}
ex := example{transition{oracle.Move, oracle.DependencyType}, features, labels}
examples = append(examples, ex)
c.apply(oracle)
count++
}
} else {
return nil, NonProjectiveError{dep}
}
return examples, nil
}
func shuffleExamples(a []example) {
for i := range a {
j := rand.Intn(i + 1)
a[i], a[j] = a[j], a[i]
}
}
================================================
FILE: dep/example_test.go
================================================
package dep
import (
"testing"
"github.com/chewxy/lingo/corpus"
)
func TestMakeExamples(t *testing.T) {
st := simpleSentence()
dict := corpus.GenerateCorpus(st)
exs := makeExamples(st, DefaultNNConfig, dict, transitions, dummyFix{})
if len(exs) != 20 {
t.Error("Expected 20 examples to be generated from simple sentence")
}
}
================================================
FILE: dep/featureExtraction.go
================================================
package dep
import (
"github.com/chewxy/lingo"
"github.com/chewxy/lingo/corpus"
)
// getFeatures extracts the IDs to pass into the neural network. These IDs are used in the network to construct the input layers
func getFeatures(c *configuration, dict *corpus.Corpus) []int {
// logf("CONFIG: %v", c)
wordFeats := make([]int, 0)
posFeats := make([]lingo.POSTag, 0)
labelFeats := make([]lingo.DependencyType, 0)
unknownID, _ := dict.Id("-UNKNOWN-")
for j := 2; j >= 0; j-- {
index := c.stackValue(j)
mor := c.annotation(index)
if wordID, ok := dict.Id(mor.Value); ok {
wordFeats = append(wordFeats, wordID)
} else {
wordFeats = append(wordFeats, unknownID)
}
posFeats = append(posFeats, mor.POSTag)
}
// logf("wordFeats: %v", wordFeats)
for j := 0; j <= 2; j++ {
index := c.bufferValue(j)
mor := c.annotation(index)
// logf("Want: %v Index: %d. Morpheme: %v", j, index, mor)
if wordID, ok := dict.Id(mor.Value); ok {
wordFeats = append(wordFeats, wordID)
} else {
wordFeats = append(wordFeats, unknownID)
}
posFeats = append(posFeats, mor.POSTag)
}
// logf("wordFeats: %v", wordFeats)
for j := 0; j <= 1; j++ {
k := c.stackValue(j)
index := c.lc(k, 1)
mor := c.annotation(index)
if wordID, ok := dict.Id(mor.Value); ok {
wordFeats = append(wordFeats, wordID)
} else {
wordFeats = append(wordFeats, unknownID)
}
posFeats = append(posFeats, mor.POSTag)
labelFeats = append(labelFeats, c.label(index))
index = c.rc(k, 1)
mor = c.annotation(index)
if wordID, ok := dict.Id(mor.Value); ok {
wordFeats = append(wordFeats, wordID)
} else {
wordFeats = append(wordFeats, unknownID)
}
posFeats = append(posFeats, mor.POSTag)
labelFeats = append(labelFeats, c.label(index))
index = c.lc(k, 2)
mor = c.annotation(index)
if wordID, ok := dict.Id(mor.Value); ok {
wordFeats = append(wordFeats, wordID)
} else {
wordFeats = append(wordFeats, unknownID)
}
posFeats = append(posFeats, mor.POSTag)
labelFeats = append(labelFeats, c.label(index))
index = c.rc(k, 2)
mor = c.annotation(index)
if wordID, ok := dict.Id(mor.Value); ok {
wordFeats = append(wordFeats, wordID)
} else {
wordFeats = append(wordFeats, unknownID)
}
posFeats = append(posFeats, mor.POSTag)
labelFeats = append(labelFeats, c.label(index))
leftChild := c.lc(k, 1)
index = c.lc(leftChild, 1)
mor = c.annotation(index)
if wordID, ok := dict.Id(mor.Value); ok {
wordFeats = append(wordFeats, wordID)
} else {
wordFeats = append(wordFeats, unknownID)
}
posFeats = append(posFeats, mor.POSTag)
labelFeats = append(labelFeats, c.label(index))
rightChild := c.rc(k, 1)
index = c.rc(rightChild, 1)
mor = c.annotation(index)
if wordID, ok := dict.Id(mor.Value); ok {
wordFeats = append(wordFeats, wordID)
} else {
wordFeats = append(wordFeats, unknownID)
}
posFeats = append(posFeats, mor.POSTag)
labelFeats = append(labelFeats, c.label(index))
}
// the embedding matrix is arranged thus:
/*
POSTag0 0, 1, ... 50
POSTag1
...
MAXTAG-1
DepType0
DepType1
...
MAXDEPTYPE-1
WordID0
...
WordIDN
*/
features := make([]int, MAXFEATURE)
for i, w := range wordFeats {
features[i] = w + wordFeatsStartAt
}
for i, t := range posFeats {
features[i+POS_OFFSET] = int(t)
}
for i, l := range labelFeats {
features[i+DEP_OFFSET] = int(l) + labelFeatsStartAt
}
return features
}
const (
POS_OFFSET int = 18
DEP_OFFSET = 36
STACK_OFFSET = 6
STACK_NUMBER = 6
)
================================================
FILE: dep/features.go
================================================
package dep
import "github.com/chewxy/lingo"
// the features are used as columns in the matrix
// go:generate stringer type=feature -output=feature_string.go
type feature int
const (
// first 18 are word related features
// second 18 are POS related features
// last 12 are label related features
s0w feature = iota
s1w
s2w
b0w
b1w
b2w
s0l1w
s0r1w
s0l2w
s0r2w
s0llw
s0rrw
s1l1w
s1r1w
s1l2w
s1r2w
s1llw
s1rrw
// POS related words
s0t
s1t
s2t
b0t
b1t
b2t
s0l1t
s0r1t
s0l2t
s0r2t
s0llt
s0rrt
s1l1t
s1r1t
s1l2t
s1r2t
s1llt
s1rrt
// label related
s0l1d
s0r1d
s0l2d
s0r2d
s0lld
s0rrd
s1l1d
s1r1d
s1l2d
s1r2d
s1lld
s1rrd
MAXFEATURE
)
const (
wordFeatsStartAt int = int(lingo.MAXTAG) + int(lingo.MAXDEPTYPE)
labelFeatsStartAt = int(lingo.MAXTAG)
posFeatsStartAt = 0
)
================================================
FILE: dep/features_string.go
================================================
// generated by stringer -type=feature -output=features_string.go; DO NOT EDIT
package dep
import "fmt"
const _feature_name = "s0ws1ws2wb0wb1wb2ws0l1ws0r1ws0l2ws0r2ws0llws0rrws1l1ws1r1ws1l2ws1r2ws1llws1rrws0ts1ts2tb0tb1tb2ts0l1ts0r1ts0l2ts0r2ts0llts0rrts1l1ts1r1ts1l2ts1r2ts1llts1rrts0l1ds0r1ds0l2ds0r2ds0llds0rrds1l1ds1r1ds1l2ds1r2ds1llds1rrdMAXFEATURE"
var _feature_index = [...]uint8{0, 3, 6, 9, 12, 15, 18, 23, 28, 33, 38, 43, 48, 53, 58, 63, 68, 73, 78, 81, 84, 87, 90, 93, 96, 101, 106, 111, 116, 121, 126, 131, 136, 141, 146, 151, 156, 161, 166, 171, 176, 181, 186, 191, 196, 201, 206, 211, 216, 226}
func (i feature) String() string {
if i < 0 || i >= feature(len(_feature_index)-1) {
return fmt.Sprintf("feature(%d)", i)
}
return _feature_name[_feature_index[i]:_feature_index[i+1]]
}
================================================
FILE: dep/fix.go
================================================
package dep
import (
"log"
"github.com/chewxy/lingo"
)
// applies common fixes
func fix(d *lingo.Dependency) {
// NNP fix:
// If a sentence is [a, b, c, D, E, f, g]
// where D, E are NNPs, they should be compound words
// The head should be the one with higher headID
spans := properNounSpans(d)
for _, s := range spans {
// we don't care about single word proper nouns
if s.end-s.start <= 1 {
continue
}
phrase := d.AnnotatedSentence[s.start:s.end]
// pick up all compound roots
// find annotations that do not have compound as deptype
var compoundRoots lingo.AnnotationSet
var problematic lingo.AnnotationSet
for _, a := range phrase {
if lingo.IsCompound(a.DependencyType) {
compoundRoots = compoundRoots.Add(a.Head)
}
if !lingo.IsCompound(a.DependencyType) && a.ID != s.end-1 {
problematic = problematic.Add(a)
}
}
// if no root
if len(compoundRoots) == 0 {
// actual root is the word with the largest ID
var compoundRoot *lingo.Annotation
var rootRoot *lingo.Annotation
for last := -1; s.end+last >= s.start; last-- {
predictedRoot := s.end + last
compoundRoot = d.AnnotatedSentence[predictedRoot]
// incorrects :
// dep==Dep
// dep==Root && others has dep != root
if compoundRoot.DependencyType == lingo.Dep {
problematic = problematic.Add(compoundRoot)
continue
}
if compoundRoot.DependencyType != lingo.Dep && compoundRoot.DependencyType != lingo.Root {
break
}
if compoundRoot.DependencyType == lingo.Root {
rootRoot = compoundRoot
problematic = problematic.Add(compoundRoot)
}
}
if rootRoot != nil && rootRoot != compoundRoot {
// we have two potential roots. Choose the best
log.Println("Problem when fixing: more than one possible compound root found")
}
for _, a := range problematic {
if a == compoundRoot {
continue
}
tmpHead := a.Head
tmpRel := a.DependencyType
a.SetHead(compoundRoot)
a.DependencyType = lingo.Compound
for _, childID := range d.AnnotatedSentence.Children(a.ID) {
childA := d.AnnotatedSentence[childID]
childA.SetHead(tmpHead)
childA.DependencyType = tmpRel
}
}
}
// if more than one root...
logf("More than zero compound roots not handled yet")
}
// Number fix
}
func properNounSpans(d *lingo.Dependency) (retVal []span) {
start := -1
end := -1
for i, a := range d.AnnotatedSentence {
if lingo.IsProperNoun(a.POSTag) {
if start == -1 {
start = i
end = i + 1
} else {
end = i + 1
}
} else {
if end == -1 {
end = i
}
if start > -1 {
s := makeSpan(start, end)
retVal = append(retVal, s)
}
start = -1
end = -1
}
}
if start > -1 {
s := makeSpan(start, len(d.AnnotatedSentence))
retVal = append(retVal, s)
}
return
}
================================================
FILE: dep/init.go
================================================
package dep
import "github.com/chewxy/lingo/corpus"
func init() {
c := corpus.New()
c.Add("") // add null words
KnownWords = c
}
================================================
FILE: dep/models.go
================================================
package dep
import (
"bufio"
"bytes"
"encoding/gob"
"fmt"
"io"
"os"
"github.com/chewxy/lingo/corpus"
"github.com/pkg/errors"
"gorgonia.org/tensor"
)
// Model holds the neural network that a DependencyParser uses. To train, use a Trainer
type Model struct {
nn *neuralnetwork2
corpus *corpus.Corpus
ts []transition
}
func (m *Model) Corpus() *corpus.Corpus { return m.corpus }
func (m *Model) WordEmbeddings() *tensor.Dense {
val := m.nn.e_w.Value().(*tensor.Dense)
emb := val.Clone().(*tensor.Dense)
return emb
}
func (m *Model) POSTagEmbeddings() *tensor.Dense {
val := m.nn.e_t.Value().(*tensor.Dense)
emb := val.Clone().(*tensor.Dense)
return emb
}
func (m *Model) String() string {
var buf bytes.Buffer
buf.WriteString(m.nn.String())
buf.WriteString("Transitions: [")
for _, t := range m.ts {
fmt.Fprintf(&buf, "%v, ", t)
}
buf.WriteString("]")
return buf.String()
}
func (m *Model) Save(filename string) error {
if m.nn == nil {
return errors.Errorf("Cannot save a model with no nn")
}
f, err := os.Create(filename)
if err != nil {
return err
}
return m.SaveWriter(f)
}
func (m *Model) SaveWriter(f io.WriteCloser) error {
defer f.Close()
w := bufio.NewWriter(f)
defer w.Flush()
encoder := gob.NewEncoder(w)
if err := encoder.Encode(m.corpus); err != nil {
return err
}
if err := encoder.Encode(m.nn); err != nil {
return err
}
// if err := encoder.Encode(m.ts); err != nil {
// return err
// }
return nil
}
func Load(filename string) (*Model, error) {
f, err := os.Open(filename)
if err != nil {
return nil, err
}
return LoadReader(f)
}
func LoadReader(rd io.ReadCloser) (*Model, error) {
defer rd.Close()
r := bufio.NewReader(rd)
decoder := gob.NewDecoder(r)
m := new(Model)
if err := decoder.Decode(&m.corpus); err != nil {
return nil, err
}
m.nn = new(neuralnetwork2)
m.nn.dict = m.corpus
if err := decoder.Decode(&m.nn); err != nil {
return nil, err
}
if err := decoder.Decode(&m.ts); err != nil {
m.ts = transitions
}
m.nn.transitions = m.ts
return m, nil
}
================================================
FILE: dep/models_test.go
================================================
package dep
import (
"os"
"testing"
"github.com/stretchr/testify/assert"
G "gorgonia.org/gorgonia"
)
func TestModel_SaveLoad(t *testing.T) {
assert := assert.New(t)
testFileName := "TestSave.dat"
m := new(Model)
// dumb shit
if err := m.Save(testFileName); err == nil {
t.Error("Expected an error")
}
conf := DefaultNNConfig
conf.Dtype = G.Float32
m = new(Model)
m.ts = transitions
m.corpus = KnownWords
m.nn = new(neuralnetwork2)
m.nn.NNConfig = conf
m.nn.dict = m.corpus
if err := m.nn.init(); err != nil {
t.Error(err)
}
if err := m.Save(testFileName); err != nil {
t.Fatal(err)
}
var m2 *Model
var err error
if m2, err = Load(testFileName); err != nil {
t.Error(err)
}
assert.Equal(m.corpus, m2.corpus, "Both Dependency Parsers need to have the same dict")
if !G.ValueEq(m.nn.w2.Value(), m2.nn.w2.Value()) {
t.Errorf("Expected w2 to be equal")
}
if !G.ValueEq(m.nn.e_w.Value(), m2.nn.e_w.Value()) {
t.Errorf("Expected e_w to be equal")
}
// cleanup
if err := os.Remove(testFileName); err != nil {
t.Error(err)
}
}
================================================
FILE: dep/move.go
================================================
package dep
// Move is an action that the dependency parser can take - whether to Shift, Attach-Left, or AttachRight
type Move byte
//go:generate stringer -type=Move
const (
Shift Move = iota
Left
Right
MAXMOVE
)
// ALLMOVES is the set of all possible moves
var ALLMOVES = [...]Move{Left, Right, Shift}
================================================
FILE: dep/move_string.go
================================================
// generated by stringer -type=Move; DO NOT EDIT
package dep
import "fmt"
const _Move_name = "ShiftLeftRightMAXMOVE"
var _Move_index = [...]uint8{0, 5, 9, 14, 21}
func (i Move) String() string {
if i >= Move(len(_Move_index)-1) {
return fmt.Sprintf("Move(%d)", i)
}
return _Move_name[_Move_index[i]:_Move_index[i+1]]
}
================================================
FILE: dep/nn2.go
================================================
package dep
import (
"github.com/chewxy/lingo"
"github.com/chewxy/lingo/corpus"
"github.com/pkg/errors"
G "gorgonia.org/gorgonia"
"gorgonia.org/tensor"
)
// may is a simple monad for handling errors
type may struct {
error
n *G.Node
}
func (m *may) doUnary(fn func(*G.Node) (*G.Node, error)) {
if m.error != nil {
return
}
m.n, m.error = fn(m.n)
}
func (m *may) doBinary(fn func(a, b *G.Node) (*G.Node, error), other *G.Node) {
if m.error != nil {
return
}
m.n, m.error = fn(m.n, other)
}
func (m *may) doSwapBinary(fn func(a, b *G.Node) (*G.Node, error), other *G.Node) {
if m.error != nil {
return
}
m.n, m.error = fn(other, m.n)
}
type neuralnetwork2 struct {
NNConfig
g *G.ExprGraph
sub *G.ExprGraph
// model
// embedding matrices for word, POSTags and labels respectively
e_w *G.Node // Shape: (EmbeddingSize, DictSize)
e_t *G.Node // Shape: (EmbeddingSize, lingo.MAXTAG)
e_l *G.Node // Shape: (EmbeddingSize, lingo.MAXDEP)
// w1
w1_w *G.Node // Shape: (HiddenSize, DictSize)
w1_t *G.Node // Shape: (HiddenSize, lingo.MAXTAG)
w1_l *G.Node // Shape: (HiddenSize, lingo.MAXDEP)
b *G.Node // Shape: (HiddenSize)
// w2
w2 *G.Node // Shape: (MAXTRANSITION, HiddenSize)
// selects
x_wSelW G.Nodes // 18 - word features
x_tSelT G.Nodes // 18 - POSTag features
x_lSelL G.Nodes // 12 - Dependency feature
// inputs (feature vectors built up from the selects)
x_w *G.Node
x_t *G.Node
x_l *G.Node
// outputs
scores *G.Node // argmax this to get the greedy decoded transition
logProb *G.Node
cost *G.Node
costVal G.Value
vm G.VM
model G.Nodes
solver G.Solver
dict *corpus.Corpus
transitions []transition
costChan chan G.Value
// wordfeats *G.Node
// tagfeats *G.Node
// depfeats *G.Node
// sumfeats *G.Node
// act *G.Node
}
func (nn *neuralnetwork2) initialized() bool {
return nn.g != nil && nn.sub != nil &&
nn.e_w != nil && nn.e_t != nil && nn.e_l != nil &&
nn.w1_w != nil && nn.w1_t != nil && nn.w1_l != nil && nn.b != nil &&
nn.w2 != nil && len(nn.x_wSelW) > 0 && len(nn.x_tSelT) > 0 && len(nn.x_lSelL) > 0 &&
nn.x_w != nil && nn.x_t != nil && nn.x_l != nil &&
nn.scores != nil &&
nn.dict != nil && nn.vm != nil && nn.solver != nil
}
func (nn *neuralnetwork2) init() error {
if nn.dict == nil {
return errors.Errorf("No Corpus Provided to the Neural Network. Will be unable to decode")
}
g := G.NewGraph()
nn.g = g
word := nn.dict.Size()
tags := int(lingo.MAXTAG)
deps := int(lingo.MAXDEPTYPE)
// trns := len(nn.transitions)
wordFeats := POS_OFFSET - 0
tagFeats := DEP_OFFSET - POS_OFFSET
depFeats := int(MAXFEATURE) - DEP_OFFSET
// In any case a very very very small dict was passed in
// we set the minimum to wordFeatss
if word < wordFeats {
word = wordFeats
}
logf(`Word: %d
tags: %d
deps: %d
wordFeats: %d
tagFeats: %d
depFeats: %d
`, word, tags, deps, wordFeats, tagFeats, depFeats)
// define inputs
nn.x_w = G.NewVector(g, nn.Dtype, G.WithShape(wordFeats*nn.EmbeddingSize), G.WithName("word input"), G.WithInit(G.Zeroes()))
nn.x_t = G.NewVector(g, nn.Dtype, G.WithShape(tagFeats*nn.EmbeddingSize), G.WithName("POSTag input"), G.WithInit(G.Zeroes()))
nn.x_l = G.NewVector(g, nn.Dtype, G.WithShape(depFeats*nn.EmbeddingSize), G.WithName("word input"), G.WithInit(G.Zeroes()))
nn.x_wSelW = make(G.Nodes, wordFeats)
nn.x_tSelT = make(G.Nodes, tagFeats)
nn.x_lSelL = make(G.Nodes, depFeats)
// define models
nn.e_w = G.NewMatrix(g, nn.Dtype, G.WithShape(word, nn.EmbeddingSize), G.WithName("e_w"), G.WithInit(G.GlorotU(1)))
nn.e_t = G.NewMatrix(g, nn.Dtype, G.WithShape(tags, nn.EmbeddingSize), G.WithName("e_t"), G.WithInit(G.GlorotU(1)))
nn.e_l = G.NewMatrix(g, nn.Dtype, G.WithShape(deps, nn.EmbeddingSize), G.WithName("e_l"), G.WithInit(G.GlorotU(1)))
nn.w1_w = G.NewMatrix(g, nn.Dtype, G.WithShape(nn.HiddenSize, nn.EmbeddingSize*wordFeats), G.WithName("w1_w"), G.WithInit(G.GlorotU(1)))
nn.w1_t = G.NewMatrix(g, nn.Dtype, G.WithShape(nn.HiddenSize, nn.EmbeddingSize*tagFeats), G.WithName("w1_t"), G.WithInit(G.GlorotU(1)))
nn.w1_l = G.NewMatrix(g, nn.Dtype, G.WithShape(nn.HiddenSize, nn.EmbeddingSize*depFeats), G.WithName("w1_l"), G.WithInit(G.GlorotU(1)))
nn.b = G.NewVector(g, nn.Dtype, G.WithShape(nn.HiddenSize), G.WithName("b"), G.WithInit(G.Zeroes()))
nn.w2 = G.NewMatrix(g, nn.Dtype, G.WithShape(MAXTRANSITION, nn.HiddenSize), G.WithName("w2"), G.WithInit(G.GlorotU(1)))
nn.model = G.Nodes{nn.e_w, nn.e_t, nn.e_l, nn.w1_w, nn.w1_t, nn.w1_l, nn.b, nn.w2}
// define selects
// words first
logf("nn.e_w: %+1.1s", nn.e_w.Value())
var err error
for i := 0; i < wordFeats; i++ {
if nn.x_wSelW[i], err = G.Slice(nn.e_w, G.S(i)); err != nil { // dummy slices... they'll be replaced at runtime
return err
}
}
// tag features
for i := 0; i < tagFeats; i++ {
if nn.x_tSelT[i], err = G.Slice(nn.e_t, G.S(i)); err != nil { // dummy slices... they'll be replaced at runtime
return err
}
}
// dependency features
for i := 0; i < depFeats; i++ {
if nn.x_lSelL[i], err = G.Slice(nn.e_l, G.S(i)); err != nil {
return err
}
}
// forwards
if err = nn.fwd(); err != nil {
return err
}
// backprop
if _, err = G.Grad(nn.cost, nn.model...); err != nil {
return err
}
nn.sub = g.SubgraphRoots(nn.scores)
// prog, locmap, err := G.Compile(nn.g)
// if err != nil {
// return err
// }
// log.Printf("Prog: %v", prog)
// ioutil.WriteFile("graph.dot", []byte(g.ToDot()), 0644)
// logger := log.New(os.Stderr, "", 0)
// nn.vm = G.NewTapeMachine(prog, locmap, G.BindDualValues(nn.model...), G.UseCudaFor(), G.WithLogger(logger), G.WithWatchlist())
// nn.vm = G.NewTapeMachine(prog, locmap, G.BindDualValues(nn.model...), G.UseCudaFor())
nn.vm = G.NewTapeMachine(nn.g, G.BindDualValues(nn.model...), G.UseCudaFor())
G.BindDualValues(nn.scores)(nn.vm) // makes sure that scores is a *dualValue
nn.solver = G.NewAdaGradSolver(G.WithLearnRate(nn.AdaAlpha), G.WithEps(nn.AdaEps), G.WithL2Reg(nn.Reg), G.WithBatchSize(float64(nn.BatchSize)))
// nn.solver = G.NewVanillaSolver(G.WithLearnRate(nn.AdaAlpha), G.WithL2Reg(nn.Reg))
return nil
}
func (nn *neuralnetwork2) fwd() error {
var err error
// build up x vectors
if nn.x_w, err = G.Concat(0, nn.x_wSelW...); err != nil {
return err
}
if nn.x_t, err = G.Concat(0, nn.x_tSelT...); err != nil {
return err
}
if nn.x_l, err = G.Concat(0, nn.x_lSelL...); err != nil {
return err
}
logf("w1_w %v, x_w %v", nn.w1_w.Shape(), nn.x_w.Shape())
m_w := &may{nil, nn.w1_w}
m_w.doBinary(G.Mul, nn.x_w)
if m_w.error != nil {
return m_w.error
}
logf("w1_t %v, x_t %v", nn.w1_t.Shape(), nn.x_t.Shape())
m_t := &may{nil, nn.w1_t}
m_t.doBinary(G.Mul, nn.x_t)
if m_t.error != nil {
return m_t.error
}
logf("w1_l %v, x_l %v", nn.w1_l.Shape(), nn.x_l.Shape())
m_l := &may{nil, nn.w1_l}
m_l.doBinary(G.Mul, nn.x_l)
if m_l.error != nil {
return m_l.error
}
// add and activate layer 1
logf("w : %v", m_w.n.Shape())
m_w1 := &may{nil, m_w.n}
m_w1.doBinary(G.Add, m_t.n)
m_w1.doBinary(G.Add, m_l.n)
m_w1.doBinary(G.Add, nn.b)
m_w1.doUnary(G.Cube)
if m_w1.error != nil {
return m_w1.error
}
if nn.Dropout > 0 {
logf("Doing dropout")
m_w1.n, m_w1.error = G.Dropout(m_w1.n, nn.Dropout)
if m_w1.error != nil {
return m_w1.error
}
}
// go to softmax layer
logf("w2: %v, w1act: %v", nn.w2.Shape(), m_w1.n.Shape())
m_sm := &may{nil, nn.w2}
m_sm.doBinary(G.Mul, m_w1.n)
nn.scores = m_sm.n
m_sm.doUnary(G.SoftMax)
if m_sm.error != nil {
return m_sm
}
nn.logProb = m_sm.n
// G.WithName("Logprob")(nn.logProb)
// log.Printf("LOGPROB %v %p %v", nn.logProb, nn.logProb, nn.logProb)
if nn.cost, err = G.Slice(nn.logProb, G.S(0)); err != nil { // slice is a dummy tensor.Slice. It'll be replaced at runtime
return err
}
G.Read(nn.cost, &nn.costVal)
return nil
}
func (nn *neuralnetwork2) costProgress() <-chan G.Value {
if nn.costChan == nil {
nn.costChan = make(chan G.Value)
}
return nn.costChan
}
// train does one epoch of training. The examples are batched.
func (nn *neuralnetwork2) train(examples []example) error {
size := len(examples)
batches := size / nn.BatchSize
var start, end int
if nn.BatchSize > size {
batches = 1
end = size
G.WithBatchSize(float64(size))(nn.solver) // set it such that the solver doesn't get confused
} else {
end = nn.BatchSize
}
for batch := 0; batch < batches; batch++ {
for _, ex := range examples[start:end] {
nn.feats2vec(ex.features)
tid := lookupTransition(ex.transition, nn.transitions)
if err := G.UnsafeLet(nn.cost, G.S(tid)); err != nil {
return err
}
if err := nn.vm.RunAll(); err != nil {
return err
}
nn.vm.Reset()
}
if err := nn.solver.Step(G.NodesToValueGrads(nn.model)); err != nil {
err = errors.Wrapf(err, "Stepping on the model failed %v", batch)
return err
}
if nn.costChan != nil {
nn.costChan <- nn.costVal
}
start = end
if start >= size {
break
}
end += nn.BatchSize
if end >= size {
end = size
}
}
return nil
}
// pred predicts the index of the transitions
func (nn *neuralnetwork2) pred(ind []int) (int, error) {
nn.feats2vec(ind)
// f, _ := os.OpenFile("LOOOOOG", os.O_APPEND|os.O_CREATE|os.O_RDWR, 0644)
// logger := log.New(f, "", 0)
// logger := log.New(os.Stderr, "", 0)
// m := G.NewLispMachine(nn.sub, G.ExecuteFwdOnly(), G.WithLogger(logger), G.WithWatchlist(), G.LogBothDir(), G.WithValueFmt("%+3.3v"))
m := G.NewLispMachine(nn.sub, G.ExecuteFwdOnly())
if err := m.RunAll(); err != nil {
return 0, err
}
// logger.Println("========================\n")
val := nn.scores.Value().(tensor.Tensor)
t, err := tensor.Argmax(val, tensor.AllAxes)
if err != nil {
return 0, err
}
return t.ScalarValue().(int), nil
}
// utility function
func (nn *neuralnetwork2) feats2vec(indicators []int) error {
// fix word features
for i, ind := range indicators[:POS_OFFSET] {
if err := G.UnsafeLet(nn.x_wSelW[i], G.S(ind-wordFeatsStartAt)); err != nil {
return err
}
}
// fix tag features
for i, ind := range indicators[POS_OFFSET:DEP_OFFSET] {
if err := G.UnsafeLet(nn.x_tSelT[i], G.S(ind)); err != nil {
return err
}
}
for i, ind := range indicators[DEP_OFFSET:] {
if err := G.UnsafeLet(nn.x_lSelL[i], G.S(ind-labelFeatsStartAt)); err != nil {
return err
}
}
return nil
}
================================================
FILE: dep/nn2_io.go
================================================
package dep
import (
"bytes"
"encoding/gob"
"fmt"
"github.com/pkg/errors"
G "gorgonia.org/gorgonia"
T "gorgonia.org/tensor"
)
var empty struct{}
func (nn *neuralnetwork2) String() string {
s := `Config
------
%v
Info
------
Embeddings_Word : %v
Embeddings_POStag : %v
Embeddings_Dependency : %v
Selects_Words : %d
Selects_POSTag : %d
Selects_Dependency : %d
Weights1_Word : %v
Weights1_POSTag : %v
Weights1_Dependency : %v
Biases : %v
Weights2 : %v
`
return fmt.Sprintf(s, nn.NNConfig,
nn.e_w.Shape(), nn.e_t.Shape(), nn.e_l.Shape(),
len(nn.x_wSelW), len(nn.x_tSelT), len(nn.x_lSelL),
nn.w1_w.Shape(), nn.w1_t.Shape(), nn.w1_l.Shape(),
nn.b.Shape(), nn.w2.Shape())
}
func (nn *neuralnetwork2) GobEncode() ([]byte, error) {
if !nn.initialized() {
return nil, errors.Errorf("Neural network not initialized. Cannot gob")
}
var buf bytes.Buffer
encoder := gob.NewEncoder(&buf)
if err := encoder.Encode(nn.NNConfig); err != nil {
return nil, err
}
if err := encoder.Encode(nn.e_w.Value()); err != nil {
return nil, err
}
if err := encoder.Encode(nn.e_t.Value()); err != nil {
return nil, err
}
if err := encoder.Encode(nn.e_l.Value()); err != nil {
return nil, err
}
if err := encoder.Encode(nn.w1_w.Value()); err != nil {
return nil, err
}
if err := encoder.Encode(nn.w1_t.Value()); err != nil {
return nil, err
}
if err := encoder.Encode(nn.w1_l.Value()); err != nil {
return nil, err
}
if err := encoder.Encode(nn.b.Value()); err != nil {
return nil, err
}
if err := encoder.Encode(nn.w2.Value()); err != nil {
return nil, err
}
return buf.Bytes(), nil
}
func (nn *neuralnetwork2) GobDecode(buf []byte) error {
// prechecks
if nn.dict == nil {
return errors.Errorf("Neural Network has no corpus attached to it (Corpuses are serialized separately).")
}
b := bytes.NewBuffer(buf)
decoder := gob.NewDecoder(b)
if err := decoder.Decode(&nn.NNConfig); err != nil {
return err
}
if err := nn.init(); err != nil {
return err
}
e_w := T.New(T.Of(nn.Dtype), T.WithShape(nn.e_w.Shape()...))
if err := decoder.Decode(e_w); err != nil {
return err
}
G.Let(nn.e_w, e_w)
e_t := T.New(T.Of(nn.Dtype), T.WithShape(nn.e_t.Shape()...))
if err := decoder.Decode(e_t); err != nil {
return err
}
G.Let(nn.e_t, e_t)
e_l := T.New(T.Of(nn.Dtype), T.WithShape(nn.e_l.Shape()...))
if err := decoder.Decode(e_l); err != nil {
return err
}
G.Let(nn.e_l, e_l)
w1_w := T.New(T.Of(nn.Dtype), T.WithShape(nn.w1_w.Shape()...))
if err := decoder.Decode(w1_w); err != nil {
return err
}
G.Let(nn.w1_w, w1_w)
w1_t := T.New(T.Of(nn.Dtype), T.WithShape(nn.w1_t.Shape()...))
if err := decoder.Decode(w1_t); err != nil {
return err
}
G.Let(nn.w1_t, w1_t)
w1_l := T.New(T.Of(nn.Dtype), T.WithShape(nn.w1_l.Shape()...))
if err := decoder.Decode(w1_l); err != nil {
return err
}
G.Let(nn.w1_l, w1_l)
bias := T.New(T.Of(nn.Dtype), T.WithShape(nn.b.Shape()...))
if err := decoder.Decode(bias); err != nil {
return err
}
G.Let(nn.b, bias)
w2 := T.New(T.Of(nn.Dtype), T.WithShape(nn.w2.Shape()...))
if err := decoder.Decode(w2); err != nil {
return err
}
G.Let(nn.w2, w2)
return nil
}
================================================
FILE: dep/nn2_io_test.go
================================================
package dep
import (
"bytes"
"encoding/gob"
"fmt"
"testing"
"github.com/chewxy/lingo"
"github.com/chewxy/lingo/corpus"
G "gorgonia.org/gorgonia"
)
func TestNNIO(t *testing.T) {
sts := allSentences()
nn := new(neuralnetwork2)
nn.NNConfig = DefaultNNConfig
nn.dict = corpus.GenerateCorpus(sts)
nn.transitions = transitions
if err := nn.init(); err != nil {
t.Fatalf("%+v", err)
}
s := `Config
------
Batch Size : 10000
Dropout Rate : 0.500000
AdaGrad Eps (ε) : 0.000001
AdaGrad Learn Rate (η) : 0.010000
Regularization Parameter : 0.000002
Hidden Layer Size : 200
Embedding Size : 50
Number Precomputed : 30000
Evaluate Per 100 Iterations
Clear Gradients Per 0 Iterations
Dtype: float64
Info
------
Embeddings_Word : (74, 50)
Embeddings_POStag : (%d, 50)
Embeddings_Dependency : (%d, 50)
Selects_Words : 18
Selects_POSTag : 18
Selects_Dependency : 12
Weights1_Word : (200, 900)
Weights1_POSTag : (200, 900)
Weights1_Dependency : (200, 600)
Biases : (200)
Weights2 : (%d, 200)
`
correctDesc := fmt.Sprintf(s, lingo.MAXTAG, lingo.MAXDEPTYPE, MAXTRANSITION)
if nn.String() != correctDesc {
t.Errorf("Oops. Got %q. Want %q", nn.String(), correctDesc)
}
// nn.Dtype = tensor.Float32
var buf bytes.Buffer
encoder := gob.NewEncoder(&buf)
if err := encoder.Encode(nn); err != nil {
t.Fatalf("%+v", err)
}
decoder := gob.NewDecoder(&buf)
nn2 := new(neuralnetwork2)
nn2.dict = corpus.GenerateCorpus(sts)
nn2.transitions = transitions
if err := decoder.Decode(nn2); err != nil {
t.Fatal(err)
}
if nn.String() != correctDesc {
t.Fatalf("Oops. Got %q. Want %q", nn.String(), correctDesc)
}
if !G.ValueEq(nn.e_w.Value(), nn2.e_w.Value()) {
t.Errorf("Expected e_w to be the same. Expected %1.1s. Got %1.1s", nn.e_w.Value(), nn2.e_w.Value())
}
if !G.ValueEq(nn.e_t.Value(), nn2.e_t.Value()) {
t.Errorf("Expected e_t to be the same. Expected %1.1s. Got %1.1s", nn.e_t.Value(), nn2.e_t.Value())
}
if !G.ValueEq(nn.e_l.Value(), nn2.e_l.Value()) {
t.Errorf("Expected e_l to be the same. Expected %1.1s. Got %1.1s", nn.e_l.Value(), nn2.e_l.Value())
}
if !G.ValueEq(nn.w1_w.Value(), nn2.w1_w.Value()) {
t.Errorf("Expected w1_w to be the same. Expected %1.1s. Got %1.1s", nn.w1_w.Value(), nn2.w1_w.Value())
}
if !G.ValueEq(nn.w1_t.Value(), nn2.w1_t.Value()) {
t.Errorf("Expected w1_t to be the same. Expected %1.1s. Got %1.1s", nn.w1_t.Value(), nn2.w1_t.Value())
}
if !G.ValueEq(nn.w1_l.Value(), nn2.w1_l.Value()) {
t.Errorf("Expected w1_l to be the same. Expected %1.1s. Got %1.1s", nn.w1_l.Value(), nn2.w1_l.Value())
}
if !G.ValueEq(nn.b.Value(), nn2.b.Value()) {
t.Errorf("Expected b to be the same. Expected %1.1s. Got %1.1s", nn.b.Value(), nn2.b.Value())
}
if !G.ValueEq(nn.w2.Value(), nn2.w2.Value()) {
t.Errorf("Expected w2 to be the same. Expected %1.1s. Got %1.1s", nn.w2.Value(), nn2.w2.Value())
}
t.Logf("Visual Inspection: \n%+1.8s\n%+1.8s", nn.e_w.Value(), nn2.e_w.Value())
// special case
buf.Reset()
encoder = gob.NewEncoder(&buf)
if err := encoder.Encode(nn); err != nil {
t.Fatalf("%+v", err)
}
decoder = gob.NewDecoder(&buf)
nn3 := new(neuralnetwork2)
if err := decoder.Decode(nn3); err == nil {
t.Error("Expected a nocorpus error")
}
}
================================================
FILE: dep/nn2_test.go
================================================
package dep
import (
"math/rand"
"testing"
"time"
"github.com/chewxy/lingo/corpus"
"gorgonia.org/gorgonia"
)
func TestNN2(t *testing.T) {
rand.Seed(1337)
// we test 50 iterations unless the short flag is passed in
epochs := 50
if testing.Short() {
epochs = 10
}
sts := allSentences()
nn := new(neuralnetwork2)
nn.NNConfig = DefaultNNConfig
nn.Dtype = gorgonia.Float32
nn.dict = corpus.GenerateCorpus(sts)
nn.transitions = transitions
if err := nn.init(); err != nil {
t.Fatalf("%+v", err)
}
var costs []float64
ch := nn.costProgress()
sigChan := make(chan struct{})
go func(ch <-chan gorgonia.Value, sig chan struct{}) {
for cost := range ch {
switch c := cost.Data().(type) {
case float32:
costs = append(costs, float64(c))
case float64:
costs = append(costs, c)
}
t.Logf("Cost %v", cost)
}
sig <- struct{}{}
}(ch, sigChan)
exs := makeExamples(sts, nn.NNConfig, nn.dict, transitions, dummyFix{})
start := time.Now()
for i := 0; i < epochs; i++ {
if err := nn.train(exs); err != nil {
t.Errorf("%+v", err)
}
shuffleExamples(exs)
}
// simulate what *DependencyParser would do
close(nn.costChan)
nn.costChan = nil
t.Logf("Training %d iterations took Taken: %v", epochs, time.Since(start))
<-sigChan
if len(costs) == 0 {
t.Error("Expected some costs")
}
if costs[0] <= costs[len(costs)-1] {
t.Error("Expected costs to have reduced during training")
}
// PREDICTION TIME!
ss2 := simpleSentence()
exs = makeExamples(ss2, nn.NNConfig, nn.dict, transitions, dummyFix{})
start = time.Now()
for i, ex := range exs {
ind, err := nn.pred(ex.features)
if err != nil {
t.Errorf("Example %d failed: %v", i, err)
continue
}
t.Logf("Example %d. Want: %v. Got %v. Same: %t", i, ex.transition, transitions[ind], ex.transition == transitions[ind])
}
t.Logf("Pred Time Taken: %v", time.Since(start))
}
================================================
FILE: dep/nnconfig.go
================================================
package dep
import (
"bytes"
"encoding/gob"
"fmt"
"github.com/pkg/errors"
"gorgonia.org/tensor"
)
// NNConfig configures the neural network
type NNConfig struct {
BatchSize int // 10000
Dropout float64 // 0.5
AdaEps float64 // 1e-6
AdaAlpha float64 //0.02
Reg float64 // 1e-8
HiddenSize int // 200
EmbeddingSize int // 50
NumPrecomputed int //100000
EvalPerIteration int // 100
ClearGradientsPerIteration int // 0
Dtype tensor.Dtype
}
func (c NNConfig) String() string {
s := `Batch Size : %d
Dropout Rate : %f
AdaGrad Eps (ε) : %f
AdaGrad Learn Rate (η) : %f
Regularization Parameter : %f
Hidden Layer Size : %d
Embedding Size : %d
Number Precomputed : %d
Evaluate Per %d Iterations
Clear Gradients Per %d Iterations
Dtype: %v
`
return fmt.Sprintf(s, c.BatchSize, c.Dropout, c.AdaEps, c.AdaAlpha, c.Reg, c.HiddenSize, c.EmbeddingSize, c.NumPrecomputed, c.EvalPerIteration, c.ClearGradientsPerIteration, c.Dtype)
}
// DefaultNNConfig is the default config that is passed in, for initialization purposses.
var DefaultNNConfig NNConfig
func (c NNConfig) GobEncode() ([]byte, error) {
var buf bytes.Buffer
encoder := gob.NewEncoder(&buf)
encoder.Encode(c.BatchSize)
encoder.Encode(c.Dropout)
encoder.Encode(c.AdaEps)
encoder.Encode(c.AdaAlpha)
encoder.Encode(c.Reg)
encoder.Encode(c.HiddenSize)
encoder.Encode(c.EmbeddingSize)
encoder.Encode(c.NumPrecomputed)
encoder.Encode(c.EvalPerIteration)
encoder.Encode(c.ClearGradientsPerIteration)
switch c.Dtype {
case tensor.Float64:
encoder.Encode(byte(0))
case tensor.Float32:
encoder.Encode(byte(1))
default:
return nil, errors.Errorf("Unsupported Dtype to be GobEncoded")
}
return buf.Bytes(), nil
}
func (c *NNConfig) GobDecode(p []byte) error {
b := bytes.NewBuffer(p)
decoder := gob.NewDecoder(b)
decoder.Decode(&c.BatchSize)
decoder.Decode(&c.Dropout)
decoder.Decode(&c.AdaEps)
decoder.Decode(&c.AdaAlpha)
decoder.Decode(&c.Reg)
decoder.Decode(&c.HiddenSize)
decoder.Decode(&c.EmbeddingSize)
decoder.Decode(&c.NumPrecomputed)
decoder.Decode(&c.EvalPerIteration)
decoder.Decode(&c.ClearGradientsPerIteration)
var bite byte
decoder.Decode(&bite)
switch bite {
case 0:
c.Dtype = tensor.Float64
case 1:
c.Dtype = tensor.Float32
default:
return errors.Errorf("Unsupported Dtype to be GobDecoded: %v", bite)
}
return nil
}
func init() {
DefaultNNConfig = NNConfig{
BatchSize: 10000,
Dropout: 0.5,
AdaEps: 1e-6,
AdaAlpha: 0.01,
Reg: 1.5e-6,
HiddenSize: 200,
EmbeddingSize: 50,
NumPrecomputed: 30000,
EvalPerIteration: 100,
ClearGradientsPerIteration: 0,
Dtype: tensor.Float64,
// Dtype: gorgonia.Float32,
}
}
================================================
FILE: dep/release.go
================================================
// +build !debug
package dep
const BUILD_DEBUG = "PARSER: RELEASE BUILD"
const BUILD_DIAG = "Non-Diagnostic Build"
const DEBUG = false
var READMEMSTATS = false
var TABCOUNT uint32 = 0
func enterLoggingContext() {}
func leaveLoggingContext() {}
func logTrainingProgress(iteration, correct, total, length, possibles int) {}
func logMemStats() {}
func logf(format string, others ...interface{}) {}
func recoverFrom(format string, attrs ...interface{}) {}
func (d *Parser) SprintFeatures(feature []int) string { return "" }
func SprintScores(scores []float64, ts []transition) string { return "" }
================================================
FILE: dep/span.go
================================================
package dep
type span struct {
start, end int
}
func makeSpan(start, end int) span {
if end <= start {
panic("Impossible span created")
}
return span{start, end}
}
func (s span) combine(other span) span {
start := minInt(s.start, other.start)
end := maxInt(s.end, other.end)
return span{start, end}
}
================================================
FILE: dep/test_test.go
================================================
package dep
import (
"bufio"
"crypto/md5"
"encoding/gob"
"fmt"
"io"
"log"
"os"
"strings"
"github.com/chewxy/lingo"
"github.com/chewxy/lingo/treebank"
"github.com/kljensen/snowball"
)
type dummyLem struct{}
func (dummyLem) Lemmatize(s string, pt lingo.POSTag) ([]string, error) {
return nil, componentUnavailable("lemmatizer")
}
type dummyStemmer struct{}
func (dummyStemmer) Stem(s string) (string, error) {
return snowball.Stem(s, "english", true)
}
type dummyFix struct {
dummyStemmer
dummyLem
}
func (dummyFix) Clusters() (map[string]lingo.Cluster, error) {
return nil, componentUnavailable("clusters")
}
const nnps = `1 Guerrillas guerrilla NOUN NNS Number=Plur 2 nsubj _ _
2 threatened threaten VERB VBD Mood=Ind|Tense=Past|VerbForm=Fin 0 root _ _
3 to to PART TO _ 4 mark _ _
4 assassinate assassinate VERB VB VerbForm=Inf 2 xcomp _ _
5 Prime Prime PROPN NNP Number=Sing 6 compound _ _
6 Minister Minister PROPN NNP Number=Sing 8 compound _ _
7 Iyad Iyad PROPN NNP Number=Sing 8 compound _ _
8 Allawi Allawi PROPN NNP Number=Sing 4 dobj _ _
9 and and CONJ CC _ 8 cc _ _
10 Minister Minister PROPN NNP Number=Sing 14 compound _ _
11 of of ADP IN _ 12 case _ _
12 Defense Defense PROPN NNP Number=Sing 10 nmod _ _
13 Hazem Hazem PROPN NNP Number=Sing 14 compound _ _
14 Shaalan Shaalan PROPN NNP Number=Sing 8 conj _ _
15 in in ADP IN _ 16 case _ _
16 retaliation retaliation NOUN NN Number=Sing 4 nmod _ _
17 for for ADP IN _ 19 case _ _
18 the the DET DT Definite=Def|PronType=Art 19 det _ _
19 attack attack NOUN NN Number=Sing 16 nmod _ _
20 . . PUNCT . _ 2 punct _ _
`
const simple = `1 Yet yet CONJ CC _ 5 cc _ _
2 we we PRON PRP Case=Nom|Number=Plur|Person=1|PronType=Prs 5 nsubj _ _
3 did do AUX VBD Mood=Ind|Tense=Past|VerbForm=Fin 5 aux _ _
4 n't not PART RB _ 5 neg _ _
5 charge charge VERB VB VerbForm=Inf 0 root _ _
6 them they PRON PRP Case=Acc|Number=Plur|Person=3|PronType=Prs 5 dobj _ _
7 for for ADP IN _ 9 case _ _
8 the the DET DT Definite=Def|PronType=Art 9 det _ _
9 evacuation evacuation NOUN NN Number=Sing 5 nmod _ _
10 . . PUNCT . _ 5 punct _ _
`
const med = `1 President President PROPN NNP Number=Sing 2 compound _ _
2 Bush Bush PROPN NNP Number=Sing 5 nsubj _ _
3 on on ADP IN _ 4 case _ _
4 Tuesday Tuesday PROPN NNP Number=Sing 5 nmod _ _
5 nominated nominate VERB VBD Mood=Ind|Tense=Past|VerbForm=Fin 0 root _ _
6 two two NUM CD NumType=Card 7 nummod _ _
7 individuals individual NOUN NNS Number=Plur 5 dobj _ _
8 to to PART TO _ 9 mark _ _
9 replace replace VERB VB VerbForm=Inf 5 advcl _ _
10 retiring retire VERB VBG VerbForm=Ger 11 amod _ _
11 jurists jurist NOUN NNS Number=Plur 9 dobj _ _
12 on on ADP IN _ 14 case _ _
13 federal federal ADJ JJ Degree=Pos 14 amod _ _
14 courts court NOUN NNS Number=Plur 11 nmod _ _
15 in in ADP IN _ 18 case _ _
16 the the DET DT Definite=Def|PronType=Art 18 det _ _
17 Washington Washington PROPN NNP Number=Sing 18 compound _ _
18 area area NOUN NN Number=Sing 14 nmod _ _
19 . . PUNCT . _ 5 punct _ _
`
const long = `1 Now now ADV RB _ 5 advmod _ _
2 , , PUNCT , _ 5 punct _ _
3 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 5 nsubj _ _
4 would would AUX MD VerbForm=Fin 5 aux _ _
5 argue argue VERB VB VerbForm=Inf 0 root _ _
6 that that SCONJ IN _ 11 mark _ _
7 one one PRON PRP _ 11 nsubj _ _
8 could could AUX MD VerbForm=Fin 11 aux _ _
9 have have AUX VB VerbForm=Inf 11 aux _ _
10 reasonably reasonably ADV RB _ 11 advmod _ _
11 predicted predict VERB VBN Tense=Past|VerbForm=Part 5 ccomp _ _
12 that that SCONJ IN _ 19 mark _ _
13 some some DET DT _ 14 det _ _
14 form form NOUN NN Number=Sing 19 nsubj _ _
15 of of ADP IN _ 17 case _ _
16 military military ADJ JJ Degree=Pos 17 amod _ _
17 violence violence NOUN NN Number=Sing 14 nmod _ _
18 was be VERB VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 19 cop _ _
19 likely likely ADJ JJ Degree=Pos 11 ccomp _ _
20 to to PART TO _ 21 mark _ _
21 occur occur VERB VB VerbForm=Inf 19 xcomp _ _
22 in in ADP IN _ 23 case _ _
23 Lebanon Lebanon PROPN NNP Number=Sing 21 nmod _ _
24 -LRB- -lrb- PUNCT -LRB- _ 25 punct _ _
25 considering consider VERB VBG VerbForm=Ger 19 advcl _ _
26 that that SCONJ IN _ 31 mark _ _
27 the the DET DT Definite=Def|PronType=Art 28 det _ _
28 country country NOUN NN Number=Sing 31 nsubj _ _
29 has have AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 31 aux _ _
30 been be AUX VBN Tense=Past|VerbForm=Part 31 aux _ _
31 experiencing experience VERB VBG Tense=Pres|VerbForm=Part 25 ccomp _ _
32 some some DET DT _ 33 det _ _
33 form form NOUN NN Number=Sing 31 dobj _ _
34 of of ADP IN _ 35 case _ _
35 conflict conflict NOUN NN Number=Sing 33 nmod _ _
36 for for ADP IN _ 41 case _ _
37 approximately approximately ADV RB _ 41 advmod _ _
38 the the DET DT Definite=Def|PronType=Art 41 det _ _
39 last last ADJ JJ Degree=Pos 41 amod _ _
40 32 32 NUM CD NumType=Card 41 nummod _ _
41 years year NOUN NNS Number=Plur 31 nmod _ _
42 -RRB- -rrb- PUNCT -RRB- _ 25 punct _ _
43 . . PUNCT . _ 5 punct _ _
`
const cvconllu = `1 Google Google PROPN NNP Number=Sing 6 nsubj _ _
2 is be VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 6 cop _ _
3 a a DET DT Definite=Ind|PronType=Art 6 det _ _
4 nice nice ADJ JJ Degree=Pos 6 amod _ _
5 search search NOUN NN Number=Sing 6 compound _ _
6 engine engine NOUN NN Number=Sing 0 root _ _
7 . . PUNCT . _ 6 punct _ _
1 Does do AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 3 aux _ _
2 anybody anybody NOUN NN Number=Sing 3 nsubj _ _
3 use use VERB VB VerbForm=Inf 0 root _ _
4 it it PRON PRP Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs 3 dobj _ _
5 for for ADP IN _ 6 case _ _
6 anything anything NOUN NN Number=Sing 3 nmod _ _
7 else else ADJ JJ Degree=Pos 6 amod _ _
8 ? ? PUNCT . _ 3 punct _ _
`
func lotsaNNP() *lingo.Dependency {
readr := strings.NewReader(nnps)
sentenceTags := treebank.ReadConllu(readr)
return sentenceTags[0].Dependency(dummyFix{})
}
// simpleSentence has 10 words
func simpleSentence() []treebank.SentenceTag {
readr := strings.NewReader(simple)
return treebank.ReadConllu(readr)
}
func mediumSentence() []treebank.SentenceTag {
readr := strings.NewReader(med)
return treebank.ReadConllu(readr)
}
// longSentence has 44 words
func longSentence() []treebank.SentenceTag {
readr := strings.NewReader(long)
return treebank.ReadConllu(readr)
}
func allSentences() []treebank.SentenceTag {
sentenceTags := treebank.ReadConllu(strings.NewReader(nnps))
sentenceTags = append(sentenceTags, treebank.ReadConllu(strings.NewReader(simple))...)
sentenceTags = append(sentenceTags, treebank.ReadConllu(strings.NewReader(med))...)
sentenceTags = append(sentenceTags, treebank.ReadConllu(strings.NewReader(long))...)
return sentenceTags
}
func cvSentences() []treebank.SentenceTag {
return treebank.ReadConllu(strings.NewReader(cvconllu))
}
func hash(s string) string {
h := md5.New()
io.WriteString(h, s)
return fmt.Sprintf("%x", h.Sum(nil))
}
func cache(input string, s lingo.AnnotatedSentence) {
hashfilename := "cached/" + hash(input) + ".cached"
f, err := os.Create(hashfilename)
if err != nil {
log.Fatal(err)
}
defer f.Close()
w := bufio.NewWriter(f)
defer w.Flush()
encoder := gob.NewEncoder(w)
if err := encoder.Encode(s); err != nil {
log.Fatal(err)
}
}
func useCached(filename string) *lingo.Dependency {
f, err := os.Open(filename)
if err != nil {
log.Fatal(err)
}
defer f.Close()
r := bufio.NewReader(f)
decoder := gob.NewDecoder(r)
var sentence lingo.AnnotatedSentence
if err := decoder.Decode(&sentence); err != nil {
log.Fatal(err)
}
// fixes ID and what nots
sentence.Fix()
dep := sentence.Dependency()
return dep
}
================================================
FILE: dep/train.go
================================================
package dep
import (
"fmt"
"os"
"sync"
"github.com/chewxy/lingo"
"github.com/chewxy/lingo/corpus"
"github.com/chewxy/lingo/treebank"
"github.com/pkg/errors"
)
// TrainerConsOpt is a construction option for trainer
type TrainerConsOpt func(t *Trainer)
// WithTrainingModel loads a trainer with a model
func WithTrainingModel(m *Model) TrainerConsOpt {
f := func(t *Trainer) {
t.Model = m
}
return f
}
// WithTrainingSet creates a trainer with a training set
func WithTrainingSet(st []treebank.SentenceTag) TrainerConsOpt {
f := func(t *Trainer) {
t.trainingSet = st
}
return f
}
// WithCrossValidationSet creates a trainer with a cross validation set
func WithCrossValidationSet(st []treebank.SentenceTag) TrainerConsOpt {
f := func(t *Trainer) {
t.crossValSet = st
}
return f
}
// WithConfig sets up a *Trainer with a NNConfig
func WithConfig(conf NNConfig) TrainerConsOpt {
f := func(t *Trainer) {
t.nn.NNConfig = conf
t.nn.dict = t.corpus
t.nn.transitions = t.ts
t.EvalPerIter = conf.EvalPerIteration
}
return f
}
// WithLemmatizer sets the lemmatizer option on the Trainer
func WithLemmatizer(l lingo.Lemmatizer) TrainerConsOpt {
f := func(t *Trainer) {
// cannot pass in itself!
if T, ok := l.(*Trainer); ok && T == t {
panic("Recursive definition of lemmatizer (trying to set the t.lemmatizer = T) !")
}
t.l = l
}
return f
}
// WithStemmer sets up the stemmer option on the DependencyParser
func WithStemmer(s lingo.Stemmer) TrainerConsOpt {
f := func(t *Trainer) {
// cannot pass in itself
if T, ok := s.(*Trainer); ok && T == t {
panic("Recursive setting of stemmer! (Trying to set t.stemmer = T)")
}
t.s = s
}
return f
}
// WithCluster sets the brown cluster options for the DependencyParser
func WithCluster(c map[string]lingo.Cluster) TrainerConsOpt {
f := func(t *Trainer) {
t.c = c
}
return f
}
// WithCorpus creates a Trainer with a corpus
func WithCorpus(c *corpus.Corpus) TrainerConsOpt {
f := func(t *Trainer) {
t.corpus = c
t.nn.dict = c
}
return f
}
// WithGeneratedCorpus creates a Trainer's corpus from a list of SentenceTags. The corpus will be generated from the SentenceTags
func WithGeneratedCorpus(sts ...treebank.SentenceTag) TrainerConsOpt {
f := func(t *Trainer) {
dict := corpus.GenerateCorpus(sts)
if t.corpus == nil {
t.corpus = dict
} else {
t.corpus.Merge(dict)
}
t.nn.dict = t.corpus
}
return f
}
// Trainer trains a model
type Trainer struct {
trainingSet []treebank.SentenceTag
crossValSet []treebank.SentenceTag
once sync.Once
*Model
// Training configuration
EvalPerIter int // for cross validation - evaluate results every n epochs
PassDirect bool // Pass on the costs directly to the cost channel? If false, an average will be used
SaveBest string // SaveBest is the filename that will be saved. If it's empty then the best-while-training will not be saved
// fixer
l lingo.Lemmatizer
s lingo.Stemmer
c map[string]lingo.Cluster
err chan error
cost chan float64
perf chan Performance
}
// NewTrainer creates a new Trainer.
func NewTrainer(opts ...TrainerConsOpt) *Trainer {
t := new(Trainer)
// set up the default model
t.Model = new(Model)
t.corpus = KnownWords
t.ts = transitions
// set up the neural network
t.nn = new(neuralnetwork2)
t.nn.NNConfig = DefaultNNConfig
t.nn.transitions = transitions
t.nn.dict = KnownWords
for _, opt := range opts {
opt(t)
}
return t
}
// Lemmatize implemnets lingo.Lemmatizer
func (t *Trainer) Lemmatize(a string, pt lingo.POSTag) ([]string, error) {
if t.l == nil {
return nil, componentUnavailable("Lemmatizer")
}
return t.l.Lemmatize(a, pt)
}
// Stem implements lingo.Stemmer
func (t *Trainer) Stem(a string) (string, error) {
if t.s == nil {
return "", componentUnavailable("Stemmer")
}
return t.s.Stem(a)
}
// Clusters implements lingo.Fixer
func (t *Trainer) Clusters() (map[string]lingo.Cluster, error) {
if t.c == nil {
return nil, componentUnavailable("Clusters")
}
return t.c, nil
}
/* Getters */
// Cost returns a channel of costs for monitoring the training. If the PassDirect field in the trainer is set to true
// then the costs are directly returned. Otherwise the costs are averaged over the epoch.
func (t *Trainer) Cost() <-chan float64 {
if t.cost == nil {
t.cost = make(chan float64)
}
return t.cost
}
// Perf returns a channel of Performance for monitoring the training.
func (t *Trainer) Perf() <-chan Performance {
if t.perf == nil {
t.perf = make(chan Performance)
}
return t.perf
}
/* Methods */
// Init initializes the DependencyParser with a corpus and a neural network config
func (t *Trainer) Init() (err error) {
f := func() {
err = t.nn.init()
}
t.once.Do(f)
return
}
// Train trains a model.
//
// If a cross validation set is provided, it will automatically train with the cross validation set
func (t *Trainer) Train(epochs int) error {
if err := t.pretrainCheck(); err != nil {
return err
}
if len(t.crossValSet) > 0 {
return t.crossValidateTrain(epochs)
}
return t.train(epochs)
}
// TrainWithoutCrossValidation trains a model without cross validation.
func (t *Trainer) TrainWithoutCrossValidation(epochs int) error {
return t.train(epochs)
}
// train simply trains the model without having a cross validation.
func (t *Trainer) train(epochs int) error {
var epochChan chan struct{}
if t.cost != nil {
defer func() {
close(t.cost)
t.cost = nil
}()
epochChan = t.handleCosts()
if epochChan != nil {
defer close(epochChan)
}
}
examples := makeExamples(t.trainingSet, t.nn.NNConfig, t.nn.dict, t.ts, t)
for e := 0; e < epochs; e++ {
if err := t.nn.train(examples); err != nil {
return err
}
if epochChan != nil {
epochChan <- struct{}{}
}
shuffleExamples(examples)
}
return nil
}
// crossValidateTrain trains the model but also does cross validation to ensure overfitting don't happen.
func (t *Trainer) crossValidateTrain(epochs int) error {
if t.perf != nil {
defer func() {
close(t.perf)
t.perf = nil
}()
}
var epochChan chan struct{}
if t.cost != nil {
defer func() {
close(t.cost)
t.cost = nil
}()
epochChan = t.handleCosts()
if epochChan != nil {
defer close(epochChan)
}
}
examples := makeExamples(t.trainingSet, t.nn.NNConfig, t.nn.dict, t.ts, t)
var best Performance
for e := 0; e < epochs; e++ {
if err := t.nn.train(examples); err != nil {
return err
}
if t.EvalPerIter > 0 && e%t.EvalPerIter == 0 || e == epochs-1 {
perf := t.crossValidate(t.crossValSet)
// if there is a channel to report back the performance, send it down
if t.perf != nil {
perf.Iter = e
t.perf <- perf
}
if perf.UAS > best.UAS {
best = perf
if t.SaveBest != "" {
f, err := os.Create(t.SaveBest)
if err != nil {
err = errors.Wrapf(err, "Unable to open SaveBest file %q", t.SaveBest)
return err
}
t.Model.SaveWriter(f)
}
}
}
if epochChan != nil {
epochChan <- struct{}{}
}
shuffleExamples(examples)
}
return nil
}
// pretrainCheck checks if everything is sane
func (t *Trainer) pretrainCheck() error {
// check
if t.nn == nil || !t.nn.initialized() {
return errors.Errorf("DependencyParser not init()'d. Perhaps you forgot to call .Init() somewhere?")
}
if len(t.trainingSet) == 0 {
return errors.Errorf("Cannot train with no training data set")
}
return nil
}
// handleCosts handles the costs from the neural network in two ways:
// 1. pass: directly passes on the costs (which may come from multiple batches in an epoch)
// 2. mean: calculates the mean of the costs and passes it on into d.cost
//
// If d.cost is nil, it simply returns. This method should be called after a check that d.cost is not nil
func (t *Trainer) handleCosts() (epochChan chan struct{}) {
nncost := t.nn.costProgress()
if t.PassDirect {
go func() {
for cost := range nncost {
switch c := cost.Data().(type) {
case float32:
t.cost <- float64(c)
case float64:
t.cost <- c
default:
// this should NEVER happen
panic(fmt.Sprintf("Unhandled cost type %T", c))
}
}
}()
} else {
epochChan = make(chan struct{})
// it collects the costs until the epoch chan signals that an epoch is done. Then the cost is averaged and sent down the d.cost channel
go func(epochChan chan struct{}) {
var collected []float64
for {
select {
case cost := <-nncost:
switch c := cost.Data().(type) {
case float32:
collected = append(collected, float64(c))
case float64:
collected = append(collected, c)
default:
// this should NEVER happen
panic(fmt.Sprintf("Unhandled cost type %T", c))
}
case <-epochChan:
var avg float64
for _, cost := range collected {
avg += cost
}
if len(collected) > 0 {
avg /= float64(len(collected))
}
t.cost <- avg
collected = collected[:0]
}
}
}(epochChan)
}
return
}
================================================
FILE: dep/train_test.go
================================================
package dep
import (
"testing"
"github.com/chewxy/lingo/corpus"
G "gorgonia.org/gorgonia"
)
func TestTrainerInitializations(t *testing.T) {
var d *Trainer
c := corpus.New()
d = NewTrainer(WithCorpus(c))
if d.corpus != c {
t.Errorf("Expected Corpus to be set to %p. Got %p instead", c, d.corpus)
}
d = NewTrainer(WithConfig(DefaultNNConfig))
if d.corpus != KnownWords {
t.Error("Expected corpus to be set to the default KnownWords corpus")
}
if d.nn == nil {
t.Fatal("Expected a neural network")
}
if d.nn.dict != KnownWords {
t.Error("Expected neuralnetwork's dict to be set")
}
// d2 = d.Clone()
// if d2.nn != d.nn {
// t.Error("Expected a neural network!")
// }
// // init empty
// d = New()
// if err := d.Init(); err != nil {
// t.Errorf("%+v", err)
// }
// // init with a corpus
// d = New(WithCorpus(c))
// if err := d.Init(); err != nil {
// t.Errorf("%+v", err)
// }
}
func TestTrainer_train(t *testing.T) {
sts := allSentences()
epochs := 10
var err error
trainer := NewTrainer(WithGeneratedCorpus(sts...), WithTrainingSet(sts))
if err = trainer.Train(epochs); err == nil {
t.Error("Expected an error when training an uninitialized Trainer")
}
// with init
t.Logf("Pass On Costs Directly")
conf := DefaultNNConfig
conf.BatchSize = 90
trainer = NewTrainer(WithGeneratedCorpus(sts...), WithConfig(conf), WithTrainingSet(sts))
if err := trainer.Init(); err != nil {
t.Errorf("%+v", err)
}
trainer.PassDirect = true
var costs []float64
cost := trainer.Cost()
go func() {
for c := range cost {
costs = append(costs, c)
t.Logf("Cost %v", c)
}
}()
if err = trainer.Train(epochs); err != nil {
t.Errorf("Err: %v", err)
}
if len(costs) == 0 {
t.Errorf("Zero costs...")
goto avgcosts
}
t.Logf("Costs %d", len(costs))
if len(costs) < (epochs*2)-5 { // we'll allow some tolerance
t.Errorf("Expected some costs")
}
if costs[0] < costs[len(costs)-1] {
t.Errorf("Costs should be reducing")
}
avgcosts:
// with init, avg costs
t.Logf("Average Costs")
costs = costs[:0] // reset
conf = DefaultNNConfig
conf.Dtype = G.Float32
trainer = NewTrainer(WithGeneratedCorpus(sts...), WithConfig(conf), WithTrainingSet(sts))
if err := trainer.Init(); err != nil {
t.Errorf("%+v", err)
}
trainer.PassDirect = false
cost = trainer.Cost()
go func() {
for c := range cost {
costs = append(costs, c)
t.Logf("Cost %v", c)
}
}()
if err = trainer.Train(epochs); err != nil {
t.Errorf("%v", err)
}
if len(costs) == 0 {
t.Fatal("Zero costs")
}
t.Logf("Costs %d", len(costs))
if len(costs) == 0 {
t.Errorf("Expected some costs")
}
if costs[0] < costs[len(costs)-1] {
t.Errorf("Costs should be reducing")
}
}
func TestTestTrainer_crossValidateTrain(t *testing.T) {
sts := allSentences()
cv := cvSentences()
epochs := 10
var trainer *Trainer
var err error
// uninit
t.Logf("Uninitiated")
trainer = NewTrainer(WithGeneratedCorpus(sts...))
if err = trainer.Train(epochs); err == nil {
t.Errorf("Expected an error when training with an uninitialized Trainer")
}
// with init
t.Logf("Pass On Costs Directly")
conf := DefaultNNConfig
conf.BatchSize = 90
trainer = NewTrainer(WithGeneratedCorpus(sts...), WithConfig(conf), WithTrainingSet(sts), WithCrossValidationSet(cv))
trainer.PassDirect = true
if err := trainer.Init(); err != nil {
t.Errorf("%+v", err)
}
var costs []float64
cost := trainer.Cost()
perf := trainer.Perf()
go func() {
for p := range perf {
t.Logf("Perf \n%v", p)
}
}()
go func() {
for c := range cost {
costs = append(costs, c)
t.Logf("Cost %v", c)
}
}()
if err = trainer.Train(epochs); err != nil {
t.Error(err)
}
if len(costs) == 0 {
t.Errorf("Zero costs")
goto avgCosts
}
t.Logf("Costs %d", len(costs))
if len(costs) < (epochs*2)-5 { // we'll allow some tolerance
t.Errorf("Expected some costs")
}
if costs[0] < costs[len(costs)-1] {
t.Errorf("Costs should be reducing")
}
avgCosts:
// with init, avg costs, and using float32
t.Logf("Average Costs")
costs = costs[:0] // reset
conf = DefaultNNConfig
conf.Dtype = G.Float32
trainer = NewTrainer(WithGeneratedCorpus(sts...), WithConfig(conf), WithTrainingSet(sts), WithCrossValidationSet(cv))
if err := trainer.Init(); err != nil {
t.Errorf("%+v", err)
}
trainer.PassDirect = false
cost = trainer.Cost()
perf = trainer.Perf()
go func() {
for p := range perf {
t.Logf("Perf \n%v", p)
}
}()
go func() {
for c := range cost {
costs = append(costs, c)
t.Logf("Cost %v", c)
}
}()
trainer.Train(epochs)
if len(costs) == 0 {
t.Fatal("Zero costs")
}
t.Logf("Costs %d", len(costs))
if len(costs) == 0 {
t.Errorf("Expected some costs")
}
if costs[0] < costs[len(costs)-1] {
t.Errorf("Costs should be reducing")
}
}
================================================
FILE: dep/transition.go
================================================
package dep
import (
"fmt"
"github.com/chewxy/lingo"
)
// transition is a tuple of Move and label
type transition struct {
Move
lingo.DependencyType
}
var transitions []transition
var MAXTRANSITION int
func buildTransitions(labels []lingo.DependencyType) []transition {
ts := make([]transition, 0)
// for _, l := range labels {
// if l == lingo.NoDepType {
// continue
// }
// t := transition{Left, l}
// ts = append(ts, t)
// }
// for _, l := range labels {
// if l == lingo.NoDepType {
// continue
// }
// t := transition{Right, l}
// ts = append(ts, t)
// }
// ts = append(ts, transition{Shift, lingo.NoDepType})
for _, m := range ALLMOVES {
for _, l := range labels {
if (m == Shift && l != lingo.NoDepType) || (m != Shift && l == lingo.NoDepType) {
continue
}
t := transition{m, l}
ts = append(ts, t)
}
}
return ts
}
func (t transition) String() string {
return fmt.Sprintf("(%s, %s)", t.Move, t.DependencyType)
}
func lookupTransition(t transition, table []transition) int {
for i, v := range table {
if v == t {
return i
}
}
panic(fmt.Sprintf("Transition %v not found", t))
}
// this builds the default transitions
func init() {
lbls := make([]lingo.DependencyType, lingo.MAXDEPTYPE)
for i := 0; i < int(lingo.MAXDEPTYPE); i++ {
lbls[i] = lingo.DependencyType(i)
}
transitions = buildTransitions(lbls)
MAXTRANSITION = len(transitions)
}
================================================
FILE: dep/util.go
================================================
package dep
func minInt(a, b int) int {
if a < b {
return a
}
return b
}
func maxInt(a, b int) int {
if a > b {
return a
}
return b
}
================================================
FILE: dependency.go
================================================
package lingo
import (
"bytes"
"fmt"
)
// Dependency represents the dependency parse of a sentence. While AnnotatedSentence does
// already do a job of representing the dependency parse of a sentence, *Dependency actually contains
// meta information about the dependency parse (specifically, lefts, rights) that makes parsing a dependency a lot faster
//
// The fields are mostly left unexported for a good reason - a dependency parse SHOULD be static after it's been built
type Dependency struct {
AnnotatedSentence
wordCount int
lefts [][]int
rights [][]int
counter int // for checking if a tree is projective
n int
}
type depConsOpt func(*Dependency)
// FromAnnotatedSentence creates a dependency from an AnnotatedSentence.
func FromAnnotatedSentence(s AnnotatedSentence) depConsOpt {
fn := func(d *Dependency) {
wc := len(s)
d.AnnotatedSentence = s
d.wordCount = wc
d.n = wc - 1
}
return fn
}
// AllocTree allocates the lefts and rights. Typical construction of the *Dependency doesn't allocate the trees as they're not necessary for a number of tasks.
func AllocTree() depConsOpt {
fn := func(d *Dependency) {
d.lefts = make([][]int, d.wordCount)
d.rights = make([][]int, d.wordCount)
for i := 0; i < d.wordCount; i++ {
d.lefts[i] = make([]int, 0)
d.rights[i] = make([]int, 0)
}
}
return fn
}
// NewDependency creates a new *Dependency. It takes optional construction options:
// FromAnnotatedSentence
// AllocTree
func NewDependency(opts ...depConsOpt) *Dependency {
d := new(Dependency)
for _, opt := range opts {
opt(d)
}
return d
}
func (d *Dependency) Sentence() AnnotatedSentence { return d.AnnotatedSentence }
func (d *Dependency) Lefts() [][]int { return d.lefts }
func (d *Dependency) Rights() [][]int { return d.rights }
func (d *Dependency) WordCount() int { return d.wordCount }
func (d *Dependency) N() int { return d.n }
// please only use these for testing
func (d *Dependency) SetLefts(l [][]int) { d.lefts = l }
func (d *Dependency) SetRights(r [][]int) { d.rights = r }
func (d *Dependency) Head(i int) int {
if i < 0 || i >= d.wordCount || d.AnnotatedSentence[i].Head == nil {
return -1
}
return d.AnnotatedSentence[i].HeadID()
}
func (d *Dependency) Label(i int) DependencyType {
if i < 0 || i >= d.wordCount {
return NoDepType
}
return d.AnnotatedSentence[i].DependencyType
}
func (d *Dependency) Annotation(i int) *Annotation {
if i < 0 || i >= d.wordCount {
return nullAnnotation
}
return d.AnnotatedSentence[i]
}
func (d *Dependency) AddArc(head, child int, label DependencyType) {
d.AddChild(head, child)
d.AddRel(child, label)
}
func (d *Dependency) AddChild(head, child int) {
headAnn := d.AnnotatedSentence[head]
d.AnnotatedSentence[child].SetHead(headAnn)
if child < head {
d.lefts[head] = append(d.lefts[head], child)
} else {
d.rights[head] = append(d.rights[head], child)
}
d.n++
}
func (d *Dependency) AddRel(child int, rel DependencyType) {
// d.labels[child] = rel
d.AnnotatedSentence[child].DependencyType = rel
}
func (d *Dependency) HasSingleRoot() bool {
roots := 0
for _, a := range d.AnnotatedSentence {
h := a.HeadID()
if h == 0 {
roots++
}
}
return roots == 1
}
func (d *Dependency) IsLegal() bool {
var heads []int
for _, a := range d.AnnotatedSentence {
h := a.HeadID()
if h < 0 || h > d.wordCount {
return false
}
heads = append(heads, -1)
}
for i := 1; i < d.wordCount; i++ {
for k := i; k > 0; {
if heads[k] >= 0 && heads[k] < 1 {
break
}
if heads[k] == i {
return false
}
heads[k] = i
k = d.AnnotatedSentence[k].HeadID()
}
}
return true
}
func (d *Dependency) IsProjective() bool {
d.counter = -1
return d.projectiveVisit(0)
}
func (d *Dependency) projectiveVisit(w int) bool {
for i := 1; i < w; i++ {
if d.AnnotatedSentence[i].HeadID() == w && d.projectiveVisit(i) == false {
return false
}
}
d.counter++
if w != d.counter {
return false
}
for i := w + 1; i < d.wordCount; i++ {
if d.AnnotatedSentence[i].HeadID() == w && d.projectiveVisit(i) == false {
return false
}
}
return true
}
func (d *Dependency) Root() int {
for i := 1; i <= d.n; i++ {
if d.Head(i) == 0 {
return i
}
}
return 0
}
func (d *Dependency) SprintRel() string {
var buf bytes.Buffer
for _, e := range d.Edges() {
fmt.Fprintf(&buf, "%v(%q-%d, %q-%d)\n", e.Rel, e.Gov.Value, e.Gov.ID, e.Dep.Value, e.Dep.ID)
}
return buf.String()
}
type DependencyEdge struct {
Gov *Annotation
Dep *Annotation
Rel DependencyType
}
// Sort interface
type edgeByID []DependencyEdge
func (b edgeByID) Len() int { return len(b) }
func (b edgeByID) Swap(i, j int) { b[i], b[j] = b[j], b[i] }
func (b edgeByID) Less(i, j int) bool { return b[i].Dep.ID < b[j].Dep.ID }
================================================
FILE: dependencyTree.go
================================================
package lingo
import (
"github.com/awalterschulze/gographviz"
"fmt"
"sync"
)
// A DependencyTree is an alternate form of representing a dependency parse.
// This form makes it easier to traverse the tree
type DependencyTree struct {
Parent *DependencyTree
ID int // the word number in a sentence
Type DependencyType // refers to the dependency type to the parent
Word *Annotation
Children []*DependencyTree
}
func NewDependencyTree(parent *DependencyTree, ID int, ann *Annotation) *DependencyTree {
return &DependencyTree{
Parent: parent,
ID: ID,
Word: ann,
Children: make([]*DependencyTree, 0),
}
}
func (d *DependencyTree) AddChild(child *DependencyTree) {
d.Children = append(d.Children, child)
}
func (d *DependencyTree) AddRel(rel DependencyType) {
d.Type = rel
}
func (d *DependencyTree) walk(c chan *DependencyTree, wg *sync.WaitGroup) {
defer wg.Done()
for _, child := range d.Children {
wg.Add(1)
go child.walk(c, wg)
}
c <- d // man someone should do somehting about my bad naming
}
func (d *DependencyTree) Dot() string {
// walk graph
c := make(chan *DependencyTree)
out := make(chan string)
go dotString(c, out)
var wg sync.WaitGroup
wg.Add(1)
go d.walk(c, &wg)
wg.Wait()
close(c)
return <-out
}
func dotString(c chan *DependencyTree, out chan string) {
g := gographviz.NewEscape()
g.SetName("G")
g.SetDir(true) // it's always going to be a directed graph
// g.AddNode("G", "Node_0x0", nil) // add the root
for t := range c {
id := fmt.Sprintf("Node_%p", t)
attrs := map[string]string{
"label": fmt.Sprintf("%d: \"%s/%s\"", t.ID, t.Word.Value, t.Word.POSTag),
}
g.AddNode("G", id, attrs)
if t.Parent == nil {
continue
}
parentID := fmt.Sprintf("Node_%p", t.Parent)
edgeAttrs := map[string]string{
"label": fmt.Sprintf("%v", t.Type),
}
g.AddEdge(parentID, id, true, edgeAttrs)
}
out <- g.String()
}
func (d *DependencyTree) Walk(fn func(interface{})) {
for _, child := range d.Children {
child.Walk(fn)
}
if fn != nil {
fn(d)
}
}
================================================
FILE: dependencyType.go
================================================
package lingo
import (
"fmt"
"strings"
)
// DependencyType represents the relation between two words
type DependencyType byte
var dependencyTypeLookup map[string]DependencyType
func init() {
dependencyTypeLookup = make(map[string]DependencyType)
for dt := NoDepType; dt < MAXDEPTYPE; dt++ {
s := dt.String()
dependencyTypeLookup[s] = DependencyType(dt)
dependencyTypeLookup[strings.ToLower(s)] = DependencyType(dt)
}
}
func (dt DependencyType) MarshalText() ([]byte, error) {
return []byte(fmt.Sprintf("%v", dt)), nil
}
func (dt *DependencyType) UnmarshalText(text []byte) error {
str := strings.Trim(string(text), `"`) // for JSON use, if any
deptype, _ := dependencyTypeLookup[str]
*dt = deptype
return nil
}
// list of dependency type functions
func InDepTypes(x DependencyType, set []DependencyType) bool {
for _, v := range set {
if v == x {
return true
}
}
return false
}
func IsModifier(x DependencyType) bool { return InDepTypes(x, Modifiers) }
func IsCompound(x DependencyType) bool { return InDepTypes(x, Compounds) }
func IsDeterminerRel(x DependencyType) bool { return InDepTypes(x, DeterminerRels) }
func IsMultiword(x DependencyType) bool { return InDepTypes(x, MultiWord) }
func IsQuantifier(x DependencyType) bool { return InDepTypes(x, QuantifingMods) }
================================================
FILE: dependencyType_stanford.go
================================================
// +build stanfordrel
package lingo
const BUILD_RELSET = "stanfordrel"
//go:generate stringer -type=DependencyType -output=dependencyType_stanford_string.go
// http://nlp.stanford.edu/software/dependencies_manual.pdf
const (
NoDepType DependencyType = iota
Dep
Root
Aux // Auxilliary
AuxPass // passive auxiliary
Cop // Copula
Arg // argument
Agent // agent
Comp // Complement
AComp // adjectival complement
CComp // clausal complement with internal subject
XComp // clausal complement with external subject
Obj // Object
DObj // Direct Object
IObj // Indirect Object
PObj // Object of preposition
Subj // subject
NSubj // Nominal Subject
NSubjPass // passive nominal subject
CSubj // clausal subject
CSubjPass // passive clausal subject
Coordination // coordination (cannot use CC, as it's a POSTag)
Conj // conjunction
Expl // Expletive
Mod // modifier
AMod // adjectival modifier
Appos // Appositional modifier
Advcl // adverbial clause modifier
Det // determiner
Predet // predeterminer
Preconj // Preconjunction
Vmod // reduced, nonfinite verbal modifier
MWE // multiword expression modifier
Mark // marker (word introducing an Advcl or CComp)
AdvMod // adverbial modifier
Neg // negation modifier
RCMod // relative clause modifier
QuantMod // quantifier modifier
NounMod // Noun Compound Modifier (cannot use NN because NN is defined as a POSTag)
NPAdvMod // Noun phrase adverbial modifier
TMod // temporal modifier
Num // Numeric Modifier
NumberElement // element of compound number (cannot use Number because Number is defined as a LexemeType)
Prep // prepositional modifier
Poss // possession modifier
Possessive // possessive modifier ('s)
PRT // phrasal verb partical
Parataxis // Parataxis (words that are next to each other)
GoesWith // GoesWith
Punct // punctuation
Ref // referant
SDep // Semantic Dependent
XSubj // controlling subject
// additional stuff not found in the original, but found in EWT
Case
Compound
NMod
Discourse
NumMod
RelCl
NFinCl
NMod_Poss
NMod_NPMod
Vocative
List
MWPrep // multiword prepositional modifier
Remnant
Acl
NPMod
MDVod
DetMod
// found in stanford nnparser SD models
PComp
MAXDEPTYPE
)
var Modifiers = []DependencyType{AMod}
var Compounds = []DependencyType{Compound}
var DeterminerRels = []DependencyType{Det, DetMod}
var MultiWord = []DependencyType{MWE, MWPrep, Compound, Parataxis}
var QuantifingMods = []DependencyType{QuantMod, NumMod}
================================================
FILE: dependencyType_stanford_string.go
================================================
// +build stanfordrel
// Code generated by "stringer -type=DependencyType -output=dependencyType_stanford_string.go"; DO NOT EDIT
package lingo
import "fmt"
const _DependencyType_name = "NoDepTypeDepRootNSubjNSubjPassDObjIObjCSubjCSubjPassCCompXCompNumModApposNModAClACl_RelClDetDet_PreDetAModNegCaseNMod_NPModNMod_TModNMod_PossAdvClAdvModCompoundCompound_PartMWEListParataxisDiscourseExplAuxAuxPassCopMarkPunctConjCoordinationCC_PreConjMAXDEPTYPE"
var _DependencyType_index = [...]uint16{0, 9, 12, 16, 21, 30, 34, 38, 43, 52, 57, 62, 68, 73, 77, 80, 89, 92, 102, 106, 109, 113, 123, 132, 141, 146, 152, 160, 173, 176, 180, 189, 198, 202, 205, 212, 215, 219, 224, 228, 240, 250, 260}
func (i DependencyType) String() string {
if i >= DependencyType(len(_DependencyType_index)-1) {
return fmt.Sprintf("DependencyType(%d)", i)
}
return _DependencyType_name[_DependencyType_index[i]:_DependencyType_index[i+1]]
}
================================================
FILE: dependencyType_universal.go
================================================
// +build !stanfordrel
package lingo
const BUILD_RELSET = "universalrel"
//go:generate stringer -type=DependencyType -output=dependencyType_universal_string.go
// http://universaldependencies.github.io/docs/en/dep/all.html
const (
NoDepType DependencyType = iota
Dep
Root
// Core dependents of clausal predicates
// nominal dependencies
NSubj
NSubjPass
DObj
IObj
// predicate dependencies
CSubj
CSubjPass
CComp
XComp
// Noun dependents
// nominal dependencies
NumMod
Appos
NMod
// predicate dependencies
ACl
ACl_RelCl // RCMod in stanford deps
Det
Det_PreDet
// modifier word
AMod
Neg
// Case Marking, preposition, possessive
Case
//Non-Core Dependents of Clausal Predicates
// Nominal dependencies
NMod_NPMod
NMod_TMod
NMod_Poss
// Predicate Dependencies
AdvCl
// Modifier Word
AdvMod
// Compounding and Unanalyzed
Compound
Compound_Part
Name // Unused in English
MWE
Foreign // Unused in English
GoesWith // Unused in English
// Loose Joining Relations
List
Dislocated // Unused in English
Parataxis
Remnant // Unused in English
Reparandum // Unused in English
// Special Clausal Dependents
// Nominal Dependent
Vocative // Unused in English
Discourse
Expl
// Auxilliary
Aux
AuxPass
Cop
// Other
Mark
Punct
// Coordination
Conj
Coordination // CC
CC_PreConj
MAXDEPTYPE
)
var Modifiers = []DependencyType{AMod}
var Compounds = []DependencyType{Compound, Compound_Part}
var DeterminerRels = []DependencyType{Det, Det_PreDet}
var MultiWord = []DependencyType{MWE, Compound, Compound_Part, Parataxis}
var QuantifingMods = []DependencyType{NumMod}
================================================
FILE: dependencyType_universal_string.go
================================================
// +build !stanfordrel
// Code generated by "stringer -type=DependencyType -output=dependencyType_universal_string.go"; DO NOT EDIT
package lingo
import "fmt"
const _DependencyType_name = "NoDepTypeDepRootNSubjNSubjPassDObjIObjCSubjCSubjPassCCompXCompNumModApposNModAClACl_RelClDetDet_PreDetAModNegCaseNMod_NPModNMod_TModNMod_PossAdvClAdvModCompoundCompound_PartNameMWEForeignGoesWithListDislocatedParataxisRemnantReparandumVocativeDiscourseExplAuxAuxPassCopMarkPunctConjCoordinationCC_PreConjMAXDEPTYPE"
var _DependencyType_index = [...]uint16{0, 9, 12, 16, 21, 30, 34, 38, 43, 52, 57, 62, 68, 73, 77, 80, 89, 92, 102, 106, 109, 113, 123, 132, 141, 146, 152, 160, 173, 177, 180, 187, 195, 199, 209, 218, 225, 235, 243, 252, 256, 259, 266, 269, 273, 278, 282, 294, 304, 314}
func (i DependencyType) String() string {
if i >= DependencyType(len(_DependencyType_index)-1) {
return fmt.Sprintf("DependencyType(%d)", i)
}
return _DependencyType_name[_DependencyType_index[i]:_DependencyType_index[i+1]]
}
================================================
FILE: errors.go
================================================
package lingo
type componentUnavailable interface {
error
Component() string
}
================================================
FILE: go.mod
================================================
module github.com/chewxy/lingo
require (
github.com/abiosoft/ishell v2.0.0+incompatible
github.com/abiosoft/readline v0.0.0-20180607040430-155bce2042db // indirect
github.com/awalterschulze/gographviz v0.0.0-20190221210632-1e9ccb565bca
github.com/chewxy/hm v1.0.0 // indirect
github.com/chewxy/math32 v1.0.0 // indirect
github.com/chzyer/logex v1.1.10 // indirect
github.com/chzyer/test v0.0.0-20180213035817-a1ea475d72b1 // indirect
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/fatih/color v1.7.0 // indirect
github.com/flynn-archive/go-shlex v0.0.0-20150515145356-3f9db97f8568 // indirect
github.com/gogo/protobuf v1.2.1 // indirect
github.com/golang/protobuf v1.2.0 // indirect
github.com/google/flatbuffers v1.10.0 // indirect
github.com/kljensen/snowball v0.6.0
github.com/leesper/go_rng v0.0.0-20171009123644-5344a9259b21 // indirect
github.com/mattn/go-colorable v0.1.1 // indirect
github.com/mattn/go-isatty v0.0.6 // indirect
github.com/pkg/browser v0.0.0-20180916011732-0a3d74bf9ce4
github.com/pkg/errors v0.8.1
github.com/stretchr/testify v1.3.0
github.com/xtgo/set v1.0.0
golang.org/x/exp v0.0.0-20190221220918-438050ddec5e // indirect
golang.org/x/sync v0.0.0-20181221193216-37e7f081c4d4 // indirect
golang.org/x/sys v0.0.0-20190225065934-cc5685c2db12 // indirect
golang.org/x/text v0.3.0
gonum.org/v1/gonum v0.0.0-20190221132855-8ea67971a689 // indirect
gonum.org/v1/netlib v0.0.0-20190221094214-0632e2ebbd2d // indirect
gorgonia.org/cu v0.9.0-beta // indirect
gorgonia.org/dawson v1.1.0 // indirect
gorgonia.org/gorgonia v0.9.1
gorgonia.org/tensor v0.9.0-beta
gorgonia.org/vecf32 v0.7.0 // indirect
gorgonia.org/vecf64 v0.7.0 // indirect
)
go 1.13
================================================
FILE: go.sum
================================================
github.com/abiosoft/ishell v2.0.0+incompatible/go.mod h1:HQR9AqF2R3P4XXpMpI0NAzgHf/aS6+zVXRj14cVk9qg=
github.com/abiosoft/readline v0.0.0-20180607040430-155bce2042db/go.mod h1:rB3B4rKii8V21ydCbIzH5hZiCQE7f5E9SzUb/ZZx530=
github.com/awalterschulze/gographviz v0.0.0-20190221210632-1e9ccb565bca h1:xwIXr1FpA2XBoohlpvgb11No/zbsh5Clm/98PWPcHVA=
github.com/awalterschulze/gographviz v0.0.0-20190221210632-1e9ccb565bca/go.mod h1:GEV5wmg4YquNw7v1kkyoX9etIk8yVmXj+AkDHuuETHs=
github.com/chewxy/hm v1.0.0 h1:zy/TSv3LV2nD3dwUEQL2VhXeoXbb9QkpmdRAVUFiA6k=
github.com/chewxy/hm v1.0.0/go.mod h1:qg9YI4q6Fkj/whwHR1D+bOGeF7SniIP40VweVepLjg0=
github.com/chewxy/math32 v1.0.0 h1:RTt2SACA7BTzvbsAKVQJLZpV6zY2MZw4bW9L2HEKkHg=
github.com/chewxy/math32 v1.0.0/go.mod h1:Miac6hA1ohdDUTagnvJy/q+aNnEk16qWUdb8ZVhvCN0=
github.com/chzyer/logex v1.1.10 h1:Swpa1K6QvQznwJRcfTfQJmTE72DqScAa40E+fbHEXEE=
github.com/chzyer/logex v1.1.10/go.mod h1:+Ywpsq7O8HXn0nuIou7OrIPyXbp3wmkHB+jjWRnGsAI=
github.com/chzyer/test v0.0.0-20180213035817-a1ea475d72b1 h1:q763qf9huN11kDQavWsoZXJNW3xEE4JJyHa5Q25/sd8=
github.com/chzyer/test v0.0.0-20180213035817-a1ea475d72b1/go.mod h1:Q3SI9o4m/ZMnBNeIyt5eFwwo7qiLfzFZmjNmxjkiQlU=
github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/fatih/color v1.7.0/go.mod h1:Zm6kSWBoL9eyXnKyktHP6abPY2pDugNf5KwzbycvMj4=
github.com/flynn-archive/go-shlex v0.0.0-20150515145356-3f9db97f8568/go.mod h1:rZfgFAXFS/z/lEd6LJmf9HVZ1LkgYiHx5pHhV5DR16M=
github.com/gogo/protobuf v1.2.1 h1:/s5zKNz0uPFCZ5hddgPdo2TK2TVrUNMn0OOX8/aZMTE=
github.com/gogo/protobuf v1.2.1/go.mod h1:hp+jE20tsWTFYpLwKvXlhS1hjn+gTNwPg2I6zVXpSg4=
github.com/golang/protobuf v1.2.0 h1:P3YflyNX/ehuJFLhxviNdFxQPkGK5cDcApsge1SqnvM=
github.com/golang/protobuf v1.2.0/go.mod h1:6lQm79b+lXiMfvg/cZm0SGofjICqVBUtrP5yJMmIC1U=
github.com/google/flatbuffers v1.10.0 h1:wHCM5N1xsJ3VwePcIpVqnmjAqRXlR44gv4hpGi+/LIw=
github.com/google/flatbuffers v1.10.0/go.mod h1:1AeVuKshWv4vARoZatz6mlQ0JxURH0Kv5+zNeJKJCa8=
github.com/kisielk/errcheck v1.1.0/go.mod h1:EZBBE59ingxPouuu3KfxchcWSUPOHkagtvWXihfKN4Q=
github.com/kisielk/gotool v1.0.0/go.mod h1:XhKaO+MFFWcvkIS/tQcRk01m1F5IRFswLeQ+oQHNcck=
github.com/kljensen/snowball v0.6.0/go.mod h1:27N7E8fVU5H68RlUmnWwZCfxgt4POBJfENGMvNRhldw=
github.com/leesper/go_rng v0.0.0-20171009123644-5344a9259b21 h1:O75p5GUdUfhJqNCMM1ntthjtJCOHVa1lzMSfh5Qsa0Y=
github.com/leesper/go_rng v0.0.0-20171009123644-5344a9259b21/go.mod h1:N0SVk0uhy+E1PZ3C9ctsPRlvOPAFPkCNlcPBDkt0N3U=
github.com/mattn/go-colorable v0.1.1/go.mod h1:FuOcm+DKB9mbwrcAfNl7/TZVBZ6rcnceauSikq3lYCQ=
github.com/mattn/go-isatty v0.0.5/go.mod h1:Iq45c/XA43vh69/j3iqttzPXn0bhXyGjM0Hdxcsrc5s=
github.com/mattn/go-isatty v0.0.6/go.mod h1:Iq45c/XA43vh69/j3iqttzPXn0bhXyGjM0Hdxcsrc5s=
github.com/pkg/browser v0.0.0-20180916011732-0a3d74bf9ce4/go.mod h1:4OwLy04Bl9Ef3GJJCoec+30X3LQs/0/m4HFRt/2LUSA=
github.com/pkg/errors v0.8.1 h1:iURUrRGxPUNPdy5/HRSm+Yj6okJ6UtLINN0Q9M4+h3I=
github.com/pkg/errors v0.8.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
github.com/stretchr/testify v1.3.0 h1:TivCn/peBQ7UY8ooIcPgZFpTNSz0Q2U6UrFlUfqbe0Q=
github.com/stretchr/tes
gitextract_whqjv2y6/ ├── .gitignore ├── .travis.yml ├── CONTRIBUTING.md ├── CONTRIBUTORS.md ├── LICENSE ├── POSTag.go ├── POSTag_stanford.go ├── POSTag_stanford_string.go ├── POSTag_universal.go ├── POSTag_universal_string.go ├── README.md ├── annotation.go ├── annotationSet.go ├── annotationSet_bench_test.go ├── browncluster.go ├── cmd/ │ ├── demo/ │ │ ├── io.go │ │ ├── main.go │ │ └── nlp.go │ ├── dep/ │ │ ├── fixer.go │ │ ├── io.go │ │ ├── main.go │ │ ├── pipeline.go │ │ └── train.go │ ├── lexer/ │ │ └── main.go │ └── pos/ │ ├── crossvalidation.go │ ├── fixer.go │ └── main.go ├── const.go ├── corpus/ │ ├── consopt.go │ ├── corpus.go │ ├── corpus_test.go │ ├── functions.go │ ├── functions_test.go │ ├── inflection.go │ ├── inflection_test.go │ ├── io.go │ ├── io_test.go │ ├── lda.go │ ├── test_test.go │ └── utils.go ├── dep/ │ ├── README.md │ ├── arcStandard.go │ ├── arcStandard_test.go │ ├── configuration.go │ ├── configuration_test.go │ ├── debug.go │ ├── dependencyParser.go │ ├── documentation/ │ │ ├── iamhuman.dot │ │ └── thecatsatonthemat.dot │ ├── errors.go │ ├── evaluation.go │ ├── example.go │ ├── example_test.go │ ├── featureExtraction.go │ ├── features.go │ ├── features_string.go │ ├── fix.go │ ├── init.go │ ├── models.go │ ├── models_test.go │ ├── move.go │ ├── move_string.go │ ├── nn2.go │ ├── nn2_io.go │ ├── nn2_io_test.go │ ├── nn2_test.go │ ├── nnconfig.go │ ├── release.go │ ├── span.go │ ├── test_test.go │ ├── train.go │ ├── train_test.go │ ├── transition.go │ └── util.go ├── dependency.go ├── dependencyTree.go ├── dependencyType.go ├── dependencyType_stanford.go ├── dependencyType_stanford_string.go ├── dependencyType_universal.go ├── dependencyType_universal_string.go ├── errors.go ├── go.mod ├── go.sum ├── interfaces.go ├── io.go ├── io_test.go ├── lexeme.go ├── lexemetype_string.go ├── lexer/ │ ├── lexer.go │ ├── lexer_test.go │ └── stateFn.go ├── lingo.go ├── pos/ │ ├── allinone_test.go │ ├── context.go │ ├── context_test.go │ ├── contexttype_string.go │ ├── debug.go │ ├── errors.go │ ├── features.go │ ├── features_test.go │ ├── featuretype_string.go │ ├── models.go │ ├── models_test.go │ ├── perceptron.go │ ├── perceptron_io.go │ ├── perceptron_io_test.go │ ├── postagger.go │ ├── release.go │ ├── sentence.go │ ├── test_test.go │ ├── util.go │ └── util_test.go ├── sentence.go ├── sets.go ├── shape.go ├── stopwords.go ├── treebank/ │ ├── const_postag_stanford.go │ ├── const_postag_universal.go │ ├── const_rel_stanford.go │ ├── const_rel_universal.go │ ├── sentenceTag.go │ ├── sentenceTag_test.go │ ├── treebank.go │ ├── treebank_test.go │ └── util.go ├── utils.go └── wordFlags.go
SYMBOL INDEX (992 symbols across 111 files)
FILE: POSTag.go
type POSTag (line 9) | type POSTag
method MarshalText (line 22) | func (p POSTag) MarshalText() ([]byte, error) {
method UnmarshalText (line 26) | func (p *POSTag) UnmarshalText(text []byte) error {
function init (line 13) | func init() {
function InPOSTags (line 34) | func InPOSTags(x POSTag, set []POSTag) bool {
function IsAdjective (line 43) | func IsAdjective(x POSTag) bool { return InPOSTags(x, Adjectives) }
function IsNoun (line 44) | func IsNoun(x POSTag) bool { return InPOSTags(x, Nouns) }
function IsProperNoun (line 45) | func IsProperNoun(x POSTag) bool { return InPOSTags(x, ProperNouns) }
function IsVerb (line 46) | func IsVerb(x POSTag) bool { return InPOSTags(x, Verbs) }
function IsAdverb (line 47) | func IsAdverb(x POSTag) bool { return InPOSTags(x, Adverbs) }
function IsInterrogative (line 48) | func IsInterrogative(x POSTag) bool { return InPOSTags(x, Interrogatives) }
function IsDeterminer (line 49) | func IsDeterminer(x POSTag) bool { return InPOSTags(x, Determiners) }
function IsNumber (line 50) | func IsNumber(x POSTag) bool { return InPOSTags(x, Numbers) }
function IsSymbol (line 51) | func IsSymbol(x POSTag) bool { return InPOSTags(x, Symbols) }
FILE: POSTag_stanford.go
constant BUILD_TAGSET (line 7) | BUILD_TAGSET = "stanfordtags"
constant X (line 10) | X POSTag = iota
constant UNKNOWN_TAG (line 11) | UNKNOWN_TAG
constant ROOT_TAG (line 12) | ROOT_TAG
constant CC (line 13) | CC
constant CD (line 14) | CD
constant DT (line 15) | DT
constant EX (line 16) | EX
constant FW (line 17) | FW
constant IN (line 18) | IN
constant JJ (line 19) | JJ
constant JJR (line 20) | JJR
constant JJS (line 21) | JJS
constant LS (line 22) | LS
constant MD (line 23) | MD
constant NN (line 24) | NN
constant NNS (line 25) | NNS
constant NNP (line 26) | NNP
constant NNPS (line 27) | NNPS
constant PDT (line 28) | PDT
constant POS (line 29) | POS
constant PRP (line 30) | PRP
constant PPRP (line 31) | PPRP
constant RB (line 32) | RB
constant RBR (line 33) | RBR
constant RBS (line 34) | RBS
constant RP (line 35) | RP
constant SYM (line 36) | SYM
constant TO (line 37) | TO
constant UH (line 38) | UH
constant VB (line 39) | VB
constant VBD (line 40) | VBD
constant VBG (line 41) | VBG
constant VBN (line 42) | VBN
constant VBP (line 43) | VBP
constant VBZ (line 44) | VBZ
constant WDT (line 45) | WDT
constant WP (line 46) | WP
constant PWP (line 47) | PWP
constant WRB (line 48) | WRB
constant COMMA (line 51) | COMMA
constant FULLSTOP (line 52) | FULLSTOP
constant OPENQUOTE (line 53) | OPENQUOTE
constant CLOSEQUOTE (line 54) | CLOSEQUOTE
constant COLON (line 55) | COLON
constant DOLLAR (line 56) | DOLLAR
constant HASHSIGN (line 57) | HASHSIGN
constant LEFTBRACE (line 58) | LEFTBRACE
constant RIGHTBRACE (line 59) | RIGHTBRACE
constant HYPH (line 63) | HYPH
constant AFX (line 64) | AFX
constant ADD (line 65) | ADD
constant NFP (line 66) | NFP
constant GW (line 67) | GW
constant XX (line 68) | XX
constant MAXTAG (line 70) | MAXTAG
function POSTagShortcut (line 74) | func POSTagShortcut(l Lexeme) (POSTag, bool) {
function IsIN (line 128) | func IsIN(x POSTag) bool { return x == IN }
FILE: POSTag_stanford_string.go
constant _POSTag_name (line 9) | _POSTag_name = "XUNKNOWN_TAGROOT_TAGCCCDDTEXFWINJJJJRJJSLSMDNNNNSNNPNNPS...
method String (line 13) | func (i POSTag) String() string {
FILE: POSTag_universal.go
constant BUILD_TAGSET (line 7) | BUILD_TAGSET = "universaltags"
constant X (line 10) | X POSTag = iota
constant UNKNOWN_TAG (line 11) | UNKNOWN_TAG
constant ROOT_TAG (line 12) | ROOT_TAG
constant ADJ (line 13) | ADJ
constant ADP (line 14) | ADP
constant ADV (line 15) | ADV
constant AUX (line 16) | AUX
constant CONJ (line 17) | CONJ
constant DET (line 18) | DET
constant INTJ (line 19) | INTJ
constant NOUN (line 20) | NOUN
constant NUM (line 21) | NUM
constant PART (line 22) | PART
constant PRON (line 23) | PRON
constant PROPN (line 24) | PROPN
constant PUNCT (line 25) | PUNCT
constant SCONJ (line 26) | SCONJ
constant SYM (line 27) | SYM
constant VERB (line 28) | VERB
constant MAXTAG (line 30) | MAXTAG
function POSTagShortcut (line 34) | func POSTagShortcut(l Lexeme) (POSTag, bool) {
function IsIN (line 67) | func IsIN(x POSTag) bool { return x == SCONJ }
FILE: POSTag_universal_string.go
constant _POSTag_name (line 9) | _POSTag_name = "XUNKNOWN_TAGROOT_TAGADJADPADVAUXCONJDETINTJNOUNNUMPARTPR...
method String (line 13) | func (i POSTag) String() string {
FILE: annotation.go
type Annotation (line 15) | type Annotation struct
method Clone (line 63) | func (a *Annotation) Clone() *Annotation {
method SetHead (line 73) | func (a *Annotation) SetHead(headAnn *Annotation) {
method HeadID (line 80) | func (a *Annotation) HeadID() int {
method IsNumber (line 87) | func (a *Annotation) IsNumber() bool {
method String (line 91) | func (a *Annotation) String() string {
method GoString (line 95) | func (a *Annotation) GoString() string {
method Process (line 104) | func (a *Annotation) Process(f AnnotationFixer) error {
function NewAnnotation (line 37) | func NewAnnotation() *Annotation {
function AnnotationFromLexTag (line 46) | func AnnotationFromLexTag(l Lexeme, t POSTag, f AnnotationFixer) *Annota...
function RootAnnotation (line 170) | func RootAnnotation() *Annotation { return rootAnnotation }
function StartAnnotation (line 171) | func StartAnnotation() *Annotation { return startAnnotation }
function NullAnnotation (line 172) | func NullAnnotation() *Annotation { return nullAnnotation }
function StringToAnnotation (line 174) | func StringToAnnotation(s string, f AnnotationFixer) *Annotation {
type AnnotationFixer (line 184) | type AnnotationFixer interface
FILE: annotationSet.go
type AnnotationSet (line 10) | type AnnotationSet
method Len (line 12) | func (as AnnotationSet) Len() int { return len(as) }
method Swap (line 13) | func (as AnnotationSet) Swap(i, j int) { as[i], as[j] = as[j], as[i] }
method Less (line 14) | func (as AnnotationSet) Less(i, j int) bool {
method Set (line 18) | func (as AnnotationSet) Set() AnnotationSet {
method Contains (line 24) | func (as AnnotationSet) Contains(a *Annotation) bool {
method Index (line 31) | func (as AnnotationSet) Index(a *Annotation) int {
method Add (line 40) | func (as AnnotationSet) Add(a *Annotation) AnnotationSet {
FILE: annotationSet_bench_test.go
method index2 (line 8) | func (as AnnotationSet) index2(a *Annotation) int {
function benchASIndex (line 16) | func benchASIndex(size int, b *testing.B) {
function benchASIndex2 (line 31) | func benchASIndex2(size int, b *testing.B) {
function BenchmarkAnnotationSetIndex_1 (line 46) | func BenchmarkAnnotationSetIndex_1(b *testing.B) { benchASIndex(1, b) }
function BenchmarkAnnotationSetIndex_2 (line 47) | func BenchmarkAnnotationSetIndex_2(b *testing.B) { benchASIndex(2, b) }
function BenchmarkAnnotationSetIndex_8 (line 48) | func BenchmarkAnnotationSetIndex_8(b *testing.B) { benchASIndex(8, b) }
function BenchmarkAnnotationSetIndex_16 (line 49) | func BenchmarkAnnotationSetIndex_16(b *testing.B) { benchASIndex(16, b) }
function BenchmarkAnnotationSetIndex_32 (line 50) | func BenchmarkAnnotationSetIndex_32(b *testing.B) { benchASIndex(32, b) }
function BenchmarkAnnotationSetIndex_64 (line 51) | func BenchmarkAnnotationSetIndex_64(b *testing.B) { benchASIndex(64, b) }
function BenchmarkAnnotationSetIndex_128 (line 52) | func BenchmarkAnnotationSetIndex_128(b *testing.B) { benchASIndex(128, ...
function BenchmarkAnnotationSetIndex_256 (line 53) | func BenchmarkAnnotationSetIndex_256(b *testing.B) { benchASIndex(256, ...
function BenchmarkAnnotationSetIndex_512 (line 54) | func BenchmarkAnnotationSetIndex_512(b *testing.B) { benchASIndex(512, ...
function BenchmarkAnnotationSetIndex_1024 (line 55) | func BenchmarkAnnotationSetIndex_1024(b *testing.B) { benchASIndex(1024,...
function BenchmarkAnnotationSetIndex2_1 (line 57) | func BenchmarkAnnotationSetIndex2_1(b *testing.B) { benchASIndex2(1, ...
function BenchmarkAnnotationSetIndex2_2 (line 58) | func BenchmarkAnnotationSetIndex2_2(b *testing.B) { benchASIndex2(2, ...
function BenchmarkAnnotationSetIndex2_8 (line 59) | func BenchmarkAnnotationSetIndex2_8(b *testing.B) { benchASIndex2(8, ...
function BenchmarkAnnotationSetIndex2_16 (line 60) | func BenchmarkAnnotationSetIndex2_16(b *testing.B) { benchASIndex2(16,...
function BenchmarkAnnotationSetIndex2_32 (line 61) | func BenchmarkAnnotationSetIndex2_32(b *testing.B) { benchASIndex2(32,...
function BenchmarkAnnotationSetIndex2_64 (line 62) | func BenchmarkAnnotationSetIndex2_64(b *testing.B) { benchASIndex2(64,...
function BenchmarkAnnotationSetIndex2_128 (line 63) | func BenchmarkAnnotationSetIndex2_128(b *testing.B) { benchASIndex2(128...
function BenchmarkAnnotationSetIndex2_256 (line 64) | func BenchmarkAnnotationSetIndex2_256(b *testing.B) { benchASIndex2(256...
function BenchmarkAnnotationSetIndex2_512 (line 65) | func BenchmarkAnnotationSetIndex2_512(b *testing.B) { benchASIndex2(512...
function BenchmarkAnnotationSetIndex2_1024 (line 66) | func BenchmarkAnnotationSetIndex2_1024(b *testing.B) { benchASIndex2(102...
FILE: browncluster.go
type Cluster (line 15) | type Cluster
function ReadCluster (line 18) | func ReadCluster(r io.Reader) map[string]Cluster {
FILE: cmd/demo/io.go
constant posModelFile (line 13) | posModelFile = `model/pos_stanfordtags_universalrel.final.model`
constant depModelFile (line 14) | depModelFile = `model/dep_stanfordtags_universalrel.final.model`
constant brownCluster (line 15) | brownCluster = `clusters.txt`
function io (line 18) | func io() {
FILE: cmd/demo/main.go
function main (line 13) | func main() {
FILE: cmd/demo/nlp.go
type stemmer (line 20) | type stemmer struct
method Stem (line 22) | func (stemmer) Stem(a string) (string, error) {
type fixer (line 26) | type fixer struct
method Clusters (line 30) | func (f fixer) Clusters() (map[string]lingo.Cluster, error) { return c...
method Lemmatize (line 31) | func (f fixer) Lemmatize(a string, pt lingo.POSTag) ([]string, error) {
type nocomp (line 35) | type nocomp
method Error (line 37) | func (e nocomp) Error() string { return fmt.Sprintf("no %v", strin...
method Component (line 38) | func (e nocomp) Component() string { return string(e) }
function pipeline (line 40) | func pipeline(s string) (d *lingo.Dependency, err error) {
FILE: cmd/dep/fixer.go
type stemmer (line 10) | type stemmer struct
method Stem (line 12) | func (stemmer) Stem(a string) (string, error) {
type fixer (line 16) | type fixer struct
method Clusters (line 20) | func (f fixer) Clusters() (map[string]lingo.Cluster, error) { return c...
method Lemmatize (line 21) | func (f fixer) Lemmatize(a string, pt lingo.POSTag) ([]string, error) {
type nocomp (line 25) | type nocomp
method Error (line 27) | func (e nocomp) Error() string { return fmt.Sprintf("no %v", strin...
method Component (line 28) | func (e nocomp) Component() string { return string(e) }
FILE: cmd/dep/io.go
function validateFlags (line 11) | func validateFlags() {
function loadTreebanks (line 38) | func loadTreebanks() {
function loadPOSModel (line 48) | func loadPOSModel() {
function loadDepModel (line 58) | func loadDepModel() {
function saveModel (line 66) | func saveModel() {
FILE: cmd/dep/main.go
function init (line 34) | func init() {
function cleanup (line 44) | func cleanup(sigChan chan os.Signal, cpuprofiling, memprofiling bool) {
function main (line 65) | func main() {
FILE: cmd/dep/pipeline.go
function receive (line 14) | func receive(deps chan *lingo.Dependency, errs, errChan chan error) {
function pipeline (line 36) | func pipeline(s string) error {
FILE: cmd/dep/train.go
function train (line 14) | func train() {
FILE: cmd/lexer/main.go
function receieve (line 15) | func receieve() {
function main (line 21) | func main() {
FILE: cmd/pos/crossvalidation.go
type testResult (line 17) | type testResult struct
method compare (line 22) | func (tr testResult) compare() (int, bool) {
function crossValidate (line 44) | func crossValidate(resultChan chan testResult) {
function collect (line 86) | func collect(ch chan lingo.AnnotatedSentence, correct lingo.AnnotatedSen...
function testModel (line 94) | func testModel(sentences []treebank.SentenceTag) {
function cvpipeline (line 115) | func cvpipeline(s string, output chan lingo.AnnotatedSentence) {
FILE: cmd/pos/fixer.go
type stemmer (line 12) | type stemmer struct
method Stem (line 14) | func (stemmer) Stem(a string) (string, error) {
type fixer (line 18) | type fixer struct
method Clusters (line 22) | func (f fixer) Clusters() (map[string]lingo.Cluster, error) { return c...
method Lemmatize (line 23) | func (f fixer) Lemmatize(a string, pt lingo.POSTag) ([]string, error) {
type nocomp (line 27) | type nocomp
method Error (line 29) | func (e nocomp) Error() string { return fmt.Sprintf("no %v", strin...
method Component (line 30) | func (e nocomp) Component() string { return string(e) }
FILE: cmd/pos/main.go
function receive (line 37) | func receive(sentences chan lingo.AnnotatedSentence, wg *sync.WaitGroup) {
function pipeline (line 46) | func pipeline(s string) {
function validateFlags (line 62) | func validateFlags() {
function loadOrTrain (line 82) | func loadOrTrain() {
function cleanup (line 135) | func cleanup(sigChan chan os.Signal, profiling bool) {
function main (line 146) | func main() {
FILE: corpus/consopt.go
type ConsOpt (line 14) | type ConsOpt
function WithWords (line 17) | func WithWords(a []string) ConsOpt {
function WithOrderedWords (line 53) | func WithOrderedWords(a []string) ConsOpt {
function WithSize (line 84) | func WithSize(size int) ConsOpt {
function FromDict (line 94) | func FromDict(d map[string]int) ConsOpt {
function FromDictWithFreq (line 125) | func FromDictWithFreq(d map[string]struct{ ID, Freq int }) ConsOpt {
FILE: corpus/corpus.go
type Corpus (line 12) | type Corpus struct
method Id (line 66) | func (c *Corpus) Id(word string) (int, bool) {
method Word (line 72) | func (c *Corpus) Word(id int) (string, bool) {
method Add (line 83) | func (c *Corpus) Add(word string) int {
method Size (line 105) | func (c *Corpus) Size() int {
method WordFreq (line 111) | func (c *Corpus) WordFreq(word string) int {
method IDFreq (line 121) | func (c *Corpus) IDFreq(id int) int {
method TotalFreq (line 132) | func (c *Corpus) TotalFreq() int {
method MaxWordLength (line 137) | func (c *Corpus) MaxWordLength() int {
method WordProb (line 142) | func (c *Corpus) WordProb(word string) (float64, bool) {
method Merge (line 154) | func (c *Corpus) Merge(other *Corpus) {
method Replace (line 172) | func (c *Corpus) Replace(a, with string) error {
method ReplaceWord (line 186) | func (c *Corpus) ReplaceWord(id int, with string) error {
function New (line 25) | func New() *Corpus {
function Construct (line 42) | func Construct(opts ...ConsOpt) (*Corpus, error) {
FILE: corpus/corpus_test.go
function TestCorpus (line 9) | func TestCorpus(t *testing.T) {
function TestCorpus_Merge (line 48) | func TestCorpus_Merge(t *testing.T) {
FILE: corpus/functions.go
function GenerateCorpus (line 14) | func GenerateCorpus(sentenceTags []treebank.SentenceTag) *Corpus {
function ViterbiSplit (line 61) | func ViterbiSplit(input string, c *Corpus) []string {
function CosineSimilarity (line 124) | func CosineSimilarity(a, b []string) float64 {
function DamerauLevenshtein (line 174) | func DamerauLevenshtein(s1 string, s2 string) (distance int) {
function LongestCommonPrefix (line 274) | func LongestCommonPrefix(strs ...string) string {
function StrsToInts (line 311) | func StrsToInts(strs []string) (retVal []int, err error) {
function CombineInts (line 336) | func CombineInts(ints []int) int {
FILE: corpus/functions_test.go
function Test_GenerateCorpus (line 10) | func Test_GenerateCorpus(t *testing.T) {
function TestViterbiSplit (line 28) | func TestViterbiSplit(t *testing.T) {
function TestCosineSimilarity (line 45) | func TestCosineSimilarity(t *testing.T) {
function TestDL (line 73) | func TestDL(t *testing.T) {
function TestLCP (line 101) | func TestLCP(t *testing.T) {
function TestParseNumber (line 128) | func TestParseNumber(t *testing.T) {
FILE: corpus/inflection.go
type conversionPattern (line 9) | type conversionPattern struct
function newConversionPattern (line 14) | func newConversionPattern(from, to string) conversionPattern {
function Pluralize (line 94) | func Pluralize(word string) string {
function Singularize (line 115) | func Singularize(word string) string {
FILE: corpus/inflection_test.go
function TestPluralize (line 28) | func TestPluralize(t *testing.T) {
function TestSingularize (line 37) | func TestSingularize(t *testing.T) {
FILE: corpus/io.go
type sortutil (line 13) | type sortutil struct
method Len (line 19) | func (s *sortutil) Len() int { return len(s.words) }
method Less (line 20) | func (s *sortutil) Less(i, j int) bool { return s.ids[i] < s.ids[j] }
method Swap (line 21) | func (s *sortutil) Swap(i, j int) {
function ToDictWithFreq (line 30) | func ToDictWithFreq(c *Corpus) map[string]struct{ ID, Freq int } {
function ToDict (line 39) | func ToDict(c *Corpus) map[string]int {
method GobEncode (line 48) | func (c *Corpus) GobEncode() ([]byte, error) {
method GobDecode (line 80) | func (c *Corpus) GobDecode(buf []byte) error {
method LoadOneGram (line 119) | func (c *Corpus) LoadOneGram(r io.Reader) error {
FILE: corpus/io_test.go
function TestCorpusGob (line 12) | func TestCorpusGob(t *testing.T) {
function TestCorpusToDict (line 43) | func TestCorpusToDict(t *testing.T) {
function TestCorpusToDictWithFreq (line 60) | func TestCorpusToDictWithFreq(t *testing.T) {
function TestLoadOneGram (line 73) | func TestLoadOneGram(t *testing.T) {
FILE: corpus/lda.go
type LDAModel (line 9) | type LDAModel struct
method init (line 36) | func (l *LDAModel) init() {
FILE: corpus/test_test.go
constant sample1Gram (line 9) | sample1Gram = `the 23135851162
function mediumSentence (line 17) | func mediumSentence() []treebank.SentenceTag {
constant EPSILON64 (line 44) | EPSILON64 float64 = 1e-10
function floatEquals64 (line 46) | func floatEquals64(a, b float64) bool {
FILE: corpus/utils.go
function minInt (line 8) | func minInt(a, b int) int {
function maxInt (line 15) | func maxInt(a, b int) int {
function dot (line 22) | func dot(a, b []float64) (float64, error) {
function mag (line 34) | func mag(a []float64) (float64, error) {
FILE: dep/arcStandard.go
method canApply (line 8) | func (c *configuration) canApply(t transition) bool {
method apply (line 45) | func (c *configuration) apply(t transition) {
method oracle (line 62) | func (c *configuration) oracle(goldParse *lingo.Dependency) (t transitio...
FILE: dep/arcStandard_test.go
function TestCanApply (line 10) | func TestCanApply(t *testing.T) {
function TestOracle (line 67) | func TestOracle(t *testing.T) {
FILE: dep/configuration.go
type head (line 11) | type head
constant DOES_NOT_EXIST (line 14) | DOES_NOT_EXIST head = iota - 1
type configuration (line 18) | type configuration struct
method String (line 50) | func (c *configuration) String() string {
method GoString (line 54) | func (c *configuration) GoString() string {
method bufferSize (line 58) | func (c *configuration) bufferSize() int {
method stackSize (line 62) | func (c *configuration) stackSize() int {
method head (line 66) | func (c *configuration) head(i int) head {
method stackValue (line 72) | func (c *configuration) stackValue(i int) head {
method bufferValue (line 80) | func (c *configuration) bufferValue(i int) head {
method pop (line 91) | func (c *configuration) pop() head {
method removeStack (line 98) | func (c *configuration) removeStack(i int) {
method removeSecondTopStack (line 103) | func (c *configuration) removeSecondTopStack() bool {
method removeTopStack (line 113) | func (c *configuration) removeTopStack() bool {
method label (line 125) | func (c *configuration) label(i head) lingo.DependencyType {
method annotation (line 141) | func (c *configuration) annotation(i head) *lingo.Annotation {
method lc (line 157) | func (c *configuration) lc(k, cnt head) head {
method rc (line 174) | func (c *configuration) rc(k, cnt head) head {
method hasOtherChildren (line 191) | func (c *configuration) hasOtherChildren(i int, goldParse *lingo.Depen...
method isTerminal (line 200) | func (c *configuration) isTerminal() bool {
method shift (line 205) | func (c *configuration) shift() bool {
function newConfiguration (line 26) | func newConfiguration(sentence lingo.AnnotatedSentence, fromGold bool) *...
FILE: dep/configuration_test.go
function TestStackAppendRemove (line 10) | func TestStackAppendRemove(t *testing.T) {
function TestConfiguration_StackValue (line 40) | func TestConfiguration_StackValue(t *testing.T) {
FILE: dep/debug.go
constant BUILD_DEBUG (line 16) | BUILD_DEBUG = "PARSER: DEBUG BUILD"
constant BUILD_DIAG (line 17) | BUILD_DIAG = "Diagnostic Build"
constant DEBUG (line 19) | DEBUG = true
function tabcount (line 25) | func tabcount() int {
function enterLoggingContext (line 29) | func enterLoggingContext() {
function leaveLoggingContext (line 35) | func leaveLoggingContext() {
function logf (line 48) | func logf(format string, others ...interface{}) {
function logTrainingProgress (line 55) | func logTrainingProgress(iteration, correct, total, length, possibles in...
function logMemStats (line 64) | func logMemStats() {
function recoverFrom (line 79) | func recoverFrom(format string, attrs ...interface{}) {
method SprintFeatures (line 87) | func (d *Parser) SprintFeatures(features []int) string {
function SprintScores (line 120) | func SprintScores(scores []float64, ts []transition) string {
function SprintFloatSlice (line 132) | func SprintFloatSlice(a []float64) string {
FILE: dep/dependencyParser.go
type Parser (line 17) | type Parser struct
method Run (line 38) | func (d *Parser) Run() {
method predict (line 52) | func (d *Parser) predict(sentence lingo.AnnotatedSentence) (*lingo.Dep...
method String (line 111) | func (d *Parser) String() string {
function New (line 26) | func New(m *Model) *Parser {
FILE: dep/errors.go
type componentUnavailable (line 9) | type componentUnavailable
method Error (line 11) | func (c componentUnavailable) Error() string { return fmt.Sprintf(...
method Component (line 12) | func (c componentUnavailable) Component() string { return string(c) }
type TarpitError (line 17) | type TarpitError struct
method Error (line 19) | func (err TarpitError) Error() string { return "Tarpit Error" }
type NonProjectiveError (line 22) | type NonProjectiveError struct
method Error (line 24) | func (err NonProjectiveError) Error() string { return "Non-projective ...
FILE: dep/evaluation.go
type Performance (line 12) | type Performance struct
method String (line 20) | func (p Performance) String() string {
function Evaluate (line 33) | func Evaluate(predictedTrees, goldTrees []*lingo.Dependency) Performance {
method crossValidate (line 87) | func (t *Trainer) crossValidate(st []treebank.SentenceTag) Performance {
method predMany (line 97) | func (t *Trainer) predMany(sentenceTags []treebank.SentenceTag) []*lingo...
method pred (line 110) | func (t *Trainer) pred(as lingo.AnnotatedSentence) (*lingo.Dependency, e...
FILE: dep/example.go
type example (line 12) | type example struct
function makeExamples (line 19) | func makeExamples(sentenceTags []treebank.SentenceTag, conf NNConfig, di...
function makeOneExample (line 43) | func makeOneExample(i int, sentenceTag treebank.SentenceTag, dict *corpu...
function shuffleExamples (line 84) | func shuffleExamples(a []example) {
FILE: dep/example_test.go
function TestMakeExamples (line 9) | func TestMakeExamples(t *testing.T) {
FILE: dep/featureExtraction.go
function getFeatures (line 9) | func getFeatures(c *configuration, dict *corpus.Corpus) []int {
constant POS_OFFSET (line 141) | POS_OFFSET int = 18
constant DEP_OFFSET (line 142) | DEP_OFFSET = 36
constant STACK_OFFSET (line 143) | STACK_OFFSET = 6
constant STACK_NUMBER (line 144) | STACK_NUMBER = 6
FILE: dep/features.go
type feature (line 8) | type feature
constant s0w (line 15) | s0w feature = iota
constant s1w (line 16) | s1w
constant s2w (line 17) | s2w
constant b0w (line 19) | b0w
constant b1w (line 20) | b1w
constant b2w (line 21) | b2w
constant s0l1w (line 23) | s0l1w
constant s0r1w (line 24) | s0r1w
constant s0l2w (line 25) | s0l2w
constant s0r2w (line 26) | s0r2w
constant s0llw (line 27) | s0llw
constant s0rrw (line 28) | s0rrw
constant s1l1w (line 30) | s1l1w
constant s1r1w (line 31) | s1r1w
constant s1l2w (line 32) | s1l2w
constant s1r2w (line 33) | s1r2w
constant s1llw (line 34) | s1llw
constant s1rrw (line 35) | s1rrw
constant s0t (line 38) | s0t
constant s1t (line 39) | s1t
constant s2t (line 40) | s2t
constant b0t (line 42) | b0t
constant b1t (line 43) | b1t
constant b2t (line 44) | b2t
constant s0l1t (line 46) | s0l1t
constant s0r1t (line 47) | s0r1t
constant s0l2t (line 48) | s0l2t
constant s0r2t (line 49) | s0r2t
constant s0llt (line 50) | s0llt
constant s0rrt (line 51) | s0rrt
constant s1l1t (line 53) | s1l1t
constant s1r1t (line 54) | s1r1t
constant s1l2t (line 55) | s1l2t
constant s1r2t (line 56) | s1r2t
constant s1llt (line 57) | s1llt
constant s1rrt (line 58) | s1rrt
constant s0l1d (line 61) | s0l1d
constant s0r1d (line 62) | s0r1d
constant s0l2d (line 63) | s0l2d
constant s0r2d (line 64) | s0r2d
constant s0lld (line 65) | s0lld
constant s0rrd (line 66) | s0rrd
constant s1l1d (line 68) | s1l1d
constant s1r1d (line 69) | s1r1d
constant s1l2d (line 70) | s1l2d
constant s1r2d (line 71) | s1r2d
constant s1lld (line 72) | s1lld
constant s1rrd (line 73) | s1rrd
constant MAXFEATURE (line 75) | MAXFEATURE
constant wordFeatsStartAt (line 79) | wordFeatsStartAt int = int(lingo.MAXTAG) + int(lingo.MAXDEPTYPE)
constant labelFeatsStartAt (line 80) | labelFeatsStartAt = int(lingo.MAXTAG)
constant posFeatsStartAt (line 81) | posFeatsStartAt = 0
FILE: dep/features_string.go
constant _feature_name (line 7) | _feature_name = "s0ws1ws2wb0wb1wb2ws0l1ws0r1ws0l2ws0r2ws0llws0rrws1l1ws1...
method String (line 11) | func (i feature) String() string {
FILE: dep/fix.go
function fix (line 10) | func fix(d *lingo.Dependency) {
function properNounSpans (line 98) | func properNounSpans(d *lingo.Dependency) (retVal []span) {
FILE: dep/init.go
function init (line 5) | func init() {
FILE: dep/models.go
type Model (line 17) | type Model struct
method Corpus (line 23) | func (m *Model) Corpus() *corpus.Corpus { return m.corpus }
method WordEmbeddings (line 25) | func (m *Model) WordEmbeddings() *tensor.Dense {
method POSTagEmbeddings (line 31) | func (m *Model) POSTagEmbeddings() *tensor.Dense {
method String (line 37) | func (m *Model) String() string {
method Save (line 48) | func (m *Model) Save(filename string) error {
method SaveWriter (line 60) | func (m *Model) SaveWriter(f io.WriteCloser) error {
function Load (line 81) | func Load(filename string) (*Model, error) {
function LoadReader (line 89) | func LoadReader(rd io.ReadCloser) (*Model, error) {
FILE: dep/models_test.go
function TestModel_SaveLoad (line 11) | func TestModel_SaveLoad(t *testing.T) {
FILE: dep/move.go
type Move (line 4) | type Move
constant Shift (line 9) | Shift Move = iota
constant Left (line 10) | Left
constant Right (line 11) | Right
constant MAXMOVE (line 13) | MAXMOVE
FILE: dep/move_string.go
constant _Move_name (line 7) | _Move_name = "ShiftLeftRightMAXMOVE"
method String (line 11) | func (i Move) String() string {
FILE: dep/nn2.go
type may (line 12) | type may struct
method doUnary (line 17) | func (m *may) doUnary(fn func(*G.Node) (*G.Node, error)) {
method doBinary (line 24) | func (m *may) doBinary(fn func(a, b *G.Node) (*G.Node, error), other *...
method doSwapBinary (line 31) | func (m *may) doSwapBinary(fn func(a, b *G.Node) (*G.Node, error), oth...
type neuralnetwork2 (line 38) | type neuralnetwork2 struct
method initialized (line 92) | func (nn *neuralnetwork2) initialized() bool {
method init (line 102) | func (nn *neuralnetwork2) init() error {
method fwd (line 211) | func (nn *neuralnetwork2) fwd() error {
method costProgress (line 288) | func (nn *neuralnetwork2) costProgress() <-chan G.Value {
method train (line 296) | func (nn *neuralnetwork2) train(examples []example) error {
method pred (line 347) | func (nn *neuralnetwork2) pred(ind []int) (int, error) {
method feats2vec (line 372) | func (nn *neuralnetwork2) feats2vec(indicators []int) error {
FILE: dep/nn2_io.go
method String (line 15) | func (nn *neuralnetwork2) String() string {
method GobEncode (line 41) | func (nn *neuralnetwork2) GobEncode() ([]byte, error) {
method GobDecode (line 87) | func (nn *neuralnetwork2) GobDecode(buf []byte) error {
FILE: dep/nn2_io_test.go
function TestNNIO (line 14) | func TestNNIO(t *testing.T) {
FILE: dep/nn2_test.go
function TestNN2 (line 12) | func TestNN2(t *testing.T) {
FILE: dep/nnconfig.go
type NNConfig (line 13) | type NNConfig struct
method String (line 28) | func (c NNConfig) String() string {
method GobEncode (line 48) | func (c NNConfig) GobEncode() ([]byte, error) {
method GobDecode (line 73) | func (c *NNConfig) GobDecode(p []byte) error {
function init (line 101) | func init() {
FILE: dep/release.go
constant BUILD_DEBUG (line 5) | BUILD_DEBUG = "PARSER: RELEASE BUILD"
constant BUILD_DIAG (line 6) | BUILD_DIAG = "Non-Diagnostic Build"
constant DEBUG (line 8) | DEBUG = false
function enterLoggingContext (line 14) | func enterLoggingContext() {}
function leaveLoggingContext (line 16) | func leaveLoggingContext() {}
function logTrainingProgress (line 18) | func logTrainingProgress(iteration, correct, total, length, possibles in...
function logMemStats (line 20) | func logMemStats() {}
function logf (line 22) | func logf(format string, others ...interface{}) {}
function recoverFrom (line 24) | func recoverFrom(format string, attrs ...interface{}) {}
method SprintFeatures (line 26) | func (d *Parser) SprintFeatures(feature []int) string { return "" }
function SprintScores (line 28) | func SprintScores(scores []float64, ts []transition) string { return "" }
FILE: dep/span.go
type span (line 3) | type span struct
method combine (line 14) | func (s span) combine(other span) span {
function makeSpan (line 7) | func makeSpan(start, end int) span {
FILE: dep/test_test.go
type dummyLem (line 18) | type dummyLem struct
method Lemmatize (line 20) | func (dummyLem) Lemmatize(s string, pt lingo.POSTag) ([]string, error) {
type dummyStemmer (line 24) | type dummyStemmer struct
method Stem (line 26) | func (dummyStemmer) Stem(s string) (string, error) {
type dummyFix (line 30) | type dummyFix struct
method Clusters (line 35) | func (dummyFix) Clusters() (map[string]lingo.Cluster, error) {
constant nnps (line 39) | nnps = `1 Guerrillas guerrilla NOUN NNS Number=Plur 2 nsubj _ _
constant simple (line 61) | simple = `1 Yet yet CONJ CC _ 5 cc _ _
constant med (line 74) | med = `1 President President PROPN NNP Number=Sing 2 compound _ _
constant long (line 96) | long = `1 Now now ADV RB _ 5 advmod _ _
constant cvconllu (line 142) | cvconllu = `1 Google Google PROPN NNP Number=Sing 6 nsubj _ _
function lotsaNNP (line 161) | func lotsaNNP() *lingo.Dependency {
function simpleSentence (line 169) | func simpleSentence() []treebank.SentenceTag {
function mediumSentence (line 174) | func mediumSentence() []treebank.SentenceTag {
function longSentence (line 180) | func longSentence() []treebank.SentenceTag {
function allSentences (line 185) | func allSentences() []treebank.SentenceTag {
function cvSentences (line 193) | func cvSentences() []treebank.SentenceTag {
function hash (line 197) | func hash(s string) string {
function cache (line 203) | func cache(input string, s lingo.AnnotatedSentence) {
function useCached (line 221) | func useCached(filename string) *lingo.Dependency {
FILE: dep/train.go
type TrainerConsOpt (line 15) | type TrainerConsOpt
function WithTrainingModel (line 18) | func WithTrainingModel(m *Model) TrainerConsOpt {
function WithTrainingSet (line 26) | func WithTrainingSet(st []treebank.SentenceTag) TrainerConsOpt {
function WithCrossValidationSet (line 34) | func WithCrossValidationSet(st []treebank.SentenceTag) TrainerConsOpt {
function WithConfig (line 42) | func WithConfig(conf NNConfig) TrainerConsOpt {
function WithLemmatizer (line 53) | func WithLemmatizer(l lingo.Lemmatizer) TrainerConsOpt {
function WithStemmer (line 66) | func WithStemmer(s lingo.Stemmer) TrainerConsOpt {
function WithCluster (line 78) | func WithCluster(c map[string]lingo.Cluster) TrainerConsOpt {
function WithCorpus (line 86) | func WithCorpus(c *corpus.Corpus) TrainerConsOpt {
function WithGeneratedCorpus (line 95) | func WithGeneratedCorpus(sts ...treebank.SentenceTag) TrainerConsOpt {
type Trainer (line 110) | type Trainer struct
method Lemmatize (line 153) | func (t *Trainer) Lemmatize(a string, pt lingo.POSTag) ([]string, erro...
method Stem (line 161) | func (t *Trainer) Stem(a string) (string, error) {
method Clusters (line 169) | func (t *Trainer) Clusters() (map[string]lingo.Cluster, error) {
method Cost (line 180) | func (t *Trainer) Cost() <-chan float64 {
method Perf (line 188) | func (t *Trainer) Perf() <-chan Performance {
method Init (line 198) | func (t *Trainer) Init() (err error) {
method Train (line 209) | func (t *Trainer) Train(epochs int) error {
method TrainWithoutCrossValidation (line 220) | func (t *Trainer) TrainWithoutCrossValidation(epochs int) error {
method train (line 225) | func (t *Trainer) train(epochs int) error {
method crossValidateTrain (line 257) | func (t *Trainer) crossValidateTrain(epochs int) error {
method pretrainCheck (line 319) | func (t *Trainer) pretrainCheck() error {
method handleCosts (line 337) | func (t *Trainer) handleCosts() (epochChan chan struct{}) {
function NewTrainer (line 133) | func NewTrainer(opts ...TrainerConsOpt) *Trainer {
FILE: dep/train_test.go
function TestTrainerInitializations (line 11) | func TestTrainerInitializations(t *testing.T) {
function TestTrainer_train (line 49) | func TestTrainer_train(t *testing.T) {
function TestTestTrainer_crossValidateTrain (line 136) | func TestTestTrainer_crossValidateTrain(t *testing.T) {
FILE: dep/transition.go
type transition (line 10) | type transition struct
method String (line 51) | func (t transition) String() string {
function buildTransitions (line 18) | func buildTransitions(labels []lingo.DependencyType) []transition {
function lookupTransition (line 55) | func lookupTransition(t transition, table []transition) int {
function init (line 65) | func init() {
FILE: dep/util.go
function minInt (line 3) | func minInt(a, b int) int {
function maxInt (line 10) | func maxInt(a, b int) int {
FILE: dependency.go
type Dependency (line 13) | type Dependency struct
method Sentence (line 64) | func (d *Dependency) Sentence() AnnotatedSentence { return d.Annotated...
method Lefts (line 65) | func (d *Dependency) Lefts() [][]int { return d.lefts }
method Rights (line 66) | func (d *Dependency) Rights() [][]int { return d.rights }
method WordCount (line 67) | func (d *Dependency) WordCount() int { return d.wordCount }
method N (line 68) | func (d *Dependency) N() int { return d.n }
method SetLefts (line 71) | func (d *Dependency) SetLefts(l [][]int) { d.lefts = l }
method SetRights (line 72) | func (d *Dependency) SetRights(r [][]int) { d.rights = r }
method Head (line 74) | func (d *Dependency) Head(i int) int {
method Label (line 82) | func (d *Dependency) Label(i int) DependencyType {
method Annotation (line 90) | func (d *Dependency) Annotation(i int) *Annotation {
method AddArc (line 98) | func (d *Dependency) AddArc(head, child int, label DependencyType) {
method AddChild (line 103) | func (d *Dependency) AddChild(head, child int) {
method AddRel (line 116) | func (d *Dependency) AddRel(child int, rel DependencyType) {
method HasSingleRoot (line 121) | func (d *Dependency) HasSingleRoot() bool {
method IsLegal (line 133) | func (d *Dependency) IsLegal() bool {
method IsProjective (line 159) | func (d *Dependency) IsProjective() bool {
method projectiveVisit (line 164) | func (d *Dependency) projectiveVisit(w int) bool {
method Root (line 186) | func (d *Dependency) Root() int {
method SprintRel (line 196) | func (d *Dependency) SprintRel() string {
type depConsOpt (line 26) | type depConsOpt
function FromAnnotatedSentence (line 29) | func FromAnnotatedSentence(s AnnotatedSentence) depConsOpt {
function AllocTree (line 40) | func AllocTree() depConsOpt {
function NewDependency (line 55) | func NewDependency(opts ...depConsOpt) *Dependency {
type DependencyEdge (line 206) | type DependencyEdge struct
type edgeByID (line 214) | type edgeByID
method Len (line 216) | func (b edgeByID) Len() int { return len(b) }
method Swap (line 217) | func (b edgeByID) Swap(i, j int) { b[i], b[j] = b[j], b[i] }
method Less (line 218) | func (b edgeByID) Less(i, j int) bool { return b[i].Dep.ID < b[j].Dep....
FILE: dependencyTree.go
type DependencyTree (line 13) | type DependencyTree struct
method AddChild (line 32) | func (d *DependencyTree) AddChild(child *DependencyTree) {
method AddRel (line 36) | func (d *DependencyTree) AddRel(rel DependencyType) {
method walk (line 40) | func (d *DependencyTree) walk(c chan *DependencyTree, wg *sync.WaitGro...
method Dot (line 50) | func (d *DependencyTree) Dot() string {
method Walk (line 91) | func (d *DependencyTree) Walk(fn func(interface{})) {
function NewDependencyTree (line 23) | func NewDependencyTree(parent *DependencyTree, ID int, ann *Annotation) ...
function dotString (line 65) | func dotString(c chan *DependencyTree, out chan string) {
FILE: dependencyType.go
type DependencyType (line 9) | type DependencyType
method MarshalText (line 22) | func (dt DependencyType) MarshalText() ([]byte, error) {
method UnmarshalText (line 26) | func (dt *DependencyType) UnmarshalText(text []byte) error {
function init (line 13) | func init() {
function InDepTypes (line 35) | func InDepTypes(x DependencyType, set []DependencyType) bool {
function IsModifier (line 44) | func IsModifier(x DependencyType) bool { return InDepTypes(x, Modif...
function IsCompound (line 45) | func IsCompound(x DependencyType) bool { return InDepTypes(x, Compo...
function IsDeterminerRel (line 46) | func IsDeterminerRel(x DependencyType) bool { return InDepTypes(x, Deter...
function IsMultiword (line 47) | func IsMultiword(x DependencyType) bool { return InDepTypes(x, Multi...
function IsQuantifier (line 48) | func IsQuantifier(x DependencyType) bool { return InDepTypes(x, Quant...
FILE: dependencyType_stanford.go
constant BUILD_RELSET (line 5) | BUILD_RELSET = "stanfordrel"
constant NoDepType (line 11) | NoDepType DependencyType = iota
constant Dep (line 12) | Dep
constant Root (line 13) | Root
constant Aux (line 14) | Aux
constant AuxPass (line 15) | AuxPass
constant Cop (line 16) | Cop
constant Arg (line 17) | Arg
constant Agent (line 18) | Agent
constant Comp (line 19) | Comp
constant AComp (line 20) | AComp
constant CComp (line 21) | CComp
constant XComp (line 22) | XComp
constant Obj (line 23) | Obj
constant DObj (line 24) | DObj
constant IObj (line 25) | IObj
constant PObj (line 26) | PObj
constant Subj (line 27) | Subj
constant NSubj (line 28) | NSubj
constant NSubjPass (line 29) | NSubjPass
constant CSubj (line 30) | CSubj
constant CSubjPass (line 31) | CSubjPass
constant Coordination (line 32) | Coordination
constant Conj (line 33) | Conj
constant Expl (line 34) | Expl
constant Mod (line 35) | Mod
constant AMod (line 36) | AMod
constant Appos (line 37) | Appos
constant Advcl (line 38) | Advcl
constant Det (line 39) | Det
constant Predet (line 40) | Predet
constant Preconj (line 41) | Preconj
constant Vmod (line 42) | Vmod
constant MWE (line 43) | MWE
constant Mark (line 44) | Mark
constant AdvMod (line 45) | AdvMod
constant Neg (line 46) | Neg
constant RCMod (line 47) | RCMod
constant QuantMod (line 48) | QuantMod
constant NounMod (line 49) | NounMod
constant NPAdvMod (line 50) | NPAdvMod
constant TMod (line 51) | TMod
constant Num (line 52) | Num
constant NumberElement (line 53) | NumberElement
constant Prep (line 54) | Prep
constant Poss (line 55) | Poss
constant Possessive (line 56) | Possessive
constant PRT (line 57) | PRT
constant Parataxis (line 58) | Parataxis
constant GoesWith (line 59) | GoesWith
constant Punct (line 60) | Punct
constant Ref (line 61) | Ref
constant SDep (line 62) | SDep
constant XSubj (line 63) | XSubj
constant Case (line 66) | Case
constant Compound (line 67) | Compound
constant NMod (line 68) | NMod
constant Discourse (line 69) | Discourse
constant NumMod (line 70) | NumMod
constant RelCl (line 71) | RelCl
constant NFinCl (line 72) | NFinCl
constant NMod_Poss (line 73) | NMod_Poss
constant NMod_NPMod (line 74) | NMod_NPMod
constant Vocative (line 75) | Vocative
constant List (line 76) | List
constant MWPrep (line 77) | MWPrep
constant Remnant (line 78) | Remnant
constant Acl (line 79) | Acl
constant NPMod (line 80) | NPMod
constant MDVod (line 81) | MDVod
constant DetMod (line 82) | DetMod
constant PComp (line 85) | PComp
constant MAXDEPTYPE (line 87) | MAXDEPTYPE
FILE: dependencyType_stanford_string.go
constant _DependencyType_name (line 9) | _DependencyType_name = "NoDepTypeDepRootNSubjNSubjPassDObjIObjCSubjCSubj...
method String (line 13) | func (i DependencyType) String() string {
FILE: dependencyType_universal.go
constant BUILD_RELSET (line 5) | BUILD_RELSET = "universalrel"
constant NoDepType (line 11) | NoDepType DependencyType = iota
constant Dep (line 12) | Dep
constant Root (line 13) | Root
constant NSubj (line 18) | NSubj
constant NSubjPass (line 19) | NSubjPass
constant DObj (line 20) | DObj
constant IObj (line 21) | IObj
constant CSubj (line 24) | CSubj
constant CSubjPass (line 25) | CSubjPass
constant CComp (line 26) | CComp
constant XComp (line 28) | XComp
constant NumMod (line 33) | NumMod
constant Appos (line 34) | Appos
constant NMod (line 35) | NMod
constant ACl (line 38) | ACl
constant ACl_RelCl (line 39) | ACl_RelCl
constant Det (line 40) | Det
constant Det_PreDet (line 41) | Det_PreDet
constant AMod (line 44) | AMod
constant Neg (line 45) | Neg
constant Case (line 48) | Case
constant NMod_NPMod (line 53) | NMod_NPMod
constant NMod_TMod (line 54) | NMod_TMod
constant NMod_Poss (line 55) | NMod_Poss
constant AdvCl (line 58) | AdvCl
constant AdvMod (line 61) | AdvMod
constant Compound (line 64) | Compound
constant Compound_Part (line 65) | Compound_Part
constant Name (line 66) | Name
constant MWE (line 67) | MWE
constant Foreign (line 68) | Foreign
constant GoesWith (line 69) | GoesWith
constant List (line 72) | List
constant Dislocated (line 73) | Dislocated
constant Parataxis (line 74) | Parataxis
constant Remnant (line 75) | Remnant
constant Reparandum (line 76) | Reparandum
constant Vocative (line 81) | Vocative
constant Discourse (line 82) | Discourse
constant Expl (line 83) | Expl
constant Aux (line 86) | Aux
constant AuxPass (line 87) | AuxPass
constant Cop (line 88) | Cop
constant Mark (line 91) | Mark
constant Punct (line 92) | Punct
constant Conj (line 96) | Conj
constant Coordination (line 97) | Coordination
constant CC_PreConj (line 98) | CC_PreConj
constant MAXDEPTYPE (line 100) | MAXDEPTYPE
FILE: dependencyType_universal_string.go
constant _DependencyType_name (line 9) | _DependencyType_name = "NoDepTypeDepRootNSubjNSubjPassDObjIObjCSubjCSubj...
method String (line 13) | func (i DependencyType) String() string {
FILE: errors.go
type componentUnavailable (line 3) | type componentUnavailable interface
FILE: interfaces.go
type Lemmatizer (line 10) | type Lemmatizer interface
type Stemmer (line 15) | type Stemmer interface
type Sentencer (line 20) | type Sentencer interface
type Corpus (line 25) | type Corpus interface
type WordEmbeddings (line 59) | type WordEmbeddings interface
FILE: io.go
type dummyAnnotation (line 12) | type dummyAnnotation struct
method MarshalJSON (line 39) | func (a *Annotation) MarshalJSON() ([]byte, error) {
method UnmarshalJSON (line 81) | func (a *Annotation) UnmarshalJSON(b []byte) error {
method MarshalJSON (line 105) | func (as AnnotatedSentence) MarshalJSON() ([]byte, error) {
method UnmarshalJSON (line 122) | func (as *AnnotatedSentence) UnmarshalJSON(b []byte) error {
FILE: io_test.go
function TestAnnotationJSON (line 8) | func TestAnnotationJSON(t *testing.T) {
function TestAnnotatedSentenceJSON (line 40) | func TestAnnotatedSentenceJSON(t *testing.T) {
FILE: lexeme.go
type LexemeType (line 10) | type LexemeType
constant EOF (line 13) | EOF LexemeType = iota
constant Word (line 14) | Word
constant Disambig (line 15) | Disambig
constant URI (line 16) | URI
constant Number (line 17) | Number
constant Date (line 18) | Date
constant Time (line 19) | Time
constant Punctuation (line 20) | Punctuation
constant Symbol (line 21) | Symbol
constant Space (line 22) | Space
constant SystemUse (line 23) | SystemUse
type Lexeme (line 26) | type Lexeme struct
method Fix (line 45) | func (l Lexeme) Fix() Lexeme {
method String (line 53) | func (l Lexeme) String() string {
method GoString (line 62) | func (l Lexeme) GoString() string {
function MakeLexeme (line 35) | func MakeLexeme(s string, t LexemeType) Lexeme {
function StartLexeme (line 75) | func StartLexeme() Lexeme { return startLexeme }
function RootLexeme (line 76) | func RootLexeme() Lexeme { return rootLexeme }
function NullLexeme (line 77) | func NullLexeme() Lexeme { return nullLexeme }
FILE: lexemetype_string.go
constant _LexemeType_name (line 7) | _LexemeType_name = "EOFWordDisambigURINumberDateTimePunctuationSymbolSpa...
method String (line 11) | func (i LexemeType) String() string {
FILE: lexer/lexer.go
constant eof (line 15) | eof rune = -1
type Lexer (line 17) | type Lexer struct
method Run (line 54) | func (l *Lexer) Run() {
method Reset (line 64) | func (l *Lexer) Reset(r io.Reader) {
method next (line 73) | func (l *Lexer) next() rune {
method nextUntilEOF (line 87) | func (l *Lexer) nextUntilEOF(s string) bool {
method backup (line 98) | func (l *Lexer) backup() {
method peek (line 104) | func (l *Lexer) peek() rune {
method lineCount (line 118) | func (l *Lexer) lineCount() {
method accept (line 127) | func (l *Lexer) accept() {
method acceptRun (line 131) | func (l *Lexer) acceptRun(valid string) (accepted bool) {
method acceptRunFn (line 140) | func (l *Lexer) acceptRunFn(fn func(rune) bool) (accepted int) {
method ignore (line 149) | func (l *Lexer) ignore() {
method emit (line 154) | func (l *Lexer) emit(t lingo.LexemeType) {
function New (line 38) | func New(name string, r io.Reader) *Lexer {
FILE: lexer/lexer_test.go
type lexerTest (line 10) | type lexerTest struct
function testLexer (line 203) | func testLexer(lts *lexerTest) []lingo.Lexeme {
function TestLexer (line 214) | func TestLexer(t *testing.T) {
FILE: lexer/stateFn.go
type stateFn (line 9) | type stateFn
function lexText (line 11) | func lexText(l *Lexer) (fn stateFn) {
function lexNumber (line 120) | func lexNumber(l *Lexer) (fn stateFn) {
function lexWhitespace (line 165) | func lexWhitespace(l *Lexer) (fn stateFn) {
function lexPunctuation (line 190) | func lexPunctuation(l *Lexer) (fn stateFn) {
function lexSymbol (line 228) | func lexSymbol(l *Lexer) (fn stateFn) {
function lexURI (line 235) | func lexURI(l *Lexer) (fn stateFn) {
function lexDate (line 252) | func lexDate(l *Lexer) (fn stateFn) {
function lexTime (line 268) | func lexTime(l *Lexer) (fn stateFn) {
FILE: pos/allinone_test.go
function TestEverything (line 13) | func TestEverything(t *testing.T) {
FILE: pos/context.go
type contextType (line 30) | type contextType
constant featuresPerContext (line 32) | featuresPerContext = 8
constant contexts (line 33) | contexts = 5
constant prev2Word (line 36) | prev2Word contextType = iota
constant prev2Lemma (line 37) | prev2Lemma
constant prev2Cluster (line 38) | prev2Cluster
constant prev2Shape (line 39) | prev2Shape
constant prev2Prefix1 (line 40) | prev2Prefix1
constant prev2Suffix3 (line 41) | prev2Suffix3
constant prev2POSTag (line 42) | prev2POSTag
constant prev2Flags (line 43) | prev2Flags
constant prevWord (line 46) | prevWord
constant prevLemma (line 47) | prevLemma
constant prevCluster (line 48) | prevCluster
constant prevShape (line 49) | prevShape
constant prevPrefix1 (line 50) | prevPrefix1
constant prevSuffix3 (line 51) | prevSuffix3
constant prevPOSTag (line 52) | prevPOSTag
constant prevFlags (line 53) | prevFlags
constant ithWord (line 56) | ithWord
constant ithLemma (line 57) | ithLemma
constant ithCluster (line 58) | ithCluster
constant ithShape (line 59) | ithShape
constant ithPrefix1 (line 60) | ithPrefix1
constant ithSuffix3 (line 61) | ithSuffix3
constant ithPOSTag (line 62) | ithPOSTag
constant ithFlags (line 63) | ithFlags
constant nextWord (line 66) | nextWord
constant nextLemma (line 67) | nextLemma
constant nextCluster (line 68) | nextCluster
constant nextShape (line 69) | nextShape
constant nextPrefix1 (line 70) | nextPrefix1
constant nextSuffix3 (line 71) | nextSuffix3
constant nextPOSTag (line 72) | nextPOSTag
constant nextFlags (line 73) | nextFlags
constant next2Word (line 76) | next2Word
constant next2Lemma (line 77) | next2Lemma
constant next2Cluster (line 78) | next2Cluster
constant next2Shape (line 79) | next2Shape
constant next2Prefix1 (line 80) | next2Prefix1
constant next2Suffix3 (line 81) | next2Suffix3
constant next2POSTag (line 82) | next2POSTag
constant next2Flags (line 83) | next2Flags
constant MAXCONTEXTTYPE (line 85) | MAXCONTEXTTYPE
type contextMap (line 88) | type contextMap
function getContext (line 90) | func getContext(prev2, prev, ith, next, next2 *lingo.Annotation) (retVal...
function extractContext (line 120) | func extractContext(a *lingo.Annotation) (retVal [featuresPerContext]str...
FILE: pos/context_test.go
function TestExtractContext (line 26) | func TestExtractContext(t *testing.T) {
FILE: pos/contexttype_string.go
constant _contextType_name (line 7) | _contextType_name = "prev2Wordprev2Lemmaprev2Clusterprev2Shapeprev2Prefi...
method String (line 11) | func (i contextType) String() string {
FILE: pos/debug.go
constant BUILD_DEBUG (line 11) | BUILD_DEBUG = "POS TAGGER: Debug Build"
function tabcount (line 17) | func tabcount() int {
function enterLoggingContext (line 21) | func enterLoggingContext() {
function leaveLoggingContext (line 27) | func leaveLoggingContext() {
function logf (line 40) | func logf(format string, others ...interface{}) {
function recoverFrom (line 44) | func recoverFrom(format string, attrs ...interface{}) {
FILE: pos/errors.go
type componentUnavailable (line 5) | type componentUnavailable
method Error (line 7) | func (c componentUnavailable) Error() string { return fmt.Sprintf(...
method Component (line 8) | func (c componentUnavailable) Component() string { return string(c) }
FILE: pos/features.go
type featureType (line 10) | type featureType
constant bias (line 14) | bias featureType = iota
constant ithWord_ (line 16) | ithWord_
constant nextWord_ (line 17) | nextWord_
constant next2Word_ (line 18) | next2Word_
constant ithSuffix3_ (line 20) | ithSuffix3_
constant ithPrefix1_ (line 21) | ithPrefix1_
constant prevPOSTag_ (line 23) | prevPOSTag_
constant prev2POSTag_ (line 24) | prev2POSTag_
constant prevSuffix3_ (line 25) | prevSuffix3_
constant nextSuffix3_ (line 26) | nextSuffix3_
constant ithShape_ (line 28) | ithShape_
constant ithCluster_ (line 29) | ithCluster_
constant nextCluster_ (line 30) | nextCluster_
constant next2Cluster_ (line 31) | next2Cluster_
constant prevCluster_ (line 32) | prevCluster_
constant prev2Cluster_ (line 33) | prev2Cluster_
constant ithFlags_ (line 35) | ithFlags_
constant nextFlags_ (line 36) | nextFlags_
constant next2Flags_ (line 37) | next2Flags_
constant prevFlags_ (line 38) | prevFlags_
constant prev2Flags_ (line 39) | prev2Flags_
constant prevLemma_prevPOSTag (line 41) | prevLemma_prevPOSTag
constant prevPOSTag_ithWord (line 42) | prevPOSTag_ithWord
constant prevPOSTag_prev2POSTag (line 43) | prevPOSTag_prev2POSTag
constant prev2Lemma_prev2POSTag (line 44) | prev2Lemma_prev2POSTag
constant MAXFEATURETYPE (line 46) | MAXFEATURETYPE
type feature (line 76) | type feature interface
type singleFeature (line 81) | type singleFeature struct
method FeatType (line 86) | func (sf singleFeature) FeatType() featureType { return sf.featureType }
method String (line 87) | func (sf singleFeature) String() string {
type tupleFeature (line 91) | type tupleFeature struct
method FeatType (line 97) | func (tf tupleFeature) FeatType() featureType { return tf.featureType }
method String (line 98) | func (tf tupleFeature) String() string {
type featureMap (line 102) | type featureMap
method String (line 104) | func (fm featureMap) String() string {
method add (line 112) | func (fm *featureMap) add(f feature) { (*fm)[f]++ }
type sfFeatures (line 114) | type sfFeatures
type tfFeatures (line 115) | type tfFeatures
function fillFromContext (line 117) | func fillFromContext(c contextMap) (sf sfFeatures, tf tfFeatures) {
function getFeatures (line 130) | func getFeatures(s lingo.AnnotatedSentence, i int) (sfFeatures, tfFeatur...
FILE: pos/features_test.go
function TestGetFeatures (line 12) | func TestGetFeatures(t *testing.T) {
FILE: pos/featuretype_string.go
constant _featureType_name (line 7) | _featureType_name = "biasithWord_prevLemma_prevPOSTagprev2Lemma_prev2POS...
method String (line 11) | func (i featureType) String() string {
FILE: pos/models.go
type Model (line 13) | type Model struct
method Save (line 19) | func (m *Model) Save(filename string) error {
method SaveWriter (line 27) | func (m *Model) SaveWriter(f io.WriteCloser) error {
function Load (line 47) | func Load(filename string) (*Model, error) {
function LoadReader (line 55) | func LoadReader(rd io.ReadCloser) (*Model, error) {
method Load (line 76) | func (p *Tagger) Load(filename string) error {
FILE: pos/models_test.go
function TestSaveLoad (line 12) | func TestSaveLoad(t *testing.T) {
FILE: pos/perceptron.go
type perceptron (line 5) | type perceptron struct
method updateWeightsSF (line 35) | func (p *perceptron) updateWeightsSF(f singleFeature, tag lingo.POSTag...
method updateWeightsTF (line 46) | func (p *perceptron) updateWeightsTF(f tupleFeature, tag lingo.POSTag,...
method update (line 57) | func (p *perceptron) update(guess, truth lingo.POSTag, sf sfFeatures, ...
method predict (line 90) | func (p *perceptron) predict(sf sfFeatures, tf tfFeatures) lingo.POSTag {
method average (line 111) | func (p *perceptron) average() {
type fctuple (line 18) | type fctuple struct
function newPerceptron (line 23) | func newPerceptron() *perceptron {
FILE: pos/perceptron_io.go
method GobEncode (line 10) | func (sf singleFeature) GobEncode() ([]byte, error) {
method GobDecode (line 25) | func (sf *singleFeature) GobDecode(buf []byte) error {
method GobEncode (line 41) | func (tf tupleFeature) GobEncode() ([]byte, error) {
method GobDecode (line 60) | func (tf *tupleFeature) GobDecode(buf []byte) error {
method GobEncode (line 81) | func (fc fctuple) GobEncode() ([]byte, error) {
method GobDecode (line 96) | func (fc *fctuple) GobDecode(buf []byte) error {
method GobEncode (line 112) | func (p *perceptron) GobEncode() ([]byte, error) {
method GobDecode (line 142) | func (p *perceptron) GobDecode(buf []byte) error {
function init (line 173) | func init() {
FILE: pos/perceptron_io_test.go
function TestFeatureSerialization (line 14) | func TestFeatureSerialization(t *testing.T) {
function TestPerceptron_Serialize (line 44) | func TestPerceptron_Serialize(t *testing.T) {
FILE: pos/postagger.go
type Tagger (line 16) | type Tagger struct
method Clone (line 98) | func (p *Tagger) Clone() *Tagger {
method Run (line 114) | func (p *Tagger) Run() {
method Lemmatize (line 138) | func (p *Tagger) Lemmatize(a string, pt lingo.POSTag) ([]string, error) {
method Stem (line 146) | func (p *Tagger) Stem(a string) (string, error) {
method Clusters (line 154) | func (p *Tagger) Clusters() (map[string]lingo.Cluster, error) {
method Progress (line 162) | func (p *Tagger) Progress() <-chan Progress {
method Train (line 170) | func (p *Tagger) Train(sentences []treebank.SentenceTag, iterations in...
method LoadShortcuts (line 248) | func (p *Tagger) LoadShortcuts(shortcuts map[string]lingo.POSTag) {
method fillCache (line 254) | func (p *Tagger) fillCache(sentences []treebank.SentenceTag) {
method shortcut (line 296) | func (p *Tagger) shortcut(l lingo.Lexeme) (lingo.POSTag, bool) {
method setTag (line 304) | func (p *Tagger) setTag(a *lingo.Annotation, tag lingo.POSTag) {
type ConsOpt (line 32) | type ConsOpt
function WithCorpus (line 35) | func WithCorpus(c *corpus.Corpus) ConsOpt {
function WithLemmatizer (line 44) | func WithLemmatizer(l lingo.Lemmatizer) ConsOpt {
function WithStemmer (line 53) | func WithStemmer(s lingo.Stemmer) ConsOpt {
function WithCluster (line 62) | func WithCluster(c map[string]lingo.Cluster) ConsOpt {
function WithModel (line 70) | func WithModel(m *Model) ConsOpt {
function New (line 78) | func New(opts ...ConsOpt) *Tagger {
type Progress (line 322) | type Progress struct
FILE: pos/release.go
constant BUILD_DEBUG (line 5) | BUILD_DEBUG = "POS TAGGER: Release Build"
function tabcount (line 10) | func tabcount() int { return 0 }
function enterLoggingContext (line 11) | func enterLoggingContext() {}
function leaveLoggingContext (line 12) | func leaveLoggingContext() {}
function logf (line 13) | func logf(format string, others ...interface{}) {}
function recoverFrom (line 14) | func recoverFrom(format string, attrs ...interface{}) {}
method ShowWeights (line 16) | func (p *Tagger) ShowWeights() {}
function printShortcuts (line 17) | func printShortcuts(p *Tagger) {}
FILE: pos/sentence.go
method getSentences (line 7) | func (p *Tagger) getSentences() {
FILE: pos/test_test.go
type dummyLem (line 8) | type dummyLem struct
method Lemmatize (line 10) | func (dummyLem) Lemmatize(s string, pt lingo.POSTag) ([]string, error) {
type dummyStemmer (line 19) | type dummyStemmer struct
method Stem (line 21) | func (dummyStemmer) Stem(s string) (string, error) {
type dummyFix (line 31) | type dummyFix struct
method Clusters (line 36) | func (dummyFix) Clusters() (map[string]lingo.Cluster, error) { return ...
constant conllu (line 38) | conllu = `1 From from ADP IN _ 3 case _ _
FILE: pos/util.go
function maxScore (line 9) | func maxScore(scores *[lingo.MAXTAG]float64) lingo.POSTag {
FILE: pos/util_test.go
function TestMaxScore (line 11) | func TestMaxScore(t *testing.T) {
FILE: sentence.go
type LexemeSentence (line 13) | type LexemeSentence
method String (line 17) | func (ls LexemeSentence) String() string {
function NewLexemeSentence (line 15) | func NewLexemeSentence() LexemeSentence { return LexemeSentence(make([]L...
type AnnotatedSentence (line 29) | type AnnotatedSentence
method Clone (line 33) | func (as AnnotatedSentence) Clone() AnnotatedSentence {
method SetID (line 47) | func (as AnnotatedSentence) SetID() {
method Fix (line 56) | func (as AnnotatedSentence) Fix() {
method IsValid (line 74) | func (as AnnotatedSentence) IsValid() bool {
method Phrase (line 96) | func (as AnnotatedSentence) Phrase(start, end int) (AnnotatedSentence,...
method IDs (line 107) | func (as AnnotatedSentence) IDs() []int {
method Tags (line 116) | func (as AnnotatedSentence) Tags() []POSTag {
method Heads (line 125) | func (as AnnotatedSentence) Heads() []int {
method Leaves (line 134) | func (as AnnotatedSentence) Leaves() (retVal []int) {
method Labels (line 144) | func (as AnnotatedSentence) Labels() []DependencyType {
method StringSlice (line 153) | func (as AnnotatedSentence) StringSlice() []string {
method LoweredStringSlice (line 162) | func (as AnnotatedSentence) LoweredStringSlice() []string {
method Lemmas (line 171) | func (as AnnotatedSentence) Lemmas() []string {
method Stems (line 180) | func (as AnnotatedSentence) Stems() []string {
method Children (line 188) | func (as AnnotatedSentence) Children(h int) (retVal []int) {
method Edges (line 197) | func (as AnnotatedSentence) Edges() (retVal []DependencyEdge) {
method Dependency (line 217) | func (as AnnotatedSentence) Dependency() *Dependency {
method Tree (line 221) | func (as AnnotatedSentence) Tree() *DependencyTree {
method String (line 263) | func (as AnnotatedSentence) String() string {
method ValueString (line 274) | func (as AnnotatedSentence) ValueString() string {
method LoweredString (line 285) | func (as AnnotatedSentence) LoweredString() string {
method LemmaString (line 296) | func (as AnnotatedSentence) LemmaString() string {
method StemString (line 307) | func (as AnnotatedSentence) StemString() string {
method Len (line 319) | func (as AnnotatedSentence) Len() int { return len(as) }
method Swap (line 320) | func (as AnnotatedSentence) Swap(i, j int) { as[i], as[j] = as[j]...
method Less (line 321) | func (as AnnotatedSentence) Less(i, j int) bool { return as[i].ID < as...
function NewAnnotatedSentence (line 31) | func NewAnnotatedSentence() AnnotatedSentence { return make(AnnotatedSen...
FILE: sets.go
type TagSet (line 11) | type TagSet
method String (line 13) | func (ts TagSet) String() string {
type DependencyTypeSet (line 22) | type DependencyTypeSet
method String (line 24) | func (dts DependencyTypeSet) String() string {
FILE: shape.go
type Shape (line 9) | type Shape
method Shape (line 11) | func (l Lexeme) Shape() Shape {
FILE: stopwords.go
constant sw (line 5) | sw = `a about above across after afterwards again against all almost alo...
function init (line 9) | func init() {
function UnescapeSpecials (line 18) | func UnescapeSpecials(word string) string {
FILE: treebank/sentenceTag.go
type SentenceTag (line 10) | type SentenceTag struct
method AnnotatedSentence (line 17) | func (s SentenceTag) AnnotatedSentence(f lingo.AnnotationFixer) lingo....
method Dependency (line 48) | func (s SentenceTag) Dependency(f lingo.AnnotationFixer) *lingo.Depend...
method String (line 55) | func (s SentenceTag) String() string {
function ShuffleSentenceTag (line 59) | func ShuffleSentenceTag(s []SentenceTag) []SentenceTag {
function WrapLexemeSentence (line 71) | func WrapLexemeSentence(sentence lingo.LexemeSentence) lingo.LexemeSente...
function WrapTags (line 79) | func WrapTags(tagList []lingo.POSTag) []lingo.POSTag {
function WrapHeads (line 85) | func WrapHeads(heads []int) []int {
function WrapDeps (line 91) | func WrapDeps(deps []lingo.DependencyType) []lingo.DependencyType {
FILE: treebank/sentenceTag_test.go
function TestSentenceTag (line 10) | func TestSentenceTag(t *testing.T) {
FILE: treebank/treebank.go
type Loader (line 19) | type Loader
function LoadUniversal (line 22) | func LoadUniversal(fileName string) []SentenceTag {
function ReadConllu (line 34) | func ReadConllu(reader io.Reader) []SentenceTag {
function LoadEWT (line 111) | func LoadEWT(filename string) []SentenceTag {
FILE: treebank/treebank_test.go
constant sampleConllu (line 11) | sampleConllu = `1 President President PROPN NNP Number=Sing 2 compound _ _
function Test_ReadConllu (line 33) | func Test_ReadConllu(t *testing.T) {
function ttos (line 119) | func ttos(ts []lingo.POSTag) []string {
function ltos (line 127) | func ltos(ls []lingo.DependencyType) []string {
FILE: treebank/util.go
function StringToLexType (line 8) | func StringToLexType(tag string) lingo.LexemeType {
function StringToPOSTag (line 23) | func StringToPOSTag(tag string) (lingo.POSTag, bool) {
function StringToDependencyType (line 29) | func StringToDependencyType(ud string) (lingo.DependencyType, bool) {
function reset (line 35) | func reset() (lingo.LexemeSentence, []lingo.POSTag, []int, []lingo.Depen...
function finish (line 44) | func finish(s lingo.LexemeSentence, st []lingo.POSTag, sh []int, sdt []l...
FILE: utils.go
function InStringSlice (line 3) | func InStringSlice(s string, l []string) bool {
type is (line 12) | type is
function StringIs (line 14) | func StringIs(s string, f is) bool {
function isAscii (line 23) | func isAscii(r rune) bool {
function EqStringSlice (line 30) | func EqStringSlice(a, b []string) bool {
FILE: wordFlags.go
type WordFlag (line 10) | type WordFlag
method String (line 31) | func (f WordFlag) String() string {
constant NoFlag (line 13) | NoFlag WordFlag = iota
constant IsLetter (line 14) | IsLetter
constant IsAscii (line 15) | IsAscii
constant IsDigit (line 16) | IsDigit
constant IsLower (line 17) | IsLower
constant IsPunct (line 18) | IsPunct
constant IsSpace (line 19) | IsSpace
constant IsTitle (line 20) | IsTitle
constant IsUpper (line 21) | IsUpper
constant LikeURL (line 22) | LikeURL
constant LikeNum (line 23) | LikeNum
constant LikeEmail (line 24) | LikeEmail
constant IsStopWord (line 25) | IsStopWord
constant IsOOV (line 26) | IsOOV
constant MAXFLAG (line 28) | MAXFLAG
method Flags (line 35) | func (l Lexeme) Flags() WordFlag {
Condensed preview — 128 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (320K chars).
[
{
"path": ".gitignore",
"chars": 266,
"preview": "# Compiled Object files, Static and Dynamic libs (Shared Objects)\n*.o\n*.a\n*.so\n\n# Folders\n_obj\n_test\n\n# Architecture spe"
},
{
"path": ".travis.yml",
"chars": 157,
"preview": "language: go\n\nbranches:\n only:\n - master\n\ngo:\n - 1.11.x\n - 1.12.x\n - 1.13.x\n - tip\n\nenv:\n - GO111MODULE=on\n\nmat"
},
{
"path": "CONTRIBUTING.md",
"chars": 1832,
"preview": "# Contributing #\n\nContributors are welcome! We want to make contributing as easy as possible, and the process is very Gi"
},
{
"path": "CONTRIBUTORS.md",
"chars": 59,
"preview": "# Contributors #\n\n* Xuanyi Chew (@chewxy) - initial package"
},
{
"path": "LICENSE",
"chars": 1063,
"preview": "MIT License\n\nCopyright (c) 2017 Chewxy\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof "
},
{
"path": "POSTag.go",
"chars": 1397,
"preview": "package lingo\n\nimport (\n\t\"fmt\"\n\t\"strings\"\n)\n\n// POSTag represents a Part of Speech Tag.\ntype POSTag byte\n\nvar posTagLook"
},
{
"path": "POSTag_stanford.go",
"chars": 4115,
"preview": "// +build stanfordtags\n\npackage lingo\n\n//go:generate stringer -type=POSTag -output=POSTag_stanford_string.go\n\nconst BUIL"
},
{
"path": "POSTag_stanford_string.go",
"chars": 829,
"preview": "// +build stanfordtags\n\n// Code generated by \"stringer -type=POSTag -output=POSTag_stanford_string.go\"; DO NOT EDIT\n\npac"
},
{
"path": "POSTag_universal.go",
"chars": 1342,
"preview": "// +build !stanfordtags\n\npackage lingo\n\n//go:generate stringer -type=POSTag -output=POSTag_universal_string.go\n\nconst BU"
},
{
"path": "POSTag_universal_string.go",
"chars": 548,
"preview": "// +build !stanfordtags\n\n// Code generated by \"stringer -type=POSTag -output=POSTag_universal_string.go\"; DO NOT EDIT\n\np"
},
{
"path": "README.md",
"chars": 5130,
"preview": "# lingo #\n\n<img src=\"https://raw.githubusercontent.com/chewxy/lingo/master/media/gopher_small.png\" align=\"right\" />\n\n[!["
},
{
"path": "annotation.go",
"chars": 3999,
"preview": "package lingo\n\nimport (\n\t\"errors\"\n\t\"fmt\"\n\t\"strings\"\n)\n\n// Annotation is the word and it's metadata.\n// This includes the"
},
{
"path": "annotationSet.go",
"chars": 826,
"preview": "package lingo\n\nimport (\n\t\"sort\"\n\t\"unsafe\"\n\n\t\"github.com/xtgo/set\"\n)\n\ntype AnnotationSet []*Annotation\n\nfunc (as Annotati"
},
{
"path": "annotationSet_bench_test.go",
"chars": 2366,
"preview": "package lingo\n\nimport (\n\t\"sort\"\n\t\"testing\"\n)\n\nfunc (as AnnotationSet) index2(a *Annotation) int {\n\tsort.Sort(as)\n\tf := f"
},
{
"path": "browncluster.go",
"chars": 1509,
"preview": "package lingo\n\nimport (\n\t\"bufio\"\n\t\"io\"\n\t\"strconv\"\n\t\"strings\"\n)\n\n// this file provides IO support and type safety for bro"
},
{
"path": "cmd/demo/io.go",
"chars": 694,
"preview": "package main\n\nimport (\n\t\"log\"\n\t\"os\"\n\n\t\"github.com/chewxy/lingo\"\n\t\"github.com/chewxy/lingo/dep\"\n\t\"github.com/chewxy/lingo"
},
{
"path": "cmd/demo/main.go",
"chars": 1264,
"preview": "package main\n\nimport (\n\t\"io/ioutil\"\n\t\"os\"\n\t\"os/exec\"\n\n\t\"github.com/abiosoft/ishell\"\n\t\"github.com/chewxy/lingo\"\n\t\"github."
},
{
"path": "cmd/demo/nlp.go",
"chars": 1348,
"preview": "package main\n\nimport (\n\t\"fmt\"\n\t\"strings\"\n\n\t\"github.com/chewxy/lingo\"\n\t\"github.com/chewxy/lingo/dep\"\n\t\"github.com/chewxy/"
},
{
"path": "cmd/dep/fixer.go",
"chars": 589,
"preview": "package main\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/chewxy/lingo\"\n\t\"github.com/kljensen/snowball\"\n)\n\ntype stemmer struct{}\n\nfunc"
},
{
"path": "cmd/dep/io.go",
"chars": 1143,
"preview": "package main\n\nimport (\n\t\"log\"\n\n\t\"github.com/chewxy/lingo/dep\"\n\t\"github.com/chewxy/lingo/pos\"\n\t\"github.com/chewxy/lingo/t"
},
{
"path": "cmd/dep/main.go",
"chars": 2527,
"preview": "package main\n\nimport (\n\t\"flag\"\n\t\"log\"\n\t\"os\"\n\t\"os/signal\"\n\t\"runtime/pprof\"\n\t\"syscall\"\n\n\t\"github.com/chewxy/lingo\"\n\t\"githu"
},
{
"path": "cmd/dep/pipeline.go",
"chars": 893,
"preview": "package main\n\nimport (\n\t\"encoding/json\"\n\t\"fmt\"\n\t\"strings\"\n\n\t\"github.com/chewxy/lingo\"\n\t\"github.com/chewxy/lingo/dep\"\n\t\"g"
},
{
"path": "cmd/dep/train.go",
"chars": 1298,
"preview": "package main\n\nimport (\n\t\"log\"\n\n\t\"github.com/chewxy/lingo/dep\"\n\t\"github.com/chewxy/lingo/treebank\"\n\t\"gorgonia.org/tensor\""
},
{
"path": "cmd/lexer/main.go",
"chars": 413,
"preview": "package main\n\nimport (\n\t\"flag\"\n\t\"fmt\"\n\t\"strings\"\n\n\t\"github.com/chewxy/lingo\"\n\t\"github.com/chewxy/lingo/lexer\"\n)\n\nvar inp"
},
{
"path": "cmd/pos/crossvalidation.go",
"chars": 2608,
"preview": "package main\n\nimport (\n\t\"bytes\"\n\t\"fmt\"\n\t\"log\"\n\t\"os\"\n\t\"strings\"\n\t\"sync\"\n\n\t\"github.com/chewxy/lingo\"\n\t\"github.com/chewxy/l"
},
{
"path": "cmd/pos/fixer.go",
"chars": 608,
"preview": "// +build !chewxy\n\npackage main\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/chewxy/lingo\"\n\t\"github.com/kljensen/snowball\"\n)\n\ntype ste"
},
{
"path": "cmd/pos/main.go",
"chars": 4478,
"preview": "package main\n\nimport (\n\t\"flag\"\n\t\"fmt\"\n\t\"log\"\n\t\"os\"\n\t\"os/signal\"\n\t\"runtime/pprof\"\n\t\"strings\"\n\t\"sync\"\n\t\"syscall\"\n\t\"time\"\n\n"
},
{
"path": "const.go",
"chars": 1625,
"preview": "package lingo\n\n// constants that are not pertaining to build tags\n\nvar empty struct{}\n\n// NumberWords was generated with"
},
{
"path": "corpus/consopt.go",
"chars": 3707,
"preview": "package corpus\n\nimport (\n\t\"log\"\n\t\"sort\"\n\t\"sync/atomic\"\n\t\"unicode/utf8\"\n\n\t\"github.com/pkg/errors\"\n\t\"github.com/xtgo/set\"\n"
},
{
"path": "corpus/corpus.go",
"chars": 4775,
"preview": "package corpus\n\nimport (\n\t\"sync/atomic\"\n\t\"unicode/utf8\"\n\n\t\"github.com/pkg/errors\"\n)\n\n// Corpus is a data structure holdi"
},
{
"path": "corpus/corpus_test.go",
"chars": 1553,
"preview": "package corpus\n\nimport (\n\t\"testing\"\n\n\t\"github.com/stretchr/testify/assert\"\n)\n\nfunc TestCorpus(t *testing.T) {\n\tassert :="
},
{
"path": "corpus/functions.go",
"chars": 8180,
"preview": "package corpus\n\nimport (\n\t\"math\"\n\t\"strings\"\n\t\"unicode/utf8\"\n\n\t\"github.com/chewxy/lingo\"\n\t\"github.com/chewxy/lingo/treeba"
},
{
"path": "corpus/functions_test.go",
"chars": 3629,
"preview": "package corpus\n\nimport (\n\t\"strings\"\n\t\"testing\"\n\n\t\"github.com/stretchr/testify/assert\"\n)\n\nfunc Test_GenerateCorpus(t *tes"
},
{
"path": "corpus/inflection.go",
"chars": 3816,
"preview": "package corpus\n\nimport (\n\t\"regexp\"\n\n\t\"github.com/chewxy/lingo\"\n)\n\ntype conversionPattern struct {\n\tpattern *regexp.R"
},
{
"path": "corpus/inflection_test.go",
"chars": 901,
"preview": "package corpus\n\nimport \"testing\"\n\nvar pluralizeTest = []struct {\n\tword, correct string\n}{\n\t{\"friend\", \"friends\"},\n\t{\"tom"
},
{
"path": "corpus/io.go",
"chars": 3123,
"preview": "package corpus\n\nimport (\n\t\"bufio\"\n\t\"bytes\"\n\t\"encoding/gob\"\n\t\"io\"\n\t\"strconv\"\n\t\"strings\"\n)\n\n// sortutil is a utility struc"
},
{
"path": "corpus/io_test.go",
"chars": 1992,
"preview": "package corpus\n\nimport (\n\t\"bytes\"\n\t\"encoding/gob\"\n\t\"strings\"\n\t\"testing\"\n\n\t\"github.com/stretchr/testify/assert\"\n)\n\nfunc T"
},
{
"path": "corpus/lda.go",
"chars": 973,
"preview": "package corpus\n\nimport (\n\t\"gorgonia.org/tensor\"\n)\n\n// LDAModel ... TODO\n//https://en.wikipedia.org/wiki/Latent_Dirichlet"
},
{
"path": "corpus/test_test.go",
"chars": 1382,
"preview": "package corpus\n\nimport (\n\t\"strings\"\n\n\t\"github.com/chewxy/lingo/treebank\"\n)\n\nconst sample1Gram = `the\t23135851162\nof\t1315"
},
{
"path": "corpus/utils.go",
"chars": 530,
"preview": "package corpus\n\nimport (\n\t\"errors\"\n\t\"math\"\n)\n\nfunc minInt(a, b int) int {\n\tif a < b {\n\t\treturn a\n\t}\n\treturn b\n}\n\nfunc ma"
},
{
"path": "dep/README.md",
"chars": 8658,
"preview": "# Dependency Parser #\n\nPackage `dependencyparser` is a package that provides data structures and algorithms for a depend"
},
{
"path": "dep/arcStandard.go",
"chars": 1628,
"preview": "package dep\n\nimport \"github.com/chewxy/lingo\"\n\n// var SingleRoot bool = true // make this part of a build process\n\n// ca"
},
{
"path": "dep/arcStandard_test.go",
"chars": 2385,
"preview": "package dep\n\nimport (\n\t\"testing\"\n\n\t\"github.com/chewxy/lingo\"\n\t\"github.com/stretchr/testify/assert\"\n)\n\nfunc TestCanApply("
},
{
"path": "dep/configuration.go",
"chars": 4431,
"preview": "package dep\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/chewxy/lingo\"\n)\n\n// describes the current state of the parser\n\ntype head int\n"
},
{
"path": "dep/configuration_test.go",
"chars": 1678,
"preview": "package dep\n\nimport (\n\t\"testing\"\n\n\t\"github.com/chewxy/lingo\"\n\t\"github.com/stretchr/testify/assert\"\n)\n\nfunc TestStackAppe"
},
{
"path": "dep/debug.go",
"chars": 3086,
"preview": "// +build debug\n\npackage dep\n\nimport (\n\t\"bytes\"\n\t\"fmt\"\n\t\"log\"\n\t\"runtime\"\n\t\"strings\"\n\t\"sync/atomic\"\n\n\t\"github.com/chewxy/"
},
{
"path": "dep/dependencyParser.go",
"chars": 2824,
"preview": "package dep\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/chewxy/lingo\"\n\t\"github.com/chewxy/lingo/corpus\"\n\t\"github.com/pkg/errors\"\n)\n\nv"
},
{
"path": "dep/documentation/iamhuman.dot",
"chars": 391,
"preview": "digraph G {\n\tNode_0xc425b88740->Node_0xc425b88780[ label=Root ];\n\tNode_0xc425b88780->Node_0xc425b88800[ label=Cop ];\n\tNo"
},
{
"path": "dep/documentation/thecatsatonthemat.dot",
"chars": 706,
"preview": "digraph G {\n\tNode_0xc4349eeec0->Node_0xc4349eef80[ label=Root ];\n\tNode_0xc4349eef80->Node_0xc4349eefc0[ label=NMod ];\n\tN"
},
{
"path": "dep/errors.go",
"chars": 864,
"preview": "package dep\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/chewxy/lingo\"\n)\n\ntype componentUnavailable string\n\nfunc (c componentUnavailab"
},
{
"path": "dep/evaluation.go",
"chars": 2968,
"preview": "package dep\n\nimport (\n\t\"fmt\"\n\t\"io/ioutil\"\n\n\t\"github.com/chewxy/lingo\"\n\t\"github.com/chewxy/lingo/treebank\"\n)\n\n// Performa"
},
{
"path": "dep/example.go",
"chars": 2183,
"preview": "package dep\n\nimport (\n\t\"math/rand\"\n\n\t\"github.com/chewxy/lingo\"\n\t\"github.com/chewxy/lingo/corpus\"\n\t\"github.com/chewxy/lin"
},
{
"path": "dep/example_test.go",
"chars": 339,
"preview": "package dep\n\nimport (\n\t\"testing\"\n\n\t\"github.com/chewxy/lingo/corpus\"\n)\n\nfunc TestMakeExamples(t *testing.T) {\n\tst := simp"
},
{
"path": "dep/featureExtraction.go",
"chars": 3542,
"preview": "package dep\n\nimport (\n\t\"github.com/chewxy/lingo\"\n\t\"github.com/chewxy/lingo/corpus\"\n)\n\n// getFeatures extracts the IDs to"
},
{
"path": "dep/features.go",
"chars": 844,
"preview": "package dep\n\nimport \"github.com/chewxy/lingo\"\n\n// the features are used as columns in the matrix\n\n// go:generate stringe"
},
{
"path": "dep/features_string.go",
"chars": 804,
"preview": "// generated by stringer -type=feature -output=features_string.go; DO NOT EDIT\n\npackage dep\n\nimport \"fmt\"\n\nconst _featur"
},
{
"path": "dep/fix.go",
"chars": 2856,
"preview": "package dep\n\nimport (\n\t\"log\"\n\n\t\"github.com/chewxy/lingo\"\n)\n\n// applies common fixes\nfunc fix(d *lingo.Dependency) {\n\t// "
},
{
"path": "dep/init.go",
"chars": 135,
"preview": "package dep\n\nimport \"github.com/chewxy/lingo/corpus\"\n\nfunc init() {\n\tc := corpus.New()\n\tc.Add(\"\") // add null words\n\n\tKn"
},
{
"path": "dep/models.go",
"chars": 2079,
"preview": "package dep\n\nimport (\n\t\"bufio\"\n\t\"bytes\"\n\t\"encoding/gob\"\n\t\"fmt\"\n\t\"io\"\n\t\"os\"\n\n\t\"github.com/chewxy/lingo/corpus\"\n\t\"github.c"
},
{
"path": "dep/models_test.go",
"chars": 1080,
"preview": "package dep\n\nimport (\n\t\"os\"\n\t\"testing\"\n\n\t\"github.com/stretchr/testify/assert\"\n\tG \"gorgonia.org/gorgonia\"\n)\n\nfunc TestMod"
},
{
"path": "dep/move.go",
"chars": 312,
"preview": "package dep\n\n// Move is an action that the dependency parser can take - whether to Shift, Attach-Left, or AttachRight\nty"
},
{
"path": "dep/move_string.go",
"chars": 329,
"preview": "// generated by stringer -type=Move; DO NOT EDIT\n\npackage dep\n\nimport \"fmt\"\n\nconst _Move_name = \"ShiftLeftRightMAXMOVE\"\n"
},
{
"path": "dep/nn2.go",
"chars": 10402,
"preview": "package dep\n\nimport (\n\t\"github.com/chewxy/lingo\"\n\t\"github.com/chewxy/lingo/corpus\"\n\t\"github.com/pkg/errors\"\n\tG \"gorgonia"
},
{
"path": "dep/nn2_io.go",
"chars": 3260,
"preview": "package dep\n\nimport (\n\t\"bytes\"\n\t\"encoding/gob\"\n\t\"fmt\"\n\n\t\"github.com/pkg/errors\"\n\tG \"gorgonia.org/gorgonia\"\n\tT \"gorgonia."
},
{
"path": "dep/nn2_io_test.go",
"chars": 3370,
"preview": "package dep\n\nimport (\n\t\"bytes\"\n\t\"encoding/gob\"\n\t\"fmt\"\n\t\"testing\"\n\n\t\"github.com/chewxy/lingo\"\n\t\"github.com/chewxy/lingo/c"
},
{
"path": "dep/nn2_test.go",
"chars": 1901,
"preview": "package dep\n\nimport (\n\t\"math/rand\"\n\t\"testing\"\n\t\"time\"\n\n\t\"github.com/chewxy/lingo/corpus\"\n\t\"gorgonia.org/gorgonia\"\n)\n\nfun"
},
{
"path": "dep/nnconfig.go",
"chars": 2914,
"preview": "package dep\n\nimport (\n\t\"bytes\"\n\t\"encoding/gob\"\n\t\"fmt\"\n\n\t\"github.com/pkg/errors\"\n\t\"gorgonia.org/tensor\"\n)\n\n// NNConfig co"
},
{
"path": "dep/release.go",
"chars": 607,
"preview": "// +build !debug\n\npackage dep\n\nconst BUILD_DEBUG = \"PARSER: RELEASE BUILD\"\nconst BUILD_DIAG = \"Non-Diagnostic Build\"\n\nco"
},
{
"path": "dep/span.go",
"chars": 313,
"preview": "package dep\n\ntype span struct {\n\tstart, end int\n}\n\nfunc makeSpan(start, end int) span {\n\tif end <= start {\n\t\tpanic(\"Impo"
},
{
"path": "dep/test_test.go",
"chars": 7717,
"preview": "package dep\n\nimport (\n\t\"bufio\"\n\t\"crypto/md5\"\n\t\"encoding/gob\"\n\t\"fmt\"\n\t\"io\"\n\t\"log\"\n\t\"os\"\n\t\"strings\"\n\n\t\"github.com/chewxy/l"
},
{
"path": "dep/train.go",
"chars": 9031,
"preview": "package dep\n\nimport (\n\t\"fmt\"\n\t\"os\"\n\t\"sync\"\n\n\t\"github.com/chewxy/lingo\"\n\t\"github.com/chewxy/lingo/corpus\"\n\t\"github.com/ch"
},
{
"path": "dep/train_test.go",
"chars": 4827,
"preview": "package dep\n\nimport (\n\t\"testing\"\n\n\t\"github.com/chewxy/lingo/corpus\"\n\n\tG \"gorgonia.org/gorgonia\"\n)\n\nfunc TestTrainerIniti"
},
{
"path": "dep/transition.go",
"chars": 1431,
"preview": "package dep\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/chewxy/lingo\"\n)\n\n// transition is a tuple of Move and label\ntype transition s"
},
{
"path": "dep/util.go",
"chars": 146,
"preview": "package dep\n\nfunc minInt(a, b int) int {\n\tif a < b {\n\t\treturn a\n\t}\n\treturn b\n}\n\nfunc maxInt(a, b int) int {\n\tif a > b {\n"
},
{
"path": "dependency.go",
"chars": 4869,
"preview": "package lingo\n\nimport (\n\t\"bytes\"\n\t\"fmt\"\n)\n\n// Dependency represents the dependency parse of a sentence. While AnnotatedS"
},
{
"path": "dependencyTree.go",
"chars": 2069,
"preview": "package lingo\n\nimport (\n\t\"github.com/awalterschulze/gographviz\"\n\n\t\"fmt\"\n\n\t\"sync\"\n)\n\n// A DependencyTree is an alternate "
},
{
"path": "dependencyType.go",
"chars": 1323,
"preview": "package lingo\n\nimport (\n\t\"fmt\"\n\t\"strings\"\n)\n\n// DependencyType represents the relation between two words\ntype Dependency"
},
{
"path": "dependencyType_stanford.go",
"chars": 2858,
"preview": "// +build stanfordrel\n\npackage lingo\n\nconst BUILD_RELSET = \"stanfordrel\"\n\n//go:generate stringer -type=DependencyType -o"
},
{
"path": "dependencyType_stanford_string.go",
"chars": 921,
"preview": "// +build stanfordrel\n\n// Code generated by \"stringer -type=DependencyType -output=dependencyType_stanford_string.go\"; D"
},
{
"path": "dependencyType_universal.go",
"chars": 1644,
"preview": "// +build !stanfordrel\n\npackage lingo\n\nconst BUILD_RELSET = \"universalrel\"\n\n//go:generate stringer -type=DependencyType "
},
{
"path": "dependencyType_universal_string.go",
"chars": 1012,
"preview": "// +build !stanfordrel\n\n// Code generated by \"stringer -type=DependencyType -output=dependencyType_universal_string.go\";"
},
{
"path": "errors.go",
"chars": 82,
"preview": "package lingo\n\ntype componentUnavailable interface {\n\terror\n\tComponent() string\n}\n"
},
{
"path": "go.mod",
"chars": 1710,
"preview": "module github.com/chewxy/lingo\n\nrequire (\n\tgithub.com/abiosoft/ishell v2.0.0+incompatible\n\tgithub.com/abiosoft/readline "
},
{
"path": "go.sum",
"chars": 6315,
"preview": "github.com/abiosoft/ishell v2.0.0+incompatible/go.mod h1:HQR9AqF2R3P4XXpMpI0NAzgHf/aS6+zVXRj14cVk9qg=\ngithub.com/abiosof"
},
{
"path": "interfaces.go",
"chars": 1988,
"preview": "package lingo\n\nimport (\n\t\"encoding/gob\"\n\n\t\"gorgonia.org/tensor\"\n)\n\n// Lemmatizer is anything that can lemmatize\ntype Lem"
},
{
"path": "io.go",
"chars": 3552,
"preview": "package lingo\n\nimport (\n\t\"bytes\"\n\t\"encoding/json\"\n\t\"fmt\"\n\t\"strings\"\n\n\t\"github.com/pkg/errors\"\n)\n\ntype dummyAnnotation st"
},
{
"path": "io_test.go",
"chars": 1849,
"preview": "package lingo\n\nimport (\n\t\"encoding/json\"\n\t\"testing\"\n)\n\nfunc TestAnnotationJSON(t *testing.T) {\n\ta := NewAnnotation()\n\ta."
},
{
"path": "lexeme.go",
"chars": 1298,
"preview": "package lingo\n\nimport (\n\t\"fmt\"\n\t\"unicode\"\n)\n\n//go:generate stringer -type=LexemeType\n\ntype LexemeType byte\n\nconst (\n\tEOF"
},
{
"path": "lexemetype_string.go",
"chars": 468,
"preview": "// Code generated by \"stringer -type=LexemeType\"; DO NOT EDIT\n\npackage lingo\n\nimport \"fmt\"\n\nconst _LexemeType_name = \"EO"
},
{
"path": "lexer/lexer.go",
"chars": 2750,
"preview": "package lexer\n\nimport (\n\t\"bufio\"\n\t\"bytes\"\n\t\"io\"\n\t\"strings\"\n\t\"sync\"\n\n\t\"golang.org/x/text/unicode/norm\"\n\n\t\"github.com/chew"
},
{
"path": "lexer/lexer_test.go",
"chars": 6700,
"preview": "package lexer\n\nimport (\n\t\"strings\"\n\t\"testing\"\n\n\t\"github.com/chewxy/lingo\"\n)\n\ntype lexerTest struct {\n\tname string\n\ts "
},
{
"path": "lexer/stateFn.go",
"chars": 5750,
"preview": "package lexer\n\nimport (\n\t\"unicode\"\n\n\t\"github.com/chewxy/lingo\"\n)\n\ntype stateFn func(*Lexer) stateFn\n\nfunc lexText(l *Lex"
},
{
"path": "lingo.go",
"chars": 117,
"preview": "// package lingo provides the data structures and algorithms required for natural language processing.\npackage lingo\n"
},
{
"path": "pos/allinone_test.go",
"chars": 1046,
"preview": "package pos\n\nimport (\n\t\"log\"\n\t\"strings\"\n\t\"testing\"\n\n\t\"github.com/chewxy/lingo\"\n\t\"github.com/chewxy/lingo/lexer\"\n\t\"github"
},
{
"path": "pos/context.go",
"chars": 2674,
"preview": "package pos\n\nimport (\n\t\"strconv\"\n\n\t\"github.com/chewxy/lingo\"\n)\n\n/*\nA context is which word in the current state the POST"
},
{
"path": "pos/context_test.go",
"chars": 1561,
"preview": "package pos\n\nimport (\n\t\"strings\"\n\t\"testing\"\n\n\t\"github.com/chewxy/lingo\"\n)\n\nvar extractContextTest = []struct {\n\tval stri"
},
{
"path": "pos/contexttype_string.go",
"chars": 972,
"preview": "// generated by stringer -type=contextType; DO NOT EDIT\n\npackage pos\n\nimport \"fmt\"\n\nconst _contextType_name = \"prev2Word"
},
{
"path": "pos/debug.go",
"chars": 789,
"preview": "// +build debug\n\npackage pos\n\nimport (\n\t\"log\"\n\t\"strings\"\n\t\"sync/atomic\"\n)\n\nconst BUILD_DEBUG = \"POS TAGGER: Debug Build\""
},
{
"path": "pos/errors.go",
"chars": 224,
"preview": "package pos\n\nimport \"fmt\"\n\ntype componentUnavailable string\n\nfunc (c componentUnavailable) Error() string { return f"
},
{
"path": "pos/features.go",
"chars": 3168,
"preview": "package pos\n\nimport (\n\t\"bytes\"\n\t\"fmt\"\n\n\t\"github.com/chewxy/lingo\"\n)\n\ntype featureType byte\n\n//go:generate stringer -type"
},
{
"path": "pos/features_test.go",
"chars": 6517,
"preview": "// +build stanfordtags\n\npackage pos\n\nimport (\n\t\"testing\"\n\n\t\"github.com/chewxy/lingo\"\n\t\"github.com/stretchr/testify/asser"
},
{
"path": "pos/featuretype_string.go",
"chars": 804,
"preview": "// generated by stringer -type=featureType; DO NOT EDIT\n\npackage pos\n\nimport \"fmt\"\n\nconst _featureType_name = \"biasithWo"
},
{
"path": "pos/models.go",
"chars": 1295,
"preview": "package pos\n\nimport (\n\t\"bufio\"\n\t\"encoding/gob\"\n\t\"io\"\n\t\"os\"\n\n\t\"github.com/chewxy/lingo\"\n)\n\n// Model is the model that the"
},
{
"path": "pos/models_test.go",
"chars": 658,
"preview": "package pos\n\nimport (\n\t\"os\"\n\t\"strings\"\n\t\"testing\"\n\n\t\"github.com/chewxy/lingo/treebank\"\n\t\"github.com/stretchr/testify/ass"
},
{
"path": "pos/perceptron.go",
"chars": 3257,
"preview": "package pos\n\nimport \"github.com/chewxy/lingo\"\n\ntype perceptron struct {\n\t// weights map[feature]*[lingo.MAXTAG]float64 /"
},
{
"path": "pos/perceptron_io.go",
"chars": 3163,
"preview": "package pos\n\nimport (\n\t\"bytes\"\n\t\"encoding/gob\"\n)\n\n/* Feature Gob interface */\n\nfunc (sf singleFeature) GobEncode() ([]by"
},
{
"path": "pos/perceptron_io_test.go",
"chars": 1781,
"preview": "// +build stanfordtags\n\npackage pos\n\nimport (\n\t\"bytes\"\n\t\"encoding/gob\"\n\t\"testing\"\n\n\t\"github.com/chewxy/lingo\"\n\t\"github.c"
},
{
"path": "pos/postagger.go",
"chars": 7725,
"preview": "package pos\n\nimport (\n\t\"github.com/chewxy/lingo\"\n\t\"github.com/chewxy/lingo/corpus\"\n\t\"github.com/chewxy/lingo/treebank\"\n)"
},
{
"path": "pos/release.go",
"chars": 490,
"preview": "// +build !debug\n\npackage pos\n\nconst BUILD_DEBUG = \"POS TAGGER: Release Build\"\n\nvar TABCOUNT uint32 = 0\nvar tracking = f"
},
{
"path": "pos/sentence.go",
"chars": 588,
"preview": "package pos\n\nimport \"github.com/chewxy/lingo\"\n\n// \"log\"\n\nfunc (p *Tagger) getSentences() {\n\tdefer close(p.sentences)\n\n\tv"
},
{
"path": "pos/test_test.go",
"chars": 3456,
"preview": "package pos\n\nimport (\n\t\"github.com/chewxy/lingo\"\n\t\"github.com/kljensen/snowball\"\n)\n\ntype dummyLem struct{}\n\nfunc (dummyL"
},
{
"path": "pos/util.go",
"chars": 293,
"preview": "package pos\n\nimport (\n\t\"math\"\n\n\t\"github.com/chewxy/lingo\"\n)\n\nfunc maxScore(scores *[lingo.MAXTAG]float64) lingo.POSTag {"
},
{
"path": "pos/util_test.go",
"chars": 435,
"preview": "package pos\n\nimport (\n\t\"math\"\n\t\"math/rand\"\n\t\"testing\"\n\n\t\"github.com/chewxy/lingo\"\n)\n\nfunc TestMaxScore(t *testing.T) {\n\t"
},
{
"path": "sentence.go",
"chars": 7282,
"preview": "package lingo\n\nimport (\n\t\"bytes\"\n\t\"fmt\"\n\t\"sort\"\n\t\"strings\"\n\n\t\"github.com/pkg/errors\"\n)\n\n/* Lexeme Sentence */\ntype Lexem"
},
{
"path": "sets.go",
"chars": 570,
"preview": "package lingo\n\nimport (\n\t\"bytes\"\n\t\"fmt\"\n)\n\n/* TAG SET */\n\n// TagSet is a set of all the POSTags\ntype TagSet [MAXTAG]bool"
},
{
"path": "shape.go",
"chars": 899,
"preview": "package lingo\n\nimport (\n\t\"bytes\"\n\t\"unicode\"\n)\n\n// Shape represents the shape of a word. It's currently implemented as a "
},
{
"path": "stopwords.go",
"chars": 2417,
"preview": "package lingo\n\nimport \"strings\"\n\nconst sw = `a about above across after afterwards again against all almost alone along "
},
{
"path": "treebank/const_postag_stanford.go",
"chars": 1364,
"preview": "// +build stanfordtags\n\npackage treebank\n\nimport \"github.com/chewxy/lingo\"\n\nvar posTagTable map[string]lingo.POSTag = ma"
},
{
"path": "treebank/const_postag_universal.go",
"chars": 600,
"preview": "// +build !stanfordtags\n\npackage treebank\n\nimport \"github.com/chewxy/lingo\"\n\nvar posTagTable map[string]lingo.POSTag = m"
},
{
"path": "treebank/const_rel_stanford.go",
"chars": 2261,
"preview": "// +build stanfordrel\n\npackage treebank\n\nimport \"github.com/chewxy/lingo\"\n\nvar dependencyTable map[string]lingo.Dependen"
},
{
"path": "treebank/const_rel_universal.go",
"chars": 1819,
"preview": "// +build !stanfordrel\n\npackage treebank\n\nimport \"github.com/chewxy/lingo\"\n\nvar dependencyTable map[string]lingo.Depende"
},
{
"path": "treebank/sentenceTag.go",
"chars": 2083,
"preview": "package treebank\n\nimport (\n\t\"math/rand\"\n\n\t\"github.com/chewxy/lingo\"\n)\n\n// SentenceTag is a struc that holds a sentence, "
},
{
"path": "treebank/sentenceTag_test.go",
"chars": 425,
"preview": "package treebank\n\nimport (\n\t\"strings\"\n\t\"testing\"\n\n\t\"github.com/stretchr/testify/assert\"\n)\n\nfunc TestSentenceTag(t *testi"
},
{
"path": "treebank/treebank.go",
"chars": 2654,
"preview": "package treebank\n\nimport (\n\t\"archive/zip\"\n\t\"io\"\n\t\"log\"\n\n\t\"github.com/chewxy/lingo\"\n\n\t\"bufio\"\n\t\"os\"\n\t\"strconv\"\n\t\"strings\""
},
{
"path": "treebank/treebank_test.go",
"chars": 2654,
"preview": "package treebank\n\nimport (\n\t\"strings\"\n\t\"testing\"\n\n\t\"github.com/chewxy/lingo\"\n\t\"github.com/stretchr/testify/assert\"\n)\n\nco"
},
{
"path": "treebank/util.go",
"chars": 1099,
"preview": "package treebank\n\nimport \"github.com/chewxy/lingo\"\n\nvar alreadyLogged map[string]bool = make(map[string]bool)\n\n// TODO :"
},
{
"path": "utils.go",
"chars": 513,
"preview": "package lingo\n\nfunc InStringSlice(s string, l []string) bool {\n\tfor _, v := range l {\n\t\tif s == v {\n\t\t\treturn true\n\t\t}\n\t"
},
{
"path": "wordFlags.go",
"chars": 1260,
"preview": "package lingo\n\nimport (\n\t\"fmt\"\n\t\"strings\"\n\t\"unicode\"\n)\n\n// WordFlags represent the types a word may be. A word may have "
}
]
About this extraction
This page contains the full source code of the chewxy/lingo GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 128 files (278.9 KB), approximately 93.4k tokens, and a symbol index with 992 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.