Full Code of interrogator/corpkit for AI

master c54be1f8c83d cached

98 files

2.3 MB

614.9k tokens

376 symbols

1 requests

Download .txt

Showing preview only (2,459K chars total). Download the full file or copy to clipboard to get everything.

Repository: interrogator/corpkit
Branch: master
Commit: c54be1f8c83d
Files: 98
Total size: 2.3 MB

Directory structure:
gitextract_mzzg7lm1/

├── .gitattributes
├── .gitmodules
├── .travis.yml
├── API-README.md
├── Dockerfile
├── LICENSE
├── Makefile
├── README.md
├── bld.bat
├── build.sh
├── conf.py
├── corpkit/
│   ├── __init__.py
│   ├── annotate.py
│   ├── blanknotebook.ipynb
│   ├── build.py
│   ├── completer.py
│   ├── configurations.py
│   ├── conll.py
│   ├── constants.py
│   ├── corpkit
│   ├── corpkit.1
│   ├── corpus.py
│   ├── cql.py
│   ├── dictionaries/
│   │   ├── __init__.py
│   │   ├── bnc.p
│   │   ├── bnc.py
│   │   ├── eng_verb_lexicon.p
│   │   ├── process_types.py
│   │   ├── queries.py
│   │   ├── roles.py
│   │   ├── stopwords.py
│   │   ├── verblist.py
│   │   ├── word_transforms.py
│   │   └── wordlists.py
│   ├── download/
│   │   ├── __init__.py
│   │   └── corenlp.py
│   ├── editor.py
│   ├── env.py
│   ├── gui.py
│   ├── inflect.py
│   ├── interpreter_tests.cki
│   ├── interrogation.py
│   ├── interrogator.py
│   ├── keys.py
│   ├── layouts.py
│   ├── lazyprop.py
│   ├── make.py
│   ├── model.py
│   ├── multiprocess.py
│   ├── new_project
│   ├── noseinstall.py
│   ├── nosetests.py
│   ├── other.py
│   ├── parse
│   ├── plotter.py
│   ├── plugins.py
│   ├── process.py
│   ├── stanford-tregex.jar
│   ├── stats.py
│   ├── textprogressbar.py
│   ├── tokenise.py
│   └── tregex.sh
├── data/
│   ├── corpus-filelist.txt
│   ├── test/
│   │   ├── first/
│   │   │   └── intro.txt
│   │   └── second/
│   │       └── body.txt
│   ├── test-plain-parsed/
│   │   ├── first/
│   │   │   └── intro.txt.conll
│   │   └── second/
│   │       └── body.txt.conll
│   ├── test-speak-parsed/
│   │   ├── first/
│   │   │   └── intro.txt.conll
│   │   └── second/
│   │       └── body.txt.conll
│   └── test-stripped/
│       ├── first/
│       │   └── intro.txt
│       └── second/
│           └── body.txt
├── index.rst
├── make.bat
├── meta.yaml
├── requirements.txt
├── rst_docs/
│   ├── API/
│   │   ├── corpkit.building.rst
│   │   ├── corpkit.concordancing.rst
│   │   ├── corpkit.editing.rst
│   │   ├── corpkit.interrogating.rst
│   │   ├── corpkit.langmodel.rst
│   │   ├── corpkit.managing.rst
│   │   └── corpkit.visualising.rst
│   ├── API-ref/
│   │   ├── corpkit.corpus.rst
│   │   ├── corpkit.dictionaries.rst
│   │   ├── corpkit.interrogation.rst
│   │   └── corpkit.other.rst
│   └── interpreter/
│       ├── corpkit.interpreter.annotating.rst
│       ├── corpkit.interpreter.concordancing.rst
│       ├── corpkit.interpreter.editing.rst
│       ├── corpkit.interpreter.interrogating.rst
│       ├── corpkit.interpreter.making.rst
│       ├── corpkit.interpreter.managing.rst
│       ├── corpkit.interpreter.overview.rst
│       ├── corpkit.interpreter.setup.rst
│       └── corpkit.interpreter.visualising.rst
├── setup.cfg
├── setup.py
└── talks/
    └── IDL_seminar.tex

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitattributes
================================================
*.p linguist-language=Python


================================================
FILE: .gitmodules
================================================
[submodule "corpkit-app"]
	path = corpkit-app
	url = https://github.com/interrogator/corpkit-app.git


================================================
FILE: .travis.yml
================================================
language: python
python:
- '2.7'
- '3.5'
install: 
- pip install --install-option="--no-cython-compile" cython
- pip install -r requirements.txt
- nltkd=$(python -c 'from __future__ import print_function; import nltk; print(nltk.data.path[0])')
- python -m nltk.downloader punkt -d "$nltkd"
- python -m nltk.downloader wordnet -d "$nltkd"
- python -m nltk.downloader averaged_perceptron_tagger -d "$nltkd"
script:
- nosetests corpkit/nosetests.py -a '!slow'
deploy:
  provider: pypi
  user: mcddjx
  password:
    secure: I7K+LWe37vRytA0QpF9sAdGaTYbwq0NuN6Xi6QgrSYr08WO5wKSZJ9bkBtJF4U9OCAtRjM64hOY+eobnKfwbNE+IHG8znI9z40jHyyCayYtk5P5UOG6OtB5wBbhXLjb9qXzy21byFcY1zM7iEUKw8D+Q4nu8cENFmx9agG025jet4MHXqtQlQYxTVr7GLK0oAqxO19J/D7F6Ykn2UEHw9dm3X0gu94gM6fMN1lIS74DM4d2IzRWOZrIaYigL8ckDSkWP9taVM553aI9qrLCz/4prCKwxo0QAINExiPYjSwG1swzTfabZvPI5bVdxY23TTx86Af6z3BQuhpIY1fspDTaw/Gn527XWFeOuqI8jhf6pP6ZdOo7qiyVwqU33/5CoTW+A/o1o963SDHSjyarxbz+De10zLScCvfIsZ2uHnh3CFnlWUeprjV09QIuz2lQbZoQP817/CAdxqLaMl/aG7Wcf4X7MI/SQauLVYR91gkhiBWzBdrYNGOEsrr7dzc5tbqBLeupF6Nf811BR2SdoGIfmihQGrYdC271/HuHTLsrcvXaCyXWElA1ATSRy6XfC8IsljU695Bm6kSrb4pG4V64P2Lhe2F8wtu4L1IzP+w7NRbeZNntMqMfksZz5vNe3CVhqcPy8VmOZGsmOaa9PIFHzZ7pM1Pxybt25Hz+GXBQ=
  on:
    tags: true
    distributions: sdist bdist_wheel
    repo: interrogator/corpkit
git:
  submodules: false

================================================
FILE: API-README.md
================================================
## *corpkit*: API readme

> This file is a deprecated introduction to the *corpkit* Python API. It still exists because it contains a lot of useful information and advanced examples that are not found elsewhere. It is deprecated because better documentation is available at [ReadTheDocs](http://corpkit.readthedocs.org/en/latest/).

- [What's in here?](#whats-in-here)
  - [`Corpus()`](#corpus)
    - [Navigating `Corpus` objects](#navigating-corpus-objects)
    - [`interrogate()` method](#interrogate-method)
    - [`concordance()` method](#concordance-method)
  - [`Interrogation`](#interrogation)
    - [`edit()` method](#edit-method)
    - [`visualise()` method](#visualise-method)
  - [Functions, lists, etc.](#functions-lists-etc)
- [Installation](#installation)
  - [By downloading the repository](#by-downloading-the-repository)
  - [By cloning the repository](#by-cloning-the-repository)
  - [Via `pip`](#via-pip)
- [Quickstart](#quickstart)
- [More detailed examples](#more-detailed-examples)
  - [`search`, `exclude` and `show`](#search-exclude-and-show)
- [Working with coreferences](#working-with-coreferences)
- [Building corpora](#building-corpora)
    - [Speaker IDs](#speaker-ids)
  - [Navigating parsed corpora](#navigating-parsed-corpora)
  - [Getting general stats](#getting-general-stats)
- [Concordancing](#concordancing)
- [Systemic functional stuff](#systemic-functional-stuff)
- [Keywording](#keywording)
  - [Visualising keywords](#visualising-keywords)
  - [Traditional reference corpora](#traditional-reference-corpora)
- [Parallel processing](#parallel-processing)
    - [Multiple corpora](#multiple-corpora)
    - [Multiple speakers](#multiple-speakers)
    - [Multiple queries](#multiple-queries)
- [More complex queries and plots](#more-complex-queries-and-plots)
  - [Visualisation options](#visualisation-options)
- [Contact](#contact)
- [Cite](#cite)

<a name="whats-in-here"></a>
## What's in here?

Essentially, the module contains classes, methods and functions for building and interrogating corpora, then manipulating or visualising the results. 

<a name="corpus"></a>
### `Corpus()`

First, there's a `Corpus()` class, which models a corpus of CoreNLP XML, lists of tokens, or plaintext files, creating subclasses for subcorpora and corpus files.

To use it, simple feed it a path to a directory containing `.txt` files, or subfolders containing `.txt` files.

```python
>>> from corpkit import Corpus
>>> unparsed = Corpus('path/to/data')
```

With the `Corpus()` class, the following attributes are available:

| Attribute | Purpose |
|-----------|---------|
| `corpus.subcorpora` | list of subcorpus objects with indexing/slicing methods |
| `corpus.features` | Corpus features (characters, clauses, words, tokens, process types, passives, etc.)  |
| `corpus.postags` | Distribution of parts of speech  |
| `corpus.wordclasses` | Distribution of word classes  |

as well as the following methods:

| Method | Purpose |
|--------|---------|
| `corpus.parse()` | Create a parsed version of a plaintext corpus |
| `corpus.tokenise()` | Create a tokenised version of a plaintext corpus |
| `corpus.interrogate()` | Interrogate the corpus for lexicogrammatical features |
| `corpus.concordance()` | Concordance via lexis and/or grammar |

<a name="navigating-corpus-objects"></a>
#### Navigating `Corpus` objects

Once you've defined a Corpus, you can move around it very easily:

```python
### corpus containing annual subcorpora of NYT articles
>>> corpus = Corpus('data/NYT-parsed')

>>> list(corpus.subcorpora)[:3]
### [<corpkit.corpus.Subcorpus instance: 1987>,
###  <corpkit.corpus.Subcorpus instance: 1988>,
###  <corpkit.corpus.Subcorpus instance: 1989>]

>>> corpus.subcorpora[0].path, corpus.subcorpora[0].datatype
### ('/Users/daniel/Work/risk/data/NYT-parsed/1987', 'parse')

>>> corpus.subcorpora.c1989.files[10:13]
### [<corpkit.corpus.File instance: NYT-1989-01-01-10-1.txt.xml>,
###  <corpkit.corpus.File instance: NYT-1989-01-01-10-2.txt.xml>,
###  <corpkit.corpus.File instance: NYT-1989-01-01-11-1.txt.xml>]

```

Most attributes, and the `.interrogate()` and `.concordance()` methods, can also be called on `Subcorpus` and `File` objects. `File` objects also have a `.read()` method.

<a name="interrogate-method"></a>
#### `interrogate()` method

* Use [Tregex](http://nlp.stanford.edu/~manning/courses/ling289/Tregex.html), regular expressions or wordlists to search parse trees, dependencies, token lists or plain text for complex lexicogrammatical phenomena
* Search for, exclude and show word, lemma, POS tag, semantic role, governor, dependent, index (etc) of a token
* N-gramming
* Two-way UK-US spelling conversion
* Output Pandas DataFrames that can be easily edited and visualised
* Use parallel processing to search for a number of patterns, or search for the same pattern in multiple corpora
* Restrict searches to particular speakers in a corpus
* Works on collections of corpora, corpora, subcorpora, single files, or slices thereof
* Quickly save to and load from disk with `save()` and `load()`

<a name="concordance-method"></a>
#### `concordance()` method

* Equivalent to `interrogate()`, but return DataFrame of concordance lines
* Return any combination and order of words, lemmas, indices, functions, or POS tags
* Editable and saveable
* Output to LaTeX, CSV or string with `format()`

The code below demonstrates the complex kinds of queries that can be handled by the `interrogate()` and `concordance()` methods:

```python
### import * mostly so that we can access global variables like G, P, V
### otherwise, use 'w' instead of W, 'p' instead of P, etc. 
>>> from corpkit import *

### select parsed corpus
>>> corpus = Corpus('data/postcounts-parsed')

### import process type lists and closed class wordlists
>>> from corpkit.dictionaries import *

### match tokens with governor that is in relational process wordlist, 
### and whose function is `nsubj(pass)` or `csubj(pass)`:
>>> criteria = {GL: processes.relational.lemmata, F: r'^.subj'}

### exclude tokens whose part-of-speech is verbal, 
### or whose word is in a list of pronouns
>>> exc = {P: r'^V', W: wordlists.pronouns}

# interrogate, returning slash-delimited function/lemma
>>> data = corpus.interrogate(criteria, exclude=exc, show=[F,L])
>>> lines = corpus.concordance(criteria, exclude=exc, show=[F,L])

### show results
>>> print data, lines.format(n=10, window=40, columns=[L,M,R])
```

Output sample:

```
    nsubj/thing  nsubj/person  nsubj/problem  nsubj/way  nsubj/son
01          296           168            134         69         73   
02          233           147             88         70         70   
03          250           160             95         80         67   
04          247           205             88         93         71   
05          275           193             68         75         61   

0  nk nsubj/it cop/be ccomp/sad advmod/when    nsubj/person  aux/do neg/not advcl/look ./at prep_at/w
1  /my dobj/Fluoxetine advmod/now mark/that    nsubj/spring  ccomp/be advmod/here ./, ./but nsubj/I a
2  y mark/because expl/there advcl/be det/a     nsubj/woman  ./across det/the prep_across/hall ./from
3   num/114 ccomp/pound ./, mark/so det/any       nsubj/med  nsubj/I rcmod/take aux/can advcl/have de
4                                                 nsubj/Kat  ./, root/be nsubj/you dep/taper ./off ./
5  /to xcomp/explain prep_from/what det/the      nsubj/mark  ./on poss/my prep_on/arm ./, conj_and/ne
6   det/the amod/first ./and conj_and/third  nsubj/hospital  nsubj/I rcmod/be advmod/at root/have num
7  e dobj/tv mark/while det/the amod/second  nsubj/hospital  nsubj/I cop/be rcmod/IP prep/at pcomp/in
8                                                 nsubj/Ben  ./, mark/if nsubj/you cop/be advcl/unhap
9  h ./of prep_of/sleep advmod/when det/the   nsubj/reality  advcl/be ./, nsubj/everyone ccomp/need n

```

<a name="interrogation"></a>
### `Interrogation`

The `corpus.interrogate()` method returns an `Interrogation` object. These have attributes:

| Attribute | Contains |
| ---------------|----------|
| `interrogation.results` |  Pandas DataFrame of counts in each subcorpus |
| `interrogation.totals` | Pandas Series of totals for each subcorpus/result |
| `interrogation.query` | a `dict` of values used to generate the interrogation |

and methods:

| Method | Purpose |
|------------|---------|
| `interrogation.edit()`        | Get relative frequencies, merge/remove results/subcorpora, calculate keywords, sort using linear regression, etc. |
| `interrogation.visualise()`       | visualise results via *matplotlib* | 
| `interrogation.save()` | Save data as pickle |
| `interrogation.quickview()` | Show top results and their absolute/relative frequency |

These methods have been monkey-patched to Pandas' DataFrame and Series objects, as well, so any slice of a result can be edited or plotted easily.

<a name="edit-method"></a>
#### `edit()` method

* Remove, keep or merge interrogation results or subcorpora using indices, words or regular expressions (see below)
* Sort results by name or total frequency
* Use linear regression to figure out the trajectories of results, and sort by the most increasing, decreasing or static values
* Show the *p*-value for linear regression slopes, or exclude results above *p*
* Work with absolute frequency, or determine ratios/percentage of another list: 
    * determine the total number of verbs, or total number of verbs that are *be*
    * determine the percentage of verbs that are *be*
    * determine the percentage of *be* verbs that are *was*
    * determine the ratio of *was/were* ...
    * etc.
* Plot more advanced kinds of relative frequency: for example, find all proper nouns that are subjects of clauses, and plot each word as a percentage of all instances of that word in the corpus (see below)

<a name="visualise-method"></a>
#### `visualise()` method

* Plot using *Matplotlib*
* Plot anything you like: words, tags, counts for grammatical features ...
* Create line charts, bar charts, pie charts, etc. with the `kind` argument
* Use `subplots=True` to produce individual charts for each result
* Customisable figure titles, axes labels, legends, image size, colormaps, etc.
* Use `TeX` if you have it
* Use log scales if you really want
* Use a number of chart styles, such as `ggplot`, `fivethirtyeight` or `seaborn-talk` (if you've got `seaborn` installed)
* Save images to file, as `.pdf` or `.png`
* Experimental interactive plots (hover-over text, interactive legends) using *mpld3*

<a name="functions-lists-etc"></a>
### Functions, lists, etc.

There are quite a few helper functions for making regular expressions, making new projects, and so on, with more documentation forthcoming. Also included are some lists of words and dependency roles, which can be used to match functional linguistic categories. These are explained in more detail [here](#systemic-functional-stuff).

<a name="installation"></a>
## Installation

You can get *corpkit* running by downloading or cloning this repository, or via `pip`.

<a name="by-downloading-the-repository"></a>
### By downloading the repository

Hit 'Download ZIP' and unzip the file. Then `cd` into the newly created directory and install:

```shell
cd corpkit-master
# might need sudo:
python setup.py install
```

<a name="by-cloning-the-repository"></a>
### By cloning the repository

Clone the repo, `cd` into it and run the setup:

```shell
git clone https://github.com/interrogator/corpkit.git
cd corpkit
# might need sudo:
python setup.py install
```

<a name="via-pip"></a>
### Via `pip`

```shell
# might need sudo:
pip install corpkit
# or, for a local install:
# pip install --user corpkit
```

*corpkit* should install all the necessary dependencies, including *pandas*, *NLTK*, *matplotlib*, etc, as well as some NLTK data files. 

<a name="quickstart"></a>
## Quickstart

Once you've got *corpkit*, and a folder containing text files, you're ready to go:

```python
### import everything
>>> from corpkit import *

### Make corpus object from path to subcorpora/text files
>>> unparsed = Corpus('data/nyt/years')

### parse it, return the new parsed corpus object
>>> corpus = unparsed.parse()

### search corpus for modal auxiliaries and plot the top results
>>> corpus.interroplot('MD')
```

Output: 

<img style="float:left" src="https://raw.githubusercontent.com/interrogator/risk/master/images/md2.png" />
<br>

<a name="more-detailed-examples"></a>
## More detailed examples

`interroplot()` is just a demo method that does three things in order:

1. uses `interrogate()` to search corpus for a (Regex- or Tregex-based) query
2. uses `edit()` to calculate the relative frequencies of each result
3. uses `visualise()` to show the top seven results
 
Here's an example of the three methods at work:

```python
### make tregex query: head of NP in PP containing 'of' in NP headed by risk word:
>>> q = r'/NN.?/ >># (NP > (PP <<# /(?i)of/ > (NP <<# (/NN.?/ < /(?i).?\brisk.?/))))'

### search trees, exclude 'risk of rain', output lemma
>>> risk_of = corpus.interrogate({T: q}, exclude={W: '^rain$'}, show=L)
### alternative syntax which may be easier when there's only a single search criterion:
# >>> risk_of = corpus.interrogate(T, q, exclude={W: '^rain$'}, show=L)

### use edit() to turn absolute into relative frequencies
>>> to_plot = risk_of.edit('%', risk_of.totals)

### plot the results
>>> to_plot.visualise('Risk of (noun)', y_label='Percentage of all results',
...    style='fivethirtyeight')
```

Output: 

<img style="float:left" src="https://raw.githubusercontent.com/interrogator/risk/master/images/risk-of-noun.png" />
<br>

<a name="search-exclude-and-show"></a>
### `search`, `exclude` and `show`

In the example above, parse trees are searched, a particular match is excluded, and lemmata are shown. These three arguments (`search`, `exclude` and `show`) are the core of the `interrogate()` and `concordance()` methods.

the `search` and `exclude` arguments need a `dict`, with the **thing to be searched as keys** and the **search pattern as values**. Here is a list of available keys for plaintext, tokenised and parsed corpora:

| Key | Gloss |
|-----|-------|
| `W`   | Word |
| `L`   | Lemma   |
| `I`   | Index of token in sentence |
| `N`   | N-gram    |

For parsed corpora, there are many other possible keys:

| Key | Gloss |
|-----|-------|
| `P`    | Part of speech tag |
| `X`   | Word class |
| `G`    | Governor word  |
| `GL`   | Governor lemma form   |
| `GP`   | Governor POS   |
| `GF`   | Governor function   |
| `D`    | Dependent word   |
| `DL`   | Dependent lemma form   |
| `DP`   | Dependent POS   |
| `DF`   | Dependent function  |
| `F`    | Dependency function |
| `R`    | Distance from 'root' |
| `T`    | Tree  |
| `S`    | Predefined general stats |

Allowable combinations are subject to common sense. If you're searching trees, you can't also search governors or dependents. If you're searching an unparsed corpus, you can't search for information provided by the parser. Here are some example `search`/`exclude` values:

| search/exclude | Gloss |
|--------|-------|
| `{W: r'^p'}`       | Tokens starting with P      |
| `{L: r'any'}`       | Any lemma (often equivalent to `r'.*'`)      |
| `{G: r'ing$'}`       | Tokens with governor word ending in 'ing'      |
| `{F: funclist}`       | Tokens whose dependency function matches a `str` in `funclist`       |
| `{D: r'^br', GL: r'$have$'}`       | Tokens with dependent starting with 'br' and 'have' as governor lemma  |
| `{I: '0', F: '^nsubj$'}`       | Sentence initial tokens with role of `nsubj`      |
| `{T: r'NP !<<# /NN.?'}`       | NPs with non-nominal heads    |

If you'd prefer, you can make a `dict` to handle dependent and governor information, instead of using things like `GL` or `DF`. The following searches produce the same output:

```python
>>> crit = {W: r'^friend$', 
...         D: {F: 'amod', 
...             W: 'great'}}

>>> crit = {W: r'^friend$', DF: 'amod', D: 'great'}
```

By default, all `search` criteria must match, but any `exclude` criterion is enough to exclude a match. This beahviour can be changed with the `searchmode` and `excludemode` arguments:

```python
### get words that end in 'ing' OR are nominal:
>>> out = corpus.interrogate({W: 'ing$', P: r'^N'}, searchmode='any')
### get any word, but exclude words that end in 'ing' AND are nominal:
>>> out = corpus.interrogate({W: 'any'}, exclude={W: 'ing$', P: N}, excludemode='all')
```

The `show` argument wants a list of keys you'd like to return for each result. The order will be respected. If you only want one thing, a `str` is OK. One additional possibility is `C`, which returns the number of occurrences only.

| `show` | return |
|--------|--------|
| `W` | `'champions'` |
| `[W]` | `'champions'` |
| `L` | `'champion'`  |
| `P` | `'NNS'` |
| `X` | `'Noun'` |
| `T` | `'(np (jj prevailing) (nns champions))'` (depending on Tregex query) |
| `[P, W]`    | `'NNS/champions'`      |
| `[W, P]`    | `'champions/NNS'`     |
| `[I, L, R]`    | `'2/champion/1'`      |
| `[L, D, F]`    | `'champion/prevailing/nsubj'`      |
| `[G, GL, I]`    | `'are/be/2'`      |
| `[GL, GF, GP]`    | `'be/root/vb'`      |
| `[L, L]`    | `'champion/champion'`      |
| `[C]` | `24` |

Again, common sense dictates what is possible. When searching trees, only trees, words, lemmata, POS and counts can be returned. If showing trees, you can't show anything else. If you use `C`, you can't use anything else.


<a name="working-with-coreferences"></a>
## Working with coreferences

One major challenge in corpus linguistics is the fact that pronouns stand in for other words. Parsing provides coreference resolution, which maps pronouns to the things they denote. You can enable this kind of parsing by specifying the `dcoref` annotator:

```python
>>> ops = 'tokenize,ssplit,pos,lemma,parse,ner,dcoref'
>>> parsed = corpus.interrogate(operations=ops)
```

If you have done this, you can use `coref=True` while interrogating to allow coreferents to be mapped together:

```python
>>> corpus.interrogate(query, coref=True)
```

So, if you wanted to find all the processes a certain entity is engaged in, you can get a more complete result with:

```python
>>> from corpkit.dictionaries import roles
>>> corpus.interrogate({W: 'clinton', GF: roles.process}, coref=True)
```

This will count `support` in `Clinton supported the independence of Kosovo`, and also potentially `authorize` in `He authorized the use of force`. You can also toggle the `representative=True` and `non_representative=True` arguments if you want to distinguish between copula and non-copula coreference.

```python
>>> corpus.interrogate({W: 'clinton', GF: roles.process}, coref=True, representative=False)
```

<a name="building-corpora"></a>
## Building corpora

*corpkit*'s `Corpus()` class contains `parse()` and `tokenise()`, methods for created parsed and/or tokenised corpora. The main thing you need is **a folder, containing either text files, or subfolders that contain text files**. [Stanford CoreNLP](http://nlp.stanford.edu/software/corenlp.shtml) is required to parse corpora. If you don't have it, *corpkit* can download and install it for you. If you're tokenising, you'll need to make sure you have NLTK's tokeniser data. You can then run:

```python
>>> unparsed = Corpus('path/to/unparsed/files')

### to parse, you can set a path to corenlp
>>> corpus = unparsed.parse(corenlppath='Downloads/corenlp')

### to tokenise, point to nltk:
# >>> corpus = unparsed.tokenise(nltk_data_path='Downloads/nltk_data')
```

which creates the parsed/tokenised corpora, and returns `Corpus()` objects representing them. When parsing, you can also optionally pass in a string of annotators, as per the [CoreNLP documentation](http://nlp.stanford.edu/software/corenlp.shtml):

```python
>>> ans = 'tokenize,ssplit,pos'
### you can also set memory and turn off copula head parsing,
### or multiprocess the parsing job (though you'll want a big machine)
>>> corpus = unparsed.parse(operations=ans, memory_mb=3000, 
...                         copula_head=False, multiprocess=4)
```

<a name="speaker-ids"></a>
#### Speaker IDs

Something novel about *corpkit* is that it can work with corpora containing speaker IDs (scripts, transcripts, logs, etc.), like this:

    JOHN: Why did they change the signs above all the bins?
    SPEAKER23: I know why. But I'm not telling.

If you use:

```python
>>> corpus = unparsed.parse(speaker_segmentation=True)
```

This will:

1. Detect any IDs in any file
2. Create a duplicate version of the corpus with IDs removed
3. Parse this 'cleaned' corpus
4. Add an XML tag to each sentence with the name of the speaker
5. Return the parsed corpus as a `Corpus()` object

When interrogating or concordancing, you can then pass in a keyword argument to restrict searches to one or more speakers:

```python
>>> s = ['BRISCOE', 'LOGAN']
>>> npheads = interrogate(T, r'/NN.?/ >># NP', just_speakers=s)
```

This makes it possible to not only investigate individual speakers, but to form an understanding of the overall tenor/tone of the text as well: *Who does most of the talking? Who is asking the questions? Who issues commands?*

<a name="navigating-parsed-corpora"></a>
### Navigating parsed corpora

When your data is parsed, `Corpus` objects draw on [CoreNLP XML](http://corenlp-xml-library.readthedocs.org/en/latest/) to keep everything seamlessly connected:

```python
>>> corp = Corpus('data/CHT-parsed')
>>> corp.subcorpora['2013'].files[1].document.sentences[4235].parse_string
### '(ROOT (FRAG (CC And) (NP (NP (RB not) (RB just)) (NP (NP (NNP Metrione) ... '
>>> corp.subcorpora['1997'].files[0].document.sentences[3509].tokens[30].word
### 'linguistics'
```

<a name="getting-general-stats"></a>
### Getting general stats

Once you have a parsed `Corpus()` object, enter `corpus.features` to interrogate the corpus for some basic frequencies:

```python
>>> corpus = Corpus('data/sessions-parsed')
>>> corpus.features
```

Output:

```
    Characters  Tokens  Words  Closed class words  Open class words  Clauses  Sentences  Unmodalised declarative  Mental processes   Relational processes  Interrogative  Passives  Verbal processes   Modalised declarative  Open interrogative  Imperative  Closed interrogative  
01       26873    8513   7308                4809              3704     2212        577                      280               156                     98             76        35                39                      26                   8           2                      3    
02       25844    7933   6920                4313              3620     2270        266                      130               195                    109             29        19                35                      11                   5           1                      3    
03       18376    5683   4877                3067              2616     1640        330                      174               132                     68             30        40                29                       8                  12           6                      1    
04       20066    6354   5366                3587              2767     1775        319                      174               176                     83             33        30                20                       9                   9           4                      1    
05       23461    7627   6217                4400              3227     1978        479                      245               154                     93             45        51                28                      20                   5           3                      1    
06       19164    6777   5200                4151              2626     1684        298                      111               165                     83             43        56                14                      10                   6           6                      2    
07       22349    7039   5951                4012              3027     1947        343                      183               195                     82             29        30                38                      12                   5           5                      0    
08       26494    8760   7124                4960              3800     2379        545                      263               170                     87             66        36                32                      10                   6           5                      4    
09       23073    7747   6193                4524              3223     2056        310                      149               164                     88             21        26                22                      10                   5           3                      0    
10       20648    6789   5608                3817              2972     1795        437                      265               139                    101             34        34                39                      18                   5           3                      2    
11       25366    8533   6899                4925              3608     2207        457                      230               203                    116             39        48                47                      15                  10           4                      0    
12       16976    5742   4624                3274              2468     1567        258                      135               183                     72             23        43                22                       4                   3           1                      6    
13       25807    8546   6966                4768              3778     2345        477                      257               200                    124             45        50                36                      15                  12           3                      2    
```

Features such as *relational/mental/verbal* processes are difficult to locate automatically, so these counts are perhaps best seen as approximations. Even so, this data can be very helpful when using `edit()` to generate relative frequencies, for example.

<a name="concordancing"></a>
## Concordancing

Unlike most concordancers, which are based on plaintext corpora, *corpkit* can concordance grammatically, using the same kind of `search`, `exclude` and `show` values as `interrogate()`.

```python
>>> subcorpus = corpus.subcorpora.c2005
### C is added above to make a valid variable name from an int
### can also be accessed as corpus.subcorpora['2005']
### or corpus.subcorpora[index]
>>> query = r'/JJ.?/ > (NP <<# (/NN.?/ < /\brisk/))'
### T option for tree searching
>>> lines = subcorpus.concordance(T, query, window=50, n=10, random=True)
```

Output (a `Pandas DataFrame`):

```
0    hedge funds or high-risk stocks obviously poses a         greater   risk to the pension program than a portfolio of   
1           contaminated water pose serious health and   environmental   risks                                             
2   a cash break-even pace '' was intended to minimize       financial   risk to the parent company                        
3                                                Other           major   risks identified within days of the attack        
4                           One seeks out stocks ; the           other   monitors risks                                    
5        men and women in Colorado Springs who were at            high   risk for H.I.V. infection , because of            
6   by the marketing consultant Seth Godin , to taking      calculated   risks , in the opinion of two longtime business   
7        to happen '' in premises '' where there was a            high   risk of fire                                      
8       As this was match points , some of them took a          slight   risk at the second trick by finessing the heart   
9     said that the agency 's continuing review of how         Guidant   treated patient risks posed by devices like the 
```

You can also concordance via dependencies:

```python
### match words starting with 'st' filling function of nsubj
>>> criteria = {W: r'^st', F: r'nsubj$'}
### show function, pos and lemma (in that order)
>>> lines = subcorpus.concordance(criteria, show =[F,P,L])
>>> lines.format(window=30, n=10, columns=[L,M,R])
```

Output:

```
0  ime ./:/; cc/CC/and det/DT/the        nsubj/NN/stock  conj:and/VBZ/be advmod/RB/hist
1  vmod/RB/even compound/NN/sleep       nsubj/NNS/study  ./,/, appos/NNS/evaluation cas
2  od:poss/NNS/veteran case/POS/'        nsubj/NN/study  ccomp/VBZ/suggest mark/IN/that
3                        det/DT/a        nsubj/NN/study  case/IN/in nmod:poss/NN/today 
4            cc/CC/but det/DT/the        nsubj/NN/study  root/VBD/find mark/IN/that cas
5  pound/NN/a amod/JJ/preliminary        nsubj/NN/study  case/IN/of nmod:of/NNS/woman c
6  case/IN/for nmod:for/WDT/which  nsubj/NNS/statistics  acl:relcl/VBD/be xcomp/JJ/avai
7                amod/JJR/earlier       nsubj/NNS/study  aux/VBD/have root/VBN/show mar
8  ay det/DT/the amod/JJR/earlier       nsubj/NNS/study  aux/VBD/do neg/RB/not ccomp/VB
9  /there root/VBP/be det/DT/some    nsubj/NNS/strategy  ./:/- dep/JJS/most case/IN/of 
```

You can search tokenised corpora or plaintext corpora for regular expressions or lists of words to match. The two queries below will return identical results:

```python
>>> r_query = r'^fr?iends?$'
>>> l_query = ['friend', 'friends', 'fiend', 'fiends']
>>> lines = subcorpus.concordance({W: r_query})
>>> lines = subcorpus.concordance({W: l_query})
```

If you really wanted, you can then go on to use `concordance()` output as a dictionary, or extract keywords and ngrams from it, or keep or remove certain results with `edit()`. If you want to [give the GUI a try](http://interrogator.github.io/corpkit/), you can colour-code and create thematic categories for concordance lines as well.

<a name="systemic-functional-stuff"></a>
## Systemic functional stuff

Because I mostly use systemic functional grammar, there is also a simple tool for distinguishing between process types (relational, mental, verbal) when interrogating a corpus. If you add words to the lists in `dictionaries/process_types.py`, corpkit will get their inflections automatically.

```python
>>> from corpkit.dictionaries import processes

### match nsubj with verbal process as governor
>>> crit = {F: '^nsubj$', G: processes.verbal}
### return lemma of the nsubj
>>> sayers = corpus.interrogate(crit, show=L)

### have a look at the top results
>>> sayers.quickview(n=20)
```

Output:

```
  0: he         (n=24530)
  1: she        (n=5558)
  2: they       (n=5510)
  3: official   (n=4348)
  4: it         (n=3752)
  5: who        (n=2940)
  6: that       (n=2665)
  7: i          (n=2062)
  8: expert     (n=2057)
  9: analyst    (n=1369)
 10: we         (n=1214)
 11: report     (n=1103)
 12: company    (n=1070)
 13: which      (n=1043)
 14: you        (n=987)
 15: researcher (n=987)
 16: study      (n=901)
 17: critic     (n=826)
 18: person     (n=802)
 19: agency     (n=798)
 20: doctor     (n=770)

```

First, let's try removing the pronouns using `edit()`. The quickest way is to use the editable wordlists stored in `dictionaries/wordlists`:

```python
>>> from corpkit.dictionaries import wordlists
>>> prps = wordlists.pronouns

# alternative approaches:
# >>> prps = [0, 1, 2, 4, 5, 6, 7, 10, 13, 14, 24]
# >>> prps = ['he', 'she', 'you']
# >>> prps = as_regex(wl.pronouns, boundaries='line')
# or, by re-interrogating:
# >>> sayers = corpus.interrogate(crit, show=L, exclude={W: wordlists.pronouns})

### give edit() indices, words, wordlists or regexes to keep remove or merge
>>> sayers_no_prp = sayers.edit(skip_entries=prps, skip_subcorpora=[1963])
>>> sayers_no_prp.quickview(n=10)
```

Output:

```
  0: official (n=4342)
  1: expert (n=2055)
  2: analyst (n=1369)
  3: report (n=1098)
  4: company (n=1066)
  5: researcher (n=987)
  6: study (n=900)
  7: critic (n=825)
  8: person (n=801)
  9: agency (n=796)
```

Great. Now, let's sort the entries by trajectory, and then plot:

```python
### sort with edit()
### use scipy.linregress to sort by 'increase', 'decrease', 'static', 'turbulent' or P
### other sort_by options: 'name', 'total', 'infreq'
>>> sayers_no_prp = sayers_no_prp.edit('%', sayers.totals, sort_by='increase')

### make an area chart with custom y label
>>> sayers_no_prp.visualise('Sayers, increasing', kind='area', 
...    y_label='Percentage of all sayers')
```

Output:

<img style="float:left" src="https://raw.githubusercontent.com/interrogator/risk/master/images/sayers-increasing.png" />
<br>

We can also merge subcorpora. Let's look for changes in gendered pronouns:

```python
>>> merges = {'1960s': r'^196', 
...           '1980s': r'^198', 
...           '1990s': r'^199', 
...           '2000s': r'^200',
...           '2010s': r'^201'}

>>> sayers = sayers.edit(merge_subcorpora=merges)

### now, get relative frequencies for he and she
### SELF calculates percentage after merging/removing etc has been performed,
### so that he and she will sum to 100%. Pass in `sayers.totals` to calculate 
### he/she as percentage of all sayers
>>> genders = sayers.edit('%', SELF, just_entries=['he','she'])

### and plot it as a series of pie charts, showing totals on the slices:
>>> genders.visualise('Pronominal sayers in the NYT', kind='pie',
...    subplots=True, figsize=(15,2.75), show_totals='plot')
```

Output:

<img style="float:left" src="https://raw.githubusercontent.com/interrogator/risk/master/images/ann_he_she.png" />
<br>

Woohoo, a decreasing gender divide! 

<a name="keywording"></a>
## Keywording

As I see it, there are two main problems with keywording, as typically performed in corpus linguistics. First is the reliance on 'balanced'/'general' reference corpora, which are obviously a fiction. Second is the idea of stopwords. Essentially, when most people calculate keywords, they use stopword lists to automatically filter out words that they think will not be of interest to them. These words are generally closed class words, like determiners, prepositions, or pronouns. This is not a good way to go about things: the relative frequencies of *I*, *you* and *one* can tell us a lot about the kinds of language in a corpus. More seriously, stopwords mean adding subjective judgements about what is interesting language into a process that is useful precisely because it is not subjective or biased.

So, what to do? Well, first, don't use 'general reference corpora' unless you really really have to. With *corpkit*, you can use your entire corpus as the reference corpus, and look for keywords in subcorpora. Second, rather than using lists of stopwords, simply do not send all words in the corpus to the keyworder for calculation. Instead, try looking for key *predicators* (rightmost verbs in the VP), or key *participants* (heads of arguments of these VPs):

```python
### just heads of participants' lemma form (no pronouns, though!)
>>> part = r'/(NN|JJ).?/ >># (/(NP|ADJP)/ $ VP | > VP)'
>>> p = corpus.interrogate(T, part, show=L)
```

When using `edit()` to calculate keywords, there are a few default parameters that can be easily changed:

| Keyword argument | Function | Default setting | Type
|---|---|---|---|
| `threshold`  | Remove words occurring fewer than `n` times in reference corpus  | `False` | `'high/medium/low'`/ `True/False` / `int`
| `calc_all`  | Calculate keyness for words in both reference and target corpus, rather than just target corpus  | `True` | `True/False`
| `selfdrop`  | Attempt to remove target data from reference data when calculating keyness  | `True`  | `True/False`

Let's have a look at how these options change the output:

```python
### SELF as reference corpus uses p.results
>>> options = {'selfdrop': False, 
...            'calc_all': False, 
...            'threshold': False}

>>> for k, v in options.items():
...    key = p.edit('keywords', SELF, k=v)
...    print key.results.ix['2011'].order(ascending=False)

```
Output:

| #1: default       | |   #2: no `selfdrop` | |  #3: no `calc_all`    |  |  #4: no `threshold` | |
|---|---:|---|---:|---|---:|---|---:|
| risk        | 1941.47  |  risk        |  1909.79  |  risk        | 1941.47   |  bank       |   668.19 |
| bank        | 1365.70  |  bank        |  1247.51  |  bank        | 1365.70   |  crisis     |   242.05 |
| crisis      |  431.36  |  crisis      |   388.01  |  crisis      |  431.36   |  obama      |   172.41 |
| investor    |  410.06  |  investor    |   387.08  |  investor    |  410.06   |  demiraj    |   161.90 |
| rule        |  316.77  |  rule        |   293.33  |  rule        |  316.77   |  regulator  |   144.91 |
|             |   ...    |              |    ...    |              |   ...     |             |    ...   |
| clinton     |  -37.80  |  tactic      |   -35.09  |  hussein     |  -25.42   |  clinton    |   -87.33 |
| vioxx       |  -38.00  |  vioxx       |   -35.29  |  clinton     |  -37.80   |  today      |   -89.49 |
| greenspan   |  -54.35  |  greenspan   |   -51.38  |  vioxx       |  -38.00   |  risky      |  -125.76 |
| bush        | -153.06  |  bush        |  -143.02  |  bush        | -153.06   |  bush       |  -253.95 |
| yesterday   | -162.30  |  yesterday   |  -151.71  |  yesterday   | -162.30   |  yesterday  |  -268.29 |

As you can see, slight variations on keywording give different impressions of the same corpus!

A key strength of *corpkit*'s approach to keywording is that you can generate new keyword lists without re-interrogating the corpus. We can use some Pandas syntax to do this more quickly.

```python
>>> yrs = ['2011', '2012', '2013', '2014']
>>> keys = p.results.ix[yrs].sum().edit('keywords', p.results.drop(yrs),
...    threshold=False)
>>> print keys.results
```

Output:

```
bank          1795.24
obama          722.36
romney         560.67
jpmorgan       527.57
rule           413.94
dimon          389.86
draghi         349.80
regulator      317.82
italy          282.00
crisis         243.43
putin          209.51
greece         208.80
snowden        208.35
mf             192.78
adoboli        161.30
```

... or track the keyness of a set of words over time:

```python
>>> twords = ['terror', 'terrorism', 'terrorist']
>>> terr = p.edit(K, SELF, merge_entries={'terror': twords})
>>> print terr.results.terror
```

Output:

```
1963    -2.51
1987    -3.67
1988   -16.09
1989    -6.24
1990   -16.24
...       ...
Name: terror, dtype: float64
```

<a name="visualising-keywords"></a>
### Visualising keywords

Naturally, we can use `visualise()` for our keywords too:

```python
>>> pols.results.terror.visualise('Terror* as Participant in the \emph{NYT}', 
...    kind='area', stacked=False, y_label='L/L Keyness')
>>> politicians = ['bush', 'obama', 'gore', 'clinton', 'mccain', 
...                'romney', 'dole', 'reagan', 'gorbachev']
>>> k.results[politicans].visualise('Keyness of politicians in the \emph{NYT}', 
...    num_to_plot='all', y_label='L/L Keyness', kind='area', legend_pos='center left')
```
Output:
<img style="float:left" src="https://raw.githubusercontent.com/interrogator/risk/master/images/terror-as-participant-in-the-emphnyt.png" />
<img style="float:left" src="https://raw.githubusercontent.com/interrogator/risk/master/images/keyness-of-politicians-in-the-emphnyt.png" />
<br>

<a name="traditional-reference-corpora"></a>
### Traditional reference corpora

If you still want to use a standard reference corpus, you can do that (and a dictionary version of the BNC is included). For the reference corpus, `edit()` recognises `dicts`, `DataFrames`, `Series`, files containing `dicts`, or paths to plain text files or trees.

```python
### arbitrary list of common/boring words
>>> from corpkit.dictionaries import stopwords
>>> print p.results.ix['2013'].edit(K, 'bnc.p', skip_entries=stopwords).results
>>> print p.results.ix['2013'].edit(K, 'bnc.p', calc_all=False).results
```

Output (not so useful):

```
#1                                #2
bank           5568.25            bank      5568.25
person         5423.24            person    5423.24
company        3839.14            company   3839.14
way            3537.16            way       3537.16
state          2873.94            state     2873.94
                ...                           ...  
three          -691.25            ten       -199.36
people         -829.97            bit       -205.97
going          -877.83            sort      -254.71
erm           -2429.29            thought   -255.72
yeah          -3179.90            will      -679.06
```

<a name="parallel-processing"></a>
## Parallel processing

`interrogate()` can also do parallel-processing. You can generally improve the speed of an interrogation by setting the `multiprocess` argument:

```python
### set num of parallel processes manually
>>> data = corpus.interrogate({T: r'/NN.?/ >># NP'}, multiprocess=3)
### set num of parallel processes automatically
>>> data = corpus.interrogate({T: r'/NN.?/ >># NP'}, multiprocess=True)
```

Multiprocessing is particularly useful, however, when you are interested in multiple corpora, speaker IDs, or search queries. The sections below explain how.

<a name="multiple-corpora"></a>
#### Multiple corpora

To parallel-process multiple corpora, first, wrap them up as a `Corpora()` object. To do this, you can pass in:

1. a list of paths
2. a list of `Corpus()` objects
3. A single path string that contains corpora

```python
>>> from corpkit.corpus import Corpora
>>> corpora = Corpora('./data') # path containing corpora
>>> corpora
### <corpkit.corpus.Corpora instance: 6 items>

### interrogate by parallel processing, 4 at a time
>>> output = corpora.interrogate(T, r'/NN.?/ < /(?i)^h/', show=L, multiprocess=4)

```

The output of a multiprocessed interrogation will generally be a `dict` with  corpus/speaker/query names as keys. The main exception to this is if you use `show=C`, which will concatenate results from each query into a single `Interrogation` object, using corpus/speaker/query names as column names.

<a name="multiple-speakers"></a>
#### Multiple speakers

Passing in a list of speaker names will also trigger multiprocessing:

```python
>>> from dictionary.wordlists import wordlists
>>> spkrs = ['MEYER', 'JAY']
>>> each_speaker = corpus.interrogate(W, wordlists.closedclass, just_speakers=spkrs)
```

There is also `just_speakers='each'`, which will be automatically expanded to include every speaker name found in the corpus.

<a name="multiple-queries"></a>
#### Multiple queries

You can also run a number of queries over the same corpus in parallel. There are two ways to do this.

```python
### method one
>>> query = {'Noun phrases': r'NP', 'Verb phrases': r'VP'}
>>> phrases = corpus.interrogate(T, query, show=C)

### method two
>>> query = {'-ing words': {W: r'ing$'}, '-ed verbs': {P: r'^V', W: r'ed$'}}
>>> patterns = corpus.interrogate(query, show=L)
```

Let's try multiprocessing with multiple queries, showing count (i.e. returning a single results DataFrame). We can look at different risk processes (e.g. *risk*, *take risk*, *run risk*, *pose risk*, *put at risk*) using constituency parses:

```python
>>> q = {'risk':        r'VP <<# (/VB.?/ < /(?i).?\brisk.?\b/)', 
...      'take risk':   r'VP <<# (/VB.?/ < /(?i)\b(take|takes|taking|took|taken)+\b/) < (NP <<# /(?i).?\brisk.?\b/)', 
...      'run risk':    r'VP <<# (/VB.?/ < /(?i)\b(run|runs|running|ran)+\b/) < (NP <<# /(?i).?\brisk.?\b/)', 
...      'put at risk': r'VP <<# /(?i)(put|puts|putting)\b/ << (PP <<# /(?i)at/ < (NP <<# /(?i).?\brisk.?/))', 
...      'pose risk':   r'VP <<# (/VB.?/ < /(?i)\b(pose|poses|posed|posing)+\b/) < (NP <<# /(?i).?\brisk.?\b/)'}

# show=C will collapse results from each search into single dataframe
>>> processes = corpus.interrogate(T, q, show=C)
>>> proc_rel = processes.edit('%', processes.totals)
>>> proc_rel.visualise('Risk processes')
```

Output:

<img style="float:left" src="https://raw.githubusercontent.com/interrogator/risk/master/images/risk_processes-2.png" />
<br>

<a name="more-complex-queries-and-plots"></a>
## More complex queries and plots

Next, let's find out what kinds of noun lemmas are subjects of any of these risk processes:

```python
### a query to find heads of nps that are subjects of risk processes
>>> query = r'/^NN(S|)$/ !< /(?i).?\brisk.?/ >># (@NP $ (VP <+(VP) (VP ( <<# (/VB.?/ < /(?i).?\brisk.?/) ' \
...    r'| <<# (/VB.?/ < /(?i)\b(take|taking|takes|taken|took|run|running|runs|ran|put|putting|puts)/) < ' \
...    r'(NP <<# (/NN.?/ < /(?i).?\brisk.?/))))))'
>>> noun_riskers = c.interrogate(T, query, show=L)
 
>>> noun_riskers.quickview(10)
```

Output:

```
  0: person (n=195)
  1: company (n=139)
  2: bank (n=80)
  3: investor (n=66)
  4: government (n=63)
  5: man (n=51)
  6: leader (n=48)
  7: woman (n=43)
  8: official (n=40)
  9: player (n=39)
```

We can use `edit()` to make some thematic categories:

```python
### get everyday people
>>> p = ['person', 'man', 'woman', 'child', 'consumer', 'baby', 'student', 'patient']
### get business, gov, institutions
>>> i = ['company', 'bank', 'investor', 'government', 'leader', 'president', 'officer', 
...      'politician', 'institution', 'agency', 'candidate', 'firm']
>>> merges = {'Everyday people': p, Institutions: i}

>>> them_cat = them_cat.edit('%', noun_riskers.totals,
...                          merge_entries=merges,
...                          sort_by='total',
...                          skip_subcorpora=1963,
...                          just_entries=merges.keys())

### plot result
>>> them_cat.visualise('Types of riskers', y_label='Percentage of all riskers')
```

Output:

<img style="float:left" src="https://raw.githubusercontent.com/interrogator/risk/master/images/types-of-riskers.png" />
<br>

Let's also find out what percentage of the time some nouns appear as riskers:

```python
### find any head of an np not containing risk
>>> query = r'/NN.?/ >># NP !< /(?i).?\brisk.?/'
>>> noun_lemmata = corpus.interrogate(T, query, show=L)

### get some key terms
>>> people = ['man', 'woman', 'child', 'baby', 'politician', 
...           'senator', 'obama', 'clinton', 'bush']
>>> selected = noun_riskers.edit('%', noun_lemmata.results, 
...    just_entries=people, just_totals=True, threshold=0, sort_by='total')

### make a bar chart:
>>> selected.visualise('Risk and power', num_to_plot='all', kind='bar', 
...                    x_label='Word', y_label='Risker percentage', fontsize=15)
```

Output:

<img style="float:left" src="https://raw.githubusercontent.com/interrogator/risk/master/images/risk-and-power-2.png" />
<br>

<a name="visualisation-options"></a>
### Visualisation options

With a bit of creativity, you can do some pretty awesome data-viz, thanks to *Pandas* and *Matplotlib*. The following plots require only one interrogation:

```python
>>> modals = corpus.interrogate(T, 'MD < __', show=L)
### simple stuff: make relative frequencies for individual or total results
>>> rel_modals = modals.edit('%', modals.totals)

### trickier: make an 'others' result from low-total entries
>>> low_indices = range(7, modals.results.shape[1])
>>> each_md = modals.edit('%', modals.totals, merge_entries={'other': low_indices}, 
...                       sort_by='total', just_totals=True, keep_top=7)

### complex stuff: merge results
>>> entries_to_merge = [r'(^w|\'ll|\'d)', r'^c', r'^m', r'^sh']
>>> modals = modals.edit(merge_entries=entries_to_merge)
    
### complex stuff: merge subcorpora
>>> merges = {'1960s': r'^196', 
...           '1980s': r'^198', 
...           '1990s': r'^199', 
...           '2000s': r'^200',
...           '2010s': r'^201'}

>>> modals = sayers.edit(merge_subcorpora=merges)
    
### make relative, sort, remove what we don't want
>>> modals = modals.edit('%', modals.totals, keep_stats=False,
...    just_subcorpora=merges.keys(), sort_by='total', keep_top=4)

### show results
>>> print rel_modals.results, each_md.results, modals.results
```
Output:
```
          would       will        can      could  ...        need     shall      dare  shalt
1963  22.326833  23.537323  17.955615   6.590451  ...    0.000000  0.537996  0.000000      0
1987  24.750614  18.505132  15.512505  11.117537  ...    0.072286  0.260228  0.014457      0
1988  23.138986  19.257117  16.182067  11.219364  ...    0.091338  0.060892  0.000000      0
...         ...        ...        ...        ...  ...         ...       ...       ...    ...
2012  23.097345  16.283186  15.132743  15.353982  ...    0.029499  0.029499  0.000000      0
2013  22.136269  17.286522  16.349301  15.620351  ...    0.029753  0.029753  0.000000      0
2014  21.618357  17.101449  16.908213  14.347826  ...    0.024155  0.000000  0.000000      0
[29 rows x 17 columns] 

would     23.235853
will      17.484034
can       15.844070
could     13.243449
may        9.581255
should     7.292294
other      7.290155
Name: Combined total, dtype: float64 

       would/will/'ll...  can/could/ca  may/might/must  should/shall/shalt
1960s          47.276395     25.016812       19.569603            7.800941
1980s          44.756285     28.050776       19.224476            7.566817
1990s          44.481957     29.142571       19.140310            6.892708
2000s          42.386571     30.710739       19.182867            7.485681
2010s          42.581666     32.045745       17.777845            7.397044

```

Now, some intense plotting:

```python
### exploded pie chart
>>> each_md.visualise('Pie chart of common modals in the NYT', explode=['other'],
...    num_to_plot='all', kind='pie', colours='Accent', figsize=(11,11))

### bar chart, transposing and reversing the data
>>> modals.results.iloc[::-1].T.iloc[::-1].visualise('Modals use by decade', kind='barh',
...    x_label='Percentage of all modals', y_label='Modal group')

### stacked area chart
>>> rel_modals.results.drop('1963').visualise('An ocean of modals', kind='area', 
...    stacked=True, colours='summer', figsize =(8,10), num_to_plot='all', 
...    legend_pos='lower right', y_label='Percentage of all modals')
```

Output:
<p align="center">
<img src="https://raw.githubusercontent.com/interrogator/risk/master/images/pie-chart-of-common-modals-in-the-nyt2.png"  height="400" width="400"/>
<img src="https://raw.githubusercontent.com/interrogator/risk/master/images/modals-use-by-decade.png"  height="230" width="500"/>
<img src="https://raw.githubusercontent.com/interrogator/risk/master/images/an-ocean-of-modals2.png"  height="600" width="500"/>
</p>

<a name="contact"></a>
## Contact

Twitter: [@interro_gator](https://twitter.com/interro_gator)

<a name="cite"></a>
## Cite

> `McDonald, D. (2015). corpkit: a toolkit for corpus linguistics. Retrieved from https://www.github.com/interrogator/corpkit. DOI: http://doi.org/10.5281/zenodo.28361`

================================================
FILE: Dockerfile
================================================
FROM alpine:latest
MAINTAINER interro_gator

# set up a workspace so we can cache python stuff
RUN rm -rf /.src && mkdir /.src
COPY requirements.txt /.src/requirements.txt

# add corenlp
# COPY ~/corenlp /.src

# use the workspace for everything
WORKDIR /.src

# install the basics
RUN apk add --update \
    python3 \
    python-dev \
    py-pip \
    build-base \
    git \
    libpng \
    freetype \
    pkgconf \
    libxft-dev \
    libxml2-dev \
    readline

# install java for parsing
RUN apk --update add openjdk8-jre-base

# needed for numpy
RUN ln -s /usr/include/locale.h /usr/include/xlocale.h
RUN ln -s /usr/include/libxml2/libxml/xmlversion.h /usr/include/xmlversion.h
RUN mkdir /usr/include/libxml
RUN ln -s /usr/include/libxml2/libxml/xmlversion.h /usr/include/libxml/xmlversion.h
RUN ln -s /usr/include/libxml2/libxml/xmlexports.h /usr/include/xmlexports.h
RUN ln -s /usr/include/libxml2/libxml/xmlexports.h /usr/include/libxml/xmlexports.h

# stop pip from complaining
RUN pip install --upgrade pip

# python heavyweight stuff
RUN pip install cython
RUN pip install numpy
RUN pip install colorama

# remove old stuff --- not sure it does much
RUN rm -rf /var/cache/apk/*

# get matplotlib github version
RUN git clone git://github.com/matplotlib/matplotlib.git
RUN cd matplotlib && python setup.py install && cd ..

# install corpkit requirements
RUN pip install -r requirements.txt

RUN pip install docker-py

# add everything from corpkit to working dir
COPY . /.src

# install corpkit itself
RUN python /.src/setup.py install

# download might be needed for licence issues
#RUN python -m corpkit.download.corenlp /

CMD python -m corpkit.env docker=corpkit

WORKDIR /projects



================================================
FILE: LICENSE
================================================
The MIT License (MIT)

Copyright (c) 2015 Daniel McDonald
mcdonaldd, at, unimelb.edu

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.



================================================
FILE: Makefile
================================================
# Makefile for Sphinx documentation
#

# You can set these variables from the command line.
SPHINXOPTS    =
SPHINXBUILD   = sphinx-build
PAPER         =
BUILDDIR      = _build

# User-friendly check for sphinx-build
ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1)
$(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/)
endif

# Internal variables.
PAPEROPT_a4     = -D latex_paper_size=a4
PAPEROPT_letter = -D latex_paper_size=letter
ALLSPHINXOPTS   = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
# the i18n builder cannot share the environment and doctrees with the others
I18NSPHINXOPTS  = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .

.PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest coverage gettext

help:
	@echo "Please use \`make <target>' where <target> is one of"
	@echo "  html       to make standalone HTML files"
	@echo "  dirhtml    to make HTML files named index.html in directories"
	@echo "  singlehtml to make a single large HTML file"
	@echo "  pickle     to make pickle files"
	@echo "  json       to make JSON files"
	@echo "  htmlhelp   to make HTML files and a HTML help project"
	@echo "  qthelp     to make HTML files and a qthelp project"
	@echo "  applehelp  to make an Apple Help Book"
	@echo "  devhelp    to make HTML files and a Devhelp project"
	@echo "  epub       to make an epub"
	@echo "  latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
	@echo "  latexpdf   to make LaTeX files and run them through pdflatex"
	@echo "  latexpdfja to make LaTeX files and run them through platex/dvipdfmx"
	@echo "  text       to make text files"
	@echo "  man        to make manual pages"
	@echo "  texinfo    to make Texinfo files"
	@echo "  info       to make Texinfo files and run them through makeinfo"
	@echo "  gettext    to make PO message catalogs"
	@echo "  changes    to make an overview of all changed/added/deprecated items"
	@echo "  xml        to make Docutils-native XML files"
	@echo "  pseudoxml  to make pseudoxml-XML files for display purposes"
	@echo "  linkcheck  to check all external links for integrity"
	@echo "  doctest    to run all doctests embedded in the documentation (if enabled)"
	@echo "  coverage   to run coverage check of the documentation (if enabled)"

clean:
	rm -rf $(BUILDDIR)/*

html:
	$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
	@echo
	@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."

dirhtml:
	$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
	@echo
	@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."

singlehtml:
	$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
	@echo
	@echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."

pickle:
	$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
	@echo
	@echo "Build finished; now you can process the pickle files."

json:
	$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
	@echo
	@echo "Build finished; now you can process the JSON files."

htmlhelp:
	$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
	@echo
	@echo "Build finished; now you can run HTML Help Workshop with the" \
	      ".hhp project file in $(BUILDDIR)/htmlhelp."

qthelp:
	$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp
	@echo
	@echo "Build finished; now you can run "qcollectiongenerator" with the" \
	      ".qhcp project file in $(BUILDDIR)/qthelp, like this:"
	@echo "# qcollectiongenerator $(BUILDDIR)/qthelp/corpkit.qhcp"
	@echo "To view the help file:"
	@echo "# assistant -collectionFile $(BUILDDIR)/qthelp/corpkit.qhc"

applehelp:
	$(SPHINXBUILD) -b applehelp $(ALLSPHINXOPTS) $(BUILDDIR)/applehelp
	@echo
	@echo "Build finished. The help book is in $(BUILDDIR)/applehelp."
	@echo "N.B. You won't be able to view it unless you put it in" \
	      "~/Library/Documentation/Help or install it in your application" \
	      "bundle."

devhelp:
	$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp
	@echo
	@echo "Build finished."
	@echo "To view the help file:"
	@echo "# mkdir -p $$HOME/.local/share/devhelp/corpkit"
	@echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/corpkit"
	@echo "# devhelp"

epub:
	$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
	@echo
	@echo "Build finished. The epub file is in $(BUILDDIR)/epub."

latex:
	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
	@echo
	@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
	@echo "Run \`make' in that directory to run these through (pdf)latex" \
	      "(use \`make latexpdf' here to do that automatically)."

latexpdf:
	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
	@echo "Running LaTeX files through pdflatex..."
	$(MAKE) -C $(BUILDDIR)/latex all-pdf
	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."

latexpdfja:
	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
	@echo "Running LaTeX files through platex and dvipdfmx..."
	$(MAKE) -C $(BUILDDIR)/latex all-pdf-ja
	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."

text:
	$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
	@echo
	@echo "Build finished. The text files are in $(BUILDDIR)/text."

man:
	$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
	@echo
	@echo "Build finished. The manual pages are in $(BUILDDIR)/man."

texinfo:
	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
	@echo
	@echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo."
	@echo "Run \`make' in that directory to run these through makeinfo" \
	      "(use \`make info' here to do that automatically)."

info:
	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
	@echo "Running Texinfo files through makeinfo..."
	make -C $(BUILDDIR)/texinfo info
	@echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo."

gettext:
	$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale
	@echo
	@echo "Build finished. The message catalogs are in $(BUILDDIR)/locale."

changes:
	$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
	@echo
	@echo "The overview file is in $(BUILDDIR)/changes."

linkcheck:
	$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
	@echo
	@echo "Link check complete; look for any errors in the above output " \
	      "or in $(BUILDDIR)/linkcheck/output.txt."

doctest:
	$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
	@echo "Testing of doctests in the sources finished, look at the " \
	      "results in $(BUILDDIR)/doctest/output.txt."

coverage:
	$(SPHINXBUILD) -b coverage $(ALLSPHINXOPTS) $(BUILDDIR)/coverage
	@echo "Testing of coverage in the sources finished, look at the " \
	      "results in $(BUILDDIR)/coverage/python.txt."

xml:
	$(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml
	@echo
	@echo "Build finished. The XML files are in $(BUILDDIR)/xml."

pseudoxml:
	$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml
	@echo
	@echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml."


================================================
FILE: README.md
================================================
# corpkit: sophisticated corpus linguistics

[![Join the chat at https://gitter.im/interrogator/corpkit](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/interrogator/corpkit?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) [![DOI](https://zenodo.org/badge/14568/interrogator/corpkit.svg)](https://zenodo.org/badge/latestdoi/14568/interrogator/corpkit) [![Travis](https://img.shields.io/travis/interrogator/corpkit.svg)](https://travis-ci.org/interrogator/corpkit) [![PyPI](https://img.shields.io/pypi/v/corpkit.svg)](https://pypi.python.org/pypi/corpkit) [![ReadTheDocs](https://readthedocs.org/projects/corpkit/badge/?version=latest)](http://corpkit.readthedocs.org/en/latest/) [![Docker Automated build](https://img.shields.io/docker/automated/interrogator/corpkit.svg)](https://hub.docker.com/r/interrogator/corpkit/) [![Anaconda-Server Badge](https://anaconda.org/asmeurer/conda/badges/installer/conda.svg)](https://anaconda.org/interro_gator/corpkit)

## **NOTICE: corpkit is now deprecated and unmaintained. It is superceded by [`buzz`](https://github.com/interrogator/buzz), which is better in every way.**

> **corpkit** is a module for doing more sophisticated corpus linguistics. It links state-of-the-art natural language processing technologies to functional linguistic research aims, allowing you to easily build, search and visualise grammatically annotated corpora in novel ways.

The basic workflow involves making corpora, parsing them, and searching them. The results of searches are [CONLL-U formatted](http://universaldependencies.org/format.html) files, represented as [pandas](http://pandas.pydata.org/) objects, which can be edited, visualised or exported in a lot of ways. The tool has three interfaces, each with its own documentation:

1. [A Python API](http://corpkit.readthedocs.io)
2. [A natural language interpreter](http://corpkit.readthedocs.io/en/latest/rst_docs/interpreter/corpkit.interpreter.overview.html)
3. [A graphical interface](http://interrogator.github.io/corpkit/)

A quick demo for each interface is provided in this document.

## Feature summary

From all three interfaces, you can do a lot of neat things. In general:

### Parsing

> Corpora are stored as `Corpus` objects, with methods for viewing, parsing, interrogating and concordancing.

* A very simple wrapper around the full Stanford CoreNLP pipeline
* Automatically add annotations, speaker names and metadata to parser output
* Detect speaker names and make these into metadata features
* Multiprocessing
* Store dependency parsed texts, parse trees and metadata in CONLL-U format

### Interrogating corpora

> Interrogating a corpus produces an `Interrogation` object, with results as Pandas DataFrame attributes.

* Search corpora using regular expressions, wordlists, CQL, Tregex, or a rich, purpose built dependency searching syntax
* Interrogate any dataset in CONLL-U format (e.g. [the Universal Dependencies Treebanks](https://github.com/UniversalDependencies))
* Collocation, n-gramming
* Restrict searches by metadata feature
* Use metadata as symbolic subcorpora
* Choose what search results return: show any combination of words, lemmata, POS, indices, distance from root node, syntax tree, etc.
* Generate concordances alongside interrogations
* Work with coreference annotation

### Editing results

> `Interrogation` objects have `edit`, `visualise` and `save` methods, to name just a few. Editing creates a new `Interrogation` object.

* Quickly delete, sort, merge entries and subcorpora
* Make relative frequencies (e.g. calculate results as percentage of all words/clauses/nouns ...)
* Use linear regression sorting to find increasing, decreasing, turbulent or static trajectories 
* Calculate p values, etc.
* Keywording
* Simple multiprocessing available for parsing and interrogating
* Results are Pandas objects, so you can do fast, good statistical work on them

### Visualising results

> The `visualise` method of `Interrogation` objects uses matplotlib and seaborn if installed to produce high quality figures.

* Many chart types
* Easily customise titles, axis labels, colours, sizes, number of results to show, etc.
* Make subplots
* Save figures in a number of formats

### Concordancing

> When interrogating a corpus, concordances are also produced, which can allow you to check that your query matches what you want it to.

* Colour, sort, delete lines using regular expressions
* Recalculate results from edited concordance lines (great for removing false positives)
* Format lines for publication with TeX

### Other stuff

* Language modelling
* Save and load results, images, concordances
* Export data to other tools
* Switch between API, GUI and interpreter whenever you like

## Installation

Via pip:

```shell
pip install corpkit
```

Via Git:

```shell
git clone https://github.com/interrogator/corpkit
cd corpkit
python setup.py install
```

Via Anaconda:

```shell
conda install -c interro_gator corpkit
```

## Creating a project

Once you've got everything installed, you'll want to create a project---this is just a folder hierarchy that stores your corpora, saved results, figures and so on. You can do this in a number of ways:

### Shell

```shell
new_project junglebook
cp -R chapters junglebook/data
```

### Interpreter

```shell
> new project named junglebook
> add ../chapters
```

### Python

```python
>>> import shutil
>>> from corpkit import new_project
>>> new_project('junglebook')
>>> shutil.copytree('../chapters', 'junglebook/data')
```

You can create projects and add data via the file menu of the graphical interface as well.

## Ways to use *corpkit*

As explained earlier, there are three ways to use the tool. Each has unique strengths and weaknesses. To summarise them, the Python API is the most powerful, but has the steepest learning curve. The GUI is the least powerful, but easy to learn (though it is still arguably the most powerful linguistics GUI available). The interpreter strikes a happy middle ground, especially for those who are not familiar with Python.

## Interpreter

The first way to use *corpkit* is by entering its natural language interpreter. To activate it, use the `corpkit` command:

```shell
$ cd junglebook
$ corpkit
```

You'll get a lovely new prompt into which you can type commands: 

```none
corpkit@junglebook:no-corpus> 
```

Generally speaking, it has the comforts of home, such as history, search, backslash line breaking, variable creation and `ls` and `cd` commands. As in `IPython`, any command beginning with an exclamation mark will be executed by the shell. You can also write scripts and execute them with `corpkit script.ck`, or `./script.ck` if you have a shebang.

### Making projects and parsing corpora

```shell
# make new project
> new project named junglebook
# add folder of (subfolders of) text files
> add '../chapters'
# specify corpus to work on
> set chapters as corpus
# parse the corpus
> parse corpus with speaker_segmentation and metadata and multiprocess as 2
```

### Searching and concordancing

```shell
# search and exclude
> search corpus for governor-function matching 'root' \
...    excluding governor-lemma matching 'be'

# show pos, lemma, index, (e.g. 'NNS/thing/3')
> search corpus for pos matching '^N' showing pos and lemma and index

# further arguments and dynamic structuring
> search corpus for word matching any \
...    with subcorpora as pagenum and preserve_case

# show concordance lines
> show concordance with window as 50 and columns as LMR

# colouring concordances
> mark m matching 'have' blue

# recalculate results
> calculate result from concordance
```

### Variables, editing results

```shell

# variable naming
> call result root_deps
# skip some numerical subcorpora
> edit root_deps by skipping subcorpora matching [1,2,3,4,5]
# make relative frequencies
> calculate edited as percentage of self
# use scipy to calculate trends and sort by them
> sort edited by decrease
```

### Visualise edited results

```shell
> plot edited as line chart \
...    with x_label as 'Subcorpus' and \
...    y_label as 'Frequency' and \
...    colours as 'summer'
```

### Switching interfaces

```shell
# open graphical interface
> gui
# enter ipython with current namespace
> ipython
# use a new/existing jupyter notebook
> jupyter notebook findings.ipynb
```

## API

Straight Python is the most powerful way to use *corpkit*, because you can manipulate results with Pandas syntax, construct loops, make recursive queries, and so on. Here are some simple examples of the API syntax:

### Instantiate and search a parsed corpus

```python
### import everything
>>> from corpkit import *
>>> from corpkit.dictionaries import *

### instantiate corpus
>>> corp = Corpus('chapters-parsed')

### search for anything participant with a governor that
### is a process, excluding closed class words, and 
### showing lemma forms. also, generate a concordance.
>>> sch = {GF: roles.process, F: roles.actor}
>>> part = corp.interrogate(search=sch,
...                         exclude={W: wordlists.closedclass},
...                         show=[L],
...                         conc=True)
```

You get an `Interrogation` object back, with a `results` attribute that is a Pandas DataFrame: 

```
          daisy  gatsby  tom  wilson  eye  man  jordan  voice  michaelis  \
chapter1     13       2    6       0    3    3       0      2          0   
chapter2      1       0   12      10    1    1       0      0          0   
chapter3      0       3    0       0    3    8       6      1          0   
chapter4      6       9    2       0    1    3       1      1          0   
chapter5      8      14    0       0    3    3       0      2          0   
chapter6      7      14    9       0    1    2       0      3          0   
chapter7     26      20   35      10   12    3      16      9          5   
chapter8      5       4    1      10    2    2       0      1         10   
chapter9      1       1    1       0    3    3       1      1          0   
```

### Edit and visualise the result

Below, we make normalised frequencies and plot:

```python
### calculate and sort---this sort requires scipy
>>> part = part.edit('%', SELF)

### make line subplots for the first nine results
>>> plt = part.visualise('Processes, increasing', subplots=True, layout=(3,3))
>>> plt.show()
```

<img src="https://raw.githubusercontent.com/interrogator/corpkit/master/images/actors.png" width="450">

There are also some [more detailed API examples over here](https://github.com/interrogator/corpkit/blob/master/API-README.md). This document is fairly thorough, but now deprecated, because the official docs are now over at [ReadTheDocs](http://corpkit.readthedocs.io/en/latest/).

## Example figures

<p align="center"> <i>
<img src="https://raw.githubusercontent.com/interrogator/corpkit/master/images/inc-proc.png" width="350"> <img src="https://raw.githubusercontent.com/interrogator/corpkit/master/images/best-derived.png" width="350">
<br>Shifting register of scientific English<br>
<br><br>
<img src="https://raw.githubusercontent.com/interrogator/corpkit/master/images/symlog-part2.png" width="310"> <img src="https://raw.githubusercontent.com/interrogator/corpkit/master/images/process-types-for-part-types.png" width="390">
<br>Participants and processes in online forum talk<br>
<br><br>
<img src="https://raw.githubusercontent.com/interrogator/corpkit/master/images/risk-and-power-2.png" width="370"> <img src="https://raw.githubusercontent.com/interrogator/corpkit/master/images/mood-role-risk.png" width="330">
<br>Riskers and mood role of risk words in print news journalism<br>
</i></p>

## Graphical interface

Screenshots coming soon! For now, just head [here](http://interrogator.github.io/corpkit/).

## Contact

Twitter: [@interro_gator](https://twitter.com/interro_gator)

## Cite

> `McDonald, D. (2015). corpkit: a toolkit for corpus linguistics. Retrieved from https://www.github.com/interrogator/corpkit. DOI: http://doi.org/10.5281/zenodo.28361`


================================================
FILE: bld.bat
================================================
"%PYTHON%" setup.py install 
if errorlevel 1 exit 1

:: Add more build steps here, if they are necessary.

:: See
:: http://docs.continuum.io/conda/build.html
:: for a list of environment variables that are set during the build process.


================================================
FILE: build.sh
================================================
#!/bin/bash

$PYTHON setup.py install 

# Add more build steps here, if they are necessary.

# See
# http://docs.continuum.io/conda/build.html
# for a list of environment variables that are set during the build process.


================================================
FILE: conf.py
================================================
# -*- coding: utf-8 -*-
#
# corpkit documentation build configuration file, created by
# sphinx-quickstart on Thu Nov 5.
#
# This file is execfile()d with the current directory set to its
# containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.

from sphinx.highlighting import PygmentsBridge
from pygments.formatters.latex import LatexFormatter

class CustomLatexFormatter(LatexFormatter):
    def __init__(self, **options):
        super(CustomLatexFormatter, self).__init__(**options)
        self.verboptions = r"formatcom=\footnotesize"

PygmentsBridge.latex_formatter = CustomLatexFormatter

import sys
import os
import shlex

from recommonmark.parser import CommonMarkParser

source_parsers = {
    '.md': CommonMarkParser,
}

source_suffix = ['.rst', '.md']

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#sys.path.insert(0, os.path.abspath('.'))

sys.path.insert(0,"/Users/daniel/work/corpkit/corpkit")

# -- General configuration ------------------------------------------------

# If your documentation needs a minimal Sphinx version, state it here.
#needs_sphinx = '1.0'

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
    'sphinx.ext.autodoc',
    'sphinx.ext.viewcode',
    'alabaster'
]

# Napoleon settings (all default)
#napoleon_google_docstring = True
#napoleon_numpy_docstring = True
#napoleon_include_init_with_doc = False
#napoleon_include_private_with_doc = False
#napoleon_include_special_with_doc = False
#napoleon_use_admonition_for_examples = False
#napoleon_use_admonition_for_notes = False
#napoleon_use_admonition_for_references = False
#napoleon_use_ivar = False
#napoleon_use_param = True
#napoleon_use_rtype = True
#napoleon_use_keyword = True

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']

# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
# source_suffix = ['.rst', '.md']

# The encoding of source files.
#source_encoding = 'utf-8-sig'

# The master toctree document.
master_doc = 'index'

# General information about the project.
project = u'corpkit'
copyright = u'2016, Daniel McDonald'
author = u'Daniel McDonald'

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = '2.3.8'
# The full version, including alpha/beta/rc tags.
release = '2.3.8'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None

# There are two options for replacing |today|: either, you set today to some
# non-false value, then it is used:
#today = ''
# Else, today_fmt is used as the format for a strftime call.
#today_fmt = '%B %d, %Y'

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
exclude_patterns = ['_build', '*/build.py']

# The reST default role (used for this markup: `text`) to use for all
# documents.
#default_role = None

# If true, '()' will be appended to :func: etc. cross-reference text.
add_function_parentheses = True

# If true, the current module name will be prepended to all description
# unit titles (such as .. function::).
#add_module_names = True

# If true, sectionauthor and moduleauthor directives will be shown in the
# output. They are ignored by default.
#show_authors = False

# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'

# A list of ignored prefixes for module index sorting.
#modindex_common_prefix = []

# If true, keep warnings as "system message" paragraphs in the built documents.
#keep_warnings = False

# If true, `todo` and `todoList` produce output, else they produce nothing.
todo_include_todos = False


# -- Options for HTML output ----------------------------------------------

# The theme to use for HTML and HTML Help pages.  See the documentation for
# a list of builtin themes.
# import alabaster
# 
# html_theme_path = [alabaster.get_path()]
# html_theme = 'alabaster'
# html_sidebars = {
#     '**': [
#         'about.html',
#         'navigation.html',
#         'relations.html',
#         'searchbox.html',
#         'donate.html',
#     ]
# }

# Theme options are theme-specific and customize the look and feel of a theme
# further.  For a list of options available for each theme, see the
# documentation.
#html_theme_options = {}

# Add any paths that contain custom themes here, relative to this directory.
#html_theme_path = []

# The name for this set of Sphinx documents.  If None, it defaults to
# "<project> v<release> documentation".
#html_title = None

# A shorter title for the navigation bar.  Default is the same as html_title.
#html_short_title = None

# The name of an image file (relative to this directory) to place at the top
# of the sidebar.
html_logo = 'images/alpha_gator_small.png'

# The name of an image file (within the static path) to use as favicon of the
# docs.  This file should be a Windows icon file (.ico) being 16x16 or 32x32
# pixels large.
#html_favicon = None

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']

# Add any extra paths that contain custom files (such as robots.txt or
# .htaccess) here, relative to this directory. These files are copied
# directly to the root of the documentation.
#html_extra_path = []

# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
# using the given strftime format.
html_last_updated_fmt = '%b %d, %Y'

# If true, SmartyPants will be used to convert quotes and dashes to
# typographically correct entities.
html_use_smartypants = True

# Custom sidebar templates, maps document names to template names.
#html_sidebars = {}

# Additional templates that should be rendered to pages, maps page names to
# template names.
#html_additional_pages = {}

# If false, no module index is generated.
#html_domain_indices = True

# If false, no index is generated.
#html_use_index = True

# If true, the index is split into individual pages for each letter.
#html_split_index = False

# If true, links to the reST sources are added to the pages.
#html_show_sourcelink = True

# If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
html_show_sphinx = False

# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
#html_show_copyright = True

# If true, an OpenSearch description file will be output, and all pages will
# contain a <link> tag referring to it.  The value of this option must be the
# base URL from which the finished HTML is served.
#html_use_opensearch = ''

# This is the file name suffix for HTML files (e.g. ".xhtml").
#html_file_suffix = None

# Language to be used for generating the HTML full-text search index.
# Sphinx supports the following languages:
#   'da', 'de', 'en', 'es', 'fi', 'fr', 'hu', 'it', 'ja'
#   'nl', 'no', 'pt', 'ro', 'ru', 'sv', 'tr'
#html_search_language = 'en'

# A dictionary with options for the search language support, empty by default.
# Now only 'ja' uses this config value
#html_search_options = {'type': 'default'}

# The name of a javascript file (relative to the configuration directory) that
# implements a search results scorer. If empty, the default will be used.
#html_search_scorer = 'scorer.js'

# Output file base name for HTML help builder.
htmlhelp_basename = 'corpkitdoc'

# -- Options for LaTeX output ---------------------------------------------

latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
'papersize': 'a4paper',

# The font size ('10pt', '11pt' or '12pt').
#'pointsize': '10pt',

# Additional stuff for the LaTeX preamble.
# This should help with line breaks in code cells
'preamble': '\\setcounter{tocdepth}{3} \\usepackage{pmboxdraw}',

# Latex figure (float) alignment
'figure_align': 'htbp',
}

# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
#  author, documentclass [howto, manual, or own class]).
latex_documents = [
  (master_doc, 'corpkit.tex', u'corpkit documentation',
   u'Daniel McDonald', 'manual'),
]

# The name of an image file (relative to this directory) to place at the top of
# the title page.
latex_logo = 'images/alpha_gator_small.png'

# For "manual" documents, if this is true, then toplevel headings are parts,
# not chapters.
#latex_use_parts = False

# If true, show page references after internal links.
#latex_show_pagerefs = False

# If true, show URL addresses after external links.
#latex_show_urls = False

# Documents to append as an appendix to all manuals.
#latex_appendices = []

# If false, no module index is generated.
#latex_domain_indices = True


# -- Options for manual page output ---------------------------------------

# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
    (master_doc, 'corpkit', u'corpkit documentation',
     [author], 1)
]

# If true, show URL addresses after external links.
#man_show_urls = False


# -- Options for Texinfo output -------------------------------------------

# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
#  dir menu entry, description, category)
texinfo_documents = [
  (master_doc, 'corpkit', u'corpkit documentation',
   author, 'corpkit', 'Corpus linguistic tools.',
   'Linguistics'),
]

# Documents to append as an appendix to all manuals.
#texinfo_appendices = []

# If false, no module index is generated.
#texinfo_domain_indices = True

# How to display URL addresses: 'footnote', 'no', or 'inline'.
#texinfo_show_urls = 'footnote'

# If true, do not generate a @detailmenu in the "Top" node's menu.
#texinfo_no_detailmenu = False

autodoc_member_order = 'bysource'


================================================
FILE: corpkit/__init__.py
================================================
"""
A toolkit for corpus linguistics
"""

from __future__ import print_function

#metadata
__version__ = "2.3.8"
__author__ = "Daniel McDonald"
__license__ = "MIT"

# probably not needed, anymore but adds corpkit to path for tregex.sh
import sys
import os
import inspect

from corpkit.constants import LETTERS

# asterisk import
__all__ = [
    "load",
    "loader",
    "load_all_results",
    "as_regex",
    "new_project",
    "Corpus",
    "File",
    "Corpora",
    "gui"] + LETTERS

corpath = inspect.getfile(inspect.currentframe())
baspat = os.path.dirname(corpath)
#dicpath = os.path.join(baspat, 'dictionaries')
for p in [corpath, baspat]:
    if p not in sys.path:
        sys.path.append(p)
    if p not in os.environ["PATH"].split(':'): 
        os.environ["PATH"] += os.pathsep + p

# import classes
from corpkit.corpus import Corpus, File, Corpora
#from corpkit.model import MultiModel

from corpkit.other import (load, loader, load_all_results, 
                           quickview, as_regex, new_project)

from corpkit.lazyprop import lazyprop
#from corpkit.dictionaries.process_types import Wordlist
from corpkit.process import gui

# monkeypatch editing and plotting to pandas objects
from pandas import DataFrame, Series

# monkey patch functions
def _plot(self, *args, **kwargs):
    from corpkit.plotter import plotter
    return plotter(self, *args, **kwargs)

def _edit(self, *args, **kwargs):
    from corpkit.editor import editor
    return editor(self, *args, **kwargs)

def _save(self, savename, **kwargs):
    from corpkit.other import save
    save(self, savename, **kwargs)

def _quickview(self, n=25):
    from corpkit.other import quickview
    quickview(self, n=n)

def _format(self, *args, **kwargs):
    from corpkit.other import concprinter
    concprinter(self, *args, **kwargs)

def _texify(self, *args, **kwargs):
    from corpkit.other import texify
    texify(self, *args, **kwargs)

def _calculate(self, *args, **kwargs):
    from corpkit.process import interrogation_from_conclines
    return interrogation_from_conclines(self)

def _multiplot(self, leftdict={}, rightdict={}, **kwargs):
    from corpkit.plotter import multiplotter
    return multiplotter(self, leftdict=leftdict, rightdict=rightdict, **kwargs)

def _perplexity(self):
    """
    Pythonification of the formal definition of perplexity.

    input:  a sequence of chances (any iterable will do)
    output: perplexity value.

    from https://github.com/zeffii/NLP_class_notes
    """
    def _perplex(chances):
        import math
        chances = [i for i in chances if i] 
        N = len(chances)
        product = 1
        for chance in chances:
            product *= chance
        return math.pow(product, -1/N)

    return self.apply(_perplex, axis=1)

def _entropy(self):
    """
    entropy(pos.edit(merge_entries=mergetags, sort_by='total').results.T
    """
    from scipy.stats import entropy
    import pandas as pd
    escores = entropy(self.edit('/', SELF).results.T)
    ser = pd.Series(escores, index=self.index)
    ser.name = 'Entropy'
    return ser

def _shannon(self):
    from corpkit.stats import shannon
    return shannon(self)


def _shuffle(self, inplace=False):
    import random
    index = list(self.index)
    random.shuffle(index)
    shuffled = self.ix[index]
    shuffled.reset_index()
    if inplace:
        self = shuffled
    else:
        return shuffled

def _top(self):
    """Show as many rows and cols as possible without truncation"""
    import pandas as pd
    max_row = pd.options.display.max_rows
    max_col = pd.options.display.max_columns
    return self.iloc[:max_row, :max_col]

def _tabview(self, **kwargs):
    import pandas as pd
    import tabview
    tabview.view(self, **kwargs)

def _rel(self, denominator='self', **kwargs):
    from corpkit.editor import editor
    return editor(self, '%', denominator, **kwargs)

def _keyness(self, measure='ll', denominator='self', **kwargs):
    from corpkit.editor import editor
    return editor(self, 'k', denominator, **kwargs)

def _plain(df):
    return ' '.join(df['w'])

# monkey patching things

DataFrame.entropy = _entropy
DataFrame.perplexity = _perplexity
DataFrame.shannon = _shannon

DataFrame.edit = _edit
Series.edit = _edit

DataFrame.rel = _rel
Series.rel = _rel

DataFrame.keyness = _keyness
Series.keyness = _keyness

DataFrame.visualise = _plot
Series.visualise = _plot

DataFrame.tabview = _tabview

DataFrame.multiplot = _multiplot
Series.multiplot = _multiplot

DataFrame.save = _save
Series.save = _save

DataFrame.quickview = _quickview
Series.quickview = _quickview

DataFrame.format = _format
Series.format = _format

Series.texify = _texify

DataFrame.calculate = _calculate
Series.calculate = _calculate

DataFrame.shuffle = _shuffle

DataFrame.top = _top

DataFrame.plain = _plain

# Defining letters
module = sys.modules[__name__]
for letter in LETTERS:
    if not letter.isalpha():
        trans = letter.replace('A', '-', 1).replace('Z', '+', 1).lower()
    else:
        trans = letter.lower()
    setattr(module, letter, trans)
    # other methods:
    # globals()[letter] = letter.lower()
    # exec('%s = "%s"' % (letter, letter.lower()))

ANYWORD = r'[A-Za-z0-9:_]'


================================================
FILE: corpkit/annotate.py
================================================
"""
corpkit: add annotations to conll-u via concordancing
"""

def process_special_annotation(v, lin):
    """
    If the user wants a fancy annotation, like 'add middle column',
    this gets processed here. it's potentially the place where the
    user could add entropy score, or something like that.
    """
    if v.lower() not in ['i', 'index', 'm', 'scheme', 't', 'q']:
        return v
    if v == 'index':
        return lin.name
    elif v in ['m', 't']:
        return str(lin[v])
    else:
        return v

def make_string_to_add(annotation, lin, replace=False):
    """
    Make a string representing metadata to add
    """
    from corpkit.constants import STRINGTYPE
    if isinstance(annotation, STRINGTYPE):
        if replace:
            return annotation + '\n'
        else:
            return '# tags=' + annotation + '\n'
    
    start = str()
    for k, v in annotation.items():
        # these are special names---add more?
        v = process_special_annotation(v, lin) 
        if replace:
            start = '%s\n' % v
        else:
            start += '# %s=%s\n' % (k, v)
    return start

def get_line_number_for_entry(data, si, ti, annotation):
    """
    Find the place in filename at which to add the string
    """
    partstart = '# sent_id %d' % si
    partend = '# sent_id %d' % (si + 1)
    # this way iterates over the lines
    # it could also just find the 
    lnum = data.split(partstart)[0].count('\n') + 2
    sent = data.split(partstart)[1].split(partend)[0]
    field = 'tags' if isinstance(annotation, str) else list(annotation.keys())[0]
    ixx = next((i for i, l in enumerate(sent.splitlines()) \
               if l.startswith('# %s=' % field)), False)
    
    if ixx is False:
        return lnum, False
    else:
        return lnum + ixx - 2, True


def update_contents(contents, place, text, do_replace=False):
    """ 
    Open file, read lines, add or replace the line with the good one
    """ 
    if do_replace:
        contents[place] = contents[place].rstrip('\n').replace(text + ';', '') + ';' + text
    else:
        contents.insert(place, text)
    return contents

def dry_run_text(filepath, contents, place, colours):
    """
    Show a dry run of what the annotations would be
    """
    import os
    contents[place] = contents[place].rstrip('\n') + '  <==========\n'
    try:
        contents[place] = colours['green'] + contents[place] + colours['reset']
    except:
        pass

    max_lines = next((i for i, l in enumerate(contents[place:]) if l == '\n'), 10)
    max_lines = 30 if max_lines > 30 else max_lines

    formline = '   Add metadata: %s   \n' % (os.path.basename(filepath))
    bars = '=' * len(formline)

    print(bars + '\n' + formline + bars)
    print(''.join(contents[place-3:max_lines+place]))

def annotate(open_file, contents):
    """
    Add annotation to a single file
    """
    from corpkit.constants import PYTHON_VERSION
    contents = ''.join(contents)
    if PYTHON_VERSION == 2:
        contents = contents.encode('utf-8', errors='ignore')
    open_file.seek(0)
    open_file.write(contents)
    open_file.truncate()

def delete_lines(corpus, annotation, dry_run=True, colour={}):
    """
    Show or delete the necessary lines
    """
    from corpkit.constants import OPENER, PYTHON_VERSION
    import re
    import os
    tagmode = True
    no_can_do = ['sent_id', 'parse']

    if isinstance(annotation, dict):
        tagmode = False
        for k, v in annotation.items():
            if k in no_can_do:
                print("You aren't allowed to delete '%s', sorry." % k)
                return
            if not v:
                v = r'.*?'
            regex = re.compile(r'(# %s=%s)\n' % (k, v), re.MULTILINE)
    else:
        if annotation in no_can_do:
            print("You aren't allowed to delete '%s', sorry." % k)
            return
        regex = re.compile(r'((# tags=.*?)%s;?(.*?))\n' % annotation, re.MULTILINE)

    fs = []
    for (root, dirs, fls) in os.walk(corpus):
        for f in fls:
            fs.append(os.path.join(root, f))
    
    for f in fs:
    
        if PYTHON_VERSION == 2:
            from corpkit.process import saferead
            data = saferead(f)[0]
        else:
            with open(f, 'rb') as fo:
                data = fo.read().decode('utf-8', errors='ignore')

        if dry_run:
            if tagmode:
                repl_str = r'\1 <=======\n%s\2\3 <=======\n' % colour.get('green', '')
            else:
                repl_str = r'\1 <=======\n'
            try:
                repl_str = colour['red'] + repl_str + colour['reset']
            except:
                pass
            data, n = re.subn(regex, repl_str, data)
            nspl = 100 if tagmode else 50
            delim = '<======='
            data = re.split(delim, data, maxsplit=nspl)
            toshow = delim.join(data[:nspl+1])
            toshow = toshow.rsplit('\n\n', 1)[0]
            print(toshow)
            if n > 50:
                n = n - 50
                print('\n... and %d more changes ... ' % n)

        else:
            if tagmode:
                repl_str = r'\2\3\n'
            else:
                repl_str = ''
            data = re.sub(regex, repl_str, data)
            with OPENER(f, 'w') as fo:
                from corpkit.constants import PYTHON_VERSION
                if PYTHON_VERSION == 2:
                    data = data.encode('utf-8', errors='ignore')
                fo.write(data)
                

def annotator(df_or_corpus, annotation, dry_run=True, deletemode=False):
    """
    Run the annotator pipeline over multiple files

    :param corpus: a Corpus object containing the files
    :param annotation: a str or dict containing annotation text
    """
    import re
    import os
    from corpkit.constants import OPENER, STRINGTYPE, PYTHON_VERSION
    colour = {}
    try:
        from colorama import Fore, init, Style
        init(autoreset=True)
        colour = {'green': Fore.GREEN, 'reset': Style.RESET_ALL, 'red': Fore.RED}
    except ImportError:
        pass
    if deletemode:
        delete_lines(df_or_corpus.path, annotation, dry_run=dry_run, colour=colour)
        return

    file_sent_words = df_or_corpus.reset_index()[['index', 'f', 'i']].values.tolist()
    from collections import defaultdict
    outt = defaultdict(list)
    for index, fn, ix in file_sent_words:
        s, i = ix.split(',', 1)
        outt[fn].append((int(s), int(i), index))
    
    for i, (fname, entries) in enumerate(sorted(outt.items()), start=1):    
        with OPENER(fname, 'r+') as fo:
            data = fo.read()
            contents = [i + '\n' for i in data.split('\n')]
            for si, ti, index in list(reversed(sorted(set(entries)))):
                line_num, do_replace = get_line_number_for_entry(data, si, ti, annotation)
                anno_text = make_string_to_add(annotation, df_or_corpus.ix[index], replace=do_replace)
                contents = update_contents(contents, line_num, anno_text, do_replace=do_replace)
                if dry_run and i < 50:
                    dry_run_text(fname,
                                 contents,
                                 line_num,
                                 colours=colour)
            if not dry_run:
                annotate(fo, contents=contents)
        if not dry_run:
            print('%d annotations made in %s' % (len(entries), fname))
        if dry_run and i > 50:
            break

    if dry_run:
        if len(file_sent_words) > 50:
            n = len(file_sent_words) - 50
            print('... and %d more changes ... ' % n)

        

================================================
FILE: corpkit/blanknotebook.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# blanknotebook"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Initialisation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, import `corpkit`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import corpkit\n",
    "from corpkit import (\n",
    "    interrogator, plotter, table, quickview, \n",
    "    tally, surgeon, merger, conc, keywords, \n",
    "    collocates, multiquery, report_display,\n",
    "    save_result, load_result\n",
    "                    )\n",
    "# show figures in browser\n",
    "% matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, set a path to your corpus:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "corpus = 'data/corpus'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define a query to match any word:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# any token containing letters or numbers (i.e. no punctuation):\n",
    "allwords_query = r'/[A-Za-z0-9]/ !< __' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Interrogate the corpus with the `allwords_query`, and store the results as `allwords`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "allwords = interrogator(annual_trees, '-C', allwords_query) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check that it worked:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print allwords.query\n",
    "print allwords.totals"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, plot something:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "plotter('Word count', allwords.total)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, save this result so that you can access it any time:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "save_result(allwords, 'allwords')\n",
    "\n",
    "# load it again with:\n",
    "# allwords = load_result('allwords')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Use the space below to interrogate and plot whatever you like!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}


================================================
FILE: corpkit/build.py
================================================
from __future__ import print_function
from corpkit.constants import STRINGTYPE, PYTHON_VERSION, INPUTFUNC

"""
This file contains a number of functions used in the corpus building process.
None of them is intended to be called by the user him/herself.
"""

def download_large_file(proj_path, url, actually_download=True, root=False, **kwargs):
    """
    Download something to proj_path, unless it's CoreNLP, which goes to ~/corenlp
    """
    import os
    import shutil
    import glob
    import zipfile
    from time import localtime, strftime
    from corpkit.textprogressbar import TextProgressBar
    from corpkit.process import animator

    file_name = url.split('/')[-1]
    home = os.path.expanduser("~")
    customdir = kwargs.get('custom_corenlp_dir', False)
    # if it's corenlp, put it in home/corenlp
    # if that dir exists, check if for a zip file
    # if there's a zipfile and it works, move on
    # if there's a zipfile and it's broken, delete it
    if 'stanford' in url:
        if customdir:
            downloaded_dir = customdir
        else:
            downloaded_dir = os.path.join(home, 'corenlp')
        if not os.path.isdir(downloaded_dir):
            os.makedirs(downloaded_dir)
        else:
            poss_zips = glob.glob(os.path.join(downloaded_dir, 'stanford-corenlp-full*.zip'))
            if poss_zips:
                fullfile = poss_zips[-1]
                from zipfile import BadZipfile
                try:
                    the_zip_file = zipfile.ZipFile(fullfile)                    
                    ret = the_zip_file.testzip()
                    if ret is None:
                        return downloaded_dir, fullfile
                    else:
                        os.remove(fullfile)
                except BadZipfile:
                    os.remove(fullfile)
            #else:
            #    shutil.rmtree(downloaded_dir)
    else:
        downloaded_dir = os.path.join(proj_path, 'temp')
        try:
            os.makedirs(downloaded_dir)
        except OSError:
            pass
    fullfile = os.path.join(downloaded_dir, file_name)

    if actually_download:
        import __main__ as main
        if not root and not hasattr(main, '__file__'):
            txt = 'CoreNLP not found. Download latest version (%s)? (y/n) ' % url
            
            selection = INPUTFUNC(txt)

            if 'n' in selection.lower():
                return None, None
        try:
            import requests
            # NOTE the stream=True parameter
            r = requests.get(url, stream=True, verify=False)
            file_size = int(r.headers['content-length'])
            file_size_dl = 0
            block_sz = 8192
            showlength = file_size / block_sz
            thetime = strftime("%H:%M:%S", localtime())
            print('\n%s: Downloading ... \n' % thetime)
            par_args = {'printstatus': kwargs.get('printstatus', True),
                        'length': showlength}
            if not root:
                tstr = '%d/%d' % (file_size_dl + 1 / block_sz, showlength)
                p = animator(None, None, init=True, tot_string=tstr, **par_args)
                animator(p, file_size_dl + 1, tstr)

            with open(fullfile, 'wb') as f:
                for chunk in r.iter_content(chunk_size=block_sz): 
                    if chunk: # filter out keep-alive new chunks
                        f.write(chunk)
                        file_size_dl += len(chunk)
                        #print file_size_dl * 100.0 / file_size
                        if kwargs.get('note'):
                            kwargs['note'].progvar.set(file_size_dl * 100.0 / int(file_size))
                        else:
                            tstr = '%d/%d' % (file_size_dl / block_sz, showlength)
                            animator(p, file_size_dl / block_sz, tstr, **par_args)
                        if root:
                            root.update()
        except Exception as err:
            import traceback
            print(traceback.format_exc())
            thetime = strftime("%H:%M:%S", localtime())
            print('%s: Download failed' % thetime)
            try:
                f.close()
            except:
                pass
            if root:
                root.update()
            return None, None

        if kwargs.get('note'):  
            kwargs['note'].progvar.set(100)
        else:    
            p.animate(int(file_size))
        thetime = strftime("%H:%M:%S", localtime())
        print('\n%s: Downloaded successully.' % thetime)
        try:
            f.close()
        except:
            pass
    return downloaded_dir, fullfile

def extract_cnlp(fullfilepath, corenlppath=False, root=False):
    """
    Extract corenlp zip file
    """
    import zipfile
    import os
    from time import localtime, strftime
    time = strftime("%H:%M:%S", localtime())
    print('%s: Extracting CoreNLP files ...' % time)
    if root:
        root.update()
    if corenlppath is False:
        home = os.path.expanduser("~")
        corenlppath = os.path.join(home, 'corenlp')
    from zipfile import BadZipfile
    try:
        with zipfile.ZipFile(fullfilepath) as zf:
            zf.extractall(corenlppath)
    except BadZipfile:
        os.remove(corenlppath)
        return False
    time = strftime("%H:%M:%S", localtime())
    print('%s: CoreNLP extracted. ' % time)
    return True

def get_corpus_filepaths(projpath=False, corpuspath=False,
                         restart=False, out_ext='conll'):
    """
    get a list of filepaths, a la find . -type f
    restart mode will look in restart dir and remove any existing files
    """
    import fnmatch
    import os
    matches = []

    # get a list of done files minus their paths and extensions
    # this handles if they have been moved to the right dir or not

    already_done = get_filepaths(restart, out_ext) if restart else []
    already_done = [os.path.splitext(os.path.basename(x))[0] for x in already_done]

    for root, dirnames, filenames in os.walk(corpuspath):
        for filename in fnmatch.filter(filenames, '*.txt'):
            if filename not in already_done:
                matches.append(os.path.join(root, filename))
    if len(matches) == 0:
        return False, False
    matchstring = '\n'.join(matches)

    # maybe not good:
    if projpath is False:
        projpath = os.path.dirname(os.path.abspath(corpuspath.rstrip('/')))

    corpname = os.path.basename(corpuspath)

    fp = os.path.join(projpath, 'data', corpname + '-filelist.txt')
    # definitely not good.
    if os.path.join('data', 'data') in fp:
        fp = fp.replace(os.path.join('data', 'data'), 'data')
    with open(fp, "w") as f:
        f.write(matchstring + '\n')
    return fp, matchstring

def check_jdk():
    """
    Check for a Java/OpenJDK
    """
    import corpkit
    import subprocess
    from subprocess import PIPE, STDOUT, Popen
    # add any other version string to here
    javastrings = ['java version "1.8', 'openjdk version "1.8']
    p = Popen(["java", "-version"], stdout=PIPE, stderr=PIPE)
    _, stderr = p.communicate()
    encoded = stderr.decode(encoding='utf-8').lower()

    return any(j in encoded for j in javastrings)

def parse_corpus(proj_path=False, 
                 corpuspath=False, 
                 filelist=False, 
                 corenlppath=False, 
                 operations=False,
                 root=False, 
                 stdout=False, 
                 memory_mb=2000,
                 copula_head=True,
                 multiprocessing=False,
                 outname=False,
                 coref=True,
                 **kwargs
                ):
    """
    Create a CoreNLP-parsed and/or NLTK tokenised corpus
    """
    import subprocess
    from subprocess import PIPE, STDOUT, Popen
    from corpkit.process import get_corenlp_path
    import os
    import sys
    import re
    import chardet
    from time import localtime, strftime
    import time

    fileparse = kwargs.get('fileparse', False)
    from corpkit.constants import CORENLP_URL as url
    
    if not check_jdk():
        print('Need latest Java.')
        return

    curdir = os.getcwd()
    note = kwargs.get('note', False)

    if proj_path is False:
        proj_path = os.path.dirname(os.path.abspath(corpuspath.rstrip('/')))

    basecp = os.path.basename(corpuspath)

    if fileparse:
        new_corpus_path = os.path.dirname(corpuspath)
    else:
        if outname:
            new_corpus_path = os.path.join(proj_path, 'data', outname)
        else:
            new_corpus_path = os.path.join(proj_path, 'data', '%s-parsed' % basecp)
            new_corpus_path = new_corpus_path.replace('-stripped-', '-')

    # todo:
    # this is not stable
    if os.path.join('data', 'data') in new_corpus_path:
        new_corpus_path = new_corpus_path.replace(os.path.join('data', 'data'), 'data')

    # this caused errors when multiprocessing
    # it used to be isdir, but supposedly there was a file there
    # i don't see how it's possible ...
    # i think it is a 'race condition', so we'll also put a try/except there
    
    if not os.path.exists(new_corpus_path):
        try:
            os.makedirs(new_corpus_path)
        except OSError:
            pass
    else:
        if not os.path.isfile(new_corpus_path):
            fs = get_filepaths(new_corpus_path, ext=False)
            if not multiprocessing:
                if any([f.endswith('.conll') for f in fs]) or \
                   any([f.endswith('.conllu') for f in fs]):
                    print('Folder containing .conll files already exists: %s' % new_corpus_path)
                    return False
         
    corenlppath = get_corenlp_path(corenlppath)

    success = bool(corenlppath)

    if not corenlppath:
        from corpkit.constants import CORENLP_VERSION
        print("CoreNLP not found. Auto-installing CoreNLP v%s..." % CORENLP_VERSION)
        cnlp_dir = os.path.join(os.path.expanduser("~"), 'corenlp')
        corenlppath, fpath = download_large_file(cnlp_dir, url,
                                                 root=root,
                                                 note=note,
                                                 actually_download=True,
                                                 custom_corenlp_dir=corenlppath)
        # cleanup
        if corenlppath is None and fpath is None:
            import shutil
            shutil.rmtree(new_corpus_path)
            shutil.rmtree(new_corpus_path.replace('-parsed', '-stripped'))
            os.remove(new_corpus_path.replace('-parsed', '-filelist.txt'))
            raise ValueError('CoreNLP needed to parse texts.')
        success = extract_cnlp(fpath)
        if not success:
            raise ValueError('CoreNLP installation failed for some reason. Try deleting the ~/corenlp directory and starting over.')
        
        import glob
        globpath = os.path.join(corenlppath, 'stanford-corenlp*')
        corenlppath = [i for i in glob.glob(globpath) if os.path.isdir(i)]
        if corenlppath:
            corenlppath = corenlppath[-1]
        else:
            raise ValueError('CoreNLP installation failed for some reason. Try deleting the ~/corenlp directory and starting over.')

    # if not gui, don't mess with stdout
    if stdout is False:
        stdout = sys.stdout

    os.chdir(corenlppath)
    if root:
        root.update_idletasks()
        # not sure why reloading sys, but seems needed
        # in order to show files in the gui
        try:
            reload(sys)
        except NameError:
            import importlib
            importlib.reload(sys)
            pass
    if memory_mb is False:
        memory_mb = 2024

    # you can pass in 'coref' as kwarg now
    cof = ',dcoref' if coref else ''
    if operations is False:
        operations = 'tokenize,ssplit,pos,lemma,parse,ner' + cof

    if isinstance(operations, list):
        operations = ','.join([i.lower() for i in operations])

    with open(filelist, 'r') as fo:
        dat = fo.read()
    num_files_to_parse = len([l for l in dat.splitlines() if l])

    # get corenlp version number
    reg = re.compile(r'stanford-corenlp-([0-9].[0-9].[0-9])-javadoc.jar')
    fver = next(re.search(reg, s).group(1) for s in os.listdir('.') if re.search(reg, s))
    if fver == '3.6.0':
        extra_jar = 'slf4j-api.jar:slf4j-simple.jar:'
    else:
        extra_jar = ''

    out_form = 'xml' if kwargs.get('output_format') == 'xml' else 'json'
    out_ext = 'xml' if kwargs.get('output_format') == 'xml' else 'conll'

    arglist = ['java', '-cp', 
               'stanford-corenlp-%s.jar:stanford-corenlp-%s-models.jar:xom.jar:joda-time.jar:%sjollyday.jar:ejml-0.23.jar' % (fver, fver, extra_jar), 
               '-Xmx%sm' % str(memory_mb),
               'edu.stanford.nlp.pipeline.StanfordCoreNLP', 
               '-annotators',
               operations, 
               '-filelist', filelist,
               '-noClobber',
               '-outputExtension', '.%s' % out_ext,
               '-outputFormat', out_form,
               '-outputDirectory', new_corpus_path]
    if copula_head:
        arglist.append('--parse.flags')
        arglist.append(' -makeCopulaHead')
    print('Java command:')
    print(arglist)
    try:
        proc = subprocess.Popen(arglist, stdout=sys.stdout)
    # maybe a problem with stdout. sacrifice it if need be
    except:
        proc = subprocess.Popen(arglist)            
    #p = TextProgressBar(num_files_to_parse)
    while proc.poll() is None:
        sys.stdout = stdout
        thetime = strftime("%H:%M:%S", localtime())
        if not fileparse:
            num_parsed = len([f for f in os.listdir(new_corpus_path) if f.endswith(out_ext)])  
            if num_parsed == 0:
                if root:
                    print('%s: Initialising parser ... ' % (thetime))
            if num_parsed > 0 and (num_parsed + 1) <= num_files_to_parse:
                if root:
                    print('%s: Parsing file %d/%d ... ' % \
                         (thetime, num_parsed + 1, num_files_to_parse))
                if kwargs.get('note'):
                    kwargs['note'].progvar.set((num_parsed) * 100.0 / num_files_to_parse)
                #p.animate(num_parsed - 1, str(num_parsed) + '/' + str(num_files_to_parse))
            time.sleep(1)
            if root:
                root.update()
    
    #p.animate(num_files_to_parse)
    if kwargs.get('note'):
        kwargs['note'].progvar.set(100)
    sys.stdout = stdout
    thetime = strftime("%H:%M:%S", localtime())
    print('%s: Parsing finished. Moving parsed files into place ...' % thetime)
    os.chdir(curdir)
    return new_corpus_path

def move_parsed_files(proj_path, old_corpus_path, new_corpus_path,
                      ext='conll', restart=False):
    """
    Make parsed files follow existing corpus structure
    """
    import corpkit
    import shutil
    import os
    import fnmatch
    cwd = os.getcwd()
    basecp = os.path.basename(old_corpus_path)
    dir_list = []
    # go through old path, make file list
    for path, dirs, files in os.walk(old_corpus_path):
        for bit in dirs:
            # is the last bit of the line below windows safe?
            dir_list.append(os.path.join(path, bit).replace(old_corpus_path, '')[1:])
    for d in dir_list:
        if not restart:
            os.makedirs(os.path.join(new_corpus_path, d))
        else:
            try:
                os.makedirs(os.path.join(new_corpus_path, d))
            except OSError:
                pass

    # make list of parsed filenames that haven't been moved already
    parsed_fs = [f for f in os.listdir(new_corpus_path) if f.endswith('.%s' % ext)]

    # make a dictionary of the right paths
    pathdict = {}
    for rootd, dirnames, filenames in os.walk(old_corpus_path):
        for filename in fnmatch.filter(filenames, '*.txt'):
            pathdict[filename] = rootd
    # move each file
    for f in parsed_fs:
        noxml = f.replace('.%s' % ext, '')
        right_dir = pathdict[noxml].replace(old_corpus_path, new_corpus_path)
        frm = os.path.join(new_corpus_path, f)
        tom = os.path.join(right_dir, f)
        # forgive errors on restart mode, because some files 
        # might already have been moved into place
        if restart:
            try:
                os.rename(frm, tom)
            except OSError:
                pass
        else:
            os.rename(frm, tom)

    return new_corpus_path

def corenlp_exists(corenlppath=False):
    import corpkit
    import os
    from corpkit.constants import CORENLP_VERSION
    important_files = ['stanford-corenlp-%s-javadoc.jar' % CORENLP_VERSION,
                       'stanford-corenlp-%s-models.jar' % CORENLP_VERSION,
                       'stanford-corenlp-%s-sources.jar' % CORENLP_VERSION,
                       'stanford-corenlp-%s.jar' % CORENLP_VERSION]
    if corenlppath is False:
        home = os.path.expanduser("~")
        corenlppath = os.path.join(home, 'corenlp')
    if os.path.isdir(corenlppath):
        find_install = [d for d in os.listdir(corenlppath) \
                   if os.path.isdir(os.path.join(corenlppath, d)) \
                   and os.path.isfile(os.path.join(corenlppath, d, 'jollyday.jar'))]

        if len(find_install) > 0:
            find_install = find_install[0]
        else:
            return False
        javalib = os.path.join(corenlppath, find_install)
        if len(javalib) == 0:
            return False
        if not any([f.endswith('-models.jar') for f in os.listdir(javalib)]):
            return False
        return True
    else:
        return False
    return True

def get_filepaths(a_path, ext='txt'):
    """
    Make list of txt files in a_path and remove non txt files
    """
    import os
    files = []
    if os.path.isfile(a_path):
        return [a_path]
    for (root, dirs, fs) in os.walk(a_path):
        for f in fs:
            if ext:
                if not f.endswith('.' + ext):
                    continue
            if 'Unidentified' not in f \
            and 'unknown' not in f \
            and not f.startswith('.'):
                files.append(os.path.join(root, f))
            #if ext:
            #    if not f.endswith('.' + ext):
            #        os.remove(os.path.join(root, f))
    return files

def make_no_id_corpus(pth, newpth, metadata_mode=False, speaker_segmentation=False):
    """
    Make version of pth without ids
    """
    import os
    import re
    import shutil
    from corpkit.process import saferead
    # define regex broadly enough to accept timestamps, locations if need be
    
    from corpkit.constants import MAX_SPEAKERNAME_SIZE
    idregex = re.compile(r'(^.{,%d}?):\s+(.*$)' % MAX_SPEAKERNAME_SIZE)

    try:
        shutil.copytree(pth, newpth)
    except OSError:
        shutil.rmtree(newpth)
        shutil.copytree(pth, newpth)
    files = get_filepaths(newpth)
    names = []
    metadata = []
    for f in files:
        good_data = []
        fo, enc = saferead(f)
        data = fo.splitlines()
        # for each line in the file, remove speaker and metadata
        for datum in data:
            if speaker_segmentation:
                matched = re.search(idregex, datum)
                if matched:
                    names.append(matched.group(1))
                    datum = matched.group(2)
            if metadata_mode:
                splitmet = datum.rsplit('<metadata ', 1)
                # for the impossibly rare case of a line that is '<metadata '
                if not splitmet:
                    continue
                datum = splitmet[0]
            if datum:
                good_data.append(datum)

        with open(f, "w") as fo:
            if PYTHON_VERSION == 2:
                fo.write('\n'.join(good_data).encode('utf-8'))
            else:
                fo.write('\n'.join(good_data))

    if speaker_segmentation:
        from time import localtime, strftime
        thetime = strftime("%H:%M:%S", localtime())
        if len(names) == 0:
            print('%s: No speaker names found. Turn off speaker segmentation.' % thetime)
            shutil.rmtree(newpth)
        else:
            try:
                if len(sorted(set(names))) < 19:
                    print('%s: Speaker names found: %s' % (thetime, ', '.join(sorted(set(names)))))
                else:
                    print('%s: Speaker names found: %s ... ' % (thetime, ', '.join(sorted(set(names[:20])))))
            except:
                pass

def get_all_metadata_fields(corpus, include_speakers=False):
    """
    Get a list of metadata fields in a corpus

    This could take a while for very little infor
    """
    from corpkit.corpus import Corpus
    from corpkit.constants import OPENER, PYTHON_VERSION, MAX_METADATA_FIELDS

    # allow corpus object
    if not isinstance(corpus, Corpus):
        corpus = Corpus(corpus, print_info=False)
    if not corpus.datatype == 'conll':
        return []

    path = getattr(corpus, 'path', corpus)

    fs = []
    import os
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            fs.append(os.path.join(root, filename))

    badfields = ['parse', 'sent_id']
    if not include_speakers:
        badfields.append('speaker')

    fields = set()
    for f in fs:
        if PYTHON_VERSION == 2:
            from corpkit.process import saferead
            lines = saferead(f)[0].splitlines()
        else:
            with OPENER(f, 'rb') as fo:
                lines = fo.read().decode('utf-8', errors='ignore')
                lines = lines.strip('\n')
                lines = lines.splitlines()

        lines = [l[2:].split('=', 1)[0] for l in lines if l.startswith('# ') \
                 if not l.startswith('# sent_id')]
        for l in lines:
            if l not in fields and l not in badfields:
                fields.add(l)
        if len(fields) > MAX_METADATA_FIELDS:
            break
    return list(fields)

def get_names(filepath, speakid):
    """
    Get a list of speaker names from a file
    """
    import re
    from corpkit.process import saferead
    txt, enc = saferead(filepath)
    res = re.findall(speakid, txt)
    if res:
        return sorted(list(set([i.strip() for i in res])))

def get_speaker_names_from_parsed_corpus(corpus, feature='speaker'):
    """
    Use regex to get speaker names from parsed data without parsing it
    """
    import os
    import re
    from corpkit.constants import MAX_METADATA_VALUES

    path = corpus.path if hasattr(corpus, 'path') else corpus
    
    list_of_files = []
    names = []

    # i am not really sure why we need multiline here
    # is it because start of line char is just matching
    speakid = re.compile(r'^# %s=(.*)' % re.escape(feature), re.MULTILINE)
    
    # if passed a dir, do it for every file
    if os.path.isdir(path):
        for (root, dirs, fs) in os.walk(path):
            for f in fs:
                list_of_files.append(os.path.join(root, f))
    elif os.path.isfile(path):
        list_of_files.append(path)

    for filepath in list_of_files:
        res = get_names(filepath, speakid)
        if not res:
            continue
        for i in res:
            if i not in names:
                names.append(i)
        if len(names) > MAX_METADATA_VALUES:
            break
    return list(sorted(set(names)))

def rename_all_files(dirs_to_do):
    """
    Get rid of the inserted dirname in filenames after parsing
    """
    import os
    if isinstance(dirs_to_do, STRINGTYPE):
        dirs_to_do = [dirs_to_do]
    for d in dirs_to_do:
        if d.endswith('-parsed'):
            ext = 'txt.xml'
        elif d.endswith('-tokenised'):
            ext = '.p'
        else:
            ext = '.txt'
        fs = get_filepaths(d, ext)
        for f in fs:
            fname = os.path.basename(f)
            justdir = os.path.dirname(f)
            subcorpus = os.path.basename(justdir)
            newname = fname.replace('-%s.%s' % (subcorpus, ext), '.%s' % ext)
            os.rename(f, os.path.join(justdir, newname))

def flatten_treestring(tree):
    """
    Turn bracketed tree string into something looking like English
    """
    import re
    tree = re.sub(r'\(.*? ', '', tree).replace(')', '')
    tree = tree.replace('$ ', '$').replace('`` ', '``').replace(' ,', ',').replace(' .', '.').replace("'' ", "''").replace(" n't", "n't").replace(" 're","'re").replace(" 'm","'m").replace(" 's","'s").replace(" 'd","'d").replace(" 'll","'ll").replace('  ', ' ')
    return tree

def can_folderise(folder):
    """
    Check if corpus can be put into folders
    """
    import os
    from glob import glob
    if os.path.isfile(folder):
        return False
    fs = glob(os.path.join(folder, '*.txt'))
    if len(fs) > 1:
        if not any(os.path.isdir(x) for x in glob(os.path.join(folder, '*'))):
            return True
    return False

def folderise(folder):
    """
    Move each file into a folder
    """
    import os
    import shutil
    from glob import glob
    from corpkit.process import makesafe
    fs = glob(os.path.join(folder, '*.txt'))
    for f in fs:
        newname = makesafe(os.path.splitext(os.path.basename(f))[0])
        newpath = os.path.join(folder, newname)
        if not os.path.exists(newpath):
            os.makedirs(newpath)
        shutil.move(f, os.path.join(newpath))


================================================
FILE: corpkit/completer.py
================================================
class Completer(object):
    """
    Tab completion for interpreter
    """

    def __init__(self, words):
        self.words = words
        self.prefix = None

    def complete(self, prefix, index):
        """
        Add paths etc to this
        """
        if prefix != self.prefix:
            # we have a new prefix!
            # find all words that start with this prefix
            self.matching_words = [
                w for w in self.words if w.startswith(prefix)
                ]
            self.prefix = prefix
        try:
            return self.matching_words[index]
        except IndexError:
            return None


================================================
FILE: corpkit/configurations.py
================================================
def configurations(corpus, search, **kwargs):
    """
    Get summary of behaviour of a word

    see corpkit.corpus.Corpus.configurations() for docs
    """

    from corpkit.dictionaries.wordlists import wordlists
    from corpkit.dictionaries.roles import roles
    from corpkit.interrogation import Interrodict
    from corpkit.interrogator import interrogator
    from collections import OrderedDict

    if search.get('l') and search.get('w'):
        raise ValueError('Search only for a word or a lemma, not both.')

    # are we searching words or lemmata?
    if search.get('l'):
        dep_word_or_lemma = 'dl'
        gov_word_or_lemma = 'gl'
        word_or_token = search.get('l')
    else:
        if search.get('w'):
            dep_word_or_lemma = 'd'
            gov_word_or_lemma = 'g'
            word_or_token = search.get('w')

    # make nested query dicts for each semantic role
    queries = {'participant': 

                {'left_participant_in':             
                  {dep_word_or_lemma: word_or_token,
                   'df': roles.participant1,
                   'f': roles.event},

                'right_participant_in':
                  {dep_word_or_lemma: word_or_token,
                   'df': roles.participant2,
                   'f': roles.event},

                'premodified':
                  {'f': roles.premodifier, 
                   gov_word_or_lemma: word_or_token},

                'postmodified':
                  {'f': roles.postmodifier, 
                   gov_word_or_lemma: word_or_token},

                 'and_or':
                  {'f': 'conj:(?:and|or)',
                   'gf': roles.participant,
                   gov_word_or_lemma: word_or_token},
                },

               'process':

                {'has_subject':
                  {'f': roles.participant1,
                   gov_word_or_lemma: word_or_token},

                 'has_object':
                  {'f': roles.participant2,
                   gov_word_or_lemma: word_or_token},

                 'modalised_by':
                  {'f': r'aux',
                   'w': wordlists.modals,
                   gov_word_or_lemma: word_or_token},

                 'modulated_by':
                  {'f': 'advmod',
                   'gf': roles.event,
                   gov_word_or_lemma: word_or_token},

                 'and_or':
                  {'f': 'conj:(?:and|or)',
                   'gf': roles.event,                 
                   gov_word_or_lemma: word_or_token},
              
                },

               'modifier':

                {'modifies':
                  {'df': roles.modifier,
                   dep_word_or_lemma: word_or_token},

                 'modulated_by':
                  {'f': 'advmod',
                   'gf': roles.modifier,
                   gov_word_or_lemma: word_or_token},

                 'and_or':
                  {'f': 'conj:(?:and|or)',
                   'gf': roles.modifier,
                   gov_word_or_lemma: word_or_token},

                }
              }

    # allow passing in of single function
    if search.get('f'):
        if search.get('f').lower().startswith('part'):
            queries = queries['participant']
        elif search.get('f').lower().startswith('proc'):
            queries = queries['process']
        elif search.get('f').lower().startswith('mod'):
            queries = queries['modifier']
    else:
        newqueries = {}
        for k, v in queries.items():
            for name, pattern in v.items():
                newqueries[name] = pattern
        queries = newqueries
        queries['and_or'] = {'f': 'conj:(?:and|or)', gov_word_or_lemma: word_or_token}

    # count all queries to be done
    # total_queries = 0
    # for k, v in queries.items():
    #    total_queries += len(v)
    
    kwargs['search'] = queries
    
    # do interrogation
    data = corpus.interrogate(**kwargs)
    
    # remove result itself
    # not ideal, but it's much more impressive this way.
    if isinstance(data, Interrodict):
      for k, v in data.items():
          v.results = v.results.drop(word_or_token, axis=1, errors='ignore')
          v.totals = v.results.sum(axis=1)
          data[k] = v
      return Interrodict(data)
    else:
      return data



================================================
FILE: corpkit/conll.py
================================================
"""
corpkit: process CONLL formatted data
"""

def parse_conll(f,
                first_time=False,
                just_meta=False,
                usecols=None):
    """
    Make a pandas.DataFrame with metadata from a CONLL-U file
    
    Args:
        f (str): Filepath
        first_time (bool, optional): If True, add in sent index
        just_meta (bool, optional): Return only a metadata `dict`
        usecols (None, optional): Which columns must be parsed by pandas.read_csv
    
    Returns:
        pandas.DataFrame: DataFrame containing tokens and a ._metadata attribute
    """
    import pandas as pd
    try:
        from StringIO import StringIO
    except ImportError:
        from io import StringIO

    from collections import defaultdict

    # go to corpkit.constants to modify the order of columns if yours are different
    from corpkit.constants import CONLL_COLUMNS as head

    with open(f, 'r') as fo:
        data = fo.read().strip('\n')

    splitdata = []
    metadata = {}
    sents = data.split('\n\n')    
    for count, sent in enumerate(sents, start=1):
        metadata[count] = defaultdict(set)
        for line in sent.split('\n'):
            if line and not line.startswith('#') \
                and not just_meta:
                splitdata.append('\n%d\t%s' % (count, line))
            else:
                line = line.lstrip('# ')
                if '=' in line:
                    field, val = line.split('=', 1)
                    metadata[count][field].add(val)
        metadata[count] = {k: ','.join(v) for k, v in metadata[count].items()}
    if just_meta:
        return metadata

    # happens with empty files
    if not splitdata:
        return

    # head can only be as long as the list of cols in the df
    num_tabs = splitdata[0].strip('\t').count('\t')
    head = head[:num_tabs]
    
    # introduce sentence index for multiindex
    #for i, d in enumerate(splitdata, start=1):
    #    d = d.replace('\n', '\n%s\t' % str(i))
    #    splitdata[i-1] = d

    # turn into something pandas can read    
    data = '\n'.join(splitdata)
    data = data.replace('\n\n', '\n') + '\n'

    # remove slashes as early as possible
    data = data.replace('/', '-slash-')

    # open with sent and token as multiindex
    try:
        df = pd.read_csv(StringIO(data), sep='\t', header=None,
                         names=['s'] + head, index_col=['s', 'i'], usecols=usecols)
        #df.index = pd.MultiIndex.from_tuples([(1, i) for i in df.index])
    except ValueError:
        return
    df._metadata = metadata
    return df

def get_dependents_of_id(idx, df=False, repeat=False, attr=False, coref=False):
    """
    Get dependents of a token
    """
    sent_id, tok_id = getattr(idx, 'name', idx)
    deps = df.ix[sent_id, tok_id]['d'].split(',')
    out = []
    for govid in deps:
        if attr:
            # might not exist...
            try:
                tok = getattr(df.ix[sent_id,int(govid)], attr, False)
                if tok:
                    out.append(tok)
            except (KeyError, IndexError):
                pass
        else:
            out.append((sent_id, int(govid)))
    return out

def get_governors_of_id(idx, df=False, repeat=False, attr=False, coref=False):
    """
    Get governors of a token
    """
    
    # it can be a series or a tuple
    sent_id, tok_id = getattr(idx, 'name', idx)
    # get the governor id
    govid = df['g'].loc[sent_id, tok_id]
    if attr:
        return getattr(df.loc[sent_id,govid], attr, 'root')
    return [(sent_id, govid)]

def get_match(idx, df=False, repeat=False, attr=False, **kwargs):
    """
    Dummy function, for the most part
    """
    sent_id, tok_id = getattr(idx, 'name', idx)
    if attr:
        return df[attr].ix[sent_id, tok_id]
    return [(sent_id, tok_id)]

def get_head(idx, df=False, repeat=False, attr=False, **kwargs):
    """
    Get the head of a 'constituent'---'
    for 'corpus linguistics', if 'corpus' is searched, return 'linguistics'
    """

    sent_id, tok_id = getattr(idx, 'name', idx)
    #sent = df.ix[sent_id]
    token = df.ix[sent_id, tok_id]

    if not hasattr(token, 'c'):
        # this should error, because the data isn't there at all
        lst_of_ixs = [(sent_id, tok_id)]

    elif token['c'] == '_':
        lst_of_ixs = [(sent_id, tok_id)]
    # if it is the head, return it
    elif token['c'].endswith('*'):
        lst_of_ixs = [(sent_id, tok_id)]
    else:
        # should be able to speed this one up!
        just_same_coref = df.loc[sent_id][df.loc[sent_id]['c'] == token['c'] + '*']
        if not just_same_coref.empty:
            lst_of_ixs = [(sent_id, i) for i in just_same_coref.index]
        else:
            lst_of_ixs = [(sent_id, tok_id)]
    if attr:
        lst_of_ixs = [df.loc[i][attr] for i in lst_of_ixs]
    return lst_of_ixs

def get_representative(idx,
                       df=False,
                       repeat=False,
                       attr=False,
                       **kwargs):
    """
    Get the representative coref head
    """
    sent_id, tok_id = getattr(idx, 'name', idx)

    token = df.ix[sent_id, tok_id]
    # if no corefs at all
    if not hasattr(token, 'c'):
        # this should error, because the data isn't there at all
        lst_of_ixs = [(sent_id, tok_id)]     
    # if no coref available
    elif token['c'] == '_':
        lst_of_ixs = [(sent_id, tok_id)]
    else:
        just_same_coref = df.loc[df['c'] == token['c'] + '*']
        if not just_same_coref.empty:
            lst_of_ixs = [just_same_coref.iloc[0].name]
        else:
            lst_of_ixs = [(sent_id, tok_id)]
    if attr:
        lst_of_ixs = [df.ix[i][attr] for i in lst_of_ixs]
    return lst_of_ixs

def get_all_corefs(s, i, df, coref=False):
    # if not in coref mode, skip
    if not coref:
        return [(s, i)]
    # if the word was not a head, forget it
    if not df.ix[s,i]['c'].endswith('*'):
        return [(s, i)]
    try:
        # get any other mention head for this coref chain
        just_same_coref = df.loc[df['c'] == df.ix[s,i]['c']]
        return list(just_same_coref.index)
    except:
        return [(s, i)]

def search_this(df, obj, attrib, pattern, adjacent=False, coref=False):
    """
    Search the dataframe for a single criterion
    """
    import re
    out = []

    # if searching by head, they need to be heads
    if obj == 'h':
        df = df.loc[df['c'].endswith('*')]

    # cut down to just tokens with matching attr
    # but, if the pattern is 'any', don't bother
    if hasattr(pattern, 'pattern') and pattern.pattern == r'.*':
        matches = df
    else:
        matches = df[df[attrib].fillna('').str.contains(pattern)]

    # functions for getting the needed object
    revmapping = {'g': get_dependents_of_id,
                  'd': get_governors_of_id,
                  'm': get_match,
                  'h': get_all_corefs,
                  'r': get_representative}

    getfunc = revmapping.get(obj)

    for idx in list(matches.index):

        if adjacent:
            if adjacent[0] == '+':
                tomove = -int(adj[1])
            elif adjacent[0] == '-':
                tomove = int(adj[1])
            idx = (idx[0], idx[1] + tomove)
        
        for mindex in getfunc(idx, df=df, coref=coref):

            if mindex:
                out.append(mindex)

    return list(set(out))

def show_fix(show):
    """show everything"""
    objmapping = {'d': get_dependents_of_id,
                  'g': get_governors_of_id,
                  'm': get_match,
                  'h': get_head}

    out = []
    for val in show:
        adj, val = determine_adjacent(val)
        obj, attr = val[0], val[-1]
        obj_getter = objmapping.get(obj)
        out.append(adj, val, obj, attr, obj_getter)
    return out

def dummy(x, *args, **kwargs):
    return x

def format_toks(to_process, show, df):
    """
    Format matches by show values
    """

    import pandas as pd

    objmapping = {'d': get_dependents_of_id,
                  'g': get_governors_of_id,
                  'm': get_match,
                  'h': get_head}

    sers = []

    dmode = any(x.startswith('d') for x in show)
    if dmode:
        from collections import defaultdict
        dicts = defaultdict(dict)

    for val in show:
        adj, val = determine_adjacent(val)
        if adj:
            if adj[0] == '+':
                tomove = int(adj[1])
            elif adj[0] == '-':
                tomove = -int(adj[1])

        obj, attr = val[0], val[-1]
        func = objmapping.get(obj, dummy)
        out = defaultdict(dict) if dmode else []
        for ix in list(to_process.index):
            piece = False
            if adj:
                ix = (ix[0], ix[1] + tomove)
                if ix not in df.index:
                    piece = 'none'
            if not piece:
                if obj == 'm':
                    piece = df.loc[ix][attr.replace('x', 'p')]
                    if attr == 'x':
                        from corpkit.dictionaries.word_transforms import taglemma
                        piece = taglemma.get(piece.lower(), piece.lower())
                    piece = [piece]
                else:
                    piece = func(ix, df=df, attr=attr)
                    if not isinstance(piece, list):
                        piece = [piece]
                if dmode:
                    dicts[ix][val] = piece
                else:
                    out.append(piece[0])

        if not dmode:
            ser = pd.Series(out, index=to_process.index)
            ser.name = val
            sers.append(ser)

    if not dmode:
        dx = pd.concat(sers, axis=1)
        if len(dx.columns) == 1:
            return dx.iloc[:,0]
        else:
            return dx.apply('/'.join, axis=1)
    else:
        index = []
        data = []
        for ix, dct in dicts.items():
            max_key, max_value = max(dct.items(), key=lambda x: len(x[1]))
            for val, pieces in dct.items():
                if len(pieces) == 1:
                    dicts[ix][val] = pieces * len(max_value)

            for tup in list(zip(*[i for i in dct.values()])):
                index.append(ix)
                data.append('/'.join(tup))
        return pd.Series(data, index=pd.MultiIndex.from_tuples(index))


def make_series(ser, df=False, obj=False,
                att=False, adj=False):
    """
    To apply to a DataFrame to add complex criteria, like 'gf'
    """
    # distance mode
    if att == 'a':
        count = 0
        if obj == 'g':
            if ser[obj] == 0:
                return '-1'
            ser = df.loc[ser.name[0], ser['g']]
        while count < 20:
            if ser['mf'].lower() == 'root':
                return str(count)
            ser = df.loc[ser.name[0], ser['g']]
            count += 1
        return '20+'

    # h is head of this particular group
    if obj == 'h':
        cohead = ser['c']
        if cohead.endswith('*'):
            return ser['m' + att]
        elif cohead == '_':
            return 'none'
        else:
            sent = df.loc[ser.name[0]]
            just_cof = sent[sent['c'] == cohead + '*']
            if just_cof.empty:
                return ser['m' + att]
            else:
                return just_cof.iloc[0]['m' + att]
        
    # r is the representative mention head
    if obj == 'r':
        cohead = ser['c']
        if cohead == '_':
            return 'none'
        if not cohead.endswith('*'):
            cohead = cohead + '*'
        # iterrows is slow, but we only need the first instance
        just_cof = df[df['c'] == cohead]
        if just_cof.empty:
            return ser['m' + att]
        else:
            return just_cof.iloc[0]['m' + att]

    if obj == 'g':
        if ser[obj] == 0:
            return 'root'
        else:
            try:
                return df[att][ser.name[0], ser[obj]]
            # this keyerror can happen if governor is punctuation, for example
            except KeyError:
                return

    # if dependent, we need to return a df-like thing instead
    elif obj == 'd':
        #import pandas as pd
        idxs = [(ser.name[0], int(i)) for i in ser[obj].split(',')]
        dat = df[att].ix[idxs]
        return dat

    # todo: fix everything below here
    elif obj == 'r': # get the representative
        cohead = ser['c'].rstrip('*')
        refs = df[df['c'] == cohead + '*']
        return refs[att].ix[0]

    elif obj == 'h': # get head
        cohead = ser['c']
        if cohead.endswith('*'):
            return ser[att]
        else:
            sent = df[att].loc[ser.name[0]]
            return sent[sent['c'] == cohead + '*']

    # potential naming conflict with sent index ...
    elif obj == 's': # get whole phrase"
        cohead = ser['c']
        sent = df[att].loc[ser.name[0]]
        return sent[sent['c'] == cohead.rstrip('*')].values
    
def joiner(ser):
    return ser.str.cat(sep='/') 

def make_new_for_dep(dfmain, dfdep, name):
    """
    If showind dependent, we have to make a whole new dataframe

    :param dfmain: dataframe with everything in it
    :param dfdep: dataframe with just dependent
    """
    import pandas as pd
    import numpy as np
    new = []
    newd = []
    index = []
    for (i, ml), (_, dl) in zip(dfmain.iterrows(), dfdep.iterrows()):
        if all(pd.isnull(i) for i in dl.values):
            index.append(i)
            new.append(ml)
            newd.append('none')
            continue
        else:
            for bit in dl:
                if pd.isnull(bit):
                    continue
                index.append(i)
                new.append(ml)
                newd.append(bit)
    
    #todo: account for no matches
    index = pd.MultiIndex.from_tuples(index, names=['s', 'i'])
    newdf = pd.DataFrame(new, index=index)
    newdf[name] = newd
    return newdf

def turn_pos_to_wc(ser, showval):
    if not showval:
        return ser
    import pandas as pd
    from corpkit.dictionaries.word_transforms import taglemma   
    vals = [taglemma.get(piece.lower(), piece.lower())
                  for piece in ser.values]
    news = pd.Series(vals, index=ser.index)
    news.name = ser.name[:-1] + 'x'
    return news

def concline_generator(matches, idxs, df, metadata,
                       add_meta, category, fname, preserve_case=False):
    """
    Get all conclines

    :param matches: a list of formatted matches
    :param idxs: their (sent, word) idx
    """
    conc_res = []
    # potential speedup: turn idxs into dict
    from collections import defaultdict
    mdict = defaultdict(list)
    # if remaking idxs here, don't need to do it earlier
    idxs = list(matches.index)
    for mid, (s, i) in zip(matches, idxs):
    #for s, i in matches:
        mdict[s].append((i, mid))
    # shorten df to just relevant sents to save lookup time
    df = df.loc[list(mdict.keys())]
    # don't look up the same sentence multiple times
    for s, tup in sorted(mdict.items()):
        sent = df.loc[s]
        if not preserve_case:
            sent = sent.str.lower()
        meta = metadata[s]
        sname = meta.get('speaker', 'none')
        for i, mid in tup:
            if not preserve_case:
                mid = mid.lower()
            ix = '%d,%d' % (s, i)
            start = ' '.join(sent.loc[:i-1].values)
            end = ' '.join(sent.loc[i+1:].values)
            lin = [ix, category, fname, sname, start, mid, end]
            if add_meta:
                for k, v in sorted(meta.items()):
                    if k in ['speaker', 'parse', 'sent_id']:
                        continue
                    if isinstance(add_meta, list):
                        if k in add_meta:
                            lin.append(v)
                    elif add_meta is True:
                        lin.append(v)
            conc_res.append(lin)
    return conc_res

def p_series_to_x_series(val):
    return taglemma.get(val.lower(), val.lower())

def fast_simple_conc(dfss, idxs, show,
                     metadata=False,
                     add_meta=False, 
                     fname=False,
                     category=False,
                     only_format_match=True,
                     conc=False,
                     preserve_case=False,
                     gramsize=1,
                     window=None):
    """
    Fast, simple concordancer, heavily conditional
    to save time.
    """
    if dfss.empty:
        return [], []
        
    import pandas as pd

    # best case, the user doesn't want any gov-dep stuff
    simple = all(i.startswith('m') and not i.endswith('a') for i in show)
    # worst case, the user wants something from dep
    dmode = any(x.startswith('d') for x in show)
    # make a quick copy if need be because we modify the df
    df = dfss.copy() if not simple else dfss
    # add text to df columns so that it resembles 'show' values
    lst = ['s', 'i', 'w', 'l', 'e', 'p', 'f']

    # for ner, change O to 'none'
    if 'e' in df.columns:
        df['e'] = df['e'].str.replace('^O$', 'none')

    df.columns = ['m' + i if len(i) == 1 and i in lst \
                  else i for i in list(df.columns)]

    # this is the data needed for concordancing
    df_for_lr = df['mw'] if only_format_match else df

    just_matches = df.loc[idxs]
    
    # if the showing can't come straight out of the df, 
    # we can add columns with the necessary information
    if not simple:
        formatted = []
        import numpy as np

        for ind, i in enumerate(show):
            # nothing to do if it's an m feature
            if i.startswith('m') and not i.endswith('a'):
                continue
            # defaults for adjacent work

            adj, tomove, adjname = False, False, ''
            adj, i = determine_adjacent(i)
            adjname = ''.join(adj) if hasattr(adj, '__iter__') else ''
            
            # get number of places to shift left or right
            if adj:
                if adj[0] == '+':
                    tomove = -int(adj[1])
                elif adj[0] == '-':
                    tomove = int(adj[1])

            # cut df down to just needed bits for the sake of speed
            # i.e. if we want gov func, get only gov and func cols
            ob, att = i[0], i[-1]
            xmode = att == 'x'
            if xmode:
                att = 'p'
                show[ind] = show[ind][:-1] + 'p'
            # for corefs, we also need the coref data
            if ob in ['h', 'r']:
                dfx = df[['c', 'm' + att]]
            else:
                lst = ['s', 'i', 'w', 'l', 'f', 'p']
                if att in lst and ob != 'm':
                    att = 'm' + att
                if ob == 'm' and att != 'a':
                    dfx = df[['m' + att]]
                elif att == 'a':
                    dfx = df[['mf', 'g']]
                else:
                    dfx = df[[ob, att]]
            # decide if we need to format everything
            if (not conc or only_format_match) and not adj:
                to_proc = just_matches
            else:
                to_proc = df
            # now we get or generate the new column
            if ob == 'm' and att != 'a':
                ser = to_proc['m' + att]
            else:
                ser = to_proc.apply(make_series, df=dfx, obj=ob, att=att, axis=1)
            if xmode:
                ser = ser.apply(p_series_to_x_series)

            # adjmode simply shifts series and index
            if adj:
                #todo: this shifts next sent into previous sent!
                ser = ser.shift(tomove)
                ser = ser.fillna('none')

            # dependent mode produces multiple matches
            # so, we have to make a new dataframe with duplicate indexes
            # todo: what about when there are two dep options?
            ser.name = adjname + i
            if ob != 'd':
                df[ser.name] = ser
            else:
                df = make_new_for_dep(df, ser, i)

        df = df.fillna('none')

    # x is wordclass. so, we just get pos and translate it
    nshow = [(i.replace('x', 'p'), i.endswith('x')) for i in show]

    # generate a series of matches with slash sep if multiple show vals
    if len(nshow) > 1:

        if conc and not only_format_match:
            first = turn_pos_to_wc(df[nshow[0][0]], nshow[0][1])
            llist = [turn_pos_to_wc(df[sho], xmode) for sho, xmode in nshow[1:]]
            df = first.str.cat(others=llist, sep='/')
            matches = df[idxs]
        else:
            justm = df.loc[idxs]
            first = turn_pos_to_wc(justm[nshow[0][0]], nshow[0][1])
            llist = [turn_pos_to_wc(justm[sho], xmode) for sho, xmode in nshow[1:]]
            matches = first.str.cat(others=llist, sep='/')
            if conc:
                df = df_for_lr
    else:
        if conc and not only_format_match:
            df = turn_pos_to_wc(df[nshow[0][0]], nshow[0][1])
            matches = df[idxs]
        else:
            matches = turn_pos_to_wc(df[nshow[0][0]][idxs], nshow[0][1])
            if conc:
                df = df_for_lr
    
    # get rid of (e.g.) nan caused by no_punct=True
    matches = matches.dropna(axis=0, how='all')

    if not preserve_case:
        matches = matches.str.lower()

    if not conc:
        # todo: is matches.values faster?
        return list(matches), []
    else:
        conc_res = concline_generator(matches, idxs, df,
                                      metadata, add_meta,
                                      category, fname,
                                      preserve_case=preserve_case)

    return list(matches), conc_res

def make_collocate_show(show, current):
    """
    Turn show into a collocate showing thing
    """
    out = []
    for i in show:
        out.append(i)
    for i in show:
        newn = '%s%s' % (str(current), i)
        if not newn.startswith('-'):
            newn = '+' + newn
        out.append(newn)
    return out

def show_this(df, matches, show, metadata, conc=False,
              coref=False, category=False, show_conc_metadata=False, **kwargs):

    only_format_match = kwargs.pop('only_format_match', True)
    ngram_mode = kwargs.get('ngram_mode', True)
    preserve_case = kwargs.get('preserve_case', False)
    gramsize = kwargs.get('gramsize', 1)
    window = kwargs.get('window', None)

    matches = sorted(list(matches))

    # add index as column if need be
    if any(i.endswith('s') for i in show):
        df['ms'] = [str(i) for i in df.index.labels[0]]
    if any(i.endswith('i') for i in show):
        df['mi'] = [str(i) for i in df.index.labels[1]]
    
    # attempt to leave really fast
    if kwargs.get('countmode'):
        return len(matches), {}
    if len(show) == 1 and not conc and gramsize == 1 and not window:
        if show[0] in ['ms', 'mi', 'mw', 'ml', 'mp', 'mf']:
            get_fast = df.loc[matches][show[0][-1]]
            if not preserve_case:
                get_fast = get_fast.str.lower()
            return list(get_fast), {}

    # todo: make work for ngram, collocate and coref
    if all(i[0] in ['m', 'g', '+', '-', 'd', 'h', 'r'] for i in show):
        if gramsize == 1 and not window:
            return fast_simple_conc(df,
                                matches,
                                show,
                                metadata,
                                show_conc_metadata,
                                kwargs.get('filename', ''),
                                category,
                                only_format_match,
                                conc=conc,
                                preserve_case=preserve_case,
                                gramsize=gramsize,
                                window=window)
        else:
            resbit = []
            concbit = []
            iterab = range(1, gramsize + 1) if gramsize > 1 else range(-window, window+1)
            for i in iterab:
                if i == 0:
                    continue
                if window:
                    nnshow = make_collocate_show(show, i)
                else:
                    nnshow = show
                r, c = fast_simple_conc(df,
                                matches,
                                nnshow,
                                metadata,
                                show_conc_metadata,
                                kwargs.get('filename', ''),
                                category,
                                only_format_match,
                                conc=conc,
                                preserve_case=preserve_case,
                                gramsize=gramsize,
                                window=window)

                resbit.append(r)
                concbit.append(c)
                if not window:
                    df = df.shift(1)
                    df = df.fillna('none')
            resbit = list(zip(*resbit))
            concbit = list(zip(*concbit))
            out = []
            conc_out = []
            # this is slow but keeps the order
            # remove it esp for resbit where it doesn't matter
            for r in resbit:
                for b in r:
                    out.append(b)
            for c in concbit:
                for b in c:
                    conc_out.append(b)
            return out, conc_out

def remove_by_mode(matches, mode, criteria):
    """
    If mode is all, remove any entry that occurs < len(criteria)
    """
    if mode == 'any':
        return set(matches)
    if mode == 'all':
        from collections import Counter
        counted = Counter(matches)
        return set(k for k, v in counted.items() if v >= len(criteria))

def determine_adjacent(original):
    """
    Figure out if we're doing an adjacent location, get the co-ordinates
    and return them and the stripped original
    """
    if original[0] in ['+', '-']:
        adj = (original[0], original[1:-2])
        original = original[-2:]
    else:
        adj = False
    return adj, original

def cut_df_by_metadata(df, metadata, criteria, coref=False,
                            feature='speaker', method='just'):
    """
    Keep or remove parts of the DataFrame based on metadata criteria
    """
    if not criteria:
        df._metadata = metadata
        return df
    # maybe could be sped up, but let's not for now:
    if coref:
        df._metadata = metadata
        return df
    import re
    good_sents = []
    new_metadata = {}
    from corpkit.constants import STRINGTYPE
    # could make the below more elegant ...
    for sentid, data in sorted(metadata.items()):
        meta_value = data.get(feature, 'none')
        lst_met_vl = meta_value.split(';')
        if isinstance(criteria, (list, set, tuple)):
            criteria = [i.lower() for i in criteria]
            if method == 'just':
                if any(i.lower() in criteria for i in lst_met_vl):
                    good_sents.append(sentid)
                    new_metadata[sentid] = data
            elif method == 'skip':
                if not any(i in criteria for i in lst_met_vl):
                    good_sents.append(sentid)
                    new_metadata[sentid] = data
        elif isinstance(criteria, (re._pattern_type, STRINGTYPE)):
            if method == 'just':
                if any(re.search(criteria, i, re.IGNORECASE) for i in lst_met_vl):
                    good_sents.append(sentid)
                    new_metadata[sentid] = data
            elif method == 'skip':
                if not any(re.search(criteria, i, re.IGNORECASE) for i in lst_met_vl):
                    good_sents.append(sentid)
                    new_metadata[sentid] = data

    df = df.loc[good_sents]
    df = df.fillna('')
    df._metadata = new_metadata
    return df

def cut_df_by_meta(df, just_metadata, skip_metadata):
    """
    Reshape a DataFrame based on filters
    """
    if df is not None:
        if just_metadata:
            for k, v in just_metadata.items():
                df = cut_df_by_metadata(df, df._metadata, v, feature=k)
        if skip_metadata:
            for k, v in skip_metadata.items():
                df = cut_df_by_metadata(df, df._metadata, v, feature=k, method='skip')
    return df


def tgrep_searcher(f=False,
                   metadata=False,
                   from_df=False,
                   search=False,
                   searchmode=False,
                   exclude=False,
                   excludemode=False,
                   translated_option=False,
                   subcorpora=False,
                   conc=False,
                   root=False,
                   preserve_case=False,
                   countmode=False,
                   show=False,
                   lem_instance=False,
                   lemtag=False,
                   category=False,
                   fname=False,
                   show_conc_metadata=False,
                   only_format_match=True,
                   **kwargs):

    """
    Use tgrep for constituency grammar search
    """

    from corpkit.process import show_tree_as_per_option, tgrep
    matches = []
    conc_out = []
    # in case search was a dict
    srch = search.get('t') if isinstance(search, dict) else search
    metcat = category if category else ''
    for i, sent in metadata.items():
        results = tgrep(sent['parse'], srch)
        sname = sent.get('speaker')
        metcat = category
        for res in results:
            tok_id, start, middle, end = show_tree_as_per_option(show, res, sent,
                                                  df=from_df, sent_id=i, conc=conc,
                                                  only_format_match=only_format_match)
            #middle, idx = show_tree_as_per_option(show, res, 'conll', sent, df=df, sent_id=i)
            matches.append(middle)
            if conc:
                form_ix = '%d,%d' % (i, tok_id)
                lin = [form_ix, metcat, fname, sname, start, middle, end]
                if show_conc_metadata:
                    for k, v in sorted(sent.items()):
                        if k in ['speaker', 'parse', 'sent_id']:
                            continue
                        if isinstance(show_conc_metadata, list):
                            if k in show_conc_metadata:
                                lin.append(v)
                        elif show_conc_metadata is True:
                            lin.append(v)
                conc_out.append(lin)

    return matches, conc_out

def slow_tregex(metadata=False,
                search=False,
                searchmode=False,
                exclude=False,
                excludemode=False,
                translated_option=False,
                subcorpora=False,
                conc=False,
                root=False,
                preserve_case=False,
                countmode=False,
                show=False,
                lem_instance=False,
                lemtag=False,
                from_df=False,
                fname=False,
                category=False,
                only_format_match=False,
                **kwargs):
    """
    Do the metadata specific version of tregex queries
    """

    from corpkit.process import tregex_engine, format_tregex, make_conc_lines_from_whole_mid
    
    if isinstance(search, dict):
        search = list(search.values())[0]
    
    speak_tree = [(x.get(subcorpora, 'none'), x['parse']) for x in metadata.values()]
        
    if speak_tree:
        speak, tree = list(zip(*speak_tree))
    else:
        speak, tree = [], []
    
    if all(not x for x in speak):
        speak = False

    to_open = '\n'.join(tree)

    concs = []

    if not to_open.strip('\n'):
        if subcorpora:
            return {}, {}

    ops = ['-%s' % i for i in translated_option] + ['-o', '-n']
    res = tregex_engine(query=search, 
                        options=ops, 
                        corpus=to_open,
                        root=root,
                        preserve_case=preserve_case,
                        speaker_data=False)

    res = format_tregex(res, show, exclude=exclude, excludemode=excludemode,
                        translated_option=translated_option,
                        lem_instance=lem_instance, countmode=countmode, speaker_data=False,
                        lemtag=lemtag)

    if not res:
        if subcorpora:
            return [], []

    if conc:
        ops += ['-w']
        whole_res = tregex_engine(query=search, 
                                  options=ops, 
                                  corpus=to_open,
                                  root=root,
                                  preserve_case=preserve_case,
                                  speaker_data=speak)

        # format match too depending on option
        if not only_format_match:
            whole_res = format_tregex(whole_res, show, exclude=exclude, excludemode=excludemode,
                                      translated_option=translated_option,
                                       lem_instance=lem_instance, countmode=countmode,
                                       speaker_data=speak, whole=True,
                                       lemtag=lemtag)

        # make conc lines from conc results
        concs = make_conc_lines_from_whole_mid(whole_res, res, filename=fname, show=show)
    else:
        concs = [False for i in res]

    if len(res) > 0 and isinstance(res[0], tuple):
        res = [i[-1] for i in res]

    if countmode:
        if isinstance(res, int):
            return res, False
        else:
            return len(res), False
    else:
        return res, concs

def get_stats(from_df=False, metadata=False, feature=False, root=False, **kwargs):
    """
    Get general statistics for a DataFrame
    """
    import re
    from corpkit.dictionaries.process_types import processes
    from collections import Counter, defaultdict
    from corpkit.process import tregex_engine

    def ispunct(s):
        import string
        return all(c in string.punctuation for c in s)

    tree = [x['parse'] for x in metadata.values()]
    
    tregex_qs = {'Imperative': r'ROOT < (/(S|SBAR)/ < (VP !< VBD !< VBG !$ NP !$ SBAR < NP !$-- S '\
                 '!$-- VP !$ VP)) !<< (/\?/ !< __) !<<- /-R.B-/ !<<, /(?i)^(-l.b-|hi|hey|hello|oh|wow|thank|thankyou|thanks|welcome)$/',
                 'Open interrogative': r'ROOT < SBARQ <<- (/\?/ !< __)', 
                 'Closed interrogative': r'ROOT ( < (SQ < (NP $+ VP)) << (/\?/ !< __) | < (/(S|SBAR)/ < (VP $+ NP)) <<- (/\?/ !< __))',
                 'Unmodalised declarative': r'ROOT < (S < (/(NP|SBAR|VP)/ $+ (VP !< MD)))',
                 'Modalised declarative': r'ROOT < (S < (/(NP|SBAR|VP)/ $+ (VP < MD)))',
                 'Clauses': r'/^S/ < __',
                 'Interrogative': r'ROOT << (/\?/ !< __)',
                 'Processes': r'/VB.?/ >># (VP !< VP >+(VP) /^(S|ROOT)/)'}

    result = Counter()

    for name in tregex_qs.keys():
        result[name] = 0

    result['Sentences'] = len(set(from_df.index.labels[0]))
    result['Passives'] = len(from_df[from_df['f'] == 'nsubjpass'])
    result['Tokens'] = len(from_df)
    # the below has returned a float before. i assume actually a nan?
    result['Words'] = len([w for w in list(from_df['w']) if w and not ispunct(str(w))])
    result['Characters'] = sum([len(str(w)) for w in list(from_df['w']) if w])
    result['Open class'] = sum([1 for x in list(from_df['p']) if x and x[0] in ['N', 'J', 'V', 'R']])
    result['Punctuation'] = result['Tokens'] - result['Words']
    result['Closed class'] = result['Words'] - result['Open class']

    to_open = '\n'.join(tree)

    if not to_open.strip('\n'):
        return {}, {}

    for name, q in sorted(tregex_qs.items()):
        options = ['-o', '-t'] if name == 'Processes' else ['-o']
        # c option removed, could cause memory problems
        #ops = ['-%s' % i for i in translated_option] + ['-o', '-n']
        res = tregex_engine(query=q, 
                            options=options,
                            corpus=to_open,  
                            root=root)

        #res = format_tregex(res)
        if not res:
            continue

        concs = [False for i in res]
        for (_, met, r), line in zip(res, concs):
            result[name] = len(res)
        if name != 'Processes':
            continue
        non_mat = 0
        for ptype in ['mental', 'relational', 'verbal']:
            reg = getattr(processes, ptype).words.as_regex(boundaries='l')
            count = len([i for i in res if re.search(reg, i[-1])])
            nname = ptype.title() + ' processes'
            result[nname] = count

        if root:
            root.update()
    return result, {}

def get_corefs(df, matches):
    """
    Add corefs to a set of matches
    """
    out = set()
    df = df['c']
    for s, i in matches:
        # keep original
        out.add((s,i))
        coline = df[(s, i)]
        if coline.endswith('*'):
            same_co = df[df == coline]
            for ix in same_co.index:
                out.add(ix)
    return out

def pipeline(f=False,
             search=False,
             show=False,
             exclude=False,
             searchmode='all',
             excludemode='any',
             conc=False,
             coref=False,
             from_df=False,
             just_metadata=False,
             skip_metadata=False,
             category=False,
             show_conc_metadata=False,
             statsmode=False,
             search_trees=False,
             lem_instance=False,
             **kwargs):
    """
    A basic pipeline for conll querying---some options still to do
    """

    if isinstance(show, str):
        show = [show]

    all_matches = []
    all_exclude = []

    if from_df is False or from_df is None:
        df = parse_conll(f, usecols=kwargs.get('usecols'))
        # can fail here if df is none
        if df is None:
            print('Problem reading data from %s.' % f)
            return [], []
        metadata = df._metadata
    else:
        df = from_df
        metadata = kwargs.pop('metadata')
    feature = kwargs.pop('by_metadata', False)
    df = cut_df_by_meta(df, just_metadata, skip_metadata)

    searcher = pipeline
    if statsmode:
        searcher = get_stats
    if search_trees == 'tregex':
        searcher = slow_tregex
    elif search_trees == 'tgrep':
        searcher = tgrep_searcher

    if feature:
        if df is None:
            print('Problem reading data from %s.' % f)
            return {}, {}

        # determine searcher
        resultdict = {}
        concresultdict = {}
        # get all the possible values in the df for the feature of interest
        all_cats = set([i.get(feature, 'none') for i in list(df._metadata.values())])
        for category in all_cats:
            new_df = cut_df_by_metadata(df, df._metadata, category, feature=feature, method='just')
            r, c = searcher(f=False,
                            fname=f,
                            search=search,
                            exclude=exclude,
                            show=show,
                            searchmode=searchmode,
                            excludemode=excludemode,
                            conc=conc,
                            coref=coref,
                            from_df=new_df,
                            by_metadata=False,
                            category=category,
                            show_conc_metadata=show_conc_metadata,
                            lem_instance=lem_instance,
                            root=kwargs.pop('root', False),
                            subcorpora=feature,
                            metadata=new_df._metadata,
                            **kwargs)
            
            resultdict[category] = r
            concresultdict[category] = c
        return resultdict, concresultdict

    if df is None:
        print('Problem reading data from %s.' % f)
        return [], []

    kwargs['ngram_mode'] = any(x.startswith('n') for x in show)

    #df = cut_df_by_metadata(df, df._metadata, kwargs.get('just_speakers'), coref=coref)
    metadata = df._metadata

    try:
        df['w'].str
    except AttributeError:
        raise AttributeError("CONLL data doesn't match expectations. " \
                             "Try the corpus.conll_conform() method to " \
                             "convert the corpus to the latest format.")

    if kwargs.get('no_punct', True):
        df = df[df['w'].fillna('').str.contains(kwargs.get('is_a_word', r'[A-Za-z0-9]'))]
            
        # remove brackets --- could it be done in one regex?
        df = df[~df['w'].str.contains(r'^-.*B-$')]

    if kwargs.get('no_closed'):
        from corpkit.dictionaries import wordlists
        crit = wordlists.closedclass.as_regex(boundaries='l', case_sensitive=False)
        df = df[~df['w'].str.contains(crit)]

    if statsmode:
        return get_stats(df, metadata, False, root=kwargs.pop('root', False), **kwargs)
    elif search_trees:
        return searcher(from_df=df,
                        search=search,
                        searchmode=searchmode,
                        exclude=exclude,
                        excludemode=excludemode,
                        conc=conc,
                        by_metadata=False,
                        metadata=metadata,
                        root=kwargs.pop('root', False),
                        fname=f,
                        show=show,
                        **kwargs)

    # do no searching if 'any' is requested
    if len(search) == 1 and list(search.keys())[0] == 'w' \
                        and hasattr(list(search.values())[0], 'pattern') \
                        and list(search.values())[0].pattern == r'.*':
        all_matches = list(df.index)
    else:
        for k, v in search.items():
            adj, k = determine_adjacent(k)
            res = search_this(df, k[0], k[-1], v, adjacent=adj, coref=coref)
            for r in res:
                all_matches.append(r)
        all_matches = remove_by_mode(all_matches, searchmode, search)
    if exclude:
        for k, v in exclude.items():
            adj, k = determine_adjacent(k)
            res = search_this(df, k[0], k[-1], v, adjacent=adj, coref=coref)
            for r in res:
                all_exclude.append(r)
        all_exclude = remove_by_mode(all_exclude, excludemode, exclude)
        all_matches = all_matches.difference(all_exclude)

    if coref:
        all_matches = get_corefs(df, all_matches)

    out, conc_out = show_this(df, all_matches, show, metadata, conc, 
                              coref=coref, category=category, 
                              show_conc_metadata=show_conc_metadata,
                              **kwargs)

    return out, conc_out

def load_raw_data(f):
    """
    Loads the stripped and raw versions of a parsed file
    """
    from corpkit.process import saferead

    # open the unparsed version of the file, read into memory
    stripped_txtfile = f.replace('.conll', '').replace('-parsed', '-stripped')
    stripped_txtdata, enc = saferead(stripped_txtfile)

    # open the unparsed version with speaker ids
    id_txtfile = f.replace('.conll', '').replace('-parsed', '')
    id_txtdata, enc = saferead(id_txtfile)

    return stripped_txtdata, id_txtdata

def get_speaker_from_offsets(stripped, plain, sent_offsets,
                             metadata_mode=False,
                             speaker_segmentation=False):
    """
    Take offsets and get a speaker ID or metadata from them
    """
    if not stripped and not plain:
        return {}
    start, end = sent_offsets
    sent = stripped[start:end]
    # find out line number
    # sever at start of match
    cut_old_text = stripped[:start]
    line_index = cut_old_text.count('\n')
    # lookup this text
    with_id = plain.splitlines()[line_index]
    
    # parse xml tags in original file ...
    meta_dict = {'speaker': 'none'}

    if metadata_mode:

        metad = with_id.strip().rstrip('>').rsplit('<metadata ', 1)
        
        import shlex
        from corpkit.constants import PYTHON_VERSION
        
        try:
            shxed = shlex.split(metad[-1].encode('utf-8')) if PYTHON_VERSION == 2 \
                else shlex.split(metad[-1])
        except:
            shxed = metad[-1].split("' ")
        for m in shxed:
            if PYTHON_VERSION == 2:
                m = m.decode('utf-8')
            # in rare cases of weirdly formatted xml:
            try:
                k, v = m.split('=', 1)
                v = v.replace(u"\u2018", "'").replace(u"\u2019", "'").strip("'").strip('"')
                meta_dict[k] = v
            except ValueError:
                continue

    if speaker_segmentation:
        split_line = with_id.split(': ', 1)
        # handle multiple tags?
        if len(split_line) > 1:
            speakerid = split_line[0]
        else:
            speakerid = 'UNIDENTIFIED'
        meta_dict['speaker'] = speakerid

    return meta_dict

def convert_json_to_conll(path,
                          speaker_segmentation=False,
                          coref=False,
                          metadata=False,
                          just_files=False):
    """
    take json corenlp output and convert to conll, with
    dependents, speaker ids and so on added.

    Path is for the parsed corpus, or a list of files within a parsed corpus
    Might need to fix if outname used?
    """

    import json
    import re
    from corpkit.build import get_filepaths
    from corpkit.constants import CORENLP_VERSION, OPENER
    
    # todo: stabilise this
    #if CORENLP_VERSION == '3.7.0':
    #    coldeps = 'enhancedPlusPlusDependencies'
    #else:
    #    coldeps = 'collapsed-ccprocessed-dependencies'

    print('Converting files to CONLL-U...')

    if just_files:
        files = just_files
    else:
        if isinstance(path, list):
            files = path
        else:
            files = get_filepaths(path, ext='conll')
        
    for f in files:

        if speaker_segmentation or metadata:
            stripped, raw = load_raw_data(f)
        else:
            stripped, raw = None, None

        main_out = ''
        # if the file has already been converted, don't worry about it
        # untested?
        with OPENER(f, 'r') as fo:
            #try:

            try:
                data = json.load(fo)
            except ValueError:
                continue
            # todo: differentiate between json errors
            # rsc corpus had one json file with an error
            # outputted by corenlp, and the conversion
            # failed silently here
            #except ValueError:
            #    continue

        for idx, sent in enumerate(data['sentences'], start=1):
            tree = sent['parse'].replace('\n', '')
            tree = re.sub(r'\s+', ' ', tree)

            # offsets for speaker_id
            sent_offsets = (sent['tokens'][0]['characterOffsetBegin'], \
                            sent['tokens'][-1]['characterOffsetEnd'])
            
            metad = get_speaker_from_offsets(stripped,
                                             raw,
                                             sent_offsets,
                                             metadata_mode=True,
                                             speaker_segmentation=speaker_segmentation)
                            
            # currently there is no standard for sent_id, so i'm leaving it out, but
            # if https://github.com/UniversalDependencies/docs/issues/273 is updated
            # then i could switch it back
            #output = '# sent_id %d\n# parse=%s\n' % (idx, tree)
            output = '# parse=%s\n' % tree
            for k, v in sorted(metad.items()):
                output += '# %s=%s\n' % (k, v)
            for token in sent['tokens']:
                index = str(token['index'])
                # this got a stopiteration on rsc data
                governor, func = next(((i['governor'], i['dep']) \
                                         for i in sent.get('enhancedPlusPlusDependencies',
                                                  sent.get('collapsed-ccprocessed-dependencies')) \
                                         if i['dependent'] == int(index)), ('_', '_'))
                if governor is '_':
                    depends = False
                else:
                    depends = [str(i['dependent']) for i in sent.get('enhancedPlusPlusDependencies',
                               sent.get('collapsed-ccprocessed-dependencies')) if i['governor'] == int(index)]
                if not depends:
                    depends = '0'
                #offsets = '%d,%d' % (token['characterOffsetBegin'], token['characterOffsetEnd'])
                line = [index,
                        token['word'],
                        token['lemma'],
                        token['pos'],
                        token.get('ner', '_'),
                        '_', # this is morphology, which is unannotated always, but here to conform to conll u
                        governor,
                        func,
                        ','.join(depends)]
                # no ints
                line = [str(l) if isinstance(l, int) else l for l in line]

                from corpkit.constants import PYTHON_VERSION
                if PYTHON_VERSION == 2:
                    try:
                        [unicode(l, errors='ignore') for l in line]
                    except TypeError:
                        pass
                
                output += '\t'.join(line) + '\n'
            main_out += output + '\n'

        # post process corefs
        if coref:
            import re
            dct = {}
            idxreg = re.compile('^([0-9]+)\t([0-9]+)')
            splitmain = main_out.split('\n')
            # add tab _ to each line, make dict of sent-token: line index
            for i, line in enumerate(splitmain):
                if line and not line.startswith('#'):
                    splitmain[i] += '\t_'
                match = re.search(idxreg, line)
                if match:
                    l, t = match.group(1), match.group(2)
                    dct[(int(l), int(t))] = i
            
            # for each coref chain, if there are corefs
            for numstring, list_of_dicts in sorted(data.get('corefs', {}).items()):
                # for each mention
                for d in list_of_dicts:
                    snum = d['sentNum']
                    # get head?
                    # this has been fixed in dev corenlp: 'headIndex' --- could simply use that
                    # ref : https://github.com/stanfordnlp/CoreNLP/issues/231
                    for i in range(d['startIndex'], d['endIndex']):
                    
                        try:
                            ix = dct[(snum, i)]
                            fixed_line = splitmain[ix].rstrip('\t_') + '\t%s' % numstring
                            gv = fixed_line.split('\t')[6]
                            try:
                                gov_s = int(gv)
                            except ValueError:
                                continue
                            if gov_s < d['startIndex'] or gov_s > d['endIndex']:
                                fixed_line += '*'
                            splitmain[ix] = fixed_line
                            dct.pop((snum, i))
                        except KeyError:
                            pass

            main_out = '\n'.join(splitmain)

        from corpkit.constants import OPENER       
        with OPENER(f, 'w', encoding='utf-8') as fo:
            main_out = main_out.replace(u"\u2018", "'").replace(u"\u2019", "'")
            fo.write(main_out)



================================================
FILE: corpkit/constants.py
================================================
import sys
import codecs

# python 2/3 coompatibility
PYTHON_VERSION = sys.version_info.major
STRINGTYPE = str if PYTHON_VERSION == 3 else basestring
INPUTFUNC = input if PYTHON_VERSION == 3 else raw_input
OPENER = open if PYTHON_VERSION == 3 else codecs.open

# quicker access to search, exclude, show types
from itertools import product
_starts = ['M', 'N', 'B', 'G', 'D', 'H', 'R']
_ends = ['W', 'L', 'I', 'S', 'P', 'X', 'R', 'F', 'E']
_others = ['A', 'ANY', 'ANYWORD', 'C', 'SELF', 'V', 'K', 'T']
_prod = list(product(_starts, _ends))
_prod = [''.join(i) for i in _prod]
_letters = sorted(_prod + _starts + _ends + _others)

_adjacent_start = ['A{}'.format(i) for i in range(1, 9)] + \
                   ['Z{}'.format(i) for i in range(1, 9)]

_adjacent = [''.join(i) for i in list(product(_adjacent_start, _prod))]

LETTERS = sorted(_letters + _adjacent)

# translating search values intro words
transshow = {'f': 'Function',
             'l': 'Lemma',
             'a': 'Distance from root',
             'w': 'Word',
             't': 'Trees',
             'i': 'Index',
             'n': 'N-grams',
             'p': 'POS',
             'e': 'NER',
             'c': 'Count',
             'x': 'Word class',
             's': 'Sentence index'}

transobjs = {'g': 'Governor',
             'd': 'Dependent',
             'm': 'Match',
             'h': 'Head'}

# below are the column names for the conll-u formatted data
# corpkit's format is slightly different, but largely compatible.

# Key differences:
#
#     1. 'e' is used for NER, rather than lang specific POS
#     2. 'd' gives a comma-sep list of dependents, rather than head-deprel pairs
#        this is done for processing speed.
#     3. 'c' is used for corefs, not 'misc comment'. it has an artibrary number 
#        representing a dependency chain. head of a mention is marked with an asterisk.

# 'm' does not have anything in it in corpkit, but denotes morphological features

# default: index, word, lem, pos, ner, morph, gov, func, deps, coref
CONLL_COLUMNS = ['i', 'w', 'l', 'p', 'e', 'v', 'g', 'f', 'd', 'c']

# what the longest possible speaker ID is. this prevents huge lines with colons
# from getting matched unintentionally
MAX_SPEAKERNAME_SIZE = 40

# parsing sometimes fails with a java error. if corpus.parse(restart=True), this will try
# parsing n times before giving up
REPEAT_PARSE_ATTEMPTS = 3

# location of the current corenlp and its version
# old, stable
#CORENLP_URL = 'http://nlp.stanford.edu/software/stanford-corenlp-full-2015-12-09.zip'
#CORENLP_VERSION = '3.6.0'

# newest, beta
CORENLP_VERSION = '3.7.0'
CORENLP_URL  = 'http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip'

# it can be very slow to load a bunch of unused metadata categories
MAX_METADATA_FIELDS = 99
MAX_METADATA_VALUES = 99

================================================
FILE: corpkit/corpkit
================================================
#!/usr/bin/env python

"""
A script to start the corpkit interpeter with options
"""

import sys
import os

# determine if we're running a script
if len(sys.argv) > 1 and os.path.isfile(sys.argv[-1]):
    fromscript = sys.argv[-1]
else:
    fromscript = False

def install(name, loc):
    """
    If we don't have a module, download it
    """
    import pip
    import importlib
    try:
        importlib.import_module(name)
    except ImportError:
        pip.main(['install', loc])

tabview = ('tabview', 'git+https://github.com/interrogator/tabview@93644dd1f410de4e47466ea8083bb628b9ccc471#egg=tabview')
colorama = ('colorama', 'colorama')

# run a command a la python -c
command = sys.argv[sys.argv.index('-c') + 1] if '-c' in sys.argv else False

debug = any(i in sys.argv for i in ['--debug', '-d', 'debug'])
quiet = any(i in sys.argv for i in ['--q', '--quiet'])
load = any(i in sys.argv for i in ['--load', '-l'])
profile = any(i in sys.argv for i in ['--profile', '-p'])
version = any(i in sys.argv for i in ['--version', '-v'])

if not any('noinstall' in arg.lower() for arg in sys.argv):
    install(*tabview)
    install(*colorama)

if version:
    import corpkit
    print(corpkit.__version__)
elif any(i in sys.argv for i in ['--help', '-h']):
    from corpkit.env import help_text
    import pydoc
    pydoc.pipepager(help_text, cmd='less -X -R -S') 
else:
    from corpkit.env import interpreter
    interpreter(debug=debug, fromscript=fromscript,
                quiet=quiet, python_c_mode=command,
                profile=profile, loadcurrent=load)


================================================
FILE: corpkit/corpkit.1
================================================
.TH corpkit 1
.SH NAME
corpkit \- corpus linguistics interface
.SH SYNOPSIS
.B corpkit
[\fB\-c\fR \fICOMMAND\fR]
[\fB\-d\fR]
[\fB\-h\fR]
[\fB\-l\fR]
.IR [file]
.SH DESCRIPTION
.B corpkit
builds and searches parsed and/or structured linguistic corpora. It also edits and visualises results, and manages projects.
.SH OPTIONS
.TP
.BR " \-\-c COMMAND"\fR
A quoted command or series of commands to pass to the corpkit interpreter. Use a semicolon to delimit each command. Disables interactivity and exits on completion.
.TP
.IR "file"\fR
Pass in a script for the interpeter to run.
.TP
.BR "\-d, " \-\-debug\fR
Debug mode: print info about how command was parsed.
.TP
.BR "\-l, " \-\-load\fR
Load all saved results into store on startup.
.TP
.BR "\-h, " \-\-help\fR
Show help.


================================================
FILE: corpkit/corpus.py
================================================
"""
corpkit: Corpus and Corpus-like objects
"""

from __future__ import print_function

from lazyprop import lazyprop
from corpkit.process import classname
from corpkit.constants import STRINGTYPE, PYTHON_VERSION

class Corpus(object):
    """
    A class representing a linguistic text corpus, which contains files,
    optionally within subcorpus folders.

    Methods for concordancing, interrogating, getting general stats, getting
    behaviour of particular word, etc.

    Unparsed, tokenised and parsed corpora use the same class, though some
    methods are available only to one or the other. Only unparsed corpora 
    can be parsed, and only parsed/tokenised corpora can be interrogated.
    """

    def __init__(self, path, **kwargs):
        import re
        import operator
        import glob
        import os
        from os.path import join, isfile, isdir, abspath, dirname, basename
        from corpkit.process import determine_datatype

        # levels are 'c' for corpus, 's' for subcorpus and 'f' for file. Which
        # one is determined automatically below, and processed accordingly. We
        # assume it is a full corpus to begin with.

        def get_symbolics(self):
            return {'skip': self.skip,
                    'just': self.just,
                    'symbolic': self.symbolic}

        self.data = None
        self._dlist = None
        self.level = kwargs.pop('level', 'c')
        self.datatype = kwargs.pop('datatype', None)
        self.print_info = kwargs.pop('print_info', True)
        self.symbolic = kwargs.get('subcorpora', False)
        self.skip = kwargs.get('skip', False)
        self.just = kwargs.get('just', False)
        self.kwa = get_symbolics(self)

        if isinstance(path, (list, Datalist)):
            self.path = abspath(dirname(path[0].path.rstrip('/')))
            self.name = basename(self.path)
            self.data = path
            if self.level == 'd':
                self._dlist = path
        elif isinstance(path, STRINGTYPE):
            self.path = abspath(path)
            self.name = basename(path)
        elif hasattr(path, 'path') and path.path:
            self.path = abspath(path.path)
            self.name = basename(path.path)

        # this messy code figures out as quickly as possible what the datatype
        # and singlefile status of the path is. it's messy because it shortcuts
        # full checking where possible some of the shortcutting could maybe be
        # moved into the determine_datatype() funct.
        
        if self.level == 'd':
            self.singlefile = len(self._dlist) > 1
        else:
            self.singlefile = False
            if os.path.isfile(self.path):
                self.singlefile = True
            else:
                if not isdir(self.path):
                    if isdir(join('data', path)):
                        self.path = abspath(join('data', path))
            
            if self.path.endswith('-parsed') or self.path.endswith('-tokenised'):

                for r, d, f in os.walk(self.path):
                    if not f:
                        continue
                    if isinstance(f, str) and f.startswith('.'):
                        continue
                    if f[0].endswith('conll') or f[0].endswith('conllu'):
                        self.datatype = 'conll'
                        break

                if len([d for d in os.listdir(self.path)
                        if isdir(join(self.path, d))]) > 0:
                    self.singlefile = False
                if len([d for d in os.listdir(self.path)
                        if isdir(join(self.path, d))]) == 0:
                    self.level = 's'
            else:
                if self.level == 'c':
                    if not self.datatype:
                        self.datatype, self.singlefile = determine_datatype(
                            self.path)
                if isdir(self.path):
                    if len([d for d in os.listdir(self.path)
                            if isdir(join(self.path, d))]) == 0:
                        self.level = 's'

            # if initialised on a file, process as file
            if self.singlefile and self.level == 'c':
                self.level = 'f'

            # load each interrogation as an attribute
            if kwargs.get('load_saved', False):
                from corpkit.other import load
                from corpkit.process import makesafe
                if os.path.isdir('saved_interrogations'):
                    saved_files = glob.glob(r'saved_interrogations/*')
                    for filepath in saved_files:
                        filename = os.path.basename(filepath)
                        if not filename.startswith(self.name):
                            continue
                        not_filename = filename.replace(self.name + '-', '')
                        not_filename = os.path.splitext(not_filename)[0]
                        if not_filename in ['features', 'wordclasses', 'postags']:
                            continue
                        variable_safe = makesafe(not_filename)
                        try:
                            setattr(self, variable_safe, load(filename))
                            if self.print_info:
                                print(
                                    '\tLoaded %s as %s attribute.' %
                                    (filename, variable_safe))
                        except AttributeError:
                            if self.print_info:
                                print(
                                    '\tFailed to load %s as %s attribute. Name conflict?' %
                                    (filename, variable_safe))

            if self.print_info:
                print('Corpus: %s' % self.path)

    @lazyprop
    def subcorpora(self):
        """
        A list-like object containing a corpus' subcorpora.
        """
        import re
        import os
        import operator
        from os.path import join, isdir
        if self.level == 'd':
            return
        if self.data.__class__ == Datalist or isinstance(self.data, (Datalist, list)):
            return self.data
        if self.level == 'c':
            variable_safe_r = re.compile(r'[\W0-9_]+', re.UNICODE)
            sbs = Datalist(sorted([Subcorpus(join(self.path

Download .txt

gitextract_mzzg7lm1/

├── .gitattributes
├── .gitmodules
├── .travis.yml
├── API-README.md
├── Dockerfile
├── LICENSE
├── Makefile
├── README.md
├── bld.bat
├── build.sh
├── conf.py
├── corpkit/
│   ├── __init__.py
│   ├── annotate.py
│   ├── blanknotebook.ipynb
│   ├── build.py
│   ├── completer.py
│   ├── configurations.py
│   ├── conll.py
│   ├── constants.py
│   ├── corpkit
│   ├── corpkit.1
│   ├── corpus.py
│   ├── cql.py
│   ├── dictionaries/
│   │   ├── __init__.py
│   │   ├── bnc.p
│   │   ├── bnc.py
│   │   ├── eng_verb_lexicon.p
│   │   ├── process_types.py
│   │   ├── queries.py
│   │   ├── roles.py
│   │   ├── stopwords.py
│   │   ├── verblist.py
│   │   ├── word_transforms.py
│   │   └── wordlists.py
│   ├── download/
│   │   ├── __init__.py
│   │   └── corenlp.py
│   ├── editor.py
│   ├── env.py
│   ├── gui.py
│   ├── inflect.py
│   ├── interpreter_tests.cki
│   ├── interrogation.py
│   ├── interrogator.py
│   ├── keys.py
│   ├── layouts.py
│   ├── lazyprop.py
│   ├── make.py
│   ├── model.py
│   ├── multiprocess.py
│   ├── new_project
│   ├── noseinstall.py
│   ├── nosetests.py
│   ├── other.py
│   ├── parse
│   ├── plotter.py
│   ├── plugins.py
│   ├── process.py
│   ├── stanford-tregex.jar
│   ├── stats.py
│   ├── textprogressbar.py
│   ├── tokenise.py
│   └── tregex.sh
├── data/
│   ├── corpus-filelist.txt
│   ├── test/
│   │   ├── first/
│   │   │   └── intro.txt
│   │   └── second/
│   │       └── body.txt
│   ├── test-plain-parsed/
│   │   ├── first/
│   │   │   └── intro.txt.conll
│   │   └── second/
│   │       └── body.txt.conll
│   ├── test-speak-parsed/
│   │   ├── first/
│   │   │   └── intro.txt.conll
│   │   └── second/
│   │       └── body.txt.conll
│   └── test-stripped/
│       ├── first/
│       │   └── intro.txt
│       └── second/
│           └── body.txt
├── index.rst
├── make.bat
├── meta.yaml
├── requirements.txt
├── rst_docs/
│   ├── API/
│   │   ├── corpkit.building.rst
│   │   ├── corpkit.concordancing.rst
│   │   ├── corpkit.editing.rst
│   │   ├── corpkit.interrogating.rst
│   │   ├── corpkit.langmodel.rst
│   │   ├── corpkit.managing.rst
│   │   └── corpkit.visualising.rst
│   ├── API-ref/
│   │   ├── corpkit.corpus.rst
│   │   ├── corpkit.dictionaries.rst
│   │   ├── corpkit.interrogation.rst
│   │   └── corpkit.other.rst
│   └── interpreter/
│       ├── corpkit.interpreter.annotating.rst
│       ├── corpkit.interpreter.concordancing.rst
│       ├── corpkit.interpreter.editing.rst
│       ├── corpkit.interpreter.interrogating.rst
│       ├── corpkit.interpreter.making.rst
│       ├── corpkit.interpreter.managing.rst
│       ├── corpkit.interpreter.overview.rst
│       ├── corpkit.interpreter.setup.rst
│       └── corpkit.interpreter.visualising.rst
├── setup.cfg
├── setup.py
└── talks/
    └── IDL_seminar.tex

Download .txt

SYMBOL INDEX (376 symbols across 36 files)

FILE: conf.py
  class CustomLatexFormatter (line 18) | class CustomLatexFormatter(LatexFormatter):
    method __init__ (line 19) | def __init__(self, **options):

FILE: corpkit/__init__.py
  function _plot (line 55) | def _plot(self, *args, **kwargs):
  function _edit (line 59) | def _edit(self, *args, **kwargs):
  function _save (line 63) | def _save(self, savename, **kwargs):
  function _quickview (line 67) | def _quickview(self, n=25):
  function _format (line 71) | def _format(self, *args, **kwargs):
  function _texify (line 75) | def _texify(self, *args, **kwargs):
  function _calculate (line 79) | def _calculate(self, *args, **kwargs):
  function _multiplot (line 83) | def _multiplot(self, leftdict={}, rightdict={}, **kwargs):
  function _perplexity (line 87) | def _perplexity(self):
  function _entropy (line 107) | def _entropy(self):
  function _shannon (line 118) | def _shannon(self):
  function _shuffle (line 123) | def _shuffle(self, inplace=False):
  function _top (line 134) | def _top(self):
  function _tabview (line 141) | def _tabview(self, **kwargs):
  function _rel (line 146) | def _rel(self, denominator='self', **kwargs):
  function _keyness (line 150) | def _keyness(self, measure='ll', denominator='self', **kwargs):
  function _plain (line 154) | def _plain(df):

FILE: corpkit/annotate.py
  function process_special_annotation (line 5) | def process_special_annotation(v, lin):
  function make_string_to_add (line 20) | def make_string_to_add(annotation, lin, replace=False):
  function get_line_number_for_entry (line 41) | def get_line_number_for_entry(data, si, ti, annotation):
  function update_contents (line 61) | def update_contents(contents, place, text, do_replace=False):
  function dry_run_text (line 71) | def dry_run_text(filepath, contents, place, colours):
  function annotate (line 91) | def annotate(open_file, contents):
  function delete_lines (line 103) | def delete_lines(corpus, annotation, dry_run=True, colour={}):
  function annotator (line 175) | def annotator(df_or_corpus, annotation, dry_run=True, deletemode=False):

FILE: corpkit/build.py
  function download_large_file (line 9) | def download_large_file(proj_path, url, actually_download=True, root=Fal...
  function extract_cnlp (line 123) | def extract_cnlp(fullfilepath, corenlppath=False, root=False):
  function get_corpus_filepaths (line 148) | def get_corpus_filepaths(projpath=False, corpuspath=False,
  function check_jdk (line 186) | def check_jdk():
  function parse_corpus (line 201) | def parse_corpus(proj_path=False,
  function move_parsed_files (line 399) | def move_parsed_files(proj_path, old_corpus_path, new_corpus_path,
  function corenlp_exists (line 451) | def corenlp_exists(corenlppath=False):
  function get_filepaths (line 481) | def get_filepaths(a_path, ext='txt'):
  function make_no_id_corpus (line 503) | def make_no_id_corpus(pth, newpth, metadata_mode=False, speaker_segmenta...
  function get_all_metadata_fields (line 565) | def get_all_metadata_fields(corpus, include_speakers=False):
  function get_names (line 612) | def get_names(filepath, speakid):
  function get_speaker_names_from_parsed_corpus (line 623) | def get_speaker_names_from_parsed_corpus(corpus, feature='speaker'):
  function rename_all_files (line 659) | def rename_all_files(dirs_to_do):
  function flatten_treestring (line 681) | def flatten_treestring(tree):
  function can_folderise (line 690) | def can_folderise(folder):
  function folderise (line 704) | def folderise(folder):

FILE: corpkit/completer.py
  class Completer (line 1) | class Completer(object):
    method __init__ (line 6) | def __init__(self, words):
    method complete (line 10) | def complete(self, prefix, index):

FILE: corpkit/configurations.py
  function configurations (line 1) | def configurations(corpus, search, **kwargs):

FILE: corpkit/conll.py
  function parse_conll (line 5) | def parse_conll(f,
  function get_dependents_of_id (line 83) | def get_dependents_of_id(idx, df=False, repeat=False, attr=False, coref=...
  function get_governors_of_id (line 103) | def get_governors_of_id(idx, df=False, repeat=False, attr=False, coref=F...
  function get_match (line 116) | def get_match(idx, df=False, repeat=False, attr=False, **kwargs):
  function get_head (line 125) | def get_head(idx, df=False, repeat=False, attr=False, **kwargs):
  function get_representative (line 155) | def get_representative(idx,
  function get_all_corefs (line 183) | def get_all_corefs(s, i, df, coref=False):
  function search_this (line 197) | def search_this(df, obj, attrib, pattern, adjacent=False, coref=False):
  function show_fix (line 240) | def show_fix(show):
  function dummy (line 255) | def dummy(x, *args, **kwargs):
  function format_toks (line 258) | def format_toks(to_process, show, df):
  function make_series (line 336) | def make_series(ser, df=False, obj=False,
  function joiner (line 421) | def joiner(ser):
  function make_new_for_dep (line 424) | def make_new_for_dep(dfmain, dfdep, name):
  function turn_pos_to_wc (line 456) | def turn_pos_to_wc(ser, showval):
  function concline_generator (line 467) | def concline_generator(matches, idxs, df, metadata,
  function p_series_to_x_series (line 512) | def p_series_to_x_series(val):
  function fast_simple_conc (line 515) | def fast_simple_conc(dfss, idxs, show,
  function make_collocate_show (line 672) | def make_collocate_show(show, current):
  function show_this (line 686) | def show_this(df, matches, show, metadata, conc=False,
  function remove_by_mode (line 771) | def remove_by_mode(matches, mode, criteria):
  function determine_adjacent (line 782) | def determine_adjacent(original):
  function cut_df_by_metadata (line 794) | def cut_df_by_metadata(df, metadata, criteria, coref=False,
  function cut_df_by_meta (line 839) | def cut_df_by_meta(df, just_metadata, skip_metadata):
  function tgrep_searcher (line 853) | def tgrep_searcher(f=False,
  function slow_tregex (line 911) | def slow_tregex(metadata=False,
  function get_stats (line 1007) | def get_stats(from_df=False, metadata=False, feature=False, root=False, ...
  function get_corefs (line 1081) | def get_corefs(df, matches):
  function pipeline (line 1097) | def pipeline(f=False,
  function load_raw_data (line 1254) | def load_raw_data(f):
  function get_speaker_from_offsets (line 1270) | def get_speaker_from_offsets(stripped, plain, sent_offsets,
  function convert_json_to_conll (line 1324) | def convert_json_to_conll(path,

FILE: corpkit/corpus.py
  class Corpus (line 11) | class Corpus(object):
    method __init__ (line 24) | def __init__(self, path, **kwargs):
    method subcorpora (line 142) | def subcorpora(self):
    method speakerlist (line 167) | def speakerlist(self):
    method files (line 175) | def files(self):
    method all_filepaths (line 195) | def all_filepaths(self):
    method conll_conform (line 209) | def conll_conform(self, errors='raise'):
    method all_files (line 251) | def all_files(self):
    method tfidf (line 265) | def tfidf(self, search={'w': 'any'}, show=['w'], **kwargs):
    method __str__ (line 293) | def __str__(self):
    method __repr__ (line 318) | def __repr__(self):
    method __getitem__ (line 330) | def __getitem__(self, key):
    method __delitem__ (line 353) | def __delitem__(self, key):
    method features (line 370) | def features(self):
    method _get_postags_and_wordclasses (line 410) | def _get_postags_and_wordclasses(self):
    method wordclasses (line 446) | def wordclasses(self):
    method postags (line 464) | def postags(self):
    method lexicon (line 483) | def lexicon(self, **kwargs):
    method configurations (line 512) | def configurations(self, search, **kwargs):
    method interrogate (line 551) | def interrogate(self, search='w', *args, **kwargs):
    method sample (line 813) | def sample(self, n, level='f'):
    method delete_metadata (line 849) | def delete_metadata(self):
    method metadata (line 857) | def metadata(self):
    method parse (line 864) | def parse(self,
    method tokenise (line 963) | def tokenise(self, postag=True, lemmatise=True, *args, **kwargs):
    method concordance (line 997) | def concordance(self, *args, **kwargs):
    method interroplot (line 1035) | def interroplot(self, search, **kwargs):
    method save (line 1058) | def save(self, savename=False, **kwargs):
    method make_language_model (line 1074) | def make_language_model(self,
    method annotate (line 1128) | def annotate(self, conclines, annotation, dry_run=True):
    method unannotate (line 1154) | def unannotate(annotation, dry_run=True):
  class Subcorpus (line 1166) | class Subcorpus(Corpus):
    method __init__ (line 1174) | def __init__(self, path, datatype, **kwa):
    method __str__ (line 1181) | def __str__(self):
    method __repr__ (line 1184) | def __repr__(self):
    method __getitem__ (line 1187) | def __getitem__(self, key):
  class File (line 1207) | class File(Corpus):
    method __init__ (line 1216) | def __init__(self, path, dirname=False, datatype=False, **kwa):
    method __repr__ (line 1231) | def __repr__(self):
    method __str__ (line 1234) | def __str__(self):
    method read (line 1237) | def read(self, **kwargs):
    method document (line 1248) | def document(self):
    method trees (line 1264) | def trees(self):
    method plain (line 1277) | def plain(self):
  class Datalist (line 1290) | class Datalist(list):
    method __init__ (line 1292) | def __init__(self, data, **kwargs):
    method __repr__ (line 1299) | def __repr__(self):
    method __getattr__ (line 1302) | def __getattr__(self, key):
    method __getitem__ (line 1307) | def __getitem__(self, key):
    method __delitem__ (line 1328) | def __delitem__(self, key):
    method interrogate (line 1337) | def interrogate(self, *args, **kwargs):
    method concordance (line 1353) | def concordance(self, *args, **kwargs):
    method configurations (line 1364) | def configurations(self, search, **kwargs):
  class Corpora (line 1375) | class Corpora(Datalist):
    method __init__ (line 1387) | def __init__(self, data=False, **kwargs):
    method __repr__ (line 1420) | def __repr__(self):
    method parse (line 1423) | def parse(self, **kwargs):
    method features (line 1441) | def features(self):
    method postags (line 1453) | def postags(self):
    method wordclasses (line 1461) | def wordclasses(self):
    method lexicon (line 1469) | def lexicon(self):

FILE: corpkit/cql.py
  function remake_special (line 5) | def remake_special(querybit, customs=False, return_list=False, **kwargs):
  function parse_quant (line 49) | def parse_quant(quant):
  function process_piece (line 61) | def process_piece(piece, op='=', quant=False, **kwargs):
  function tokenise_cql (line 90) | def tokenise_cql(query):
  function to_corpkit (line 144) | def to_corpkit(cstring, **kwargs):
  function to_cql (line 178) | def to_cql(dquery, exclude=False, **kwargs):

FILE: corpkit/dictionaries/bnc.py
  function _get_bnc (line 1) | def _get_bnc():

FILE: corpkit/dictionaries/process_types.py
  function _verbs (line 13) | def _verbs():
  function load_verb_data (line 19) | def load_verb_data():
  function find_lexeme (line 49) | def find_lexeme(verb):
  function get_both_spellings (line 81) | def get_both_spellings(verb_list):
  function add_verb_inflections (line 95) | def add_verb_inflections(verb_list):
  class Wordlist (line 164) | class Wordlist(list):
    method __init__ (line 167) | def __init__(self, data, **kwargs):
    method words (line 175) | def words(self):
    method lemmata (line 183) | def lemmata(self):
    method as_regex (line 190) | def as_regex(self, boundaries='w', case_sensitive=False, inverse=False...
  class Processes (line 201) | class Processes(object):
    method __init__ (line 203) | def __init__(self):

FILE: corpkit/dictionaries/queries.py
  class Queries (line 8) | class Queries(object):
    method __init__ (line 10) | def __init__(self):

FILE: corpkit/dictionaries/roles.py
  function translator (line 3) | def translator():

FILE: corpkit/dictionaries/wordlists.py
  function closed_class_wordlists (line 7) | def closed_class_wordlists():

FILE: corpkit/download/corenlp.py
  function corenlp_downloader (line 1) | def corenlp_downloader(custompath=False):

FILE: corpkit/editor.py
  function editor (line 7) | def editor(interrogation,

FILE: corpkit/env.py
  function save_history (line 144) | def save_history(history_path=history_path):
  class Objects (line 177) | class Objects(object):
    method __init__ (line 183) | def __init__(self):
    method _get (line 247) | def _get(self, name):
  function interpreter (line 269) | def interpreter(debug=False,
  function install (line 2385) | def install(name, loc):

FILE: corpkit/gui.py
  class SplashScreen (line 73) | class SplashScreen(object):
    method __init__ (line 77) | def __init__(self, tkRoot, imageFilename, minSplashTime=0):
    method __enter__ (line 97) | def __enter__(self):
    method __exit__ (line 136) | def __exit__(self, exc_type, exc_value, traceback ):
  class RedirectText (line 155) | class RedirectText(object):
    method __init__ (line 158) | def __init__(self, text_ctrl, log_text, text_widget):
    method write (line 171) | def write(self, string):
  class Label2 (line 198) | class Label2(Frame):
    method __init__ (line 200) | def __init__(self, master, width=0, height=0, **kwargs):
    method pack (line 209) | def pack(self, *args, **kwargs):
    method grid (line 213) | def grid(self, *args, **kwargs):
  class HyperlinkManager (line 218) | class HyperlinkManager:
    method __init__ (line 220) | def __init__(self, text):
    method reset (line 227) | def reset(self):
    method add (line 229) | def add(self, action):
    method _enter (line 235) | def _enter(self, event):
    method _leave (line 237) | def _leave(self, event):
    method _click (line 239) | def _click(self, event):
  class Notebook (line 245) | class Notebook(Frame):
    method __init__ (line 247) | def __init__(self, parent, activerelief=RAISED, inactiverelief=FLAT,
    method change_tab (line 327) | def change_tab(self, IDNum):
    method add_tab (line 345) | def add_tab(self, width=2, **kw):
    method destroy_tab (line 360) | def destroy_tab(self, tab):
    method focus_on (line 373) | def focus_on(self, tab):
  function corpkit_gui (line 384) | def corpkit_gui(noupdate=False, loadcurrent=False, debug=False):
  function install (line 7138) | def install(name, loc):

FILE: corpkit/inflect.py
  function definite_article (line 73) | def definite_article(word):
  function indefinite_article (line 76) | def indefinite_article(word):
  function article (line 88) | def article(word, function=INDEFINITE):
  function referenced (line 95) | def referenced(word, article=INDEFINITE):
  function pluralize (line 389) | def pluralize(word, pos=NOUN, custom={}, classical=True):
  function singularize (line 594) | def singularize(word, pos=NOUN, custom={}):
  function _count_syllables (line 652) | def _count_syllables(word):
  function grade (line 663) | def grade(adjective, suffix=COMPARATIVE):
  function comparative (line 695) | def comparative(adjective):
  function superlative (line 698) | def superlative(adjective):
  function attributive (line 703) | def attributive(adjective):
  function predicative (line 706) | def predicative(adjective):

FILE: corpkit/interrogation.py
  class Interrogation (line 11) | class Interrogation(object):
    method __init__ (line 18) | def __init__(self, results=None, totals=None, query=None, concordance=...
    method __str__ (line 29) | def __str__(self):
    method __repr__ (line 40) | def __repr__(self):
    method edit (line 46) | def edit(self, *args, **kwargs):
    method sort (line 244) | def sort(self, way, **kwargs):
    method visualise (line 248) | def visualise(self,
    method multiplot (line 372) | def multiplot(self, leftdict={}, rightdict={}, **kwargs):
    method language_model (line 376) | def language_model(self, name, *args, **kwargs):
    method save (line 387) | def save(self, savename, savedir='saved_interrogations', **kwargs):
    method quickview (line 411) | def quickview(self, n=25):
    method tabview (line 430) | def tabview(self, **kwargs):
    method asciiplot (line 434) | def asciiplot(self,
    method rel (line 478) | def rel(self, denominator='self', **kwargs):
    method keyness (line 481) | def keyness(self, measure='ll', denominator='self', **kwargs):
    method multiindex (line 484) | def multiindex(self, indexnames=None):
    method topwords (line 513) | def topwords(self, datatype='n', n=10, df=False, sort=True, precision=2):
    method perplexity (line 549) | def perplexity(self):
    method entropy (line 570) | def entropy(self):
    method shannon (line 581) | def shannon(self):
  class Concordance (line 585) | class Concordance(pd.core.frame.DataFrame):
    method __init__ (line 590) | def __init__(self, data):
    method format (line 595) | def format(self, kind='string', n=100, window=35,
    method calculate (line 629) | def calculate(self):
    method shuffle (line 634) | def shuffle(self, inplace=False):
    method edit (line 659) | def edit(self, *args, **kwargs):
    method __str__ (line 669) | def __str__(self):
    method __repr__ (line 672) | def __repr__(self):
    method less (line 675) | def less(self, **kwargs):
  class Interrodict (line 679) | class Interrodict(OrderedDict):
    method __init__ (line 698) | def __init__(self, data):
    method __getitem__ (line 708) | def __getitem__(self, key):
    method __setitem__ (line 730) | def __setitem__(self, key, value):
    method __repr__ (line 735) | def __repr__(self):
    method __str__ (line 738) | def __str__(self):
    method edit (line 741) | def edit(self, *args, **kwargs):
    method multiindex (line 752) | def multiindex(self, indexnames=False):
    method save (line 843) | def save(self, savename, savedir='saved_interrogations', **kwargs):
    method collapse (line 867) | def collapse(self, axis='y'):
    method topwords (line 945) | def topwords(self, datatype='n', n=10, df=False, sort=True, precision=2):
    method visualise (line 978) | def visualise(self, shape='auto', truncate=8, **kwargs):
    method copy (line 1004) | def copy(self):
    method flip (line 1011) | def flip(self, truncate=30, transpose=True, repeat=False, *args, **kwa...
    method get_totals (line 1065) | def get_totals(self):

FILE: corpkit/interrogator.py
  function interrogator (line 8) | def interrogator(corpus,

FILE: corpkit/keys.py
  function keywords (line 6) | def keywords(target_corpus,

FILE: corpkit/lazyprop.py
  function lazyprop (line 4) | def lazyprop(fn):

FILE: corpkit/make.py
  function make_corpus (line 4) | def make_corpus(unparsed_corpus_path,

FILE: corpkit/model.py
  class LanguageModel (line 11) | class LanguageModel(object):
    method __init__ (line 12) | def __init__(self, order, alpha, data):
    method _logprob (line 42) | def _logprob(self, ngram):
    method _prob (line 45) | def _prob(self, ngram):
  class MultiModel (line 57) | class MultiModel(dict):
    method __init__ (line 59) | def __init__(self, data, order, name='', **kwargs):
    method score (line 77) | def score(self, data, **kwargs):
    method _score_counts_against_model (line 111) | def _score_counts_against_model(self, counts, model):
    method _turn_file_obj_into_counts (line 122) | def _turn_file_obj_into_counts(self, data, *args, **kwargs):
    method score_subcorpora (line 132) | def score_subcorpora(self):
  function _make_model_from_interro (line 142) | def _make_model_from_interro(self, name, order, **kwargs):
  function _train (line 187) | def _train(data, name, corpusname, order=3, **kwargs):

FILE: corpkit/multiprocess.py
  function pmultiquery (line 5) | def pmultiquery(corpus,

FILE: corpkit/noseinstall.py
  function test_import (line 8) | def test_import():

FILE: corpkit/nosetests.py
  function test_import (line 31) | def test_import():
  function test_corpus_class (line 37) | def test_corpus_class():
  function test_parse (line 42) | def test_parse():
  function test_tokenise (line 58) | def test_tokenise():
  function test_speak_parse (line 73) | def test_speak_parse():
  function test_interro1 (line 93) | def test_interro1():
  function test_interro2 (line 99) | def test_interro2():
  function test_interro3 (line 105) | def test_interro3():
  function test_interro_multiindex_tregex_justspeakers (line 131) | def test_interro_multiindex_tregex_justspeakers():
  function test_conc (line 143) | def test_conc():
  function test_edit (line 150) | def test_edit():
  function test_tok1_interro (line 160) | def test_tok1_interro():
  function test_tok2_interro (line 171) | def test_tok2_interro():
  function document_check (line 187) | def document_check():
  function test_conc_edit (line 201) | def test_conc_edit():
  function test_symbolic_subcorpora (line 211) | def test_symbolic_subcorpora():
  function test_symbolic_multiindex (line 220) | def test_symbolic_multiindex():
  function check_skip_filt (line 231) | def check_skip_filt():
  function check_just_filt (line 239) | def check_just_filt():
  function test_interpreter (line 247) | def test_interpreter():
  function check_interpreter_res_csv (line 269) | def check_interpreter_res_csv():
  function check_interpreter_conc_csv (line 277) | def check_interpreter_conc_csv():
  function check_interpreter_saved_interro (line 287) | def check_interpreter_saved_interro():

FILE: corpkit/other.py
  function quickview (line 9) | def quickview(results, n=25):
  function concprinter (line 104) | def concprinter(dataframe, kind='string', n=100,
  function save (line 202) | def save(interrogation, savename, savedir='saved_interrogations', **kwar...
  function load (line 314) | def load(savename, loaddir='saved_interrogations'):
  function loader (line 353) | def loader(savedir='saved_interrogations'):
  function new_project (line 380) | def new_project(name, loc='.', **kwargs):
  function load_all_results (line 446) | def load_all_results(data_dir='saved_interrogations', **kwargs):
  function texify (line 494) | def texify(series, n=20, colname='Keyness', toptail=False, sort=False):
  function as_regex (line 522) | def as_regex(lst, boundaries='w', case_sensitive=False, inverse=False, c...
  function make_multi (line 581) | def make_multi(interrogation, indexnames=None):
  function topwords (line 701) | def topwords(self, datatype='n', n=10, df=False, sort=True, precision=2):

FILE: corpkit/plotter.py
  function plotter (line 4) | def plotter(df,
  function multiplotter (line 1169) | def multiplotter(df, leftdict={},rightdict={}, **kwargs):

FILE: corpkit/plugins.py
  class HighlightLines (line 5) | class HighlightLines(plugins.PluginBase):
    method __init__ (line 32) | def __init__(self, lines):
  class InteractiveLegendPlugin (line 42) | class InteractiveLegendPlugin(plugins.PluginBase):
    method __init__ (line 193) | def __init__(self, plot_elements, labels, ax=None,
    method _determine_mpld3ids (line 211) | def _determine_mpld3ids(self, plot_elements):

FILE: corpkit/process.py
  function tregex_engine (line 9) | def tregex_engine(corpus=False,
  function show (line 255) | def show(lines, index, show='thread'):
  function add_corpkit_to_path (line 261) | def add_corpkit_to_path():
  function add_nltk_data_to_nltk_path (line 274) | def add_nltk_data_to_nltk_path(**kwargs):
  function get_gui_resource_dir (line 288) | def get_gui_resource_dir():
  function get_fullpath_to_jars (line 313) | def get_fullpath_to_jars(path_var):
  function determine_datatype (line 363) | def determine_datatype(path):
  function filtermaker (line 398) | def filtermaker(the_filter, case_sensitive=False, **kwargs):
  function searchfixer (line 446) | def searchfixer(search, query, datatype=False):
  function is_number (line 460) | def is_number(s):
  function animator (line 476) | def animator(progbar,
  function parse_just_speakers (line 567) | def parse_just_speakers(just_speakers, corpus):
  function get_deps (line 581) | def get_deps(sentence, dep_type):
  function timestring (line 589) | def timestring(input):
  function makesafe (line 595) | def makesafe(variabletext, drop_datatype=True, hyphens_ok=False):
  function interrogation_from_conclines (line 616) | def interrogation_from_conclines(newdata):
  function checkstack (line 641) | def checkstack(the_string):
  function check_tex (line 651) | def check_tex(have_ipython=True):
  function get_corenlp_path (line 673) | def get_corenlp_path(corenlppath):
  function unsplitter (line 729) | def unsplitter(data):
  function classname (line 764) | def classname(cls):
  function format_middle (line 769) | def format_middle(tree, show, df=False, sent_id=False, ixs=False):
  function format_conc (line 794) | def format_conc(tups, show, df=False, sent_id=False, root=False, ixs=Fal...
  function show_tree_as_per_option (line 824) | def show_tree_as_per_option(show, tree, sent=False, df=False,
  function tgrep (line 866) | def tgrep(parse_string, search):
  function canpickle (line 882) | def canpickle(obj):
  function sanitise_dict (line 907) | def sanitise_dict(d):
  function saferead (line 923) | def saferead(path):
  function urlify (line 951) | def urlify(s):
  function gui (line 962) | def gui():
  function dictformat (line 971) | def dictformat(d, query=False):
  function fix_search (line 1017) | def fix_search(search, case_sensitive=False, root=False):
  function pat_format (line 1069) | def pat_format(pat, case_sensitive=False, root=False):
  function make_name_to_query_dict (line 1094) | def make_name_to_query_dict(existing={}, cols=False, dtype=False):
  function auto_usecols (line 1124) | def auto_usecols(search, exclude, show, usecols, coref=False):
  function format_tregex (line 1192) | def format_tregex(results,
  function make_conc_lines_from_whole_mid (line 1279) | def make_conc_lines_from_whole_mid(wholes,
  function gettag (line 1323) | def gettag(query, lemmatag=False):
  function lemmatiser (line 1341) | def lemmatiser(list_of_words, tag, translated_option,
  function get_first_df (line 1359) | def get_first_df(corpus):
  function make_dotfile (line 1375) | def make_dotfile(path, return_json=False, data_dict=False):
  function get_corpus_metadata (line 1401) | def get_corpus_metadata(path, generate=False):
  function make_df_json_name (line 1434) | def make_df_json_name(typ, subcorpora=False):
  function add_df_to_dotfile (line 1442) | def add_df_to_dotfile(path, df, typ='features', subcorpora=False):
  function delete_files_and_subcorpora (line 1453) | def delete_files_and_subcorpora(corpus, skip_metadata, just_metadata):

FILE: corpkit/stats.py
  function tfidf (line 5) | def tfidf(self, search={'w': 'any'}, show=['w'], **kwargs):
  function translate_show_for_surprisal (line 39) | def translate_show_for_surprisal(show, gramsize):
  function surprisal (line 45) | def surprisal(self,
  function shannon (line 64) | def shannon(self):

FILE: corpkit/textprogressbar.py
  class TextProgressBar (line 5) | class TextProgressBar:
    method __init__ (line 10) | def __init__(self, iterations, dirname=False, quiet=False):
    method animate_ipython (line 20) | def animate_ipython(self, iter, dirname=None, quiet=False):
    method update_iteration (line 31) | def update_iteration(self, elapsed_iter, dirname=None):
    method __update_amount (line 40) | def __update_amount(self, new_amount, dirname=None):
    method __str__ (line 62) | def __str__(self):

FILE: corpkit/tokenise.py
  function nested_list_to_pandas (line 7) | def nested_list_to_pandas(toks):
  function pos_tag_series (line 23) | def pos_tag_series(ser, tagger):
  function lemmatise_series (line 34) | def lemmatise_series(words, postags, lemmatiser):
  function write_df_to_conll (line 56) | def write_df_to_conll(df, newf, plain=False, stripped=False,
  function new_fname (line 92) | def new_fname(oldpath, inpath):
  function process_meta (line 104) | def process_meta(data, speaker_segmentation, metadata):
  function plaintext_to_conll (line 123) | def plaintext_to_conll(inpath,

FILE: setup.py
  class CustomInstallCommand (line 7) | class CustomInstallCommand(install):
    method run (line 12) | def run(self):

Download .json

Condensed preview — 98 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (2,627K chars).

[
  {
    "path": ".gitattributes",
    "chars": 29,
    "preview": "*.p linguist-language=Python\n"
  },
  {
    "path": ".gitmodules",
    "chars": 101,
    "preview": "[submodule \"corpkit-app\"]\n\tpath = corpkit-app\n\turl = https://github.com/interrogator/corpkit-app.git\n"
  },
  {
    "path": ".travis.yml",
    "chars": 1320,
    "preview": "language: python\npython:\n- '2.7'\n- '3.5'\ninstall: \n- pip install --install-option=\"--no-cython-compile\" cython\n- pip ins"
  },
  {
    "path": "API-README.md",
    "chars": 51037,
    "preview": "## *corpkit*: API readme\n\n> This file is a deprecated introduction to the *corpkit* Python API. It still exists because "
  },
  {
    "path": "Dockerfile",
    "chars": 1700,
    "preview": "FROM alpine:latest\nMAINTAINER interro_gator\n\n# set up a workspace so we can cache python stuff\nRUN rm -rf /.src && mkdir"
  },
  {
    "path": "LICENSE",
    "chars": 1110,
    "preview": "The MIT License (MIT)\n\nCopyright (c) 2015 Daniel McDonald\nmcdonaldd, at, unimelb.edu\n\nPermission is hereby granted, free"
  },
  {
    "path": "Makefile",
    "chars": 7413,
    "preview": "# Makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line.\nSPHINXOPTS    =\nSPHINXBUILD "
  },
  {
    "path": "README.md",
    "chars": 12062,
    "preview": "# corpkit: sophisticated corpus linguistics\n\n[![Join the chat at https://gitter.im/interrogator/corpkit](https://badges."
  },
  {
    "path": "bld.bat",
    "chars": 237,
    "preview": "\"%PYTHON%\" setup.py install \nif errorlevel 1 exit 1\n\n:: Add more build steps here, if they are necessary.\n\n:: See\n:: htt"
  },
  {
    "path": "build.sh",
    "chars": 220,
    "preview": "#!/bin/bash\n\n$PYTHON setup.py install \n\n# Add more build steps here, if they are necessary.\n\n# See\n# http://docs.continu"
  },
  {
    "path": "conf.py",
    "chars": 10649,
    "preview": "# -*- coding: utf-8 -*-\n#\n# corpkit documentation build configuration file, created by\n# sphinx-quickstart on Thu Nov 5."
  },
  {
    "path": "corpkit/__init__.py",
    "chars": 5221,
    "preview": "\"\"\"\nA toolkit for corpus linguistics\n\"\"\"\n\nfrom __future__ import print_function\n\n#metadata\n__version__ = \"2.3.8\"\n__autho"
  },
  {
    "path": "corpkit/annotate.py",
    "chars": 7648,
    "preview": "\"\"\"\ncorpkit: add annotations to conll-u via concordancing\n\"\"\"\n\ndef process_special_annotation(v, lin):\n    \"\"\"\n    If th"
  },
  {
    "path": "corpkit/blanknotebook.ipynb",
    "chars": 4566,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# blanknotebook\"\n   ]\n  },\n  {\n   \""
  },
  {
    "path": "corpkit/build.py",
    "chars": 25507,
    "preview": "from __future__ import print_function\nfrom corpkit.constants import STRINGTYPE, PYTHON_VERSION, INPUTFUNC\n\n\"\"\"\nThis file"
  },
  {
    "path": "corpkit/completer.py",
    "chars": 642,
    "preview": "class Completer(object):\n    \"\"\"\n    Tab completion for interpreter\n    \"\"\"\n\n    def __init__(self, words):\n        self"
  },
  {
    "path": "corpkit/configurations.py",
    "chars": 4322,
    "preview": "def configurations(corpus, search, **kwargs):\n    \"\"\"\n    Get summary of behaviour of a word\n\n    see corpkit.corpus.Cor"
  },
  {
    "path": "corpkit/conll.py",
    "chars": 52013,
    "preview": "\"\"\"\ncorpkit: process CONLL formatted data\n\"\"\"\n\ndef parse_conll(f,\n                first_time=False,\n                just"
  },
  {
    "path": "corpkit/constants.py",
    "chars": 2815,
    "preview": "import sys\nimport codecs\n\n# python 2/3 coompatibility\nPYTHON_VERSION = sys.version_info.major\nSTRINGTYPE = str if PYTHON"
  },
  {
    "path": "corpkit/corpkit",
    "chars": 1569,
    "preview": "#!/usr/bin/env python\n\n\"\"\"\nA script to start the corpkit interpeter with options\n\"\"\"\n\nimport sys\nimport os\n\n# determine "
  },
  {
    "path": "corpkit/corpkit.1",
    "chars": 773,
    "preview": ".TH corpkit 1\n.SH NAME\ncorpkit \\- corpus linguistics interface\n.SH SYNOPSIS\n.B corpkit\n[\\fB\\-c\\fR \\fICOMMAND\\fR]\n[\\fB\\-d"
  },
  {
    "path": "corpkit/corpus.py",
    "chars": 56841,
    "preview": "\"\"\"\ncorpkit: Corpus and Corpus-like objects\n\"\"\"\n\nfrom __future__ import print_function\n\nfrom lazyprop import lazyprop\nfr"
  },
  {
    "path": "corpkit/cql.py",
    "chars": 6040,
    "preview": "\"\"\"\nTranslating between CQL and corpkit's native \n\"\"\"\n\ndef remake_special(querybit, customs=False, return_list=False, **"
  },
  {
    "path": "corpkit/dictionaries/__init__.py",
    "chars": 774,
    "preview": "__all__ = [\"wordlists\", \"roles\", \"bnc\", \"processes\", \"verbs\", \n           \"uktous\", \"tagtoclass\", \"queries\", \"mergetags\""
  },
  {
    "path": "corpkit/dictionaries/bnc.p",
    "chars": 136956,
    "preview": "ccopy_reg\n_reconstructor\np0\n(ccollections\nCounter\np1\nc__builtin__\ndict\np2\n(dp3\nVsecondly\np4\nI29\nsVwritings\np5\nI11\nsVpard"
  },
  {
    "path": "corpkit/dictionaries/bnc.py",
    "chars": 684,
    "preview": "def _get_bnc():\n    \"\"\"Load the BNC\"\"\"\n    import corpkit\n    try:\n        import cPickle as pickle\n    except ImportErr"
  },
  {
    "path": "corpkit/dictionaries/eng_verb_lexicon.p",
    "chars": 915423,
    "preview": "(dp0\nFnan\n(lp1\nS\"belly-laughs'\"\np2\naS'belly-laughing'\np3\naS'belly-laughed'\np4\naS'belly-laughed'\np5\nasS'fawn'\np6\n(lp7\nS'f"
  },
  {
    "path": "corpkit/dictionaries/process_types.py",
    "chars": 20316,
    "preview": "#!/usr/bin/python\n\n#   dictionaries: process type wordlists\n#   Author: Daniel McDonald\n\n# make regular expressions and "
  },
  {
    "path": "corpkit/dictionaries/queries.py",
    "chars": 911,
    "preview": "\ntry:\n    from corpkit.lazyprop import lazyprop\nexcept:\n    import corpkit\n    from lazyprop import lazyprop\n\nclass Quer"
  },
  {
    "path": "corpkit/dictionaries/roles.py",
    "chars": 3092,
    "preview": "# This file translates CoreNLP labels into SFL categories\n\ndef translator():\n    from collections import namedtuple\n    "
  },
  {
    "path": "corpkit/dictionaries/stopwords.py",
    "chars": 13406,
    "preview": "\n# stopwords from spindle\nfrom corpkit.dictionaries.process_types import Wordlist\nstopwords = Wordlist([\"yeah\", \"monday\""
  },
  {
    "path": "corpkit/dictionaries/verblist.py",
    "chars": 99966,
    "preview": "allverbs = [\"bird's-nest\",\n 'abandon',\n 'abase',\n 'abash',\n 'abate',\n 'abbreviate',\n 'abdicate',\n 'abduct',\n 'abet',\n 'a"
  },
  {
    "path": "corpkit/dictionaries/word_transforms.py",
    "chars": 72019,
    "preview": "\"\"\"\ncorpkit: manual word cludging\n\"\"\"\n\n# WordNet/CoreNLP lemmatiser are used for lemmatisation, but they can both \n# str"
  },
  {
    "path": "corpkit/dictionaries/wordlists.py",
    "chars": 17459,
    "preview": "\"\"\"\nlists of closed class words\n\"\"\"\n\n# feel free to correct/add things---this was just a quick grab from the web\n\ndef cl"
  },
  {
    "path": "corpkit/download/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "corpkit/download/corenlp.py",
    "chars": 826,
    "preview": "def corenlp_downloader(custompath=False):\n    \"\"\"\n    Very simple CoreNLP downloader\n\n    :param custompath: A path wher"
  },
  {
    "path": "corpkit/editor.py",
    "chars": 35415,
    "preview": "\"\"\"\ncorpkit: edit Interrogation, Concordance and Interrodict objects\n\"\"\"\nfrom __future__ import print_function\nfrom corp"
  },
  {
    "path": "corpkit/env.py",
    "chars": 91557,
    "preview": "\"\"\"\nA corpkit interpreter, with natural language commands.\n\ntodo:\n\n* documentation\n* handling of kwargs tuples etc\n* che"
  },
  {
    "path": "corpkit/gui.py",
    "chars": 316400,
    "preview": "#!/usr/bin/env python\n\n\"\"\"\n# corpkit GUI\n# Daniel McDonald\n\n# This file conains the frontend side of the corpkit gui.\n# "
  },
  {
    "path": "corpkit/inflect.py",
    "chars": 30573,
    "preview": "#### PATTERN | EN | INFLECT ########################################################################\n# -*- coding: utf-8"
  },
  {
    "path": "corpkit/interpreter_tests.cki",
    "chars": 489,
    "preview": "# this code is written in corpkit interpreter language.\n# it is used to test that the interpreter works properly\n\nset te"
  },
  {
    "path": "corpkit/interrogation.py",
    "chars": 39908,
    "preview": "\"\"\"\ncorpkit: `Int`errogation and Interrogation-like classes\n\"\"\"\n\nfrom __future__ import print_function\n\nfrom collections"
  },
  {
    "path": "corpkit/interrogator.py",
    "chars": 40018,
    "preview": "\"\"\"\ncorpkit: Interrogate a parsed corpus\n\"\"\"\n\nfrom __future__ import print_function\nfrom corpkit.constants import STRING"
  },
  {
    "path": "corpkit/keys.py",
    "chars": 5819,
    "preview": "\"\"\"corpkit: simple keyworder\"\"\"\n\nfrom __future__ import print_function\nfrom corpkit.constants import STRINGTYPE, PYTHON_"
  },
  {
    "path": "corpkit/layouts.py",
    "chars": 2943,
    "preview": "\"\"\"\nThis file contains a dictionary of matplotlib subplot layouts\n\nThey are used during multiplotting. Sooner or later t"
  },
  {
    "path": "corpkit/lazyprop.py",
    "chars": 3389,
    "preview": "# this file duplicates lazyprop many times because i can't work out how to\n# automatically add the right docstring...\n\nd"
  },
  {
    "path": "corpkit/make.py",
    "chars": 15857,
    "preview": "from __future__ import print_function\n\nfrom corpkit.constants import INPUTFUNC, PYTHON_VERSION\ndef make_corpus(unparsed_"
  },
  {
    "path": "corpkit/model.py",
    "chars": 6817,
    "preview": "\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport math\nimport os\nfrom nltk import ngrams, s"
  },
  {
    "path": "corpkit/multiprocess.py",
    "chars": 15128,
    "preview": "\"\"\"corpkit: multiprocessing of interrogations\"\"\"\n\nfrom __future__ import print_function\n\ndef pmultiquery(corpus, \n      "
  },
  {
    "path": "corpkit/new_project",
    "chars": 296,
    "preview": "#!/usr/bin/env python\n\nfrom __future__ import print_function\n\n\"\"\"\nA script to create a new corpkit project\n\"\"\"\n\nimport s"
  },
  {
    "path": "corpkit/noseinstall.py",
    "chars": 315,
    "preview": "import os\nfrom nose.tools import assert_equals\nfrom corpkit import *\n\nunparsed_path = 'data/test'\nparsed_path = 'data/te"
  },
  {
    "path": "corpkit/nosetests.py",
    "chars": 9541,
    "preview": "\"\"\"\nThis file contains tests for the corpkit API, to be run by Nose.\n\nThere are fast and slow tests. Slow tests include "
  },
  {
    "path": "corpkit/other.py",
    "chars": 27102,
    "preview": "from __future__ import print_function\n\n\"\"\"\nIn here are functions used internally by corpkit, but also\nmight be called by"
  },
  {
    "path": "corpkit/parse",
    "chars": 847,
    "preview": "#!/usr/bin/env python\n\nfrom __future__ import print_function\n\n\"\"\"\nA script to parse using corpkit\n\n:Example:\n\n$ parse ju"
  },
  {
    "path": "corpkit/plotter.py",
    "chars": 45579,
    "preview": "from __future__ import print_function\nfrom corpkit.constants import STRINGTYPE, PYTHON_VERSION\n\ndef plotter(df,\n        "
  },
  {
    "path": "corpkit/plugins.py",
    "chars": 8690,
    "preview": "import mpld3\nimport collections\nfrom mpld3 import plugins, utils\n\nclass HighlightLines(plugins.PluginBase):\n\n    \"\"\"A pl"
  },
  {
    "path": "corpkit/process.py",
    "chars": 52888,
    "preview": "\"\"\"\nIn here are functions used internally by corpkit,  \nnot intended to be called by users.\n\"\"\"\n\nfrom __future__ import "
  },
  {
    "path": "corpkit/stats.py",
    "chars": 2410,
    "preview": "\"\"\"\nscikit-learn stuff\n\"\"\"\n\ndef tfidf(self, search={'w': 'any'}, show=['w'], **kwargs):\n    \"\"\"\n    Generate TF-IDF vect"
  },
  {
    "path": "corpkit/textprogressbar.py",
    "chars": 2515,
    "preview": "#!/usr/bin/python\n\nfrom __future__ import print_function\n\nclass TextProgressBar:\n    \"\"\"a text progress bar for CLI oper"
  },
  {
    "path": "corpkit/tokenise.py",
    "chars": 6993,
    "preview": "from __future__ import print_function\n\n\"\"\"\nTokenise, POS tag and lemmatise a corpus, returning CONLL-U data\n\"\"\"\n\ndef nes"
  },
  {
    "path": "corpkit/tregex.sh",
    "chars": 134,
    "preview": "#!/bin/sh\nscriptdir=`dirname $0`\n\njava -mx100m -cp \"$scriptdir/stanford-tregex.jar:\" edu.stanford.nlp.trees.tregex.Trege"
  },
  {
    "path": "data/corpus-filelist.txt",
    "chars": 139,
    "preview": "/Users/daniel/Work/corpkit/corpkit/data/test-stripped/first/intro.txt\n/Users/daniel/Work/corpkit/corpkit/data/test-strip"
  },
  {
    "path": "data/test/first/intro.txt",
    "chars": 319,
    "preview": "TESTER: This small corpus is used in corpkit's tests. Not a lot of data is required. <metadata test=\"on\" year=\"2004\">\nAN"
  },
  {
    "path": "data/test/second/body.txt",
    "chars": 374,
    "preview": "Corpus linguistics and computational linguistics, like concordancing and interrogating, are situated on a vast continuum"
  },
  {
    "path": "data/test-plain-parsed/first/intro.txt.conll",
    "chars": 2515,
    "preview": "# parse=(ROOT (FRAG (NP (NN TESTER)) (: :) (S (NP (DT This) (JJ small) (NN corpus)) (VP (VBZ is) (VP (VBN used) (PP (IN "
  },
  {
    "path": "data/test-plain-parsed/second/body.txt.conll",
    "chars": 2832,
    "preview": "# parse=(ROOT (S (NP (NP (NNP Corpus) (NNS linguistics)) (CC and) (NP (JJ computational) (NNS linguistics))) (, ,) (PP ("
  },
  {
    "path": "data/test-speak-parsed/first/intro.txt.conll",
    "chars": 2221,
    "preview": "# parse=(ROOT (S (NP (DT This) (JJ small) (NN corpus)) (VP (VBZ is) (VP (VBN used) (PP (IN in) (NP (NP (NN corpkit) (POS"
  },
  {
    "path": "data/test-speak-parsed/second/body.txt.conll",
    "chars": 2742,
    "preview": "# parse=(ROOT (S (NP (NP (NNP Corpus) (NNS linguistics)) (CC and) (NP (JJ computational) (NNS linguistics))) (, ,) (PP ("
  },
  {
    "path": "data/test-stripped/first/intro.txt",
    "chars": 195,
    "preview": "This small corpus is used in corpkit's tests. Not a lot of data is required. \nHere, we're testing the speaker_segmentati"
  },
  {
    "path": "data/test-stripped/second/body.txt",
    "chars": 298,
    "preview": "Corpus linguistics and computational linguistics, like concordancing and interrogating, are situated on a vast continuum"
  },
  {
    "path": "index.rst",
    "chars": 7122,
    "preview": ".. corpkit documentation master file, created by\n   sphinx-quickstart on Thu Nov  5 11:43:02 2015.\n   You can adapt this"
  },
  {
    "path": "make.bat",
    "chars": 7246,
    "preview": "@ECHO OFF\r\n\r\nREM Command file for Sphinx documentation\r\n\r\nif \"%SPHINXBUILD%\" == \"\" (\r\n\tset SPHINXBUILD=sphinx-build\r\n)\r\n"
  },
  {
    "path": "meta.yaml",
    "chars": 1835,
    "preview": "package:\n  name: corpkit\n  version: \"2.3.8\"\n\nsource:\n  fn: corpkit-2.3.8.tar.gz\n  url: https://pypi.python.org/packages/"
  },
  {
    "path": "requirements.txt",
    "chars": 243,
    "preview": "git+https://github.com/interrogator/tkintertable.git#egg=tkintertable-1.2\ngit+https://github.com/interrogator/tabview#eg"
  },
  {
    "path": "rst_docs/API/corpkit.building.rst",
    "chars": 10307,
    "preview": "Creating projects and building corpora\n=======================================\n\nDoing corpus linguistics involves buildi"
  },
  {
    "path": "rst_docs/API/corpkit.concordancing.rst",
    "chars": 5992,
    "preview": "\nConcordancing\n==============\n\nConcordancing is the task of getting an aligned list of *keywords in context*. Here's a v"
  },
  {
    "path": "rst_docs/API/corpkit.editing.rst",
    "chars": 6824,
    "preview": ".. _editing-page:\n\nEditing results\n=====================\n\nCorpus interrogation is the task of getting frequency counts f"
  },
  {
    "path": "rst_docs/API/corpkit.interrogating.rst",
    "chars": 21066,
    "preview": "Interrogating corpora\n=====================\n\nOnce you've built a corpus, you can search it for linguistic phenomena. Thi"
  },
  {
    "path": "rst_docs/API/corpkit.langmodel.rst",
    "chars": 2382,
    "preview": "Using language models \n======================\n\n.. warning::\n\n   Language modelling is currently deprecated, while the to"
  },
  {
    "path": "rst_docs/API/corpkit.managing.rst",
    "chars": 5175,
    "preview": "Managing projects\n=================\n\n``corpkit`` has a few other bits and pieces designed to make life easier when doing"
  },
  {
    "path": "rst_docs/API/corpkit.visualising.rst",
    "chars": 10228,
    "preview": "Visualising results\n=====================\n\nOne thing missing in a lot of corpus linguistic tools is the ability to produ"
  },
  {
    "path": "rst_docs/API-ref/corpkit.corpus.rst",
    "chars": 869,
    "preview": "Corpus classes\n=====================\n\nMuch of *corpkit*'s functionality comes from the ability to work with ``Corpus`` a"
  },
  {
    "path": "rst_docs/API-ref/corpkit.dictionaries.rst",
    "chars": 1060,
    "preview": "Wordlists\n============================\n\nClosed class word types\n-------------------------------------------\n\nVarious wor"
  },
  {
    "path": "rst_docs/API-ref/corpkit.interrogation.rst",
    "chars": 751,
    "preview": "Interrogation classes\n============================\n\nOnce you have searched a ``Corpus`` object, you'll want to be able t"
  },
  {
    "path": "rst_docs/API-ref/corpkit.other.rst",
    "chars": 398,
    "preview": "Functions\n====================\n\n*corpkit* contains a small set of standalone functions.\n\n`as_regex`\n--------------------"
  },
  {
    "path": "rst_docs/interpreter/corpkit.interpreter.annotating.rst",
    "chars": 3251,
    "preview": "Annotating your corpus\n========================\n\nAnother thing you might like to do is add metadata or annotations to yo"
  },
  {
    "path": "rst_docs/interpreter/corpkit.interpreter.concordancing.rst",
    "chars": 3592,
    "preview": "Concordancing\n===============\n\nBy default, every search also produces concordance lines. You can view them by typing ``c"
  },
  {
    "path": "rst_docs/interpreter/corpkit.interpreter.editing.rst",
    "chars": 1885,
    "preview": "Editing results\n================\n\nOnce you have generated a `result` object via the `search` command, you can edit the r"
  },
  {
    "path": "rst_docs/interpreter/corpkit.interpreter.interrogating.rst",
    "chars": 4784,
    "preview": "Interrogating corpora\n=======================\n\nThe most powerful thing about *corpkit* is its ability to search parsed c"
  },
  {
    "path": "rst_docs/interpreter/corpkit.interpreter.making.rst",
    "chars": 3966,
    "preview": "Making projects and corpora\n============================\n\nThe first two things you need to do when using *corpkit* are t"
  },
  {
    "path": "rst_docs/interpreter/corpkit.interpreter.managing.rst",
    "chars": 3140,
    "preview": "Settings and management\n========================\n\nThe interpreter can do a number of other useful things. They are outli"
  },
  {
    "path": "rst_docs/interpreter/corpkit.interpreter.overview.rst",
    "chars": 14384,
    "preview": ".. _interpreter-page:\n\nOverview\n=======================\n\n*corpkit* comes with a dedicated interpreter, which receives co"
  },
  {
    "path": "rst_docs/interpreter/corpkit.interpreter.setup.rst",
    "chars": 906,
    "preview": "Setup\n==============================\n\n.. contents::\n   :local:\n\nDependencies\n-------------\n\nTo use the interpreter, you'"
  },
  {
    "path": "rst_docs/interpreter/corpkit.interpreter.visualising.rst",
    "chars": 2694,
    "preview": "\nPlotting\n=========\n\nYou can plot results and edited results using the `plot` method, which interfaces with *matplotlib*"
  },
  {
    "path": "setup.cfg",
    "chars": 249,
    "preview": "[metadata]\nname = corpkit\ndescription-file = README.md\ndescription = A toolkit for working with parsed corpora\nurl = htt"
  },
  {
    "path": "setup.py",
    "chars": 2648,
    "preview": "import setuptools\nfrom setuptools import setup, find_packages\nfrom setuptools.command.install import install\nimport os\nf"
  },
  {
    "path": "talks/IDL_seminar.tex",
    "chars": 9983,
    "preview": "\\documentclass{beamer}       % print frames\n%\\documentclass[notes=only]{beamer}   % only notes\n%\\documentclass{beamer}  "
  }
]

// ... and 1 more files (download for full content)

About this extraction

This page contains the full source code of the interrogator/corpkit GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 98 files (2.3 MB), approximately 614.9k tokens, and a symbol index with 376 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo