Showing preview only (2,459K chars total). Download the full file or copy to clipboard to get everything.
Repository: interrogator/corpkit
Branch: master
Commit: c54be1f8c83d
Files: 98
Total size: 2.3 MB
Directory structure:
gitextract_mzzg7lm1/
├── .gitattributes
├── .gitmodules
├── .travis.yml
├── API-README.md
├── Dockerfile
├── LICENSE
├── Makefile
├── README.md
├── bld.bat
├── build.sh
├── conf.py
├── corpkit/
│ ├── __init__.py
│ ├── annotate.py
│ ├── blanknotebook.ipynb
│ ├── build.py
│ ├── completer.py
│ ├── configurations.py
│ ├── conll.py
│ ├── constants.py
│ ├── corpkit
│ ├── corpkit.1
│ ├── corpus.py
│ ├── cql.py
│ ├── dictionaries/
│ │ ├── __init__.py
│ │ ├── bnc.p
│ │ ├── bnc.py
│ │ ├── eng_verb_lexicon.p
│ │ ├── process_types.py
│ │ ├── queries.py
│ │ ├── roles.py
│ │ ├── stopwords.py
│ │ ├── verblist.py
│ │ ├── word_transforms.py
│ │ └── wordlists.py
│ ├── download/
│ │ ├── __init__.py
│ │ └── corenlp.py
│ ├── editor.py
│ ├── env.py
│ ├── gui.py
│ ├── inflect.py
│ ├── interpreter_tests.cki
│ ├── interrogation.py
│ ├── interrogator.py
│ ├── keys.py
│ ├── layouts.py
│ ├── lazyprop.py
│ ├── make.py
│ ├── model.py
│ ├── multiprocess.py
│ ├── new_project
│ ├── noseinstall.py
│ ├── nosetests.py
│ ├── other.py
│ ├── parse
│ ├── plotter.py
│ ├── plugins.py
│ ├── process.py
│ ├── stanford-tregex.jar
│ ├── stats.py
│ ├── textprogressbar.py
│ ├── tokenise.py
│ └── tregex.sh
├── data/
│ ├── corpus-filelist.txt
│ ├── test/
│ │ ├── first/
│ │ │ └── intro.txt
│ │ └── second/
│ │ └── body.txt
│ ├── test-plain-parsed/
│ │ ├── first/
│ │ │ └── intro.txt.conll
│ │ └── second/
│ │ └── body.txt.conll
│ ├── test-speak-parsed/
│ │ ├── first/
│ │ │ └── intro.txt.conll
│ │ └── second/
│ │ └── body.txt.conll
│ └── test-stripped/
│ ├── first/
│ │ └── intro.txt
│ └── second/
│ └── body.txt
├── index.rst
├── make.bat
├── meta.yaml
├── requirements.txt
├── rst_docs/
│ ├── API/
│ │ ├── corpkit.building.rst
│ │ ├── corpkit.concordancing.rst
│ │ ├── corpkit.editing.rst
│ │ ├── corpkit.interrogating.rst
│ │ ├── corpkit.langmodel.rst
│ │ ├── corpkit.managing.rst
│ │ └── corpkit.visualising.rst
│ ├── API-ref/
│ │ ├── corpkit.corpus.rst
│ │ ├── corpkit.dictionaries.rst
│ │ ├── corpkit.interrogation.rst
│ │ └── corpkit.other.rst
│ └── interpreter/
│ ├── corpkit.interpreter.annotating.rst
│ ├── corpkit.interpreter.concordancing.rst
│ ├── corpkit.interpreter.editing.rst
│ ├── corpkit.interpreter.interrogating.rst
│ ├── corpkit.interpreter.making.rst
│ ├── corpkit.interpreter.managing.rst
│ ├── corpkit.interpreter.overview.rst
│ ├── corpkit.interpreter.setup.rst
│ └── corpkit.interpreter.visualising.rst
├── setup.cfg
├── setup.py
└── talks/
└── IDL_seminar.tex
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitattributes
================================================
*.p linguist-language=Python
================================================
FILE: .gitmodules
================================================
[submodule "corpkit-app"]
path = corpkit-app
url = https://github.com/interrogator/corpkit-app.git
================================================
FILE: .travis.yml
================================================
language: python
python:
- '2.7'
- '3.5'
install:
- pip install --install-option="--no-cython-compile" cython
- pip install -r requirements.txt
- nltkd=$(python -c 'from __future__ import print_function; import nltk; print(nltk.data.path[0])')
- python -m nltk.downloader punkt -d "$nltkd"
- python -m nltk.downloader wordnet -d "$nltkd"
- python -m nltk.downloader averaged_perceptron_tagger -d "$nltkd"
script:
- nosetests corpkit/nosetests.py -a '!slow'
deploy:
provider: pypi
user: mcddjx
password:
secure: I7K+LWe37vRytA0QpF9sAdGaTYbwq0NuN6Xi6QgrSYr08WO5wKSZJ9bkBtJF4U9OCAtRjM64hOY+eobnKfwbNE+IHG8znI9z40jHyyCayYtk5P5UOG6OtB5wBbhXLjb9qXzy21byFcY1zM7iEUKw8D+Q4nu8cENFmx9agG025jet4MHXqtQlQYxTVr7GLK0oAqxO19J/D7F6Ykn2UEHw9dm3X0gu94gM6fMN1lIS74DM4d2IzRWOZrIaYigL8ckDSkWP9taVM553aI9qrLCz/4prCKwxo0QAINExiPYjSwG1swzTfabZvPI5bVdxY23TTx86Af6z3BQuhpIY1fspDTaw/Gn527XWFeOuqI8jhf6pP6ZdOo7qiyVwqU33/5CoTW+A/o1o963SDHSjyarxbz+De10zLScCvfIsZ2uHnh3CFnlWUeprjV09QIuz2lQbZoQP817/CAdxqLaMl/aG7Wcf4X7MI/SQauLVYR91gkhiBWzBdrYNGOEsrr7dzc5tbqBLeupF6Nf811BR2SdoGIfmihQGrYdC271/HuHTLsrcvXaCyXWElA1ATSRy6XfC8IsljU695Bm6kSrb4pG4V64P2Lhe2F8wtu4L1IzP+w7NRbeZNntMqMfksZz5vNe3CVhqcPy8VmOZGsmOaa9PIFHzZ7pM1Pxybt25Hz+GXBQ=
on:
tags: true
distributions: sdist bdist_wheel
repo: interrogator/corpkit
git:
submodules: false
================================================
FILE: API-README.md
================================================
## *corpkit*: API readme
> This file is a deprecated introduction to the *corpkit* Python API. It still exists because it contains a lot of useful information and advanced examples that are not found elsewhere. It is deprecated because better documentation is available at [ReadTheDocs](http://corpkit.readthedocs.org/en/latest/).
- [What's in here?](#whats-in-here)
- [`Corpus()`](#corpus)
- [Navigating `Corpus` objects](#navigating-corpus-objects)
- [`interrogate()` method](#interrogate-method)
- [`concordance()` method](#concordance-method)
- [`Interrogation`](#interrogation)
- [`edit()` method](#edit-method)
- [`visualise()` method](#visualise-method)
- [Functions, lists, etc.](#functions-lists-etc)
- [Installation](#installation)
- [By downloading the repository](#by-downloading-the-repository)
- [By cloning the repository](#by-cloning-the-repository)
- [Via `pip`](#via-pip)
- [Quickstart](#quickstart)
- [More detailed examples](#more-detailed-examples)
- [`search`, `exclude` and `show`](#search-exclude-and-show)
- [Working with coreferences](#working-with-coreferences)
- [Building corpora](#building-corpora)
- [Speaker IDs](#speaker-ids)
- [Navigating parsed corpora](#navigating-parsed-corpora)
- [Getting general stats](#getting-general-stats)
- [Concordancing](#concordancing)
- [Systemic functional stuff](#systemic-functional-stuff)
- [Keywording](#keywording)
- [Visualising keywords](#visualising-keywords)
- [Traditional reference corpora](#traditional-reference-corpora)
- [Parallel processing](#parallel-processing)
- [Multiple corpora](#multiple-corpora)
- [Multiple speakers](#multiple-speakers)
- [Multiple queries](#multiple-queries)
- [More complex queries and plots](#more-complex-queries-and-plots)
- [Visualisation options](#visualisation-options)
- [Contact](#contact)
- [Cite](#cite)
<a name="whats-in-here"></a>
## What's in here?
Essentially, the module contains classes, methods and functions for building and interrogating corpora, then manipulating or visualising the results.
<a name="corpus"></a>
### `Corpus()`
First, there's a `Corpus()` class, which models a corpus of CoreNLP XML, lists of tokens, or plaintext files, creating subclasses for subcorpora and corpus files.
To use it, simple feed it a path to a directory containing `.txt` files, or subfolders containing `.txt` files.
```python
>>> from corpkit import Corpus
>>> unparsed = Corpus('path/to/data')
```
With the `Corpus()` class, the following attributes are available:
| Attribute | Purpose |
|-----------|---------|
| `corpus.subcorpora` | list of subcorpus objects with indexing/slicing methods |
| `corpus.features` | Corpus features (characters, clauses, words, tokens, process types, passives, etc.) |
| `corpus.postags` | Distribution of parts of speech |
| `corpus.wordclasses` | Distribution of word classes |
as well as the following methods:
| Method | Purpose |
|--------|---------|
| `corpus.parse()` | Create a parsed version of a plaintext corpus |
| `corpus.tokenise()` | Create a tokenised version of a plaintext corpus |
| `corpus.interrogate()` | Interrogate the corpus for lexicogrammatical features |
| `corpus.concordance()` | Concordance via lexis and/or grammar |
<a name="navigating-corpus-objects"></a>
#### Navigating `Corpus` objects
Once you've defined a Corpus, you can move around it very easily:
```python
### corpus containing annual subcorpora of NYT articles
>>> corpus = Corpus('data/NYT-parsed')
>>> list(corpus.subcorpora)[:3]
### [<corpkit.corpus.Subcorpus instance: 1987>,
### <corpkit.corpus.Subcorpus instance: 1988>,
### <corpkit.corpus.Subcorpus instance: 1989>]
>>> corpus.subcorpora[0].path, corpus.subcorpora[0].datatype
### ('/Users/daniel/Work/risk/data/NYT-parsed/1987', 'parse')
>>> corpus.subcorpora.c1989.files[10:13]
### [<corpkit.corpus.File instance: NYT-1989-01-01-10-1.txt.xml>,
### <corpkit.corpus.File instance: NYT-1989-01-01-10-2.txt.xml>,
### <corpkit.corpus.File instance: NYT-1989-01-01-11-1.txt.xml>]
```
Most attributes, and the `.interrogate()` and `.concordance()` methods, can also be called on `Subcorpus` and `File` objects. `File` objects also have a `.read()` method.
<a name="interrogate-method"></a>
#### `interrogate()` method
* Use [Tregex](http://nlp.stanford.edu/~manning/courses/ling289/Tregex.html), regular expressions or wordlists to search parse trees, dependencies, token lists or plain text for complex lexicogrammatical phenomena
* Search for, exclude and show word, lemma, POS tag, semantic role, governor, dependent, index (etc) of a token
* N-gramming
* Two-way UK-US spelling conversion
* Output Pandas DataFrames that can be easily edited and visualised
* Use parallel processing to search for a number of patterns, or search for the same pattern in multiple corpora
* Restrict searches to particular speakers in a corpus
* Works on collections of corpora, corpora, subcorpora, single files, or slices thereof
* Quickly save to and load from disk with `save()` and `load()`
<a name="concordance-method"></a>
#### `concordance()` method
* Equivalent to `interrogate()`, but return DataFrame of concordance lines
* Return any combination and order of words, lemmas, indices, functions, or POS tags
* Editable and saveable
* Output to LaTeX, CSV or string with `format()`
The code below demonstrates the complex kinds of queries that can be handled by the `interrogate()` and `concordance()` methods:
```python
### import * mostly so that we can access global variables like G, P, V
### otherwise, use 'w' instead of W, 'p' instead of P, etc.
>>> from corpkit import *
### select parsed corpus
>>> corpus = Corpus('data/postcounts-parsed')
### import process type lists and closed class wordlists
>>> from corpkit.dictionaries import *
### match tokens with governor that is in relational process wordlist,
### and whose function is `nsubj(pass)` or `csubj(pass)`:
>>> criteria = {GL: processes.relational.lemmata, F: r'^.subj'}
### exclude tokens whose part-of-speech is verbal,
### or whose word is in a list of pronouns
>>> exc = {P: r'^V', W: wordlists.pronouns}
# interrogate, returning slash-delimited function/lemma
>>> data = corpus.interrogate(criteria, exclude=exc, show=[F,L])
>>> lines = corpus.concordance(criteria, exclude=exc, show=[F,L])
### show results
>>> print data, lines.format(n=10, window=40, columns=[L,M,R])
```
Output sample:
```
nsubj/thing nsubj/person nsubj/problem nsubj/way nsubj/son
01 296 168 134 69 73
02 233 147 88 70 70
03 250 160 95 80 67
04 247 205 88 93 71
05 275 193 68 75 61
0 nk nsubj/it cop/be ccomp/sad advmod/when nsubj/person aux/do neg/not advcl/look ./at prep_at/w
1 /my dobj/Fluoxetine advmod/now mark/that nsubj/spring ccomp/be advmod/here ./, ./but nsubj/I a
2 y mark/because expl/there advcl/be det/a nsubj/woman ./across det/the prep_across/hall ./from
3 num/114 ccomp/pound ./, mark/so det/any nsubj/med nsubj/I rcmod/take aux/can advcl/have de
4 nsubj/Kat ./, root/be nsubj/you dep/taper ./off ./
5 /to xcomp/explain prep_from/what det/the nsubj/mark ./on poss/my prep_on/arm ./, conj_and/ne
6 det/the amod/first ./and conj_and/third nsubj/hospital nsubj/I rcmod/be advmod/at root/have num
7 e dobj/tv mark/while det/the amod/second nsubj/hospital nsubj/I cop/be rcmod/IP prep/at pcomp/in
8 nsubj/Ben ./, mark/if nsubj/you cop/be advcl/unhap
9 h ./of prep_of/sleep advmod/when det/the nsubj/reality advcl/be ./, nsubj/everyone ccomp/need n
```
<a name="interrogation"></a>
### `Interrogation`
The `corpus.interrogate()` method returns an `Interrogation` object. These have attributes:
| Attribute | Contains |
| ---------------|----------|
| `interrogation.results` | Pandas DataFrame of counts in each subcorpus |
| `interrogation.totals` | Pandas Series of totals for each subcorpus/result |
| `interrogation.query` | a `dict` of values used to generate the interrogation |
and methods:
| Method | Purpose |
|------------|---------|
| `interrogation.edit()` | Get relative frequencies, merge/remove results/subcorpora, calculate keywords, sort using linear regression, etc. |
| `interrogation.visualise()` | visualise results via *matplotlib* |
| `interrogation.save()` | Save data as pickle |
| `interrogation.quickview()` | Show top results and their absolute/relative frequency |
These methods have been monkey-patched to Pandas' DataFrame and Series objects, as well, so any slice of a result can be edited or plotted easily.
<a name="edit-method"></a>
#### `edit()` method
* Remove, keep or merge interrogation results or subcorpora using indices, words or regular expressions (see below)
* Sort results by name or total frequency
* Use linear regression to figure out the trajectories of results, and sort by the most increasing, decreasing or static values
* Show the *p*-value for linear regression slopes, or exclude results above *p*
* Work with absolute frequency, or determine ratios/percentage of another list:
* determine the total number of verbs, or total number of verbs that are *be*
* determine the percentage of verbs that are *be*
* determine the percentage of *be* verbs that are *was*
* determine the ratio of *was/were* ...
* etc.
* Plot more advanced kinds of relative frequency: for example, find all proper nouns that are subjects of clauses, and plot each word as a percentage of all instances of that word in the corpus (see below)
<a name="visualise-method"></a>
#### `visualise()` method
* Plot using *Matplotlib*
* Plot anything you like: words, tags, counts for grammatical features ...
* Create line charts, bar charts, pie charts, etc. with the `kind` argument
* Use `subplots=True` to produce individual charts for each result
* Customisable figure titles, axes labels, legends, image size, colormaps, etc.
* Use `TeX` if you have it
* Use log scales if you really want
* Use a number of chart styles, such as `ggplot`, `fivethirtyeight` or `seaborn-talk` (if you've got `seaborn` installed)
* Save images to file, as `.pdf` or `.png`
* Experimental interactive plots (hover-over text, interactive legends) using *mpld3*
<a name="functions-lists-etc"></a>
### Functions, lists, etc.
There are quite a few helper functions for making regular expressions, making new projects, and so on, with more documentation forthcoming. Also included are some lists of words and dependency roles, which can be used to match functional linguistic categories. These are explained in more detail [here](#systemic-functional-stuff).
<a name="installation"></a>
## Installation
You can get *corpkit* running by downloading or cloning this repository, or via `pip`.
<a name="by-downloading-the-repository"></a>
### By downloading the repository
Hit 'Download ZIP' and unzip the file. Then `cd` into the newly created directory and install:
```shell
cd corpkit-master
# might need sudo:
python setup.py install
```
<a name="by-cloning-the-repository"></a>
### By cloning the repository
Clone the repo, `cd` into it and run the setup:
```shell
git clone https://github.com/interrogator/corpkit.git
cd corpkit
# might need sudo:
python setup.py install
```
<a name="via-pip"></a>
### Via `pip`
```shell
# might need sudo:
pip install corpkit
# or, for a local install:
# pip install --user corpkit
```
*corpkit* should install all the necessary dependencies, including *pandas*, *NLTK*, *matplotlib*, etc, as well as some NLTK data files.
<a name="quickstart"></a>
## Quickstart
Once you've got *corpkit*, and a folder containing text files, you're ready to go:
```python
### import everything
>>> from corpkit import *
### Make corpus object from path to subcorpora/text files
>>> unparsed = Corpus('data/nyt/years')
### parse it, return the new parsed corpus object
>>> corpus = unparsed.parse()
### search corpus for modal auxiliaries and plot the top results
>>> corpus.interroplot('MD')
```
Output:
<img style="float:left" src="https://raw.githubusercontent.com/interrogator/risk/master/images/md2.png" />
<br>
<a name="more-detailed-examples"></a>
## More detailed examples
`interroplot()` is just a demo method that does three things in order:
1. uses `interrogate()` to search corpus for a (Regex- or Tregex-based) query
2. uses `edit()` to calculate the relative frequencies of each result
3. uses `visualise()` to show the top seven results
Here's an example of the three methods at work:
```python
### make tregex query: head of NP in PP containing 'of' in NP headed by risk word:
>>> q = r'/NN.?/ >># (NP > (PP <<# /(?i)of/ > (NP <<# (/NN.?/ < /(?i).?\brisk.?/))))'
### search trees, exclude 'risk of rain', output lemma
>>> risk_of = corpus.interrogate({T: q}, exclude={W: '^rain$'}, show=L)
### alternative syntax which may be easier when there's only a single search criterion:
# >>> risk_of = corpus.interrogate(T, q, exclude={W: '^rain$'}, show=L)
### use edit() to turn absolute into relative frequencies
>>> to_plot = risk_of.edit('%', risk_of.totals)
### plot the results
>>> to_plot.visualise('Risk of (noun)', y_label='Percentage of all results',
... style='fivethirtyeight')
```
Output:
<img style="float:left" src="https://raw.githubusercontent.com/interrogator/risk/master/images/risk-of-noun.png" />
<br>
<a name="search-exclude-and-show"></a>
### `search`, `exclude` and `show`
In the example above, parse trees are searched, a particular match is excluded, and lemmata are shown. These three arguments (`search`, `exclude` and `show`) are the core of the `interrogate()` and `concordance()` methods.
the `search` and `exclude` arguments need a `dict`, with the **thing to be searched as keys** and the **search pattern as values**. Here is a list of available keys for plaintext, tokenised and parsed corpora:
| Key | Gloss |
|-----|-------|
| `W` | Word |
| `L` | Lemma |
| `I` | Index of token in sentence |
| `N` | N-gram |
For parsed corpora, there are many other possible keys:
| Key | Gloss |
|-----|-------|
| `P` | Part of speech tag |
| `X` | Word class |
| `G` | Governor word |
| `GL` | Governor lemma form |
| `GP` | Governor POS |
| `GF` | Governor function |
| `D` | Dependent word |
| `DL` | Dependent lemma form |
| `DP` | Dependent POS |
| `DF` | Dependent function |
| `F` | Dependency function |
| `R` | Distance from 'root' |
| `T` | Tree |
| `S` | Predefined general stats |
Allowable combinations are subject to common sense. If you're searching trees, you can't also search governors or dependents. If you're searching an unparsed corpus, you can't search for information provided by the parser. Here are some example `search`/`exclude` values:
| search/exclude | Gloss |
|--------|-------|
| `{W: r'^p'}` | Tokens starting with P |
| `{L: r'any'}` | Any lemma (often equivalent to `r'.*'`) |
| `{G: r'ing$'}` | Tokens with governor word ending in 'ing' |
| `{F: funclist}` | Tokens whose dependency function matches a `str` in `funclist` |
| `{D: r'^br', GL: r'$have$'}` | Tokens with dependent starting with 'br' and 'have' as governor lemma |
| `{I: '0', F: '^nsubj$'}` | Sentence initial tokens with role of `nsubj` |
| `{T: r'NP !<<# /NN.?'}` | NPs with non-nominal heads |
If you'd prefer, you can make a `dict` to handle dependent and governor information, instead of using things like `GL` or `DF`. The following searches produce the same output:
```python
>>> crit = {W: r'^friend$',
... D: {F: 'amod',
... W: 'great'}}
>>> crit = {W: r'^friend$', DF: 'amod', D: 'great'}
```
By default, all `search` criteria must match, but any `exclude` criterion is enough to exclude a match. This beahviour can be changed with the `searchmode` and `excludemode` arguments:
```python
### get words that end in 'ing' OR are nominal:
>>> out = corpus.interrogate({W: 'ing$', P: r'^N'}, searchmode='any')
### get any word, but exclude words that end in 'ing' AND are nominal:
>>> out = corpus.interrogate({W: 'any'}, exclude={W: 'ing$', P: N}, excludemode='all')
```
The `show` argument wants a list of keys you'd like to return for each result. The order will be respected. If you only want one thing, a `str` is OK. One additional possibility is `C`, which returns the number of occurrences only.
| `show` | return |
|--------|--------|
| `W` | `'champions'` |
| `[W]` | `'champions'` |
| `L` | `'champion'` |
| `P` | `'NNS'` |
| `X` | `'Noun'` |
| `T` | `'(np (jj prevailing) (nns champions))'` (depending on Tregex query) |
| `[P, W]` | `'NNS/champions'` |
| `[W, P]` | `'champions/NNS'` |
| `[I, L, R]` | `'2/champion/1'` |
| `[L, D, F]` | `'champion/prevailing/nsubj'` |
| `[G, GL, I]` | `'are/be/2'` |
| `[GL, GF, GP]` | `'be/root/vb'` |
| `[L, L]` | `'champion/champion'` |
| `[C]` | `24` |
Again, common sense dictates what is possible. When searching trees, only trees, words, lemmata, POS and counts can be returned. If showing trees, you can't show anything else. If you use `C`, you can't use anything else.
<a name="working-with-coreferences"></a>
## Working with coreferences
One major challenge in corpus linguistics is the fact that pronouns stand in for other words. Parsing provides coreference resolution, which maps pronouns to the things they denote. You can enable this kind of parsing by specifying the `dcoref` annotator:
```python
>>> ops = 'tokenize,ssplit,pos,lemma,parse,ner,dcoref'
>>> parsed = corpus.interrogate(operations=ops)
```
If you have done this, you can use `coref=True` while interrogating to allow coreferents to be mapped together:
```python
>>> corpus.interrogate(query, coref=True)
```
So, if you wanted to find all the processes a certain entity is engaged in, you can get a more complete result with:
```python
>>> from corpkit.dictionaries import roles
>>> corpus.interrogate({W: 'clinton', GF: roles.process}, coref=True)
```
This will count `support` in `Clinton supported the independence of Kosovo`, and also potentially `authorize` in `He authorized the use of force`. You can also toggle the `representative=True` and `non_representative=True` arguments if you want to distinguish between copula and non-copula coreference.
```python
>>> corpus.interrogate({W: 'clinton', GF: roles.process}, coref=True, representative=False)
```
<a name="building-corpora"></a>
## Building corpora
*corpkit*'s `Corpus()` class contains `parse()` and `tokenise()`, methods for created parsed and/or tokenised corpora. The main thing you need is **a folder, containing either text files, or subfolders that contain text files**. [Stanford CoreNLP](http://nlp.stanford.edu/software/corenlp.shtml) is required to parse corpora. If you don't have it, *corpkit* can download and install it for you. If you're tokenising, you'll need to make sure you have NLTK's tokeniser data. You can then run:
```python
>>> unparsed = Corpus('path/to/unparsed/files')
### to parse, you can set a path to corenlp
>>> corpus = unparsed.parse(corenlppath='Downloads/corenlp')
### to tokenise, point to nltk:
# >>> corpus = unparsed.tokenise(nltk_data_path='Downloads/nltk_data')
```
which creates the parsed/tokenised corpora, and returns `Corpus()` objects representing them. When parsing, you can also optionally pass in a string of annotators, as per the [CoreNLP documentation](http://nlp.stanford.edu/software/corenlp.shtml):
```python
>>> ans = 'tokenize,ssplit,pos'
### you can also set memory and turn off copula head parsing,
### or multiprocess the parsing job (though you'll want a big machine)
>>> corpus = unparsed.parse(operations=ans, memory_mb=3000,
... copula_head=False, multiprocess=4)
```
<a name="speaker-ids"></a>
#### Speaker IDs
Something novel about *corpkit* is that it can work with corpora containing speaker IDs (scripts, transcripts, logs, etc.), like this:
JOHN: Why did they change the signs above all the bins?
SPEAKER23: I know why. But I'm not telling.
If you use:
```python
>>> corpus = unparsed.parse(speaker_segmentation=True)
```
This will:
1. Detect any IDs in any file
2. Create a duplicate version of the corpus with IDs removed
3. Parse this 'cleaned' corpus
4. Add an XML tag to each sentence with the name of the speaker
5. Return the parsed corpus as a `Corpus()` object
When interrogating or concordancing, you can then pass in a keyword argument to restrict searches to one or more speakers:
```python
>>> s = ['BRISCOE', 'LOGAN']
>>> npheads = interrogate(T, r'/NN.?/ >># NP', just_speakers=s)
```
This makes it possible to not only investigate individual speakers, but to form an understanding of the overall tenor/tone of the text as well: *Who does most of the talking? Who is asking the questions? Who issues commands?*
<a name="navigating-parsed-corpora"></a>
### Navigating parsed corpora
When your data is parsed, `Corpus` objects draw on [CoreNLP XML](http://corenlp-xml-library.readthedocs.org/en/latest/) to keep everything seamlessly connected:
```python
>>> corp = Corpus('data/CHT-parsed')
>>> corp.subcorpora['2013'].files[1].document.sentences[4235].parse_string
### '(ROOT (FRAG (CC And) (NP (NP (RB not) (RB just)) (NP (NP (NNP Metrione) ... '
>>> corp.subcorpora['1997'].files[0].document.sentences[3509].tokens[30].word
### 'linguistics'
```
<a name="getting-general-stats"></a>
### Getting general stats
Once you have a parsed `Corpus()` object, enter `corpus.features` to interrogate the corpus for some basic frequencies:
```python
>>> corpus = Corpus('data/sessions-parsed')
>>> corpus.features
```
Output:
```
Characters Tokens Words Closed class words Open class words Clauses Sentences Unmodalised declarative Mental processes Relational processes Interrogative Passives Verbal processes Modalised declarative Open interrogative Imperative Closed interrogative
01 26873 8513 7308 4809 3704 2212 577 280 156 98 76 35 39 26 8 2 3
02 25844 7933 6920 4313 3620 2270 266 130 195 109 29 19 35 11 5 1 3
03 18376 5683 4877 3067 2616 1640 330 174 132 68 30 40 29 8 12 6 1
04 20066 6354 5366 3587 2767 1775 319 174 176 83 33 30 20 9 9 4 1
05 23461 7627 6217 4400 3227 1978 479 245 154 93 45 51 28 20 5 3 1
06 19164 6777 5200 4151 2626 1684 298 111 165 83 43 56 14 10 6 6 2
07 22349 7039 5951 4012 3027 1947 343 183 195 82 29 30 38 12 5 5 0
08 26494 8760 7124 4960 3800 2379 545 263 170 87 66 36 32 10 6 5 4
09 23073 7747 6193 4524 3223 2056 310 149 164 88 21 26 22 10 5 3 0
10 20648 6789 5608 3817 2972 1795 437 265 139 101 34 34 39 18 5 3 2
11 25366 8533 6899 4925 3608 2207 457 230 203 116 39 48 47 15 10 4 0
12 16976 5742 4624 3274 2468 1567 258 135 183 72 23 43 22 4 3 1 6
13 25807 8546 6966 4768 3778 2345 477 257 200 124 45 50 36 15 12 3 2
```
Features such as *relational/mental/verbal* processes are difficult to locate automatically, so these counts are perhaps best seen as approximations. Even so, this data can be very helpful when using `edit()` to generate relative frequencies, for example.
<a name="concordancing"></a>
## Concordancing
Unlike most concordancers, which are based on plaintext corpora, *corpkit* can concordance grammatically, using the same kind of `search`, `exclude` and `show` values as `interrogate()`.
```python
>>> subcorpus = corpus.subcorpora.c2005
### C is added above to make a valid variable name from an int
### can also be accessed as corpus.subcorpora['2005']
### or corpus.subcorpora[index]
>>> query = r'/JJ.?/ > (NP <<# (/NN.?/ < /\brisk/))'
### T option for tree searching
>>> lines = subcorpus.concordance(T, query, window=50, n=10, random=True)
```
Output (a `Pandas DataFrame`):
```
0 hedge funds or high-risk stocks obviously poses a greater risk to the pension program than a portfolio of
1 contaminated water pose serious health and environmental risks
2 a cash break-even pace '' was intended to minimize financial risk to the parent company
3 Other major risks identified within days of the attack
4 One seeks out stocks ; the other monitors risks
5 men and women in Colorado Springs who were at high risk for H.I.V. infection , because of
6 by the marketing consultant Seth Godin , to taking calculated risks , in the opinion of two longtime business
7 to happen '' in premises '' where there was a high risk of fire
8 As this was match points , some of them took a slight risk at the second trick by finessing the heart
9 said that the agency 's continuing review of how Guidant treated patient risks posed by devices like the
```
You can also concordance via dependencies:
```python
### match words starting with 'st' filling function of nsubj
>>> criteria = {W: r'^st', F: r'nsubj$'}
### show function, pos and lemma (in that order)
>>> lines = subcorpus.concordance(criteria, show =[F,P,L])
>>> lines.format(window=30, n=10, columns=[L,M,R])
```
Output:
```
0 ime ./:/; cc/CC/and det/DT/the nsubj/NN/stock conj:and/VBZ/be advmod/RB/hist
1 vmod/RB/even compound/NN/sleep nsubj/NNS/study ./,/, appos/NNS/evaluation cas
2 od:poss/NNS/veteran case/POS/' nsubj/NN/study ccomp/VBZ/suggest mark/IN/that
3 det/DT/a nsubj/NN/study case/IN/in nmod:poss/NN/today
4 cc/CC/but det/DT/the nsubj/NN/study root/VBD/find mark/IN/that cas
5 pound/NN/a amod/JJ/preliminary nsubj/NN/study case/IN/of nmod:of/NNS/woman c
6 case/IN/for nmod:for/WDT/which nsubj/NNS/statistics acl:relcl/VBD/be xcomp/JJ/avai
7 amod/JJR/earlier nsubj/NNS/study aux/VBD/have root/VBN/show mar
8 ay det/DT/the amod/JJR/earlier nsubj/NNS/study aux/VBD/do neg/RB/not ccomp/VB
9 /there root/VBP/be det/DT/some nsubj/NNS/strategy ./:/- dep/JJS/most case/IN/of
```
You can search tokenised corpora or plaintext corpora for regular expressions or lists of words to match. The two queries below will return identical results:
```python
>>> r_query = r'^fr?iends?$'
>>> l_query = ['friend', 'friends', 'fiend', 'fiends']
>>> lines = subcorpus.concordance({W: r_query})
>>> lines = subcorpus.concordance({W: l_query})
```
If you really wanted, you can then go on to use `concordance()` output as a dictionary, or extract keywords and ngrams from it, or keep or remove certain results with `edit()`. If you want to [give the GUI a try](http://interrogator.github.io/corpkit/), you can colour-code and create thematic categories for concordance lines as well.
<a name="systemic-functional-stuff"></a>
## Systemic functional stuff
Because I mostly use systemic functional grammar, there is also a simple tool for distinguishing between process types (relational, mental, verbal) when interrogating a corpus. If you add words to the lists in `dictionaries/process_types.py`, corpkit will get their inflections automatically.
```python
>>> from corpkit.dictionaries import processes
### match nsubj with verbal process as governor
>>> crit = {F: '^nsubj$', G: processes.verbal}
### return lemma of the nsubj
>>> sayers = corpus.interrogate(crit, show=L)
### have a look at the top results
>>> sayers.quickview(n=20)
```
Output:
```
0: he (n=24530)
1: she (n=5558)
2: they (n=5510)
3: official (n=4348)
4: it (n=3752)
5: who (n=2940)
6: that (n=2665)
7: i (n=2062)
8: expert (n=2057)
9: analyst (n=1369)
10: we (n=1214)
11: report (n=1103)
12: company (n=1070)
13: which (n=1043)
14: you (n=987)
15: researcher (n=987)
16: study (n=901)
17: critic (n=826)
18: person (n=802)
19: agency (n=798)
20: doctor (n=770)
```
First, let's try removing the pronouns using `edit()`. The quickest way is to use the editable wordlists stored in `dictionaries/wordlists`:
```python
>>> from corpkit.dictionaries import wordlists
>>> prps = wordlists.pronouns
# alternative approaches:
# >>> prps = [0, 1, 2, 4, 5, 6, 7, 10, 13, 14, 24]
# >>> prps = ['he', 'she', 'you']
# >>> prps = as_regex(wl.pronouns, boundaries='line')
# or, by re-interrogating:
# >>> sayers = corpus.interrogate(crit, show=L, exclude={W: wordlists.pronouns})
### give edit() indices, words, wordlists or regexes to keep remove or merge
>>> sayers_no_prp = sayers.edit(skip_entries=prps, skip_subcorpora=[1963])
>>> sayers_no_prp.quickview(n=10)
```
Output:
```
0: official (n=4342)
1: expert (n=2055)
2: analyst (n=1369)
3: report (n=1098)
4: company (n=1066)
5: researcher (n=987)
6: study (n=900)
7: critic (n=825)
8: person (n=801)
9: agency (n=796)
```
Great. Now, let's sort the entries by trajectory, and then plot:
```python
### sort with edit()
### use scipy.linregress to sort by 'increase', 'decrease', 'static', 'turbulent' or P
### other sort_by options: 'name', 'total', 'infreq'
>>> sayers_no_prp = sayers_no_prp.edit('%', sayers.totals, sort_by='increase')
### make an area chart with custom y label
>>> sayers_no_prp.visualise('Sayers, increasing', kind='area',
... y_label='Percentage of all sayers')
```
Output:
<img style="float:left" src="https://raw.githubusercontent.com/interrogator/risk/master/images/sayers-increasing.png" />
<br>
We can also merge subcorpora. Let's look for changes in gendered pronouns:
```python
>>> merges = {'1960s': r'^196',
... '1980s': r'^198',
... '1990s': r'^199',
... '2000s': r'^200',
... '2010s': r'^201'}
>>> sayers = sayers.edit(merge_subcorpora=merges)
### now, get relative frequencies for he and she
### SELF calculates percentage after merging/removing etc has been performed,
### so that he and she will sum to 100%. Pass in `sayers.totals` to calculate
### he/she as percentage of all sayers
>>> genders = sayers.edit('%', SELF, just_entries=['he','she'])
### and plot it as a series of pie charts, showing totals on the slices:
>>> genders.visualise('Pronominal sayers in the NYT', kind='pie',
... subplots=True, figsize=(15,2.75), show_totals='plot')
```
Output:
<img style="float:left" src="https://raw.githubusercontent.com/interrogator/risk/master/images/ann_he_she.png" />
<br>
Woohoo, a decreasing gender divide!
<a name="keywording"></a>
## Keywording
As I see it, there are two main problems with keywording, as typically performed in corpus linguistics. First is the reliance on 'balanced'/'general' reference corpora, which are obviously a fiction. Second is the idea of stopwords. Essentially, when most people calculate keywords, they use stopword lists to automatically filter out words that they think will not be of interest to them. These words are generally closed class words, like determiners, prepositions, or pronouns. This is not a good way to go about things: the relative frequencies of *I*, *you* and *one* can tell us a lot about the kinds of language in a corpus. More seriously, stopwords mean adding subjective judgements about what is interesting language into a process that is useful precisely because it is not subjective or biased.
So, what to do? Well, first, don't use 'general reference corpora' unless you really really have to. With *corpkit*, you can use your entire corpus as the reference corpus, and look for keywords in subcorpora. Second, rather than using lists of stopwords, simply do not send all words in the corpus to the keyworder for calculation. Instead, try looking for key *predicators* (rightmost verbs in the VP), or key *participants* (heads of arguments of these VPs):
```python
### just heads of participants' lemma form (no pronouns, though!)
>>> part = r'/(NN|JJ).?/ >># (/(NP|ADJP)/ $ VP | > VP)'
>>> p = corpus.interrogate(T, part, show=L)
```
When using `edit()` to calculate keywords, there are a few default parameters that can be easily changed:
| Keyword argument | Function | Default setting | Type
|---|---|---|---|
| `threshold` | Remove words occurring fewer than `n` times in reference corpus | `False` | `'high/medium/low'`/ `True/False` / `int`
| `calc_all` | Calculate keyness for words in both reference and target corpus, rather than just target corpus | `True` | `True/False`
| `selfdrop` | Attempt to remove target data from reference data when calculating keyness | `True` | `True/False`
Let's have a look at how these options change the output:
```python
### SELF as reference corpus uses p.results
>>> options = {'selfdrop': False,
... 'calc_all': False,
... 'threshold': False}
>>> for k, v in options.items():
... key = p.edit('keywords', SELF, k=v)
... print key.results.ix['2011'].order(ascending=False)
```
Output:
| #1: default | | #2: no `selfdrop` | | #3: no `calc_all` | | #4: no `threshold` | |
|---|---:|---|---:|---|---:|---|---:|
| risk | 1941.47 | risk | 1909.79 | risk | 1941.47 | bank | 668.19 |
| bank | 1365.70 | bank | 1247.51 | bank | 1365.70 | crisis | 242.05 |
| crisis | 431.36 | crisis | 388.01 | crisis | 431.36 | obama | 172.41 |
| investor | 410.06 | investor | 387.08 | investor | 410.06 | demiraj | 161.90 |
| rule | 316.77 | rule | 293.33 | rule | 316.77 | regulator | 144.91 |
| | ... | | ... | | ... | | ... |
| clinton | -37.80 | tactic | -35.09 | hussein | -25.42 | clinton | -87.33 |
| vioxx | -38.00 | vioxx | -35.29 | clinton | -37.80 | today | -89.49 |
| greenspan | -54.35 | greenspan | -51.38 | vioxx | -38.00 | risky | -125.76 |
| bush | -153.06 | bush | -143.02 | bush | -153.06 | bush | -253.95 |
| yesterday | -162.30 | yesterday | -151.71 | yesterday | -162.30 | yesterday | -268.29 |
As you can see, slight variations on keywording give different impressions of the same corpus!
A key strength of *corpkit*'s approach to keywording is that you can generate new keyword lists without re-interrogating the corpus. We can use some Pandas syntax to do this more quickly.
```python
>>> yrs = ['2011', '2012', '2013', '2014']
>>> keys = p.results.ix[yrs].sum().edit('keywords', p.results.drop(yrs),
... threshold=False)
>>> print keys.results
```
Output:
```
bank 1795.24
obama 722.36
romney 560.67
jpmorgan 527.57
rule 413.94
dimon 389.86
draghi 349.80
regulator 317.82
italy 282.00
crisis 243.43
putin 209.51
greece 208.80
snowden 208.35
mf 192.78
adoboli 161.30
```
... or track the keyness of a set of words over time:
```python
>>> twords = ['terror', 'terrorism', 'terrorist']
>>> terr = p.edit(K, SELF, merge_entries={'terror': twords})
>>> print terr.results.terror
```
Output:
```
1963 -2.51
1987 -3.67
1988 -16.09
1989 -6.24
1990 -16.24
... ...
Name: terror, dtype: float64
```
<a name="visualising-keywords"></a>
### Visualising keywords
Naturally, we can use `visualise()` for our keywords too:
```python
>>> pols.results.terror.visualise('Terror* as Participant in the \emph{NYT}',
... kind='area', stacked=False, y_label='L/L Keyness')
>>> politicians = ['bush', 'obama', 'gore', 'clinton', 'mccain',
... 'romney', 'dole', 'reagan', 'gorbachev']
>>> k.results[politicans].visualise('Keyness of politicians in the \emph{NYT}',
... num_to_plot='all', y_label='L/L Keyness', kind='area', legend_pos='center left')
```
Output:
<img style="float:left" src="https://raw.githubusercontent.com/interrogator/risk/master/images/terror-as-participant-in-the-emphnyt.png" />
<img style="float:left" src="https://raw.githubusercontent.com/interrogator/risk/master/images/keyness-of-politicians-in-the-emphnyt.png" />
<br>
<a name="traditional-reference-corpora"></a>
### Traditional reference corpora
If you still want to use a standard reference corpus, you can do that (and a dictionary version of the BNC is included). For the reference corpus, `edit()` recognises `dicts`, `DataFrames`, `Series`, files containing `dicts`, or paths to plain text files or trees.
```python
### arbitrary list of common/boring words
>>> from corpkit.dictionaries import stopwords
>>> print p.results.ix['2013'].edit(K, 'bnc.p', skip_entries=stopwords).results
>>> print p.results.ix['2013'].edit(K, 'bnc.p', calc_all=False).results
```
Output (not so useful):
```
#1 #2
bank 5568.25 bank 5568.25
person 5423.24 person 5423.24
company 3839.14 company 3839.14
way 3537.16 way 3537.16
state 2873.94 state 2873.94
... ...
three -691.25 ten -199.36
people -829.97 bit -205.97
going -877.83 sort -254.71
erm -2429.29 thought -255.72
yeah -3179.90 will -679.06
```
<a name="parallel-processing"></a>
## Parallel processing
`interrogate()` can also do parallel-processing. You can generally improve the speed of an interrogation by setting the `multiprocess` argument:
```python
### set num of parallel processes manually
>>> data = corpus.interrogate({T: r'/NN.?/ >># NP'}, multiprocess=3)
### set num of parallel processes automatically
>>> data = corpus.interrogate({T: r'/NN.?/ >># NP'}, multiprocess=True)
```
Multiprocessing is particularly useful, however, when you are interested in multiple corpora, speaker IDs, or search queries. The sections below explain how.
<a name="multiple-corpora"></a>
#### Multiple corpora
To parallel-process multiple corpora, first, wrap them up as a `Corpora()` object. To do this, you can pass in:
1. a list of paths
2. a list of `Corpus()` objects
3. A single path string that contains corpora
```python
>>> from corpkit.corpus import Corpora
>>> corpora = Corpora('./data') # path containing corpora
>>> corpora
### <corpkit.corpus.Corpora instance: 6 items>
### interrogate by parallel processing, 4 at a time
>>> output = corpora.interrogate(T, r'/NN.?/ < /(?i)^h/', show=L, multiprocess=4)
```
The output of a multiprocessed interrogation will generally be a `dict` with corpus/speaker/query names as keys. The main exception to this is if you use `show=C`, which will concatenate results from each query into a single `Interrogation` object, using corpus/speaker/query names as column names.
<a name="multiple-speakers"></a>
#### Multiple speakers
Passing in a list of speaker names will also trigger multiprocessing:
```python
>>> from dictionary.wordlists import wordlists
>>> spkrs = ['MEYER', 'JAY']
>>> each_speaker = corpus.interrogate(W, wordlists.closedclass, just_speakers=spkrs)
```
There is also `just_speakers='each'`, which will be automatically expanded to include every speaker name found in the corpus.
<a name="multiple-queries"></a>
#### Multiple queries
You can also run a number of queries over the same corpus in parallel. There are two ways to do this.
```python
### method one
>>> query = {'Noun phrases': r'NP', 'Verb phrases': r'VP'}
>>> phrases = corpus.interrogate(T, query, show=C)
### method two
>>> query = {'-ing words': {W: r'ing$'}, '-ed verbs': {P: r'^V', W: r'ed$'}}
>>> patterns = corpus.interrogate(query, show=L)
```
Let's try multiprocessing with multiple queries, showing count (i.e. returning a single results DataFrame). We can look at different risk processes (e.g. *risk*, *take risk*, *run risk*, *pose risk*, *put at risk*) using constituency parses:
```python
>>> q = {'risk': r'VP <<# (/VB.?/ < /(?i).?\brisk.?\b/)',
... 'take risk': r'VP <<# (/VB.?/ < /(?i)\b(take|takes|taking|took|taken)+\b/) < (NP <<# /(?i).?\brisk.?\b/)',
... 'run risk': r'VP <<# (/VB.?/ < /(?i)\b(run|runs|running|ran)+\b/) < (NP <<# /(?i).?\brisk.?\b/)',
... 'put at risk': r'VP <<# /(?i)(put|puts|putting)\b/ << (PP <<# /(?i)at/ < (NP <<# /(?i).?\brisk.?/))',
... 'pose risk': r'VP <<# (/VB.?/ < /(?i)\b(pose|poses|posed|posing)+\b/) < (NP <<# /(?i).?\brisk.?\b/)'}
# show=C will collapse results from each search into single dataframe
>>> processes = corpus.interrogate(T, q, show=C)
>>> proc_rel = processes.edit('%', processes.totals)
>>> proc_rel.visualise('Risk processes')
```
Output:
<img style="float:left" src="https://raw.githubusercontent.com/interrogator/risk/master/images/risk_processes-2.png" />
<br>
<a name="more-complex-queries-and-plots"></a>
## More complex queries and plots
Next, let's find out what kinds of noun lemmas are subjects of any of these risk processes:
```python
### a query to find heads of nps that are subjects of risk processes
>>> query = r'/^NN(S|)$/ !< /(?i).?\brisk.?/ >># (@NP $ (VP <+(VP) (VP ( <<# (/VB.?/ < /(?i).?\brisk.?/) ' \
... r'| <<# (/VB.?/ < /(?i)\b(take|taking|takes|taken|took|run|running|runs|ran|put|putting|puts)/) < ' \
... r'(NP <<# (/NN.?/ < /(?i).?\brisk.?/))))))'
>>> noun_riskers = c.interrogate(T, query, show=L)
>>> noun_riskers.quickview(10)
```
Output:
```
0: person (n=195)
1: company (n=139)
2: bank (n=80)
3: investor (n=66)
4: government (n=63)
5: man (n=51)
6: leader (n=48)
7: woman (n=43)
8: official (n=40)
9: player (n=39)
```
We can use `edit()` to make some thematic categories:
```python
### get everyday people
>>> p = ['person', 'man', 'woman', 'child', 'consumer', 'baby', 'student', 'patient']
### get business, gov, institutions
>>> i = ['company', 'bank', 'investor', 'government', 'leader', 'president', 'officer',
... 'politician', 'institution', 'agency', 'candidate', 'firm']
>>> merges = {'Everyday people': p, Institutions: i}
>>> them_cat = them_cat.edit('%', noun_riskers.totals,
... merge_entries=merges,
... sort_by='total',
... skip_subcorpora=1963,
... just_entries=merges.keys())
### plot result
>>> them_cat.visualise('Types of riskers', y_label='Percentage of all riskers')
```
Output:
<img style="float:left" src="https://raw.githubusercontent.com/interrogator/risk/master/images/types-of-riskers.png" />
<br>
Let's also find out what percentage of the time some nouns appear as riskers:
```python
### find any head of an np not containing risk
>>> query = r'/NN.?/ >># NP !< /(?i).?\brisk.?/'
>>> noun_lemmata = corpus.interrogate(T, query, show=L)
### get some key terms
>>> people = ['man', 'woman', 'child', 'baby', 'politician',
... 'senator', 'obama', 'clinton', 'bush']
>>> selected = noun_riskers.edit('%', noun_lemmata.results,
... just_entries=people, just_totals=True, threshold=0, sort_by='total')
### make a bar chart:
>>> selected.visualise('Risk and power', num_to_plot='all', kind='bar',
... x_label='Word', y_label='Risker percentage', fontsize=15)
```
Output:
<img style="float:left" src="https://raw.githubusercontent.com/interrogator/risk/master/images/risk-and-power-2.png" />
<br>
<a name="visualisation-options"></a>
### Visualisation options
With a bit of creativity, you can do some pretty awesome data-viz, thanks to *Pandas* and *Matplotlib*. The following plots require only one interrogation:
```python
>>> modals = corpus.interrogate(T, 'MD < __', show=L)
### simple stuff: make relative frequencies for individual or total results
>>> rel_modals = modals.edit('%', modals.totals)
### trickier: make an 'others' result from low-total entries
>>> low_indices = range(7, modals.results.shape[1])
>>> each_md = modals.edit('%', modals.totals, merge_entries={'other': low_indices},
... sort_by='total', just_totals=True, keep_top=7)
### complex stuff: merge results
>>> entries_to_merge = [r'(^w|\'ll|\'d)', r'^c', r'^m', r'^sh']
>>> modals = modals.edit(merge_entries=entries_to_merge)
### complex stuff: merge subcorpora
>>> merges = {'1960s': r'^196',
... '1980s': r'^198',
... '1990s': r'^199',
... '2000s': r'^200',
... '2010s': r'^201'}
>>> modals = sayers.edit(merge_subcorpora=merges)
### make relative, sort, remove what we don't want
>>> modals = modals.edit('%', modals.totals, keep_stats=False,
... just_subcorpora=merges.keys(), sort_by='total', keep_top=4)
### show results
>>> print rel_modals.results, each_md.results, modals.results
```
Output:
```
would will can could ... need shall dare shalt
1963 22.326833 23.537323 17.955615 6.590451 ... 0.000000 0.537996 0.000000 0
1987 24.750614 18.505132 15.512505 11.117537 ... 0.072286 0.260228 0.014457 0
1988 23.138986 19.257117 16.182067 11.219364 ... 0.091338 0.060892 0.000000 0
... ... ... ... ... ... ... ... ... ...
2012 23.097345 16.283186 15.132743 15.353982 ... 0.029499 0.029499 0.000000 0
2013 22.136269 17.286522 16.349301 15.620351 ... 0.029753 0.029753 0.000000 0
2014 21.618357 17.101449 16.908213 14.347826 ... 0.024155 0.000000 0.000000 0
[29 rows x 17 columns]
would 23.235853
will 17.484034
can 15.844070
could 13.243449
may 9.581255
should 7.292294
other 7.290155
Name: Combined total, dtype: float64
would/will/'ll... can/could/ca may/might/must should/shall/shalt
1960s 47.276395 25.016812 19.569603 7.800941
1980s 44.756285 28.050776 19.224476 7.566817
1990s 44.481957 29.142571 19.140310 6.892708
2000s 42.386571 30.710739 19.182867 7.485681
2010s 42.581666 32.045745 17.777845 7.397044
```
Now, some intense plotting:
```python
### exploded pie chart
>>> each_md.visualise('Pie chart of common modals in the NYT', explode=['other'],
... num_to_plot='all', kind='pie', colours='Accent', figsize=(11,11))
### bar chart, transposing and reversing the data
>>> modals.results.iloc[::-1].T.iloc[::-1].visualise('Modals use by decade', kind='barh',
... x_label='Percentage of all modals', y_label='Modal group')
### stacked area chart
>>> rel_modals.results.drop('1963').visualise('An ocean of modals', kind='area',
... stacked=True, colours='summer', figsize =(8,10), num_to_plot='all',
... legend_pos='lower right', y_label='Percentage of all modals')
```
Output:
<p align="center">
<img src="https://raw.githubusercontent.com/interrogator/risk/master/images/pie-chart-of-common-modals-in-the-nyt2.png" height="400" width="400"/>
<img src="https://raw.githubusercontent.com/interrogator/risk/master/images/modals-use-by-decade.png" height="230" width="500"/>
<img src="https://raw.githubusercontent.com/interrogator/risk/master/images/an-ocean-of-modals2.png" height="600" width="500"/>
</p>
<a name="contact"></a>
## Contact
Twitter: [@interro_gator](https://twitter.com/interro_gator)
<a name="cite"></a>
## Cite
> `McDonald, D. (2015). corpkit: a toolkit for corpus linguistics. Retrieved from https://www.github.com/interrogator/corpkit. DOI: http://doi.org/10.5281/zenodo.28361`
================================================
FILE: Dockerfile
================================================
FROM alpine:latest
MAINTAINER interro_gator
# set up a workspace so we can cache python stuff
RUN rm -rf /.src && mkdir /.src
COPY requirements.txt /.src/requirements.txt
# add corenlp
# COPY ~/corenlp /.src
# use the workspace for everything
WORKDIR /.src
# install the basics
RUN apk add --update \
python3 \
python-dev \
py-pip \
build-base \
git \
libpng \
freetype \
pkgconf \
libxft-dev \
libxml2-dev \
readline
# install java for parsing
RUN apk --update add openjdk8-jre-base
# needed for numpy
RUN ln -s /usr/include/locale.h /usr/include/xlocale.h
RUN ln -s /usr/include/libxml2/libxml/xmlversion.h /usr/include/xmlversion.h
RUN mkdir /usr/include/libxml
RUN ln -s /usr/include/libxml2/libxml/xmlversion.h /usr/include/libxml/xmlversion.h
RUN ln -s /usr/include/libxml2/libxml/xmlexports.h /usr/include/xmlexports.h
RUN ln -s /usr/include/libxml2/libxml/xmlexports.h /usr/include/libxml/xmlexports.h
# stop pip from complaining
RUN pip install --upgrade pip
# python heavyweight stuff
RUN pip install cython
RUN pip install numpy
RUN pip install colorama
# remove old stuff --- not sure it does much
RUN rm -rf /var/cache/apk/*
# get matplotlib github version
RUN git clone git://github.com/matplotlib/matplotlib.git
RUN cd matplotlib && python setup.py install && cd ..
# install corpkit requirements
RUN pip install -r requirements.txt
RUN pip install docker-py
# add everything from corpkit to working dir
COPY . /.src
# install corpkit itself
RUN python /.src/setup.py install
# download might be needed for licence issues
#RUN python -m corpkit.download.corenlp /
CMD python -m corpkit.env docker=corpkit
WORKDIR /projects
================================================
FILE: LICENSE
================================================
The MIT License (MIT)
Copyright (c) 2015 Daniel McDonald
mcdonaldd, at, unimelb.edu
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: Makefile
================================================
# Makefile for Sphinx documentation
#
# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
PAPER =
BUILDDIR = _build
# User-friendly check for sphinx-build
ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1)
$(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/)
endif
# Internal variables.
PAPEROPT_a4 = -D latex_paper_size=a4
PAPEROPT_letter = -D latex_paper_size=letter
ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
# the i18n builder cannot share the environment and doctrees with the others
I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
.PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest coverage gettext
help:
@echo "Please use \`make <target>' where <target> is one of"
@echo " html to make standalone HTML files"
@echo " dirhtml to make HTML files named index.html in directories"
@echo " singlehtml to make a single large HTML file"
@echo " pickle to make pickle files"
@echo " json to make JSON files"
@echo " htmlhelp to make HTML files and a HTML help project"
@echo " qthelp to make HTML files and a qthelp project"
@echo " applehelp to make an Apple Help Book"
@echo " devhelp to make HTML files and a Devhelp project"
@echo " epub to make an epub"
@echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
@echo " latexpdf to make LaTeX files and run them through pdflatex"
@echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx"
@echo " text to make text files"
@echo " man to make manual pages"
@echo " texinfo to make Texinfo files"
@echo " info to make Texinfo files and run them through makeinfo"
@echo " gettext to make PO message catalogs"
@echo " changes to make an overview of all changed/added/deprecated items"
@echo " xml to make Docutils-native XML files"
@echo " pseudoxml to make pseudoxml-XML files for display purposes"
@echo " linkcheck to check all external links for integrity"
@echo " doctest to run all doctests embedded in the documentation (if enabled)"
@echo " coverage to run coverage check of the documentation (if enabled)"
clean:
rm -rf $(BUILDDIR)/*
html:
$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
dirhtml:
$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."
singlehtml:
$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
@echo
@echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."
pickle:
$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
@echo
@echo "Build finished; now you can process the pickle files."
json:
$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
@echo
@echo "Build finished; now you can process the JSON files."
htmlhelp:
$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
@echo
@echo "Build finished; now you can run HTML Help Workshop with the" \
".hhp project file in $(BUILDDIR)/htmlhelp."
qthelp:
$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp
@echo
@echo "Build finished; now you can run "qcollectiongenerator" with the" \
".qhcp project file in $(BUILDDIR)/qthelp, like this:"
@echo "# qcollectiongenerator $(BUILDDIR)/qthelp/corpkit.qhcp"
@echo "To view the help file:"
@echo "# assistant -collectionFile $(BUILDDIR)/qthelp/corpkit.qhc"
applehelp:
$(SPHINXBUILD) -b applehelp $(ALLSPHINXOPTS) $(BUILDDIR)/applehelp
@echo
@echo "Build finished. The help book is in $(BUILDDIR)/applehelp."
@echo "N.B. You won't be able to view it unless you put it in" \
"~/Library/Documentation/Help or install it in your application" \
"bundle."
devhelp:
$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp
@echo
@echo "Build finished."
@echo "To view the help file:"
@echo "# mkdir -p $$HOME/.local/share/devhelp/corpkit"
@echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/corpkit"
@echo "# devhelp"
epub:
$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
@echo
@echo "Build finished. The epub file is in $(BUILDDIR)/epub."
latex:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo
@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
@echo "Run \`make' in that directory to run these through (pdf)latex" \
"(use \`make latexpdf' here to do that automatically)."
latexpdf:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo "Running LaTeX files through pdflatex..."
$(MAKE) -C $(BUILDDIR)/latex all-pdf
@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
latexpdfja:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo "Running LaTeX files through platex and dvipdfmx..."
$(MAKE) -C $(BUILDDIR)/latex all-pdf-ja
@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
text:
$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
@echo
@echo "Build finished. The text files are in $(BUILDDIR)/text."
man:
$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
@echo
@echo "Build finished. The manual pages are in $(BUILDDIR)/man."
texinfo:
$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
@echo
@echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo."
@echo "Run \`make' in that directory to run these through makeinfo" \
"(use \`make info' here to do that automatically)."
info:
$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
@echo "Running Texinfo files through makeinfo..."
make -C $(BUILDDIR)/texinfo info
@echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo."
gettext:
$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale
@echo
@echo "Build finished. The message catalogs are in $(BUILDDIR)/locale."
changes:
$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
@echo
@echo "The overview file is in $(BUILDDIR)/changes."
linkcheck:
$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
@echo
@echo "Link check complete; look for any errors in the above output " \
"or in $(BUILDDIR)/linkcheck/output.txt."
doctest:
$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
@echo "Testing of doctests in the sources finished, look at the " \
"results in $(BUILDDIR)/doctest/output.txt."
coverage:
$(SPHINXBUILD) -b coverage $(ALLSPHINXOPTS) $(BUILDDIR)/coverage
@echo "Testing of coverage in the sources finished, look at the " \
"results in $(BUILDDIR)/coverage/python.txt."
xml:
$(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml
@echo
@echo "Build finished. The XML files are in $(BUILDDIR)/xml."
pseudoxml:
$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml
@echo
@echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml."
================================================
FILE: README.md
================================================
# corpkit: sophisticated corpus linguistics
[](https://gitter.im/interrogator/corpkit?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) [](https://zenodo.org/badge/latestdoi/14568/interrogator/corpkit) [](https://travis-ci.org/interrogator/corpkit) [](https://pypi.python.org/pypi/corpkit) [](http://corpkit.readthedocs.org/en/latest/) [](https://hub.docker.com/r/interrogator/corpkit/) [](https://anaconda.org/interro_gator/corpkit)
## **NOTICE: corpkit is now deprecated and unmaintained. It is superceded by [`buzz`](https://github.com/interrogator/buzz), which is better in every way.**
> **corpkit** is a module for doing more sophisticated corpus linguistics. It links state-of-the-art natural language processing technologies to functional linguistic research aims, allowing you to easily build, search and visualise grammatically annotated corpora in novel ways.
The basic workflow involves making corpora, parsing them, and searching them. The results of searches are [CONLL-U formatted](http://universaldependencies.org/format.html) files, represented as [pandas](http://pandas.pydata.org/) objects, which can be edited, visualised or exported in a lot of ways. The tool has three interfaces, each with its own documentation:
1. [A Python API](http://corpkit.readthedocs.io)
2. [A natural language interpreter](http://corpkit.readthedocs.io/en/latest/rst_docs/interpreter/corpkit.interpreter.overview.html)
3. [A graphical interface](http://interrogator.github.io/corpkit/)
A quick demo for each interface is provided in this document.
## Feature summary
From all three interfaces, you can do a lot of neat things. In general:
### Parsing
> Corpora are stored as `Corpus` objects, with methods for viewing, parsing, interrogating and concordancing.
* A very simple wrapper around the full Stanford CoreNLP pipeline
* Automatically add annotations, speaker names and metadata to parser output
* Detect speaker names and make these into metadata features
* Multiprocessing
* Store dependency parsed texts, parse trees and metadata in CONLL-U format
### Interrogating corpora
> Interrogating a corpus produces an `Interrogation` object, with results as Pandas DataFrame attributes.
* Search corpora using regular expressions, wordlists, CQL, Tregex, or a rich, purpose built dependency searching syntax
* Interrogate any dataset in CONLL-U format (e.g. [the Universal Dependencies Treebanks](https://github.com/UniversalDependencies))
* Collocation, n-gramming
* Restrict searches by metadata feature
* Use metadata as symbolic subcorpora
* Choose what search results return: show any combination of words, lemmata, POS, indices, distance from root node, syntax tree, etc.
* Generate concordances alongside interrogations
* Work with coreference annotation
### Editing results
> `Interrogation` objects have `edit`, `visualise` and `save` methods, to name just a few. Editing creates a new `Interrogation` object.
* Quickly delete, sort, merge entries and subcorpora
* Make relative frequencies (e.g. calculate results as percentage of all words/clauses/nouns ...)
* Use linear regression sorting to find increasing, decreasing, turbulent or static trajectories
* Calculate p values, etc.
* Keywording
* Simple multiprocessing available for parsing and interrogating
* Results are Pandas objects, so you can do fast, good statistical work on them
### Visualising results
> The `visualise` method of `Interrogation` objects uses matplotlib and seaborn if installed to produce high quality figures.
* Many chart types
* Easily customise titles, axis labels, colours, sizes, number of results to show, etc.
* Make subplots
* Save figures in a number of formats
### Concordancing
> When interrogating a corpus, concordances are also produced, which can allow you to check that your query matches what you want it to.
* Colour, sort, delete lines using regular expressions
* Recalculate results from edited concordance lines (great for removing false positives)
* Format lines for publication with TeX
### Other stuff
* Language modelling
* Save and load results, images, concordances
* Export data to other tools
* Switch between API, GUI and interpreter whenever you like
## Installation
Via pip:
```shell
pip install corpkit
```
Via Git:
```shell
git clone https://github.com/interrogator/corpkit
cd corpkit
python setup.py install
```
Via Anaconda:
```shell
conda install -c interro_gator corpkit
```
## Creating a project
Once you've got everything installed, you'll want to create a project---this is just a folder hierarchy that stores your corpora, saved results, figures and so on. You can do this in a number of ways:
### Shell
```shell
new_project junglebook
cp -R chapters junglebook/data
```
### Interpreter
```shell
> new project named junglebook
> add ../chapters
```
### Python
```python
>>> import shutil
>>> from corpkit import new_project
>>> new_project('junglebook')
>>> shutil.copytree('../chapters', 'junglebook/data')
```
You can create projects and add data via the file menu of the graphical interface as well.
## Ways to use *corpkit*
As explained earlier, there are three ways to use the tool. Each has unique strengths and weaknesses. To summarise them, the Python API is the most powerful, but has the steepest learning curve. The GUI is the least powerful, but easy to learn (though it is still arguably the most powerful linguistics GUI available). The interpreter strikes a happy middle ground, especially for those who are not familiar with Python.
## Interpreter
The first way to use *corpkit* is by entering its natural language interpreter. To activate it, use the `corpkit` command:
```shell
$ cd junglebook
$ corpkit
```
You'll get a lovely new prompt into which you can type commands:
```none
corpkit@junglebook:no-corpus>
```
Generally speaking, it has the comforts of home, such as history, search, backslash line breaking, variable creation and `ls` and `cd` commands. As in `IPython`, any command beginning with an exclamation mark will be executed by the shell. You can also write scripts and execute them with `corpkit script.ck`, or `./script.ck` if you have a shebang.
### Making projects and parsing corpora
```shell
# make new project
> new project named junglebook
# add folder of (subfolders of) text files
> add '../chapters'
# specify corpus to work on
> set chapters as corpus
# parse the corpus
> parse corpus with speaker_segmentation and metadata and multiprocess as 2
```
### Searching and concordancing
```shell
# search and exclude
> search corpus for governor-function matching 'root' \
... excluding governor-lemma matching 'be'
# show pos, lemma, index, (e.g. 'NNS/thing/3')
> search corpus for pos matching '^N' showing pos and lemma and index
# further arguments and dynamic structuring
> search corpus for word matching any \
... with subcorpora as pagenum and preserve_case
# show concordance lines
> show concordance with window as 50 and columns as LMR
# colouring concordances
> mark m matching 'have' blue
# recalculate results
> calculate result from concordance
```
### Variables, editing results
```shell
# variable naming
> call result root_deps
# skip some numerical subcorpora
> edit root_deps by skipping subcorpora matching [1,2,3,4,5]
# make relative frequencies
> calculate edited as percentage of self
# use scipy to calculate trends and sort by them
> sort edited by decrease
```
### Visualise edited results
```shell
> plot edited as line chart \
... with x_label as 'Subcorpus' and \
... y_label as 'Frequency' and \
... colours as 'summer'
```
### Switching interfaces
```shell
# open graphical interface
> gui
# enter ipython with current namespace
> ipython
# use a new/existing jupyter notebook
> jupyter notebook findings.ipynb
```
## API
Straight Python is the most powerful way to use *corpkit*, because you can manipulate results with Pandas syntax, construct loops, make recursive queries, and so on. Here are some simple examples of the API syntax:
### Instantiate and search a parsed corpus
```python
### import everything
>>> from corpkit import *
>>> from corpkit.dictionaries import *
### instantiate corpus
>>> corp = Corpus('chapters-parsed')
### search for anything participant with a governor that
### is a process, excluding closed class words, and
### showing lemma forms. also, generate a concordance.
>>> sch = {GF: roles.process, F: roles.actor}
>>> part = corp.interrogate(search=sch,
... exclude={W: wordlists.closedclass},
... show=[L],
... conc=True)
```
You get an `Interrogation` object back, with a `results` attribute that is a Pandas DataFrame:
```
daisy gatsby tom wilson eye man jordan voice michaelis \
chapter1 13 2 6 0 3 3 0 2 0
chapter2 1 0 12 10 1 1 0 0 0
chapter3 0 3 0 0 3 8 6 1 0
chapter4 6 9 2 0 1 3 1 1 0
chapter5 8 14 0 0 3 3 0 2 0
chapter6 7 14 9 0 1 2 0 3 0
chapter7 26 20 35 10 12 3 16 9 5
chapter8 5 4 1 10 2 2 0 1 10
chapter9 1 1 1 0 3 3 1 1 0
```
### Edit and visualise the result
Below, we make normalised frequencies and plot:
```python
### calculate and sort---this sort requires scipy
>>> part = part.edit('%', SELF)
### make line subplots for the first nine results
>>> plt = part.visualise('Processes, increasing', subplots=True, layout=(3,3))
>>> plt.show()
```
<img src="https://raw.githubusercontent.com/interrogator/corpkit/master/images/actors.png" width="450">
There are also some [more detailed API examples over here](https://github.com/interrogator/corpkit/blob/master/API-README.md). This document is fairly thorough, but now deprecated, because the official docs are now over at [ReadTheDocs](http://corpkit.readthedocs.io/en/latest/).
## Example figures
<p align="center"> <i>
<img src="https://raw.githubusercontent.com/interrogator/corpkit/master/images/inc-proc.png" width="350"> <img src="https://raw.githubusercontent.com/interrogator/corpkit/master/images/best-derived.png" width="350">
<br>Shifting register of scientific English<br>
<br><br>
<img src="https://raw.githubusercontent.com/interrogator/corpkit/master/images/symlog-part2.png" width="310"> <img src="https://raw.githubusercontent.com/interrogator/corpkit/master/images/process-types-for-part-types.png" width="390">
<br>Participants and processes in online forum talk<br>
<br><br>
<img src="https://raw.githubusercontent.com/interrogator/corpkit/master/images/risk-and-power-2.png" width="370"> <img src="https://raw.githubusercontent.com/interrogator/corpkit/master/images/mood-role-risk.png" width="330">
<br>Riskers and mood role of risk words in print news journalism<br>
</i></p>
## Graphical interface
Screenshots coming soon! For now, just head [here](http://interrogator.github.io/corpkit/).
## Contact
Twitter: [@interro_gator](https://twitter.com/interro_gator)
## Cite
> `McDonald, D. (2015). corpkit: a toolkit for corpus linguistics. Retrieved from https://www.github.com/interrogator/corpkit. DOI: http://doi.org/10.5281/zenodo.28361`
================================================
FILE: bld.bat
================================================
"%PYTHON%" setup.py install
if errorlevel 1 exit 1
:: Add more build steps here, if they are necessary.
:: See
:: http://docs.continuum.io/conda/build.html
:: for a list of environment variables that are set during the build process.
================================================
FILE: build.sh
================================================
#!/bin/bash
$PYTHON setup.py install
# Add more build steps here, if they are necessary.
# See
# http://docs.continuum.io/conda/build.html
# for a list of environment variables that are set during the build process.
================================================
FILE: conf.py
================================================
# -*- coding: utf-8 -*-
#
# corpkit documentation build configuration file, created by
# sphinx-quickstart on Thu Nov 5.
#
# This file is execfile()d with the current directory set to its
# containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.
from sphinx.highlighting import PygmentsBridge
from pygments.formatters.latex import LatexFormatter
class CustomLatexFormatter(LatexFormatter):
def __init__(self, **options):
super(CustomLatexFormatter, self).__init__(**options)
self.verboptions = r"formatcom=\footnotesize"
PygmentsBridge.latex_formatter = CustomLatexFormatter
import sys
import os
import shlex
from recommonmark.parser import CommonMarkParser
source_parsers = {
'.md': CommonMarkParser,
}
source_suffix = ['.rst', '.md']
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#sys.path.insert(0, os.path.abspath('.'))
sys.path.insert(0,"/Users/daniel/work/corpkit/corpkit")
# -- General configuration ------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
#needs_sphinx = '1.0'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.viewcode',
'alabaster'
]
# Napoleon settings (all default)
#napoleon_google_docstring = True
#napoleon_numpy_docstring = True
#napoleon_include_init_with_doc = False
#napoleon_include_private_with_doc = False
#napoleon_include_special_with_doc = False
#napoleon_use_admonition_for_examples = False
#napoleon_use_admonition_for_notes = False
#napoleon_use_admonition_for_references = False
#napoleon_use_ivar = False
#napoleon_use_param = True
#napoleon_use_rtype = True
#napoleon_use_keyword = True
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
# source_suffix = ['.rst', '.md']
# The encoding of source files.
#source_encoding = 'utf-8-sig'
# The master toctree document.
master_doc = 'index'
# General information about the project.
project = u'corpkit'
copyright = u'2016, Daniel McDonald'
author = u'Daniel McDonald'
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = '2.3.8'
# The full version, including alpha/beta/rc tags.
release = '2.3.8'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None
# There are two options for replacing |today|: either, you set today to some
# non-false value, then it is used:
#today = ''
# Else, today_fmt is used as the format for a strftime call.
#today_fmt = '%B %d, %Y'
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
exclude_patterns = ['_build', '*/build.py']
# The reST default role (used for this markup: `text`) to use for all
# documents.
#default_role = None
# If true, '()' will be appended to :func: etc. cross-reference text.
add_function_parentheses = True
# If true, the current module name will be prepended to all description
# unit titles (such as .. function::).
#add_module_names = True
# If true, sectionauthor and moduleauthor directives will be shown in the
# output. They are ignored by default.
#show_authors = False
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'
# A list of ignored prefixes for module index sorting.
#modindex_common_prefix = []
# If true, keep warnings as "system message" paragraphs in the built documents.
#keep_warnings = False
# If true, `todo` and `todoList` produce output, else they produce nothing.
todo_include_todos = False
# -- Options for HTML output ----------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
# import alabaster
#
# html_theme_path = [alabaster.get_path()]
# html_theme = 'alabaster'
# html_sidebars = {
# '**': [
# 'about.html',
# 'navigation.html',
# 'relations.html',
# 'searchbox.html',
# 'donate.html',
# ]
# }
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
#html_theme_options = {}
# Add any paths that contain custom themes here, relative to this directory.
#html_theme_path = []
# The name for this set of Sphinx documents. If None, it defaults to
# "<project> v<release> documentation".
#html_title = None
# A shorter title for the navigation bar. Default is the same as html_title.
#html_short_title = None
# The name of an image file (relative to this directory) to place at the top
# of the sidebar.
html_logo = 'images/alpha_gator_small.png'
# The name of an image file (within the static path) to use as favicon of the
# docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32
# pixels large.
#html_favicon = None
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
# Add any extra paths that contain custom files (such as robots.txt or
# .htaccess) here, relative to this directory. These files are copied
# directly to the root of the documentation.
#html_extra_path = []
# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
# using the given strftime format.
html_last_updated_fmt = '%b %d, %Y'
# If true, SmartyPants will be used to convert quotes and dashes to
# typographically correct entities.
html_use_smartypants = True
# Custom sidebar templates, maps document names to template names.
#html_sidebars = {}
# Additional templates that should be rendered to pages, maps page names to
# template names.
#html_additional_pages = {}
# If false, no module index is generated.
#html_domain_indices = True
# If false, no index is generated.
#html_use_index = True
# If true, the index is split into individual pages for each letter.
#html_split_index = False
# If true, links to the reST sources are added to the pages.
#html_show_sourcelink = True
# If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
html_show_sphinx = False
# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
#html_show_copyright = True
# If true, an OpenSearch description file will be output, and all pages will
# contain a <link> tag referring to it. The value of this option must be the
# base URL from which the finished HTML is served.
#html_use_opensearch = ''
# This is the file name suffix for HTML files (e.g. ".xhtml").
#html_file_suffix = None
# Language to be used for generating the HTML full-text search index.
# Sphinx supports the following languages:
# 'da', 'de', 'en', 'es', 'fi', 'fr', 'hu', 'it', 'ja'
# 'nl', 'no', 'pt', 'ro', 'ru', 'sv', 'tr'
#html_search_language = 'en'
# A dictionary with options for the search language support, empty by default.
# Now only 'ja' uses this config value
#html_search_options = {'type': 'default'}
# The name of a javascript file (relative to the configuration directory) that
# implements a search results scorer. If empty, the default will be used.
#html_search_scorer = 'scorer.js'
# Output file base name for HTML help builder.
htmlhelp_basename = 'corpkitdoc'
# -- Options for LaTeX output ---------------------------------------------
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
'papersize': 'a4paper',
# The font size ('10pt', '11pt' or '12pt').
#'pointsize': '10pt',
# Additional stuff for the LaTeX preamble.
# This should help with line breaks in code cells
'preamble': '\\setcounter{tocdepth}{3} \\usepackage{pmboxdraw}',
# Latex figure (float) alignment
'figure_align': 'htbp',
}
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
(master_doc, 'corpkit.tex', u'corpkit documentation',
u'Daniel McDonald', 'manual'),
]
# The name of an image file (relative to this directory) to place at the top of
# the title page.
latex_logo = 'images/alpha_gator_small.png'
# For "manual" documents, if this is true, then toplevel headings are parts,
# not chapters.
#latex_use_parts = False
# If true, show page references after internal links.
#latex_show_pagerefs = False
# If true, show URL addresses after external links.
#latex_show_urls = False
# Documents to append as an appendix to all manuals.
#latex_appendices = []
# If false, no module index is generated.
#latex_domain_indices = True
# -- Options for manual page output ---------------------------------------
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
(master_doc, 'corpkit', u'corpkit documentation',
[author], 1)
]
# If true, show URL addresses after external links.
#man_show_urls = False
# -- Options for Texinfo output -------------------------------------------
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
(master_doc, 'corpkit', u'corpkit documentation',
author, 'corpkit', 'Corpus linguistic tools.',
'Linguistics'),
]
# Documents to append as an appendix to all manuals.
#texinfo_appendices = []
# If false, no module index is generated.
#texinfo_domain_indices = True
# How to display URL addresses: 'footnote', 'no', or 'inline'.
#texinfo_show_urls = 'footnote'
# If true, do not generate a @detailmenu in the "Top" node's menu.
#texinfo_no_detailmenu = False
autodoc_member_order = 'bysource'
================================================
FILE: corpkit/__init__.py
================================================
"""
A toolkit for corpus linguistics
"""
from __future__ import print_function
#metadata
__version__ = "2.3.8"
__author__ = "Daniel McDonald"
__license__ = "MIT"
# probably not needed, anymore but adds corpkit to path for tregex.sh
import sys
import os
import inspect
from corpkit.constants import LETTERS
# asterisk import
__all__ = [
"load",
"loader",
"load_all_results",
"as_regex",
"new_project",
"Corpus",
"File",
"Corpora",
"gui"] + LETTERS
corpath = inspect.getfile(inspect.currentframe())
baspat = os.path.dirname(corpath)
#dicpath = os.path.join(baspat, 'dictionaries')
for p in [corpath, baspat]:
if p not in sys.path:
sys.path.append(p)
if p not in os.environ["PATH"].split(':'):
os.environ["PATH"] += os.pathsep + p
# import classes
from corpkit.corpus import Corpus, File, Corpora
#from corpkit.model import MultiModel
from corpkit.other import (load, loader, load_all_results,
quickview, as_regex, new_project)
from corpkit.lazyprop import lazyprop
#from corpkit.dictionaries.process_types import Wordlist
from corpkit.process import gui
# monkeypatch editing and plotting to pandas objects
from pandas import DataFrame, Series
# monkey patch functions
def _plot(self, *args, **kwargs):
from corpkit.plotter import plotter
return plotter(self, *args, **kwargs)
def _edit(self, *args, **kwargs):
from corpkit.editor import editor
return editor(self, *args, **kwargs)
def _save(self, savename, **kwargs):
from corpkit.other import save
save(self, savename, **kwargs)
def _quickview(self, n=25):
from corpkit.other import quickview
quickview(self, n=n)
def _format(self, *args, **kwargs):
from corpkit.other import concprinter
concprinter(self, *args, **kwargs)
def _texify(self, *args, **kwargs):
from corpkit.other import texify
texify(self, *args, **kwargs)
def _calculate(self, *args, **kwargs):
from corpkit.process import interrogation_from_conclines
return interrogation_from_conclines(self)
def _multiplot(self, leftdict={}, rightdict={}, **kwargs):
from corpkit.plotter import multiplotter
return multiplotter(self, leftdict=leftdict, rightdict=rightdict, **kwargs)
def _perplexity(self):
"""
Pythonification of the formal definition of perplexity.
input: a sequence of chances (any iterable will do)
output: perplexity value.
from https://github.com/zeffii/NLP_class_notes
"""
def _perplex(chances):
import math
chances = [i for i in chances if i]
N = len(chances)
product = 1
for chance in chances:
product *= chance
return math.pow(product, -1/N)
return self.apply(_perplex, axis=1)
def _entropy(self):
"""
entropy(pos.edit(merge_entries=mergetags, sort_by='total').results.T
"""
from scipy.stats import entropy
import pandas as pd
escores = entropy(self.edit('/', SELF).results.T)
ser = pd.Series(escores, index=self.index)
ser.name = 'Entropy'
return ser
def _shannon(self):
from corpkit.stats import shannon
return shannon(self)
def _shuffle(self, inplace=False):
import random
index = list(self.index)
random.shuffle(index)
shuffled = self.ix[index]
shuffled.reset_index()
if inplace:
self = shuffled
else:
return shuffled
def _top(self):
"""Show as many rows and cols as possible without truncation"""
import pandas as pd
max_row = pd.options.display.max_rows
max_col = pd.options.display.max_columns
return self.iloc[:max_row, :max_col]
def _tabview(self, **kwargs):
import pandas as pd
import tabview
tabview.view(self, **kwargs)
def _rel(self, denominator='self', **kwargs):
from corpkit.editor import editor
return editor(self, '%', denominator, **kwargs)
def _keyness(self, measure='ll', denominator='self', **kwargs):
from corpkit.editor import editor
return editor(self, 'k', denominator, **kwargs)
def _plain(df):
return ' '.join(df['w'])
# monkey patching things
DataFrame.entropy = _entropy
DataFrame.perplexity = _perplexity
DataFrame.shannon = _shannon
DataFrame.edit = _edit
Series.edit = _edit
DataFrame.rel = _rel
Series.rel = _rel
DataFrame.keyness = _keyness
Series.keyness = _keyness
DataFrame.visualise = _plot
Series.visualise = _plot
DataFrame.tabview = _tabview
DataFrame.multiplot = _multiplot
Series.multiplot = _multiplot
DataFrame.save = _save
Series.save = _save
DataFrame.quickview = _quickview
Series.quickview = _quickview
DataFrame.format = _format
Series.format = _format
Series.texify = _texify
DataFrame.calculate = _calculate
Series.calculate = _calculate
DataFrame.shuffle = _shuffle
DataFrame.top = _top
DataFrame.plain = _plain
# Defining letters
module = sys.modules[__name__]
for letter in LETTERS:
if not letter.isalpha():
trans = letter.replace('A', '-', 1).replace('Z', '+', 1).lower()
else:
trans = letter.lower()
setattr(module, letter, trans)
# other methods:
# globals()[letter] = letter.lower()
# exec('%s = "%s"' % (letter, letter.lower()))
ANYWORD = r'[A-Za-z0-9:_]'
================================================
FILE: corpkit/annotate.py
================================================
"""
corpkit: add annotations to conll-u via concordancing
"""
def process_special_annotation(v, lin):
"""
If the user wants a fancy annotation, like 'add middle column',
this gets processed here. it's potentially the place where the
user could add entropy score, or something like that.
"""
if v.lower() not in ['i', 'index', 'm', 'scheme', 't', 'q']:
return v
if v == 'index':
return lin.name
elif v in ['m', 't']:
return str(lin[v])
else:
return v
def make_string_to_add(annotation, lin, replace=False):
"""
Make a string representing metadata to add
"""
from corpkit.constants import STRINGTYPE
if isinstance(annotation, STRINGTYPE):
if replace:
return annotation + '\n'
else:
return '# tags=' + annotation + '\n'
start = str()
for k, v in annotation.items():
# these are special names---add more?
v = process_special_annotation(v, lin)
if replace:
start = '%s\n' % v
else:
start += '# %s=%s\n' % (k, v)
return start
def get_line_number_for_entry(data, si, ti, annotation):
"""
Find the place in filename at which to add the string
"""
partstart = '# sent_id %d' % si
partend = '# sent_id %d' % (si + 1)
# this way iterates over the lines
# it could also just find the
lnum = data.split(partstart)[0].count('\n') + 2
sent = data.split(partstart)[1].split(partend)[0]
field = 'tags' if isinstance(annotation, str) else list(annotation.keys())[0]
ixx = next((i for i, l in enumerate(sent.splitlines()) \
if l.startswith('# %s=' % field)), False)
if ixx is False:
return lnum, False
else:
return lnum + ixx - 2, True
def update_contents(contents, place, text, do_replace=False):
"""
Open file, read lines, add or replace the line with the good one
"""
if do_replace:
contents[place] = contents[place].rstrip('\n').replace(text + ';', '') + ';' + text
else:
contents.insert(place, text)
return contents
def dry_run_text(filepath, contents, place, colours):
"""
Show a dry run of what the annotations would be
"""
import os
contents[place] = contents[place].rstrip('\n') + ' <==========\n'
try:
contents[place] = colours['green'] + contents[place] + colours['reset']
except:
pass
max_lines = next((i for i, l in enumerate(contents[place:]) if l == '\n'), 10)
max_lines = 30 if max_lines > 30 else max_lines
formline = ' Add metadata: %s \n' % (os.path.basename(filepath))
bars = '=' * len(formline)
print(bars + '\n' + formline + bars)
print(''.join(contents[place-3:max_lines+place]))
def annotate(open_file, contents):
"""
Add annotation to a single file
"""
from corpkit.constants import PYTHON_VERSION
contents = ''.join(contents)
if PYTHON_VERSION == 2:
contents = contents.encode('utf-8', errors='ignore')
open_file.seek(0)
open_file.write(contents)
open_file.truncate()
def delete_lines(corpus, annotation, dry_run=True, colour={}):
"""
Show or delete the necessary lines
"""
from corpkit.constants import OPENER, PYTHON_VERSION
import re
import os
tagmode = True
no_can_do = ['sent_id', 'parse']
if isinstance(annotation, dict):
tagmode = False
for k, v in annotation.items():
if k in no_can_do:
print("You aren't allowed to delete '%s', sorry." % k)
return
if not v:
v = r'.*?'
regex = re.compile(r'(# %s=%s)\n' % (k, v), re.MULTILINE)
else:
if annotation in no_can_do:
print("You aren't allowed to delete '%s', sorry." % k)
return
regex = re.compile(r'((# tags=.*?)%s;?(.*?))\n' % annotation, re.MULTILINE)
fs = []
for (root, dirs, fls) in os.walk(corpus):
for f in fls:
fs.append(os.path.join(root, f))
for f in fs:
if PYTHON_VERSION == 2:
from corpkit.process import saferead
data = saferead(f)[0]
else:
with open(f, 'rb') as fo:
data = fo.read().decode('utf-8', errors='ignore')
if dry_run:
if tagmode:
repl_str = r'\1 <=======\n%s\2\3 <=======\n' % colour.get('green', '')
else:
repl_str = r'\1 <=======\n'
try:
repl_str = colour['red'] + repl_str + colour['reset']
except:
pass
data, n = re.subn(regex, repl_str, data)
nspl = 100 if tagmode else 50
delim = '<======='
data = re.split(delim, data, maxsplit=nspl)
toshow = delim.join(data[:nspl+1])
toshow = toshow.rsplit('\n\n', 1)[0]
print(toshow)
if n > 50:
n = n - 50
print('\n... and %d more changes ... ' % n)
else:
if tagmode:
repl_str = r'\2\3\n'
else:
repl_str = ''
data = re.sub(regex, repl_str, data)
with OPENER(f, 'w') as fo:
from corpkit.constants import PYTHON_VERSION
if PYTHON_VERSION == 2:
data = data.encode('utf-8', errors='ignore')
fo.write(data)
def annotator(df_or_corpus, annotation, dry_run=True, deletemode=False):
"""
Run the annotator pipeline over multiple files
:param corpus: a Corpus object containing the files
:param annotation: a str or dict containing annotation text
"""
import re
import os
from corpkit.constants import OPENER, STRINGTYPE, PYTHON_VERSION
colour = {}
try:
from colorama import Fore, init, Style
init(autoreset=True)
colour = {'green': Fore.GREEN, 'reset': Style.RESET_ALL, 'red': Fore.RED}
except ImportError:
pass
if deletemode:
delete_lines(df_or_corpus.path, annotation, dry_run=dry_run, colour=colour)
return
file_sent_words = df_or_corpus.reset_index()[['index', 'f', 'i']].values.tolist()
from collections import defaultdict
outt = defaultdict(list)
for index, fn, ix in file_sent_words:
s, i = ix.split(',', 1)
outt[fn].append((int(s), int(i), index))
for i, (fname, entries) in enumerate(sorted(outt.items()), start=1):
with OPENER(fname, 'r+') as fo:
data = fo.read()
contents = [i + '\n' for i in data.split('\n')]
for si, ti, index in list(reversed(sorted(set(entries)))):
line_num, do_replace = get_line_number_for_entry(data, si, ti, annotation)
anno_text = make_string_to_add(annotation, df_or_corpus.ix[index], replace=do_replace)
contents = update_contents(contents, line_num, anno_text, do_replace=do_replace)
if dry_run and i < 50:
dry_run_text(fname,
contents,
line_num,
colours=colour)
if not dry_run:
annotate(fo, contents=contents)
if not dry_run:
print('%d annotations made in %s' % (len(entries), fname))
if dry_run and i > 50:
break
if dry_run:
if len(file_sent_words) > 50:
n = len(file_sent_words) - 50
print('... and %d more changes ... ' % n)
================================================
FILE: corpkit/blanknotebook.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# blanknotebook"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialisation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, import `corpkit`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import corpkit\n",
"from corpkit import (\n",
" interrogator, plotter, table, quickview, \n",
" tally, surgeon, merger, conc, keywords, \n",
" collocates, multiquery, report_display,\n",
" save_result, load_result\n",
" )\n",
"# show figures in browser\n",
"% matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, set a path to your corpus:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"corpus = 'data/corpus'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Define a query to match any word:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# any token containing letters or numbers (i.e. no punctuation):\n",
"allwords_query = r'/[A-Za-z0-9]/ !< __' "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Interrogate the corpus with the `allwords_query`, and store the results as `allwords`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"allwords = interrogator(annual_trees, '-C', allwords_query) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check that it worked:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"print allwords.query\n",
"print allwords.totals"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, plot something:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"plotter('Word count', allwords.total)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, save this result so that you can access it any time:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"save_result(allwords, 'allwords')\n",
"\n",
"# load it again with:\n",
"# allwords = load_result('allwords')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Use the space below to interrogate and plot whatever you like!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
================================================
FILE: corpkit/build.py
================================================
from __future__ import print_function
from corpkit.constants import STRINGTYPE, PYTHON_VERSION, INPUTFUNC
"""
This file contains a number of functions used in the corpus building process.
None of them is intended to be called by the user him/herself.
"""
def download_large_file(proj_path, url, actually_download=True, root=False, **kwargs):
"""
Download something to proj_path, unless it's CoreNLP, which goes to ~/corenlp
"""
import os
import shutil
import glob
import zipfile
from time import localtime, strftime
from corpkit.textprogressbar import TextProgressBar
from corpkit.process import animator
file_name = url.split('/')[-1]
home = os.path.expanduser("~")
customdir = kwargs.get('custom_corenlp_dir', False)
# if it's corenlp, put it in home/corenlp
# if that dir exists, check if for a zip file
# if there's a zipfile and it works, move on
# if there's a zipfile and it's broken, delete it
if 'stanford' in url:
if customdir:
downloaded_dir = customdir
else:
downloaded_dir = os.path.join(home, 'corenlp')
if not os.path.isdir(downloaded_dir):
os.makedirs(downloaded_dir)
else:
poss_zips = glob.glob(os.path.join(downloaded_dir, 'stanford-corenlp-full*.zip'))
if poss_zips:
fullfile = poss_zips[-1]
from zipfile import BadZipfile
try:
the_zip_file = zipfile.ZipFile(fullfile)
ret = the_zip_file.testzip()
if ret is None:
return downloaded_dir, fullfile
else:
os.remove(fullfile)
except BadZipfile:
os.remove(fullfile)
#else:
# shutil.rmtree(downloaded_dir)
else:
downloaded_dir = os.path.join(proj_path, 'temp')
try:
os.makedirs(downloaded_dir)
except OSError:
pass
fullfile = os.path.join(downloaded_dir, file_name)
if actually_download:
import __main__ as main
if not root and not hasattr(main, '__file__'):
txt = 'CoreNLP not found. Download latest version (%s)? (y/n) ' % url
selection = INPUTFUNC(txt)
if 'n' in selection.lower():
return None, None
try:
import requests
# NOTE the stream=True parameter
r = requests.get(url, stream=True, verify=False)
file_size = int(r.headers['content-length'])
file_size_dl = 0
block_sz = 8192
showlength = file_size / block_sz
thetime = strftime("%H:%M:%S", localtime())
print('\n%s: Downloading ... \n' % thetime)
par_args = {'printstatus': kwargs.get('printstatus', True),
'length': showlength}
if not root:
tstr = '%d/%d' % (file_size_dl + 1 / block_sz, showlength)
p = animator(None, None, init=True, tot_string=tstr, **par_args)
animator(p, file_size_dl + 1, tstr)
with open(fullfile, 'wb') as f:
for chunk in r.iter_content(chunk_size=block_sz):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
file_size_dl += len(chunk)
#print file_size_dl * 100.0 / file_size
if kwargs.get('note'):
kwargs['note'].progvar.set(file_size_dl * 100.0 / int(file_size))
else:
tstr = '%d/%d' % (file_size_dl / block_sz, showlength)
animator(p, file_size_dl / block_sz, tstr, **par_args)
if root:
root.update()
except Exception as err:
import traceback
print(traceback.format_exc())
thetime = strftime("%H:%M:%S", localtime())
print('%s: Download failed' % thetime)
try:
f.close()
except:
pass
if root:
root.update()
return None, None
if kwargs.get('note'):
kwargs['note'].progvar.set(100)
else:
p.animate(int(file_size))
thetime = strftime("%H:%M:%S", localtime())
print('\n%s: Downloaded successully.' % thetime)
try:
f.close()
except:
pass
return downloaded_dir, fullfile
def extract_cnlp(fullfilepath, corenlppath=False, root=False):
"""
Extract corenlp zip file
"""
import zipfile
import os
from time import localtime, strftime
time = strftime("%H:%M:%S", localtime())
print('%s: Extracting CoreNLP files ...' % time)
if root:
root.update()
if corenlppath is False:
home = os.path.expanduser("~")
corenlppath = os.path.join(home, 'corenlp')
from zipfile import BadZipfile
try:
with zipfile.ZipFile(fullfilepath) as zf:
zf.extractall(corenlppath)
except BadZipfile:
os.remove(corenlppath)
return False
time = strftime("%H:%M:%S", localtime())
print('%s: CoreNLP extracted. ' % time)
return True
def get_corpus_filepaths(projpath=False, corpuspath=False,
restart=False, out_ext='conll'):
"""
get a list of filepaths, a la find . -type f
restart mode will look in restart dir and remove any existing files
"""
import fnmatch
import os
matches = []
# get a list of done files minus their paths and extensions
# this handles if they have been moved to the right dir or not
already_done = get_filepaths(restart, out_ext) if restart else []
already_done = [os.path.splitext(os.path.basename(x))[0] for x in already_done]
for root, dirnames, filenames in os.walk(corpuspath):
for filename in fnmatch.filter(filenames, '*.txt'):
if filename not in already_done:
matches.append(os.path.join(root, filename))
if len(matches) == 0:
return False, False
matchstring = '\n'.join(matches)
# maybe not good:
if projpath is False:
projpath = os.path.dirname(os.path.abspath(corpuspath.rstrip('/')))
corpname = os.path.basename(corpuspath)
fp = os.path.join(projpath, 'data', corpname + '-filelist.txt')
# definitely not good.
if os.path.join('data', 'data') in fp:
fp = fp.replace(os.path.join('data', 'data'), 'data')
with open(fp, "w") as f:
f.write(matchstring + '\n')
return fp, matchstring
def check_jdk():
"""
Check for a Java/OpenJDK
"""
import corpkit
import subprocess
from subprocess import PIPE, STDOUT, Popen
# add any other version string to here
javastrings = ['java version "1.8', 'openjdk version "1.8']
p = Popen(["java", "-version"], stdout=PIPE, stderr=PIPE)
_, stderr = p.communicate()
encoded = stderr.decode(encoding='utf-8').lower()
return any(j in encoded for j in javastrings)
def parse_corpus(proj_path=False,
corpuspath=False,
filelist=False,
corenlppath=False,
operations=False,
root=False,
stdout=False,
memory_mb=2000,
copula_head=True,
multiprocessing=False,
outname=False,
coref=True,
**kwargs
):
"""
Create a CoreNLP-parsed and/or NLTK tokenised corpus
"""
import subprocess
from subprocess import PIPE, STDOUT, Popen
from corpkit.process import get_corenlp_path
import os
import sys
import re
import chardet
from time import localtime, strftime
import time
fileparse = kwargs.get('fileparse', False)
from corpkit.constants import CORENLP_URL as url
if not check_jdk():
print('Need latest Java.')
return
curdir = os.getcwd()
note = kwargs.get('note', False)
if proj_path is False:
proj_path = os.path.dirname(os.path.abspath(corpuspath.rstrip('/')))
basecp = os.path.basename(corpuspath)
if fileparse:
new_corpus_path = os.path.dirname(corpuspath)
else:
if outname:
new_corpus_path = os.path.join(proj_path, 'data', outname)
else:
new_corpus_path = os.path.join(proj_path, 'data', '%s-parsed' % basecp)
new_corpus_path = new_corpus_path.replace('-stripped-', '-')
# todo:
# this is not stable
if os.path.join('data', 'data') in new_corpus_path:
new_corpus_path = new_corpus_path.replace(os.path.join('data', 'data'), 'data')
# this caused errors when multiprocessing
# it used to be isdir, but supposedly there was a file there
# i don't see how it's possible ...
# i think it is a 'race condition', so we'll also put a try/except there
if not os.path.exists(new_corpus_path):
try:
os.makedirs(new_corpus_path)
except OSError:
pass
else:
if not os.path.isfile(new_corpus_path):
fs = get_filepaths(new_corpus_path, ext=False)
if not multiprocessing:
if any([f.endswith('.conll') for f in fs]) or \
any([f.endswith('.conllu') for f in fs]):
print('Folder containing .conll files already exists: %s' % new_corpus_path)
return False
corenlppath = get_corenlp_path(corenlppath)
success = bool(corenlppath)
if not corenlppath:
from corpkit.constants import CORENLP_VERSION
print("CoreNLP not found. Auto-installing CoreNLP v%s..." % CORENLP_VERSION)
cnlp_dir = os.path.join(os.path.expanduser("~"), 'corenlp')
corenlppath, fpath = download_large_file(cnlp_dir, url,
root=root,
note=note,
actually_download=True,
custom_corenlp_dir=corenlppath)
# cleanup
if corenlppath is None and fpath is None:
import shutil
shutil.rmtree(new_corpus_path)
shutil.rmtree(new_corpus_path.replace('-parsed', '-stripped'))
os.remove(new_corpus_path.replace('-parsed', '-filelist.txt'))
raise ValueError('CoreNLP needed to parse texts.')
success = extract_cnlp(fpath)
if not success:
raise ValueError('CoreNLP installation failed for some reason. Try deleting the ~/corenlp directory and starting over.')
import glob
globpath = os.path.join(corenlppath, 'stanford-corenlp*')
corenlppath = [i for i in glob.glob(globpath) if os.path.isdir(i)]
if corenlppath:
corenlppath = corenlppath[-1]
else:
raise ValueError('CoreNLP installation failed for some reason. Try deleting the ~/corenlp directory and starting over.')
# if not gui, don't mess with stdout
if stdout is False:
stdout = sys.stdout
os.chdir(corenlppath)
if root:
root.update_idletasks()
# not sure why reloading sys, but seems needed
# in order to show files in the gui
try:
reload(sys)
except NameError:
import importlib
importlib.reload(sys)
pass
if memory_mb is False:
memory_mb = 2024
# you can pass in 'coref' as kwarg now
cof = ',dcoref' if coref else ''
if operations is False:
operations = 'tokenize,ssplit,pos,lemma,parse,ner' + cof
if isinstance(operations, list):
operations = ','.join([i.lower() for i in operations])
with open(filelist, 'r') as fo:
dat = fo.read()
num_files_to_parse = len([l for l in dat.splitlines() if l])
# get corenlp version number
reg = re.compile(r'stanford-corenlp-([0-9].[0-9].[0-9])-javadoc.jar')
fver = next(re.search(reg, s).group(1) for s in os.listdir('.') if re.search(reg, s))
if fver == '3.6.0':
extra_jar = 'slf4j-api.jar:slf4j-simple.jar:'
else:
extra_jar = ''
out_form = 'xml' if kwargs.get('output_format') == 'xml' else 'json'
out_ext = 'xml' if kwargs.get('output_format') == 'xml' else 'conll'
arglist = ['java', '-cp',
'stanford-corenlp-%s.jar:stanford-corenlp-%s-models.jar:xom.jar:joda-time.jar:%sjollyday.jar:ejml-0.23.jar' % (fver, fver, extra_jar),
'-Xmx%sm' % str(memory_mb),
'edu.stanford.nlp.pipeline.StanfordCoreNLP',
'-annotators',
operations,
'-filelist', filelist,
'-noClobber',
'-outputExtension', '.%s' % out_ext,
'-outputFormat', out_form,
'-outputDirectory', new_corpus_path]
if copula_head:
arglist.append('--parse.flags')
arglist.append(' -makeCopulaHead')
print('Java command:')
print(arglist)
try:
proc = subprocess.Popen(arglist, stdout=sys.stdout)
# maybe a problem with stdout. sacrifice it if need be
except:
proc = subprocess.Popen(arglist)
#p = TextProgressBar(num_files_to_parse)
while proc.poll() is None:
sys.stdout = stdout
thetime = strftime("%H:%M:%S", localtime())
if not fileparse:
num_parsed = len([f for f in os.listdir(new_corpus_path) if f.endswith(out_ext)])
if num_parsed == 0:
if root:
print('%s: Initialising parser ... ' % (thetime))
if num_parsed > 0 and (num_parsed + 1) <= num_files_to_parse:
if root:
print('%s: Parsing file %d/%d ... ' % \
(thetime, num_parsed + 1, num_files_to_parse))
if kwargs.get('note'):
kwargs['note'].progvar.set((num_parsed) * 100.0 / num_files_to_parse)
#p.animate(num_parsed - 1, str(num_parsed) + '/' + str(num_files_to_parse))
time.sleep(1)
if root:
root.update()
#p.animate(num_files_to_parse)
if kwargs.get('note'):
kwargs['note'].progvar.set(100)
sys.stdout = stdout
thetime = strftime("%H:%M:%S", localtime())
print('%s: Parsing finished. Moving parsed files into place ...' % thetime)
os.chdir(curdir)
return new_corpus_path
def move_parsed_files(proj_path, old_corpus_path, new_corpus_path,
ext='conll', restart=False):
"""
Make parsed files follow existing corpus structure
"""
import corpkit
import shutil
import os
import fnmatch
cwd = os.getcwd()
basecp = os.path.basename(old_corpus_path)
dir_list = []
# go through old path, make file list
for path, dirs, files in os.walk(old_corpus_path):
for bit in dirs:
# is the last bit of the line below windows safe?
dir_list.append(os.path.join(path, bit).replace(old_corpus_path, '')[1:])
for d in dir_list:
if not restart:
os.makedirs(os.path.join(new_corpus_path, d))
else:
try:
os.makedirs(os.path.join(new_corpus_path, d))
except OSError:
pass
# make list of parsed filenames that haven't been moved already
parsed_fs = [f for f in os.listdir(new_corpus_path) if f.endswith('.%s' % ext)]
# make a dictionary of the right paths
pathdict = {}
for rootd, dirnames, filenames in os.walk(old_corpus_path):
for filename in fnmatch.filter(filenames, '*.txt'):
pathdict[filename] = rootd
# move each file
for f in parsed_fs:
noxml = f.replace('.%s' % ext, '')
right_dir = pathdict[noxml].replace(old_corpus_path, new_corpus_path)
frm = os.path.join(new_corpus_path, f)
tom = os.path.join(right_dir, f)
# forgive errors on restart mode, because some files
# might already have been moved into place
if restart:
try:
os.rename(frm, tom)
except OSError:
pass
else:
os.rename(frm, tom)
return new_corpus_path
def corenlp_exists(corenlppath=False):
import corpkit
import os
from corpkit.constants import CORENLP_VERSION
important_files = ['stanford-corenlp-%s-javadoc.jar' % CORENLP_VERSION,
'stanford-corenlp-%s-models.jar' % CORENLP_VERSION,
'stanford-corenlp-%s-sources.jar' % CORENLP_VERSION,
'stanford-corenlp-%s.jar' % CORENLP_VERSION]
if corenlppath is False:
home = os.path.expanduser("~")
corenlppath = os.path.join(home, 'corenlp')
if os.path.isdir(corenlppath):
find_install = [d for d in os.listdir(corenlppath) \
if os.path.isdir(os.path.join(corenlppath, d)) \
and os.path.isfile(os.path.join(corenlppath, d, 'jollyday.jar'))]
if len(find_install) > 0:
find_install = find_install[0]
else:
return False
javalib = os.path.join(corenlppath, find_install)
if len(javalib) == 0:
return False
if not any([f.endswith('-models.jar') for f in os.listdir(javalib)]):
return False
return True
else:
return False
return True
def get_filepaths(a_path, ext='txt'):
"""
Make list of txt files in a_path and remove non txt files
"""
import os
files = []
if os.path.isfile(a_path):
return [a_path]
for (root, dirs, fs) in os.walk(a_path):
for f in fs:
if ext:
if not f.endswith('.' + ext):
continue
if 'Unidentified' not in f \
and 'unknown' not in f \
and not f.startswith('.'):
files.append(os.path.join(root, f))
#if ext:
# if not f.endswith('.' + ext):
# os.remove(os.path.join(root, f))
return files
def make_no_id_corpus(pth, newpth, metadata_mode=False, speaker_segmentation=False):
"""
Make version of pth without ids
"""
import os
import re
import shutil
from corpkit.process import saferead
# define regex broadly enough to accept timestamps, locations if need be
from corpkit.constants import MAX_SPEAKERNAME_SIZE
idregex = re.compile(r'(^.{,%d}?):\s+(.*$)' % MAX_SPEAKERNAME_SIZE)
try:
shutil.copytree(pth, newpth)
except OSError:
shutil.rmtree(newpth)
shutil.copytree(pth, newpth)
files = get_filepaths(newpth)
names = []
metadata = []
for f in files:
good_data = []
fo, enc = saferead(f)
data = fo.splitlines()
# for each line in the file, remove speaker and metadata
for datum in data:
if speaker_segmentation:
matched = re.search(idregex, datum)
if matched:
names.append(matched.group(1))
datum = matched.group(2)
if metadata_mode:
splitmet = datum.rsplit('<metadata ', 1)
# for the impossibly rare case of a line that is '<metadata '
if not splitmet:
continue
datum = splitmet[0]
if datum:
good_data.append(datum)
with open(f, "w") as fo:
if PYTHON_VERSION == 2:
fo.write('\n'.join(good_data).encode('utf-8'))
else:
fo.write('\n'.join(good_data))
if speaker_segmentation:
from time import localtime, strftime
thetime = strftime("%H:%M:%S", localtime())
if len(names) == 0:
print('%s: No speaker names found. Turn off speaker segmentation.' % thetime)
shutil.rmtree(newpth)
else:
try:
if len(sorted(set(names))) < 19:
print('%s: Speaker names found: %s' % (thetime, ', '.join(sorted(set(names)))))
else:
print('%s: Speaker names found: %s ... ' % (thetime, ', '.join(sorted(set(names[:20])))))
except:
pass
def get_all_metadata_fields(corpus, include_speakers=False):
"""
Get a list of metadata fields in a corpus
This could take a while for very little infor
"""
from corpkit.corpus import Corpus
from corpkit.constants import OPENER, PYTHON_VERSION, MAX_METADATA_FIELDS
# allow corpus object
if not isinstance(corpus, Corpus):
corpus = Corpus(corpus, print_info=False)
if not corpus.datatype == 'conll':
return []
path = getattr(corpus, 'path', corpus)
fs = []
import os
for root, dirnames, filenames in os.walk(path):
for filename in filenames:
fs.append(os.path.join(root, filename))
badfields = ['parse', 'sent_id']
if not include_speakers:
badfields.append('speaker')
fields = set()
for f in fs:
if PYTHON_VERSION == 2:
from corpkit.process import saferead
lines = saferead(f)[0].splitlines()
else:
with OPENER(f, 'rb') as fo:
lines = fo.read().decode('utf-8', errors='ignore')
lines = lines.strip('\n')
lines = lines.splitlines()
lines = [l[2:].split('=', 1)[0] for l in lines if l.startswith('# ') \
if not l.startswith('# sent_id')]
for l in lines:
if l not in fields and l not in badfields:
fields.add(l)
if len(fields) > MAX_METADATA_FIELDS:
break
return list(fields)
def get_names(filepath, speakid):
"""
Get a list of speaker names from a file
"""
import re
from corpkit.process import saferead
txt, enc = saferead(filepath)
res = re.findall(speakid, txt)
if res:
return sorted(list(set([i.strip() for i in res])))
def get_speaker_names_from_parsed_corpus(corpus, feature='speaker'):
"""
Use regex to get speaker names from parsed data without parsing it
"""
import os
import re
from corpkit.constants import MAX_METADATA_VALUES
path = corpus.path if hasattr(corpus, 'path') else corpus
list_of_files = []
names = []
# i am not really sure why we need multiline here
# is it because start of line char is just matching
speakid = re.compile(r'^# %s=(.*)' % re.escape(feature), re.MULTILINE)
# if passed a dir, do it for every file
if os.path.isdir(path):
for (root, dirs, fs) in os.walk(path):
for f in fs:
list_of_files.append(os.path.join(root, f))
elif os.path.isfile(path):
list_of_files.append(path)
for filepath in list_of_files:
res = get_names(filepath, speakid)
if not res:
continue
for i in res:
if i not in names:
names.append(i)
if len(names) > MAX_METADATA_VALUES:
break
return list(sorted(set(names)))
def rename_all_files(dirs_to_do):
"""
Get rid of the inserted dirname in filenames after parsing
"""
import os
if isinstance(dirs_to_do, STRINGTYPE):
dirs_to_do = [dirs_to_do]
for d in dirs_to_do:
if d.endswith('-parsed'):
ext = 'txt.xml'
elif d.endswith('-tokenised'):
ext = '.p'
else:
ext = '.txt'
fs = get_filepaths(d, ext)
for f in fs:
fname = os.path.basename(f)
justdir = os.path.dirname(f)
subcorpus = os.path.basename(justdir)
newname = fname.replace('-%s.%s' % (subcorpus, ext), '.%s' % ext)
os.rename(f, os.path.join(justdir, newname))
def flatten_treestring(tree):
"""
Turn bracketed tree string into something looking like English
"""
import re
tree = re.sub(r'\(.*? ', '', tree).replace(')', '')
tree = tree.replace('$ ', '$').replace('`` ', '``').replace(' ,', ',').replace(' .', '.').replace("'' ", "''").replace(" n't", "n't").replace(" 're","'re").replace(" 'm","'m").replace(" 's","'s").replace(" 'd","'d").replace(" 'll","'ll").replace(' ', ' ')
return tree
def can_folderise(folder):
"""
Check if corpus can be put into folders
"""
import os
from glob import glob
if os.path.isfile(folder):
return False
fs = glob(os.path.join(folder, '*.txt'))
if len(fs) > 1:
if not any(os.path.isdir(x) for x in glob(os.path.join(folder, '*'))):
return True
return False
def folderise(folder):
"""
Move each file into a folder
"""
import os
import shutil
from glob import glob
from corpkit.process import makesafe
fs = glob(os.path.join(folder, '*.txt'))
for f in fs:
newname = makesafe(os.path.splitext(os.path.basename(f))[0])
newpath = os.path.join(folder, newname)
if not os.path.exists(newpath):
os.makedirs(newpath)
shutil.move(f, os.path.join(newpath))
================================================
FILE: corpkit/completer.py
================================================
class Completer(object):
"""
Tab completion for interpreter
"""
def __init__(self, words):
self.words = words
self.prefix = None
def complete(self, prefix, index):
"""
Add paths etc to this
"""
if prefix != self.prefix:
# we have a new prefix!
# find all words that start with this prefix
self.matching_words = [
w for w in self.words if w.startswith(prefix)
]
self.prefix = prefix
try:
return self.matching_words[index]
except IndexError:
return None
================================================
FILE: corpkit/configurations.py
================================================
def configurations(corpus, search, **kwargs):
"""
Get summary of behaviour of a word
see corpkit.corpus.Corpus.configurations() for docs
"""
from corpkit.dictionaries.wordlists import wordlists
from corpkit.dictionaries.roles import roles
from corpkit.interrogation import Interrodict
from corpkit.interrogator import interrogator
from collections import OrderedDict
if search.get('l') and search.get('w'):
raise ValueError('Search only for a word or a lemma, not both.')
# are we searching words or lemmata?
if search.get('l'):
dep_word_or_lemma = 'dl'
gov_word_or_lemma = 'gl'
word_or_token = search.get('l')
else:
if search.get('w'):
dep_word_or_lemma = 'd'
gov_word_or_lemma = 'g'
word_or_token = search.get('w')
# make nested query dicts for each semantic role
queries = {'participant':
{'left_participant_in':
{dep_word_or_lemma: word_or_token,
'df': roles.participant1,
'f': roles.event},
'right_participant_in':
{dep_word_or_lemma: word_or_token,
'df': roles.participant2,
'f': roles.event},
'premodified':
{'f': roles.premodifier,
gov_word_or_lemma: word_or_token},
'postmodified':
{'f': roles.postmodifier,
gov_word_or_lemma: word_or_token},
'and_or':
{'f': 'conj:(?:and|or)',
'gf': roles.participant,
gov_word_or_lemma: word_or_token},
},
'process':
{'has_subject':
{'f': roles.participant1,
gov_word_or_lemma: word_or_token},
'has_object':
{'f': roles.participant2,
gov_word_or_lemma: word_or_token},
'modalised_by':
{'f': r'aux',
'w': wordlists.modals,
gov_word_or_lemma: word_or_token},
'modulated_by':
{'f': 'advmod',
'gf': roles.event,
gov_word_or_lemma: word_or_token},
'and_or':
{'f': 'conj:(?:and|or)',
'gf': roles.event,
gov_word_or_lemma: word_or_token},
},
'modifier':
{'modifies':
{'df': roles.modifier,
dep_word_or_lemma: word_or_token},
'modulated_by':
{'f': 'advmod',
'gf': roles.modifier,
gov_word_or_lemma: word_or_token},
'and_or':
{'f': 'conj:(?:and|or)',
'gf': roles.modifier,
gov_word_or_lemma: word_or_token},
}
}
# allow passing in of single function
if search.get('f'):
if search.get('f').lower().startswith('part'):
queries = queries['participant']
elif search.get('f').lower().startswith('proc'):
queries = queries['process']
elif search.get('f').lower().startswith('mod'):
queries = queries['modifier']
else:
newqueries = {}
for k, v in queries.items():
for name, pattern in v.items():
newqueries[name] = pattern
queries = newqueries
queries['and_or'] = {'f': 'conj:(?:and|or)', gov_word_or_lemma: word_or_token}
# count all queries to be done
# total_queries = 0
# for k, v in queries.items():
# total_queries += len(v)
kwargs['search'] = queries
# do interrogation
data = corpus.interrogate(**kwargs)
# remove result itself
# not ideal, but it's much more impressive this way.
if isinstance(data, Interrodict):
for k, v in data.items():
v.results = v.results.drop(word_or_token, axis=1, errors='ignore')
v.totals = v.results.sum(axis=1)
data[k] = v
return Interrodict(data)
else:
return data
================================================
FILE: corpkit/conll.py
================================================
"""
corpkit: process CONLL formatted data
"""
def parse_conll(f,
first_time=False,
just_meta=False,
usecols=None):
"""
Make a pandas.DataFrame with metadata from a CONLL-U file
Args:
f (str): Filepath
first_time (bool, optional): If True, add in sent index
just_meta (bool, optional): Return only a metadata `dict`
usecols (None, optional): Which columns must be parsed by pandas.read_csv
Returns:
pandas.DataFrame: DataFrame containing tokens and a ._metadata attribute
"""
import pandas as pd
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
from collections import defaultdict
# go to corpkit.constants to modify the order of columns if yours are different
from corpkit.constants import CONLL_COLUMNS as head
with open(f, 'r') as fo:
data = fo.read().strip('\n')
splitdata = []
metadata = {}
sents = data.split('\n\n')
for count, sent in enumerate(sents, start=1):
metadata[count] = defaultdict(set)
for line in sent.split('\n'):
if line and not line.startswith('#') \
and not just_meta:
splitdata.append('\n%d\t%s' % (count, line))
else:
line = line.lstrip('# ')
if '=' in line:
field, val = line.split('=', 1)
metadata[count][field].add(val)
metadata[count] = {k: ','.join(v) for k, v in metadata[count].items()}
if just_meta:
return metadata
# happens with empty files
if not splitdata:
return
# head can only be as long as the list of cols in the df
num_tabs = splitdata[0].strip('\t').count('\t')
head = head[:num_tabs]
# introduce sentence index for multiindex
#for i, d in enumerate(splitdata, start=1):
# d = d.replace('\n', '\n%s\t' % str(i))
# splitdata[i-1] = d
# turn into something pandas can read
data = '\n'.join(splitdata)
data = data.replace('\n\n', '\n') + '\n'
# remove slashes as early as possible
data = data.replace('/', '-slash-')
# open with sent and token as multiindex
try:
df = pd.read_csv(StringIO(data), sep='\t', header=None,
names=['s'] + head, index_col=['s', 'i'], usecols=usecols)
#df.index = pd.MultiIndex.from_tuples([(1, i) for i in df.index])
except ValueError:
return
df._metadata = metadata
return df
def get_dependents_of_id(idx, df=False, repeat=False, attr=False, coref=False):
"""
Get dependents of a token
"""
sent_id, tok_id = getattr(idx, 'name', idx)
deps = df.ix[sent_id, tok_id]['d'].split(',')
out = []
for govid in deps:
if attr:
# might not exist...
try:
tok = getattr(df.ix[sent_id,int(govid)], attr, False)
if tok:
out.append(tok)
except (KeyError, IndexError):
pass
else:
out.append((sent_id, int(govid)))
return out
def get_governors_of_id(idx, df=False, repeat=False, attr=False, coref=False):
"""
Get governors of a token
"""
# it can be a series or a tuple
sent_id, tok_id = getattr(idx, 'name', idx)
# get the governor id
govid = df['g'].loc[sent_id, tok_id]
if attr:
return getattr(df.loc[sent_id,govid], attr, 'root')
return [(sent_id, govid)]
def get_match(idx, df=False, repeat=False, attr=False, **kwargs):
"""
Dummy function, for the most part
"""
sent_id, tok_id = getattr(idx, 'name', idx)
if attr:
return df[attr].ix[sent_id, tok_id]
return [(sent_id, tok_id)]
def get_head(idx, df=False, repeat=False, attr=False, **kwargs):
"""
Get the head of a 'constituent'---'
for 'corpus linguistics', if 'corpus' is searched, return 'linguistics'
"""
sent_id, tok_id = getattr(idx, 'name', idx)
#sent = df.ix[sent_id]
token = df.ix[sent_id, tok_id]
if not hasattr(token, 'c'):
# this should error, because the data isn't there at all
lst_of_ixs = [(sent_id, tok_id)]
elif token['c'] == '_':
lst_of_ixs = [(sent_id, tok_id)]
# if it is the head, return it
elif token['c'].endswith('*'):
lst_of_ixs = [(sent_id, tok_id)]
else:
# should be able to speed this one up!
just_same_coref = df.loc[sent_id][df.loc[sent_id]['c'] == token['c'] + '*']
if not just_same_coref.empty:
lst_of_ixs = [(sent_id, i) for i in just_same_coref.index]
else:
lst_of_ixs = [(sent_id, tok_id)]
if attr:
lst_of_ixs = [df.loc[i][attr] for i in lst_of_ixs]
return lst_of_ixs
def get_representative(idx,
df=False,
repeat=False,
attr=False,
**kwargs):
"""
Get the representative coref head
"""
sent_id, tok_id = getattr(idx, 'name', idx)
token = df.ix[sent_id, tok_id]
# if no corefs at all
if not hasattr(token, 'c'):
# this should error, because the data isn't there at all
lst_of_ixs = [(sent_id, tok_id)]
# if no coref available
elif token['c'] == '_':
lst_of_ixs = [(sent_id, tok_id)]
else:
just_same_coref = df.loc[df['c'] == token['c'] + '*']
if not just_same_coref.empty:
lst_of_ixs = [just_same_coref.iloc[0].name]
else:
lst_of_ixs = [(sent_id, tok_id)]
if attr:
lst_of_ixs = [df.ix[i][attr] for i in lst_of_ixs]
return lst_of_ixs
def get_all_corefs(s, i, df, coref=False):
# if not in coref mode, skip
if not coref:
return [(s, i)]
# if the word was not a head, forget it
if not df.ix[s,i]['c'].endswith('*'):
return [(s, i)]
try:
# get any other mention head for this coref chain
just_same_coref = df.loc[df['c'] == df.ix[s,i]['c']]
return list(just_same_coref.index)
except:
return [(s, i)]
def search_this(df, obj, attrib, pattern, adjacent=False, coref=False):
"""
Search the dataframe for a single criterion
"""
import re
out = []
# if searching by head, they need to be heads
if obj == 'h':
df = df.loc[df['c'].endswith('*')]
# cut down to just tokens with matching attr
# but, if the pattern is 'any', don't bother
if hasattr(pattern, 'pattern') and pattern.pattern == r'.*':
matches = df
else:
matches = df[df[attrib].fillna('').str.contains(pattern)]
# functions for getting the needed object
revmapping = {'g': get_dependents_of_id,
'd': get_governors_of_id,
'm': get_match,
'h': get_all_corefs,
'r': get_representative}
getfunc = revmapping.get(obj)
for idx in list(matches.index):
if adjacent:
if adjacent[0] == '+':
tomove = -int(adj[1])
elif adjacent[0] == '-':
tomove = int(adj[1])
idx = (idx[0], idx[1] + tomove)
for mindex in getfunc(idx, df=df, coref=coref):
if mindex:
out.append(mindex)
return list(set(out))
def show_fix(show):
"""show everything"""
objmapping = {'d': get_dependents_of_id,
'g': get_governors_of_id,
'm': get_match,
'h': get_head}
out = []
for val in show:
adj, val = determine_adjacent(val)
obj, attr = val[0], val[-1]
obj_getter = objmapping.get(obj)
out.append(adj, val, obj, attr, obj_getter)
return out
def dummy(x, *args, **kwargs):
return x
def format_toks(to_process, show, df):
"""
Format matches by show values
"""
import pandas as pd
objmapping = {'d': get_dependents_of_id,
'g': get_governors_of_id,
'm': get_match,
'h': get_head}
sers = []
dmode = any(x.startswith('d') for x in show)
if dmode:
from collections import defaultdict
dicts = defaultdict(dict)
for val in show:
adj, val = determine_adjacent(val)
if adj:
if adj[0] == '+':
tomove = int(adj[1])
elif adj[0] == '-':
tomove = -int(adj[1])
obj, attr = val[0], val[-1]
func = objmapping.get(obj, dummy)
out = defaultdict(dict) if dmode else []
for ix in list(to_process.index):
piece = False
if adj:
ix = (ix[0], ix[1] + tomove)
if ix not in df.index:
piece = 'none'
if not piece:
if obj == 'm':
piece = df.loc[ix][attr.replace('x', 'p')]
if attr == 'x':
from corpkit.dictionaries.word_transforms import taglemma
piece = taglemma.get(piece.lower(), piece.lower())
piece = [piece]
else:
piece = func(ix, df=df, attr=attr)
if not isinstance(piece, list):
piece = [piece]
if dmode:
dicts[ix][val] = piece
else:
out.append(piece[0])
if not dmode:
ser = pd.Series(out, index=to_process.index)
ser.name = val
sers.append(ser)
if not dmode:
dx = pd.concat(sers, axis=1)
if len(dx.columns) == 1:
return dx.iloc[:,0]
else:
return dx.apply('/'.join, axis=1)
else:
index = []
data = []
for ix, dct in dicts.items():
max_key, max_value = max(dct.items(), key=lambda x: len(x[1]))
for val, pieces in dct.items():
if len(pieces) == 1:
dicts[ix][val] = pieces * len(max_value)
for tup in list(zip(*[i for i in dct.values()])):
index.append(ix)
data.append('/'.join(tup))
return pd.Series(data, index=pd.MultiIndex.from_tuples(index))
def make_series(ser, df=False, obj=False,
att=False, adj=False):
"""
To apply to a DataFrame to add complex criteria, like 'gf'
"""
# distance mode
if att == 'a':
count = 0
if obj == 'g':
if ser[obj] == 0:
return '-1'
ser = df.loc[ser.name[0], ser['g']]
while count < 20:
if ser['mf'].lower() == 'root':
return str(count)
ser = df.loc[ser.name[0], ser['g']]
count += 1
return '20+'
# h is head of this particular group
if obj == 'h':
cohead = ser['c']
if cohead.endswith('*'):
return ser['m' + att]
elif cohead == '_':
return 'none'
else:
sent = df.loc[ser.name[0]]
just_cof = sent[sent['c'] == cohead + '*']
if just_cof.empty:
return ser['m' + att]
else:
return just_cof.iloc[0]['m' + att]
# r is the representative mention head
if obj == 'r':
cohead = ser['c']
if cohead == '_':
return 'none'
if not cohead.endswith('*'):
cohead = cohead + '*'
# iterrows is slow, but we only need the first instance
just_cof = df[df['c'] == cohead]
if just_cof.empty:
return ser['m' + att]
else:
return just_cof.iloc[0]['m' + att]
if obj == 'g':
if ser[obj] == 0:
return 'root'
else:
try:
return df[att][ser.name[0], ser[obj]]
# this keyerror can happen if governor is punctuation, for example
except KeyError:
return
# if dependent, we need to return a df-like thing instead
elif obj == 'd':
#import pandas as pd
idxs = [(ser.name[0], int(i)) for i in ser[obj].split(',')]
dat = df[att].ix[idxs]
return dat
# todo: fix everything below here
elif obj == 'r': # get the representative
cohead = ser['c'].rstrip('*')
refs = df[df['c'] == cohead + '*']
return refs[att].ix[0]
elif obj == 'h': # get head
cohead = ser['c']
if cohead.endswith('*'):
return ser[att]
else:
sent = df[att].loc[ser.name[0]]
return sent[sent['c'] == cohead + '*']
# potential naming conflict with sent index ...
elif obj == 's': # get whole phrase"
cohead = ser['c']
sent = df[att].loc[ser.name[0]]
return sent[sent['c'] == cohead.rstrip('*')].values
def joiner(ser):
return ser.str.cat(sep='/')
def make_new_for_dep(dfmain, dfdep, name):
"""
If showind dependent, we have to make a whole new dataframe
:param dfmain: dataframe with everything in it
:param dfdep: dataframe with just dependent
"""
import pandas as pd
import numpy as np
new = []
newd = []
index = []
for (i, ml), (_, dl) in zip(dfmain.iterrows(), dfdep.iterrows()):
if all(pd.isnull(i) for i in dl.values):
index.append(i)
new.append(ml)
newd.append('none')
continue
else:
for bit in dl:
if pd.isnull(bit):
continue
index.append(i)
new.append(ml)
newd.append(bit)
#todo: account for no matches
index = pd.MultiIndex.from_tuples(index, names=['s', 'i'])
newdf = pd.DataFrame(new, index=index)
newdf[name] = newd
return newdf
def turn_pos_to_wc(ser, showval):
if not showval:
return ser
import pandas as pd
from corpkit.dictionaries.word_transforms import taglemma
vals = [taglemma.get(piece.lower(), piece.lower())
for piece in ser.values]
news = pd.Series(vals, index=ser.index)
news.name = ser.name[:-1] + 'x'
return news
def concline_generator(matches, idxs, df, metadata,
add_meta, category, fname, preserve_case=False):
"""
Get all conclines
:param matches: a list of formatted matches
:param idxs: their (sent, word) idx
"""
conc_res = []
# potential speedup: turn idxs into dict
from collections import defaultdict
mdict = defaultdict(list)
# if remaking idxs here, don't need to do it earlier
idxs = list(matches.index)
for mid, (s, i) in zip(matches, idxs):
#for s, i in matches:
mdict[s].append((i, mid))
# shorten df to just relevant sents to save lookup time
df = df.loc[list(mdict.keys())]
# don't look up the same sentence multiple times
for s, tup in sorted(mdict.items()):
sent = df.loc[s]
if not preserve_case:
sent = sent.str.lower()
meta = metadata[s]
sname = meta.get('speaker', 'none')
for i, mid in tup:
if not preserve_case:
mid = mid.lower()
ix = '%d,%d' % (s, i)
start = ' '.join(sent.loc[:i-1].values)
end = ' '.join(sent.loc[i+1:].values)
lin = [ix, category, fname, sname, start, mid, end]
if add_meta:
for k, v in sorted(meta.items()):
if k in ['speaker', 'parse', 'sent_id']:
continue
if isinstance(add_meta, list):
if k in add_meta:
lin.append(v)
elif add_meta is True:
lin.append(v)
conc_res.append(lin)
return conc_res
def p_series_to_x_series(val):
return taglemma.get(val.lower(), val.lower())
def fast_simple_conc(dfss, idxs, show,
metadata=False,
add_meta=False,
fname=False,
category=False,
only_format_match=True,
conc=False,
preserve_case=False,
gramsize=1,
window=None):
"""
Fast, simple concordancer, heavily conditional
to save time.
"""
if dfss.empty:
return [], []
import pandas as pd
# best case, the user doesn't want any gov-dep stuff
simple = all(i.startswith('m') and not i.endswith('a') for i in show)
# worst case, the user wants something from dep
dmode = any(x.startswith('d') for x in show)
# make a quick copy if need be because we modify the df
df = dfss.copy() if not simple else dfss
# add text to df columns so that it resembles 'show' values
lst = ['s', 'i', 'w', 'l', 'e', 'p', 'f']
# for ner, change O to 'none'
if 'e' in df.columns:
df['e'] = df['e'].str.replace('^O$', 'none')
df.columns = ['m' + i if len(i) == 1 and i in lst \
else i for i in list(df.columns)]
# this is the data needed for concordancing
df_for_lr = df['mw'] if only_format_match else df
just_matches = df.loc[idxs]
# if the showing can't come straight out of the df,
# we can add columns with the necessary information
if not simple:
formatted = []
import numpy as np
for ind, i in enumerate(show):
# nothing to do if it's an m feature
if i.startswith('m') and not i.endswith('a'):
continue
# defaults for adjacent work
adj, tomove, adjname = False, False, ''
adj, i = determine_adjacent(i)
adjname = ''.join(adj) if hasattr(adj, '__iter__') else ''
# get number of places to shift left or right
if adj:
if adj[0] == '+':
tomove = -int(adj[1])
elif adj[0] == '-':
tomove = int(adj[1])
# cut df down to just needed bits for the sake of speed
# i.e. if we want gov func, get only gov and func cols
ob, att = i[0], i[-1]
xmode = att == 'x'
if xmode:
att = 'p'
show[ind] = show[ind][:-1] + 'p'
# for corefs, we also need the coref data
if ob in ['h', 'r']:
dfx = df[['c', 'm' + att]]
else:
lst = ['s', 'i', 'w', 'l', 'f', 'p']
if att in lst and ob != 'm':
att = 'm' + att
if ob == 'm' and att != 'a':
dfx = df[['m' + att]]
elif att == 'a':
dfx = df[['mf', 'g']]
else:
dfx = df[[ob, att]]
# decide if we need to format everything
if (not conc or only_format_match) and not adj:
to_proc = just_matches
else:
to_proc = df
# now we get or generate the new column
if ob == 'm' and att != 'a':
ser = to_proc['m' + att]
else:
ser = to_proc.apply(make_series, df=dfx, obj=ob, att=att, axis=1)
if xmode:
ser = ser.apply(p_series_to_x_series)
# adjmode simply shifts series and index
if adj:
#todo: this shifts next sent into previous sent!
ser = ser.shift(tomove)
ser = ser.fillna('none')
# dependent mode produces multiple matches
# so, we have to make a new dataframe with duplicate indexes
# todo: what about when there are two dep options?
ser.name = adjname + i
if ob != 'd':
df[ser.name] = ser
else:
df = make_new_for_dep(df, ser, i)
df = df.fillna('none')
# x is wordclass. so, we just get pos and translate it
nshow = [(i.replace('x', 'p'), i.endswith('x')) for i in show]
# generate a series of matches with slash sep if multiple show vals
if len(nshow) > 1:
if conc and not only_format_match:
first = turn_pos_to_wc(df[nshow[0][0]], nshow[0][1])
llist = [turn_pos_to_wc(df[sho], xmode) for sho, xmode in nshow[1:]]
df = first.str.cat(others=llist, sep='/')
matches = df[idxs]
else:
justm = df.loc[idxs]
first = turn_pos_to_wc(justm[nshow[0][0]], nshow[0][1])
llist = [turn_pos_to_wc(justm[sho], xmode) for sho, xmode in nshow[1:]]
matches = first.str.cat(others=llist, sep='/')
if conc:
df = df_for_lr
else:
if conc and not only_format_match:
df = turn_pos_to_wc(df[nshow[0][0]], nshow[0][1])
matches = df[idxs]
else:
matches = turn_pos_to_wc(df[nshow[0][0]][idxs], nshow[0][1])
if conc:
df = df_for_lr
# get rid of (e.g.) nan caused by no_punct=True
matches = matches.dropna(axis=0, how='all')
if not preserve_case:
matches = matches.str.lower()
if not conc:
# todo: is matches.values faster?
return list(matches), []
else:
conc_res = concline_generator(matches, idxs, df,
metadata, add_meta,
category, fname,
preserve_case=preserve_case)
return list(matches), conc_res
def make_collocate_show(show, current):
"""
Turn show into a collocate showing thing
"""
out = []
for i in show:
out.append(i)
for i in show:
newn = '%s%s' % (str(current), i)
if not newn.startswith('-'):
newn = '+' + newn
out.append(newn)
return out
def show_this(df, matches, show, metadata, conc=False,
coref=False, category=False, show_conc_metadata=False, **kwargs):
only_format_match = kwargs.pop('only_format_match', True)
ngram_mode = kwargs.get('ngram_mode', True)
preserve_case = kwargs.get('preserve_case', False)
gramsize = kwargs.get('gramsize', 1)
window = kwargs.get('window', None)
matches = sorted(list(matches))
# add index as column if need be
if any(i.endswith('s') for i in show):
df['ms'] = [str(i) for i in df.index.labels[0]]
if any(i.endswith('i') for i in show):
df['mi'] = [str(i) for i in df.index.labels[1]]
# attempt to leave really fast
if kwargs.get('countmode'):
return len(matches), {}
if len(show) == 1 and not conc and gramsize == 1 and not window:
if show[0] in ['ms', 'mi', 'mw', 'ml', 'mp', 'mf']:
get_fast = df.loc[matches][show[0][-1]]
if not preserve_case:
get_fast = get_fast.str.lower()
return list(get_fast), {}
# todo: make work for ngram, collocate and coref
if all(i[0] in ['m', 'g', '+', '-', 'd', 'h', 'r'] for i in show):
if gramsize == 1 and not window:
return fast_simple_conc(df,
matches,
show,
metadata,
show_conc_metadata,
kwargs.get('filename', ''),
category,
only_format_match,
conc=conc,
preserve_case=preserve_case,
gramsize=gramsize,
window=window)
else:
resbit = []
concbit = []
iterab = range(1, gramsize + 1) if gramsize > 1 else range(-window, window+1)
for i in iterab:
if i == 0:
continue
if window:
nnshow = make_collocate_show(show, i)
else:
nnshow = show
r, c = fast_simple_conc(df,
matches,
nnshow,
metadata,
show_conc_metadata,
kwargs.get('filename', ''),
category,
only_format_match,
conc=conc,
preserve_case=preserve_case,
gramsize=gramsize,
window=window)
resbit.append(r)
concbit.append(c)
if not window:
df = df.shift(1)
df = df.fillna('none')
resbit = list(zip(*resbit))
concbit = list(zip(*concbit))
out = []
conc_out = []
# this is slow but keeps the order
# remove it esp for resbit where it doesn't matter
for r in resbit:
for b in r:
out.append(b)
for c in concbit:
for b in c:
conc_out.append(b)
return out, conc_out
def remove_by_mode(matches, mode, criteria):
"""
If mode is all, remove any entry that occurs < len(criteria)
"""
if mode == 'any':
return set(matches)
if mode == 'all':
from collections import Counter
counted = Counter(matches)
return set(k for k, v in counted.items() if v >= len(criteria))
def determine_adjacent(original):
"""
Figure out if we're doing an adjacent location, get the co-ordinates
and return them and the stripped original
"""
if original[0] in ['+', '-']:
adj = (original[0], original[1:-2])
original = original[-2:]
else:
adj = False
return adj, original
def cut_df_by_metadata(df, metadata, criteria, coref=False,
feature='speaker', method='just'):
"""
Keep or remove parts of the DataFrame based on metadata criteria
"""
if not criteria:
df._metadata = metadata
return df
# maybe could be sped up, but let's not for now:
if coref:
df._metadata = metadata
return df
import re
good_sents = []
new_metadata = {}
from corpkit.constants import STRINGTYPE
# could make the below more elegant ...
for sentid, data in sorted(metadata.items()):
meta_value = data.get(feature, 'none')
lst_met_vl = meta_value.split(';')
if isinstance(criteria, (list, set, tuple)):
criteria = [i.lower() for i in criteria]
if method == 'just':
if any(i.lower() in criteria for i in lst_met_vl):
good_sents.append(sentid)
new_metadata[sentid] = data
elif method == 'skip':
if not any(i in criteria for i in lst_met_vl):
good_sents.append(sentid)
new_metadata[sentid] = data
elif isinstance(criteria, (re._pattern_type, STRINGTYPE)):
if method == 'just':
if any(re.search(criteria, i, re.IGNORECASE) for i in lst_met_vl):
good_sents.append(sentid)
new_metadata[sentid] = data
elif method == 'skip':
if not any(re.search(criteria, i, re.IGNORECASE) for i in lst_met_vl):
good_sents.append(sentid)
new_metadata[sentid] = data
df = df.loc[good_sents]
df = df.fillna('')
df._metadata = new_metadata
return df
def cut_df_by_meta(df, just_metadata, skip_metadata):
"""
Reshape a DataFrame based on filters
"""
if df is not None:
if just_metadata:
for k, v in just_metadata.items():
df = cut_df_by_metadata(df, df._metadata, v, feature=k)
if skip_metadata:
for k, v in skip_metadata.items():
df = cut_df_by_metadata(df, df._metadata, v, feature=k, method='skip')
return df
def tgrep_searcher(f=False,
metadata=False,
from_df=False,
search=False,
searchmode=False,
exclude=False,
excludemode=False,
translated_option=False,
subcorpora=False,
conc=False,
root=False,
preserve_case=False,
countmode=False,
show=False,
lem_instance=False,
lemtag=False,
category=False,
fname=False,
show_conc_metadata=False,
only_format_match=True,
**kwargs):
"""
Use tgrep for constituency grammar search
"""
from corpkit.process import show_tree_as_per_option, tgrep
matches = []
conc_out = []
# in case search was a dict
srch = search.get('t') if isinstance(search, dict) else search
metcat = category if category else ''
for i, sent in metadata.items():
results = tgrep(sent['parse'], srch)
sname = sent.get('speaker')
metcat = category
for res in results:
tok_id, start, middle, end = show_tree_as_per_option(show, res, sent,
df=from_df, sent_id=i, conc=conc,
only_format_match=only_format_match)
#middle, idx = show_tree_as_per_option(show, res, 'conll', sent, df=df, sent_id=i)
matches.append(middle)
if conc:
form_ix = '%d,%d' % (i, tok_id)
lin = [form_ix, metcat, fname, sname, start, middle, end]
if show_conc_metadata:
for k, v in sorted(sent.items()):
if k in ['speaker', 'parse', 'sent_id']:
continue
if isinstance(show_conc_metadata, list):
if k in show_conc_metadata:
lin.append(v)
elif show_conc_metadata is True:
lin.append(v)
conc_out.append(lin)
return matches, conc_out
def slow_tregex(metadata=False,
search=False,
searchmode=False,
exclude=False,
excludemode=False,
translated_option=False,
subcorpora=False,
conc=False,
root=False,
preserve_case=False,
countmode=False,
show=False,
lem_instance=False,
lemtag=False,
from_df=False,
fname=False,
category=False,
only_format_match=False,
**kwargs):
"""
Do the metadata specific version of tregex queries
"""
from corpkit.process import tregex_engine, format_tregex, make_conc_lines_from_whole_mid
if isinstance(search, dict):
search = list(search.values())[0]
speak_tree = [(x.get(subcorpora, 'none'), x['parse']) for x in metadata.values()]
if speak_tree:
speak, tree = list(zip(*speak_tree))
else:
speak, tree = [], []
if all(not x for x in speak):
speak = False
to_open = '\n'.join(tree)
concs = []
if not to_open.strip('\n'):
if subcorpora:
return {}, {}
ops = ['-%s' % i for i in translated_option] + ['-o', '-n']
res = tregex_engine(query=search,
options=ops,
corpus=to_open,
root=root,
preserve_case=preserve_case,
speaker_data=False)
res = format_tregex(res, show, exclude=exclude, excludemode=excludemode,
translated_option=translated_option,
lem_instance=lem_instance, countmode=countmode, speaker_data=False,
lemtag=lemtag)
if not res:
if subcorpora:
return [], []
if conc:
ops += ['-w']
whole_res = tregex_engine(query=search,
options=ops,
corpus=to_open,
root=root,
preserve_case=preserve_case,
speaker_data=speak)
# format match too depending on option
if not only_format_match:
whole_res = format_tregex(whole_res, show, exclude=exclude, excludemode=excludemode,
translated_option=translated_option,
lem_instance=lem_instance, countmode=countmode,
speaker_data=speak, whole=True,
lemtag=lemtag)
# make conc lines from conc results
concs = make_conc_lines_from_whole_mid(whole_res, res, filename=fname, show=show)
else:
concs = [False for i in res]
if len(res) > 0 and isinstance(res[0], tuple):
res = [i[-1] for i in res]
if countmode:
if isinstance(res, int):
return res, False
else:
return len(res), False
else:
return res, concs
def get_stats(from_df=False, metadata=False, feature=False, root=False, **kwargs):
"""
Get general statistics for a DataFrame
"""
import re
from corpkit.dictionaries.process_types import processes
from collections import Counter, defaultdict
from corpkit.process import tregex_engine
def ispunct(s):
import string
return all(c in string.punctuation for c in s)
tree = [x['parse'] for x in metadata.values()]
tregex_qs = {'Imperative': r'ROOT < (/(S|SBAR)/ < (VP !< VBD !< VBG !$ NP !$ SBAR < NP !$-- S '\
'!$-- VP !$ VP)) !<< (/\?/ !< __) !<<- /-R.B-/ !<<, /(?i)^(-l.b-|hi|hey|hello|oh|wow|thank|thankyou|thanks|welcome)$/',
'Open interrogative': r'ROOT < SBARQ <<- (/\?/ !< __)',
'Closed interrogative': r'ROOT ( < (SQ < (NP $+ VP)) << (/\?/ !< __) | < (/(S|SBAR)/ < (VP $+ NP)) <<- (/\?/ !< __))',
'Unmodalised declarative': r'ROOT < (S < (/(NP|SBAR|VP)/ $+ (VP !< MD)))',
'Modalised declarative': r'ROOT < (S < (/(NP|SBAR|VP)/ $+ (VP < MD)))',
'Clauses': r'/^S/ < __',
'Interrogative': r'ROOT << (/\?/ !< __)',
'Processes': r'/VB.?/ >># (VP !< VP >+(VP) /^(S|ROOT)/)'}
result = Counter()
for name in tregex_qs.keys():
result[name] = 0
result['Sentences'] = len(set(from_df.index.labels[0]))
result['Passives'] = len(from_df[from_df['f'] == 'nsubjpass'])
result['Tokens'] = len(from_df)
# the below has returned a float before. i assume actually a nan?
result['Words'] = len([w for w in list(from_df['w']) if w and not ispunct(str(w))])
result['Characters'] = sum([len(str(w)) for w in list(from_df['w']) if w])
result['Open class'] = sum([1 for x in list(from_df['p']) if x and x[0] in ['N', 'J', 'V', 'R']])
result['Punctuation'] = result['Tokens'] - result['Words']
result['Closed class'] = result['Words'] - result['Open class']
to_open = '\n'.join(tree)
if not to_open.strip('\n'):
return {}, {}
for name, q in sorted(tregex_qs.items()):
options = ['-o', '-t'] if name == 'Processes' else ['-o']
# c option removed, could cause memory problems
#ops = ['-%s' % i for i in translated_option] + ['-o', '-n']
res = tregex_engine(query=q,
options=options,
corpus=to_open,
root=root)
#res = format_tregex(res)
if not res:
continue
concs = [False for i in res]
for (_, met, r), line in zip(res, concs):
result[name] = len(res)
if name != 'Processes':
continue
non_mat = 0
for ptype in ['mental', 'relational', 'verbal']:
reg = getattr(processes, ptype).words.as_regex(boundaries='l')
count = len([i for i in res if re.search(reg, i[-1])])
nname = ptype.title() + ' processes'
result[nname] = count
if root:
root.update()
return result, {}
def get_corefs(df, matches):
"""
Add corefs to a set of matches
"""
out = set()
df = df['c']
for s, i in matches:
# keep original
out.add((s,i))
coline = df[(s, i)]
if coline.endswith('*'):
same_co = df[df == coline]
for ix in same_co.index:
out.add(ix)
return out
def pipeline(f=False,
search=False,
show=False,
exclude=False,
searchmode='all',
excludemode='any',
conc=False,
coref=False,
from_df=False,
just_metadata=False,
skip_metadata=False,
category=False,
show_conc_metadata=False,
statsmode=False,
search_trees=False,
lem_instance=False,
**kwargs):
"""
A basic pipeline for conll querying---some options still to do
"""
if isinstance(show, str):
show = [show]
all_matches = []
all_exclude = []
if from_df is False or from_df is None:
df = parse_conll(f, usecols=kwargs.get('usecols'))
# can fail here if df is none
if df is None:
print('Problem reading data from %s.' % f)
return [], []
metadata = df._metadata
else:
df = from_df
metadata = kwargs.pop('metadata')
feature = kwargs.pop('by_metadata', False)
df = cut_df_by_meta(df, just_metadata, skip_metadata)
searcher = pipeline
if statsmode:
searcher = get_stats
if search_trees == 'tregex':
searcher = slow_tregex
elif search_trees == 'tgrep':
searcher = tgrep_searcher
if feature:
if df is None:
print('Problem reading data from %s.' % f)
return {}, {}
# determine searcher
resultdict = {}
concresultdict = {}
# get all the possible values in the df for the feature of interest
all_cats = set([i.get(feature, 'none') for i in list(df._metadata.values())])
for category in all_cats:
new_df = cut_df_by_metadata(df, df._metadata, category, feature=feature, method='just')
r, c = searcher(f=False,
fname=f,
search=search,
exclude=exclude,
show=show,
searchmode=searchmode,
excludemode=excludemode,
conc=conc,
coref=coref,
from_df=new_df,
by_metadata=False,
category=category,
show_conc_metadata=show_conc_metadata,
lem_instance=lem_instance,
root=kwargs.pop('root', False),
subcorpora=feature,
metadata=new_df._metadata,
**kwargs)
resultdict[category] = r
concresultdict[category] = c
return resultdict, concresultdict
if df is None:
print('Problem reading data from %s.' % f)
return [], []
kwargs['ngram_mode'] = any(x.startswith('n') for x in show)
#df = cut_df_by_metadata(df, df._metadata, kwargs.get('just_speakers'), coref=coref)
metadata = df._metadata
try:
df['w'].str
except AttributeError:
raise AttributeError("CONLL data doesn't match expectations. " \
"Try the corpus.conll_conform() method to " \
"convert the corpus to the latest format.")
if kwargs.get('no_punct', True):
df = df[df['w'].fillna('').str.contains(kwargs.get('is_a_word', r'[A-Za-z0-9]'))]
# remove brackets --- could it be done in one regex?
df = df[~df['w'].str.contains(r'^-.*B-$')]
if kwargs.get('no_closed'):
from corpkit.dictionaries import wordlists
crit = wordlists.closedclass.as_regex(boundaries='l', case_sensitive=False)
df = df[~df['w'].str.contains(crit)]
if statsmode:
return get_stats(df, metadata, False, root=kwargs.pop('root', False), **kwargs)
elif search_trees:
return searcher(from_df=df,
search=search,
searchmode=searchmode,
exclude=exclude,
excludemode=excludemode,
conc=conc,
by_metadata=False,
metadata=metadata,
root=kwargs.pop('root', False),
fname=f,
show=show,
**kwargs)
# do no searching if 'any' is requested
if len(search) == 1 and list(search.keys())[0] == 'w' \
and hasattr(list(search.values())[0], 'pattern') \
and list(search.values())[0].pattern == r'.*':
all_matches = list(df.index)
else:
for k, v in search.items():
adj, k = determine_adjacent(k)
res = search_this(df, k[0], k[-1], v, adjacent=adj, coref=coref)
for r in res:
all_matches.append(r)
all_matches = remove_by_mode(all_matches, searchmode, search)
if exclude:
for k, v in exclude.items():
adj, k = determine_adjacent(k)
res = search_this(df, k[0], k[-1], v, adjacent=adj, coref=coref)
for r in res:
all_exclude.append(r)
all_exclude = remove_by_mode(all_exclude, excludemode, exclude)
all_matches = all_matches.difference(all_exclude)
if coref:
all_matches = get_corefs(df, all_matches)
out, conc_out = show_this(df, all_matches, show, metadata, conc,
coref=coref, category=category,
show_conc_metadata=show_conc_metadata,
**kwargs)
return out, conc_out
def load_raw_data(f):
"""
Loads the stripped and raw versions of a parsed file
"""
from corpkit.process import saferead
# open the unparsed version of the file, read into memory
stripped_txtfile = f.replace('.conll', '').replace('-parsed', '-stripped')
stripped_txtdata, enc = saferead(stripped_txtfile)
# open the unparsed version with speaker ids
id_txtfile = f.replace('.conll', '').replace('-parsed', '')
id_txtdata, enc = saferead(id_txtfile)
return stripped_txtdata, id_txtdata
def get_speaker_from_offsets(stripped, plain, sent_offsets,
metadata_mode=False,
speaker_segmentation=False):
"""
Take offsets and get a speaker ID or metadata from them
"""
if not stripped and not plain:
return {}
start, end = sent_offsets
sent = stripped[start:end]
# find out line number
# sever at start of match
cut_old_text = stripped[:start]
line_index = cut_old_text.count('\n')
# lookup this text
with_id = plain.splitlines()[line_index]
# parse xml tags in original file ...
meta_dict = {'speaker': 'none'}
if metadata_mode:
metad = with_id.strip().rstrip('>').rsplit('<metadata ', 1)
import shlex
from corpkit.constants import PYTHON_VERSION
try:
shxed = shlex.split(metad[-1].encode('utf-8')) if PYTHON_VERSION == 2 \
else shlex.split(metad[-1])
except:
shxed = metad[-1].split("' ")
for m in shxed:
if PYTHON_VERSION == 2:
m = m.decode('utf-8')
# in rare cases of weirdly formatted xml:
try:
k, v = m.split('=', 1)
v = v.replace(u"\u2018", "'").replace(u"\u2019", "'").strip("'").strip('"')
meta_dict[k] = v
except ValueError:
continue
if speaker_segmentation:
split_line = with_id.split(': ', 1)
# handle multiple tags?
if len(split_line) > 1:
speakerid = split_line[0]
else:
speakerid = 'UNIDENTIFIED'
meta_dict['speaker'] = speakerid
return meta_dict
def convert_json_to_conll(path,
speaker_segmentation=False,
coref=False,
metadata=False,
just_files=False):
"""
take json corenlp output and convert to conll, with
dependents, speaker ids and so on added.
Path is for the parsed corpus, or a list of files within a parsed corpus
Might need to fix if outname used?
"""
import json
import re
from corpkit.build import get_filepaths
from corpkit.constants import CORENLP_VERSION, OPENER
# todo: stabilise this
#if CORENLP_VERSION == '3.7.0':
# coldeps = 'enhancedPlusPlusDependencies'
#else:
# coldeps = 'collapsed-ccprocessed-dependencies'
print('Converting files to CONLL-U...')
if just_files:
files = just_files
else:
if isinstance(path, list):
files = path
else:
files = get_filepaths(path, ext='conll')
for f in files:
if speaker_segmentation or metadata:
stripped, raw = load_raw_data(f)
else:
stripped, raw = None, None
main_out = ''
# if the file has already been converted, don't worry about it
# untested?
with OPENER(f, 'r') as fo:
#try:
try:
data = json.load(fo)
except ValueError:
continue
# todo: differentiate between json errors
# rsc corpus had one json file with an error
# outputted by corenlp, and the conversion
# failed silently here
#except ValueError:
# continue
for idx, sent in enumerate(data['sentences'], start=1):
tree = sent['parse'].replace('\n', '')
tree = re.sub(r'\s+', ' ', tree)
# offsets for speaker_id
sent_offsets = (sent['tokens'][0]['characterOffsetBegin'], \
sent['tokens'][-1]['characterOffsetEnd'])
metad = get_speaker_from_offsets(stripped,
raw,
sent_offsets,
metadata_mode=True,
speaker_segmentation=speaker_segmentation)
# currently there is no standard for sent_id, so i'm leaving it out, but
# if https://github.com/UniversalDependencies/docs/issues/273 is updated
# then i could switch it back
#output = '# sent_id %d\n# parse=%s\n' % (idx, tree)
output = '# parse=%s\n' % tree
for k, v in sorted(metad.items()):
output += '# %s=%s\n' % (k, v)
for token in sent['tokens']:
index = str(token['index'])
# this got a stopiteration on rsc data
governor, func = next(((i['governor'], i['dep']) \
for i in sent.get('enhancedPlusPlusDependencies',
sent.get('collapsed-ccprocessed-dependencies')) \
if i['dependent'] == int(index)), ('_', '_'))
if governor is '_':
depends = False
else:
depends = [str(i['dependent']) for i in sent.get('enhancedPlusPlusDependencies',
sent.get('collapsed-ccprocessed-dependencies')) if i['governor'] == int(index)]
if not depends:
depends = '0'
#offsets = '%d,%d' % (token['characterOffsetBegin'], token['characterOffsetEnd'])
line = [index,
token['word'],
token['lemma'],
token['pos'],
token.get('ner', '_'),
'_', # this is morphology, which is unannotated always, but here to conform to conll u
governor,
func,
','.join(depends)]
# no ints
line = [str(l) if isinstance(l, int) else l for l in line]
from corpkit.constants import PYTHON_VERSION
if PYTHON_VERSION == 2:
try:
[unicode(l, errors='ignore') for l in line]
except TypeError:
pass
output += '\t'.join(line) + '\n'
main_out += output + '\n'
# post process corefs
if coref:
import re
dct = {}
idxreg = re.compile('^([0-9]+)\t([0-9]+)')
splitmain = main_out.split('\n')
# add tab _ to each line, make dict of sent-token: line index
for i, line in enumerate(splitmain):
if line and not line.startswith('#'):
splitmain[i] += '\t_'
match = re.search(idxreg, line)
if match:
l, t = match.group(1), match.group(2)
dct[(int(l), int(t))] = i
# for each coref chain, if there are corefs
for numstring, list_of_dicts in sorted(data.get('corefs', {}).items()):
# for each mention
for d in list_of_dicts:
snum = d['sentNum']
# get head?
# this has been fixed in dev corenlp: 'headIndex' --- could simply use that
# ref : https://github.com/stanfordnlp/CoreNLP/issues/231
for i in range(d['startIndex'], d['endIndex']):
try:
ix = dct[(snum, i)]
fixed_line = splitmain[ix].rstrip('\t_') + '\t%s' % numstring
gv = fixed_line.split('\t')[6]
try:
gov_s = int(gv)
except ValueError:
continue
if gov_s < d['startIndex'] or gov_s > d['endIndex']:
fixed_line += '*'
splitmain[ix] = fixed_line
dct.pop((snum, i))
except KeyError:
pass
main_out = '\n'.join(splitmain)
from corpkit.constants import OPENER
with OPENER(f, 'w', encoding='utf-8') as fo:
main_out = main_out.replace(u"\u2018", "'").replace(u"\u2019", "'")
fo.write(main_out)
================================================
FILE: corpkit/constants.py
================================================
import sys
import codecs
# python 2/3 coompatibility
PYTHON_VERSION = sys.version_info.major
STRINGTYPE = str if PYTHON_VERSION == 3 else basestring
INPUTFUNC = input if PYTHON_VERSION == 3 else raw_input
OPENER = open if PYTHON_VERSION == 3 else codecs.open
# quicker access to search, exclude, show types
from itertools import product
_starts = ['M', 'N', 'B', 'G', 'D', 'H', 'R']
_ends = ['W', 'L', 'I', 'S', 'P', 'X', 'R', 'F', 'E']
_others = ['A', 'ANY', 'ANYWORD', 'C', 'SELF', 'V', 'K', 'T']
_prod = list(product(_starts, _ends))
_prod = [''.join(i) for i in _prod]
_letters = sorted(_prod + _starts + _ends + _others)
_adjacent_start = ['A{}'.format(i) for i in range(1, 9)] + \
['Z{}'.format(i) for i in range(1, 9)]
_adjacent = [''.join(i) for i in list(product(_adjacent_start, _prod))]
LETTERS = sorted(_letters + _adjacent)
# translating search values intro words
transshow = {'f': 'Function',
'l': 'Lemma',
'a': 'Distance from root',
'w': 'Word',
't': 'Trees',
'i': 'Index',
'n': 'N-grams',
'p': 'POS',
'e': 'NER',
'c': 'Count',
'x': 'Word class',
's': 'Sentence index'}
transobjs = {'g': 'Governor',
'd': 'Dependent',
'm': 'Match',
'h': 'Head'}
# below are the column names for the conll-u formatted data
# corpkit's format is slightly different, but largely compatible.
# Key differences:
#
# 1. 'e' is used for NER, rather than lang specific POS
# 2. 'd' gives a comma-sep list of dependents, rather than head-deprel pairs
# this is done for processing speed.
# 3. 'c' is used for corefs, not 'misc comment'. it has an artibrary number
# representing a dependency chain. head of a mention is marked with an asterisk.
# 'm' does not have anything in it in corpkit, but denotes morphological features
# default: index, word, lem, pos, ner, morph, gov, func, deps, coref
CONLL_COLUMNS = ['i', 'w', 'l', 'p', 'e', 'v', 'g', 'f', 'd', 'c']
# what the longest possible speaker ID is. this prevents huge lines with colons
# from getting matched unintentionally
MAX_SPEAKERNAME_SIZE = 40
# parsing sometimes fails with a java error. if corpus.parse(restart=True), this will try
# parsing n times before giving up
REPEAT_PARSE_ATTEMPTS = 3
# location of the current corenlp and its version
# old, stable
#CORENLP_URL = 'http://nlp.stanford.edu/software/stanford-corenlp-full-2015-12-09.zip'
#CORENLP_VERSION = '3.6.0'
# newest, beta
CORENLP_VERSION = '3.7.0'
CORENLP_URL = 'http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip'
# it can be very slow to load a bunch of unused metadata categories
MAX_METADATA_FIELDS = 99
MAX_METADATA_VALUES = 99
================================================
FILE: corpkit/corpkit
================================================
#!/usr/bin/env python
"""
A script to start the corpkit interpeter with options
"""
import sys
import os
# determine if we're running a script
if len(sys.argv) > 1 and os.path.isfile(sys.argv[-1]):
fromscript = sys.argv[-1]
else:
fromscript = False
def install(name, loc):
"""
If we don't have a module, download it
"""
import pip
import importlib
try:
importlib.import_module(name)
except ImportError:
pip.main(['install', loc])
tabview = ('tabview', 'git+https://github.com/interrogator/tabview@93644dd1f410de4e47466ea8083bb628b9ccc471#egg=tabview')
colorama = ('colorama', 'colorama')
# run a command a la python -c
command = sys.argv[sys.argv.index('-c') + 1] if '-c' in sys.argv else False
debug = any(i in sys.argv for i in ['--debug', '-d', 'debug'])
quiet = any(i in sys.argv for i in ['--q', '--quiet'])
load = any(i in sys.argv for i in ['--load', '-l'])
profile = any(i in sys.argv for i in ['--profile', '-p'])
version = any(i in sys.argv for i in ['--version', '-v'])
if not any('noinstall' in arg.lower() for arg in sys.argv):
install(*tabview)
install(*colorama)
if version:
import corpkit
print(corpkit.__version__)
elif any(i in sys.argv for i in ['--help', '-h']):
from corpkit.env import help_text
import pydoc
pydoc.pipepager(help_text, cmd='less -X -R -S')
else:
from corpkit.env import interpreter
interpreter(debug=debug, fromscript=fromscript,
quiet=quiet, python_c_mode=command,
profile=profile, loadcurrent=load)
================================================
FILE: corpkit/corpkit.1
================================================
.TH corpkit 1
.SH NAME
corpkit \- corpus linguistics interface
.SH SYNOPSIS
.B corpkit
[\fB\-c\fR \fICOMMAND\fR]
[\fB\-d\fR]
[\fB\-h\fR]
[\fB\-l\fR]
.IR [file]
.SH DESCRIPTION
.B corpkit
builds and searches parsed and/or structured linguistic corpora. It also edits and visualises results, and manages projects.
.SH OPTIONS
.TP
.BR " \-\-c COMMAND"\fR
A quoted command or series of commands to pass to the corpkit interpreter. Use a semicolon to delimit each command. Disables interactivity and exits on completion.
.TP
.IR "file"\fR
Pass in a script for the interpeter to run.
.TP
.BR "\-d, " \-\-debug\fR
Debug mode: print info about how command was parsed.
.TP
.BR "\-l, " \-\-load\fR
Load all saved results into store on startup.
.TP
.BR "\-h, " \-\-help\fR
Show help.
================================================
FILE: corpkit/corpus.py
================================================
"""
corpkit: Corpus and Corpus-like objects
"""
from __future__ import print_function
from lazyprop import lazyprop
from corpkit.process import classname
from corpkit.constants import STRINGTYPE, PYTHON_VERSION
class Corpus(object):
"""
A class representing a linguistic text corpus, which contains files,
optionally within subcorpus folders.
Methods for concordancing, interrogating, getting general stats, getting
behaviour of particular word, etc.
Unparsed, tokenised and parsed corpora use the same class, though some
methods are available only to one or the other. Only unparsed corpora
can be parsed, and only parsed/tokenised corpora can be interrogated.
"""
def __init__(self, path, **kwargs):
import re
import operator
import glob
import os
from os.path import join, isfile, isdir, abspath, dirname, basename
from corpkit.process import determine_datatype
# levels are 'c' for corpus, 's' for subcorpus and 'f' for file. Which
# one is determined automatically below, and processed accordingly. We
# assume it is a full corpus to begin with.
def get_symbolics(self):
return {'skip': self.skip,
'just': self.just,
'symbolic': self.symbolic}
self.data = None
self._dlist = None
self.level = kwargs.pop('level', 'c')
self.datatype = kwargs.pop('datatype', None)
self.print_info = kwargs.pop('print_info', True)
self.symbolic = kwargs.get('subcorpora', False)
self.skip = kwargs.get('skip', False)
self.just = kwargs.get('just', False)
self.kwa = get_symbolics(self)
if isinstance(path, (list, Datalist)):
self.path = abspath(dirname(path[0].path.rstrip('/')))
self.name = basename(self.path)
self.data = path
if self.level == 'd':
self._dlist = path
elif isinstance(path, STRINGTYPE):
self.path = abspath(path)
self.name = basename(path)
elif hasattr(path, 'path') and path.path:
self.path = abspath(path.path)
self.name = basename(path.path)
# this messy code figures out as quickly as possible what the datatype
# and singlefile status of the path is. it's messy because it shortcuts
# full checking where possible some of the shortcutting could maybe be
# moved into the determine_datatype() funct.
if self.level == 'd':
self.singlefile = len(self._dlist) > 1
else:
self.singlefile = False
if os.path.isfile(self.path):
self.singlefile = True
else:
if not isdir(self.path):
if isdir(join('data', path)):
self.path = abspath(join('data', path))
if self.path.endswith('-parsed') or self.path.endswith('-tokenised'):
for r, d, f in os.walk(self.path):
if not f:
continue
if isinstance(f, str) and f.startswith('.'):
continue
if f[0].endswith('conll') or f[0].endswith('conllu'):
self.datatype = 'conll'
break
if len([d for d in os.listdir(self.path)
if isdir(join(self.path, d))]) > 0:
self.singlefile = False
if len([d for d in os.listdir(self.path)
if isdir(join(self.path, d))]) == 0:
self.level = 's'
else:
if self.level == 'c':
if not self.datatype:
self.datatype, self.singlefile = determine_datatype(
self.path)
if isdir(self.path):
if len([d for d in os.listdir(self.path)
if isdir(join(self.path, d))]) == 0:
self.level = 's'
# if initialised on a file, process as file
if self.singlefile and self.level == 'c':
self.level = 'f'
# load each interrogation as an attribute
if kwargs.get('load_saved', False):
from corpkit.other import load
from corpkit.process import makesafe
if os.path.isdir('saved_interrogations'):
saved_files = glob.glob(r'saved_interrogations/*')
for filepath in saved_files:
filename = os.path.basename(filepath)
if not filename.startswith(self.name):
continue
not_filename = filename.replace(self.name + '-', '')
not_filename = os.path.splitext(not_filename)[0]
if not_filename in ['features', 'wordclasses', 'postags']:
continue
variable_safe = makesafe(not_filename)
try:
setattr(self, variable_safe, load(filename))
if self.print_info:
print(
'\tLoaded %s as %s attribute.' %
(filename, variable_safe))
except AttributeError:
if self.print_info:
print(
'\tFailed to load %s as %s attribute. Name conflict?' %
(filename, variable_safe))
if self.print_info:
print('Corpus: %s' % self.path)
@lazyprop
def subcorpora(self):
"""
A list-like object containing a corpus' subcorpora.
"""
import re
import os
import operator
from os.path import join, isdir
if self.level == 'd':
return
if self.data.__class__ == Datalist or isinstance(self.data, (Datalist, list)):
return self.data
if self.level == 'c':
variable_safe_r = re.compile(r'[\W0-9_]+', re.UNICODE)
sbs = Datalist(sorted([Subcorpus(join(self.path
gitextract_mzzg7lm1/
├── .gitattributes
├── .gitmodules
├── .travis.yml
├── API-README.md
├── Dockerfile
├── LICENSE
├── Makefile
├── README.md
├── bld.bat
├── build.sh
├── conf.py
├── corpkit/
│ ├── __init__.py
│ ├── annotate.py
│ ├── blanknotebook.ipynb
│ ├── build.py
│ ├── completer.py
│ ├── configurations.py
│ ├── conll.py
│ ├── constants.py
│ ├── corpkit
│ ├── corpkit.1
│ ├── corpus.py
│ ├── cql.py
│ ├── dictionaries/
│ │ ├── __init__.py
│ │ ├── bnc.p
│ │ ├── bnc.py
│ │ ├── eng_verb_lexicon.p
│ │ ├── process_types.py
│ │ ├── queries.py
│ │ ├── roles.py
│ │ ├── stopwords.py
│ │ ├── verblist.py
│ │ ├── word_transforms.py
│ │ └── wordlists.py
│ ├── download/
│ │ ├── __init__.py
│ │ └── corenlp.py
│ ├── editor.py
│ ├── env.py
│ ├── gui.py
│ ├── inflect.py
│ ├── interpreter_tests.cki
│ ├── interrogation.py
│ ├── interrogator.py
│ ├── keys.py
│ ├── layouts.py
│ ├── lazyprop.py
│ ├── make.py
│ ├── model.py
│ ├── multiprocess.py
│ ├── new_project
│ ├── noseinstall.py
│ ├── nosetests.py
│ ├── other.py
│ ├── parse
│ ├── plotter.py
│ ├── plugins.py
│ ├── process.py
│ ├── stanford-tregex.jar
│ ├── stats.py
│ ├── textprogressbar.py
│ ├── tokenise.py
│ └── tregex.sh
├── data/
│ ├── corpus-filelist.txt
│ ├── test/
│ │ ├── first/
│ │ │ └── intro.txt
│ │ └── second/
│ │ └── body.txt
│ ├── test-plain-parsed/
│ │ ├── first/
│ │ │ └── intro.txt.conll
│ │ └── second/
│ │ └── body.txt.conll
│ ├── test-speak-parsed/
│ │ ├── first/
│ │ │ └── intro.txt.conll
│ │ └── second/
│ │ └── body.txt.conll
│ └── test-stripped/
│ ├── first/
│ │ └── intro.txt
│ └── second/
│ └── body.txt
├── index.rst
├── make.bat
├── meta.yaml
├── requirements.txt
├── rst_docs/
│ ├── API/
│ │ ├── corpkit.building.rst
│ │ ├── corpkit.concordancing.rst
│ │ ├── corpkit.editing.rst
│ │ ├── corpkit.interrogating.rst
│ │ ├── corpkit.langmodel.rst
│ │ ├── corpkit.managing.rst
│ │ └── corpkit.visualising.rst
│ ├── API-ref/
│ │ ├── corpkit.corpus.rst
│ │ ├── corpkit.dictionaries.rst
│ │ ├── corpkit.interrogation.rst
│ │ └── corpkit.other.rst
│ └── interpreter/
│ ├── corpkit.interpreter.annotating.rst
│ ├── corpkit.interpreter.concordancing.rst
│ ├── corpkit.interpreter.editing.rst
│ ├── corpkit.interpreter.interrogating.rst
│ ├── corpkit.interpreter.making.rst
│ ├── corpkit.interpreter.managing.rst
│ ├── corpkit.interpreter.overview.rst
│ ├── corpkit.interpreter.setup.rst
│ └── corpkit.interpreter.visualising.rst
├── setup.cfg
├── setup.py
└── talks/
└── IDL_seminar.tex
SYMBOL INDEX (376 symbols across 36 files)
FILE: conf.py
class CustomLatexFormatter (line 18) | class CustomLatexFormatter(LatexFormatter):
method __init__ (line 19) | def __init__(self, **options):
FILE: corpkit/__init__.py
function _plot (line 55) | def _plot(self, *args, **kwargs):
function _edit (line 59) | def _edit(self, *args, **kwargs):
function _save (line 63) | def _save(self, savename, **kwargs):
function _quickview (line 67) | def _quickview(self, n=25):
function _format (line 71) | def _format(self, *args, **kwargs):
function _texify (line 75) | def _texify(self, *args, **kwargs):
function _calculate (line 79) | def _calculate(self, *args, **kwargs):
function _multiplot (line 83) | def _multiplot(self, leftdict={}, rightdict={}, **kwargs):
function _perplexity (line 87) | def _perplexity(self):
function _entropy (line 107) | def _entropy(self):
function _shannon (line 118) | def _shannon(self):
function _shuffle (line 123) | def _shuffle(self, inplace=False):
function _top (line 134) | def _top(self):
function _tabview (line 141) | def _tabview(self, **kwargs):
function _rel (line 146) | def _rel(self, denominator='self', **kwargs):
function _keyness (line 150) | def _keyness(self, measure='ll', denominator='self', **kwargs):
function _plain (line 154) | def _plain(df):
FILE: corpkit/annotate.py
function process_special_annotation (line 5) | def process_special_annotation(v, lin):
function make_string_to_add (line 20) | def make_string_to_add(annotation, lin, replace=False):
function get_line_number_for_entry (line 41) | def get_line_number_for_entry(data, si, ti, annotation):
function update_contents (line 61) | def update_contents(contents, place, text, do_replace=False):
function dry_run_text (line 71) | def dry_run_text(filepath, contents, place, colours):
function annotate (line 91) | def annotate(open_file, contents):
function delete_lines (line 103) | def delete_lines(corpus, annotation, dry_run=True, colour={}):
function annotator (line 175) | def annotator(df_or_corpus, annotation, dry_run=True, deletemode=False):
FILE: corpkit/build.py
function download_large_file (line 9) | def download_large_file(proj_path, url, actually_download=True, root=Fal...
function extract_cnlp (line 123) | def extract_cnlp(fullfilepath, corenlppath=False, root=False):
function get_corpus_filepaths (line 148) | def get_corpus_filepaths(projpath=False, corpuspath=False,
function check_jdk (line 186) | def check_jdk():
function parse_corpus (line 201) | def parse_corpus(proj_path=False,
function move_parsed_files (line 399) | def move_parsed_files(proj_path, old_corpus_path, new_corpus_path,
function corenlp_exists (line 451) | def corenlp_exists(corenlppath=False):
function get_filepaths (line 481) | def get_filepaths(a_path, ext='txt'):
function make_no_id_corpus (line 503) | def make_no_id_corpus(pth, newpth, metadata_mode=False, speaker_segmenta...
function get_all_metadata_fields (line 565) | def get_all_metadata_fields(corpus, include_speakers=False):
function get_names (line 612) | def get_names(filepath, speakid):
function get_speaker_names_from_parsed_corpus (line 623) | def get_speaker_names_from_parsed_corpus(corpus, feature='speaker'):
function rename_all_files (line 659) | def rename_all_files(dirs_to_do):
function flatten_treestring (line 681) | def flatten_treestring(tree):
function can_folderise (line 690) | def can_folderise(folder):
function folderise (line 704) | def folderise(folder):
FILE: corpkit/completer.py
class Completer (line 1) | class Completer(object):
method __init__ (line 6) | def __init__(self, words):
method complete (line 10) | def complete(self, prefix, index):
FILE: corpkit/configurations.py
function configurations (line 1) | def configurations(corpus, search, **kwargs):
FILE: corpkit/conll.py
function parse_conll (line 5) | def parse_conll(f,
function get_dependents_of_id (line 83) | def get_dependents_of_id(idx, df=False, repeat=False, attr=False, coref=...
function get_governors_of_id (line 103) | def get_governors_of_id(idx, df=False, repeat=False, attr=False, coref=F...
function get_match (line 116) | def get_match(idx, df=False, repeat=False, attr=False, **kwargs):
function get_head (line 125) | def get_head(idx, df=False, repeat=False, attr=False, **kwargs):
function get_representative (line 155) | def get_representative(idx,
function get_all_corefs (line 183) | def get_all_corefs(s, i, df, coref=False):
function search_this (line 197) | def search_this(df, obj, attrib, pattern, adjacent=False, coref=False):
function show_fix (line 240) | def show_fix(show):
function dummy (line 255) | def dummy(x, *args, **kwargs):
function format_toks (line 258) | def format_toks(to_process, show, df):
function make_series (line 336) | def make_series(ser, df=False, obj=False,
function joiner (line 421) | def joiner(ser):
function make_new_for_dep (line 424) | def make_new_for_dep(dfmain, dfdep, name):
function turn_pos_to_wc (line 456) | def turn_pos_to_wc(ser, showval):
function concline_generator (line 467) | def concline_generator(matches, idxs, df, metadata,
function p_series_to_x_series (line 512) | def p_series_to_x_series(val):
function fast_simple_conc (line 515) | def fast_simple_conc(dfss, idxs, show,
function make_collocate_show (line 672) | def make_collocate_show(show, current):
function show_this (line 686) | def show_this(df, matches, show, metadata, conc=False,
function remove_by_mode (line 771) | def remove_by_mode(matches, mode, criteria):
function determine_adjacent (line 782) | def determine_adjacent(original):
function cut_df_by_metadata (line 794) | def cut_df_by_metadata(df, metadata, criteria, coref=False,
function cut_df_by_meta (line 839) | def cut_df_by_meta(df, just_metadata, skip_metadata):
function tgrep_searcher (line 853) | def tgrep_searcher(f=False,
function slow_tregex (line 911) | def slow_tregex(metadata=False,
function get_stats (line 1007) | def get_stats(from_df=False, metadata=False, feature=False, root=False, ...
function get_corefs (line 1081) | def get_corefs(df, matches):
function pipeline (line 1097) | def pipeline(f=False,
function load_raw_data (line 1254) | def load_raw_data(f):
function get_speaker_from_offsets (line 1270) | def get_speaker_from_offsets(stripped, plain, sent_offsets,
function convert_json_to_conll (line 1324) | def convert_json_to_conll(path,
FILE: corpkit/corpus.py
class Corpus (line 11) | class Corpus(object):
method __init__ (line 24) | def __init__(self, path, **kwargs):
method subcorpora (line 142) | def subcorpora(self):
method speakerlist (line 167) | def speakerlist(self):
method files (line 175) | def files(self):
method all_filepaths (line 195) | def all_filepaths(self):
method conll_conform (line 209) | def conll_conform(self, errors='raise'):
method all_files (line 251) | def all_files(self):
method tfidf (line 265) | def tfidf(self, search={'w': 'any'}, show=['w'], **kwargs):
method __str__ (line 293) | def __str__(self):
method __repr__ (line 318) | def __repr__(self):
method __getitem__ (line 330) | def __getitem__(self, key):
method __delitem__ (line 353) | def __delitem__(self, key):
method features (line 370) | def features(self):
method _get_postags_and_wordclasses (line 410) | def _get_postags_and_wordclasses(self):
method wordclasses (line 446) | def wordclasses(self):
method postags (line 464) | def postags(self):
method lexicon (line 483) | def lexicon(self, **kwargs):
method configurations (line 512) | def configurations(self, search, **kwargs):
method interrogate (line 551) | def interrogate(self, search='w', *args, **kwargs):
method sample (line 813) | def sample(self, n, level='f'):
method delete_metadata (line 849) | def delete_metadata(self):
method metadata (line 857) | def metadata(self):
method parse (line 864) | def parse(self,
method tokenise (line 963) | def tokenise(self, postag=True, lemmatise=True, *args, **kwargs):
method concordance (line 997) | def concordance(self, *args, **kwargs):
method interroplot (line 1035) | def interroplot(self, search, **kwargs):
method save (line 1058) | def save(self, savename=False, **kwargs):
method make_language_model (line 1074) | def make_language_model(self,
method annotate (line 1128) | def annotate(self, conclines, annotation, dry_run=True):
method unannotate (line 1154) | def unannotate(annotation, dry_run=True):
class Subcorpus (line 1166) | class Subcorpus(Corpus):
method __init__ (line 1174) | def __init__(self, path, datatype, **kwa):
method __str__ (line 1181) | def __str__(self):
method __repr__ (line 1184) | def __repr__(self):
method __getitem__ (line 1187) | def __getitem__(self, key):
class File (line 1207) | class File(Corpus):
method __init__ (line 1216) | def __init__(self, path, dirname=False, datatype=False, **kwa):
method __repr__ (line 1231) | def __repr__(self):
method __str__ (line 1234) | def __str__(self):
method read (line 1237) | def read(self, **kwargs):
method document (line 1248) | def document(self):
method trees (line 1264) | def trees(self):
method plain (line 1277) | def plain(self):
class Datalist (line 1290) | class Datalist(list):
method __init__ (line 1292) | def __init__(self, data, **kwargs):
method __repr__ (line 1299) | def __repr__(self):
method __getattr__ (line 1302) | def __getattr__(self, key):
method __getitem__ (line 1307) | def __getitem__(self, key):
method __delitem__ (line 1328) | def __delitem__(self, key):
method interrogate (line 1337) | def interrogate(self, *args, **kwargs):
method concordance (line 1353) | def concordance(self, *args, **kwargs):
method configurations (line 1364) | def configurations(self, search, **kwargs):
class Corpora (line 1375) | class Corpora(Datalist):
method __init__ (line 1387) | def __init__(self, data=False, **kwargs):
method __repr__ (line 1420) | def __repr__(self):
method parse (line 1423) | def parse(self, **kwargs):
method features (line 1441) | def features(self):
method postags (line 1453) | def postags(self):
method wordclasses (line 1461) | def wordclasses(self):
method lexicon (line 1469) | def lexicon(self):
FILE: corpkit/cql.py
function remake_special (line 5) | def remake_special(querybit, customs=False, return_list=False, **kwargs):
function parse_quant (line 49) | def parse_quant(quant):
function process_piece (line 61) | def process_piece(piece, op='=', quant=False, **kwargs):
function tokenise_cql (line 90) | def tokenise_cql(query):
function to_corpkit (line 144) | def to_corpkit(cstring, **kwargs):
function to_cql (line 178) | def to_cql(dquery, exclude=False, **kwargs):
FILE: corpkit/dictionaries/bnc.py
function _get_bnc (line 1) | def _get_bnc():
FILE: corpkit/dictionaries/process_types.py
function _verbs (line 13) | def _verbs():
function load_verb_data (line 19) | def load_verb_data():
function find_lexeme (line 49) | def find_lexeme(verb):
function get_both_spellings (line 81) | def get_both_spellings(verb_list):
function add_verb_inflections (line 95) | def add_verb_inflections(verb_list):
class Wordlist (line 164) | class Wordlist(list):
method __init__ (line 167) | def __init__(self, data, **kwargs):
method words (line 175) | def words(self):
method lemmata (line 183) | def lemmata(self):
method as_regex (line 190) | def as_regex(self, boundaries='w', case_sensitive=False, inverse=False...
class Processes (line 201) | class Processes(object):
method __init__ (line 203) | def __init__(self):
FILE: corpkit/dictionaries/queries.py
class Queries (line 8) | class Queries(object):
method __init__ (line 10) | def __init__(self):
FILE: corpkit/dictionaries/roles.py
function translator (line 3) | def translator():
FILE: corpkit/dictionaries/wordlists.py
function closed_class_wordlists (line 7) | def closed_class_wordlists():
FILE: corpkit/download/corenlp.py
function corenlp_downloader (line 1) | def corenlp_downloader(custompath=False):
FILE: corpkit/editor.py
function editor (line 7) | def editor(interrogation,
FILE: corpkit/env.py
function save_history (line 144) | def save_history(history_path=history_path):
class Objects (line 177) | class Objects(object):
method __init__ (line 183) | def __init__(self):
method _get (line 247) | def _get(self, name):
function interpreter (line 269) | def interpreter(debug=False,
function install (line 2385) | def install(name, loc):
FILE: corpkit/gui.py
class SplashScreen (line 73) | class SplashScreen(object):
method __init__ (line 77) | def __init__(self, tkRoot, imageFilename, minSplashTime=0):
method __enter__ (line 97) | def __enter__(self):
method __exit__ (line 136) | def __exit__(self, exc_type, exc_value, traceback ):
class RedirectText (line 155) | class RedirectText(object):
method __init__ (line 158) | def __init__(self, text_ctrl, log_text, text_widget):
method write (line 171) | def write(self, string):
class Label2 (line 198) | class Label2(Frame):
method __init__ (line 200) | def __init__(self, master, width=0, height=0, **kwargs):
method pack (line 209) | def pack(self, *args, **kwargs):
method grid (line 213) | def grid(self, *args, **kwargs):
class HyperlinkManager (line 218) | class HyperlinkManager:
method __init__ (line 220) | def __init__(self, text):
method reset (line 227) | def reset(self):
method add (line 229) | def add(self, action):
method _enter (line 235) | def _enter(self, event):
method _leave (line 237) | def _leave(self, event):
method _click (line 239) | def _click(self, event):
class Notebook (line 245) | class Notebook(Frame):
method __init__ (line 247) | def __init__(self, parent, activerelief=RAISED, inactiverelief=FLAT,
method change_tab (line 327) | def change_tab(self, IDNum):
method add_tab (line 345) | def add_tab(self, width=2, **kw):
method destroy_tab (line 360) | def destroy_tab(self, tab):
method focus_on (line 373) | def focus_on(self, tab):
function corpkit_gui (line 384) | def corpkit_gui(noupdate=False, loadcurrent=False, debug=False):
function install (line 7138) | def install(name, loc):
FILE: corpkit/inflect.py
function definite_article (line 73) | def definite_article(word):
function indefinite_article (line 76) | def indefinite_article(word):
function article (line 88) | def article(word, function=INDEFINITE):
function referenced (line 95) | def referenced(word, article=INDEFINITE):
function pluralize (line 389) | def pluralize(word, pos=NOUN, custom={}, classical=True):
function singularize (line 594) | def singularize(word, pos=NOUN, custom={}):
function _count_syllables (line 652) | def _count_syllables(word):
function grade (line 663) | def grade(adjective, suffix=COMPARATIVE):
function comparative (line 695) | def comparative(adjective):
function superlative (line 698) | def superlative(adjective):
function attributive (line 703) | def attributive(adjective):
function predicative (line 706) | def predicative(adjective):
FILE: corpkit/interrogation.py
class Interrogation (line 11) | class Interrogation(object):
method __init__ (line 18) | def __init__(self, results=None, totals=None, query=None, concordance=...
method __str__ (line 29) | def __str__(self):
method __repr__ (line 40) | def __repr__(self):
method edit (line 46) | def edit(self, *args, **kwargs):
method sort (line 244) | def sort(self, way, **kwargs):
method visualise (line 248) | def visualise(self,
method multiplot (line 372) | def multiplot(self, leftdict={}, rightdict={}, **kwargs):
method language_model (line 376) | def language_model(self, name, *args, **kwargs):
method save (line 387) | def save(self, savename, savedir='saved_interrogations', **kwargs):
method quickview (line 411) | def quickview(self, n=25):
method tabview (line 430) | def tabview(self, **kwargs):
method asciiplot (line 434) | def asciiplot(self,
method rel (line 478) | def rel(self, denominator='self', **kwargs):
method keyness (line 481) | def keyness(self, measure='ll', denominator='self', **kwargs):
method multiindex (line 484) | def multiindex(self, indexnames=None):
method topwords (line 513) | def topwords(self, datatype='n', n=10, df=False, sort=True, precision=2):
method perplexity (line 549) | def perplexity(self):
method entropy (line 570) | def entropy(self):
method shannon (line 581) | def shannon(self):
class Concordance (line 585) | class Concordance(pd.core.frame.DataFrame):
method __init__ (line 590) | def __init__(self, data):
method format (line 595) | def format(self, kind='string', n=100, window=35,
method calculate (line 629) | def calculate(self):
method shuffle (line 634) | def shuffle(self, inplace=False):
method edit (line 659) | def edit(self, *args, **kwargs):
method __str__ (line 669) | def __str__(self):
method __repr__ (line 672) | def __repr__(self):
method less (line 675) | def less(self, **kwargs):
class Interrodict (line 679) | class Interrodict(OrderedDict):
method __init__ (line 698) | def __init__(self, data):
method __getitem__ (line 708) | def __getitem__(self, key):
method __setitem__ (line 730) | def __setitem__(self, key, value):
method __repr__ (line 735) | def __repr__(self):
method __str__ (line 738) | def __str__(self):
method edit (line 741) | def edit(self, *args, **kwargs):
method multiindex (line 752) | def multiindex(self, indexnames=False):
method save (line 843) | def save(self, savename, savedir='saved_interrogations', **kwargs):
method collapse (line 867) | def collapse(self, axis='y'):
method topwords (line 945) | def topwords(self, datatype='n', n=10, df=False, sort=True, precision=2):
method visualise (line 978) | def visualise(self, shape='auto', truncate=8, **kwargs):
method copy (line 1004) | def copy(self):
method flip (line 1011) | def flip(self, truncate=30, transpose=True, repeat=False, *args, **kwa...
method get_totals (line 1065) | def get_totals(self):
FILE: corpkit/interrogator.py
function interrogator (line 8) | def interrogator(corpus,
FILE: corpkit/keys.py
function keywords (line 6) | def keywords(target_corpus,
FILE: corpkit/lazyprop.py
function lazyprop (line 4) | def lazyprop(fn):
FILE: corpkit/make.py
function make_corpus (line 4) | def make_corpus(unparsed_corpus_path,
FILE: corpkit/model.py
class LanguageModel (line 11) | class LanguageModel(object):
method __init__ (line 12) | def __init__(self, order, alpha, data):
method _logprob (line 42) | def _logprob(self, ngram):
method _prob (line 45) | def _prob(self, ngram):
class MultiModel (line 57) | class MultiModel(dict):
method __init__ (line 59) | def __init__(self, data, order, name='', **kwargs):
method score (line 77) | def score(self, data, **kwargs):
method _score_counts_against_model (line 111) | def _score_counts_against_model(self, counts, model):
method _turn_file_obj_into_counts (line 122) | def _turn_file_obj_into_counts(self, data, *args, **kwargs):
method score_subcorpora (line 132) | def score_subcorpora(self):
function _make_model_from_interro (line 142) | def _make_model_from_interro(self, name, order, **kwargs):
function _train (line 187) | def _train(data, name, corpusname, order=3, **kwargs):
FILE: corpkit/multiprocess.py
function pmultiquery (line 5) | def pmultiquery(corpus,
FILE: corpkit/noseinstall.py
function test_import (line 8) | def test_import():
FILE: corpkit/nosetests.py
function test_import (line 31) | def test_import():
function test_corpus_class (line 37) | def test_corpus_class():
function test_parse (line 42) | def test_parse():
function test_tokenise (line 58) | def test_tokenise():
function test_speak_parse (line 73) | def test_speak_parse():
function test_interro1 (line 93) | def test_interro1():
function test_interro2 (line 99) | def test_interro2():
function test_interro3 (line 105) | def test_interro3():
function test_interro_multiindex_tregex_justspeakers (line 131) | def test_interro_multiindex_tregex_justspeakers():
function test_conc (line 143) | def test_conc():
function test_edit (line 150) | def test_edit():
function test_tok1_interro (line 160) | def test_tok1_interro():
function test_tok2_interro (line 171) | def test_tok2_interro():
function document_check (line 187) | def document_check():
function test_conc_edit (line 201) | def test_conc_edit():
function test_symbolic_subcorpora (line 211) | def test_symbolic_subcorpora():
function test_symbolic_multiindex (line 220) | def test_symbolic_multiindex():
function check_skip_filt (line 231) | def check_skip_filt():
function check_just_filt (line 239) | def check_just_filt():
function test_interpreter (line 247) | def test_interpreter():
function check_interpreter_res_csv (line 269) | def check_interpreter_res_csv():
function check_interpreter_conc_csv (line 277) | def check_interpreter_conc_csv():
function check_interpreter_saved_interro (line 287) | def check_interpreter_saved_interro():
FILE: corpkit/other.py
function quickview (line 9) | def quickview(results, n=25):
function concprinter (line 104) | def concprinter(dataframe, kind='string', n=100,
function save (line 202) | def save(interrogation, savename, savedir='saved_interrogations', **kwar...
function load (line 314) | def load(savename, loaddir='saved_interrogations'):
function loader (line 353) | def loader(savedir='saved_interrogations'):
function new_project (line 380) | def new_project(name, loc='.', **kwargs):
function load_all_results (line 446) | def load_all_results(data_dir='saved_interrogations', **kwargs):
function texify (line 494) | def texify(series, n=20, colname='Keyness', toptail=False, sort=False):
function as_regex (line 522) | def as_regex(lst, boundaries='w', case_sensitive=False, inverse=False, c...
function make_multi (line 581) | def make_multi(interrogation, indexnames=None):
function topwords (line 701) | def topwords(self, datatype='n', n=10, df=False, sort=True, precision=2):
FILE: corpkit/plotter.py
function plotter (line 4) | def plotter(df,
function multiplotter (line 1169) | def multiplotter(df, leftdict={},rightdict={}, **kwargs):
FILE: corpkit/plugins.py
class HighlightLines (line 5) | class HighlightLines(plugins.PluginBase):
method __init__ (line 32) | def __init__(self, lines):
class InteractiveLegendPlugin (line 42) | class InteractiveLegendPlugin(plugins.PluginBase):
method __init__ (line 193) | def __init__(self, plot_elements, labels, ax=None,
method _determine_mpld3ids (line 211) | def _determine_mpld3ids(self, plot_elements):
FILE: corpkit/process.py
function tregex_engine (line 9) | def tregex_engine(corpus=False,
function show (line 255) | def show(lines, index, show='thread'):
function add_corpkit_to_path (line 261) | def add_corpkit_to_path():
function add_nltk_data_to_nltk_path (line 274) | def add_nltk_data_to_nltk_path(**kwargs):
function get_gui_resource_dir (line 288) | def get_gui_resource_dir():
function get_fullpath_to_jars (line 313) | def get_fullpath_to_jars(path_var):
function determine_datatype (line 363) | def determine_datatype(path):
function filtermaker (line 398) | def filtermaker(the_filter, case_sensitive=False, **kwargs):
function searchfixer (line 446) | def searchfixer(search, query, datatype=False):
function is_number (line 460) | def is_number(s):
function animator (line 476) | def animator(progbar,
function parse_just_speakers (line 567) | def parse_just_speakers(just_speakers, corpus):
function get_deps (line 581) | def get_deps(sentence, dep_type):
function timestring (line 589) | def timestring(input):
function makesafe (line 595) | def makesafe(variabletext, drop_datatype=True, hyphens_ok=False):
function interrogation_from_conclines (line 616) | def interrogation_from_conclines(newdata):
function checkstack (line 641) | def checkstack(the_string):
function check_tex (line 651) | def check_tex(have_ipython=True):
function get_corenlp_path (line 673) | def get_corenlp_path(corenlppath):
function unsplitter (line 729) | def unsplitter(data):
function classname (line 764) | def classname(cls):
function format_middle (line 769) | def format_middle(tree, show, df=False, sent_id=False, ixs=False):
function format_conc (line 794) | def format_conc(tups, show, df=False, sent_id=False, root=False, ixs=Fal...
function show_tree_as_per_option (line 824) | def show_tree_as_per_option(show, tree, sent=False, df=False,
function tgrep (line 866) | def tgrep(parse_string, search):
function canpickle (line 882) | def canpickle(obj):
function sanitise_dict (line 907) | def sanitise_dict(d):
function saferead (line 923) | def saferead(path):
function urlify (line 951) | def urlify(s):
function gui (line 962) | def gui():
function dictformat (line 971) | def dictformat(d, query=False):
function fix_search (line 1017) | def fix_search(search, case_sensitive=False, root=False):
function pat_format (line 1069) | def pat_format(pat, case_sensitive=False, root=False):
function make_name_to_query_dict (line 1094) | def make_name_to_query_dict(existing={}, cols=False, dtype=False):
function auto_usecols (line 1124) | def auto_usecols(search, exclude, show, usecols, coref=False):
function format_tregex (line 1192) | def format_tregex(results,
function make_conc_lines_from_whole_mid (line 1279) | def make_conc_lines_from_whole_mid(wholes,
function gettag (line 1323) | def gettag(query, lemmatag=False):
function lemmatiser (line 1341) | def lemmatiser(list_of_words, tag, translated_option,
function get_first_df (line 1359) | def get_first_df(corpus):
function make_dotfile (line 1375) | def make_dotfile(path, return_json=False, data_dict=False):
function get_corpus_metadata (line 1401) | def get_corpus_metadata(path, generate=False):
function make_df_json_name (line 1434) | def make_df_json_name(typ, subcorpora=False):
function add_df_to_dotfile (line 1442) | def add_df_to_dotfile(path, df, typ='features', subcorpora=False):
function delete_files_and_subcorpora (line 1453) | def delete_files_and_subcorpora(corpus, skip_metadata, just_metadata):
FILE: corpkit/stats.py
function tfidf (line 5) | def tfidf(self, search={'w': 'any'}, show=['w'], **kwargs):
function translate_show_for_surprisal (line 39) | def translate_show_for_surprisal(show, gramsize):
function surprisal (line 45) | def surprisal(self,
function shannon (line 64) | def shannon(self):
FILE: corpkit/textprogressbar.py
class TextProgressBar (line 5) | class TextProgressBar:
method __init__ (line 10) | def __init__(self, iterations, dirname=False, quiet=False):
method animate_ipython (line 20) | def animate_ipython(self, iter, dirname=None, quiet=False):
method update_iteration (line 31) | def update_iteration(self, elapsed_iter, dirname=None):
method __update_amount (line 40) | def __update_amount(self, new_amount, dirname=None):
method __str__ (line 62) | def __str__(self):
FILE: corpkit/tokenise.py
function nested_list_to_pandas (line 7) | def nested_list_to_pandas(toks):
function pos_tag_series (line 23) | def pos_tag_series(ser, tagger):
function lemmatise_series (line 34) | def lemmatise_series(words, postags, lemmatiser):
function write_df_to_conll (line 56) | def write_df_to_conll(df, newf, plain=False, stripped=False,
function new_fname (line 92) | def new_fname(oldpath, inpath):
function process_meta (line 104) | def process_meta(data, speaker_segmentation, metadata):
function plaintext_to_conll (line 123) | def plaintext_to_conll(inpath,
FILE: setup.py
class CustomInstallCommand (line 7) | class CustomInstallCommand(install):
method run (line 12) | def run(self):
Condensed preview — 98 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (2,627K chars).
[
{
"path": ".gitattributes",
"chars": 29,
"preview": "*.p linguist-language=Python\n"
},
{
"path": ".gitmodules",
"chars": 101,
"preview": "[submodule \"corpkit-app\"]\n\tpath = corpkit-app\n\turl = https://github.com/interrogator/corpkit-app.git\n"
},
{
"path": ".travis.yml",
"chars": 1320,
"preview": "language: python\npython:\n- '2.7'\n- '3.5'\ninstall: \n- pip install --install-option=\"--no-cython-compile\" cython\n- pip ins"
},
{
"path": "API-README.md",
"chars": 51037,
"preview": "## *corpkit*: API readme\n\n> This file is a deprecated introduction to the *corpkit* Python API. It still exists because "
},
{
"path": "Dockerfile",
"chars": 1700,
"preview": "FROM alpine:latest\nMAINTAINER interro_gator\n\n# set up a workspace so we can cache python stuff\nRUN rm -rf /.src && mkdir"
},
{
"path": "LICENSE",
"chars": 1110,
"preview": "The MIT License (MIT)\n\nCopyright (c) 2015 Daniel McDonald\nmcdonaldd, at, unimelb.edu\n\nPermission is hereby granted, free"
},
{
"path": "Makefile",
"chars": 7413,
"preview": "# Makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line.\nSPHINXOPTS =\nSPHINXBUILD "
},
{
"path": "README.md",
"chars": 12062,
"preview": "# corpkit: sophisticated corpus linguistics\n\n[:\n \"\"\"\n If th"
},
{
"path": "corpkit/blanknotebook.ipynb",
"chars": 4566,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# blanknotebook\"\n ]\n },\n {\n \""
},
{
"path": "corpkit/build.py",
"chars": 25507,
"preview": "from __future__ import print_function\nfrom corpkit.constants import STRINGTYPE, PYTHON_VERSION, INPUTFUNC\n\n\"\"\"\nThis file"
},
{
"path": "corpkit/completer.py",
"chars": 642,
"preview": "class Completer(object):\n \"\"\"\n Tab completion for interpreter\n \"\"\"\n\n def __init__(self, words):\n self"
},
{
"path": "corpkit/configurations.py",
"chars": 4322,
"preview": "def configurations(corpus, search, **kwargs):\n \"\"\"\n Get summary of behaviour of a word\n\n see corpkit.corpus.Cor"
},
{
"path": "corpkit/conll.py",
"chars": 52013,
"preview": "\"\"\"\ncorpkit: process CONLL formatted data\n\"\"\"\n\ndef parse_conll(f,\n first_time=False,\n just"
},
{
"path": "corpkit/constants.py",
"chars": 2815,
"preview": "import sys\nimport codecs\n\n# python 2/3 coompatibility\nPYTHON_VERSION = sys.version_info.major\nSTRINGTYPE = str if PYTHON"
},
{
"path": "corpkit/corpkit",
"chars": 1569,
"preview": "#!/usr/bin/env python\n\n\"\"\"\nA script to start the corpkit interpeter with options\n\"\"\"\n\nimport sys\nimport os\n\n# determine "
},
{
"path": "corpkit/corpkit.1",
"chars": 773,
"preview": ".TH corpkit 1\n.SH NAME\ncorpkit \\- corpus linguistics interface\n.SH SYNOPSIS\n.B corpkit\n[\\fB\\-c\\fR \\fICOMMAND\\fR]\n[\\fB\\-d"
},
{
"path": "corpkit/corpus.py",
"chars": 56841,
"preview": "\"\"\"\ncorpkit: Corpus and Corpus-like objects\n\"\"\"\n\nfrom __future__ import print_function\n\nfrom lazyprop import lazyprop\nfr"
},
{
"path": "corpkit/cql.py",
"chars": 6040,
"preview": "\"\"\"\nTranslating between CQL and corpkit's native \n\"\"\"\n\ndef remake_special(querybit, customs=False, return_list=False, **"
},
{
"path": "corpkit/dictionaries/__init__.py",
"chars": 774,
"preview": "__all__ = [\"wordlists\", \"roles\", \"bnc\", \"processes\", \"verbs\", \n \"uktous\", \"tagtoclass\", \"queries\", \"mergetags\""
},
{
"path": "corpkit/dictionaries/bnc.p",
"chars": 136956,
"preview": "ccopy_reg\n_reconstructor\np0\n(ccollections\nCounter\np1\nc__builtin__\ndict\np2\n(dp3\nVsecondly\np4\nI29\nsVwritings\np5\nI11\nsVpard"
},
{
"path": "corpkit/dictionaries/bnc.py",
"chars": 684,
"preview": "def _get_bnc():\n \"\"\"Load the BNC\"\"\"\n import corpkit\n try:\n import cPickle as pickle\n except ImportErr"
},
{
"path": "corpkit/dictionaries/eng_verb_lexicon.p",
"chars": 915423,
"preview": "(dp0\nFnan\n(lp1\nS\"belly-laughs'\"\np2\naS'belly-laughing'\np3\naS'belly-laughed'\np4\naS'belly-laughed'\np5\nasS'fawn'\np6\n(lp7\nS'f"
},
{
"path": "corpkit/dictionaries/process_types.py",
"chars": 20316,
"preview": "#!/usr/bin/python\n\n# dictionaries: process type wordlists\n# Author: Daniel McDonald\n\n# make regular expressions and "
},
{
"path": "corpkit/dictionaries/queries.py",
"chars": 911,
"preview": "\ntry:\n from corpkit.lazyprop import lazyprop\nexcept:\n import corpkit\n from lazyprop import lazyprop\n\nclass Quer"
},
{
"path": "corpkit/dictionaries/roles.py",
"chars": 3092,
"preview": "# This file translates CoreNLP labels into SFL categories\n\ndef translator():\n from collections import namedtuple\n "
},
{
"path": "corpkit/dictionaries/stopwords.py",
"chars": 13406,
"preview": "\n# stopwords from spindle\nfrom corpkit.dictionaries.process_types import Wordlist\nstopwords = Wordlist([\"yeah\", \"monday\""
},
{
"path": "corpkit/dictionaries/verblist.py",
"chars": 99966,
"preview": "allverbs = [\"bird's-nest\",\n 'abandon',\n 'abase',\n 'abash',\n 'abate',\n 'abbreviate',\n 'abdicate',\n 'abduct',\n 'abet',\n 'a"
},
{
"path": "corpkit/dictionaries/word_transforms.py",
"chars": 72019,
"preview": "\"\"\"\ncorpkit: manual word cludging\n\"\"\"\n\n# WordNet/CoreNLP lemmatiser are used for lemmatisation, but they can both \n# str"
},
{
"path": "corpkit/dictionaries/wordlists.py",
"chars": 17459,
"preview": "\"\"\"\nlists of closed class words\n\"\"\"\n\n# feel free to correct/add things---this was just a quick grab from the web\n\ndef cl"
},
{
"path": "corpkit/download/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "corpkit/download/corenlp.py",
"chars": 826,
"preview": "def corenlp_downloader(custompath=False):\n \"\"\"\n Very simple CoreNLP downloader\n\n :param custompath: A path wher"
},
{
"path": "corpkit/editor.py",
"chars": 35415,
"preview": "\"\"\"\ncorpkit: edit Interrogation, Concordance and Interrodict objects\n\"\"\"\nfrom __future__ import print_function\nfrom corp"
},
{
"path": "corpkit/env.py",
"chars": 91557,
"preview": "\"\"\"\nA corpkit interpreter, with natural language commands.\n\ntodo:\n\n* documentation\n* handling of kwargs tuples etc\n* che"
},
{
"path": "corpkit/gui.py",
"chars": 316400,
"preview": "#!/usr/bin/env python\n\n\"\"\"\n# corpkit GUI\n# Daniel McDonald\n\n# This file conains the frontend side of the corpkit gui.\n# "
},
{
"path": "corpkit/inflect.py",
"chars": 30573,
"preview": "#### PATTERN | EN | INFLECT ########################################################################\n# -*- coding: utf-8"
},
{
"path": "corpkit/interpreter_tests.cki",
"chars": 489,
"preview": "# this code is written in corpkit interpreter language.\n# it is used to test that the interpreter works properly\n\nset te"
},
{
"path": "corpkit/interrogation.py",
"chars": 39908,
"preview": "\"\"\"\ncorpkit: `Int`errogation and Interrogation-like classes\n\"\"\"\n\nfrom __future__ import print_function\n\nfrom collections"
},
{
"path": "corpkit/interrogator.py",
"chars": 40018,
"preview": "\"\"\"\ncorpkit: Interrogate a parsed corpus\n\"\"\"\n\nfrom __future__ import print_function\nfrom corpkit.constants import STRING"
},
{
"path": "corpkit/keys.py",
"chars": 5819,
"preview": "\"\"\"corpkit: simple keyworder\"\"\"\n\nfrom __future__ import print_function\nfrom corpkit.constants import STRINGTYPE, PYTHON_"
},
{
"path": "corpkit/layouts.py",
"chars": 2943,
"preview": "\"\"\"\nThis file contains a dictionary of matplotlib subplot layouts\n\nThey are used during multiplotting. Sooner or later t"
},
{
"path": "corpkit/lazyprop.py",
"chars": 3389,
"preview": "# this file duplicates lazyprop many times because i can't work out how to\n# automatically add the right docstring...\n\nd"
},
{
"path": "corpkit/make.py",
"chars": 15857,
"preview": "from __future__ import print_function\n\nfrom corpkit.constants import INPUTFUNC, PYTHON_VERSION\ndef make_corpus(unparsed_"
},
{
"path": "corpkit/model.py",
"chars": 6817,
"preview": "\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport math\nimport os\nfrom nltk import ngrams, s"
},
{
"path": "corpkit/multiprocess.py",
"chars": 15128,
"preview": "\"\"\"corpkit: multiprocessing of interrogations\"\"\"\n\nfrom __future__ import print_function\n\ndef pmultiquery(corpus, \n "
},
{
"path": "corpkit/new_project",
"chars": 296,
"preview": "#!/usr/bin/env python\n\nfrom __future__ import print_function\n\n\"\"\"\nA script to create a new corpkit project\n\"\"\"\n\nimport s"
},
{
"path": "corpkit/noseinstall.py",
"chars": 315,
"preview": "import os\nfrom nose.tools import assert_equals\nfrom corpkit import *\n\nunparsed_path = 'data/test'\nparsed_path = 'data/te"
},
{
"path": "corpkit/nosetests.py",
"chars": 9541,
"preview": "\"\"\"\nThis file contains tests for the corpkit API, to be run by Nose.\n\nThere are fast and slow tests. Slow tests include "
},
{
"path": "corpkit/other.py",
"chars": 27102,
"preview": "from __future__ import print_function\n\n\"\"\"\nIn here are functions used internally by corpkit, but also\nmight be called by"
},
{
"path": "corpkit/parse",
"chars": 847,
"preview": "#!/usr/bin/env python\n\nfrom __future__ import print_function\n\n\"\"\"\nA script to parse using corpkit\n\n:Example:\n\n$ parse ju"
},
{
"path": "corpkit/plotter.py",
"chars": 45579,
"preview": "from __future__ import print_function\nfrom corpkit.constants import STRINGTYPE, PYTHON_VERSION\n\ndef plotter(df,\n "
},
{
"path": "corpkit/plugins.py",
"chars": 8690,
"preview": "import mpld3\nimport collections\nfrom mpld3 import plugins, utils\n\nclass HighlightLines(plugins.PluginBase):\n\n \"\"\"A pl"
},
{
"path": "corpkit/process.py",
"chars": 52888,
"preview": "\"\"\"\nIn here are functions used internally by corpkit, \nnot intended to be called by users.\n\"\"\"\n\nfrom __future__ import "
},
{
"path": "corpkit/stats.py",
"chars": 2410,
"preview": "\"\"\"\nscikit-learn stuff\n\"\"\"\n\ndef tfidf(self, search={'w': 'any'}, show=['w'], **kwargs):\n \"\"\"\n Generate TF-IDF vect"
},
{
"path": "corpkit/textprogressbar.py",
"chars": 2515,
"preview": "#!/usr/bin/python\n\nfrom __future__ import print_function\n\nclass TextProgressBar:\n \"\"\"a text progress bar for CLI oper"
},
{
"path": "corpkit/tokenise.py",
"chars": 6993,
"preview": "from __future__ import print_function\n\n\"\"\"\nTokenise, POS tag and lemmatise a corpus, returning CONLL-U data\n\"\"\"\n\ndef nes"
},
{
"path": "corpkit/tregex.sh",
"chars": 134,
"preview": "#!/bin/sh\nscriptdir=`dirname $0`\n\njava -mx100m -cp \"$scriptdir/stanford-tregex.jar:\" edu.stanford.nlp.trees.tregex.Trege"
},
{
"path": "data/corpus-filelist.txt",
"chars": 139,
"preview": "/Users/daniel/Work/corpkit/corpkit/data/test-stripped/first/intro.txt\n/Users/daniel/Work/corpkit/corpkit/data/test-strip"
},
{
"path": "data/test/first/intro.txt",
"chars": 319,
"preview": "TESTER: This small corpus is used in corpkit's tests. Not a lot of data is required. <metadata test=\"on\" year=\"2004\">\nAN"
},
{
"path": "data/test/second/body.txt",
"chars": 374,
"preview": "Corpus linguistics and computational linguistics, like concordancing and interrogating, are situated on a vast continuum"
},
{
"path": "data/test-plain-parsed/first/intro.txt.conll",
"chars": 2515,
"preview": "# parse=(ROOT (FRAG (NP (NN TESTER)) (: :) (S (NP (DT This) (JJ small) (NN corpus)) (VP (VBZ is) (VP (VBN used) (PP (IN "
},
{
"path": "data/test-plain-parsed/second/body.txt.conll",
"chars": 2832,
"preview": "# parse=(ROOT (S (NP (NP (NNP Corpus) (NNS linguistics)) (CC and) (NP (JJ computational) (NNS linguistics))) (, ,) (PP ("
},
{
"path": "data/test-speak-parsed/first/intro.txt.conll",
"chars": 2221,
"preview": "# parse=(ROOT (S (NP (DT This) (JJ small) (NN corpus)) (VP (VBZ is) (VP (VBN used) (PP (IN in) (NP (NP (NN corpkit) (POS"
},
{
"path": "data/test-speak-parsed/second/body.txt.conll",
"chars": 2742,
"preview": "# parse=(ROOT (S (NP (NP (NNP Corpus) (NNS linguistics)) (CC and) (NP (JJ computational) (NNS linguistics))) (, ,) (PP ("
},
{
"path": "data/test-stripped/first/intro.txt",
"chars": 195,
"preview": "This small corpus is used in corpkit's tests. Not a lot of data is required. \nHere, we're testing the speaker_segmentati"
},
{
"path": "data/test-stripped/second/body.txt",
"chars": 298,
"preview": "Corpus linguistics and computational linguistics, like concordancing and interrogating, are situated on a vast continuum"
},
{
"path": "index.rst",
"chars": 7122,
"preview": ".. corpkit documentation master file, created by\n sphinx-quickstart on Thu Nov 5 11:43:02 2015.\n You can adapt this"
},
{
"path": "make.bat",
"chars": 7246,
"preview": "@ECHO OFF\r\n\r\nREM Command file for Sphinx documentation\r\n\r\nif \"%SPHINXBUILD%\" == \"\" (\r\n\tset SPHINXBUILD=sphinx-build\r\n)\r\n"
},
{
"path": "meta.yaml",
"chars": 1835,
"preview": "package:\n name: corpkit\n version: \"2.3.8\"\n\nsource:\n fn: corpkit-2.3.8.tar.gz\n url: https://pypi.python.org/packages/"
},
{
"path": "requirements.txt",
"chars": 243,
"preview": "git+https://github.com/interrogator/tkintertable.git#egg=tkintertable-1.2\ngit+https://github.com/interrogator/tabview#eg"
},
{
"path": "rst_docs/API/corpkit.building.rst",
"chars": 10307,
"preview": "Creating projects and building corpora\n=======================================\n\nDoing corpus linguistics involves buildi"
},
{
"path": "rst_docs/API/corpkit.concordancing.rst",
"chars": 5992,
"preview": "\nConcordancing\n==============\n\nConcordancing is the task of getting an aligned list of *keywords in context*. Here's a v"
},
{
"path": "rst_docs/API/corpkit.editing.rst",
"chars": 6824,
"preview": ".. _editing-page:\n\nEditing results\n=====================\n\nCorpus interrogation is the task of getting frequency counts f"
},
{
"path": "rst_docs/API/corpkit.interrogating.rst",
"chars": 21066,
"preview": "Interrogating corpora\n=====================\n\nOnce you've built a corpus, you can search it for linguistic phenomena. Thi"
},
{
"path": "rst_docs/API/corpkit.langmodel.rst",
"chars": 2382,
"preview": "Using language models \n======================\n\n.. warning::\n\n Language modelling is currently deprecated, while the to"
},
{
"path": "rst_docs/API/corpkit.managing.rst",
"chars": 5175,
"preview": "Managing projects\n=================\n\n``corpkit`` has a few other bits and pieces designed to make life easier when doing"
},
{
"path": "rst_docs/API/corpkit.visualising.rst",
"chars": 10228,
"preview": "Visualising results\n=====================\n\nOne thing missing in a lot of corpus linguistic tools is the ability to produ"
},
{
"path": "rst_docs/API-ref/corpkit.corpus.rst",
"chars": 869,
"preview": "Corpus classes\n=====================\n\nMuch of *corpkit*'s functionality comes from the ability to work with ``Corpus`` a"
},
{
"path": "rst_docs/API-ref/corpkit.dictionaries.rst",
"chars": 1060,
"preview": "Wordlists\n============================\n\nClosed class word types\n-------------------------------------------\n\nVarious wor"
},
{
"path": "rst_docs/API-ref/corpkit.interrogation.rst",
"chars": 751,
"preview": "Interrogation classes\n============================\n\nOnce you have searched a ``Corpus`` object, you'll want to be able t"
},
{
"path": "rst_docs/API-ref/corpkit.other.rst",
"chars": 398,
"preview": "Functions\n====================\n\n*corpkit* contains a small set of standalone functions.\n\n`as_regex`\n--------------------"
},
{
"path": "rst_docs/interpreter/corpkit.interpreter.annotating.rst",
"chars": 3251,
"preview": "Annotating your corpus\n========================\n\nAnother thing you might like to do is add metadata or annotations to yo"
},
{
"path": "rst_docs/interpreter/corpkit.interpreter.concordancing.rst",
"chars": 3592,
"preview": "Concordancing\n===============\n\nBy default, every search also produces concordance lines. You can view them by typing ``c"
},
{
"path": "rst_docs/interpreter/corpkit.interpreter.editing.rst",
"chars": 1885,
"preview": "Editing results\n================\n\nOnce you have generated a `result` object via the `search` command, you can edit the r"
},
{
"path": "rst_docs/interpreter/corpkit.interpreter.interrogating.rst",
"chars": 4784,
"preview": "Interrogating corpora\n=======================\n\nThe most powerful thing about *corpkit* is its ability to search parsed c"
},
{
"path": "rst_docs/interpreter/corpkit.interpreter.making.rst",
"chars": 3966,
"preview": "Making projects and corpora\n============================\n\nThe first two things you need to do when using *corpkit* are t"
},
{
"path": "rst_docs/interpreter/corpkit.interpreter.managing.rst",
"chars": 3140,
"preview": "Settings and management\n========================\n\nThe interpreter can do a number of other useful things. They are outli"
},
{
"path": "rst_docs/interpreter/corpkit.interpreter.overview.rst",
"chars": 14384,
"preview": ".. _interpreter-page:\n\nOverview\n=======================\n\n*corpkit* comes with a dedicated interpreter, which receives co"
},
{
"path": "rst_docs/interpreter/corpkit.interpreter.setup.rst",
"chars": 906,
"preview": "Setup\n==============================\n\n.. contents::\n :local:\n\nDependencies\n-------------\n\nTo use the interpreter, you'"
},
{
"path": "rst_docs/interpreter/corpkit.interpreter.visualising.rst",
"chars": 2694,
"preview": "\nPlotting\n=========\n\nYou can plot results and edited results using the `plot` method, which interfaces with *matplotlib*"
},
{
"path": "setup.cfg",
"chars": 249,
"preview": "[metadata]\nname = corpkit\ndescription-file = README.md\ndescription = A toolkit for working with parsed corpora\nurl = htt"
},
{
"path": "setup.py",
"chars": 2648,
"preview": "import setuptools\nfrom setuptools import setup, find_packages\nfrom setuptools.command.install import install\nimport os\nf"
},
{
"path": "talks/IDL_seminar.tex",
"chars": 9983,
"preview": "\\documentclass{beamer} % print frames\n%\\documentclass[notes=only]{beamer} % only notes\n%\\documentclass{beamer} "
}
]
// ... and 1 more files (download for full content)
About this extraction
This page contains the full source code of the interrogator/corpkit GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 98 files (2.3 MB), approximately 614.9k tokens, and a symbol index with 376 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.