Repository: chbrown/liwc-python
Branch: master
Commit: b8e158c84b2a
Files: 12
Total size: 9.9 KB

Directory structure:
gitextract__4myuiyb/

├── .github/
│   └── ISSUE_TEMPLATE.md
├── .gitignore
├── .travis.yml
├── LICENSE.txt
├── README.md
├── liwc/
│   ├── __init__.py
│   ├── dic.py
│   └── trie.py
├── setup.cfg
├── setup.py
└── test/
    ├── alpha.dic
    └── test_alpha_dic.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/ISSUE_TEMPLATE.md
================================================
Please do not open an issue with the intent of subverting encryption implemented by the LIWC developers.

If the version of LIWC that you purchased (or otherwise legitimately obtained as a researcher at an academic institution) does not provide a machine-readable `*.dic` file, please contact the distributor directly.


================================================
FILE: .gitignore
================================================
build/
dist/


================================================
FILE: .travis.yml
================================================
language: python
python:
  - "2.7"
  - "3.6"
script:
  - python setup.py test


================================================
FILE: LICENSE.txt
================================================
Copyright © 2012-2020 Christopher Brown <io@henrian.com>

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


================================================
FILE: README.md
================================================
# `liwc`

[![PyPI version](https://badge.fury.io/py/liwc.svg)](https://pypi.org/project/liwc/)
[![Travis CI Build Status](https://travis-ci.org/chbrown/liwc-python.svg?branch=master)](https://travis-ci.org/chbrown/liwc-python)

This repository is a Python package implementing two basic functions:
1. Loading (parsing) a Linguistic Inquiry and Word Count (LIWC) dictionary from the `.dic` file format.
2. Using that dictionary to count category matches on provided texts.

This is not an official LIWC product nor is it in any way affiliated with the LIWC development team or Receptiviti.


## Obtaining LIWC

The LIWC lexicon is proprietary, so it is _not_ included in this repository.

The lexicon data can be acquired (purchased) from [liwc.net](http://liwc.net/).

* If you are a researcher at an academic institution, please contact [Dr. James W. Pennebaker](https://liberalarts.utexas.edu/psychology/faculty/pennebak) directly.
* For commercial use, contact [Receptiviti](https://www.receptiviti.com/), which is the company that holds exclusive commercial license.

Finally, please do not open an issue in this repository with the intent of subverting encryption implemented by the LIWC developers.
If the version of LIWC that you purchased (or otherwise legitimately obtained as a researcher at an academic institution) does not provide a machine-readable `*.dic` file, please contact the distributor directly.


## Setup

Install from [PyPI](https://pypi.python.org/pypi/liwc):

    pip install liwc


## Example

This example reads the LIWC dictionary from a file named `LIWC2007_English100131.dic`, which looks like this:

    %
    1   funct
    2   pronoun
    [...]
    %
    a   1   10
    abdomen*    146 147
    about   1   16  17
    [...]


#### Loading the lexicon

```python
import liwc
parse, category_names = liwc.load_token_parser('LIWC2007_English100131.dic')
```

* `parse` is a function from a token of text (a string) to a list of matching LIWC categories (a list of strings)
* `category_names` is all LIWC categories in the lexicon (a list of strings)


#### Analyzing text

```python
import re

def tokenize(text):
    # you may want to use a smarter tokenizer
    for match in re.finditer(r'\w+', text, re.UNICODE):
        yield match.group(0)

gettysburg = '''Four score and seven years ago our fathers brought forth on
  this continent a new nation, conceived in liberty, and dedicated to the
  proposition that all men are created equal. Now we are engaged in a great
  civil war, testing whether that nation, or any nation so conceived and so
  dedicated, can long endure. We are met on a great battlefield of that war.
  We have come to dedicate a portion of that field, as a final resting place
  for those who here gave their lives that that nation might live. It is
  altogether fitting and proper that we should do this.'''.lower()
gettysburg_tokens = tokenize(gettysburg)
```

Now, count all the categories in all of the tokens, and print the results:

```python
from collections import Counter
gettysburg_counts = Counter(category for token in gettysburg_tokens for category in parse(token))
print(gettysburg_counts)
#=> Counter({'funct': 58, 'pronoun': 18, 'cogmech': 17, ...})
```

### _N.B._:

* The LIWC lexicon only matches lowercase strings, so you will most likely want to lowercase your input text before passing it to `parse(...)`.
  In the example above, I call `.lower()` on the entire string, but you could alternatively incorporate that into your tokenization process (e.g., by using [spaCy](https://spacy.io/api/token)'s `token.lower_`).


## License

Copyright (c) 2012-2020 Christopher Brown.
[MIT Licensed](LICENSE.txt).


================================================
FILE: liwc/__init__.py
================================================
from .dic import read_dic
from .trie import build_trie, search_trie

try:
    import pkg_resources

    __version__ = pkg_resources.get_distribution("liwc").version
except Exception:
    __version__ = None


def load_token_parser(filepath):
    """
    Reads a LIWC lexicon from a file in the .dic format, returning a tuple of
    (parse, category_names), where:
    * `parse` is a function from a token to a list of strings (potentially
      empty) of matching categories
    * `category_names` is a list of strings representing all LIWC categories in
      the lexicon
    """
    lexicon, category_names = read_dic(filepath)
    trie = build_trie(lexicon)

    def parse_token(token):
        for category_name in search_trie(trie, token):
            yield category_name

    return parse_token, category_names


================================================
FILE: liwc/dic.py
================================================
def _parse_categories(lines):
    """
    Read (category_id, category_name) pairs from the categories section.
    Each line consists of an integer followed a tab and then the category name.
    This section is separated from the lexicon by a line consisting of a single "%".
    """
    for line in lines:
        line = line.strip()
        if line == "%":
            return
        # ignore non-matching groups of categories
        if "\t" in line:
            category_id, category_name = line.split("\t", 1)
            yield category_id, category_name


def _parse_lexicon(lines, category_mapping):
    """
    Read (match_expression, category_names) pairs from the lexicon section.
    Each line consists of a match expression followed by a tab and then one or more
    tab-separated integers, which are mapped to category names using `category_mapping`.
    """
    for line in lines:
        line = line.strip()
        parts = line.split("\t")
        yield parts[0], [category_mapping[category_id] for category_id in parts[1:]]


def read_dic(filepath):
    """
    Reads a LIWC lexicon from a file in the .dic format, returning a tuple of
    (lexicon, category_names), where:
    * `lexicon` is a dict mapping string patterns to lists of category names
    * `category_names` is a list of category names (as strings)
    """
    with open(filepath) as lines:
        # read up to first "%" (should be very first line of file)
        for line in lines:
            if line.strip() == "%":
                break
        # read categories (a mapping from integer string to category name)
        category_mapping = dict(_parse_categories(lines))
        # read lexicon (a mapping from matching string to a list of category names)
        lexicon = dict(_parse_lexicon(lines, category_mapping))
    return lexicon, list(category_mapping.values())


================================================
FILE: liwc/trie.py
================================================
def build_trie(lexicon):
    """
    Build a character-trie from the plain pattern_string -> categories_list
    mapping provided by `lexicon`.

    Some LIWC patterns end with a `*` to indicate a wildcard match.
    """
    trie = {}
    for pattern, category_names in lexicon.items():
        cursor = trie
        for char in pattern:
            if char == "*":
                cursor["*"] = category_names
                break
            if char not in cursor:
                cursor[char] = {}
            cursor = cursor[char]
        cursor["$"] = category_names
    return trie


def search_trie(trie, token, token_i=0):
    """
    Search the given character-trie for paths that match the `token` string.
    """
    if "*" in trie:
        return trie["*"]
    if "$" in trie and token_i == len(token):
        return trie["$"]
    if token_i < len(token):
        char = token[token_i]
        if char in trie:
            return search_trie(trie[char], token, token_i + 1)
    return []


================================================
FILE: setup.cfg
================================================
[metadata]
name = liwc
author = Christopher Brown
author_email = chrisbrown@utexas.edu
url = https://github.com/chbrown/liwc-python
description = Linguistic Inquiry and Word Count (LIWC) analyzer (proprietary data not included)
long_description = file: README.md
long_description_content_type = text/markdown
license = MIT

[options]
packages = find:
zip_safe = True
setup_requires =
  pytest-runner
  setuptools-scm
tests_require =
  pytest
  pytest-black

[aliases]
test = pytest

[tool:pytest]
addopts =
  --black

[bdist_wheel]
universal = 1


================================================
FILE: setup.py
================================================
from setuptools import setup

setup(use_scm_version=True)


================================================
FILE: test/alpha.dic
================================================
%
1	A
2	Bravo
%
a*	1
bravo	2


================================================
FILE: test/test_alpha_dic.py
================================================
import os.path

import liwc

test_dir = os.path.dirname(__file__)


def test_category_names():
    _, category_names = liwc.load_token_parser(os.path.join(test_dir, "alpha.dic"))
    assert category_names == ["A", "Bravo"]


def test_parse():
    parse, _ = liwc.load_token_parser(os.path.join(test_dir, "alpha.dic"))
    sentence = "Any alpha a bravo charlie Bravo boy"
    tokens = sentence.split()
    matches = [category for token in tokens for category in parse(token)]
    # matching is case-sensitive, so the only matches are "alpha" (A), "a" (A) and "bravo" (Bravo)
    assert matches == ["A", "A", "Bravo"]