Repository: chbrown/liwc-python Branch: master Commit: b8e158c84b2a Files: 12 Total size: 9.9 KB Directory structure: gitextract__4myuiyb/ ├── .github/ │ └── ISSUE_TEMPLATE.md ├── .gitignore ├── .travis.yml ├── LICENSE.txt ├── README.md ├── liwc/ │ ├── __init__.py │ ├── dic.py │ └── trie.py ├── setup.cfg ├── setup.py └── test/ ├── alpha.dic └── test_alpha_dic.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .github/ISSUE_TEMPLATE.md ================================================ Please do not open an issue with the intent of subverting encryption implemented by the LIWC developers. If the version of LIWC that you purchased (or otherwise legitimately obtained as a researcher at an academic institution) does not provide a machine-readable `*.dic` file, please contact the distributor directly. ================================================ FILE: .gitignore ================================================ build/ dist/ ================================================ FILE: .travis.yml ================================================ language: python python: - "2.7" - "3.6" script: - python setup.py test ================================================ FILE: LICENSE.txt ================================================ Copyright © 2012-2020 Christopher Brown MIT License Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ # `liwc` [![PyPI version](https://badge.fury.io/py/liwc.svg)](https://pypi.org/project/liwc/) [![Travis CI Build Status](https://travis-ci.org/chbrown/liwc-python.svg?branch=master)](https://travis-ci.org/chbrown/liwc-python) This repository is a Python package implementing two basic functions: 1. Loading (parsing) a Linguistic Inquiry and Word Count (LIWC) dictionary from the `.dic` file format. 2. Using that dictionary to count category matches on provided texts. This is not an official LIWC product nor is it in any way affiliated with the LIWC development team or Receptiviti. ## Obtaining LIWC The LIWC lexicon is proprietary, so it is _not_ included in this repository. The lexicon data can be acquired (purchased) from [liwc.net](http://liwc.net/). * If you are a researcher at an academic institution, please contact [Dr. James W. Pennebaker](https://liberalarts.utexas.edu/psychology/faculty/pennebak) directly. * For commercial use, contact [Receptiviti](https://www.receptiviti.com/), which is the company that holds exclusive commercial license. Finally, please do not open an issue in this repository with the intent of subverting encryption implemented by the LIWC developers. If the version of LIWC that you purchased (or otherwise legitimately obtained as a researcher at an academic institution) does not provide a machine-readable `*.dic` file, please contact the distributor directly. ## Setup Install from [PyPI](https://pypi.python.org/pypi/liwc): pip install liwc ## Example This example reads the LIWC dictionary from a file named `LIWC2007_English100131.dic`, which looks like this: % 1 funct 2 pronoun [...] % a 1 10 abdomen* 146 147 about 1 16 17 [...] #### Loading the lexicon ```python import liwc parse, category_names = liwc.load_token_parser('LIWC2007_English100131.dic') ``` * `parse` is a function from a token of text (a string) to a list of matching LIWC categories (a list of strings) * `category_names` is all LIWC categories in the lexicon (a list of strings) #### Analyzing text ```python import re def tokenize(text): # you may want to use a smarter tokenizer for match in re.finditer(r'\w+', text, re.UNICODE): yield match.group(0) gettysburg = '''Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battlefield of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.'''.lower() gettysburg_tokens = tokenize(gettysburg) ``` Now, count all the categories in all of the tokens, and print the results: ```python from collections import Counter gettysburg_counts = Counter(category for token in gettysburg_tokens for category in parse(token)) print(gettysburg_counts) #=> Counter({'funct': 58, 'pronoun': 18, 'cogmech': 17, ...}) ``` ### _N.B._: * The LIWC lexicon only matches lowercase strings, so you will most likely want to lowercase your input text before passing it to `parse(...)`. In the example above, I call `.lower()` on the entire string, but you could alternatively incorporate that into your tokenization process (e.g., by using [spaCy](https://spacy.io/api/token)'s `token.lower_`). ## License Copyright (c) 2012-2020 Christopher Brown. [MIT Licensed](LICENSE.txt). ================================================ FILE: liwc/__init__.py ================================================ from .dic import read_dic from .trie import build_trie, search_trie try: import pkg_resources __version__ = pkg_resources.get_distribution("liwc").version except Exception: __version__ = None def load_token_parser(filepath): """ Reads a LIWC lexicon from a file in the .dic format, returning a tuple of (parse, category_names), where: * `parse` is a function from a token to a list of strings (potentially empty) of matching categories * `category_names` is a list of strings representing all LIWC categories in the lexicon """ lexicon, category_names = read_dic(filepath) trie = build_trie(lexicon) def parse_token(token): for category_name in search_trie(trie, token): yield category_name return parse_token, category_names ================================================ FILE: liwc/dic.py ================================================ def _parse_categories(lines): """ Read (category_id, category_name) pairs from the categories section. Each line consists of an integer followed a tab and then the category name. This section is separated from the lexicon by a line consisting of a single "%". """ for line in lines: line = line.strip() if line == "%": return # ignore non-matching groups of categories if "\t" in line: category_id, category_name = line.split("\t", 1) yield category_id, category_name def _parse_lexicon(lines, category_mapping): """ Read (match_expression, category_names) pairs from the lexicon section. Each line consists of a match expression followed by a tab and then one or more tab-separated integers, which are mapped to category names using `category_mapping`. """ for line in lines: line = line.strip() parts = line.split("\t") yield parts[0], [category_mapping[category_id] for category_id in parts[1:]] def read_dic(filepath): """ Reads a LIWC lexicon from a file in the .dic format, returning a tuple of (lexicon, category_names), where: * `lexicon` is a dict mapping string patterns to lists of category names * `category_names` is a list of category names (as strings) """ with open(filepath) as lines: # read up to first "%" (should be very first line of file) for line in lines: if line.strip() == "%": break # read categories (a mapping from integer string to category name) category_mapping = dict(_parse_categories(lines)) # read lexicon (a mapping from matching string to a list of category names) lexicon = dict(_parse_lexicon(lines, category_mapping)) return lexicon, list(category_mapping.values()) ================================================ FILE: liwc/trie.py ================================================ def build_trie(lexicon): """ Build a character-trie from the plain pattern_string -> categories_list mapping provided by `lexicon`. Some LIWC patterns end with a `*` to indicate a wildcard match. """ trie = {} for pattern, category_names in lexicon.items(): cursor = trie for char in pattern: if char == "*": cursor["*"] = category_names break if char not in cursor: cursor[char] = {} cursor = cursor[char] cursor["$"] = category_names return trie def search_trie(trie, token, token_i=0): """ Search the given character-trie for paths that match the `token` string. """ if "*" in trie: return trie["*"] if "$" in trie and token_i == len(token): return trie["$"] if token_i < len(token): char = token[token_i] if char in trie: return search_trie(trie[char], token, token_i + 1) return [] ================================================ FILE: setup.cfg ================================================ [metadata] name = liwc author = Christopher Brown author_email = chrisbrown@utexas.edu url = https://github.com/chbrown/liwc-python description = Linguistic Inquiry and Word Count (LIWC) analyzer (proprietary data not included) long_description = file: README.md long_description_content_type = text/markdown license = MIT [options] packages = find: zip_safe = True setup_requires = pytest-runner setuptools-scm tests_require = pytest pytest-black [aliases] test = pytest [tool:pytest] addopts = --black [bdist_wheel] universal = 1 ================================================ FILE: setup.py ================================================ from setuptools import setup setup(use_scm_version=True) ================================================ FILE: test/alpha.dic ================================================ % 1 A 2 Bravo % a* 1 bravo 2 ================================================ FILE: test/test_alpha_dic.py ================================================ import os.path import liwc test_dir = os.path.dirname(__file__) def test_category_names(): _, category_names = liwc.load_token_parser(os.path.join(test_dir, "alpha.dic")) assert category_names == ["A", "Bravo"] def test_parse(): parse, _ = liwc.load_token_parser(os.path.join(test_dir, "alpha.dic")) sentence = "Any alpha a bravo charlie Bravo boy" tokens = sentence.split() matches = [category for token in tokens for category in parse(token)] # matching is case-sensitive, so the only matches are "alpha" (A), "a" (A) and "bravo" (Bravo) assert matches == ["A", "A", "Bravo"]