[
  {
    "path": ".gitignore",
    "content": "*.py[co]\n\n# Packages\n*.egg\n*.egg-info\ndist\nbuild\neggs\nparts\nbin\nvar\nsdist\ndevelop-eggs\n.installed.cfg\n\n# Installer logs\npip-log.txt\n\n# Unit test / coverage reports\n.coverage\n.tox\n\n#Translations\n*.mo\n\n#Mr Developer\n.mr.developer.cfg\n"
  },
  {
    "path": "LICENSE.TXT",
    "content": "Copyright (C) 2010-2012 Tristan Havelick\n\nLicensed under the Apache License, Version 2.0 (the 'License');\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\n   http://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an 'AS IS' BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License."
  },
  {
    "path": "README.md",
    "content": "summarize\n=========\n\nA python library for simple text summarization\n\nInstallation \n============\n\nFirst install nltk and numpy:\n\n    sudo pip install nltk\n    sudo pip install numpy\n\nThen install the punkt and stopwords nltk packages:\n\n    sudo python -m nltk.downloader -d /usr/share/nltk_data punkt\n    sudo python -m nltk.downloader -d /usr/share/nltk_data stopwords\n\nThen, run the tests:\n\n    python run-tests.py\n\nIf nothing is output, you're good to go!\n\nExamples\n========\n\nSee `test/summarize.doctest` for a few simple usage examples\n\n\n"
  },
  {
    "path": "run-tests.py",
    "content": "import doctest \nfrom glob import glob\nimport os.path\n\npaths = glob('test/*.doctest')\n\nfor file in paths:\n\tdoctest.testfile(file)"
  },
  {
    "path": "setup.py",
    "content": "#!/usr/bin/env python\n\n# http://docs.python.org/2/distutils/setupscript.html\nfrom distutils.core import setup\n\nsetup(\n  name='summarize',\n\tversion='20121029',\n\t# ...\n\tpy_modules=['summarize'],\n\t# ...\n)\n"
  },
  {
    "path": "summarize.py",
    "content": "# Simple Summarizer\n# Copyright (C) 2010-2012 Tristan Havelick\n# Author: Tristan Havelick <tristan@havelick.com>\n# URL: <https://github.com/thavelick/summarize/>\n# For license information, see LICENSE.TXT\n\n\"\"\"\nA summarizer based on the algorithm found in Classifier4J by Nick Lothan.\nIn order to summarize a document this algorithm first determines the\nfrequencies of the words in the document.  It then splits the document\ninto a series of sentences.  Then it creates a summary by including the\nfirst sentence that includes each of the most frequent words.  Finally\nsummary's sentences are reordered to reflect that of those in the original\ndocument.\n\"\"\"\n\n##//////////////////////////////////////////////////////\n##  Simple Summarizer\n##//////////////////////////////////////////////////////\n\nfrom nltk.probability import FreqDist\nfrom nltk.tokenize import RegexpTokenizer\nfrom nltk.corpus import stopwords\nimport nltk.data\n\nclass SimpleSummarizer:\n\n\tdef reorder_sentences( self, output_sentences, input ):\n\t\toutput_sentences.sort( lambda s1, s2:\n\t\t\tinput.find(s1) - input.find(s2) )\n\t\treturn output_sentences\n\t\n\tdef get_summarized(self, input, num_sentences ):\n\t\t# TODO: allow the caller to specify the tokenizer they want\n\t\t# TODO: allow the user to specify the sentence tokenizer they want\n\t\t\n\t\ttokenizer = RegexpTokenizer('\\w+')\n\t\t\n\t\t# get the frequency of each word in the input\n\t\tbase_words = [word.lower()\n\t\t\tfor word in tokenizer.tokenize(input)]\n\t\twords = [word for word in base_words if word not in stopwords.words()]\n\t\tword_frequencies = FreqDist(words)\n\t\t\n\t\t# now create a set of the most frequent words\n\t\tmost_frequent_words = [pair[0] for pair in\n\t\t\tword_frequencies.items()[:100]]\n\t\t\n\t\t# break the input up into sentences.  working_sentences is used\n\t\t# for the analysis, but actual_sentences is used in the results\n\t\t# so capitalization will be correct.\n\t\t\n\t\tsent_detector = nltk.data.load('tokenizers/punkt/english.pickle')\n\t\tactual_sentences = sent_detector.tokenize(input)\n\t\tworking_sentences = [sentence.lower()\n\t\t\tfor sentence in actual_sentences]\n\n\t\t# iterate over the most frequent words, and add the first sentence\n\t\t# that inclues each word to the result.\n\t\toutput_sentences = []\n\n\t\tfor word in most_frequent_words:\n\t\t\tfor i in range(0, len(working_sentences)):\n\t\t\t\tif (word in working_sentences[i]\n\t\t\t\t  and actual_sentences[i] not in output_sentences):\n\t\t\t\t\toutput_sentences.append(actual_sentences[i])\n\t\t\t\t\tbreak\n\t\t\t\tif len(output_sentences) >= num_sentences: break\n\t\t\tif len(output_sentences) >= num_sentences: break\n\t\t\t\n\t\t# sort the output sentences back to their original order\n\t\treturn self.reorder_sentences(output_sentences, input)\n\t\n\tdef summarize(self, input, num_sentences):\n\t\treturn \" \".join(self.get_summarized(input, num_sentences))\n"
  },
  {
    "path": "test/summarize.doctest",
    "content": ".. Copyright (C) 2010-2012 Tristan Havelick. \n.. For license information, see LICENSE.TXT\n\n===========\nSummarizers\n===========\n\nOverview\n~~~~~~~~\n\nSummarizers provide a short summary or abstract from a longer document.\n\n\t>>> import summarize\n\nSimpleSummarizer\n~~~~~~~~~~~~~~~~\n\t\nA SimpleSummarizer makes a summary by using sentences with the most frequent \nwords\n\t\n\t>>> ss = summarize.SimpleSummarizer()\n\t>>> input = \"NLTK is a python library for working human-written text. Summarize is a package that uses NLTK to create summaries.\"\n\t>>> ss.summarize(input, 1)\n\t'NLTK is a python library for working human-written text.'\n\t\nYou can specify any number of sentenecs in the summary as you like.\n\n\t>>> input = \"NLTK is a python library for working human-written text. Summarize is a package that uses NLTK to create summaries. A Summariser is really cool. I don't think there are any other python summarisers.\"\n\t>>> ss.summarize(input, 2)\n\t\"NLTK is a python library for working human-written text. I don't think there are any other python summarisers.\"\n\nUnlike the original algorithm from Classifier4J, this summarizer works\ncorrectly with punctuation other than periods:\n\n\t>>> input = \"NLTK is a python library for working human-written text! Summarize is a package that uses NLTK to create summaries.\"\n\t>>> ss.summarize(input, 1)\n\t'NLTK is a python library for working human-written text!'\n"
  }
]