[
  {
    "path": ".gitignore",
    "content": "*.pyc\n*.log\n.cache\n.venv\n.eggs\nPipfile*\nbuild\ndist\ntwarc.egg-info\n.pytest_cache\n.vscode\n.env\nsite\nuv.lock\n"
  },
  {
    "path": ".readthedocs.yaml",
    "content": "version: 2\n\nmkdocs:\n  configuration: mkdocs.yml\n\npython:\n  version: 3.8\n  install:\n  - requirements: requirements-mkdocs.txt\n"
  },
  {
    "path": "LICENSE",
    "content": "The MIT License (MIT)\n\nCopyright (c) Documenting the Now Project\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n\n"
  },
  {
    "path": "MANIFEST.in",
    "content": "include requirements.txt\ninclude docs/README.md\n"
  },
  {
    "path": "README.md",
    "content": "# twarc\n\n**Note: twarc is no longer actively supported after changes to Twitter's API quotas made it unusable.**\n\n---\n\n[![DOI](https://zenodo.org/badge/7605723.svg)](https://zenodo.org/badge/latestdoi/7605723)\n\ntwarc is a command line tool and Python library for collecting and archiving Twitter JSON\ndata via the Twitter API. It has separate commands (twarc and twarc2) for working with the older\nv1.1 API and the newer v2 API and Academic Access (respectively).\n\n* Read the [documentation](https://twarc-project.readthedocs.io)\n* Ask questions here in [GitHub](https://github.com/DocNow/twarc/discussions), in [Slack](https://bit.ly/docnow-slack) or [Matrix](https://matrix.to/#/#docnow:matrix.org?via=matrix.org&via=petrichor.me&via=converser.eu)\n\ntwarc has been developed with generous support from the [Mellon Foundation](https://mellon.org/).\n\n## Contributing \n\nNew features are welcome and encouraged for twarc. However, to keep the core twarc library and command line tool sustainable we will look at new functionality with the following principles in mind:\n\n1. Purpose: twarc is for *collection* and *archiving* of Twitter data via the Twitter API.\n2. Sustainability: keeping the surface area of twarc and it's dependencies small enough to ensure high quality.\n3. Utility: what is exposed by twarc should be applicable to different people, projects and domains, and not specific use cases.\n4. API consistency: as much as sensible we aim to make twarc consistent with the Twitter API, and also aim to make twarc consistent with itself - so commands in core twarc should work similarly to each other, and twarc functionality should align towards the Twitter API.\n\nFor features and approaches that fall outside of this, twarc enables external packages to hook into the twarc2 command line tool via [click-plugins](https://github.com/click-contrib/click-plugins). This means that if you want to propose new functionality, you can create your own package without coordinating with core twarc.\n\n### Documentation\n\nThe documentation is managed at ReadTheDocs. If you would like to improve the documentation you can edit the Markdown files in `docs` or add new ones. Then send a pull request and we can add it.\n\nTo view your documentation locally you should be able to:\n\n    pip install -r requirements-mkdocs.txt\n    pip install -e .\n    mkdocs serve\n    open http://127.0.0.1:8000/\n\nIf you prefer you can create a page on the [wiki](https://github.com/docnow/twarc/wiki/) to workshop the documentation, and then when/if you think it's ready to be merged with the documentation create an [issue](https://github.com/docnow/twarc/issues). Please feel free to create whatever documentation is useful in the wiki area.\n\n### Code\n\nIf you are interested in adding functionality to twarc or fixing something that's broken here are the steps to setting up your development environment:\n\n    git clone https://github.com/docnow/twarc\n    cd twarc\n\nCreate a .env file that included Twitter App keys to use during testing:\n\n    BEARER_TOKEN=CHANGEME\n    CONSUMER_KEY=CHANGEME\n    CONSUMER_SECRET=CHANGEME\n    ACCESS_TOKEN=CHANGEME\n    ACCESS_TOKEN_SECRET=CHANGEME\n\nNow run the tests:\n\n    uv run pytest\n\nAdd your code and some new tests, and send a pull request!\n\n"
  },
  {
    "path": "RELEASING.md",
    "content": "# Releasing\n\nNew versions of twarc can be released by creating a release and assigning a new tag in the GitHub repo. The release, including upload of the new version to PyPI, is performed by GitHub actions when a new tag is created, using the PyPI token stored in the secrets associated with the repository. Anybody who has the permission to create a tag can perform a release.\n\nSteps in a release:\n\n1. Update the version number in `twarc/version.py` - the format is MAJOR.MINOR.PATCH and should always be increasing and unique.\n2. Make a new release from https://github.com/DocNow/twarc/releases (hit the 'draft new release' button on the top right).\n3. Create a new tag, matching the version number in `twarc/version.py`, with a v prefix (ie. vMAJOR.MINOR.PATCH)\n4. Write release notes.\n5. Publish the release.\n6. Make sure the GitHub action completes successfully.\n7. Double check that the new version correctly installs from PyPI: `pip install --upgrade twarc` should install the new version created above.\n"
  },
  {
    "path": "docs/README.md",
    "content": "# twarc\n\ntwarc is a command line tool and Python library for collecting and archiving Twitter JSON\ndata via the Twitter API. It has separate commands (twarc and twarc2) for working with the older\nv1.1 API and the newer v2 API and Academic Access (respectively). It also has an ecosystem of [plugins](plugins) for doing things with the collected data. \n\nSee the `twarc` documentation for running commands: [twarc2](twarc2_en_us.md) and [twarc1](twarc2_en_us.md) for using the v1.1 API. If you aren't sure about which one to use you'll want to start with twarc2 since the v1.1 is scheduled to be retired.\n\n## Install\n\nIf you have python installed, you can install twarc from a terminal (such as the Windows Command Prompt available in the \"start\" menu, or the [OSX Terminal application](https://support.apple.com/en-au/guide/terminal/apd5265185d-f365-44cb-8b09-71a064a42125/mac)):\n\n```\npip3 install twarc\n```\n\nOnce installed, you should be able to use the twarc and twarc2 command line utilities, or use it as a Python library - check the examples [here](api/library.md) for that.\n\n## Other Tools\n\nTwarc is purpose build for working with the twitter API for archiving and studying digital trace data. It is not built as a general purpose API library for Twitter. While the primary use is academic, it works just as well with \"Standard\" v2 API and \"Premium\" v1.1 APIs.\n\nFor a list of general purpose Twitter Libraries in different languages see the [Twitter Documentation](https://developer.twitter.com/en/docs/twitter-api/tools-and-libraries). For Python, [TwitterAPI](https://github.com/geduldig/TwitterAPI) and [tweepy](https://github.com/tweepy/tweepy) are both up to date and maintained. They also support v2 APIs, and their data format with expansions may differ from twarc. There is also a reference implementation of the [v2 Academic Access Search](https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-all) and [v1.1 Premium Search](https://developer.twitter.com/en/docs/twitter-api/premium/search-api/overview) from Twitter [here](https://github.com/twitterdev/search-tweets-python/). The [v2 version](https://github.com/twitterdev/search-tweets-python/tree/v2) of this script is compatible with twarc.\n\nFor `R` there is [academictwitteR](https://cran.r-project.org/web/packages/academictwitteR/vignettes/academictwitteR-intro.html). Unlike twarc, it focuses solely on querying the Twitter Academic Research Product Track v2 API endpoint. Data gathered in twarc can be imported into `R` for analysis as a dataframe if you export the data into CSV using [twarc-csv](https://pypi.org/project/twarc-csv/).\n\n## Getting Help\n\nCheck out the [tutorial](tutorial.md) to get started, or follow along with this [recorded stream](https://tube.nocturlab.fr/videos/watch/1d98d20e-a4fd-4594-aa94-9b1b1301cead) introducing twarc. You can also find additional resources linked from [resources](resources.md). If you run into trouble, feel free to make a post on the [Twarc Repository](https://github.com/DocNow/twarc/issues) or on the [Twitter Developer Forums](https://twittercommunity.com/c/academic-research/62).\n"
  },
  {
    "path": "docs/api/client.md",
    "content": "# twarc.Client\n\n::: twarc.client\n  handler: python\n"
  },
  {
    "path": "docs/api/client2.md",
    "content": "# twarc.Client2\n\n::: twarc.client2\n  handler: python\n"
  },
  {
    "path": "docs/api/expansions.md",
    "content": "# twarc.expansions\n\n[Expansions](https://developer.twitter.com/en/docs/twitter-api/expansions) are how the new v2 Twitter API includes optional metadata about Tweets. In contrast to v1.1, where each Tweet JSON object is self-contained, in v2 metadata about a whole \"page\" of requests is included in the response. This means that to get a self-contained Tweet JSON, additional processing is needed to look up each piece of extra metadata. Different tools and libraries may implement this in different ways. In twarc, the goal was to retain the original JSON format and only append extra fields, so that any code that expects original JSON will still work.\n\n::: twarc.expansions\n  handler: python\n"
  },
  {
    "path": "docs/api/library.md",
    "content": "# Examples of using twarc2 as a library\n\nPlease see [client2](client2.md) docs for the full list of available functions. Here are some minimal working snippets of code that use twarc2 as a library.\n\n## Search \n\nThe client implements the API as closely as possible - so if the API docs expect a parameter in a certain way, so does the twarc2 library.\n\n```python\nimport datetime\n\nfrom twarc.client2 import Twarc2\nfrom twarc.expansions import ensure_flattened\n\n# Your bearer token here\nt = Twarc2(bearer_token=\"A...z\")\n\n# Start and end times must be in UTC\nstart_time = datetime.datetime(2021, 3, 21, 0, 0, 0, 0, datetime.timezone.utc)\nend_time = datetime.datetime(2021, 3, 22, 0, 0, 0, 0, datetime.timezone.utc)\n\n# search_results is a generator, max_results is max tweets per page, 100 max for full archive search with all expansions.\nsearch_results = t.search_all(query=\"dogs lang:en -is:retweet\", start_time=start_time, end_time=end_time, max_results=100)\n\n# Get all results page by page:\nfor page in search_results:\n    # Do something with the whole page of results:\n    # print(page)\n    # or alternatively, \"flatten\" results returning 1 tweet at a time, with expansions inline:\n    for tweet in ensure_flattened(page):\n        # Do something with the tweet\n        print(tweet)\n\n    # Stop iteration prematurely, to only get 1 page of results.\n    break\n```\n\n## Working with Generators\n\nTwarc will try to retrieve all available results and handle retries and rate limits for you. This can potentially retrieve more tweets than your monthly limit will allow. The command line interface has a `--limit` option, but the library returns generator functions and it is upto you to stop iterating when you have retrieved enough results.\n\nFor example, to only get 2 \"pages\" of followers max per user:\n\n```python\nfrom twarc.client2 import Twarc2\n\n# Your bearer token here\nt = Twarc2(bearer_token=\"A...z\")\n\nuser_ids = [12, 2244994945, 4503599627370241] # @jack, @twitterdev, @overflow64\n\n# Iterate over our target users\nfor user_id in user_ids:\n\n    # Iterate over pages of followers\n    for i, follower_page in enumerate(t.followers(user_id)):\n\n         # Do something with the follower_page here\n         print(f\"Fetched a page of {len(follower_page['data'])} followers for {user_id}\")\n\n         if i == 1: # Only retrieve the first two pages (enumerate starts from 0)\n               break\n```\n\n## twarc CSV\n\n`twarc-csv` is an extra plugin you can install:\n\n```\npip install twarc-csv\n```\n\nThis can also be used as a library, for example:\n\nIf you have a bunch of data, and want a DataFrame:\n\n```\nfrom twarc_csv import DataFrameConverter\n\n# Default options for Dataframe converter\nconverter = DataFrameConverter()\n\n# this can be a list or generator of individual tweets or pages or results.\njson_objects = [...] \n\ndf = converter.process(json_objects)\n```\n\nThis doesn't save any files, and converts everything in memory.\n\nIf you have a large file, you should use `CSVConverter` as before\n\n```\nfrom twarc_csv import CSVConverter\n\nwith open(\"input.json\", \"r\") as infile:\n    with open(\"output.csv\", \"w\") as outfile:\n        converter = CSVConverter(infile=infile, outfile=outfile)\n        converter.process()\n```\n\nor with additional options:\n\n```\nfrom twarc_csv import CSVConverter, DataFrameConverter\n\nconverter = DataFrameConverter(\n    input_data_type=\"tweets\",\n    json_encode_all=False,\n    json_encode_text=False,\n    json_encode_lists=True,\n    inline_referenced_tweets=True,\n    merge_retweets=True,\n    allow_duplicates=False,\n)\n\nwith open(\"results.jsonl\", \"r\") as infile:\n    with open(\"results.csv\", \"w\") as outfile:\n        converter = CSVConverter(infile=infile, outfile=outfile, converter=converter)\n        converter.process()\n\n```\n\n`DataFrameConverter` parameters correspond to the command line options: https://github.com/DocNow/twarc-csv#extra-command-line-options\n\nThe full list of valid `output_columns` are: https://github.com/DocNow/twarc-csv/blob/main/dataframe_converter.py#L13-L85 when using `input_data_type=\"tweets\"` and https://github.com/DocNow/twarc-csv/blob/main/dataframe_converter.py#L90-L115 when using `input_data_type=\"users\"`. Note that it won't extract users from tweets, these have to be already extracted from the JSON. `twarc-csv` can also process compliance output and counts output.\n\n## Search and write results to CSV example\n\nHere is a complete working example that searches for all recent tweets in the last few hours, writes a `results.jsonl` with the original responses, and then converts this to CSV:\n\n```python\nimport json\nfrom datetime import datetime, timezone, timedelta\n\nfrom twarc.client2 import Twarc2\nfrom twarc_csv import CSVConverter\n\n# Your bearer token here\nt = Twarc2(bearer_token=\"A...z\")\n\n# Start and end times must be in UTC\nstart_time = datetime.now(timezone.utc) + timedelta(hours=-3)\n# end_time cannot be immediately now, has to be at least 30 seconds ago.\nend_time = datetime.now(timezone.utc) + timedelta(minutes=-1)\n\nquery = \"dogs lang:en -is:retweet has:media\"\n\nprint(f\"Searching for \\\"{query}\\\" tweets from {start_time} to {end_time}...\")\n\n# search_results is a generator, max_results is max tweets per page, not total, 100 is max when using all expansions.\nsearch_results = t.search_recent(query=query, start_time=start_time, end_time=end_time, max_results=100)\n\n# Get all results page by page:\nfor page in search_results:\n    # Do something with the page of results:\n    with open(\"dogs_results.jsonl\", \"w+\") as f:\n        f.write(json.dumps(page) + \"\\n\")\n    print(\"Wrote a page of results...\")\n\nprint(\"Converting to CSV...\")\n\n# This assumes `results.jsonl` is finished writing.\nwith open(\"dogs_results.jsonl\", \"r\") as infile:\n    with open(\"dogs_output.csv\", \"w\") as outfile:\n        converter = CSVConverter(infile, outfile)\n        converter.process()\n\nprint(\"Finished.\")\n```\n"
  },
  {
    "path": "docs/plugins.md",
    "content": "# Plugins\n\ntwarc v1 collected a set of utilities for working with tweet json in the\n[utils] directory of the git repository. This was a handy way to develop and\nshare snippets of code. But some utilities had different dependencies which\nweren't managed in a uniform way. Some of the utilities had slightly different\ninterfaces. They needed to be downloaded from GitHub manually and weren't\neasily accessible at the command line if you remembered where you put them.\n\nWith *twarc2* these utilities are now installable as plugins, which are made\navailable as subcommands using the same twarc2 command line. Plugins are\npublished separately from twarc on [PyPI] and are installed with [pip]. Here is\na list of some known plugins (if you write one please [let us know] so we can\nadd it to this list):\n\n* [twarc-ids](https://pypi.org/project/twarc-ids/): a simple example of printing the ids for tweets to use as a reference for creating plugins\n* [twarc-csv](https://pypi.org/project/twarc-csv/): export tweets to CSV, which is probably the first thing a researcher will want to do\n* [twarc-videos](https://pypi.org/project/twarc-videos): extract videos from tweets \n* [twarc-network](https://pypi.org/project/twarc-network): visualize tweets and users as a network graph\n* [twarc-timeline-archive](https://pypi.org/project/twarc-timeline-archive): routinely download tweet timelines for a list of users\n* [twarc-hashtags](https://pypi.org/project/twarc-hashtags): create a report of hashtags that are used in collected tweet data\n* Write your own, and [let us know] so we can add it here!\n\n## Writing a Plugin\n\nThe [twarc-ids] plugin provides an example of how to write plugins. This\nreference plugin simply reads collected tweet JSON data and writes out the tweet\nidentifiers. First you install the plugin:\n\n    pip install twarc-ids\n\nand then you use it:\n\n    twarc2 ids tweets.json > ids.txt\n\nInternally twarc's command line is implemented using the [click] library. The\n[click-plugins] module is what manages twarc2 plugins. Basically you import\n`click` and implement your plugin as you would any other click utility, for\nexample:\n\n```python\nimport json\nimport click\n\n@click.command()\n@click.argument('infile', type=click.File('r'), default='-')\n@click.argument('outfile', type=click.File('w'), default='-')\ndef ids(infile, outfile):\n    \"\"\"\n    Extract tweet ids from tweet JSON.\n    \"\"\"\n    for line in infile:\n        tweet = json.loads(line)\n        click.echo(t['data']['id'], file=outfile)\n```\n\nNote that the plugin takes input file *infile* and writes to an output file\n*outfile* which default to stdin and stdout respectively. This allows plugin\nutilities to be used as part of pipelines. You can add options using the\nstandard facilities that click provides if your plugin needs them.\n\nIf your plugin needs to talk to the Twitter API then just add the\n`@click.pass_obj` decorator which will ensure that the first parameter in\nyour function will be a Twarc2 client that is configured to use the\nclient's keys.\n\n```python\n@click.command()\n@click.argument('infile', type=click.File('r'), default='-')\n@click.argument('outfile', type=click.File('w'), default='-')\n@click.pass_obj\ndef ids(twarc_client, infile, outfile):\n    # do something with the twarc client here\n```\n\nFinally you just need to create a `setup.py` file for your project that\nlooks something like this:\n\n```python\n\nimport setuptools\n\nsetuptools.setup(\n    name='twarc-ids',\n    version='0.0.1',\n    url='https://github.com/docnow/twarc-ids',\n    author='Ed Summers',\n    author_email='ehs@pobox.com',\n    py_modules=['twarc_ids'],\n    description='A twarc plugin to read Twitter data and output the tweet ids',\n    install_requires=['twarc'],\n    setup_requires=['pytest-runner'],\n    tests_require=['pytest'],\n    entry_points='''\n        [twarc.plugins]\n        ids=twarc_ids:ids\n    '''\n)\n```\n\nThe key part here is the `entry_points` section which is what allows twarc2 to\ndiscover twarc.plugins dynamically at runtime, and also defines how the\nsubcommand maps to the plugin's function.\n\nIt's good practice to include a test or two for your plugin to ensure it works\nover time. Check out the example [here] for how to test command line utilities\neasily with click.\n\nTo publish your plugin on PyPi:\n\n```\npip install twine\npython setup.py sdist\ntwine upload dist/*\n# enter pypi login details\n```\n\n[twarc-ids]: https://github.com/docnow/twarc-ids/\n[PyPI]: https://python.org/pypi/\n[pip]: https://pip.pypa.io/en/stable/\n[click]: https://click.palletsprojects.com/\n[click-plugins]: https://github.com/click-contrib/click-plugins\n[here]: https://github.com/DocNow/twarc-ids/blob/main/test_twarc_ids.py\n[let us know]: https://github.com/docnow/twarc/issues/\n[utils]: https://github.com/DocNow/twarc/tree/main/utils\n"
  },
  {
    "path": "docs/resources.md",
    "content": "# Twarc Tutorials and Other Resources\n\nDocumentation here is largely auto generated from the code, which may not always be the most user friendly. Others have written great tutorials and other resources relating to using twarc, or working with the data generated by twarc. If you'd like to suggest additional resources that are relevant, please feel to open a pull request or open an issue.\n\n## An Introductory Video from the Australian Digital Observatory\n\nA [six minute video](https://www.youtube.com/watch?v=4DXEeM2AA9Y) by the [Australian Digital Observatory](https://www.digitalobservatory.net.au/) that shows some of the functionality of `twarc2` search, as well as how to use [Twitter's Query Builder](https://developer.twitter.com/apitools/query?query=) in conjunction with twarc.\n\n<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/4DXEeM2AA9Y\" title=\"YouTube video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" allowfullscreen></iframe>\n\n## Carpentries Lesson\n\n<https://carpentries-incubator.github.io/twitter-with-twarc/index.html>\n\nIncludes a step by step guide to collecting Twitter data using `twarc2`. It includes information on Twitter's JSON format, and how to manage collected data. \n\n## UVA Library's Scholars' Lab Twarc Tutorial\n\n<https://scholarslab.github.io/learn-twarc/>\n\nA beginner guide that also goes through command line and Python setup. Uses `twarc` for v1.1 API examples, not `twarc2`.\n\n## Guide from TwitterDev\n\n<https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research>\n\nTwitter have released a 101 guide on using the Academic Access endpoints. It uses `twarc2` as a library as opposed to command line, and gives code examples in R too.\n\n## Twitter Data Collection & Analysis\n\n<https://melaniewalsh.github.io/Intro-Cultural-Analytics/04-Data-Collection/12-Twitter-Data.html>\n\nLesson from Introduction to Cultural Analytics & Python\n\n## Getting Data from Twitter: A twarc tutorial\n\n<https://github.com/alblaine/twarc-tutorial>\n\nUses `twarc` for `v1.1` endpoints and has step by step examples for using some of the `/utils` scripts.\n\n## UCSB Library Twarc Tutorials\n\n<https://ucsb-collaboratory.github.io/twitter/>\n\nUses both `twarc` and `twarc2`\n\n## Introduction to full archive searching using twarc v2\n\n<https://github.com/jeffcsauer/twarc-v2-tutorials/blob/master/twarc_fas.md>\n\nAn example of using `twarc2` search, but be sure to install twarc using `pip install twarc` not the link to the v2 branch zip.\n\n\n"
  },
  {
    "path": "docs/tutorial.md",
    "content": "# Twarc Tutorial\n\nTwarc is a command line tool for collecting Twitter data via Twitter's web Application Programming Interface (API). This tutorial is aimed at researchers who are new to collecting social media data, and who might be unfamiliar with command line interfaces.\n\nBy the end of this tutorial, you will have:\n\n1. Familiarised yourself with interacting with a command line application via a terminal\n2. Setup Twarc so you can collect data from the Twitter API (version 2)\n3. Constructed two Twitter search queries to address a specific research question\n4. Collected data for those two queries\n5. Processed the collected data into formats suitable for other analysis\n6. Performed a simple quantitative comparison of the two collections using Python\n7. Prepared a dataset of tweet identifiers that can be shared with other researchers\n\n\n## Motivating example\n\nThis tutorial is built around collecting data from Twitter to address the following research question:\n\n***Which monotreme is currently the coolest - the echidna or the platypus?***\n\nWe'll answer this question with a simple quantitative approach to analysing the collected data: counting the volume of likes that tweets mentioning each species of animal accrue. For this tutorial, the species that gets the most likes on tweets is going to be considered the \"coolest\". This is a very simplistic quantitative approach, just to get you started on collecting and analysing Twitter data. To seriously study the relative coolness of monotremes, there are a wide variety of more appropriate (but also more involved) methods.\n\n## Introduction to twarc and the Twitter API\n\n### What is an API?\n\nAn **Application Programming Interface** (API) is a common method for software applications and services to allow other systems or people to programmatically interact with them. For example, Twitter has an API which allows external systems to make requests to Twitter for information or actions. Twitter (and many other web apps and services) uses an HTTP REST API, meaning that to interact with Twitter through the API you can send an HTTP request to a specific URL provided by Twitter. Twitter affords many different URLs (also known as **endpoints**) which have been designed for different purposes (more about that later). Assuming that your HTTP request is valid, Twitter will respond with a bundle of information in [JSON format](https://en.wikipedia.org/wiki/JSON) for you.\n\nTwarc acts as a tool or an intermediary for you to interact with the Twitter API, so that you don't have to manage the details of how exactly to make requests to the Twitter API and handle Twitter's responses. Twarc commands correspond roughly with Twitter API endpoints. For example, when you use Twarc to fetch the timeline of a specific Twitter account (we'll use @Twitter in this example), this is the sequence of events:\n\n1. You run `twarc2 timeline Twitter tweets.jsonl`\n\n2. twarc2 makes a request on your behalf to the [Twitter v2 user lookup API endpoint](https://developer.twitter.com/en/docs/twitter-api/users/lookup/introduction) in order to find the user ID for the @Twitter account, and receives a response from the Twitter API server with that user ID\n\n3. twarc2 makes a request on your behalf to the [Twitter v2 timeline API endpoint](https://developer.twitter.com/en/docs/twitter-api/tweets/timelines/introduction), using the user ID determined in step 2, and receives a response (or several responses) from the Twitter API server with @Twitter's tweets\n\n4. twarc2 consolidates the timeline responses from step 3 and outputs them according to your initial command, in this case as `tweets.jsonl`\n\nThere are a great many resources on the internet to learn more about APIs more generally and how to use them in a variety of contexts. Here are a few introductory articles:\n\n- [How to Geek: What is an API, and how do developers use them?](https://www.howtogeek.com/343877/what-is-an-api/)\n- [IBM: What is an API?](https://www.ibm.com/cloud/learn/api)\n\nMore detailed information on APIs and working with them:\n\n- [Zapier: An introduction to APIs](https://zapier.com/learn/apis/)\n- [RealPython: Python and REST APIs: Interacting with web services](https://realpython.com/api-integration-in-python/)\n\n### What can you do with the Twitter API?\n\nThe Twitter API is very popular in academic communities for good reason: it is one of the most accessible and research-friendly of the popular social media platforms at present. The Twitter API is well-established and offers a broad range of possibilities for data collection.\n\nHere are some examples of things you can do with the Twitter API:\n\n- Find historical tweets containing words or phrases during a time window of interest\n- Collect live tweets as they are posted matching specific search criteria\n- Collect tweets using specific hashtags or mentioning particular users\n- Collect tweets made by a particular user account\n- Collect engagement metrics including likes and retweets for specific tweets of interest\n- Map Twitter account followers and followees within or around a group of users\n- Trace conversations and interactions around users or tweets of interest\n\nYou may notice as you read about the Twitter API that there are two versions of the Twitter API - version 1.1 and version 2. At the time of writing, Twitter is providing both versions of the API, but at some unknown point in the future version 1.1 may be discontinued. Twarc can handle either API version: the `twarc` command uses version 1.1 of the Twitter API, the `twarc2` command uses version 2. Take care when reading documentation and tutorials as to which Twitter API version is being referenced. **This tutorial uses version 2 of the Twitter API**.\n\nTwitter API endpoints can be structured either around tweets or around user accounts. For example, the search endpoint provides lists of tweets - user information is included, but the data is focused on the tweets.\n\nThe available endpoints and their details are evolving as Twitter develops and releases its API version 2, so for the most up to date information refer to [the Twitter API documentation](https://developer.twitter.com/en/docs/twitter-api). Some of the most used endpoints for research purposes are:\n\n- [search](https://developer.twitter.com/en/docs/twitter-api/tweets/search/introduction): This is the endpoint used to search tweets, whether recent or historical.\n- [lookup](https://developer.twitter.com/en/docs/twitter-api/tweets/lookup/introduction): The lookup endpoints are useful when you have IDs of tweets of interest and want to fetch further data about those tweets - known in the Twarc community as **hydrating** the tweets.\n- [follows](https://developer.twitter.com/en/docs/twitter-api/users/follows/introduction): The follows endpoint allows collecting information about who follows who on Twitter.\n\nWith the Twitter API, you can get data related to all types of objects that make up the Twitter experience, including [tweets](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet) and [users](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/user). The Twitter documentation provides full details, and these two pages are very useful to bookmark!\n\nThe Twitter documentation also provides some useful tools for constructing searches and queries:\n\n- [Twitter's v2 API Query Builder](https://developer.twitter.com/apitools/query?query=)\n- [Building high quality filters for getting Twitter data](https://developer.twitter.com/en/docs/tutorials/building-high-quality-filters)\n\nThe rest of this tutorial is going to focus on using the Twitter search API endpoint to retrieve tweets containing content relevant to the research question. We've chosen to focus on this because:\n\n1. With the rich functionality available in the search API the data collection for many projects can be condensed down to a few carefully chosen searches.\n2. With [academic research access](https://developer.twitter.com/en/products/twitter-api/academic-research) it's possible to search the entire Twitter archive, making search uniquely powerful among the endpoints Twitter supports.\n\n### Introduction to twarc\n\nTwarc is at its core an application for interacting with the Twitter API, reading results from the different functionality the API offers, and safely writing the collected data to your machine for further analysis. Twarc handles the mechanical details of interacting with the Twitter API like including information to authenticate yourself, making HTTP requests to the API, formatting data in the right way, and retrying when things on the internet fail. Your job is to work out:\n\n1. Which endpoint you want to call on from the Twitter API.\n2. Which data you want to retrieve from that endpoint.\n\nTwarc is a command line based application - to use twarc you type a command specifying a particular action, and the results of that command are shown as text on screen. If you haven't used a command line interface before, don't worry! Although there is a bit of a learning curve at the beginning, you will quickly get the hang of it - and because everything is a typed command, it is very easy to record and share _exactly_ how you collected data with other people.\n\n\n## Considerations when using social media data for research\n\nBefore we dive into the details, it's worth mentioning some broader issues you will need to keep in mind when working with social media data. This is by no means an exhaustive list of issues and is intended as a starting point for further enquiry.\n\n### Ethical use of \"public\" communication\n\nEven though most tweets on Twitter are public, in that they're accessible to anyone on the web, most users of Twitter don't have any expectation that researchers will be reading their tweets for the purpose of research. Researchers need to be mindful of this when working with data from Twitter, and user expectations should be considered as part of the study design. The Association of Internet Researchers has established [Ethical Guidelines for Internet Research](https://aoir.org/ethics/) which are a good starting point for the higher level considerations.\n\nWork has also been done specifically looking at [Twitter users' expectations](https://journals.sagepub.com/doi/10.1177/2056305118763366), with a number of key concerns outlined. For this tutorial we're going to be taking a high level quantitative evaluation of very recent Twitter data, which distances ourselves from the specific tweets and users creating them and aligns with these broader ethical considerations.\n\nFinally, because tweets (and the internet more generally) are searchable, we need to keep in mind that quoting a tweet in whole or part might allow easy reidentification of any specific user or tweet. For this reason care needs to be taken when reporting material from tweets, and common practices in qualitative research may not align with Twitter users' interests or expectations.\n\n### Copyright\n\nThis may vary according to where you are in the world but tweets, including the text of the tweet and attached photos and videos are likely to be protected by copyright. As well as the Twitter Developer Agreement considerations in the next section, this may limit what you can do with tweets and media downloaded from Twitter.\n\n\n### Twitter's terms of service\n\nWhen you signed up for a Twitter developer account you agreed to follow Twitter's [Developer Agreement and Policy](https://developer.twitter.com/en/developer-terms/agreement-and-policy). This agreement constrains how you can use and share Twitter data. While the primary purpose of this agreement is to protect Twitter the company, this policy also incorporates some elements aimed at protecting users of Twitter.\n\nSome particular things to note from the Developer Agreement are:\n\n- Limits on how geolocation data can be used\n- How to share Twitter data\n- Dealing with deleted tweets\n\nNote that researchers using deleted tweets were also key concerns for [Twitter users](https://journals.sagepub.com/doi/10.1177/2056305118763366). This tutorial won't cover geolocation data at all, but will cover approaches to sharing Twitter data and removing deleted material from collections.\n\n\n## Setup\n\nTwarc is a command line application, written in the Python programming language. To get Twarc running on our machines, we're going to need to install Python, then install Twarc itself, and we will also need to setup a Twitter developer account.\n\n### Twitter developer access\n\n[Start here](https://developer.twitter.com/en/apply-for-access) to apply for a Twitter developer account and follow the steps in [our developer access guide](twitter-developer-access.md). For this tutorial, you can skip step 2, as we won't require academic access.\n\nOnce you have the **Bearer Token**, you are ready for the next step. This token is like a password, so you shouldn't share it with other people. You will also need to be able to enter this token once to configure Twarc, so it would be best to copy and paste it to a text file on your local machine until we've finished configuration.\n\n### Install Python\n\n#### Windows\n\nInstall the latest version [for Windows](https://www.python.org/downloads/windows/). During the installation, make sure the *Add Python to PATH* option is selected/ticked.\n\n![](images/win_installer.png)\n\n#### Mac\n\nInstall the latest version [for Mac](https://www.python.org/downloads/macos/). No additional setup should be necessary for Python.\n\n\n### Install Twarc and other utilities\n\nFor this tutorial we're going to install three Python packages, `twarc`, an extension called `twarc-csv`, and `pandas`, a Python library for data analysis. We will use a command line interface to install these packages. On Windows we will use the `cmd` console, which can be found by searching for `cmd` from the start menu - you should see a prompt like the below screenshot. On Mac you can open the `Terminal` app.\n\n![Screenshot showing the opening of the cmd window on windows](images/CMD.png)\n\nOnce you have a terminal open we can run the following command to install the necessary packages:\n\n```shell\npip install twarc twarc-csv pandas\n```\n\nYou should see output similar to the following:\n\n![](images/pip_install.png)\n\n### Our first command: making sure everything is working\n\nLet's open a terminal and get started - just like when installing twarc, you will want to use the `cmd` application on windows and the `Terminal` application on Mac.\n\nThe first command we want to run is to check if everything in twarc is installed and working correctly. We'll use twarc's builtin `help` for this. Running the following command should show you a brief overview of the functionality that the twarc2 command provides and some of the options available:\n\n```shell\ntwarc2 --help\n```\n\n![](images/twarc_help.png)\n\nTwarc is structured like many other command line applications: there is a single main command, `twarc2`, to launch the application, and then you provide a subcommand, or additional arguments, or flags to provide additional context about what that command should actually do. In this case we're only launching the `twarc2` command, and providing a single _flag_ `--help` (the double-dash syntax is usually used for this). Most terminal applications will have a `--help` or `-h` flag that will provide some useful information about the application you're running. This often includes example usage, options, and a short description.\n\nNote also that often when reading commands out loud, the space in between words is not mentioned explicitly: the command above (`twarc2 --help`) might be read as \"twarc-two dash dash help\".\n\nThough we won't cover the command line outside of using Twarc in this tutorial, your operating system's command line functionality is extensive and can help you automate a lot of otherwise tedious tasks. If you're interested in learning more the [Software Carpentry lesson on the shell](https://swcarpentry.github.io/shell-novice/) is a good starting point.\n\n\n### Configuring twarc with our bearer token\n\nThe next thing we want to do is tell twarc about our bearer token so we can authenticate ourselves with the Twitter API. This can be done using twarc's `configure` command. In this case we're going to use the `twarc2` main command, and provide it with the subcommand `configure` to tell twarc we want to start the configuration process.\n\n```\ntwarc2 configure\n```\n\nOn running this command twarc will prompt us to paste our bearer token, as shown in the screenshot below. Note that for many command line terminals on Windows, using the usual `Ctrl+V` keyboard shortcut will not work by default. If this happens, try right-clicking,then click `paste` to achieve the same thing. After entering our token, we will be prompted to enter additional information - this is not necessary for this tutorial, so we will skip this step by typing the letter `n` and hitting `enter`.\n\n![](images/twarc_configure.png)\n\n\n## Introduction to Twitter search and counts\n\nTo tackle the research question we're interested in we're going to use the search endpoint to retrieve two sets of tweets: those using the word echidna, and those using the word platypus.\n\nThere are two key commands that the Twitter API provides for search: a `search` endpoint to retrieve tweets matching a particular query, and a `counts` endpoint to tell you how many tweets match that query over time. It's always a good idea to start with the `counts` endpoint first, because:\n\n- it lets you establish early on how many tweets you will need to deal with: too many or too few matching tweets will help you determine whether your search strategy is reasonable\n- it can take a long time to retrieve large numbers of tweets and its better to know in advance how much data you will need to deal with\n- the count and trend over time is useful in and of itself\n- if you accidentally search for the wrong thing you can consume your monthly quota of tweets without collecting anything useful\n\nLet's get started with the `counts` API - in twarc this is accessible by the command `counts`. As before `twarc2` is our entry command, `counts` is the subcommand we're interested in, and the `echidna` is what we're interested in searching for on Twitter (the query).\n\n```shell\ntwarc2 counts echidna\n```\n\nYou should see something like the below screenshot - and yes, this output isn't very readable! By default twarc shows us the response in the JSON format directly from the Twitter API, so it's not great for using directly on the command line.\n\n![](images/twarc_count_echidna.png)\n\nLet's improve this by updating our command to:\n\n```shell\ntwarc2 counts echidna --text --granularity day\n```\n\nAnd we should see output like below (your results will be different, because you're searching on a different day to when these screenshots were captured). Note that the `--text` and `--granularity` are optional flags provided to the `twarc2 counts` command, we can see other options by running `twarc2 counts --help`. In this case `--text` returns a simplified text output for easier reading, and `--granularity day` is passed to the Twitter API to specify that we're interested only in daily counts of tweets, not the default hourly count.\n\n```shell\n2022-11-03T02:49:02.000Z - 2022-11-04T00:00:00.000Z: 974\n2022-11-04T00:00:00.000Z - 2022-11-05T00:00:00.000Z: 802\n2022-11-05T00:00:00.000Z - 2022-11-06T00:00:00.000Z: 527\n2022-11-06T00:00:00.000Z - 2022-11-07T00:00:00.000Z: 554\n2022-11-07T00:00:00.000Z - 2022-11-08T00:00:00.000Z: 883\n2022-11-08T00:00:00.000Z - 2022-11-09T00:00:00.000Z: 723\n2022-11-09T00:00:00.000Z - 2022-11-10T00:00:00.000Z: 1,567\n2022-11-10T00:00:00.000Z - 2022-11-10T02:49:02.000Z: 219\n```\n\nNote that this is only the count for the last seven days, which is the level of search functionality available for all developers via the standard track of the Twitter API. If you have access to the [Twitter Academic track](https://developer.twitter.com/en/use-cases/do-research/academic-research), you can switch to searching the full Twitter archive from the `counts` and `search` commands by adding the `--archive` flag.\n\nTwitter search is powerful and provides many rich options. However, it also functions a little differently to most other search engines, because Twitter search does not focus on _ranking_ tweets by relevance (like a web search engine does). Instead, Twitter search via the API focuses on retrieving all matching tweets in chronological order. In other words, Twitter search uses the [Boolean model of searching](https://nlp.stanford.edu/IR-book/html/htmledition/boolean-retrieval-1.html), and returns the documents that match exactly what you provide and nothing else.\n\nLet's work through this example a little further, first we want to expand to capture more variants of the word echidna - note that Twitter search via the API matches on the whole word, so `echidna` and `echidnas` are different. You can also see that we've added some double quotes around our query - without these quotes the individual pieces of our query might be interpreted as additional arguments to our search command:\n\n```shell\ntwarc2 counts \"echidna echidna's echidnas\" --granularity day --text\n```\n\n```console\n2022-11-03T03:40:44.000Z - 2022-11-04T00:00:00.000Z: 0\n2022-11-04T00:00:00.000Z - 2022-11-05T00:00:00.000Z: 0\n2022-11-05T00:00:00.000Z - 2022-11-06T00:00:00.000Z: 0\n2022-11-06T00:00:00.000Z - 2022-11-07T00:00:00.000Z: 0\n2022-11-07T00:00:00.000Z - 2022-11-08T00:00:00.000Z: 0\n2022-11-08T00:00:00.000Z - 2022-11-09T00:00:00.000Z: 0\n2022-11-09T00:00:00.000Z - 2022-11-10T00:00:00.000Z: 0\n2022-11-10T00:00:00.000Z - 2022-11-10T03:40:44.000Z: 0\n```\n\nSuddenly we're retrieving very few results! By default, if you don't specify an operator, the Twitter API assumes you mean AND, or that all of the words should be present - we will need to explicitly say that we want any of these words using the OR operator:\n\n```shell\ntwarc2 counts \"echidna OR echidna's OR echidnas\" --granularity day --text\n```\n\n```console\n2022-11-03T03:42:10.000Z - 2022-11-04T00:00:00.000Z: 964\n2022-11-04T00:00:00.000Z - 2022-11-05T00:00:00.000Z: 846\n2022-11-05T00:00:00.000Z - 2022-11-06T00:00:00.000Z: 552\n2022-11-06T00:00:00.000Z - 2022-11-07T00:00:00.000Z: 573\n2022-11-07T00:00:00.000Z - 2022-11-08T00:00:00.000Z: 962\n2022-11-08T00:00:00.000Z - 2022-11-09T00:00:00.000Z: 758\n2022-11-09T00:00:00.000Z - 2022-11-10T00:00:00.000Z: 1,591\n2022-11-10T00:00:00.000Z - 2022-11-10T03:42:10.000Z: 288\n```\n\nWe can also apply operators based on other content or properties of tweets (see more [search operators](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query#list) in the Twitter API documentation). Because we're deciding to focus on the number of likes on tweets as our measure of coolness, we want to exclude retweets. If we don't exclude retweets, our like measure might be heavily influenced by one highly retweeted tweet.\n\nWe can do this using the `-` (minus) operator, which allows us to exclude tweets matching a criteria, in conjunction with the `is:retweet` operator, which filters on whether the tweet is a retweet or not. If we applied just the `is:retweet` operator we'd only see the retweets, the opposite of what we want.\n\n```shell\ntwarc2 counts \"echidna OR echidna's OR echidnas -is:retweet\" --granularity day --text\n```\n\n```text\n2022-11-03T03:43:02.000Z - 2022-11-04T00:00:00.000Z: 957\n2022-11-04T00:00:00.000Z - 2022-11-05T00:00:00.000Z: 826\n2022-11-05T00:00:00.000Z - 2022-11-06T00:00:00.000Z: 546\n2022-11-06T00:00:00.000Z - 2022-11-07T00:00:00.000Z: 570\n2022-11-07T00:00:00.000Z - 2022-11-08T00:00:00.000Z: 931\n2022-11-08T00:00:00.000Z - 2022-11-09T00:00:00.000Z: 750\n2022-11-09T00:00:00.000Z - 2022-11-10T00:00:00.000Z: 1,587\n2022-11-10T00:00:00.000Z - 2022-11-10T03:43:02.000Z: 288\n```\n\nThere's one tiny gotcha from the Twitter API here, which is important to know about. AND operators are applied before OR operators, even if the AND is not specified by the user. The query we wrote above actually means something like below. We're only removing the retweets containing the word \"echidnas\", not all retweets:\n\n```\nechidna OR echidna's OR (echidnas AND -is:retweet)\n```\n\nWe can make our intent explicit by adding parentheses to group terms. This is a good idea in general to make your meaning clear, even if you know all of the operator rules.\n\n```shell\ntwarc2 counts \"(echidna OR echidna's OR echidnas) -is:retweet\" --granularity day --text\n```\n\nNow for the purposes of this tutorial we're going to stop exploring any further, but we could continue to refine and improve this query to match our research question. Twitter lets you build very long queries (up to 512 characters on the standard track and 1024 for the academic track) so you have plenty of scope to express yourself. As mentioned earlier, [Twitter's Query Builder](https://developer.twitter.com/apitools/query?query=) is an excellent tool for helping you to build your query.\n\nIf we apply the same kind of process to the platypus case, we might end up with something like the following. In this case it was necessary to use the [Twitter search web interface](https://twitter.com/explore) to find some of the variations in the word platypus:\n\n```shell\ntwarc2 counts \"(platypus OR platpus's OR platypi OR platypusses OR platypuses) -is:retweet\" --granularity day --text\n```\n\nHaving decided on the actual queries to run and examined the counts, now it's time to actually collect the tweets! We can take the queries we ran earlier, replace the `counts` command with the `search` and remove the `counts` specific arguments to get:\n\n```shell\ntwarc2 search \"(echidna OR echidna's OR echidnas) -is:retweet\" echidna.json\n\ntwarc2 search \"(platypus OR platpus's OR platypi OR platypusses OR platypuses) -is:retweet\" platypus.json\n```\n\nRunning these two commands will save the tweets matching each of those searches to two files on our disk, which we will use for the next sessions.\n\n![Screenshot showing the progress of the tweets being downloaded](images/twarc_progress_download.png)\n\nTIP: if you're not sure where the files above have been saved, you can run the command `cd` on Windows, or `pwd` on Mac to have your shell print out the folder in the filesystem where twarc has been working.\n\n## Understanding and transforming twitter JSON data\n\nNow that we've collected some data, it's time to take a look at it. Let's start by viewing the collected data in its plainest form: as a text file. Although we named the file with an extension of `.json`, this is just a convention: the actual file content is a plain text in the [JSON](https://en.wikipedia.org/wiki/JSON) format. Let's open this file with our inbuilt text editor (Notepad on Windows, TextEdit on Mac).\n\n![Screenshot of the json file in notepad](images/json_echidna.png)\n\nYou'll notice immediately that there is a *lot* of data in that file: tweets are rich objects, and we mentioned that twarc by default captures as much information as Twitter makes available. Further, the Twitter API provides data in a format that makes it convenient for machines to work with, but not so much for humans.\n\n## Making a CSV file from our collected tweets\n\nWe don't recommend trying to manually parse this raw data unless you have specific needs that aren't covered by existing tools. So we're going to use the `twarc-csv` package that we installed earlier to do the heavy lifting of transforming the collected JSON into a more friendly comma-separated value ([CSV](https://en.wikipedia.org/wiki/Comma-separated_values)) file. CSV is a simple plaintext format, but unlike JSON format is easy to import or open with a spreadsheet.\n\nThe `twarc-csv` package lets us use a `csv` command to transform the files from twarc:\n\n```shell\ntwarc2 csv echidna.json echidna.csv\n\ntwarc2 csv platypus.json platypus.csv\n```\n\nIf we look at these files in our text editor again, we'll see a nice structure of one line per tweet, with all of the many columns for that tweet.\n\n![Screenshot of the plaintext CSV file in notepad](images/echidna_csv.png)\n\nSince we're going to do more analysis with the Pandas library to answer our question, we will want to create the CSV with only the columns of interest. This will reduce the time and amount of computer memory/RAM you need to load your dataset. For example, the following commands produce CSV files with a small number of fields:\n\n```shell\ntwarc2 csv --output-columns id,created_at,author_id,text,referenced_tweets.retweeted.id,public_metrics.like_count echidna.json echidna_minimal.csv\n\ntwarc2 csv --output-columns id,created_at,author_id,text,referenced_tweets.retweeted.id,public_metrics.like_count platypus.json platypus_minimal.csv\n```\n\n### The problem with Excel\n\nIt's tempting to try to open these CSV files directly in Excel, but if you do you're probably going to notice one or more of the following problems, as illustrated below:\n\n1. The ID columns are likely to be broken.\n2. Emoji and languages that don't use latin characters may not appear correctly.\n3. Tweets may be broken up on newlines.\n4. Excel can only support 1,048,576 rows - it's very easy to collect tweet datasets bigger than this.\n\n![Screenshot of the broken CSV file opened directly in excel](images/excel_echidna.png)\n\nIf you save a file from Excel with any of those problems that file is no longer useful for most purposes (this is a common and longstanding problem with using spreadsheet software, that affects many fields. For example in genomics: https://www.nature.com/articles/d41586-021-02211-4). While it is possible to make Excel do the right thing with your data, it takes more work, and a single mistake can lead to loss of important data. Therefore our recommendation is, if possible, to avoid the use of spreadsheets for analysing Twitter data.\n\n### Working with Pandas\n\nIf you are going to be using the scientific Python library [Pandas](https://pandas.pydata.org/) for any processing or analysis, you may wish to use Pandas methods. Pandas can be used to load and manipulate data like we have in our CSV file. Note that for this section we're going to run a very simple computation, the references will have links to more extensive resources for learning more.\n\n```python\n# process_monotremes.py\n\nimport pandas\n\nechidna = pandas.read_csv(\"echidna_minimal.csv\")\nplatypus = pandas.read_csv(\"platypus_minimal.csv\")\n\nechidna_likes = echidna[\"public_metrics.like_count\"].sum()\nplatypus_likes = platypus[\"public_metrics.like_count\"].sum()\n\nprint(f\"Total likes on echidna tweets: {echidna_likes}. Total likes on platypus tweets: {platypus_likes}.\")\n```\n\nRun this script through Python to see which of the monotremes is the coolest:\n\n```shell\npython process_monotremes.py\n```\n\n### Answering the research question: which monotreme is the coolest?\n\nAt the time of creating this tutorial, the above script run with the just collected data leads to the following result:\n\n```shell\nTotal likes on echidna tweets: 1787652. Total likes on platypus tweets: 3462715.\n```\n\nOn that basis, we can conclude that at the time of running this search the platypus is nearly twice as cool as the echnida based on Twitter likes.\n\nOf course this is a simplistic approach to answering this specific research question - we could have made many other choices. Even using a simple quantitative approach looking at metrics: we could have chosen to look at other engagement counts like the number of retweets, or looked at the number of followers of the accounts tweeting about each animal (because a \"cooler\" account will have more followers). Much of the challenge in using Twitter for research is both about asking the right research question and also the choosing the right approach to the data to address that research question.\n\n## Prepare a dataset for sharing/using a shared dataset\n\nHaving performed this analysis and come to a conclusion, it is good practice to share the underlying data so other people can reproduce these results (with some caveats). Noting that we want to preserve Twitter users' agency over the availability of their content, and Twitter's Developer Agreement, we can do this by creating a dataset of tweet IDs. Instead of sharing the content of the tweets, we can share the unique ID for that tweet, which allows others to `hydrate` the tweets by retrieving them again from the Twitter API.\n\nThis can be done as follows using twarc's `dehydrate` command:\n\n```shell\ntwarc2 dehydrate --id-type tweets platypus.json platypus_ids.txt\n\ntwarc2 dehydrate --id-type tweets echidna.json echidna_ids.txt\n```\n\nThese commands will produce the two text files, with each line in these files containing the unique ID of the tweet.\n\nTo `hydrate`, or retrieve the tweets again, we can use the corresponding commands:\n\n```shell\ntwarc2 hydrate platypus_ids.txt platypus_hydrated.json\n\ntwarc2 hydrate echidna_ids.txt echidna_hydrated.json\n```\n\nNote that the hydrated files will include fewer tweets: tweets that have been deleted, or tweets by accounts that have been deleted, suspended, or protected, will not be included in the file. Note also that hydrating a dataset also means that engagement metrics like retweets and likes will be up to date for tweets that are still available.\n\n\n## Suggested resources\n\nYou can find some additional links and resources in the [resources section](https://twarc-project.readthedocs.io/en/latest/resources/) of the twarc documentation.\n"
  },
  {
    "path": "docs/twarc1_en_us.md",
    "content": "twarc1\n=====\n\n***For information about working with the Twitter V2 API please see the [twarc2](https://twarc-project.readthedocs.io/en/latest/twarc2/) page.***\n\n---\n\ntwarc is a command line tool and Python library for archiving Twitter JSON data.\nEach tweet is represented as a JSON object that is\n[exactly](https://dev.twitter.com/overview/api/tweets) what was returned from\nthe Twitter API.  Tweets are stored as [line-oriented JSON](https://en.wikipedia.org/wiki/JSON_Streaming#Line-delimited_JSON).  twarc will handle\nTwitter API's [rate limits](https://dev.twitter.com/rest/public/rate-limiting)\nfor you. In addition to letting you collect tweets twarc can also help you\ncollect users, trends and hydrate tweet ids.\n\ntwarc was developed as part of the [Documenting the Now](http://www.docnow.io)\nproject which was funded by the [Mellon Foundation](https://mellon.org/).\n\n## Install\n\nBefore using twarc you will need to register an application at\n[apps.twitter.com](http://apps.twitter.com). Once you've created your\napplication, note down the consumer key, consumer secret and then click to\ngenerate an access token and access token secret. With these four variables\nin hand you are ready to start using twarc.\n\n1. install [Python 3](http://python.org/download)\n2. [pip](https://pip.pypa.io/en/stable/installing/) install twarc:\n\n```\n    pip install --upgrade twarc\n```\n\n### Homebrew (macOS only)\n\nFor macOS users, you can also install `twarc` via [Homebrew](https://brew.sh/):\n\n```bash\n$ brew install twarc\n```\n\n### Windows\n\nIf you installed with pip and see a \"failed to create process\" when running twarc try reinstalling like this:\n\n    python -m pip install --upgrade --force-reinstall twarc\n\n## Quickstart:\n\nFirst you're going to need to tell twarc about your application API keys and\ngrant access to one or more Twitter accounts:\n\n    twarc configure\n\nThen try out a search:\n\n    twarc search blacklivesmatter > search.jsonl\n\nOr maybe you'd like to collect tweets as they happen?\n\n    twarc filter blacklivesmatter > stream.jsonl\n\nSee below for the details about these commands and more.\n\n## Usage\n\n### Configure\n\nOnce you've got your application keys you can tell twarc what they are with the\n`configure` command.\n\n    twarc configure\n\nThis will store your credentials in a file called `.twarc` in your home\ndirectory so you don't have to keep entering them in. If you would rather supply\nthem directly you can set them in the environment (`CONSUMER_KEY`,\n`CONSUMER_SECRET`, `ACCESS_TOKEN`, `ACCESS_TOKEN_SECRET`) or using command line\noptions (`--consumer_key`, `--consumer_secret`, `--access_token`,\n`--access_token_secret`).\n\n### Search\n\nThis uses Twitter's [search/tweets](https://dev.twitter.com/rest/reference/get/search/tweets) to download *pre-existing* tweets matching a given query.\n\n    twarc search blacklivesmatter > tweets.jsonl\n\nIt's important to note that `search` will return tweets that are found within a\n7 day window that Twitter's search API imposes. If this seems like a small\nwindow, it is, but you may be interested in collecting tweets as they happen\nusing the `filter` and `sample` commands below.\n\nThe best way to get familiar with Twitter's search syntax is to experiment with\n[Twitter's Advanced Search](https://twitter.com/search-advanced) and copy and\npasting the resulting query from the search box. For example here is a more\ncomplicated query that searches for tweets containing either the\n\\#blacklivesmatter or #blm hashtags that were sent to deray.\n\n    twarc search '#blacklivesmatter OR #blm to:deray' > tweets.jsonl\n\nYou also should definitely check out Igor Brigadir's *excellent* reference guide\nto the Twitter Search syntax:\n[Advanced Search on Twitter](https://github.com/igorbrigadir/twitter-advanced-search/blob/master/README.md).\nThere are lots of hidden gems in there that the advanced search form doesn't\nmake readily apparent.\n\nTwitter attempts to code the language of a tweet, and you can limit your search\nto a particular language if you want using an [ISO 639-1] code:\n\n    twarc search '#blacklivesmatter' --lang fr > tweets.jsonl\n\nYou can also search for tweets with a given location, for example tweets\nmentioning *blacklivesmatter* that are 1 mile from the center of Ferguson,\nMissouri:\n\n    twarc search blacklivesmatter --geocode 38.7442,-90.3054,1mi > tweets.jsonl\n\nIf a search query isn't supplied when using `--geocode` you will get all tweets\nrelevant for that location and radius:\n\n    twarc search --geocode 38.7442,-90.3054,1mi > tweets.jsonl\n\n### Filter\n\nThe `filter` command will use Twitter's [statuses/filter](https://developer.twitter.com/en/docs/twitter-api/v1/tweets/filter-realtime/api-reference/post-statuses-filter) API to collect tweets as they happen.\n\n    twarc filter blacklivesmatter,blm > tweets.jsonl\n\nPlease note that the syntax for the Twitter's track queries is significantly\ndifferent than what queries in their search API. Consult the\n[track documentation](https://developer.twitter.com/en/docs/twitter-api/v1/tweets/filter-realtime/guides/basic-stream-parameters#track) on how best to express the filter option you are using.\n\nUse the `follow` command line argument if you would like to collect tweets from\na given user id as they happen. This includes retweets. For example this will\ncollect tweets and retweets from CNN:\n\n    twarc filter --follow 759251 > tweets.jsonl\n\nYou can also collect tweets using a bounding box. Note: the leading dash needs\nto be escaped in the bounding box or else it will be interpreted as a command\nline argument!\n\n    twarc filter --locations \"\\-74,40,-73,41\" > tweets.jsonl\n\nYou can use the `lang` command line argument to pass in a [ISO 639-1] language\ncode to limit to, and since the filter stream allow you to filter by one more\nlanguages it is repeatable. So this would collect tweets that mention paris or\nmadrid that were made in French or Spanish:\n\n    twarc filter paris,madrid --lang fr --lang es\n\nIf you combine filter and follow options they are OR'ed together. For example\nthis will collect tweets that use the blacklivesmatter or blm hashtags and also\ntweets from user CNN:\n\n    twarc filter blacklivesmatter,blm --follow 759251 > tweets.jsonl\n\nBut combining locations and languages will result effectively in an AND. For\nexample this will collect tweets from the greater New York area that are in\nSpanish or French:\n\n    twarc filter --locations \"\\-74,40,-73,41\" --lang es --lang fr\n\n### Sample\n\nUse the `sample` command to listen to Twitter's [statuses/sample](https://dev.twitter.com/streaming/reference/get/statuses/sample) API for a \"random\" sample of recent public statuses.\n\n    twarc sample > tweets.jsonl\n\n### Dehydrate\n\nThe `dehydrate` command generates an id list from a file of tweets:\n\n    twarc dehydrate tweets.jsonl > tweet-ids.txt\n\n### Hydrate\n\ntwarc's `hydrate` command will read a file of tweet identifiers and write out the tweet JSON for them using Twitter's [status/lookup](https://dev.twitter.com/rest/reference/get/statuses/lookup) API.\n\n    twarc hydrate ids.txt > tweets.jsonl\n\nTwitter API's [Terms of Service](https://dev.twitter.com/overview/terms/policy#6._Be_a_Good_Partner_to_Twitter) discourage people from making large amounts of raw Twitter data available on the Web.  The data can be used for research and archived for local use, but not shared with the world. Twitter does allow files of tweet identifiers to be shared, which can be useful when you would like to make a dataset of tweets available.  You can then use Twitter's API to *hydrate* the data, or to retrieve the full JSON for each identifier. This is particularly important for [verification](https://en.wikipedia.org/wiki/Reproducibility) of social media research.\n\n### Users\n\nThe `users` command will return User metadata for the given screen names.\n\n    twarc users deray,Nettaaaaaaaa > users.jsonl\n\nYou can also give it user ids:\n\n    twarc users 1232134,1413213 > users.jsonl\n\nIf you want you can also use a file of user ids, which can be useful if you are\nusing the `followers` and `friends` commands below:\n\n    twarc users ids.txt > users.jsonl\n\n### Followers\n\nThe `followers` command  will use Twitter's [follower id API](https://dev.twitter.com/rest/reference/get/followers/ids) to collect the follower user ids for exactly one user screen name per request as specified as an argument:\n\n    twarc followers deray > follower_ids.txt\n\nThe result will include exactly one user id per line. The response order is\nreverse chronological, or most recent followers first.\n\n### Friends\n\nLike the `followers` command, the `friends` command will use Twitter's [friend id API](https://dev.twitter.com/rest/reference/get/friends/ids) to collect the friend user ids for exactly one user screen name per request as specified as an argument:\n\n    twarc friends deray > friend_ids.txt\n\n### Trends\n\nThe `trends` command lets you retrieve information from Twitter's API about trending hashtags. You need to supply a [Where On Earth](https://web.archive.org/web/20180102203025/https://developer.yahoo.com/geo/geoplanet/) identifier (`woeid`) to indicate what trends you are interested in. For example here's how you can get the current trends for St Louis:\n\n    twarc trends 2486982\n\nUsing a `woeid` of 1 will return trends for the entire planet:\n\n    twarc trends 1\n\nIf you aren't sure what to use as a `woeid` just omit it and you will get a list\nof all the places for which Twitter tracks trends:\n\n    twarc trends\n\nIf you have a geo-location you can use it instead of the `woedid`.\n\n    twarc trends 39.9062,-79.4679\n\nBehind the scenes twarc will lookup the location using Twitter's [trends/closest](https://dev.twitter.com/rest/reference/get/trends/closest) API to find the nearest `woeid`.\n\n### Timeline\n\nThe `timeline` command will use Twitter's [user timeline API](https://dev.twitter.com/rest/reference/get/statuses/user_timeline) to collect the most recent tweets posted by the user indicated by screen_name.\n\n    twarc timeline deray > tweets.jsonl\n\nYou can also look up users using a user id:\n\n    twarc timeline 12345 > tweets.jsonl\n\n### Retweets\n\nYou can get retweets for a given tweet id like so:\n\n    twarc retweets 824077910927691778 > retweets.jsonl\n\nIf you have tweet_ids that you would like to fetch the retweets for, you can:\n\n    twarc retweets ids.txt > retweets.jsonl\n\n### Replies\n\nUnfortunately Twitter's API does not currently support getting replies to a\ntweet. So twarc approximates it by using the search API. Since the search API\ndoes not support getting tweets older than a week, twarc can only get the\nreplies to a tweet that have been sent in the last week.\n\nIf you want to get the replies to a given tweet you can:\n\n    twarc replies 824077910927691778 > replies.jsonl\n\nUsing the `--recursive` option will also fetch replies to the replies as well as\nquotes.  This can take a long time to complete for a large thread because of\nrate limiting by the search API.\n\n    twarc replies 824077910927691778 --recursive\n\n### Lists\n\nTo get the users that are on a list you can use the list URL with the\n`listmembers` command:\n\n    twarc listmembers https://twitter.com/edsu/lists/bots\n\n## Premium Search API\n\nTwitter introduced a Premium Search API that lets you pay Twitter money for tweets.\nOnce you have set up an environment in your\n[dashboard](https://developer.twitter.com/en/dashboard) you can use their 30day\nand fullarchive endpoints to search for tweets outside the 7 day window provided\nby the Standard Search API. To use the premium API from the command line you\nwill need to indicate which endpoint you are using, and the environment.\n\nTo avoid using up your entire budget you will likely want to limit the time\nrange using `--to_date` and `--from_date`. Additionally you can limit the\nmaximum number of tweets returned using `--limit`.\n\nSo for example, if I wanted to get all the blacklivesmatter tweets from a two\nweeks ago (assuming today is June 1, 2020) using my environment named\n*docnowdev* but not retrieving more than 1000 tweets, I could:\n\n    twarc search blacklivesmatter \\\n      --30day docnowdev \\\n      --from_date 2020-05-01 \\\n      --to_date 2020-05-14 \\\n      --limit 1000 \\\n      > tweets.jsonl\n\nSimilarly, to find tweets from 2014 using the full archive you can:\n\n    twarc search blacklivesmatter \\\n      --fullarchive docnowdev \\\n      --from_date 2014-08-04 \\\n      --to_date 2014-08-05 \\\n      --limit 1000 \\\n      > tweets.jsonl\n\nIf your environment is sandboxed you will need to use `--sandbox` so that twarc\nknows not to request more than 100 tweets at a time (the default for\nnon-sandboxed environments is 500)\n\n    twarc search blacklivesmatter \\\n      --fullarchive docnowdev \\\n      --from_date 2014-08-04 \\\n      --to_date 2014-08-05 \\\n      --limit 1000 \\\n      --sandbox \\\n      > tweets.jsonl\n\n## Gnip Enterprise API\n\ntwarc supports integration with the Gnip Twitter Full-Archive Enterprise API.\nTo do so, you must pass in the `--gnip_auth` argument. Additionally, set the\n`GNIP_USERNAME`, `GNIP_PASSWORD`, and `GNIP_ACCOUNT` environment variables.\nYou can then run the following:\n\n    twarc search blacklivesmatter \\\n      --gnip_auth \\\n      --gnip_fullarchive prod \\\n      --from_date 2014-08-04 \\\n      --to_date 2015-08-05 \\\n      --limit 1000 \\\n      > tweets.jsonl\n\n## Use as a Library\n\nIf you want you can use twarc programmatically as a library to collect\ntweets. You first need to create a `twarc` instance (using your Twitter\ncredentials), and then use it to iterate through search results, filter\nresults or lookup results.\n\n```python\nfrom twarc import Twarc\n\nt = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)\nfor tweet in t.search(\"ferguson\"):\n    print(tweet[\"text\"])\n```\n\nYou can do the same for a filter stream of new tweets that match a track\nkeyword\n\n```python\nfor tweet in t.filter(track=\"ferguson\"):\n    print(tweet[\"text\"])\n```\n\nor location:\n\n```python\nfor tweet in t.filter(locations=\"-74,40,-73,41\"):\n    print(tweet[\"text\"])\n```\n\nor user ids:\n\n```python\nfor tweet in t.filter(follow='12345,678910'):\n    print(tweet[\"text\"])\n```\n\nSimilarly you can hydrate tweet identifiers by passing in a list of ids\nor a generator:\n\n```python\nfor tweet in t.hydrate(open('ids.txt')):\n    print(tweet[\"text\"])\n```\n\n## User vs App Auth\n\ntwarc will manage rate limiting by Twitter. However, you should know that\ntheir rate limiting varies based on the way that you authenticate. The two\noptions are User Auth and App Auth. twarc defaults to using User Auth but you\ncan tell it to use App Auth.\n\nSwitching to App Auth can be handy in some situations like when you are\nsearching tweets, since User Auth can only issue 180 requests every 15 minutes\n(1.6 million tweets per day), but App Auth can issue 450 (4.3 million tweets per\nday).\n\nBut be careful: the `statuses/lookup` endpoint used by the hydrate subcommand\nhas a rate limit of 900 requests per 15 minutes for User Auth, and 300 request\nper 15 minutes for App Auth.\n\nIf you know what you are doing and want to force App Auth, you can use the\n`--app_auth` command line option:\n\n    twarc --app_auth search ferguson > tweets.jsonl\n\nSimilarly, if you are using twarc as a library you can:\n\n```python\nfrom twarc import Twarc\n\nt = Twarc(app_auth=True)\nfor tweet in t.search('ferguson'):\n    print(tweet['id_str'])\n```\n\n## Utilities\n\nIn the utils directory there are some simple command line utilities for\nworking with the line-oriented JSON, like printing out the archived tweets as\ntext or html, extracting the usernames, referenced URLs, etc.  If you create a\nscript that you find handy please send a pull request.\n\nWhen you've got some tweets you can create a rudimentary wall of them:\n\n    utils/wall.py tweets.jsonl > tweets.html\n\nYou can create a word cloud of tweets you collected about nasa:\n\n    utils/wordcloud.py tweets.jsonl > wordcloud.html\n\nIf you've collected some tweets using `replies` you can create a static D3\nvisualization of them with:\n\n    utils/network.py tweets.jsonl tweets.html\n\nOptionally you can consolidate tweets by user, allowing you to see central accounts:\n\n    utils/network.py --users tweets.jsonl tweets.html\n\nAdditionally, you can create a network of hashtags, allowing you to view their colocation:\n\n        utils/network.py --hashtags tweets.jsonl tweets.html\n\nAnd if you want to use the network graph in a program like [Gephi](https://gephi.org/),\nyou can generate a GEXF file with the following:\n\n    utils/network.py --users tweets.jsonl tweets.gexf\n    utils/network.py --hashtags tweets.jsonl tweets.gexf\n\nAdditionally if you want to convert the network into a dynamic network with timeline enabled (i.e. nodes will appear and disappear according to their  attributes), you can open up your GEXF file in Gephi and follow [these instructions](https://seinecle.github.io/gephi-tutorials/generated-html/converting-a-network-with-dates-into-dynamic.html). Note that in tweets.gexf there is a column for \"start_date\" (which is the day the post was created) but none for \"end_date\" and that in the dynamic timeline, the nodes will appear on the screen at their start date and stay on screen forever after.  For the \"Time Interval creation options\" pop-up in Gephi, the \"Start time column\" should be \"start_date\", the \"End time column\" should be empty, the \"Parse dates\" should be selected, and the Date format should be the last option, \"dd/MM/yyyy HH:mm:ss\".\n\ngender.py is a filter which allows you to filter tweets based on a guess about\nthe gender of the author. So for example you can filter out all the tweets that\nlook like they were from women, and create a word cloud for them:\n\n    utils/gender.py --gender female tweets.jsonl | utils/wordcloud.py >\n    tweets-female.html\n\nYou can output [GeoJSON](http://geojson.org/) from tweets where geo coordinates are available:\n\n    utils/geojson.py tweets.jsonl > tweets.geojson\n\nOptionally you can export GeoJSON with centroids replacing bounding boxes:\n\n    utils/geojson.py tweets.jsonl --centroid > tweets.geojson\n\nAnd if you do export GeoJSON with centroids, you can add some random fuzzing:\n\n    utils/geojson.py tweets.jsonl --centroid --fuzz 0.01 > tweets.geojson\n\nTo filter tweets by presence or absence of geo coordinates (or Place, see\n[API documentation](https://dev.twitter.com/overview/api/places)):\n\n    utils/geofilter.py tweets.jsonl --yes-coordinates > tweets-with-geocoords.jsonl\n    cat tweets.jsonl | utils/geofilter.py --no-place > tweets-with-no-place.jsonl\n\nTo filter tweets by a GeoJSON fence (requires [Shapely](https://github.com/Toblerity/Shapely)):\n\n    utils/geofilter.py tweets.jsonl --fence limits.geojson > fenced-tweets.jsonl\n    cat tweets.jsonl | utils/geofilter.py --fence limits.geojson > fenced-tweets.jsonl\n\nIf you suspect you have duplicate in your tweets you can dedupe them:\n\n    utils/deduplicate.py tweets.jsonl > deduped.jsonl\n\nYou can sort by ID, which is analogous to sorting by time:\n\n    utils/sort_by_id.py tweets.jsonl > sorted.jsonl\n\nYou can filter out all tweets before a certain date (for example, if a hashtag was used for another event before the one you're interested in):\n\n    utils/filter_date.py --mindate 1-may-2014 tweets.jsonl > filtered.jsonl\n\nYou can get an HTML list of the clients used:\n\n    utils/source.py tweets.jsonl > sources.html\n\nIf you want to remove the retweets:\n\n    utils/noretweets.py tweets.jsonl > tweets_noretweets.jsonl\n\nOr unshorten urls (requires [unshrtn](https://github.com/docnow/unshrtn)):\n\n    cat tweets.jsonl | utils/unshrtn.py > unshortened.jsonl\n\nOnce you unshorten your URLs you can get a ranked list of most-tweeted URLs:\n\n    cat unshortened.jsonl | utils/urls.py | sort | uniq -c | sort -nr > urls.txt\n\n## twarc-report\n\nSome further utility scripts to generate csv or json output suitable for\nuse with [D3.js](http://d3js.org/) visualizations are found in the\n[twarc-report](https://github.com/pbinkley/twarc-report) project. The\nutil `directed.py`, formerly part of twarc, has moved to twarc-report as\n`d3graph.py`.\n\nEach script can also generate an html demo of a D3 visualization, e.g.\n[timelines](https://wallandbinkley.com/twarc/bill10/) or a\n[directed graph of retweets](https://wallandbinkley.com/twarc/bill10/directed-retweets.html).\n\n[Chinese]: https://github.com/DocNow/twarc/blob/main/README_zw_zh.md\n[Japanese]: https://github.com/DocNow/twarc/blob/main/README_ja_jp.md\n[Portuguese]: https://github.com/DocNow/twarc/blob/main/README_pt_br.md\n[Spanish]: https://github.com/DocNow/twarc/blob/main/README_es_mx.md\n[Swedish]: https://github.com/DocNow/twarc/blob/main/README_sv_se.md\n[Swahili]: https://github.com/DocNow/twarc/blob/main/README_sw_ke.md\n[ISO 639-1]: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes\n"
  },
  {
    "path": "docs/twarc1_es_mx.md",
    "content": "# twarc1\r\n\r\ntwarc es una recurso de línea de commando y catálogo de Python para archivar JSON dato de Twitter. Cada tweet se representa como\r\nun artículo de JSON que es [exactamente](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object) lo que fue capturado del API de Twitter. Los Tweets se archivan como [JSON de línea orientado](https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON). twarc se encarga del [límite de tarifa](https://developer.twitter.com/en/docs/basics/rate-limiting) del API de Twitter. twarc también puede facilitar la colección de usuarios, tendencias y detallar las identificaciones de los tweets.\r\n\r\ntwarc fue desarrollado como parte del proyecto [Documenting the Now](http://www.docnow.io/) el cual fue financiado por el [Mellon Foundation](https://mellon.org/).\r\n\r\n## La Instalación\r\n\r\nAntes de usar twarc es necesario registrarse por [apps.twitter.com](https://apps.twitter.com/). Después de establecer la solicitud, se anota el clabe del consumidor, el secreto del consumidor, y entoces clickear para generar un access token y el secretro del access token. Con estos quatros requisitos, está listo para usar twarc.\r\n1. Instala [Python](https://www.python.org/downloads/) (2 ó 3)\r\n2. Instala twarc atraves de pip (si estas acezando de categoría: pip install --upgrade twarc)\r\n\r\n## Quickstart:\r\n\r\nPara empezar, se nececita dirigir a twarc sobre los claves de API:\r\n\r\n  `twarc configure`\r\n\r\nPrueba una búsqueda:\r\n\r\n  `twarc search blacklivesmatter > search.josnl`\r\n\r\n¿O quizás, preferirá coleccionar tweets en tiempo real?\r\n\r\n  `twarc filter blacklivesmatter > stream.josnl`\r\n\r\nVea abajo por detalles sobre estos commandos y más.\r\n\r\n## Uso\r\n\r\n### Configure\r\nUna vez que tenga sus claves de aplicación, puede dirigir a twarc lo que son con el commando `configure`.\r\n\r\n  `twarc configure`\r\n\r\nEsto archiva sus credenciales en un archivo que se llama `.twarc` en su directorio personal\r\npara que no tenga que volver a ingresar los datos. Si prefiere ingresar los datos directamente, se\r\npuede establecer en el ambiente `(CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET)`\r\no usando las opciones de línea commando `(--consumer_key, --consumer_secret, --access_token, --access_token_secret)`.\r\n\r\n### Search\r\n\r\nEsto se usa para [las búsquedas](https://developer.twitter.com/en/docs/api-reference-index) de Twitter para descargar *preexistentes* tweets que corresponde a una consulta en particular.\r\n\r\n`twarc search blacklivesmatter > tweets.jsonl`\r\n\r\nEs importante a notar que este `search` dara resultados los tweets que se encuentran dentro de una ventana de siete dias como se imponga la búsqueda del Twitter API. Si parece una ventana mínima, lo es, pero puede ser que el interés es en coleccionar tweets en tiempo real usando `filter` y `sample` commandos detallados abajo.\r\n\r\nLa mejor manera de familiares con la búsqueda de syntax de Twitter es experimentado con el [Búsqueda Avanzada de Twitter](https://twitter.com/search-advanced) y copiar y pegar la consulta de la caja de búsqueda. Por ejemplo, abajo hay una consulta más complicada que busca los tweets que contienen #blacklivesmatter OR #blm hastags que se enviaron a deray.\r\n\r\n`twarc search '#blacklivesmatter OR #blm to:deray' > tweets.jsonl`\r\n\r\nTwitter puede codificar el lenguaje de un tweet, y puede limitar su búsqueda a un lenguaje particular:\r\n\r\n`twarc search '#blacklivesmatter' --lang fr > tweets.jsonl`\r\n\r\nTambién, puede buscar tweets dentro de un lugar geográfico, por ejemplo, los tweets que menciona blacklivesmatter que están a una milla del centro de Ferguson, Missouri:\r\n\r\n`twarc search blacklivesmatter --geocode 38.7442,-90.3054,1mi > tweets.jsonl`\r\n\r\nSi una bsqueda no está identificado cuando se usa \"--geocode\" se regresa a los tweets en esa ubicación y radio:\r\n\r\n`twarc search --geocode 38.7442,-90.3054,1mi > tweets.jsonl`\r\n\r\n### Filter\r\n\r\nEl commando \"filter\" se usa Twitter's [\"status/filter\"](https://developer.twitter.com/en/docs/tutorials/consuming-streaming-data) API para coleccionar tweets en tiempo real.\r\n\r\n`twarc filter blacklivesmatter,blm > tweets.jsonl`\r\n\r\nFavor de notar que el sintaxis para los track queries de Twitter es differente de las búsquedas en el search API. Favor de consultar la documentación.\r\n\r\nUse el commando `follow` para coleccionar tweets de una identificación de usuario en particular en tiempo real. Incluye retweets. Por ejemplo, esto colecciona tweets y retweets de CNN:\r\n\r\n`twarc filter --follow 759251 > tweets.jsonl`\r\n\r\nTambién se puede coleccionar tweets usando un \"bounding box\". Nota: ¡el primer guion necesita estar escapado en el \"bounding box\" si no, estará interpretado como un argumento de línea de commando!\r\n\r\n`twarc filter --locations \"\\-74,40,-73,41\" > tweets.jsonl`\r\n\r\nSi combina las opciones serán \"OR'ed\" juntos. Por ejemplo, esto colecciona los tweets que usan los hashtags de blacklivesmatter o blm y tambien tweets del usario CNN:\r\n\r\n`twarc filter blacklivesmatter,blm --follow 759251 > tweets.jsonl`\r\n\r\n### Sample\r\n\r\nUsa el commando `sample` para probar a los [statuses/API de muestra](https://developer.twitter.com/en/docs/tutorials/consuming-streaming-data) para una muestra \"azar\" de tweets recientes.\r\n\r\n`twarc sample > tweets.jsonl`\r\n\r\n### Dehydrate\r\n\r\nEl commando `dehydrate` genera una lista de id's de un archivo de tweets:\r\n\r\n`twarc dehydrate tweets.jsonl > tweet-ids.txt`\r\n\r\n### Hydrate\r\n\r\nEl mando `hydrate` busca a través de un archivo de identificadores y regresa el JSON del tweet usando el [\"status/lookup API\"](https://developer.twitter.com/en/docs/api-reference-index).\r\n\r\n`twarc hydrate ids.txt > tweets.jsonl`\r\n\r\nLos [términos de servicio](https://developer.twitter.com/en/developer-terms/policy#6._Be_a_Good_Partner_to_Twitter) del API de Twitter desalientan los usuarios a hacer público por el internet los datos de Twitter. Los datos se pueden usar para el estudio y archivado para uso local, pero no para compartir público. Aún, Twitter permite archivos de identificadores de Twitter ser compartidos. Puede usar el API de Twitter para hidratar los datos, o recuperar el completo JSON dato. Esto es importante para la [verificación](https://en.wikipedia.org/wiki/Reproducibility) del estudio de los redes sociales.\r\n\r\n### Users\r\n\r\nEl commando `user` regresa metadata de usuario para los nobres de pantalla.\r\n\r\n`twarc users deray,Nettaaaaaaaa > users.jsonl`\r\n\r\nTambién puede acceder ids de usuario:\r\n\r\n`twarc users 1232134,1413213 > users.jsonl`\r\n\r\nSi quiere, también se puede usar un archivo de user ids:\r\n\r\n`twarc users ids.txt > users.jsonl`\r\n\r\n### Followers\r\n\r\nEl commando `followers` usa el [follower id API](https://developer.twitter.com/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-followers-ids) para coleccionar los user ids para un nombre de pantalla por búsqueda:\r\n\r\n`twarc followers deray > follower_ids.txt`\r\n\r\nEl resultado incluye un user id por cada línea. El orden es en reversa cronológica, o los followers más recientes.\r\n\r\n### Friends\r\n\r\nEl commando `friends` usa el [friend id API](https://developer.twitter.com/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-friends-ids) de Twitter para coleccionar los friend user ids para un nombre de pantalla por búsqueda:\r\n\r\n`twarc friends deray > friend_ids.txt`\r\n\r\n### Trends\r\n\r\nEl commando `trends` regresa información del Twitter API sobre los hashtags populares. Necesita ingresar un [Where on Earth idenfier (`woeid`)](https://en.wikipedia.org/wiki/WOEID) para indicar cual temas quieres buscar. Por ejemplo:\r\n\r\n`twarc trends 2486982`\r\n\r\nUsando un woeid de 1 regresara temas para el planeta:\r\n\r\n`twarc trends 1`\r\n\r\nTambién se puede omitir el `woeid` y los datos que regresan serán una lista de los lugares por donde Twitter localiza las temas:\r\n\r\n`twarc trends`\r\n\r\nSi tiene un geo-location, puede usarlo.\r\n\r\n`twarc trends 39.9062,-79.4679`\r\n\r\ntwarc buscara el lugar usando el [trends/closest](https://developer.twitter.com/en/docs/api-reference-index) API para encontrar el `woeid` más cerca.\r\n\r\n### Timeline\r\n\r\nEl commando `timeline` usa el [user timeline API](https://developer.twitter.com/en/docs/api-reference-index) para coleccionar los tweets más recientes del usuario indicado por el nombre de pantalla.\r\n\r\n`twarc timeline deray > tweets.jsonl`\r\n\r\nTambién se puede buscar usuarios usando un user id:\r\n\r\n`twarc timeline 12345 > tweets.jsonl`\r\n\r\n### Retweets\r\n\r\nSe puede buscar retweets de un tweet específico:\r\n\r\n`twarc retweets 824077910927691778 > retweets.jsonl`\r\n\r\n### Replies\r\n\r\nDesafortunadamente, el API de Twitter no soporte buscando respuestas a un tweet. Entonces, twarc usa el search API. EL search API no regresa tweets mayores de siete días.\r\n\r\nSi quieres buscar las respuestas de un tweet:\r\n\r\n`twarc replies 824077910927691778 > replies.jsonl`\r\n\r\nEl commando `--recursive` regresa respuestos a los respuestos. Esto puede tomar mucho tiempo para un thread muy grande porque el rate liming por el search API.\r\n\r\n`twarc replies 824077910927691778 --recursive`\r\n\r\n### Lists\r\n\r\nPara conseguir los usuarios en una lista, se puede usar el list URL con el commando `listmembers`.\r\n\r\n`twarc listmembers https://twitter.com/edsu/lists/bots`\r\n\r\n## Use as a Library\r\n\r\ntwarc se puede usar programáticamente como una biblioteca para coleccionar tweets. Necesitas usar un `twarc` instance (usando tus credenciales de Twitter), y luego lo usas para buscar por resultados de búsqueda.\r\n\r\n`from twarc import Twarc\r\n\r\nt = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)\r\nfor tweet in t.search(\"ferguson\"):\r\n    print(tweet[\"text\"])`\r\n\r\nPuedes usar lo mismo para el filtro de stream de nuevos de tweets que sean iguales al track keyword.\r\n\r\n`for tweet in t.filter(track=\"ferguson\"):\r\n    print(tweet[\"text\"])`\r\n\r\no lugar:\r\n\r\n`for tweet in t.filter(locations=\"-74,40,-73,41\"):\r\n    print(tweet[\"text\"])`\r\n\r\no user ids:\r\n\r\n`for tweet in t.filter(follow='12345,678910'):\r\n    print(tweet[\"text\"])`\r\n\r\nTambién los identificados de tweets se pueden hydratar:\r\n\r\n`for tweet in t.hydrate(open('ids.txt')):\r\n    print(tweet[\"text\"])`\r\n\r\n## Utilities\r\n\r\nEn el directorio de utilidades hay algunos commando simple de line utilities para trabajar conel line-oriented JSON, Como imprimiendo out the archived tweets as texto o html, extracting the usernames, referenced URLs, etc. Si creas un script que tú puedas encontrar fácilmente por favor envía un pull request.\r\n\r\nCuando tengas algunos tweets puedes crear una pared rudimentaria de ellos:\r\n\r\n`% utils/wall.py tweets.jsonl > tweets.html`\r\n\r\nPuedes crear un word cloud de tweets que has coleccionado sobre nasa:\r\n\r\n`% utils/wordcloud.py tweets.jsonl > wordcloud.html`\r\n\r\nSi has coleccionado algunos tweets usando `replies` puedes crear a static D3 visualization de ellos con:\r\n\r\n`% utils/network.py tweets.jsonl tweets.html`\r\n\r\nTienes la opción de consolidar tweets por user, permitiéndote ver las cuentas centrales:\r\n\r\n`% utils/network.py --users tweets.jsonl tweets.html`\r\n\r\nY si quieres usar la graficas del network en un programa como [Gephi](https://gephi.org/), puedes generar un GEXF file con lo siguiente:\r\n\r\n`% utils/network.py --users tweets.jsonl tweets.gexf`\r\n\r\ngender.py es un filtro que te permite filtrar tweets basados en un guess sobre el género del autor. Por ejemplo, puedes filtrar todos los tweets que parecen ser de mujeres, y crear un word cloud para ellos:\r\n\r\n`% utils/gender.py --gender female tweets.jsonl | utils/wordcloud.py > tweets-female.html`\r\n\r\nSe puede usar [GeoJSON](http://geojson.org/) de tweets que tienen geo coordiates:\r\n\r\n`% utils/geojson.py tweets.jsonl > tweets.geojson`\r\n\r\nTienes la opcion de exportar GeoJSON con centroids replacing bounding boxes:\r\n\r\n`% utils/geojson.py tweets.jsonl --centroid > tweets.geojson`\r\n\r\nY si exportas GeoJSON with centroids, puedes añadir algunos random fuzzing:\r\n\r\n`% utils/geojson.py tweets.jsonl --centroid --fuzz 0.01 > tweets.geojson`\r\n\r\nPara filtrar tweets por presencia o ausencia de coordenadas geo (o por lugar Place, verifica [API documentacion](https://developer.twitter.com/en/docs/basics/getting-started)):\r\n\r\n`% utils/geofilter.py tweets.jsonl --yes-coordinates > tweets-with-geocoords.jsonl\r\n% cat tweets.jsonl | utils/geofilter.py --no-place > tweets-with-no-place.jsonl`\r\n\r\nPara filtrar con GeoJSON fence (se necesita [Shapely](https://github.com/Toblerity/Shapely)):\r\n\r\n`% utils/geofilter.py tweets.jsonl --fence limits.geojson > fenced-tweets.jsonl\r\n% cat tweets.jsonl | utils/geofilter.py --fence limits.geojson > fenced-tweets.jsonl`\r\n\r\nSi sospechas que tienes un duplicado en tus tweets se puede usar \"dedupe\":\r\n\r\n`% utils/deduplicate.py tweets.jsonl > deduped.jsonl`\r\n\r\nPara ordernar por ID:\r\n\r\n`% utils/sort_by_id.py tweets.jsonl > sorted.jsonl`\r\n\r\nPuedes filtrar todos los tweets antes de una fecha exacta (Por ejemplo, si un hashtag fue usado para otro evento antes del que te interesaba):\r\n\r\n`% utils/filter_date.py --mindate 1-may-2014 tweets.jsonl > filtered.jsonl`\r\n\r\nPuedes conseguir un listado de  HTML  de clientes usados:\r\n\r\n`% utils/source.py tweets.jsonl > sources.html`\r\n\r\nSi deseas remover los retweets:\r\n\r\n`% utils/noretweets.py tweets.jsonl > tweets_noretweets.jsonl`\r\n\r\nO unshorten urls (se necesita [unshrtn](https://github.com/DocNow/unshrtn)):\r\n\r\n`% cat tweets.jsonl | utils/unshorten.py > unshortened.jsonl`\r\n\r\nUna vez hayas unshorten tus URLs puedes obtener un listado de los  most-tweeted URLs:\r\n\r\n`% cat unshortened.jsonl | utils/urls.py | sort | uniq -c | sort -nr > urls.txt`\r\n\r\n## twarc-report\r\n\r\nMás commandos de \"utility\" para generar csv or json output con uso con [D3.js](https://d3js.org/) visualizaciónes son encontrados en el [twarc-report](https://github.com/pbinkley/twarc-report) project. El util `directed.py` ahora es `d3graph.py`.\r\n\r\nCada script también puede generar un html demo de D3 visualization, e.g. [timelines](https://www.wallandbinkley.com/twarc/bill10/) o una [gráfica dirigida de retweets](https://www.wallandbinkley.com/twarc/bill10/directed-retweets.html).\r\n\r\n---\r\n\r\nCrédito de tradução: [Tina Figueroa]\r\n\r\n[japonés]: https://github.com/DocNow/twarc/blob/main/README_ja_jp.md\r\n[Portugués]: https://github.com/DocNow/twarc/blob/main/README_pt_br.md\r\n[Inglés]: https://github.com/DocNow/twarc/blob/main/README.md\r\n[Sueco]: https://github.com/DocNow/twarc/blob/main/README_sv_se.md\r\n[Swahili]: https://github.com/DocNow/twarc/blob/main/README_sw_ke.md\r\n[Tina Figueroa]: https://github.com/@tinafigueroa\r\n\r\n"
  },
  {
    "path": "docs/twarc1_ja_jp.md",
    "content": "twarc1\n=====\n\ntwarcは、TwitterのJSONデータをアーカイブするためのコマンドラインツールおよびPythonライブラリーのプログラムです。\n\n- 各ツイートは、Twitter APIから返された内容を[正確に](https://dev.twitter.com/overview/api/tweets)表すJSONオブジェクトとして表示されます。\n- ツイートは[line-oriented JSON](https://en.wikipedia.org/wiki/JSON_Streaming#Line-delimited_JSON)として保存されます。\n- twarcがTwitterのAPI[レート制限](https://dev.twitter.com/rest/public/rate-limiting)を処理してくれます。\n- twarcはツイートを収集できるだけでなく、ユーザー、トレンド、ツイートIDの詳細な情報の収集（hydrate; ハイドレート）にも役立ちます。\n\ntwarcは[Mellon Foundation](https://mellon.org/)によって援助された[Documenting the Now](http://www.docnow.io)プロジェクトの一環として開発されました.\n\n## Install | インストール\n\ntwarcを使う前に[Twitter Developers](http://apps.twitter.com)にあなたのアプリケーションを登録する必要があります.\n\n登録したら, コンシューマーキーとその秘密鍵を控えておきます.\nそして「Create my access token」をクリックして、アクセストークンと秘密鍵を生成して控えておいてください.\nこれら4つの鍵が手元に揃えば, twarcを使い始める準備は完了です.\n\n1. [Python](http://python.org/download)をインストールする (Version2か3)\n2. [pip](https://pip.pypa.io/en/stable/installing/) install twarcする\n\n### Homebrew (macOSだけ)\n\n`twarc`は以下によってインストールできます.\n\n```bash\n$ brew install twarc\n```\n\n## Quickstart | クイックスタート\n\nまず初めに, アプリケーションのAPIキーをtwarcに教え, 1つ以上のTwitterアカウントへのアクセスを許可する必要があります.\n\n    twarc configure\n\n検索を試してみましょう.\n\n    twarc search blacklivesmatter > search.jsonl\n\nまたは, 呟かれたツイートを収集したいですか？\n\n    twarc filter blacklivesmatter > stream.jsonl\n\nコマンドなどの詳細については, 以下を参照してください.\n\n## Usage | 用法\n\n### Configure | 設定\n\n`configure`コマンドで, 取得したアプリケーションキーをtwarcに教えることができます.\n    break\n\n    twarc configure\n\nこれにより, ホームディレクトリの`.twarc`というファイルに資格情報が保存されるため, 常に入力し続ける必要はありません.\n\n直接指定したい場合は, 環境変数(`CONSUMER_KEY`, `CONSUMER_SECRET`, `ACCESS_TOKEN`, `ACCESS_TOKEN_SECRET`)か, コマンドラインオプション(`--consumer_key`, `--consumer_secret`, `--access_token`, `--access_token_secret`)を使用してください.\n\n### Search | 検索\n\n検索には, 与えられたクエリに適合する*既存の*ツイートをダウンロードするために, Twitterの[search/tweets](https://dev.twitter.com/rest/reference/get/search/tweets) APIを使います.\n\n    twarc search blacklivesmatter > tweets.jsonl\n\nここで重要なのは, `search`コマンドがTwitter検索APIの課す7日間以内の期限中から見つかったツイートを返すということです.\nもし期限が「短すぎる」と思うのなら(まあそれはそうですが), 以下の`filter`コマンドや`sample`コマンドを使って収集してみると面白いかもしれません.\n\nTwitterの検索構文についてよく知るためのベストプラクティスは, [Twitter's Advanced Search](https://twitter.com/search-advanced)で試してみて, 検索窓からクエリ文の結果をコピペすることです.\n\n例えば以下の例は, `@deray`に送信された, ハッシュタグ`#blacklivesmatter`か`#blm`かの一方を含むツイートを検索する複雑なクエリです.\n\n    twarc search '#blacklivesmatter OR #blm to:deray' > tweets.jsonl\n\nまた, [Igor Brigadir](https://github.com/igorbrigadir)の*素晴らしい*Twitter検索構文のリファレンスを絶対にチェックしておくべきです.（[Advanced Search on Twitter](https://github.com/igorbrigadir/twitter-advanced-search/blob/master/README.md)）\n高度な検索フォームには, すぐにはみつからない隠れた宝石がたくさんあります.\n\nTwitterはツイートの言語をコーディングしようとします.  [ISO 639-1]コードを使用すれば, 特定の言語に検索を制限できます.\n\n    twarc search '#blacklivesmatter' --lang fr > tweets.jsonl\n\n特定の場所でのツイートを検索することもできます.\n例えば, ミズーリ州ファーガソンの中心から1マイルの`blacklivesmatter`に言及するツイートなどを検索できます.\n\n    twarc search blacklivesmatter --geocode 38.7442,-90.3054,1mi > tweets.jsonl\n\n`--geocode`の使用時に検索クエリが提供されない場合, その場所と半径に関連する全てのツイートを返します.\n\n    twarc search --geocode 38.7442,-90.3054,1mi > tweets.jsonl\n\n### Filter | フィルター\n\n`filter`コマンドは, 呟かれたツイートを収集するために, Twitterの[statuses/filter](https://dev.twitter.com/streaming/reference/post/statuses/filter) APIを使います.\n\n    twarc filter blacklivesmatter,blm > tweets.jsonl\n\nここで注意すべきなのは, Twitterのトラッククエリの構文は, 検索APIのクエリとは少し異なるということです.\nそのため, 使用しているフィルターオプションの最も良い表現方法については, ドキュメントを参照してください.\n\n特定のユーザーIDから呟かれたツイートを収集したい場合は, `follow`引数を使いましょう.\nこれにはリツイートも含まれます. 例えば, これは`@CNN`のツイート及びリツイートを収集します.\n\n    twarc filter --follow 759251 > tweets.jsonl\n\n境界ボックス座標の数値(バウンディングボックス)を用いてツイートを収集することもできます.\n注意: 先頭のダッシュ(`-`)はバウンディングボックス内ではエスケープする必要があります. エスケープしないと, コマンドライン引数として解釈されてしまいます！\n\n    twarc filter --locations \"\\-74,40,-73,41\" > tweets.jsonl\n\n`lang`コマンドライン引数を使用して, 検索を制限する[ISO 639-1]の言語コードを渡すことができます.\nフィルターストリームでは, 1つ以上の言語でフィルタリングできるため, 繰り返し可能です.\n以下は, フランス語またはスペイン語で呟かれた, パリまたはマドリードに言及しているツイートを収集します.\n\n    twarc filter paris,madrid --lang fr --lang es\n\nフィルタを組み合わせてオプションの後ろに続けた場合には, それらは共にORで結がれます.\n例えば, これはハッシュタグ`#blacklivesmatter`または`#blm`を使用するツイート, 及びユーザー`@CNN`からのツイートを収集します.\n\n    twarc filter blacklivesmatter,blm --follow 759251 > tweets.jsonl\n\nただし, 場所と言語を組み合わせると, 結果的にANDになります.\n例えば, これは, スペイン語またはフランス語で呟かれた, ニューヨークあたりからのツイートを収集します.\n\n    twarc filter --locations \"\\-74,40,-73,41\" --lang es --lang fr\n\n### Sample | 抽出\n\n`sample`コマンドは, Twitterの[statuses/sample](https://dev.twitter.com/streaming/reference/get/statuses/sample) APIに直近のパブリックステータスの「無作為な」抽出を尋ねるのに使えます.\n\n    twarc sample > tweets.jsonl\n\n### Dehydrate | デハイドレート\n\n`dehydrate`コマンドはツイートのJSONLファイルからツイートIDのリストを生成します.\n\n    twarc dehydrate tweets.jsonl > tweet-ids.txt\n\n### Hydrate | ハイドレート\n\ntwarcの`hydrate`コマンドは, ツイートの識別子のファイルを読み込んで, Twitterの[status/lookup](https://dev.twitter.com/rest/reference/get/statuses/lookup) APIを用いてそれらのツイートのJSONを書き出します.\n\n    twarc hydrate ids.txt > tweets.jsonl\n\nTwitter APIの[利用規約](https://dev.twitter.com/overview/terms/policy#6._Be_a_Good_Partner_to_Twitter)では, 人々が大量のTwitterの生データをWeb上で利用可能にすることを制限しています.\n\n- データは調査に使用したり, ローカルで使用するためにアーカイブしたりできますが, 世界と共有することはできません.\n- Twitterはツイートの識別子ファイルを共有することは許可しておらず, それはツイートのデータセットを利用可能にしたい場合に役立ちます.\n- それから, Twitter APIでデータを*ハイドレート*(注:水和)したり, またそれぞれの識別子のフルJSONデータを取得することは許可されています.\n- `hydrate`は特に, ソーシャルメディア研究を[検証](https://ja.wikipedia.org/wiki/再現性)する時に重要となります.\n\n### Users | ユーザー\n\n`users`コマンドは, 与えられたスクリーンネームを持つユーザーのメタデータを返します.\n\n    twarc users deray,Nettaaaaaaaa > users.jsonl\n\nまたユーザーidも与えることができます.\n\n    twarc users 1232134,1413213 > users.jsonl\n\nまた, 望むなら以下のようにユーザーidのファイルを使用可能で, `followers`や`friends`といったコマンドを使っているときに有効です.\n\n    twarc users ids.txt > users.jsonl\n\n### Followers | フォロワー\n\n`followers`コマンドは, Twitterの[follower id API](https://dev.twitter.com/rest/reference/get/followers/ids)を用い, 引数として指定されたリクエストごとに1つだけのスクリーン名を持つユーザーのフォロワーのユーザーIDを収集します.\n\n    twarc followers deray > follower_ids.txt\n\n結果には, 行ごとに1つのユーザーIDが含まれ, その応答順序は逆時系列順, すなわち最新のフォロワーが初めに来ます.\n\n### Friends | 友達\n\n`followers`コマンドと同じく, `friends`コマンドはTwitterの[friend id API](https://dev.twitter.com/rest/reference/get/friends/ids)を用いて, 引数として指定されたリクエストごとに1つだけのスクリーン名を持つユーザーのフレンド(フォロー)ユーザーIDを収集します.\n\n    twarc friends deray > friend_ids.txt\n\n### Trends | トレンド\n\n時に, 興味のあるトレンドの地域を示す[Where On Earth](https://web.archive.org/web/20180102203025/https://developer.yahoo.com/geo/geoplanet/)識別子(`WOE ID`)をオプションに与える必要があります.\n例としてセントルイスの現在のトレンドを取得するやり方を示します.\n\n    twarc trends 2486982\n\n`WOE ID`に`１`を用いることで, 全世界のトレンドが取得されます.\n\n    twarc trends 1\n\n`WOE ID`として何を使用すればよいかわからない場合は, 以下のように`WOE ID`を省略することで, Twitterがトレンドを追跡している全ての場所のリストを取得できます.\n\n    twarc trends\n\nGeolocationがあれば, `WOE ID`の代わりにジオロケーションを使用できます.\n\n    twarc trends 39.9062,-79.4679\n\nバックグラウンドでtwarcは, Twitterの[trends/closest](https://dev.twitter.com/rest/reference/get/trends/closest) APIを使用して, 場所を検索し, 最も近い`WOE ID`を見つけます.\n\n### Timeline | タイムライン\n\n`timeline`コマンドは, Twitterの[user timeline API](https://dev.twitter.com/rest/reference/get/statuses/user_timeline)を用いて, スクリーンネームで示されるユーザーが投稿した最新のツイートを収集します.\n\n    twarc timeline deray > tweets.jsonl\n\nまた, ユーザーIDからユーザーを調べることもできます.\n\n    twarc timeline 12345 > tweets.jsonl\n\n### Retweets | リツイート\n\n指定されたツイートIDのリツイートを以下のように取得できます.\n\n    twarc retweets 824077910927691778 > retweets.jsonl\n\n### Replies | 返信\n\n残念ながら, TwitterのAPIは現在, ツイートへの返信の取得をサポートしていません.\n代わりに, twarcは検索APIを使用してその機能の近似を行います.\n\nTwitterの検索APIは, 1週間以上前のツイートの取得をサポートしていません.\nそのため, twarcは先週までに送信されたツイートに対する返信のみを取得できます.\n\n特定のツイートへの返信を取得したい場合は以下のようにします.\n\n    twarc replies 824077910927691778 > replies.jsonl\n\n`--recursive`オプションを使用すると, 返信に対する返信や引用も取得されます.\n検索APIによるレート制限のために, 長いスレッドの場合は完了するのに長時間かかる場合があります.\n\n    twarc replies 824077910927691778 --recursive\n\n### Lists | リスト\n\nリストにあるユーザを取得するには、`listmembers`コマンドで list URLを使用します。\n\n    twarc listmembers https://twitter.com/edsu/lists/bots\n\n## Premium Search API\n\nTwitterでは、ツイートにTwitterのお金を支払うことができるプレミアム検索APIが導入されました。\n\n[ダッシュボード]（https://developer.twitter.com/en/dashboard）で環境設定をした後、\n「Standard Search API」が提供する7日間のウィンドウ外で、30日間とフルアーカイブ\nでのエンドポイントを使ってツイートを検索することができます。コマンドラインから\nPremium APIを使用するには、使用しているエンドポイントと環境を指定する必要があります。\n\n予算全体を使い果たすことを避けるために、`--to_date`と`--from_date`を使用して\n時間範囲を制限することをおすすめします。また、`--limit`を使用して返される\nツイートの最大数を制限することができます。\n\n例えば、（今日が2020年6月1日だと仮定し）2週間前の全てのblacklivesmatterツイートを、\n*docnowdev*という名前の環境を使って取得したいが、1000件以上のツイートを取得しない\n場合は、次のような操作ができる。\n\n    twarc search blacklivesmatter \\\n      --30day docnowdev \\\n      --from_date 2020-05-01 \\\n      --to_date 2020-05-14 \\\n      --limit 1000 \\\n      > tweets.jsonl\n\n同様に、フルアーカイブを使用して2014年のツイートを検索するには、次の方法があります。\n\n    twarc search blacklivesmatter \\\n      --fullarchive docnowdev \\\n      --from_date 2014-08-04 \\\n      --to_date 2014-08-05 \\\n      --limit 1000 \\\n      > tweets.jsonl\n\n環境がサンドボックス化されている場合、twarcが一度に100件以上のツイートを要求しないように、\n`--sandbox`を使用する必要があります。（サンドボックス化されていない環境のデフォルトは 500）\n\n    twarc search blacklivesmatter \\\n      --fullarchive docnowdev \\\n      --from_date 2014-08-04 \\\n      --to_date 2014-08-05 \\\n      --limit 1000 \\\n      --sandbox \\\n      > tweets.jsonl\n\n## Use as a Library | ライブラリとして使用\n\n必要で応じてtwarcをプログラム的にライブラリとして使ってツイートを収集することができます。\n最初に（Twitterの資格情報を使用して）twarcインスタンスを作成し、検索結果、フィルタ結果、\nまたは検索結果の反復を処理するために使用できます。\n\n```python\nfrom twarc import Twarc\n\nt = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)\nfor tweet in t.search(\"ferguson\"):\n    print(tweet[\"text\"])\n```\n\ntrackキーワードに一致する新しいツイートのフィルタストリームに対しても同じことができます。\n\n```python\nfor tweet in t.filter(track=\"ferguson\"):\n    print(tweet[\"text\"])\n```\n\nまた`location`なら,\n\n```python\nfor tweet in t.filter(locations=\"-74,40,-73,41\"):\n    print(tweet[\"text\"])\n```\n\n`user id`なら,\n\n```python\nfor tweet in t.filter(follow='12345,678910'):\n    print(tweet[\"text\"])\n```\n\n同様に, IDのリストまたはジェネレーターを渡すことで, ツイートIDをハイドレートできます.\n\n```python\nfor tweet in t.hydrate(open('ids.txt')):\n    print(tweet[\"text\"])\n```\n\n## User vs App Auth\n\ntwarcはTwitterによるレート制限を管理しますが、 それらのレート制限は、認証方法によって\n異なります。ユーザー認証とアプリ認証の２つのオプションがありますが、twarcは\nデフォルトでユーザー認証を使用するので、アプリ認証を使用するように指示することもできます。\n\nアプリ認証への切り替えは、ツイートを検索するときなんかに便利です。ユーザー認証は\n15分ごとに180件(1日あたり160万件)しかリクエストできないのに対し、アプリ認証は450件\n(1日あたり430万件)のリクエストができるからです。\n\nただし注意すべきことは、ハイドレートサブコマンドで使用される`statuses / lookup`\nエンドポイントには、ユーザー認証の場合は15分あたり900件までリクエスト、アプリ\n認証の場合は15分あたり300件までのリクエストのレート制限があるということです。\n\n自分が何をしているかを知っていて、アプリ認証を強制したい場合は、次のように`--app_auth`\nコマンドラインオプションが使用できます。\n\n    twarc --app_auth search ferguson > tweets.jsonl\n\n同様に、twarcをライブラリとして使用している場合は、次のことができます。\n\n```python\nfrom twarc import Twarc\n\nt = Twarc(app_auth=True)\nfor tweet in t.search('ferguson'):\n    print(tweet['id_str'])\n```\n\n## Utilities | ユーティリティ\n\n`utils`ディレクトリには, line-oriented JSONを操作するための簡単なコマンドラインユーティリティがいくつかあります.\n例えばアーカイブされたツイートをテキストまたはHTMLとして出力したり, ユーザー名や参照URLなどを抽出したりするものです.\n\n便利なスクリプトを自作したら, 是非プルリクエストをください.\n\nいくつかツイートが手元にある時, それらを用いて初歩的なWallを作成できます.\n\n    utils/wall.py tweets.jsonl > tweets.html\n\n`NASA`について収集したツイートのワードクラウドを作成できます.\n\n    utils/wordcloud.py tweets.jsonl > wordcloud.html\n\n`replies`コマンドを用いていくつかのツイートを収集した場合, それらの静的な`D3.js`を用いたビジュアライゼーションを作成できます.\n\n    utils/network.py tweets.jsonl tweets.html\n\n必要に応じてユーザーごとにツイートを統合し, その中心のアカウントを表示できます.\n\n    utils/network.py --users tweets.jsonl tweets.html\n\n[Gephi](https://gephi.org/)などのプログラムでネットワークグラフを使用する場合は, 次のようにGEXFファイルを生成できます.\n\n    utils/network.py --users tweets.jsonl tweets.gexf\n\n`gender.py`は, 著者の性別に関する推測に基づいてツイートをフィルタリングできるフィルターです.\n例えば, 女性からのもののように見えるすべてのツイートを除外し, それらの単語クラウドを作成できます.\n\n    utils/gender.py --gender female tweets.jsonl | utils/wordcloud.py > tweets-female.html\n\n地理座標が利用可能なツイートから[GeoJSON](http://geojson.org/)を出力できます.\n\n    utils/geojson.py tweets.jsonl > tweets.geojson\n\n必要に応じて, バウンディングボックスを置き換える重心を用いたGeoJSONをできます.\n\n    utils/geojson.py tweets.jsonl --centroid > tweets.geojson\n\nまた, 重心を用いたGeoJSONをエクスポートする場合に, ランダムファジングを追加することもできます.\n\n    utils/geojson.py tweets.jsonl --centroid --fuzz 0.01 > tweets.geojson\n\n地理座標の有無でツイートをフィルタリングするには, (場所については以下を参照:[API documentation](https://dev.twitter.com/overview/api/places))\n\n    utils/geofilter.py tweets.jsonl --yes-coordinates > tweets-with-geocoords.jsonl\n    cat tweets.jsonl | utils/geofilter.py --no-place > tweets-with-no-place.jsonl\n\nGeoJSONのフェンスでツイートをフィルタリングするには, (要:[Shapely](https://github.com/Toblerity/Shapely))\n\n    utils/geofilter.py tweets.jsonl --fence limits.geojson > fenced-tweets.jsonl\n    cat tweets.jsonl | utils/geofilter.py --fence limits.geojson > fenced-tweets.jsonl\n\nツイートに重複があると思われる場合は, 重複の排除が可能です.\n\n    utils/deduplicate.py tweets.jsonl > deduped.jsonl\n\nID順ソートできます.これは, 時間順ソートに似ています.\n\n    utils/sort_by_id.py tweets.jsonl > sorted.jsonl\n\n特定の日付以前のすべてのツイートを除外できます.\n例えば, 以下は関心のあるイベントの前, 別のイベントにハッシュタグが使用されていた場合です.\n\n    utils/filter_date.py --mindate 1-may-2014 tweets.jsonl > filtered.jsonl\n\n使用されているクライアントのHTMLリストを取得できます.\n\n    utils/source.py tweets.jsonl > sources.html\n\nリツイートを削除する場合は,\n\n    utils/noretweets.py tweets.jsonl > tweets_noretweets.jsonl\n\nまたはURLの短縮を解除したい場合は, (要:[unshrtn](https://github.com/docnow/unshrtn))\n\n    cat tweets.jsonl | utils/unshrtn.py > unshortened.jsonl\n\nURLを短縮すると, 最もよくツイートされたURLのランキングリストを取得できます.\n\n    cat unshortened.jsonl | utils/urls.py | sort | uniq -c | sort -nr > urls.txt\n\n## twarc-report\n\n[twarc-report](https://github.com/pbinkley/twarc-report)プロジェクトでは, [D3.js](http://d3js.org/)でのビジュアライゼーションでの使用に適したCSVまたはJSONを生成・出力するユーティリティスクリプトを用意しています.\n以前はtwarcの一部であった`directed.py`は`d3graph.py`としてtwarc-reportプロジェクトに移管しました.\n\nまたそれぞれのスクリプトは, ビジュアライゼーションのHTMLでのデモを生成できます.\n\n具体例として,\n  - [タイムライン](https://www.wallandbinkley.com/twarc/bill10/)\n  - [リツイートの有向グラフ](https://www.wallandbinkley.com/twarc/bill10/directed-retweets.html)\n\nがあります.\n\n---\n\n翻訳クレジット: [Haruna]\n\n[英語]: https://github.com/DocNow/twarc/blob/main/README.md\n[ポルトガル語]: https://github.com/DocNow/twarc/blob/main/README_pt_br.md\n[スペイン語]: https://github.com/DocNow/twarc/blob/main/README_es_mx.md\n[スウェーデン語]: https://github.com/DocNow/twarc/blob/main/README_sv_se.md\n[スワヒリ語]: https://github.com/DocNow/twarc/blob/main/README_sw_ke.md\n[ISO 639-1]: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes\n[Haruna]: https://github.com/eggplants\n"
  },
  {
    "path": "docs/twarc1_pt_br.md",
    "content": "twarc1\n=====\n\ntwarc é uma ferramenta de linha de comando e usa a biblioteca Python para arquivamento de dados do Twitter com JSON.\nCada tweet será representado como um objeto JSON\n[exatamente](https://dev.twitter.com/overview/api/tweets) o que foi devolvido pela\nAPI do Twitter.  Os Tweets serão armazenados como [JSON, um por linha](https://en.wikipedia.org/wiki/JSON_Streaming#Line-delimited_JSON).  twarc controla totalmente a API [limites de uso](https://dev.twitter.com/rest/public/rate-limiting)\npara você. Além de permitir que você colete Tweets, twarc também pode ajudá-lo\nColetar usuários, tendências e hidratar tweet ids.\n\ntwarc Foi desenvolvido como parte [Documenting the Now](http://www.docnow.io)\nProjecto financiado pelo [Mellon Foundation](https://mellon.org/).\n\n## Instalação\n\nAntes de usar twarc você precisa registrar um aplicativo em\n[apps.twitter.com](http://apps.twitter.com). Depois de criar o aplicativo, anote a [consumer_key], [consumer_secret] e clique em  Gerar um [access_token] e um [access_token_secret]. Com estas quatro variáveis na mão você está pronto para começar a usar twarc.\nOBS: Se tiver alguma dúvida de como criar o aplicativo, consulte [como criar um app](http://blog.difluir.com/2013/06/como-criar-uma-app-no-twitter/)\n\n1. instalação [Python](http://python.org/download) (2 ou 3)\n2. pip install twarc\n\n### Homebrew (macOS apenas)\n\nPara usuários do macOS, você pode instalar o `twarc` via:\n\n```bash\n$ brew install twarc\n```\n\n## Início Rápido:\n\nPrimeiro você vai precisar configurar o twarc mostrando a ele suas chaves de API:\n\n    twarc configure\n\nEm seguida, experimente uma pesquisa rápida:\n\n    twarc search blacklivesmatter > search.jsonl\n\nOu talvez você gostaria de coletar tweets como eles acontecem?\n\n    twarc filter blacklivesmatter > stream.jsonl\n\nVeja abaixo os detalhes sobre esses comandos e muito mais.\n\n## Uso\n\n### Configurar\n\nUma vez que você tem suas chaves de aplicativo, você pode dizer ao twarc quais são com o comando\n`configure`.\n\n    twarc configure\n\nIsso irá armazenar as credenciais em um arquivo chamado `.twarc` em seu\ndiretório home. Este arquivo será usado como padrão em outras chamadas.\nSe preferir, você pode fornecer diretamente as chaves (`CONSUMER_KEY`,\n`CONSUMER_SECRET`, `ACCESS_TOKEN`, `ACCESS_TOKEN_SECRET`) ou usando a linha de comando\ncom as opções (`--consumer_key`, `--consumer_secret`, `--access_token`,\n`--access_token_secret`).\n\n### Pesquisar\n\nOs usuários do Twitter [Pesquisar/tweets](https://dev.twitter.com/rest/reference/get/search/tweets) para baixar *pre-existing* tweets, correspondendo a uma determinada consulta que desejar.\n\n    twarc search blacklivesmatter > tweets.jsonl\n\nÉ importante notar que `search` Irá retornar tweets encontrados dentro de uma\nJanela de 7 dias imposta pela API de pesquisa do Twitter. Se isso parece uma pequena\nJanela,e é, mas você pode estar interessado em coletar tweets como eles acontecem\nUsando o `filter` e `sample` comandos abaixo.\n\nA melhor maneira de se familiarizar com a sintaxe de pesquisa do Twitter é experimentando\n[Pesquisa Avançada do Twitter](https://twitter.com/search-advanced) E copiar e\ncolar a consulta resultante da caixa de pesquisa. Por exemplo, aqui está uma\nconsulta complicada que procura por tweets que contenham\n\\#blacklivesmatter ou #blm hashtags que foram enviados para deray.\n\n    twarc search '#blacklivesmatter OR #blm to:deray' > tweets.jsonl\n\nVocê definitivamente também deve consultar o *excelente* guia de referência de\nIgor Brigadir à sintaxe de busca do Twitter:\n\n[Busca Avançada no Twitter](https://github.com/igorbrigadir/twitter-advanced-search/blob/master/README.md)\n\nLá existem várias pérolas escondidas que não estão muito evidentes no\nformulário de pesquisa avançada.\n\nO Twitter tenta codificar a linguagem de um tweet e você pode limitar sua\npesquisa para um idioma específico se quiser usando um código [ISO 639-1]:\n\n    twarc search '#forabolsonaro' --lang pt > tweets.jsonl\n\nVocê também pode pesquisar tweets com um determinado local, por exemplo tweets\nMencionando *foratemer* das pessoas situadas a 1 milha na região de Brasília:\n\n    twarc search foratemer --geocode -16.050561,-47.814708,1mi > tweets.jsonl\n\nSe uma consulta de pesquisa não for fornecida`--geocode` Você receberá todos os tweets\nRelevantes para esse local e raio:\n\n    twarc search --geocode -16.050561,-47.814708,1mi > tweets.jsonl\n\n### Filter\n\nO comando `filter` Vai usar o Twitter [statuses/filter](https://dev.twitter.com/streaming/reference/post/statuses/filter) API to collect tweets as they happen.\n\n    twarc filter foratemer,blm > tweets.jsonl\n\nObserve que a sintaxe para consultas de queries do Twitter é ligeiramente\ndiferente do que as consultas em sua API de pesquisa. Por favor, consulte a\ndocumentação sobre a melhor forma de expressar a opção de filtro que você deseja.\n\nUse o comando de linha `follow` com argumento se você quer coletar tweets de\num determinado ID de usuário. Isso inclui retweets. Por exemplo, isso vai\ncoletar tweets e os retweets da CNN:\n\n    twarc filter --follow 759251 > tweets.jsonl\n\nVocê também pode coletar tweets usando uma caixa delimitadora.\nNota: o traço principal precisa ser escapado na caixa delimitadora ou então\nele será interpretado como um comando de linha como argumento!\nExemplo: escapando com a barra invertida após aspas \"\\\n\n    twarc filter --locations \"\\-74,40,-73,41\" > tweets.jsonl\n\nSe você combinar opções eles serão um OU outro juntos.\nPor exemplo, isso irá coletar Tweets que usam o hashtags foratemer\nOU blm e também tweets do usuário CNN:\n\n    twarc filter blacklivesmatter,blm --follow 759251 > tweets.jsonl\n\nMas combinar locais e idiomas resultará efetivamente em um E. Para\nexemplo, isso irá coletar tweets da grande área de Nova York que estão em\nEspanhol ou francês:\n\n    twarc filter --locations \"\\-74,40,-73,41\" --lang es --lang fr\n\n### Sample\n\nUse o comando `sample` para ouvir/Status do Twitter [statuses/sample](https://dev.twitter.com/streaming/reference/get/statuses/sample) API para uma amostra \"aleatória/ramdom\" de tweets públicos recentes. O status será do usuário ativo na API twarc.\n\n    twarc sample > tweets.jsonl\n\n### Dehydrate\n\nO comando `dehydrate` gera uma lista de id de um arquivo de tweets:\n\n    twarc dehydrate tweets.jsonl > tweet-ids.txt\n\n### Hydrate\n\nO comando do twarc `hydrate` Lê um arquivo de IDs de tweets e escreve o tweet em JSON para eles usando Twitter [status/lookup](https://dev.twitter.com/rest/reference/get/statuses/lookup) API.\n\n    twarc hydrate ids.txt > tweets.jsonl\n\nO [Termos do Serviço](https://dev.twitter.com/overview/terms/policy#6._Be_a_Good_Partner_to_Twitter) do Twitter API's desencoraja pessoas na busca de grandes quantidades de dados brutos do Twitter e disponíbilizar na Web. Os dados podem ser usados para pesquisa e arquivados para uso local, mas não devem ser compartilhados com o mundo. O Twitter permite que arquivos de identificadores de tweet sejam compartilhados, o que pode ser útil quando você quer fazer um conjunto de dados de tweets disponíveis. Você pode usar a API do Twitter para *hydrate* dados ou para recuperar o JSON completo para cada identificador/usuário ID. Isto é particularmente importante para [verificação](https://en.wikipedia.org/wiki/Reproducibility) da rede social mundial.\n\n### Usuários\n\nO comando `users` retorna metadados do usuário fornecidos na tela,exemplo:\n\n    twarc users deray,Nettaaaaaaaa > users.jsonl\n\nVocê também pode usar os ids do usuário:\n\n    twarc users 1232134,1413213 > users.jsonl\n\nSe você quiser, você também pode usar um arquivo com ids de usuário, o que\npode ser útil se você estiver usando o `followers` e o `friends` conforme\ncomando abaixo:\n\n    twarc users ids.txt > users.jsonl\n\n### Seguidores (Quem me segue)\n\nO comando `followers` Vai usar o Twitter [API seguidores ID](https://dev.twitter.com/rest/reference/get/followers/ids) Para coletar os ids dos usuários que estão seguindo exatamente o nome informado na tela. Veja como é feita a solicitação usando o nome do user como argumento:\n\n    twarc followers deray > follower_ids.txt\n\nO resultado incluirá exatamente um ID de usuário por linha.\nA ordem de resposta é Invertida cronológicamente, o mais recente seguidores em primeiro lugar.\n\n### Amigos (Quem eu sigo)\n\nIgual o comando `followers`, o comando` friends` usará o Twitter [API amigos ID](https://dev.twitter.com/rest/reference/get/friends/ids) Para coletar os IDs de usuário amigo/friends com o nome que foi informado na tela no momento da solicitação,conforme especificado abaixo no argumento:\n\n    twarc friends deray > friend_ids.txt\n\n### Trends / tendências\n\nO comando `trends` permite recuperar informações da API do Twitter sobre hashtags tendências. Você precisa fornecer um  [Onde na Terra](http://developer.yahoo.com/geo/geoplanet/) identificador (`woeid`) para indicar quais as tendências que você está interessado. Por exemplo, aqui é como você pode obter as tendências atuais para St Louis:\n\n    twarc trends 2486982\n\nUsando um `woeid` de 1 irá retornar tendências para todo o planeta, ou trends mundiais:\n\n    twarc trends 1\n\nSe você não tem certeza do que usar como um \"woeid\", não se preocupe, apenas omita seu valor e você receberá uma lista\nde todos os lugares para os quais o Twitter acompanha as tendências:\n\n    twarc trends\n\nSe você já tem uma [geo-location/geo localização], você pode usar diretamente no seu `woedid`.\n\n    twarc trends 39.9062,-79.4679\n\nPor trás das cenas, o twarc buscará o local usando o Twitter [trends/closest](https://dev.twitter.com/rest/reference/get/trends/closest) API para encontrar a `woeid`.\n\n### Timeline\n\nO comando timeline usará do Twitter [API user timeline](https://dev.twitter.com/rest/reference/get/statuses/user_timeline) Para coletar os tweets mais recentes postados pelo usuário indicado por um screen_name.\n\n    twarc timeline deray > tweets.jsonl\n\nVocê também pode procurar usuários usando um id de usuário:\n\n    twarc timeline 12345 > tweets.jsonl\n\n### Retuítes\n\nVocê pode obter retuítes para um determinado id de tweet como este:\n\n    twarc retweets 824077910927691778 > retweets.jsonl\n\nSe você tiver tweet_ids para os quais gostaria de buscar os retuítes, você pode:\n\n    twarc retweets ids.txt > retweets.jsonl\n\n### Repostas\n\nInfelizmente, a API do Twitter não suporta atualmente a obtenção de respostas\npara um tweet. Portanto, o twarc o aproxima usando a API de pesquisa. Como\na API de pesquisa não suporta a obtenção de tweets com mais de uma semana,\no twarc só pode obter todas as respostas a um tweet que foram enviadas na\núltima semana.\n\nSe você deseja obter respostas para um determinado tweet, você pode:\n\n    twarc replies 824077910927691778 > replies.jsonl\n\nUsar a opção `--recursive` também irá buscar respostas para as respostas, bem\ncomo citações. Isso pode levar muito tempo para ser concluído em um thread\ngrande por causa de limitação de taxa pela API de pesquisa.\n\n    twarc replies 824077910927691778 --recursive\n\n### Listas\n\nPara obter os usuários que estão em uma lista, você pode usar o URL da lista com o\ncomando `listmembers`:\n\n    twarc listmembers https://twitter.com/edsu/lists/bots\n\n## Premium Search API\n\nO Twitter introduziu uma API de pesquisa premium que permite que você pague dinheiro \nao Twitter por tweets. Depois de configurar um ambiente em seu\n[painel] (https://developer.twitter.com/en/dashboard) você pode usar seus 30 dias\ne endpoints fullarchive para pesquisar tweets fora da janela de 7 dias fornecida\npela API de pesquisa padrão. Para usar a API premium na linha de comando, você\nprecisará indicar qual terminal você está usando e o ambiente.\n\nPara evitar usar todo o seu orçamento, você provavelmente desejará limitar o\nintervalo de tempo usando `--to_date` e` --from_date`. Além disso, você pode\nlimitar o número máximo de tweets retornados usando `--limit`.\n\nPor exemplo, se eu quisesse obter todos os tweets blacklivesmatter de um\nsemanas atrás (supondo que hoje seja 1 de Junho de 2020) usando meu ambiente\nchamado *docnowdev*, mas não recuperando mais de 1000 tweets, eu poderia:\n\n    twarc search blacklivesmatter \\\n      --30day docnowdev \\\n      --from_date 2020-05-01 \\\n      --to_date 2020-05-14 \\\n      --limit 1000 \\\n      > tweets.jsonl\n\nDa mesma forma, para encontrar tweets de 2014 usando o arquivo completo, você\npode:\n\n    twarc search blacklivesmatter \\\n      --fullarchive docnowdev \\\n      --from_date 2014-08-04 \\\n      --to_date 2014-08-05 \\\n      --limit 1000 \\\n      > tweets.jsonl\n\nSe o seu ambiente for sandbox, você precisará usar `--sandbox` para que o\ntwarc saiba que não deve solicitar mais de 100 tweets por vez (o padrão para\nambientes sem sandbox é 500)\n\n    twarc search blacklivesmatter \\\n      --fullarchive docnowdev \\\n      --from_date 2014-08-04 \\\n      --to_date 2014-08-05 \\\n      --limit 1000 \\\n      --sandbox \\\n      > tweets.jsonl\n\n## Usar twarc como uma biblioteca\n\nSe você quiser pode usar `twarc` programaticamente como uma biblioteca\npara coletar Tweets. Primeiro você precisa criar uma instância do `twarc`\n(usando as suas Credenciais do Twitter) e, em seguida, usá-lo para iterar\natravés de resultados de pesquisa ou filtrar resultados de pesquisa.\n\n```python\nfrom twarc import Twarc\n\nt = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)\nfor tweet in t.search(\"ferguson\"):\n    print(tweet[\"text\"])\n```\n\nVocê pode fazer o mesmo para um fluxo de filtro de novos tweets que\ncorrespondem a uma determinada faixa usando palavra-chave.\n\n```python\nfor tweet in t.filter(track=\"ferguson\"):\n    print(tweet[\"text\"])\n```\n\nou localização:\n\n```python\nfor tweet in t.filter(locations=\"-74,40,-73,41\"):\n    print(tweet[\"text\"])\n```\n\nou IDS do usuário:\n\n```python\nfor tweet in t.filter(follow='12345,678910'):\n    print(tweet[\"text\"])\n```\n\nDa mesma forma você pode hidratar os identificadores de tweet passando\nem uma lista de ids ou um gerador:\n\n```python\nfor tweet in t.hydrate(open('ids.txt')):\n    print(tweet[\"text\"])\n```\n\n## User x App Auth\n\ntwarc gerenciará a limitação de taxas pelo Twitter. No entanto, você deve\nsaber que a limitação de taxa varia de acordo com a maneira como você\nautentica. As duas opções são User Auth e App Auth. O padrão do twarc é usar a\nautenticação do usuário, mas você pode dizer a ele para usar o App Auth.\n\nMudar para App Auth pode ser útil em algumas situações, como quando você está\npesquisando tweets, já que o User Auth só pode emitir 180 solicitações a cada\n15 minutos (1,6 milhões de tweets por dia), mas o App Auth pode emitir 450 (4,\n3 milhões de tweets por dia).\n\nMas tenha cuidado: o endpoint `statuses / lookup` usado pelo subcomando\nhydrate tem um limite de taxa de 900 solicitações por 15 minutos para\nautenticação do usuário e 300 solicitações por 15 minutos para App Auth.\n\nSe você sabe o que está fazendo e deseja forçar o App Auth, pode usar o opção\nde linha de comando `--app_auth`:\n\n    twarc --app_auth search ferguson > tweets.jsonl\n\nDa mesma forma, se você estiver usando twarc como uma biblioteca, você pode:\n\n```python\nfrom twarc import Twarc\n\nt = Twarc(app_auth=True)\nfor tweet in t.search('ferguson'):\n    print(tweet['id_str'])\n```\n\n## Utilitários\n\nNo diretório utils existem alguns utilitários via linha de comando simples para\nTrabalhar com o JSON gravando linha por por linha, tais como.\n- Imprimir os tweets arquivados como Texto ou html.\n- Extraindo os nomes de usuários.\n- URLs referenciadas.\n- Etc.\nSe você criar um Script e achar útil, por favor envie um pedido de pull no github do projeto.\n\nQuando você tem alguns tweets você pode criar um paralelo rudimentar deles:\n\n    utils/wall.py tweets.jsonl > tweets.html\n\nVocê pode criar uma nuvem de palavras de tweets coletados sobre a nasa:\n\n    utils/wordcloud.py tweets.jsonl > wordcloud.html\n\nSe você coletou alguns tweets usando `respostas`, você pode criar uma\nvisualização estática D3 deles com:\n\n    utils/network.py tweets.jsonl tweets.html\n\nOpcionalmente, você pode consolidar tweets por usuário, permitindo que você\nveja contas centrais:\n\n    utils/network.py --users tweets.jsonl tweets.html\n\nAlém disso, você pode criar uma rede de hashtags, permitindo que você\nvisualize sua alocação:\n\n    utils/network.py --hashtags tweets.jsonl tweets.html\n\nE se você quiser usar o gráfico de rede em um programa como \n[Gephi] (https://gephi.org/), você pode gerar um arquivo GEXF com o seguinte:\n\n    utils/network.py --users tweets.jsonl tweets.gexf\n    utils/network.py --hashtags tweets.jsonl tweets.gexf\n\ngender.py É um filtro que permite filtrar tweets com base em um palpite sobre\no gênero do autor. Assim, por exemplo, você pode filtrar todos os tweets que\nem tese foram feitos por mulheres, e criar uma nuvem de palavras para eles:\n\n    utils/gender.py --gender female tweets.jsonl | utils/wordcloud.py > tweets-female.html\n\nVocê pode com [GeoJSON](http://geojson.org/) ver os tweets de determinadas coordenadas geográficas:\n\n    utils/geojson.py tweets.jsonl > tweets.geojson\n\nOpcionalmente você pode exportar GeoJSON com centróides substituindo as caixas delimitadoras:\n\n    utils/geojson.py tweets.jsonl --centroid > tweets.geojson\n\nE se você exportar GeoJSON com centróides, você pode adicionar alguns fuzzing aleatórios:\n\n    utils/geojson.py tweets.jsonl --centroid --fuzz 0.01 > tweets.geojson\n\nPara filtrar tweets pela presença ou ausência de coordenadas geográficas (Ou Local, veja [Documentação da API locais](https://dev.twitter.com/overview/api/places)):\n\n    utils/geofilter.py tweets.jsonl --yes-coordinates > tweets-with-geocoords.jsonl\n    cat tweets.jsonl | utils/geofilter.py --no-place > tweets-with-no-place.jsonl\n\nPara filtrar tweets por uma área com GeoJSON (Requer [Shapely](https://github.com/Toblerity/Shapely)):\n\n    utils/geofilter.py tweets.jsonl --fence limits.geojson > fenced-tweets.jsonl\n    cat tweets.jsonl | utils/geofilter.py --fence limits.geojson > fenced-tweets.jsonl\n\nSe você suspeitar ter duplicado seus tweets, você pode remove-los:\n\n    utils/deduplicate.py tweets.jsonl > deduped.jsonl\n\nVocê pode classificar por ID, o que é análogo à classificação por tempo:\n\n    utils/sort_by_id.py tweets.jsonl > sorted.jsonl\n\nVocê pode filtrar todos os tweets antes de uma determinada data (por exemplo, se uma hashtag foi usada para outro evento antes do que você está interessado):\n\n    utils/filter_date.py --mindate 1-may-2014 tweets.jsonl > filtered.jsonl\n\nVocê pode obter uma lista HTML dos usuários usados:\n\n    utils/source.py tweets.jsonl > sources.html\n\nSe você quiser remover os retweets:\n\n    utils/noretweets.py tweets.jsonl > tweets_noretweets.jsonl\n\nOu unshorten urls (Requer [unshrtn](https://github.com/edsu/unshrtn)):\n\n    cat tweets.jsonl | utils/unshorten.py > unshortened.jsonl\n\nDepois de desfazer masca de seus URLs, você pode obter uma lista classificada dos URLs mais tweeted:\n\n    cat unshortened.jsonl | utils/urls.py | sort | uniq -c | sort -nr > urls.txt\n\n## twarc-report\n\nAlguns scripts de utilitários adicionais para gerar saída csv ou json adequada foi\nfeito com [D3.js](http://d3js.org/) Visualizações são encontradas\n[twarc-report](https://github.com/pbinkley/twarc-report) projeto. O\nUtil direct.py, anteriormente parte do twarc, mudou-se para twarc-report como\nd3graph.py.\n\nCada script também pode gerar uma demo html de uma visualização D3, e.g.\n[timelines](https://wallandbinkley.com/twarc/bill10/) or a\n[directed graph of retweets](https://wallandbinkley.com/twarc/bill10/directed-retweets.html).\n\n---\n\nTradução créditos: [Wilson Jr]\n\n[Espanhol]: https://github.com/DocNow/twarc/blob/main/README_es_mx.md\n[Inglês]: https://github.com/DocNow/twarc/blob/main/README.md\n[Japonês]: https://github.com/DocNow/twarc/blob/main/README_ja_jp.md\n[Sueco]: https://github.com/DocNow/twarc/blob/main/README_sv_se.md\n[Suaíli]: https://github.com/DocNow/twarc/blob/main/README_sw_ke.md\n[ISO 639-1]: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes\n[Wilson Jr]: https://github.com/py3in\n"
  },
  {
    "path": "docs/twarc1_sv_se.md",
    "content": "twarc1\n=====\n\ntwarc är ett kommandoradsverktyg twarc och ett Pythonbibliotek för arkivering av Twitter JSON data.\nVarje tweet är representerat som ett JSON-objekt som är [exakt](https://dev.twitter.com/overview/api/tweets) vad som returneras från Twitters API\nTweets lagras som [line-oriented JSON](https://en.wikipedia.org/wiki/JSON_Streaming#Line-delimited_JSON).  twarc hanterar\nTwitter API:ets [rate limits](https://dev.twitter.com/rest/public/rate-limiting)\nåt dig. Förutom att kunna samla in tweets kan även twarc hjälpa dig att samla in användare, trender och omvandla tweet-id:n till tweets.\n\ntwarc har utvecklats som en del av [Documenting the Now](http://www.docnow.io)\nprojektet som finiansierades av [Mellon Foundation](https://mellon.org/).\n\n\n## Installera\n\nInnan du använder twarc behöver du registrera en applikation hos\n[apps.twitter.com](http://apps.twitter.com). När du har skapat din applikation, skriv ner consumer key, consumer secret och klicka för att generera en access token och en access token secret.\nMed dessa fyra variabler är du redo att börja använda twarc.\n\n1. Installera [Python](http://python.org/download) (2 eller 3)\n2. pip install twarc (om du uppgraderar: pip install --upgrade twarc)\n\n## Snabbstart:\n\nFörst måste du tala om för twarc vad dina API-nycklar är och tillåta åtkomst till ett\neller flera twitterkonton:\n\n    twarc configure\n\nProva att köra:\n\n    twarc search blacklivesmatter > search.jsonl\n\nEller om du vill samla in tweets i samma ögonblick de skapas:\n\n    twarc filter blacklivesmatter > stream.jsonl\n\nSe nedan för detaljer om dessa och fler kommandon.\n\n\n## Användning\n\n### Konfigurera\n\nNär du har dina applikationsnycklar så kan du tala om för twarc vilka de är med\n`configure` kommandot.\n\n    twarc configure\n\nDetta kommer att lagra dina nycklar i en fil som heter `.twarc` placerad i din hemkatalog så du slipper att skriva in dem varje gång.\nOm du hellre vill tilldela dom direkt så kan du göra det i environment (`CONSUMER_KEY`,\n`CONSUMER_SECRET`, `ACCESS_TOKEN`, `ACCESS_TOKEN_SECRET`) eller genom att använda kommandoradsparameter\noptions (`--consumer_key`, `--consumer_secret`, `--access_token`,\n`--access_token_secret`).\n\n### Sök\nDetta använder Twitters [search/tweets](https://dev.twitter.com/rest/reference/get/search/tweets) för att ladda ner *redan befintliga* tweets som matchar en given söksträng.\n\n    twarc search blacklivesmatter > tweets.jsonl\n\nDet är viktigt att notera att `search` retunerar tweets som hittas inom det 7-dagarsfönster som\nTwitters sök-API erbjuder. Känns det som ett smalt fönster? Det är det. Men du kanske är intresserad av att samla in tweets i samma ögonblick som de skapas\ngenom att använda `filter` och `sample` kommandona nedan.\n\nDet bästa sättet att bekanta sig med Twitters söksyntax är att experimentera med\n[Twitters Advancerade Sök](https://twitter.com/search-advanced) och kopiera och klistra in söksträngen från sökboxen.\nHär är till exempel en mer avancerad söksträng som matchar tweets innehållande antingen \\#blacklivesmatter eller #blm hashtaggar som skickats till deray\n\n    twarc search '#blacklivesmatter OR #blm to:deray' > tweets.jsonl\n\nTwitter försöker att koda en tweets språk, och du kan begränsa sökningen till ett specifikt språk om du vill:\n\n    twarc search '#blacklivesmatter' --lang fr > tweets.jsonl\n\nDu kan också söka efter tweets inom en given plats, till exempel tweets som nämner *blacklivesmatter*  som är 1 mile från centrala Ferguson, Missouri:\n\n    twarc search blacklivesmatter --geocode 38.7442,-90.3054,1mi > tweets.jsonl\n\nOm inte en söksträng ges när du använder `--geocode` kommer du få alla tweets som är relevanta för den platsen och radien.\n\n    twarc search --geocode 38.7442,-90.3054,1mi > tweets.jsonl\n\n### Filter\n\n`filter` Kommandot använder Twitters [statuses/filter](https://dev.twitter.com/streaming/reference/post/statuses/filter) API för att samla in tweets i samma ögonblick som de skapas.\n\n    twarc filter blacklivesmatter,blm > tweets.jsonl\n\nNotera att syntaxen för Twitters track söksträngar är något annorlunda än de som används i sök-API:et\nVar god läs dokumentationen för att se hur du bäst kan formulera sökningar.\n\n\nAnvänd `follow` kommandot om du vill samla in tweets från ett specifikt användar-id i samma ögonblick som de skapas. Detta inkluderar retweets.\nTill exempel så samlar detta in tweets och retweets från CNN:\n\n    twarc filter --follow 759251 > tweets.jsonl\n\nDu kan också samla in tweets genom att använda koordinater.  Notera: det inledande bindestrecket behöver ignoreras, annars kommer det tolkas som en kommandoradsparameter!\n\n    twarc filter --locations \"\\-74,40,-73,41\" > tweets.jsonl\n\n\nOm du kombinerar parametrar så kommer de tolkas som OR\nTill exempel så kommer detta samla in tweets som använder blacklivesmatter eller blm hashtaggen och som också postats av användaren CNN:\n\n    twarc filter blacklivesmatter,blm --follow 759251 > tweets.jsonl\n\n### Sample\n\nAnvänd `sample` kommandot för att \"lyssna\" på Twitters [statuses/sample](https://dev.twitter.com/streaming/reference/get/statuses/sample) API för ett \"slumpmässigt\" prov av nyligen skapade publika tweets.\n\n    twarc sample > tweets.jsonl\n\n### Dehydrering\n\n`dehydrate` kommandot genererar en lista med identifierare från en fil med tweets:\n\n    twarc dehydrate tweets.jsonl > tweet-ids.txt\n\n### Hydrering\n\ntwarc's `hydrate` kommando läser en fil med tweetidentifierare och skriver ut som tweet JSON genom Twitters [status/lookup](https://dev.twitter.com/rest/reference/get/statuses/lookup) API.\n\n    twarc hydrate ids.txt > tweets.jsonl\n\nTwitter APIs [Terms of Service](https://dev.twitter.com/overview/terms/policy#6._Be_a_Good_Partner_to_Twitter) uppmuntrar inte folk att tillgängliggöra stora mängder av rå Twitterdata på webben.\nDatan kan användas för forskning och arkiveras lokalt, men kan inte delas med världen. Twitter tillåter emellertid att identifierare delas, vilket kan vara bra när du vill tillgängliggöra ett dataset.\nDu kan då använda Twitters API för att *hydrera* datan, eller för att hämta den fulla JSON-objektet för varje identifierare.\nDetta är särskilt viktigt för [verifiering](https://en.wikipedia.org/wiki/Reproducibility) av forskning på social media.\n\n### Användare\n\n`users` kommandot retunerar metadata för angivna screen names.\n\n    twarc users deray,Nettaaaaaaaa > users.jsonl\n\nDu kan också använda användar-id:\n\n    twarc users 1232134,1413213 > users.jsonl\n\nOm du vill kan du också använda en fil med användar-id, vilket kan vara användbart om du använder\n`followers` och `friends` kommandona nedan:\n\n    twarc users ids.txt > users.jsonl\n\n### Följare\n\n`followers` kommandot använder Twitters [follower id API](https://dev.twitter.com/rest/reference/get/followers/ids) för att samla in följarens användar-id för exakt ett screen name per request specificerat som ett argument:\n\n    twarc followers deray > follower_ids.txt\n\nResultatet inkluderar exakt ett användar-id per linje ordnat i omvänd kronologisk ordning, alltså de senaste följarna först.\n\n\n### Vänner\n\nPrecis som `followers` kommandot, använder `friends` kommandot Twitters [friend id API](https://dev.twitter.com/rest/reference/get/friends/ids) för att samla in vänners användar-id för exakt ett screen name per request, specificerat som ett argument:\n\n    twarc friends deray > friend_ids.txt\n\n### Trender\n\n`trends` kommandot låter dig hämta information från Twitters API om trendande hashtags. Du måste bifoga en [Where On Earth](http://developer.yahoo.com/geo/geoplanet/) identifierare (`woeid`)\nför att precisera vilka trender du är intresserad av. Till exempel kan du hämta de senaste trenderna för St. Louis på det hör viset:\n\n    twarc trends 2486982\n\nAnvänder du ett `woeid` på 1 så kommer du få trender för hela världen:\n\n    twarc trends 1\n\nOm du inte är säker på vad du ska använda för `woeid` så kan du helt enkelt utesluta det för att få en lista över alla platser Twitter har trender för:\n\n    twarc trends\n\nOm du har en geo-position så kan du använda den istället för `woeid`.\n\n    twarc trends 39.9062,-79.4679\n\nBakom kulisserna så hjälper twarc dig genom Twitters [trends/closest](https://dev.twitter.com/rest/reference/get/trends/closest) API att hitta närmaste `woeid`.\n\n### Tidslinje\n\n`timeline` kommandot använder Twitters [user timeline API](https://dev.twitter.com/rest/reference/get/statuses/user_timeline)  för att samla in de senaste tweetsen skapade av en användare baserat på screen_name.\n\n    twarc timeline deray > tweets.jsonl\n\nDu kan också använda användar-id:\n\n    twarc timeline 12345 > tweets.jsonl\n\n### Retweets\n\nDu kan samla in retweets för ett givet tweetid genom:\n\n    twarc retweets 824077910927691778 > retweets.jsonl\n\n### Svar\n\nTyvärr så stödjer inte Twitters API att hämta svar till en tweet.\ntwarc använder istället sök-API:et för detta. Då sök-API:et inte kan användas för att samla in tweets äldre än en vecka kan twarc endast hämta alla svar till en tweet som har postats den senaste veckan.\n\nOm du vill hämta svaren till en tweet så kan du använda följande:\n\n    twarc replies 824077910927691778 > replies.jsonl\n\nGenom att använda `--recursive` parametern så hämtas även svar till svar så väl som citerade tweets. Detta kan ta mycket lång tid att köra på stora trådar på grund av\nrate limiting på sök-API:et.\n\n    twarc replies 824077910927691778 --recursive\n\n### Listor\n\nFör att hämta användare som är med på en lista kan du använda list-URL:en med\n`listmembers` kommandot:\n\n    twarc listmembers https://twitter.com/edsu/lists/bots\n\n## Använd som ett bibliotek\n\nDu kan också använda twarc programatiskt som ett bibliotek för att samla in tweets.\nDu behöver först skapa en instans av `twarc` (genom att använda dina nycklar)\n, och sedan använda det för att iterera genom sökresultat, filter och resultat.\n\n```python\nfrom twarc import Twarc\n\nt = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)\nfor tweet in t.search(\"ferguson\"):\n    print(tweet[\"text\"])\n```\n\nDu kan göra samma sak för en ström som matchar ett nyckelord\n\n```python\nfor tweet in t.filter(track=\"ferguson\"):\n    print(tweet[\"text\"])\n```\n\neller en position:\n\n```python\nfor tweet in t.filter(locations=\"-74,40,-73,41\"):\n    print(tweet[\"text\"])\n```\n\neller användar-id:\n\n```python\nfor tweet in t.filter(follow='12345,678910'):\n    print(tweet[\"text\"])\n```\n\nPå samma sätt kan du hydrera tweetid:n genom att bearbeta en lista med idn\neller en generator:\n\n```python\nfor tweet in t.hydrate(open('ids.txt')):\n    print(tweet[\"text\"])\n```\n\n## Verktyg\n\nI utils-mappen finns ett antal enkla kommandoradsverktyg för att bearbeta linjeorienterad JSON, så som att skriva ut arkiverade tweets som text eller html, extrahera användarnamn, refererade url:er, m.m.\nOm du skapar ett skript som du tycker är bra så får du gärna skicka en pull request.\n\nNär du samlat in lite tweets kan du skapa en rudimentär vägg av dem:\n\n    % utils/wall.py tweets.jsonl > tweets.html\n\nDu kan skapa ett ordmoln baserat på tweets du samlat in:\n\n    % utils/wordcloud.py tweets.jsonl > wordcloud.html\n\nOm du har samlat in tweets genom att använda `replies` kan du skapa en statisk D3\nvisualisering av dem med:\n\n    % utils/network.py tweets.jsonl tweets.html\n\nDu kan även slå samman tweets per användare, vilket gör att du kan se centrala konton.\n\n    % utils/network.py --users tweets.jsonl tweets.html\n\nOch om du vill använda nätverksgrafen i ett program som [Gephi](https://gephi.org/), så kan du generera en GEXF-fil med följande:\n\n    % utils/network.py --users tweets.jsonl tweets.gexf\n\ngender.py  är ett filter som låter dig filtrera tweets baserat på en gissining författarens kön. Till exempel kan du filtrera ut alla tweets som\nser ut som de var skrivna av kvinnor och skapa ett ordmoln:\n\n    % utils/gender.py --gender female tweets.jsonl | utils/wordcloud.py > tweets-female.html\n\nDu kan få ut [GeoJSON](http://geojson.org/) från tweets där geo-koordinater finns tillgängliga:\n\n    % utils/geojson.py tweets.jsonl > tweets.geojson\n\nAlternativt kan du exportera GeoJSON med centroider som ersättning för bounding boxes:\n\n    % utils/geojson.py tweets.jsonl --centroid > tweets.geojson\n\nOch om du exporterar GeoJSON med centroider, så kan du lägga till lite slumpmässig fuzz:\n\n    % utils/geojson.py tweets.jsonl --centroid --fuzz 0.01 > tweets.geojson\n\nFör att filtrera tweets baserat på tillgänglighet av geo-koordinater (eller plats, se [API documentation](https://dev.twitter.com/overview/api/places)):\n\n    % utils/geofilter.py tweets.jsonl --yes-coordinates > tweets-with-geocoords.jsonl\n    % cat tweets.jsonl | utils/geofilter.py --no-place > tweets-with-no-place.jsonl\n\nFör att filtrera tweets genom ett GeoJSON-staket (Kräver [Shapely](https://github.com/Toblerity/Shapely)):\n\n    % utils/geofilter.py tweets.jsonl --fence limits.geojson > fenced-tweets.jsonl\n    % cat tweets.jsonl | utils/geofilter.py --fence limits.geojson > fenced-tweets.jsonl\n\nOm du misstänker att du har duplikat i dina tweetinsamlingar kan du ta bort duplikaten:\n\n    % utils/deduplicate.py tweets.jsonl > deduped.jsonl\n\nDu kan sortera efter ID, vilket är samma sak som att sortera efter tid.\n\n    % utils/sort_by_id.py tweets.jsonl > sorted.jsonl\n\nDu kan filtrera bort alla tweets före ett specifikt datum (till exempel, om en hashtag användes för en annan händelse före det du är intresserad av):\n\n    % utils/filter_date.py --mindate 1-may-2014 tweets.jsonl > filtered.jsonl\n\nDu kan få en lista i HTML över vilka klienter som använts:\n\n    % utils/source.py tweets.jsonl > sources.html\n\nOm du vill ta bort retweets:\n\n    % utils/noretweets.py tweets.jsonl > tweets_noretweets.jsonl\n\nEller lösa förkortade url:er (kräver [unshrtn](https://github.com/edsu/unshrtn)):\n\n    % cat tweets.jsonl | utils/unshorten.py > unshortened.jsonl\n\nNär du har löst de förkortade url:erna kan du få en ranklista över de mest tweetade url:erna:\n\n    % cat unshortened.jsonl | utils/urls.py | sort | uniq -c | sort -nr > urls.txt\n\n## twarc-report\n\nYtterligare verktyg för att generera CSV-filer eller json lämpad för att använda med\n[D3.js](http://d3js.org/) visualiseringar kan du hitta i\n[twarc-report](https://github.com/pbinkley/twarc-report) projektet. Verktyget\n `directed.py`, tidigare en del av twarc, har flyttat till twarc-report som\n`d3graph.py`.\n\nVarje skript kan också generera en html-demo av en D3 visualisering, t.ex.\n[timelines](https://wallandbinkley.com/twarc/bill10/) eller en\n[riktad graf av retweets](https://wallandbinkley.com/twarc/bill10/directed-retweets.html).\n\nÖversättning: [Andreas Segerberg]\n\n[Engelska]: https://github.com/DocNow/twarc/blob/main/README.md\n[Japanska]: https://github.com/DocNow/twarc/blob/main/README_ja_jp.md\n[Portugisiska]: https://github.com/DocNow/twarc/blob/main/README_pt_br.md\n[Spanska]: https://github.com/DocNow/twarc/blob/main/README_es_mx.md\n[Swahili]: https://github.com/DocNow/twarc/blob/main/README_sw_ke.md\n[Andreas Segerberg]: https://github.com/Segerberg\n"
  },
  {
    "path": "docs/twarc1_sw_ke.md",
    "content": "twarc1\n\n=====\n\ntwarc ni chombo ya command-line na Python Library ya kuhifadhi Twitter JSON\ndata. Kila Tweet ita akilishwa kama kitu ya JSON ita onyeshwa\n[hivi](https://dev.twitter.com/overview/api/tweets) kutoka kwa Twitter API.\nTweets zita wekwa kama [line-oriented\nJSON](https://en.wikipedia.org/wiki/JSON_Streaming#Line-delimited_JSON). twarc\nita kusaidia ku chunga [rate\nlimits](https://dev.twitter.com/rest/public/rate-limiting) ya API ya Twitter.\ntwarc pia ita sanya tweets, watumiaji wa Twitter, uwenendo za Twitter na ita\nhydrate tweet ids.\n\ntwarc imeundwa kama sehemu ya [Documenting the Now](http://www.docnow.io) ambayo\nilifadhiliwa na [Mellon Foundation](https://mellon.org/).\n\n## Weka\n\nKabla kutumia twarc utahitaji kujiandikisha kwa\n[apps.twitter.com](http://apps.twitter.com). Mara baada ya kuunda programu yako\nandika `consumer key` and `consumer secret` yako alafu bonyeza kuzalisha `access\ntoken` na `access token secret`. Uta hitaji hizi vigezo nne ku tumia twarc\n\n1. weka [Python](http://python.org/download) (2 or 3)\n2. pip install twarc (ama kuboresha: pip install --upgrade twarc)\n\n## Haraka Haraka\n\nUtahitaji kuambia twarc vifunguo ya API ya Twitter\n\n    twarc configure\n\nalafu jaribu kuchungua na:\n\n    twarc search blacklivesmatter > search.jsonl\n\nAma wataka kusanya ma tweets kama zinatoka\n\n    twarc filter blacklivesmatter > stream.jsonl\n\nEndelea kusoma ku pata maelezo kuhusu utumizi wa twarc\n\n## Matumizi\n\n### Sanidi\n\nMara tu una vifunguo vya Twitter unaweza kuambia twarc ukitumia command ya\n`configure`.\n\n    twarc configure\n\ntwarc ita andika sifa zako kwenye file itayo itwa `.twarc` kwa saraka ya home.\nKama hutaki ama huwezi kuandika file hiyo unaweza kutumia command inayo tumia\nmazingira yako. (`CONSUMER_KEY`,\n`CONSUMER_SECRET`, `ACCESS_TOKEN`, `ACCESS_TOKEN_SECRET`) ama chagua command line\n(`--consumer_key`, `--consumer_secret`, `--access_token`,\n`--access_token_secret`).\n\n### Uchunguzi\n\nHutumia [uchunguzi wa\ntweets](https://dev.twitter.com/rest/reference/get/search/tweets) kupakua tweets\nzilizoandikwa zinazo swala\n\n    twarc search blacklivesmatter > tweets.jsonl\n\nNi muhimu kukumbuka swali yako ita pakua tweets za mda wa siku 7 inayo tiwa na\nAPI ya Twitter. Kama swali yako inataka mda wa siku nane au zaidi waeza kutumia\n`filter` ama `sample` commands kama hizi.\n\nNjia bora ya kujifunza na uchunguzi wa Twitter Search API ni ku jaribu\n[Twitter's Advanced Search](https://twitter.com/search-advanced) alafu kuitumia\nkwa twarc. Kwa mfano hapa tuna tafuta ma tweets zinazo \\#blacklivesmatter ama\n#blm hashtags zilizo tumwa kwa deray.\n\n    twarc search '#blacklivesmatter OR #blm to:deray' > tweets.jsonl\n\nTwitter hujaribu kuweka lugha ya tweet na unaweza kupunguza kikoma yako kwa\nlugha ukitaka\n\n    twarc search '#blacklivesmatter' --lang fr > tweets.jsonl\n\nUnaweza pia kutafuta tweets za mahali fulani kwa mfano tweets zinazo taja\n*blacklivesmatter* zilizo maili 1 kutoka katikati ya Ferguson, Missouri:\n\n    twarc search blacklivesmatter --geocode 38.7442,-90.3054,1mi > tweets.jsonl\n\nIkiwa swali yako haina maneno lakini umetumia `--geocode` utapata tweets zote za\neneo hio.\n\n    twarc search --geocode 38.7442,-90.3054,1mi > tweets.jsonl\n\n### Chuja\n\nUtumizi wa `filter` command husanya tweets zikiandikwa no hutumia\n[statuses/filter](https://dev.twitter.com/streaming/reference/post/statuses/filter)\nAPI.\n\n    twarc filter blacklivesmatter,blm > tweets.jsonl\n\nTafadhali kumbuka kuwa syntax ya Twitter ni tofauti na Twitter ya uchunguzi.\nTafadhali wasiliana na nyaraka jinsi ya kueleza chujia unayo tumia\n\nTumia command ya `follow` kama wataka kusanya tweets kutoka kwa mtumiaji kama\nzinatokea. Hi inajumuisha retweets. Kwa mfano hii itasanya tweets na retweets za\nCNN:\n\n    twarc filter --follow 759251 > tweets.jsonl\n\nWaeza kusanya tweets kwa kutumia sanduku linalozingatia. Kumbuka: dash\ninayoongoza inahitaji kutoroka katika sanduku linalozingatia ama ita fasiriwa\nkama command line argument!\n\n    twarc filter --locations \"\\-74,40,-73,41\" > tweets.jsonl\n\nIkiwa unachanganya chaguzi yako au OR'ed pamoja. Kwa mfano hii ita sanya tweets\nzinasotumia blacklivesmatter ama blm na pia tweets kutoka mtumiaji CNN:\n\n    twarc filter blacklivesmatter,blm --follow 759251 > tweets.jsonl\n\n### Sampuli\n\nTumia `sample` command kusikiliza kwa sampuli ya Twitter\n[statuses/sample](https://dev.twitter.com/streaming/reference/get/statuses/sample)\nstatuses hivi karibuni\n\n    twarc sample > tweets.jsonl\n\n### Punguza maji\n\ntwarc ina `dehydrate` command ita tengeneza orodha ya id kutoka faili ya tweets:\n\n    twarc dehydrate tweets.jsonl > tweet-ids.txt\n\n### Hydrate\n\ntwarc pia ina `hydrate` command ita soma faili inayo id na ita andika faili mpya\nya tweet JSON kwa kutumiya Twitter [status/lookup](https://dev.twitter.com/rest/reference/get/statuses/lookup) API.\n\n    twarc hydrate ids.txt > tweets.jsonl\n\nAPI ya Twitter [Masharti ya\nHuduma](https://dev.twitter.com/overview/terms/policy#6._Be_a_Good_Partner_to_Twitter)\nhuwazuia watu kutengeza kiasi kubwa ya Twitter data ipatikane kwenye Web. Hiyo\ndata yaeza kutumiwa kwa uchunguzi bora isi shirikiana na ulimwengu. Twitter\nhuruhusu mafaili ya tweet identifiers kugawanywa no hiyo inaweza kuwa na\nmanufaa. Waeza kutumia API ya Twitter ku *hydrate* hiyo data ama kupata kamili\nya JSON. Hi ni muhimu kwa\n[uthibitishaji](https://en.wikipedia.org/wiki/Reproducibility) ya social media\nresearch.\n\n### Watumiaji\n\nUtumizi was `users` command hurudisha metadata ya majina ya skrini iliyopewa\n\n    twarc users deray,Nettaaaaaaaa > users.jsonl\n\nWaeza pia kuipatia ids za watumiaji\n\n    twarc users 1232134,1413213 > users.jsonl\n\nWaeza kutumia faili iliyo na ids za watumiaji kwa mfano wataka `followers` na\n`friends` commands\n\n    twarc users ids.txt > users.jsonl\n\n### Wafuasi\n\nUtumizi wa `followers` hutegemeya [follower id\nAPI](https://dev.twitter.com/rest/reference/get/followers/ids) ku kusanya ids za\nmfuasi moja kwa kila ombi. Kwa mfano:\n\n    twarc followers deray > follower_ids.txt\n\nita rudisha mfuasi moja kwa kila laini. Faili yako ita andikwa na wafuasi wa\nhivi karibuni kwanza.\n\n### Mwelekeo\n\nUtumizi wa `trends` hutegemeya API ya Twitter ya mwelekeo wa hashtags. Unahitaji\nkuipatia [Where On Earth](http://developer.yahoo.com/geo/geoplanet/) identifier\n(`woeid`) kuiambia mwenendo unayopenda. Kwa mfano kama wataka maelekeo ya St.\nLouis:\n\n    twarc trends 2486982\n\nUkitumia `woeid` ya 1 itarudisha mwenendo wa dunia yote.\n\n    twarc trends 1\n\nIkiwa hujui nini cha kutumia ya `woeid` iache na utapata maeneo yote ambayo\nTwitter hufuata:\n\n    twarc trends\n\nKama una geo-location waeza kuitimia badala ya `woeid`\n\n    twarc trends 39.9062,-79.4679\n\nTwitter ita tumia API ya [trends/closest](https://dev.twitter.com/rest/reference/get/trends/closest) ili kupata `woeid` iliyo karibu nawe\n\n### Muda wa wakati\n\nUtumiaji wa `timeline` command hutegemeya kwa API ya [user timeline\nAPI](https://dev.twitter.com/rest/reference/get/statuses/user_timeline)\nkukusanya Tweets za mtumiaji alionyeshwa na `screen_name`:\n\n    twarc timeline deray > tweets.jsonl\n\nUnaweza pia kuangalia juu ya watumiaji kwa kutumia id ya mtumiaji\n\n    twarc timeline 12345 > tweets.jsonl\n\n### Retweets\n\nUnaweza kupata retweets kwa kuipeya id ya tweet hivi:\n\n    twarc retweets 824077910927691778 > retweets.jsonl\n\n### Majibu\n\nTwitter haina API ambayo inaweza kupata majibu za tweet. twarc hujaribu kwa\nkutumia search API. Lakino search API haiwezi kupata majibu zaidi ya siku saba.\nIkiwa unataka kupata majibu ya tweets fanya hivi:\n\n    twarc replies 824077910927691778 > replies.jsonl\n\nUtumizi wa `--recursive` utapata majibu ya majibu na quotes. Hii inaweza\nkuchukua muda mrefu kukamilisha kama una majibu mengi kwa sababu ya kiwango cha\nkupunguzwa search API.\n\n    twarc replies 824077910927691778 --recursive\n\n### Orodha\n\nIli kupata watumiaji walio kwenye orodha unaweza kutumia URL ya orodha na\ncommand ya `listmembers`\n\n    twarc listmembers https://twitter.com/edsu/lists/bots\n\n## Tumia kama Maktaba\n\nIkiwa unataka kutumia twarc programatically kama maktaba kukusanya tweets.\nKwanza utahitaji kuunda `twarc` instance yako. (utatumia sifa zako za Twitter),\nalafu utaitumia kutafuta matokeo ya utafutaji, futa matokeo au matokeo ya\nkufuatilia.\n\n```python\nfrom twarc import Twarc\n\nt = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)\nfor tweet in t.search(\"ferguson\"):\n    print(tweet[\"text\"])\n```\n\nUnaweza kufanya hivyo kwa mkondo wa machujio ya tweets ambazo zinafanana na\nkufuatilio neno muhimu:\n\n```python\nfor tweet in t.filter(track=\"ferguson\"):\n    print(tweet[\"text\"])\n```\n\nau mahali\n\n```python\nfor tweet in t.filter(locations=\"-74,40,-73,41\"):\n    print(tweet[\"text\"])\n```\n\nau ids za watumiaji\n\n```python\nfor tweet in t.filter(follow='12345,678910'):\n    print(tweet[\"text\"])\n```\n\nVivyo hivyo unaweza ku hydrate tweet identifiers kwa kupitisha orodha ya ids au\njenereta:\n\n```python\nfor tweet in t.hydrate(open('ids.txt')):\n    print(tweet[\"text\"])\n```\n\n## Vya Kutumia\n\nKatika saraka `utils` kuna commands zinazo weza kukusaidia kufanya kazi na\nline-oriented JSON kama kuchapisha ma tweets kwa text au html, kuchimba majina\nza watumiaji, URLS. If tengeneza script yako tafadhali tushirikiana na PR.\n\nUnapopata tweets unaweza kuunda ukuta mzuri wako:\n\n    % utils/wall.py tweets.jsonl > tweets.html\n\nUnaweza kuunda wingu ya maneno ya tweets ulizo sanya ambayo in neno nasa\n\n    % utils/wordcloud.py tweets.jsonl > wordcloud.html\n\nIkiwa umekusanya tweets kwa kutumia `majibu` unaweza kuunda taswira ya D3 na:\n\n    % utils/network.py tweets.jsonl tweets.html\n\nUnaweza kuimarisha tweets za mtumiaji, kukuruhusu kuona akaunti kuu:\n\n    % utils/network.py --users tweets.jsonl tweets.html\n\nNa kama unataka kutumia grafu ya mtandao katika mpango kama\n[Gephi](https://gephi.org/), unaweza kuuna faili ya GEXF na\n\n    % utils/network.py --users tweets.jsonl tweets.gexf\n\n`gender.py` ni chujio kinachokuwezesha kufuta tweets kulingana na nadhani kuhusu\njinsia ya mwandishi. Kwa mfano unaweza kufuta tweets zote ambazo\nkuangalia kama walikuwa kutoka kwa wanawake, na kuunda wingu neno na:\n\n    % utils/gender.py --gender female tweets.jsonl | utils/wordcloud.py > tweets-female.html\n\nUnaweza kutoa [GeoJSON](http://geojson.org/) ya tweets kama geo coordinates\nziko:\n\n    % utils/geojson.py tweets.jsonl > tweets.geojson\n\nUnaweza pia kuto GeoJSON na centriods, kubadilisha nafasi ya masanduku:\n\n    % utils/geojson.py tweets.jsonl --centroid > tweets.geojson\n\nNa ukitoa GeoJSON na centroids, unaweza kuongeza random fuzzing:\n\n    % utils/geojson.py tweets.jsonl --centroid --fuzz 0.01 > tweets.geojson\n\nIli kufuta tweets kwa kuwepo au kutokuwepo kwa kuratibu za geo (au Mahali, angalia nyaraka za [API](https://dev.twitter.com/overview/api/places)):\n\n    % utils/geofilter.py tweets.jsonl --yes-coordinates > tweets-with-geocoords.jsonl\n    % cat tweets.jsonl | utils/geofilter.py --no-place > tweets-with-no-place.jsonl\n\nIli kufuta tweets na uzio wa GeoJSON (inahitaji [Shapely](https://github.com/Toblerity/Shapely)):\n\n    % utils/geofilter.py tweets.jsonl --fence limits.geojson > fenced-tweets.jsonl\n    % cat tweets.jsonl | utils/geofilter.py --fence limits.geojson > fenced-tweets.jsonl\n\nIkiwa unadhani una duplicate kwenye tweets zako unaweza kuwapunguza:\n\n    % utils/deduplicate.py tweets.jsonl > deduped.jsonl\n\nUnaweza kuchagua na ID, ambayo ni sawa na kutatua kwa wakati:\n\n    % utils/sort_by_id.py tweets.jsonl > sorted.jsonl\n\nUnaweza kufuta tweets zote kabla ya tarehe fulani (kwa mfano, kama hashtag ilitumiwa kwa tukio lingine kabla ya moja unayopenda):\n\n    % utils/filter_date.py --mindate 1-may-2014 tweets.jsonl > filtered.jsonl\n\nUnaweza kupata orodha ya HTML ya wateja kutumika:\n\n    % utils/source.py tweets.jsonl > sources.html\n\nIkiwa unataka kuondoa retweets:\n\n    % utils/noretweets.py tweets.jsonl > tweets_noretweets.jsonl\n\nAu unshorten urls (requires [unshrtn](https://github.com/docnow/unshrtn)):\n\n    % cat tweets.jsonl | utils/unshorten.py > unshortened.jsonl\n\nMara baada ya kufuta URL zako unaweza kupata orodha ya vya URL inayo tweets nyingi zaidi:\n\n    % cat unshortened.jsonl | utils/urls.py | sort | uniq -c | sort -nr > urls.txt\n\n## twarc-report\n\nBaadhi ya scripts zaidi ya huduma ili kuzalisha csv au json pato yanafaa kwa\nkutumia na [D3.js](http://d3js.org/) visualizations hupatikana katika\n[twarc-report](https://github.com/pbinkley/twarc-report). `directed.py` ilikuwa\nsehemu ya twarc imehama kwa twarc-report kama `d3graph.py`.\n\nKila script pia inaweza kuzalisha demo html ya taswira ya D3, kwa mfano. [timelines](https://wallandbinkley.com/twarc/bill10/) or a\n[directed graph of retweets](https://wallandbinkley.com/twarc/bill10/directed-retweets.html).\n\n[Kihispania]: https://github.com/DocNow/twarc/blob/main/README_es_mx.md\n[Kiingereza]: https://github.com/DocNow/twarc/blob/main/README_sv_se.md\n[Kijapani]: https://github.com/DocNow/twarc/blob/main/README_ja_jp.md\n[Kireno]: https://github.com/DocNow/twarc/blob/main/README_pt_br.md\n[Kisweden]: https://github.com/DocNow/twarc/blob/main/README_sw_ke.md\n"
  },
  {
    "path": "docs/twarc1_zw_zh.md",
    "content": "twarc1\n=====\n\ntwarc 是一个用来处理并存档推特 JSON 数据的命令行工具和 Python 包。\n\n[正如](https://dev.twitter.com/overview/api/tweets)推特 API 返回的一样，twarc 处理的每一条推文都用一个 JSON 对象来表示。twarc 会自动处理推特 API 的[流量限制](https://dev.twitter.com/rest/public/rate-limiting)。除了可以让你收集推文之外，twarc 还可以帮助你收集用户信息、当下流行的标签和根据 id 获得推文的详细信息。\n\ntwarc 是作为 [Mellon Foundation](https://mellon.org/) 资助下的 [Documenting the Now](http://www.docnow.io) 项目的一部分开发的。\n\n## 安装\n\n在使用 twarc 之前，你需要在 [apps.twitter.com](http://apps.twitter.com) 注册一个应用。一旦你注册了你的应用，记下你的 `consumer key` 和 `consumer secret` 并点击生成一组 `access token` 和 `access token secret`. 这四个数据在手你就可以开始使用 twarc 了。\n\n1. 安装 [Python](http://python.org/download) (2 或者 3)\n2. [pip](https://pip.pypa.io/en/stable/installing/) install twarc\n\n### 使用Homebrew (仅限macOS 系统)\n\nmacOS系统用户, 你可以通过Homebrew安装 `twarc` :\n\n```shell\n$ brew install twarc\n```\n\n## 快速开始:\n\n首先你需要告诉 twarc 你的应用 keys 并授权它访问一个或者多个推特账号：\n\n```shell\ntwarc configure\n```\n\n然后尝试搜索\n\n```shell\ntwarc search blacklivesmatter > search.jsonl\n```\n\n或者你想试试实时搜索?\n\n```shell\ntwarc filter blacklivesmatter > stream.jsonl\n```\n\n请阅读下文了解更多这些命令的意义和更多内容。\n\n## 使用\n\n### 配置\n\n在获得应用 keys 之后你可以通过 `configure` 命令来告诉 twarc 它们的值。\n\n```shell\ntwarc configure\n```\n\n这样做会在你的 `~` 目录下创建一个名为 `.twarc` 的文件来储存你的这些凭证，这样你就不必每次使用 twarc 的时候输入它们。如果你倾向于每次使用 twarc 的时候输入 keys，你可以使用环境变量 (`CONSUMER_KEY`,\n`CONSUMER_SECRET`, `ACCESS_TOKEN`, `ACCESS_TOKEN_SECRET`) 或者使用命令行工具选项 (`--consumer_key`, `--consumer_secret`, `--access_token`,\n`--access_token_secret`).\n\n### 搜索\n\n搜索功能使用推特的[搜索推文](https://dev.twitter.com/rest/reference/get/search/tweets) API endpoint 来下载*已经存在*的符合搜索字符串的推文。\n\n```shell\ntwarc search blacklivesmatter > tweets.jsonl\n```\n\n尤其需要注意的是 `search` 返回的是过去七天内的推文：这是推特搜索 API 的限制。如果你觉得这太短了——我们也觉得——你或许会更愿意尝试使用下文提到的 `filter` 和 `sample` 命令。\n\n最好的快速上手推特搜索语法的方法是实验[推特高级搜索](https://twitter.com/search-advanced)这个页面上的样例。你可以复制粘贴搜索框里的查询语句。比如这里有一个比较复杂的查询语句，它搜索包含有 `#blacklivesmatter` 和 `#blm` 关键字并发给 [deray](https://twitter.com/deray) 的推文。\n\n```shell\ntwarc search '#blacklivesmatter OR #blm to:deray' > tweets.jsonl\n```\n\n你还应当看一看 Igor Brigadir 关于推特高级搜索语法`精彩绝伦`的指南: [推特高级搜索 (英文)](https://github.com/igorbrigadir/twitter-advanced-search/blob/master/README.md). 这份指南里包含了很多阅读推特搜索文档后依然不显然的玄妙之处。\n\n推特尝试显式地定义推文的语言。你可以尝试使用 [ISO 639-1] 规范限制你获得的推文的语言。\n\n```shell\ntwarc search '#blacklivesmatter' --lang fr > tweets.jsonl\n```\n\n你还可以通过位置来搜索。比如你可以搜索包含 `#blacklivesmatter` 且位置定位在密苏里弗格森半径1英里之内的推文。\n\n```shell\ntwarc search blacklivesmatter --geocode 38.7442,-90.3054,1mi > tweets.jsonl\n```\n\n如果一个包含 `--geocode` 的搜索没有包含要查询的字符串，那么你将得到所有与该位置和其半径相关的推文。\n\n```shell\ntwarc search --geocode 38.7442,-90.3054,1mi > tweets.jsonl\n```\n\n### 过滤\n\n`filter` 命令使用推特的 [状态/过滤](https://dev.twitter.com/streaming/reference/post/statuses/filter) API 来搜集实时推文。\n\n```shell\ntwarc filter blacklivesmatter,blm > tweets.jsonl\n```\n\n请注意推特的 `track` 查询语句的语法和搜索 API 里的语法略有不同。请使用官方文档来了解如何最好地表达你的过滤命令选项。\n\n使用 `follow` 命令行参数和用户的 id 来实时收集某个具体用户的推文。注意这个命令的结果包含转推。举个例子，下面的命令搜索 `CNN` 的推文和转推。\n\n```shell\ntwarc filter --follow 759251 > tweets.jsonl\n```\n\n你还可以限制一个地理上的矩形边界来收集推文。注意经纬度数据中的短横线必须用`\\`转义，否则它将被理解成一个命令行参数！\n\n```shell\ntwarc filter --locations \"\\-74,40,-73,41\" > tweets.jsonl\n```\n\n你可以使用 `lang` 命令行参数来传入 [ISO 639-1] 语言代码来限制语言。你还可以多次使用这个参数指定多种语言。下面的例子实时收集提到了巴黎和马德里的法语推文和西班牙语推文：\n\n```shell\ntwarc filter paris,madrid --lang fr --lang es\n```\n\n`filter` 和 `follow` 命令是**或**关系。下面的例子将收集包含 `blacklivesmatter` 或者 `blm` 关键字的推文，或者是来自 CNN 的推文。\n\n```shell\ntwarc filter blacklivesmatter,blm --follow 759251 > tweets.jsonl\n```\n\n但是将位置和语言限制合并将得到**和**的关系，下面的例子收集来自纽约且被标记为法语或者西班牙语的推文。\n\n```shell\ntwarc filter --locations \"\\-74,40,-73,41\" --lang es --lang fr\n```\n\n### 采样\n\n使用 `sample` 命令来监听推特的 [状态/采样](https://dev.twitter.com/streaming/reference/get/statuses/sample) API 来“随机“采样最近的、公开的推文。\n\n```shell\ntwarc sample > tweets.jsonl\n```\n\n### `脱水`\n\n所谓的脱水 `dehydrate` 命令读取一个推文的 jsonl 文件，生成一个包含推文 id 的列表。\n\n```shell\ntwarc dehydrate tweets.jsonl > tweet-ids.txt\n```\n\n### `补水`\n\ntwarc 所谓的补水命令 `hydrate` 是 `dehydrate` 的反过程，它读取一个包含推文 id 的文件，使用推特的 [状态/检索](https://dev.twitter.com/rest/reference/get/statuses/lookup) API 重建包含完整推文 json 的 jsonl 文件。\n\n```shell\ntwarc hydrate ids.txt > tweets.jsonl\n```\n\n推特 API 的[服务条款](https://dev.twitter.com/overview/terms/policy#6._Be_a_Good_Partner_to_Twitter) 反对用户将大量原始推文数据公布在网络上。数据可以被用来研究使用和保存在本地，但是不可以和世界分享。不过，推特确实允许用户大量地将推文 id 公开分享，而这些 id 可以用来重建推文 JSON 数据——通过 `hydrate` 命令和推特的 API. 这一点对于社交媒体研究中的[复现](https://en.wikipedia.org/wiki/Reproducibility)尤为重要。\n\n### 用户\n\n用户 `users` 命令可以返回（多个）用户的元数据。用户的名称由推特上的屏幕名称唯一确认。（译者注：屏幕名称即你 @ 某用户时所显示的字符串）。\n\n```shell\ntwarc users deray,Nettaaaaaaaa > users.jsonl\n```\n\n你也可以使用用户的 id.\n\n```shell\ntwarc users 1232134,1413213 > users.jsonl\n```\n\n你也可以使用一个包含用户 id 的文件作为输入，这在你同时使用 `followers` 和 `friends` 命令时尤其有用。举例如下：\n\n```shell\ntwarc users ids.txt > users.jsonl\n```\n\n### 粉丝\n\n粉丝 `followers` 命令使用推特的 [粉丝 id](https://dev.twitter.com/rest/reference/get/followers/ids) API 来收集推特用户粉丝的 id 信息。该命令的输入只能是一个用户的屏幕名称。举例如下：\n\n```shell\ntwarc followers deray > follower_ids.txt\n```\n\n输出的结果每一行是一个粉丝用户 id. 最新的粉丝将出现在最前面，依时间顺序倒序排列。\n\n### 朋友\n\n和粉丝 `followers` 命令类似，朋友 `friends` 命令将使用推特的 [朋友 id](https://dev.twitter.com/rest/reference/get/friends/ids) API 收集推特用户朋友的 id 信息。该命令的输入只能是一个用户的屏幕名称。举例如下：\n\n```shell\ntwarc friends deray > friend_ids.txt\n```\n\n### 当下流行\n\n当下流行 `trends` 命令可以用来搜索当下流行的标签。你需要一个 [地球上哪里](https://web.archive.org/web/20180102203025/https://developer.yahoo.com/geo/geoplanet/) 的 id (woeid) 来指明你对哪个地理位置的当下流行标签感兴趣。下面这个例子中的 `2486982` 代表圣路易斯：\n\n```shell\ntwarc trends 2486982\n```\n\n令 `woeid` 为 1 即为搜索全球范围内当下流行的标签：\n\n```shell\ntwarc trends 1\n```\n\n如果你不确定 `woeid`, 可以留空，这样推特会返回一个列表，包括全球各地的当下流行标签。\n\n```shell\ntwarc trends\n```\n\n如果你已经知道确切的地理信息，可以用它来替代 `woeid`. \n\n```shell\ntwarc trends 39.9062,-79.4679\n```\n\n这里的原理是 twarc 将使用推特的[趋势/最近位置](https://dev.twitter.com/rest/reference/get/trends/closest) API 找到距离指定地点最近的 `woeid`.\n\n### 时间线\n\n时间线 `timeline` 命令将通过推特的[时间线](https://dev.twitter.com/rest/reference/get/statuses/user_timeline) API 收集某个用户最近的推文。用户名称由其屏幕名称指定。\n\n```shell\ntwarc timeline deray > tweets.jsonl\n```\n\n你也可以使用用户 id.\n\n```shell\ntwarc timeline 12345 > tweets.jsonl\n```\n\n### 转推\n\n你可以使用下面这个例子的格式来获得 id 为 `824077910927691778` 这条推文的转推。\n\n```shell\ntwarc retweets 824077910927691778 > retweets.jsonl\n```\n\n输入也可以是一个包含推文 id 的文本。\n\n```shell\ntwarc retweets ids.txt > retweets.jsonl\n```\n\n### 回复\n\n推特的 API 不支持获得回复，但是 twarc 可以通过搜索 API 来近似模拟这一功能。因为搜索 API 的搜索时间区间只有过去一周所以 twarc 只能得到某条推文过去一周的回复。\n\n下面这个例子使用推文 id 作为输入。\n\n```shell\ntwarc replies 824077910927691778 > replies.jsonl\n```\n\n使用 `--recursive` 选项可以获得回复的回复以及引用。注意这可能会花费很长时间因为推特的搜索 API 有流量限制。\n\n```shell\ntwarc replies 824077910927691778 --recursive\n```\n\n### 列表\n\n你可以将推特用户列表的 URL 传入 `listmembers` 命令得到列表中的用户：\n\n```shell\ntwarc listmembers https://twitter.com/edsu/lists/bots\n```\n\n## 付费搜索 API\n\n推特引入了付费搜索 API. 它可以让你通过付款的方式实现更高级的搜索功能。你需要在[仪表板](https://developer.twitter.com/en/dashboard) 配置一个环境。在此之后，你可以搜索不限于最近7天内的推文的过去30天内的备份甚至完整推文备份。如果需要在命令行实现这一功能，你需要告诉 twarc 你在使用哪一个 endpoint 和环境。\n\n为了控制预算，你可能需要限制搜索的时间段：使用 `--to_date` 和 `--frome_date`. 再次之外，你还可以使用 `--limit` 参数来限制返回的推文数目上限。\n\n举例来看，假设今天是2020年6月1日，如果你想搜索不超过1000条从2020年5月1日到2020年5月14日所有提到 `blacklivesmatter` 的推文。如果我们的环境名为 `docnowdev`， 那么这个命令如下，注意我们使用了 `--30day` 这个 endpoint:\n\n```shell\ntwarc search blacklivesmatter \\\n    --30day docnowdev \\\n    --from_date 2020-05-01 \\\n    --to_date 2020-05-14 \\\n    --limit 1000 \\\n    > tweets.jsonl\n```\n\n类似的，如果你要搜索超过30天期限的全部推文备份，你需要使用 fullarchive, 举例如下：\n\n```shell\ntwarc search blacklivesmatter \\\n    --fullarchive docnowdev \\\n    --from_date 2014-08-04 \\\n    --to_date 2014-08-05 \\\n    --limit 1000 \\\n    > tweets.jsonl\n```\n\n如果你的环境在沙盒之中，你需要使用 `--sandbox` 参数来告诉 twarc 不要获得超过100条推文。默认的非沙盒环境的上限是500条。\n\n```shell\ntwarc search blacklivesmatter \\\n    --fullarchive docnowdev \\\n    --from_date 2014-08-04 \\\n    --to_date 2014-08-05 \\\n    --limit 1000 \\\n    --sandbox \\\n    > tweets.jsonl\n```\n## Gnip 企业级 API\n\ntwarc 支持和 Gnip 推特全备份企业级 API 的完全整合。你需要使用 `--gnip_auth` 参数并设置好 `GNIP_USERNAME`、 `GNIP_PASSWORD`、 `GNIP_ACCOUNT` 三个环境变量。举例如下：\n\n```shell\ntwarc search blacklivesmatter \\\n    --gnip_auth \\\n    --gnip_fullarchive prod \\\n    --from_date 2014-08-04 \\\n    --to_date 2015-08-05 \\\n    --limit 1000 \\\n    > tweets.jsonl\n```\n\n## 作为一个 Python 包的 twarc\n\n如果你想在你自己的代码里使用 twarc 的话，你需要首先创建一个 `twarc` 实例，传入你的推特应用凭证然后用它进行搜索、过滤和检索。 举例如下：\n\n```python\nfrom twarc import Twarc\n\nt = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)\nfor tweet in t.search(\"ferguson\"):\n    print(tweet[\"text\"])\n```\n\n你还可以用同样的语法过滤满足关键字匹配的实时信息流。举例如下：\n\n```python\nfor tweet in t.filter(track=\"ferguson\"):\n    print(tweet[\"text\"])\n```\n\n或者地点：\n\n```python\nfor tweet in t.filter(locations=\"-74,40,-73,41\"):\n    print(tweet[\"text\"])\n```\n\n或者用户 id:\n\n```python\nfor tweet in t.filter(follow='12345,678910'):\n    print(tweet[\"text\"])\n```\n\n类似的，你还可以传入一个包含推特 id 的文件，“补水”以获得完整信息。举例如下：\n\n```python\nfor tweet in t.hydrate(open('ids.txt')):\n    print(tweet[\"text\"])\n```\n\n## 基于用户的验证和基于应用的验证\n\ntwarc 自动处理推特的流量限制。但是你应该了解流量限制会因为验证方式的不同而不同。推特有两种验证方式分别是基于用户的验证和基于应用的验证。 twarc 默认使用基于用户的验证方式但是你可以告诉 twarc 使用基于应用的验证。\n\n举个例子，转为基于应用的验证可以显著提高搜索功能的效率。基于用户的验证每分钟可以发出180个请求（每天160万条结果），而基于应用的验证每分钟可以发出450个请求（每天430万个结果）。\n\n需要注意的是，用 “补水”功能访问 `状态/检索 (status/lookup)` 这个 API endpoint 在基于用户的验证下有每15分钟900个请求的限制，而在基于应用的验证下是每15分钟300个。\n\n如果你确认你要使用基于应用的验证，你可以使用 `--app_auth` 这个命令行选项。举例如下：\n\n```shell\ntwarc --app_auth search ferguson > tweets.jsonl\n```\n\n类似的功能也可以在你的 Python 代码中实现。\n\n```python\nfrom twarc import Twarc\n\nt = Twarc(app_auth=True)\nfor tweet in t.search('ferguson'):\n    print(tweet['id_str'])\n```\n\n## 实用工具\n\n\n在 `utils` 文件夹下你可以找到几个脚本。这些脚本可以作用于 jsonl 文件上实现一些非常实用的功能：比如将 JSON 格式的推文输出为文本或者 HTML 格式, 提取用户名或者推文中引用的 URL 等等。如果你创作了一个好用的脚本，欢迎提出 PR.\n\n下面的命令可以创作一个简单的推文墙。\n\n```shell\nutils/wall.py tweets.jsonl > tweets.html\n```\n\n下面的命令可以创作一个简单的词云。\n\n```shell\nutils/wordcloud.py tweets.jsonl > wordcloud.html\n```\n\n如果你用 `replies` 命令收集了一些推文，你可以用下面的命令创作一个静态的 D3 可视化。\n\n```shell\nutils/network.py tweets.jsonl tweets.html\n```\n\n你可以增加可选参数根据用户组织推文，这样你可看到这个网络中的核心账号。\n\n```shell\nutils/network.py --users tweets.jsonl tweets.html\n```\n\n额外的，你可以创作一个标签的网络，从而看到它们彼此之间的（共存）关系。\n\n```shell\nutils/network.py --hashtags tweets.jsonl tweets.html\n```\n\n如果你想使用网络作图软件 [Gephi](https://gephi.org/),你可以用下面的命令生成一个 `GEXF` 格式的文件。\n\n```shell\nutils/network.py --users tweets.jsonl tweets.gexf\nutils/network.py --hashtags tweets.jsonl tweets.gexf\n```\n\n额外的，如果你想将网络转换成一个随时间线动态变化（节点会出现和消失）的动态网络，你可以在 Gephi 中打开生成的 `GEXF` 文件，跟随这个[教程](https://seinecle.github.io/gephi-tutorials/generated-html/converting-a-network-with-dates-into-dynamic.html)实现。注意在 `tweets.gexf` 文件里，仅有 `start_date` 一栏但是却没有 `end_date` 一栏，这会导致节点出现在屏幕上后便不再消失。对于 Gephi 中的 `Time interval creation options` 跳出窗口，`Start time column` 应该是 `start_date`, 而 `End time column` 则是空白的。`Parse dates` 应该勾选，同时选择最后一个日期格式选项：`dd/MM/yyyy HH:mm:ss`, 如下图所示。\n\n`gender.py` 是一个可以猜测推文作者性别的脚本。比如下面的例子展示了如何保留看上去像是女性发出的推文并生成一个词云。\n\n```shell\nutils/gender.py --gender female tweets.jsonl | utils/wordcloud.py >\ntweets-female.html\n```\n\n你可以用含有地理定位信息的推文生成 [GeoJSON](http://geojson.org/) 格式的文件。\n\n```shell\nutils/geojson.py tweets.jsonl > tweets.geojson\n```\n\n你还可以用地理边界的[形心](https://en.wikipedia.org/wiki/Centroid)来取代地理位置矩形的边界。\n\n```shell\nutils/geojson.py tweets.jsonl --centroid > tweets.geojson\n```\n\n在此基础上你还可以加一些随机模糊。\n\n```shell\nutils/geojson.py tweets.jsonl --centroid --fuzz 0.01 > tweets.geojson\n```\n\n欲了解更多关于利用地理坐标（或地点）的存在与否过滤推文的内容，请参考[文档](https://dev.twitter.com/overview/api/places)。下面是两个例子。\n\n```shell\nutils/geofilter.py tweets.jsonl --yes-coordinates > tweets-with-geocoords.jsonl\n\ncat tweets.jsonl | utils/geofilter.py --no-place > tweets-with-no-place.jsonl\n```\n\n欲通过 GeoJson 的边界过滤推文，请参考下面的例子。注意你需要安装 [Shapely](https://github.com/Toblerity/Shapely).\n\n```shell\nutils/geofilter.py tweets.jsonl --fence limits.geojson > fenced-tweets.jsonl\n\ncat tweets.jsonl | utils/geofilter.py --fence limits.geojson > fenced-tweets.jsonl\n```\n\n如果你怀疑你有重复的推文，可以用下面的命令去重。\n\n```shell\nutils/deduplicate.py tweets.jsonl > deduped.jsonl\n```\n\n你可以用下面的命令像根据时间线排序一样根据推文 id 排序。\n\n```shell\nutils/sort_by_id.py tweets.jsonl > sorted.jsonl\n```\n\nYou can filter out all tweets before a certain date (for example, if a hashtag was used for another event before the one you're interested in):\n\n你可以过滤调某一具体日期前的推文，举个例子，有可能这一日期前某个标签的含义并不是你感兴趣的意思。\n\n```shell\nutils/filter_date.py --mindate 1-may-2014 tweets.jsonl > filtered.jsonl\n```\n\n你还能够以列表的形式得到客户端信息。\n\n```shell\nutils/source.py tweets.jsonl > sources.html\n```\n\n下面的命令去除了转推。\n\n```shell\nutils/noretweets.py tweets.jsonl > tweets_noretweets.jsonl\n```\n\n或者复原原始的 URL 的长度（需要安装[unshrtn](https://github.com/docnow/unshrtn)）。\n\n```shell\ncat tweets.jsonl | utils/unshrtn.py > unshortened.jsonl\n```\n\n一旦你获得了原始的 URL, 你可以根据推文中提到的次数对这些 URL  排序。\n\n```shell\ncat unshortened.jsonl | utils/urls.py | sort | uniq -c | sort -nr > urls.txt\n```\n\n## twarc-report 项目\n\n还有一些可以生成 csv 或者 json 输出以供 [D3.js](http://d3js.org/) 可视化使用的脚本可以在 [twarc-report](https://github.com/pbinkley/twarc-report) 项目中找到。原本属于 twarc 一部分的 `directed.py` 脚本也已经被转移到了 twarc-report 项目并被重命名为 `d3graph.py`.\n\n下面的这两个链接包含了两个生成 HTML 格式的 D3 可视化文件的例子。 \n\n1. [timelines](https://wallandbinkley.com/twarc/bill10/)\n2. [directed graph of retweets](https://wallandbinkley.com/twarc/bill10/directed-retweets.html)\n\n[英语]: https://github.com/DocNow/twarc/blob/main/README.md\n[日语]: https://github.com/DocNow/twarc/blob/main/README_ja_jp.md\n[葡萄牙语]: https://github.com/DocNow/twarc/blob/main/README_pt_br.md\n[西班牙语]: https://github.com/DocNow/twarc/blob/main/README_es_mx.md\n[瑞典语]: https://github.com/DocNow/twarc/blob/main/README_sv_se.md\n[斯瓦希里语]: https://github.com/DocNow/twarc/blob/main/README_sw_ke.md\n[ISO 639-1]: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes\n"
  },
  {
    "path": "docs/twarc2_en_us.md",
    "content": "\n# twarc2\n\ntwarc2 is a command line tool and Python library for archiving Twitter JSON\ndata. Each tweet is represented as a JSON object that was returned from the\nTwitter API. Since Twitter's introduction of their [v2\nAPI](https://developer.twitter.com/en/docs/twitter-api/api-reference-index#v2)\nthe JSON representation of a tweet is conditional on the types of fields and\nexpansions that are requested. twarc2 does the work of requesting the highest\nfidelity representation of a tweet by requesting all the available data for\ntweets. \n\nTweets are streamed or stored as [line-oriented\nJSON](https://en.wikipedia.org/wiki/JSON_Streaming#Line-delimited_JSON). twarc2\nwill handle Twitter API's [rate\nlimits](https://dev.twitter.com/rest/public/rate-limiting) for you. In addition\nto letting you collect tweets twarc can also help you collect users and hydrate\ntweet ids. It also has a collection of [plugins](plugins) you can use to do\nthings with the collected JSON data (such as converting it to CSV).\n\ntwarc2 was developed as part of the [Documenting the Now](http://www.docnow.io)\nproject which was funded by the [Mellon Foundation](https://mellon.org/).\n\n## Install\n\nBefore using twarc you will need to create an application and attach it to an\nproject on your [Twitter Developer Portal](https://developer.twitter.com/en/portal/projects-and-apps). A [\"Project\"](https://developer.twitter.com/en/docs/projects/overview) is like a container for an \"Application\" with a specific purpose.\n\nIf you have Academic Access you should see an \"Academic Research\" Project,\nif not, you should see only \"Standard\" Project. Academic Access is a separate endpoint, see [here](twitter-developer-access.md) for notes on this.\n\nOnce you've created your application, note down the Bearer token, and or the consumer key, consumer secret,\nwhich may also be called API Key and API Secret and then optionally click to\ngenerate an access token and access token secret. With these four variables\nin hand you are ready to start using twarc.\n\n1. install [Python 3](http://python.org/download)\n2. [pip](https://pip.pypa.io/en/stable/installing/) install twarc from a terminal (such as the Windows Command Prompt available in the \"start\" menu, or the [OSX Terminal application](https://support.apple.com/en-au/guide/terminal/apd5265185d-f365-44cb-8b09-71a064a42125/mac)):\n\n```\npip install --upgrade twarc\n```\n\n### Homebrew (macOS only)\n\nFor macOS users, you can also install `twarc` via [Homebrew](https://brew.sh/):\n\n```bash\nbrew install twarc\n```\n\n### Windows\n\nIf you installed with pip and see a \"failed to create process\" when running twarc try reinstalling like this:\n\n    python -m pip install --upgrade --force-reinstall twarc\n\n## Quickstart:\n\nFirst you're going to need to tell twarc about your application API keys and\ngrant access to one or more Twitter accounts:\n\n    twarc2 configure\n\nThen try out a search:\n\n    twarc2 search \"blacklivesmatter\" results.jsonl\n\nOr maybe you'd like to collect tweets as they happen?\n\n    twarc2 filter \"blacklivesmatter\" results.jsonl\n\nSee below for the details about these commands and more.\n\n## Configure\n\nOnce you've got your Twitter developer access set up you can tell twarc what they are with the `configure` command.\n\n    twarc2 configure\n\nThis will store your credentials in your home directory so you don't have to\nkeep entering them in. You can most of twarc's functionality by simply\nconfiguring the *bearer token*, but if you want it to be complete you can enter\nin the *API key* and *API secret*.\n\nYou can also the keys in the system environment (`CONSUMER_KEY`,\n`CONSUMER_SECRET`, `ACCESS_TOKEN`, `ACCESS_TOKEN_SECRET`) or using command line\noptions (`--consumer-key`, `--consumer-secret`, `--access-token`,\n`--access-token-secret`).\n\n## Search\n\nThis uses Twitter's [tweets/search/recent](https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-recent) and [tweets/search/all](https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-all) endpoints to download *pre-existing* tweets matching a given query. This command will search for any tweets mentioning *blacklivesmatter* from the 7 days.\n\n    twarc2 search \"blacklivesmatter\" results.jsonl\n\nIf you have access to the [Academic Research Product Track](https://developer.twitter.com/en/products/twitter-api/academic-research) you can search the full archive of tweets by using the `--archive` option.\n\n    twarc2 search --archive \"blacklivesmatter\" results.jsonl \n\nThe queries can be a lot more expressive than matching a single term. For\nexample this query will search for tweets containing either `blacklivesmatter`\nor `blm` that were sent to the user \\@deray. \n\n    twarc2 search \"(blacklivesmatter OR blm) to:deray\" results.jsonl\n\nThe best way to get familiar with Twitter's search syntax is to consult Twitter's [Building queries for Search Tweets](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query) documentation. \n\nYou also should definitely check out Igor Brigadir's *excellent* reference guide\nto the Twitter Search syntax:\n[Advanced Search on Twitter](https://github.com/igorbrigadir/twitter-advanced-search/blob/master/README.md).\nThere are lots of hidden gems in there that the advanced search form doesn't\nmake readily apparent.\n\n### Limit\n\nBecause there is a 500,000 tweet limit (5, or sometimes 10 million for Academic Research Track)\nyou may want to limit the number of tweets you retrieve by using `--limit`:\n\n    twarc2 search --limit 5000 \"blacklivesmatter\" results.jsonl\n\n### Time\n\nYou can also limit to a particular time range using `--start-time` and/or\n`--end-time`, which can be especially useful in conjunction with `--archive`\nwhen you are searching for historical tweets.\n\n    twarc2 search --start-time 2014-07-17 --end-time 2014-07-24 '\"eric garner\"' tweets.jsonl \n\nIf you leave off --start-time or --end-time it will be open on that side. So\nfor example to get all \"eric garner\" tweets before 2014-07-24 you would just\nleave off the `--start-time`:\n\n    twarc2 search --end-time 2014-07-24 '\"eric garner\"' tweets.jsonl \n\n### Sort Order\n\nBy default, Twitter returns the results ordered by their published date with the newest tweets being first.\nTo alter this behavior, it is possible to specify the `--sort-order` parameter.\nCurrently, it supports `recency` (the default) or `relevancy`.\nIn the latter case, tweets are ordered based on what Twitter determines to be the best results for your query.\n\n## Searches\n\nSearches works like the [search](#search) command, but instead of taking a single query, it reads from a file containing many queries. You can use the same limit and time options just like a single search command, but it will be applied to every query.\n\nThe input file for this command needs to be a plain text file, with one line for each query you want to run, for example you might have a file called `animals.txt` with the following lines:\n\n    cat\n    dog\n    mouse OR mice\n\nNote that each line will be passed through directly to the Twitter API - if you have quoted strings, they will be treated as a phrase search by the Twitter API, which might not be what you intended.\n\nIf you run the following `searches` command, `animals.json` will contain at least 100 tweets for each query in the input file:\n\n    twarc2 searches --limit 100 animals.txt animals.json\n\nYou can use the `--archive` and `--start-time` flags just like a regular search command too, in this case to search the full archive of all tweets for the first day of 2020:\n\n    twarc2 searches --archive --start-time 2020-01-01 --end-time 2020-01-02 animals.txt animals.json\n\nYou can also use the `--counts-only` flag to check volumes first. This produces a csv file in the same format as the [counts](#counts) command with the `--csv` flag, with the addition of a column containing the query for that row.\n\n    twarc2 searches --counts-only animals.txt animals_counts.csv\n\nOne more thing - if you have a lot searches you want to run, you might want to consider using the `--combine-queries` flag. This combines consecutive queries into the file into a single longer query, meaning you issue fewer API calls and potentially collect fewer duplicate tweets that match more than one query. Using this on the `animals.txt` file as input will combine the three queries into the single longer query `(cat) OR (dog) OR (mouse OR mice)`, and only issue one logical query.\n\n    twarc2 searches --combine-queries animals.txt animals_combined.json\n\n## Stream\n\nThe `stream` command will use Twitter's API\n[tweets/search/stream](https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/api-reference/get-tweets-search-stream)\nendpoint to collect tweets as they happen. In order to use it you first need to\ncreate one or more [rules]. For example:\n\n    twarc2 stream-rules add blacklivesmatter\n\nYou can list your active stream rules:\n\n    twarc2 stream-rules list\n\nAnd you can collect the data from the stream, which will bring down any tweets that match your rules:\n\n    twarc2 stream stream.jsonl\n\nWhen you want to stop you use `ctrl-c`. This only stops the stream but doesn't delete your stream rule. To remove a rule you can:\n\n    twarc2 stream-rules delete blacklivesmatter\n\n## Sample\n\nUse the `sample` command to listen to Twitter's [tweets/sample/stream](https://developer.twitter.com/en/docs/twitter-api/tweets/sampled-stream/api-reference/get-tweets-sample-stream) API for a \"random\" sample of recent public statuses. The sampling is based on the millisecond part of the tweet timestamp.\n\n    twarc2 sample sample.jsonl\n\n## Users\n\nIf you have a file of user ids you can fetch the user metadata for them with\nthe `users` command:\n\n    twarc users users.txt users.jsonl\n\nIf the file contains usernames instead of user ids you can use the `--usernames` option:\n\n    twarc2 users --usernames users.txt users.jsonl\n\n## Followers\n\nYou can fetch the followers of an account using the `followers` command:\n\n    twarc2 followers deray users.jsonl\n\n## Following\n\nTo get the users that a user is following you can use `following`:\n\n    twarc2 following deray users.jsonl\n\nThe result will include exactly one user id per line. The response order is\nreverse chronological, or most recent followers first.\n\n## Timeline\n\nThe `timeline` command will use Twitter's [user timeline API](https://developer.twitter.com/en/docs/twitter-api/tweets/timelines/api-reference/get-users-id-tweets) to collect the most recent tweets posted by the user indicated by screen_name.\n\n    twarc2 timeline deray tweets.jsonl\n\n## Conversation\n\nYou can retrieve a conversation thread using the tweet ID at the head of the\nconversation:\n\n    twarc2 conversation 266031293945503744 > conversation.jsonl\n\n## Likes\n\nTwarc supports the two approaches that the Twitter API exposes for collecting likes via the `liked-tweets` and `liking-users` commands. \n\nThe `liked-tweets` command returns the tweets that have been liked by a specific account. The account is specified by the user ID of that account, in the following example is the account of Twitter's founder:\n\n    twarc2 liked-tweets 12 jacks-likes.jsonl\n\nIn this case the output file contains all of the likes of publicly accessible tweets. Note that the order of likes is not guaranteed by the API, but is probably reverse chronological, or most recent likes by that account first. The underlying tweet objects contain no information about when the tweet was liked.\n\nThe `liking-users` command returns the user profiles of the accounts that have liked a specific tweet (specified by the ID of the tweet):\n\n    twarc2 liking-users 1460417326130421765 liking-users.jsonl\n\nIn this example the output file contains all of the user profiles of the publicly accessible accounts that have liked that specific tweet. Note that the order of profiles is not guaranteed by the API, but is probably reverse chronological, or the profile of the most recent like for that account first. The underlying profile objects contain no information about when the tweet was liked.\n\nNote that likes of tweets that are not publicly accessible, or likes by accounts that are protected will not be retrieved by either of these methods. Therefore, the metrics available on a tweet object (under the `public_metrics.like_count` field) will likely be higher than the number of likes you can retrieve via the Twitter API using these endpoints.\n\n## Retweets\n\nYou can retrieve the user profiles of publicly accessible accounts that have retweeted a specific tweet, using the `retweeted_by` command and the ID of the tweet as an identifier. For example:\n\n    twarc2 retweeted-by 1460417326130421765 retweeting-users.jsonl\n\nUnfortunately this only returns the user profiles (presumably in reverse chronological order) of the retweeters of that tweet - this means that important information, like when the tweet was retweeted is not present in the returned object. \n\n## Dehydrate\n\nThe `dehydrate` command generates an id list from a file of tweets:\n\n    twarc2 dehydrate tweets.jsonl tweet-ids.txt\n\n## Hydrate\n\ntwarc's `hydrate` command will read a file of tweet identifiers and write out the tweet JSON for them using Twitter's [tweets](https://developer.twitter.com/en/docs/twitter-api/tweets/lookup/api-reference/get-tweets)\nAPI endpoint:\n\n    twarc2 hydrate ids.txt tweets.jsonl\n    \nThe input file, `ids.txt` is expected to be a file that contains a tweet identifier on each line, without quotes or a header:\n\n```\n919505987303886849\n919505982882844672\n919505982602039297\n```\n\nTwitter API's [Terms of Service](https://dev.twitter.com/overview/terms/policy#6._Be_a_Good_Partner_to_Twitter) discourage people from making large amounts of raw Twitter data available on the Web.  The data can be used for research and archived for local use, but not shared with the world. Twitter does allow files of tweet identifiers to be shared, which can be useful when you would like to make a dataset of tweets available.  You can then use Twitter's API to *hydrate* the data, or to retrieve the full JSON for each identifier. This is particularly important for [verification](https://en.wikipedia.org/wiki/Reproducibility) of social media research.\n\n## Places\n\nThe search and stream APIs allow you to search by places. But in order to use\nthem you need to know the identifier for a specific place. twarc's\n`places` command will let you search by the place name, geo coordinates, or ip\naddress. For example: \n\n    twarc2 places Ferguson                 \n\nWhich will output something like:\n\n```shell\n$ twarc2 places Ferguson                 \nFerguson, MO, United States [id=0a62ce0f6aa37536]\nRuisseau-Ferguson, Québec, Canada [id=25283a1f59449e8f]\nFerguson, Victoria, Australia [id=2538e66b7e5c082c]\nFerguson Road Initiative, Dallas, United States [id=368aad647311292a]\nFerguson, Western Australia, Australia [id=45f20c78d803ad84]\nFerguson, PA, United States [id=00c92e14361c9674]\nFerguson, KY, United States [id=0190ea5612aaae32]\n```\n\nYou can then use one of the ids in a search:\n\n    twarc2 search \"place:0a62ce0f6aa37536\" tweets.jsonl\n\nYou can also search by geo-coordinates (lat,lon) and IP address. If you would prefer to see the full JSON response with the bounding boxes use the `--json` option.\n\n## Command Line Usage\n\nBelow is what you see when you run `twarc2 --help`.\n\n::: mkdocs-click:\n  :module: twarc.command2\n  :command: twarc2\n  :depth: 1\n"
  },
  {
    "path": "docs/twitter-developer-access.md",
    "content": "# Twitter Developer Access\n\nIf you have established that you would like to use Twitter Data in your study, you will need access to the API. There are several steps required to get access to the API. This is a guide on how best to engage with this process. Allow plenty of time for this.\n\nTwitter has made the process of accessing their API more strict. There are a number of restricted use cases that may require you implement additional safeguards. \n\nBefore applying, the Terms of Service for Developers and the [Restricted Use Cases](https://developer.twitter.com/en/developer-terms/more-on-restricted-use-cases) are very short and relevant to read.\n\n## Step 0: Have a Twitter account in good standing\n\nCreate and or edit your Twitter profile to fit your person or organization, preferably in English. Make sure it's public and you do the basic things like verifying your email and phone number (do not use a VoIP service), setting a non default profile picture and header, a description, links to your research group or website, a good description that identifies you as you, and preferably some friends and followers who are already on twitter in your research community. Use a good stable email provider (gmail) or your institution email as long as it is reliable and you can see any emails that may end up in spam, just in case.\n\n## Step 1: Applying for a Developer Account\n\nFill out the forms for a new Individual developer Account here: <https://developer.twitter.com/en/apply-for-access>. Team accounts are not supported with Academic Access, so do not apply for a Team account. Pay attention to the specifics of each question: especially about sharing data outside of your organization, and with other government entities. Wait for a reply. This may take a couple of weeks.\n\n## Step 2: Apply for the special Academic Access v2 Endpoint\n\nEven if you specify your use case as \"Academic\" use case in your developer application form, you will not automatically get access to the [new Search endpoint](https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-all) with higher limits for academic use. You must fill in an additional form: <https://developer.twitter.com/en/portal/petition/academic/is-it-right-for-you>\n\nTwitter generally prefers to grant access to faculty and postgrad researchers, not undergrad or masters students or contractors or collaborators. It may be better for the principal investigator or professor to log in from an institution account or their own one, provided it is in good standing and has an obviously identifiable online academic presense.\n\nThis application may also take a couple of days or weeks.\n\n## Step 3: Create a Project and App\n\nA Project with Academic Access should be created for you, or if you did not get Academic Access, you can create a new Standard Project. On your Dashboard <https://developer.twitter.com/en/portal/dashboard> you should see \"Academic Research\" or \"Standard\" and \"Standalone Apps\".\n\nBefore accessing the v2 API, you will need to create an App or use an existing one and add it to the Academic Access Project first. You can only have 1 App assigned to 1 Project.\n\nWhen Creating an app, take note of the keys you are given:\n\nAPI Key: \n```\nhCe77nsrgew3gsdhSDGFSgsdf\n```\n\nAPI Secret: \n```\n1jWERGWBrtRTWBTwGFDHGFH66SDFGSDFGSSDFGSDFGSSDFGa11\n```\n\nBearer Token: \n```\nAAAAAAAAAAAAAAAAAAAAAAAsdfgsAAAAvSDFGSDRgssdfSDFGSDF44gsd4E%3Dkk33345336dfsgsdgsdgsdASGASDGadsGAFAKJGYIUYUIDGGKK\n```\n\nThese are fake but have the same format as real ones. Note the `%` sign in the Bearer Token - this can often cause errors when copy pasting or providing this token in a command line. Other common causes of errors are including a trailing space, or extra `\"` or `'` quotes or not quoting the string in code or command line. This depends on implementation.\n\nThese are important to save and [store as you would a password](https://developer.twitter.com/en/docs/authentication/guides/authentication-best-practices).\n\nContinue to \"App Settings\" and fill in the description field of the app. You don't need to change any other settings here. Generally you will only need Read Only Access and will not need \"3-legged OAuth\" or callback URLs unlesws you plan on using the [Account Activity API](https://developer.twitter.com/en/docs/twitter-api/enterprise/account-activity-api/overview) if you want to make an interactive Bot for example.\n\nA project must *contain* an app. The difference between a [Project](https://developer.twitter.com/en/docs/projects/overview) and [App](https://developer.twitter.com/en/docs/apps/overview) is sometimes confusing.\n\n*Standalone Apps* are for `v1.1` endpoints, Standard and Academic Access *Projects* are for `v2` endpoints.\n\n## Step 4: Collaborating with Others\n\nNow that you have your keys and tokens, you can start using the API. You may be working with other people on implementations, so you may have to share your keys with someone at some point. Do not share your Twitter user and password details for the Developer Dashboard. This is not a good idea. Currently Twitter's \"Teams\" functionality is also incompatible with Academic Access. The best way is to provide your colaborator with the keys in a plain text configuration file that you securely share. Or as Environment variables. When someone has your keys, they have full access to the API on your behalf.\n\nBe careful not to commit your keys into a public repository or make them visible to the public - do not include them in a client side js script for example. Most apps will ask for API Key and Secret, but \"Consumer Key\" is \"API Key\" and \"Consumer Secret\" is \"API Secret\".\n\nFor Academic Access, there is only one endpoint that takes Bearer (App Only) authentication, so in most cases, the Bearer Token is all you need to share.\n\n## Step 5: Next Steps\n\nInstall `twarc`, and run `twarc2 configure` to set it up.\n\nTo make arbitrary API calls for testing, [twurl](https://github.com/twitter/twurl) is a good tool, when combined with [jq](https://stedolan.github.io/jq/).\n\nTo get help, a good place is the [Developer Forums](https://twittercommunity.com/), or the [DocNow Slack](https://docs.google.com/forms/d/1Wk0JdF2Cty2VHMqpf_QlJXVKQdUtfeeFhaYRben3qaM/viewform), or [Stackoverflow](https://stackoverflow.com/) for implementation details, or the repository [Issues](https://github.com/DocNow/twarc) if it's an issue with twarc or one of the addons.\n\nTo share and publish a Twitter Dataset, extract the Tweet IDs and or User IDs, and format these as 1 ID per line in a plain text file (optionally, you can compress this file). This will make your dataset easier to process for others. See the [DocNow Catalog](https://catalog.docnow.io/) and tools like [Zenodo](https://zenodo.org/) and [Figshare](https://figshare.com/).\n"
  },
  {
    "path": "docs/windows10.md",
    "content": "# twarc2 on Windows 10\n\nThis guide assumes you already have a Twitter Developer Account, a registered App with your keys and a Bearer Token, and Python installed on Windows.\n\n## Prerequisites and Installation\n\nYou must have Python installed and working on Windows.\n\nPython will be located in different places on your computer if you installed Python from either the [official website](https://www.python.org/downloads/windows/), or from the [Microsoft App store](https://www.microsoft.com/en-us/p/python-38/9mssztt1n39l), or via [Anaconda](https://www.anaconda.com/products/individual#windows).\n\nCheck that you can run these successfully:\n\nOpen the command line `cmd.exe` or `PowerShell` or `Windows Terminal Preview` and run:\n\n`python --version`\n\nand\n\n`pip --version`\n\nIf both give you some version output without errors everything is ready to go. Otherwise, install and configure `python` and `pip`.\n\n`twarc2` CLI works best through [Windows Terminal Preview](https://www.microsoft.com/en-us/p/windows-terminal-preview/9n8g5rfz9xk3?activetab=pivot:overviewtab)\n\n## Setting up twarc2\n\nInstall `twarc2` with\n\n`pip install --upgrade twarc`\n\nIf you get a warning like \n\n```\nWARNING: The scripts twarc.exe and twarc2.exe are installed in 'C:\\Users\\t495\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\\LocalCache\\local-packages\\Python39\\Scripts' which is not on PATH.\n  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.\n```\n\nYou will need to add that folder to the PATH.\n\nThis will be different for your machine, so make sure to copy the full folder location from the command prompt, without the `'` quotes with `CTRL+C`.\n\nMake sure that folder is set in PATH System Variables:\n\nIn Settings, find \"edit the system environment variables\"\n\nAfter clicking on \"Environment Variables\"\n\nEdit the \"Path\" variable in User Variables and add a new entry, in my case it was `C:\\Users\\t495\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\\LocalCache\\local-packages\\Python39\\Scripts` but for you it will be different. Copy this from the warning it gives you, because it varies.\n\nYou should now be able to run `twarc2` from the command line:\n\n`twarc2`\n\nIf you can see the instructions, everything is ready to go.\n\nIn powershell or command prompt, run:\n\n`twarc2 configure`\n\nPaste in your Bearer token, taking care not to accidentally copy an extra new line or space. It's not recommended to type these in manually, the API Secret entry will also not display what's being typed, but it still accepts input. If something went wrong, you can repeat the command and start over. The keys will be saved in a file that youcan use Notepad to view, saved in `C:\\Users\\youraccount\\AppData\\Roaming\\twarc\\config` or sometimes a different location, twarc will output the location of this file after the command runs.\n\nWhen this is completed, twarc2 is ready to use.\n\n## Escaping `\"` Characters in Windows\n\nThe query you specify to search can contain `\"` quotes for phrases, spaces and other special characters like `:` and `()`. When entered directly into the prompt these can be interpreted as part of the command, not part of the command line argument value. Windows has an odd way of escaping characters in the command line.\n\nTo use a `\"` in a query, change it to `\"\"` in Windows. The more common escape `\\\"` does not work.\n\nFor example, if you want to search for tweets that contain the phrase `\"live laugh love\"` or `\"home sweet home\"` in english, from the US, the query would be:\n\n```\nlang:en (\"live laugh love\" OR \"home sweet home\") place_country:US\n```\n\nChanging the `\"` to `\"\"` The twarc2 command (`--limit` is optional) for this would be:\n\n```\ntwarc2 search --limit 500 \"lang:en (\"\"live laugh love\"\" OR \"\"home sweet home\"\") place_country:US\" output.json\n```\n\nThis Stackoverflow answer has the long version that explains why this works: https://stackoverflow.com/a/15262019\n\n## Output Format Errors:\n\nIf you see this kind of error, for example when using `twarc2 flatten`:\n\n> ⚡ Expecting value: line 1 column 1 (char 0)\n\nIt means the file was incorrectly saved. There is an edge case in Windows when writing output, do not use `>` to redirect `stdout`. This alters how files are written, and adds a BOM (Byte Order Mark) that makes the files unreadable to twarc for later, eg: when using `twarc2 flatten`. To fix the file, edit it in a Hex editor to remove the first 2 bytes.\n\nFor example, this will give you a bad file with a BOM:\n\n`twarc2 search --limit 100 \"dogs\" > dogs.json`\n\nWhile this will give you a correctly written UTF8 file:\n\n`twarc2 search --limit 100 \"dogs\" dogs.json`\n\nDo not redirect stdout to a file in Windows, instead - specify the output file as a command line argument.\n"
  },
  {
    "path": "mkdocs.yml",
    "content": "site_name: twarc\nsite_url: https://readthedocs.org/projects/twarc-project/\nsite_description: Collect Twitter JSON data from the command line.\n\nrepo_url: https://github.com/docnow/twarc\nrepo_name: twarc\nedit_uri: edit/main/docs/\n\ntheme:\n  name: \"material\"\n  logo: images/docnow.png\n  palette:\n    scheme: preference\n\nnav: \n  - Home: README.md\n  - twarc2: \n    - twarc2 (en): twarc2_en_us.md\n  - twarc1: \n    - twarc1 (en): twarc1_en_us.md\n    - twarc1 (es): twarc1_es_mx.md\n    - twarc1 (ja): twarc1_ja_jp.md\n    - twarc1 (pt): twarc1_pt_br.md\n    - twarc1 (sv): twarc1_sv_se.md\n    - twarc1 (sw): twarc1_sw_ke.md\n    - twarc1 (zw): twarc1_zw_zh.md\n  - Plugins: plugins.md\n  - Tutorial: tutorial.md\n  - Resources: resources.md\n  - Twitter Developer Access: twitter-developer-access.md \n  - Windows 10: windows10.md\n  - Library API:\n    - api/client.md\n    - api/client2.md\n    - api/library.md\n    - api/expansions.md\n\nplugins:\n- search\n- mkdocstrings\n\nmarkdown_extensions:\n- mkdocs-click\n- pymdownx.highlight\n- pymdownx.superfences\n"
  },
  {
    "path": "pyproject.toml",
    "content": "[project]\nname = \"twarc\"\nversion = \"2.14.1\"\ndescription = \"Archive tweets from the command line\"\nlicense = \"MIT\"\nreadme = \"README.md\"\nrequires-python = \">=3.13\"\ndependencies = [\n    \"click>=7,<9\",\n    \"click-config-file>=0.6\",\n    \"click-plugins>=1\",\n    \"humanize>=3.9\",\n    \"python-dateutil>=2.8\",\n    \"requests_oauthlib>=1.3\",\n    \"tqdm>=4.62\",\n    \"twarc-csv>=0.7.2\",\n]\n\n[dependency-groups]\ndev = [\n    \"black>=25.9.0\",\n    \"pytest>=8.4.2\",\n    \"pytest-black>=0.6.0\",\n    \"python-dotenv>=1.2.1\",\n    \"pytz>=2025.2\",\n    \"toml>=0.10.2\",\n]\n\n[project.scripts]\ntwarc = \"twarc.command:main\"\ntwarc2 = \"twarc.command2:twarc2\"\n\n[tool.pytest.ini_options]\naddopts = \"--verbose --black\"\n\n[tool.uv.workspace]\nmembers = [\n    \"tmp/twarc\",\n]\n\n[build-system]\nrequires = [\"uv_build>=0.8.3,<0.9.0\"]\nbuild-backend = \"uv_build\"\n"
  },
  {
    "path": "requirements-mkdocs.txt",
    "content": "click>=7,<9\nclick-config-file>=0.6\nclick-plugins>=1\nhumanize>=3.9\npython-dateutil>=2.8\nrequests_oauthlib>=1.3\ntqdm>=4.62\nmkdocs>=1.2\nmkdocs-click>=0.4\nmkdocs-material>=7.2\nmkdocstrings[python]>=0.15\n"
  },
  {
    "path": "setup.cfg",
    "content": "[tool:pytest]\naddopts=--verbose --black\n\n[aliases]\ntest=pytest\n"
  },
  {
    "path": "src/twarc/__init__.py",
    "content": "from .client import Twarc\nfrom .client2 import Twarc2\nfrom .version import version\nfrom .expansions import ensure_flattened\n"
  },
  {
    "path": "src/twarc/__main__.py",
    "content": "from twarc.command2 import twarc2\n\nif __name__ == \"__main__\":\n    twarc2(prog_name=\"python -m twarc2\")\n"
  },
  {
    "path": "src/twarc/client.py",
    "content": "# -*- coding: utf-8 -*-\n\nimport os\nimport re\nimport sys\nimport json\nimport types\nimport logging\nimport datetime\nimport requests\n\nimport ssl\nfrom requests.exceptions import ConnectionError\nfrom requests.packages.urllib3.exceptions import ProtocolError\n\nfrom .decorators import *\nfrom twarc.version import version, user_agent\n\nfrom requests_oauthlib import OAuth1, OAuth1Session, OAuth2Session\nfrom oauthlib.oauth2 import BackendApplicationClient\n\nif sys.version_info[:2] <= (2, 7):\n    # Python 2\n    get_input = raw_input\n    str_type = unicode\n    import ConfigParser as configparser\n    from urlparse import parse_qs\nelse:\n    # Python 3\n    get_input = input\n    str_type = str\n    import configparser\n    from urllib.parse import parse_qs\n\nlog = logging.getLogger(\"twarc\")\n\n\nclass Twarc(object):\n    \"\"\"\n    Twarc allows you retrieve data from the Twitter API. Each method\n    is an iterator that runs to completion, and handles rate limiting so\n    that it will go to sleep when Twitter tells it to, and wake back up\n    when it is able to retrieve data from the API again.\n    \"\"\"\n\n    def __init__(\n        self,\n        consumer_key=None,\n        consumer_secret=None,\n        access_token=None,\n        access_token_secret=None,\n        connection_errors=0,\n        http_errors=0,\n        config=None,\n        profile=\"\",\n        protected=False,\n        tweet_mode=\"extended\",\n        app_auth=False,\n        validate_keys=True,\n        gnip_auth=False,\n        gnip_username=None,\n        gnip_password=None,\n        gnip_account=None,\n    ):\n        \"\"\"\n        Instantiate a Twarc instance. If keys aren't set we'll try to\n        discover them in the environment or a supplied profile. If no\n        profile is indicated the first section of the config files will\n        be used.\n        \"\"\"\n\n        self.api_version = \"1.1\"\n        self.consumer_key = consumer_key\n        self.consumer_secret = consumer_secret\n        self.access_token = access_token\n        self.access_token_secret = access_token_secret\n        self.connection_errors = connection_errors\n        self.http_errors = http_errors\n        self.profile = profile\n        self.client = None\n        self.last_response = None\n        self.tweet_mode = tweet_mode\n        self.protected = protected\n        self.app_auth = app_auth\n        self.gnip_auth = gnip_auth\n        self.gnip_username = gnip_username\n        self.gnip_password = gnip_password\n        self.gnip_account = gnip_account\n\n        if config:\n            self.config = config\n        else:\n            self.config = self.default_config()\n\n        self.get_keys()\n\n        if validate_keys:\n            self.validate_keys()\n\n    @filter_protected\n    def search(\n        self,\n        q,\n        max_id=None,\n        since_id=None,\n        lang=None,\n        result_type=\"recent\",\n        geocode=None,\n        max_pages=None,\n    ):\n        \"\"\"\n        Pass in a query with optional max_id, min_id, lang, geocode, or\n        max_pages, and get back an iterator for decoded tweets. Defaults to\n        recent (i.e. not mixed, the API default, or popular) tweets.\n        \"\"\"\n        url = \"https://api.twitter.com/1.1/search/tweets.json\"\n        params = {\n            \"count\": 100,\n            \"q\": q,\n            \"include_ext_alt_text\": \"true\",\n            \"include_ext_is_blue_verified\": \"true\",\n            \"include_entities\": \"true\",\n        }\n\n        if lang is not None:\n            params[\"lang\"] = lang\n        if geocode is not None:\n            params[\"geocode\"] = geocode\n        if since_id:\n            # Make the since_id inclusive, so we can avoid retrieving\n            # an empty page of results in some cases\n            params[\"since_id\"] = str(int(since_id) - 1)\n\n        if result_type in [\"mixed\", \"recent\", \"popular\"]:\n            params[\"result_type\"] = result_type\n        else:\n            params[\"result_type\"] = \"recent\"\n\n        retrieved_pages = 0\n        reached_end = False\n\n        while True:\n            # note: max_id changes as results are retrieved\n            if max_id:\n                params[\"max_id\"] = max_id\n\n            resp = self.get(url, params=params)\n\n            retrieved_pages += 1\n            statuses = resp.json()[\"statuses\"]\n\n            if len(statuses) == 0:\n                log.info(\"no new tweets matching %s\", params)\n                break\n\n            for status in statuses:\n                # We've certainly reached the end of new results\n                if since_id is not None and status[\"id_str\"] == str(since_id):\n                    reached_end = True\n                    break\n\n                yield status\n\n            if reached_end:\n                log.info(\"no new tweets matching %s\", params)\n                break\n\n            if max_pages is not None and retrieved_pages == max_pages:\n                log.info(\"reached max page limit for %s\", params)\n                break\n\n            max_id = str(int(status[\"id_str\"]) - 1)\n\n    def premium_search(\n        self,\n        q,\n        product,\n        environment,\n        from_date=None,\n        to_date=None,\n        max_results=None,\n        sandbox=False,\n        limit=0,\n    ):\n        \"\"\"\n        Search using the Premium Search API. You will need to pass in a query\n        a product (30day or fullarchive) and environment to use. Optionally\n        you can pass in a from_date and to_date to limit the search using\n        datetime objects. If you would like to set max_results you can, or\n        you can accept the maximum results (500). If using the a sandbox\n        environment you will want to set sandbox=True to lower the max_results\n        to 100. The limit option will cause your search to finish after it has\n        return more than that number of tweets (0 means no limit).\n        \"\"\"\n\n        if not self.app_auth and not self.gnip_auth:\n            raise RuntimeError(\n                \"This endpoint is only available with application authentication. \"\n                \"Pass app_auth=True in Python or --app-auth on the command line.\"\n            )\n\n        if from_date and not isinstance(from_date, datetime.date):\n            raise RuntimeError(\n                \"from_date must be a datetime.date or datetime.datetime object\"\n            )\n        if to_date and not isinstance(to_date, datetime.date):\n            raise RuntimeError(\n                \"to_date must be a datetime.date or datetime.datetime object\"\n            )\n\n        if product not in [\"30day\", \"gnip_fullarchive\", \"fullarchive\"]:\n            raise RuntimeError(\"Invalid Premium Search API product: {}\".format(product))\n\n        # set default max_results based on whether its sandboxed\n        if max_results is None:\n            if sandbox:\n                max_results = 100\n            else:\n                max_results = 500\n\n        if product == \"gnip_fullarchive\":\n            url = \"https://gnip-api.twitter.com/search/fullarchive/accounts/{}/{}.json\".format(\n                self.gnip_account, environment\n            )\n        else:\n            url = \"https://api.twitter.com/1.1/tweets/search/{}/{}.json\".format(\n                product, environment\n            )\n\n        params = {\n            \"query\": q,\n            \"fromDate\": from_date.strftime(\"%Y%m%d%H%M\") if from_date else None,\n            \"toDate\": to_date.strftime(\"%Y%m%d%H%M\") if to_date else None,\n            \"maxResults\": max_results,\n        }\n\n        count = 0\n        stop = False\n        while not stop:\n            resp = self.get(url, params=params)\n            if resp.status_code == 200:\n                data = resp.json()\n                for tweet in data[\"results\"]:\n                    count += 1\n                    yield tweet\n                    if limit != 0 and count >= limit:\n                        stop = True\n                        break\n                if \"next\" in data:\n                    params[\"next\"] = data[\"next\"]\n                else:\n                    stop = True\n            elif resp.status_code == 422:\n                raise RuntimeError(\n                    \"Twitter API 422 response: are you using a premium search sandbox environment and forgot the --sandbox argument?\"\n                )\n\n    def timeline(\n        self, user_id=None, screen_name=None, max_id=None, since_id=None, max_pages=None\n    ):\n        \"\"\"\n        Returns a collection of the most recent tweets posted\n        by the user indicated by the user_id or screen_name parameter.\n        Provide a user_id or screen_name.\n        \"\"\"\n\n        if user_id and screen_name:\n            raise ValueError(\"only user_id or screen_name may be passed\")\n\n        # Strip if screen_name is prefixed with '@'\n        if screen_name:\n            screen_name = screen_name.lstrip(\"@\")\n        id = screen_name or str(user_id)\n        id_type = \"screen_name\" if screen_name else \"user_id\"\n        log.info(\"starting user timeline for user %s\", id)\n\n        if screen_name or user_id:\n            url = \"https://api.twitter.com/1.1/statuses/user_timeline.json\"\n        else:\n            url = \"https://api.twitter.com/1.1/statuses/home_timeline.json\"\n\n        params = {\n            \"count\": 200,\n            id_type: id,\n            \"include_ext_alt_text\": \"true\",\n            \"include_ext_is_blue_verified\": \"true\",\n        }\n\n        retrieved_pages = 0\n        reached_end = False\n\n        while True:\n            if since_id:\n                # Make the since_id inclusive, so we can avoid retrieving\n                # an empty page of results in some cases\n                params[\"since_id\"] = str(int(since_id) - 1)\n            if max_id:\n                params[\"max_id\"] = max_id\n\n            try:\n                resp = self.get(url, params=params, allow_404=True)\n                retrieved_pages += 1\n            except requests.exceptions.HTTPError as e:\n                if e.response.status_code == 404:\n                    log.warn(\"no timeline available for %s\", id)\n                    break\n                elif e.response.status_code == 401:\n                    log.warn(\"protected account %s\", id)\n                    break\n                raise e\n\n            statuses = resp.json()\n\n            if len(statuses) == 0:\n                log.info(\"no new tweets matching %s\", params)\n                break\n\n            for status in statuses:\n                # We've certainly reached the end of new results\n                if since_id is not None and status[\"id_str\"] == str(since_id):\n                    reached_end = True\n                    break\n                # If you request an invalid user_id, you may still get\n                # results so need to check.\n                if not user_id or id == status.get(\"user\", {}).get(\"id_str\"):\n                    yield status\n\n            if reached_end:\n                log.info(\"no new tweets matching %s\", params)\n                break\n\n            if max_pages is not None and retrieved_pages == max_pages:\n                log.info(\"reached max page limit for %s\", params)\n                break\n\n            max_id = str(int(status[\"id_str\"]) - 1)\n\n    def user_lookup(self, ids, id_type=\"user_id\"):\n        \"\"\"\n        A generator that returns users for supplied iterator of user ids or screen_names.\n        Use the id_type to indicate which you are supplying (user_id or screen_name).\n        \"\"\"\n\n        if isinstance(ids, str):\n            raise TypeError(\"ids must be an iterable other than a string\")\n\n        if id_type not in [\"user_id\", \"screen_name\"]:\n            raise RuntimeError(\"id_type must be user_id or screen_name\")\n\n        if not isinstance(ids, types.GeneratorType):\n            ids = iter(ids)\n\n        # TODO: this is similar to hydrate, maybe they could share code?\n\n        lookup_ids = []\n\n        def do_lookup():\n            ids_str = \",\".join(lookup_ids)\n            log.info(\"looking up users %s\", ids_str)\n            url = \"https://api.twitter.com/1.1/users/lookup.json\"\n            params = {\n                id_type: ids_str,\n                \"include_ext_is_blue_verified\": \"true\",\n            }\n            try:\n                resp = self.get(url, params=params, allow_404=True)\n            except requests.exceptions.HTTPError as e:\n                if e.response.status_code == 404:\n                    log.warning(\"no users matching %s\", ids_str)\n                raise e\n            return resp.json()\n\n        for id in ids:\n            lookup_ids.append(str(id).strip())\n            if len(lookup_ids) == 100:\n                for u in do_lookup():\n                    yield u\n                lookup_ids = []\n\n        if len(lookup_ids) > 0:\n            for u in do_lookup():\n                yield u\n\n    def follower_ids(self, user, max_pages=None):\n        \"\"\"\n        Returns Twitter user id lists for the specified user's followers.\n        A user can be a specific using their screen_name or user_id\n        \"\"\"\n        user = str(user)\n        user = user.lstrip(\"@\")\n        url = \"https://api.twitter.com/1.1/followers/ids.json\"\n\n        if re.match(r\"^\\d+$\", user):\n            params = {\"user_id\": user, \"cursor\": -1}\n        else:\n            params = {\"screen_name\": user, \"cursor\": -1}\n\n        retrieved_pages = 0\n\n        while params[\"cursor\"] != 0:\n            try:\n                resp = self.get(url, params=params, allow_404=True)\n                retrieved_pages += 1\n            except requests.exceptions.HTTPError as e:\n                if e.response.status_code == 404:\n                    log.info(\"no users matching %s\", user)\n                raise e\n            user_ids = resp.json()\n            for user_id in user_ids[\"ids\"]:\n                yield str_type(user_id)\n            params[\"cursor\"] = user_ids[\"next_cursor\"]\n\n            if max_pages is not None and retrieved_pages == max_pages:\n                log.info(\"reached max follower page limit for %s\", params)\n                break\n\n    def friend_ids(self, user, max_pages=None):\n        \"\"\"\n        Returns Twitter user id lists for the specified user's friend. A user\n        can be specified using their screen_name or user_id.\n        \"\"\"\n        user = str(user)\n        user = user.lstrip(\"@\")\n        url = \"https://api.twitter.com/1.1/friends/ids.json\"\n\n        if re.match(r\"^\\d+$\", user):\n            params = {\"user_id\": user, \"cursor\": -1}\n        else:\n            params = {\"screen_name\": user, \"cursor\": -1}\n\n        retrieved_pages = 0\n\n        while params[\"cursor\"] != 0:\n            try:\n                resp = self.get(url, params=params, allow_404=True)\n                retrieved_pages += 1\n            except requests.exceptions.HTTPError as e:\n                if e.response.status_code == 404:\n                    log.error(\"no users matching %s\", user)\n                raise e\n\n            user_ids = resp.json()\n            for user_id in user_ids[\"ids\"]:\n                yield str_type(user_id)\n            params[\"cursor\"] = user_ids[\"next_cursor\"]\n\n            if max_pages is not None and retrieved_pages == max_pages:\n                log.info(\"reached max friend page limit for %s\", params)\n                break\n\n    @filter_protected\n    def filter(\n        self,\n        track=None,\n        follow=None,\n        locations=None,\n        lang=[],\n        event=None,\n        record_keepalive=False,\n    ):\n        \"\"\"\n        Returns an iterator for tweets that match a given filter track from\n        the livestream of tweets happening right now.\n\n        If a threading.Event is provided for event and the event is set,\n        the filter will be interrupted.\n        \"\"\"\n        if locations is not None:\n            if type(locations) == list:\n                locations = \",\".join(locations)\n            locations = locations.replace(\"\\\\\", \"\")\n\n        url = \"https://stream.twitter.com/1.1/statuses/filter.json\"\n        params = {\n            \"stall_warning\": True,\n            \"include_ext_alt_text\": True,\n            \"include_ext_is_blue_verified\": \"true\",\n        }\n        if track:\n            params[\"track\"] = track\n        if follow:\n            params[\"follow\"] = follow\n        if locations:\n            params[\"locations\"] = locations\n        if lang:\n            # should be a list, but just in case\n            if isinstance(lang, list):\n                params[\"language\"] = \",\".join(lang)\n            else:\n                params[\"language\"] = lang\n        headers = {\"accept-encoding\": \"deflate, gzip\"}\n        errors = 0\n        while True:\n            try:\n                log.info(\"connecting to filter stream for %s\", params)\n                resp = self.post(url, params, headers=headers, stream=True)\n                errors = 0\n                for line in resp.iter_lines(chunk_size=1024):\n                    if event and event.is_set():\n                        log.info(\"stopping filter\")\n                        # Explicitly close response\n                        resp.close()\n                        return\n                    if not line:\n                        log.info(\"keep-alive\")\n                        if record_keepalive:\n                            yield \"keep-alive\"\n                        continue\n                    try:\n                        yield json.loads(line.decode())\n                    except Exception as e:\n                        log.error(\"json parse error: %s - %s\", e, line)\n            except requests.exceptions.HTTPError as e:\n                errors += 1\n                log.error(\"caught http error %s on %s try\", e, errors)\n                if self.http_errors and errors == self.http_errors:\n                    log.warning(\"too many errors\")\n                    raise e\n                if e.response.status_code == 420:\n                    if interruptible_sleep(errors * 60, event):\n                        log.info(\"stopping filter\")\n                        return\n                else:\n                    if interruptible_sleep(errors * 5, event):\n                        log.info(\"stopping filter\")\n                        return\n            except Exception as e:\n                errors += 1\n                log.error(\"caught exception %s on %s try\", e, errors)\n                if self.http_errors and errors == self.http_errors:\n                    log.warning(\"too many exceptions\")\n                    raise e\n                log.error(e)\n                if interruptible_sleep(errors, event):\n                    log.info(\"stopping filter\")\n                    return\n\n    def sample(self, event=None, record_keepalive=False):\n        \"\"\"\n        Returns a small random sample of all public statuses. The Tweets\n        returned by the default access level are the same, so if two different\n        clients connect to this endpoint, they will see the same Tweets.\n\n        If a threading.Event is provided for event and the event is set,\n        the sample will be interrupted.\n        \"\"\"\n        url = \"https://stream.twitter.com/1.1/statuses/sample.json\"\n        params = {\"stall_warning\": True}\n        headers = {\"accept-encoding\": \"deflate, gzip\"}\n        errors = 0\n        while True:\n            try:\n                log.info(\"connecting to sample stream\")\n                resp = self.post(url, params, headers=headers, stream=True)\n                errors = 0\n                for line in resp.iter_lines(chunk_size=512):\n                    if event and event.is_set():\n                        log.info(\"stopping sample\")\n                        # Explicitly close response\n                        resp.close()\n                        return\n                    if line == \"\":\n                        log.info(\"keep-alive\")\n                        if record_keepalive:\n                            yield \"keep-alive\"\n                        continue\n                    try:\n                        yield json.loads(line.decode())\n                    except Exception as e:\n                        log.error(\"json parse error: %s - %s\", e, line)\n            except requests.exceptions.HTTPError as e:\n                errors += 1\n                log.error(\"caught http error %s on %s try\", e, errors)\n                if self.http_errors and errors == self.http_errors:\n                    log.warning(\"too many errors\")\n                    raise e\n                if e.response.status_code == 420:\n                    if interruptible_sleep(errors * 60, event):\n                        log.info(\"stopping filter\")\n                        return\n                else:\n                    if interruptible_sleep(errors * 5, event):\n                        log.info(\"stopping filter\")\n                        return\n\n            except Exception as e:\n                errors += 1\n                log.error(\"caught exception %s on %s try\", e, errors)\n                if self.http_errors and errors == self.http_errors:\n                    log.warning(\"too many errors\")\n                    raise e\n                if interruptible_sleep(errors, event):\n                    log.info(\"stopping filter\")\n                    return\n\n    def dehydrate(self, iterator):\n        \"\"\"\n        Pass in an iterator of tweets' JSON and get back an iterator of the\n        IDs of each tweet.\n        \"\"\"\n        for line in iterator:\n            try:\n                yield json.loads(line)[\"id_str\"]\n            except Exception as e:\n                log.error(\"uhoh: %s\\n\" % e)\n\n    def hydrate(self, iterator, trim_user=False):\n        \"\"\"\n        Pass in an iterator of tweet ids and get back an iterator for the\n        decoded JSON for each corresponding tweet.\n        \"\"\"\n        ids = []\n        url = \"https://api.twitter.com/1.1/statuses/lookup.json\"\n\n        # lookup 100 tweets at a time\n        for tweet_id in iterator:\n            tweet_id = str(tweet_id)\n            tweet_id = tweet_id.strip()  # remove new line if present\n            ids.append(tweet_id)\n            if len(ids) == 100:\n                log.info(\"hydrating %s ids\", len(ids))\n                resp = self.post(\n                    url,\n                    data={\n                        \"id\": \",\".join(ids),\n                        \"include_ext_alt_text\": \"true\",\n                        \"include_ext_is_blue_verified\": \"true\",\n                        \"include_entities\": \"true\",\n                        \"trim_user\": trim_user,\n                    },\n                )\n                tweets = resp.json()\n                tweets.sort(key=lambda t: t[\"id_str\"])\n                for tweet in tweets:\n                    yield tweet\n                ids = []\n\n        # hydrate any remaining ones\n        if len(ids) > 0:\n            log.info(\"hydrating %s\", ids)\n            resp = self.post(\n                url,\n                data={\n                    \"id\": \",\".join(ids),\n                    \"include_ext_alt_text\": \"true\",\n                    \"include_ext_is_blue_verified\": \"true\",\n                    \"include_entities\": \"true\",\n                    \"trim_user\": trim_user,\n                },\n            )\n            for tweet in resp.json():\n                yield tweet\n\n    def tweet(self, tweet_id):\n        try:\n            return next(self.hydrate([tweet_id]))\n        except StopIteration:\n            return []\n\n    def retweets(self, tweet_ids):\n        \"\"\"\n        Retrieves up to the last 100 retweets for the provided iterator of tweet_ids.\n        \"\"\"\n        if not isinstance(tweet_ids, types.GeneratorType):\n            tweet_ids = iter(tweet_ids)\n\n        for tweet_id in tweet_ids:\n            if hasattr(tweet_id, \"strip\"):\n                tweet_id = tweet_id.strip()\n            log.info(\"retrieving retweets of %s\", tweet_id)\n            url = \"https://api.twitter.com/1.1/statuses/retweets/\" \"{}.json\".format(\n                tweet_id\n            )\n            try:\n                resp = self.get(url, params={\"count\": 100}, allow_404=True)\n                for tweet in resp.json():\n                    yield tweet\n            except requests.exceptions.HTTPError as e:\n                if e.response.status_code == 404:\n                    log.info(\"can't get tweets for non-existent tweet: %s\", tweet_id)\n\n    def trends_available(self):\n        \"\"\"\n        Returns a list of regions for which Twitter tracks trends.\n        \"\"\"\n        url = \"https://api.twitter.com/1.1/trends/available.json\"\n        try:\n            resp = self.get(url)\n        except requests.exceptions.HTTPError as e:\n            raise e\n        return resp.json()\n\n    def trends_place(self, woeid, exclude=None):\n        \"\"\"\n        Returns recent Twitter trends for the specified WOEID. If\n        exclude == 'hashtags', Twitter will remove hashtag trends from the\n        response.\n        \"\"\"\n        url = \"https://api.twitter.com/1.1/trends/place.json\"\n        params = {\"id\": woeid}\n        if exclude:\n            params[\"exclude\"] = exclude\n        try:\n            resp = self.get(url, params=params, allow_404=True)\n        except requests.exceptions.HTTPError as e:\n            if e.response.status_code == 404:\n                log.info(\"no region matching WOEID %s\", woeid)\n            raise e\n        return resp.json()\n\n    def trends_closest(self, lat, lon):\n        \"\"\"\n        Returns the closest regions for the supplied lat/lon.\n        \"\"\"\n        url = \"https://api.twitter.com/1.1/trends/closest.json\"\n        params = {\"lat\": lat, \"long\": lon}\n        try:\n            resp = self.get(url, params=params)\n        except requests.exceptions.HTTPError as e:\n            raise e\n        return resp.json()\n\n    def replies(self, tweet, recursive=False, prune=()):\n        \"\"\"\n        replies returns a generator of tweets that are replies for a given\n        tweet. It includes the original tweet. If you would like to fetch the\n        replies to the replies use recursive=True which will do a depth-first\n        recursive walk of the replies. It also walk up the reply chain if you\n        supply a tweet that is itself a reply to another tweet. You can\n        optionally supply a tuple of tweet ids to ignore during this traversal\n        using the prune parameter.\n        \"\"\"\n\n        yield tweet\n\n        # get replies to the tweet\n        screen_name = tweet[\"user\"][\"screen_name\"]\n        tweet_id = tweet[\"id_str\"]\n        log.info(\"looking for replies to: %s\", tweet_id)\n        for reply in self.search(\"to:%s\" % screen_name, since_id=tweet_id):\n            if reply[\"in_reply_to_status_id_str\"] != tweet_id:\n                continue\n\n            if reply[\"id_str\"] in prune:\n                log.info(\"ignoring pruned tweet id %s\", reply[\"id_str\"])\n                continue\n\n            log.info(\"found reply: %s\", reply[\"id_str\"])\n\n            if recursive:\n                if reply[\"id_str\"] not in prune:\n                    prune = prune + (tweet_id,)\n                    for r in self.replies(reply, recursive, prune):\n                        yield r\n            else:\n                yield reply\n\n        # if this tweet is itself a reply to another tweet get it and\n        # get other potential replies to it\n\n        reply_to_id = tweet.get(\"in_reply_to_status_id_str\")\n        log.info(\"prune=%s\", prune)\n        if recursive and reply_to_id and reply_to_id not in prune:\n            t = self.tweet(reply_to_id)\n            if t:\n                log.info(\"found reply-to: %s\", t[\"id_str\"])\n                prune = prune + (tweet[\"id_str\"],)\n                for r in self.replies(t, recursive=True, prune=prune):\n                    yield r\n\n        # if this tweet is a quote go get that too whatever tweets it\n        # may be in reply to\n\n        quote_id = tweet.get(\"quoted_status_id_str\")\n        if recursive and quote_id and quote_id not in prune:\n            t = self.tweet(quote_id)\n            if t:\n                log.info(\"found quote: %s\", t[\"id_str\"])\n                prune = prune + (tweet[\"id_str\"],)\n                for r in self.replies(t, recursive=True, prune=prune):\n                    yield r\n\n    def list_members(\n        self, list_id=None, slug=None, owner_screen_name=None, owner_id=None\n    ):\n        \"\"\"\n        Returns the members of a list.\n\n        List id or (slug and (owner_screen_name or owner_id)) are required\n        \"\"\"\n        assert list_id or (slug and (owner_screen_name or owner_id))\n        url = \"https://api.twitter.com/1.1/lists/members.json\"\n        params = {\"cursor\": -1}\n        if list_id:\n            params[\"list_id\"] = list_id\n        else:\n            params[\"slug\"] = slug\n            if owner_screen_name:\n                params[\"owner_screen_name\"] = owner_screen_name\n            else:\n                params[\"owner_id\"] = owner_id\n\n        while params[\"cursor\"] != 0:\n            try:\n                resp = self.get(url, params=params, allow_404=True)\n            except requests.exceptions.HTTPError as e:\n                if e.response.status_code == 404:\n                    log.error(\"no matching list\")\n                raise e\n\n            users = resp.json()\n            for user in users[\"users\"]:\n                yield user\n            params[\"cursor\"] = users[\"next_cursor\"]\n\n    def oembed(self, tweet_url, **params):\n        \"\"\"\n        Returns the oEmbed JSON for a tweet. The JSON includes an html\n        key that contains the HTML for the embed. You can pass in\n        parameters that correspond to the paramters that Twitter's\n        statuses/oembed endpoint supports. For example:\n\n        o = client.oembed('https://twitter.com/biz/status/21', theme='dark')\n        \"\"\"\n        log.info(\"generating embedding for tweet %s\", tweet_url)\n        url = \"https://publish.twitter.com/oembed\"\n\n        params[\"url\"] = tweet_url\n        resp = self.get(url, params=params)\n\n        return resp.json()\n\n    @rate_limit\n    @catch_conn_reset\n    @catch_timeout\n    @catch_gzip_errors\n    def get(self, *args, **kwargs):\n        if not self.client:\n            self.connect()\n\n        # set default tweet_mode; only used for non-premium/non-gnip endpoints\n        if self.is_standard_v1(args[0]):\n            if \"params\" not in kwargs:\n                kwargs[\"params\"] = {\"tweet_mode\": self.tweet_mode}\n            else:\n                kwargs[\"params\"][\"tweet_mode\"] = self.tweet_mode\n\n        # Pass allow 404 to not retry on 404\n        allow_404 = kwargs.pop(\"allow_404\", False)\n        connection_error_count = kwargs.pop(\"connection_error_count\", 0)\n        try:\n            log.info(\"getting %s %s\", args, kwargs)\n            r = self.last_response = self.client.get(\n                *args, timeout=(3.05, 31), **kwargs\n            )\n            # this has been noticed, believe it or not\n            # https://github.com/edsu/twarc/issues/75\n            if r.status_code == 404 and not allow_404:\n                log.warning(\"404 from Twitter API! trying again\")\n                time.sleep(1)\n                r = self.get(*args, **kwargs)\n            return r\n        except (ssl.SSLError, ConnectionError, ProtocolError) as e:\n            connection_error_count += 1\n            log.error(\"caught connection error %s on %s try\", e, connection_error_count)\n            if (\n                self.connection_errors\n                and connection_error_count == self.connection_errors\n            ):\n                log.error(\"received too many connection errors\")\n                raise e\n            else:\n                self.connect()\n                kwargs[\"connection_error_count\"] = connection_error_count\n                kwargs[\"allow_404\"] = allow_404\n                return self.get(*args, **kwargs)\n\n    @rate_limit\n    @catch_conn_reset\n    @catch_timeout\n    @catch_gzip_errors\n    def post(self, *args, **kwargs):\n        if not self.client:\n            self.connect()\n\n        if \"data\" in kwargs:\n            kwargs[\"data\"][\"tweet_mode\"] = self.tweet_mode\n\n        connection_error_count = kwargs.pop(\"connection_error_count\", 0)\n        try:\n            log.info(\"posting %s %s\", args, kwargs)\n            self.last_response = self.client.post(*args, timeout=(3.05, 31), **kwargs)\n            return self.last_response\n        except (ssl.SSLError, ConnectionError, ProtocolError) as e:\n            connection_error_count += 1\n            log.error(\"caught connection error %s on %s try\", e, connection_error_count)\n            if (\n                self.connection_errors\n                and connection_error_count == self.connection_errors\n            ):\n                log.error(\"received too many connection errors\")\n                raise e\n            else:\n                self.connect()\n                kwargs[\"connection_error_count\"] = connection_error_count\n                return self.post(*args, **kwargs)\n\n    @catch_timeout\n    def connect(self):\n        \"\"\"\n        Sets up the HTTP session to talk to Twitter. If one is active it is\n        closed and another one is opened.\n        \"\"\"\n        if self.gnip_auth and not (\n            self.gnip_username and self.gnip_password and self.gnip_account\n        ):\n            raise RuntimeError(\"MissingKeys\")\n        elif not self.gnip_auth and not (\n            self.consumer_key\n            and self.consumer_secret\n            and self.access_token\n            and self.access_token_secret\n        ):\n            raise RuntimeError(\"MissingKeys\")\n\n        if self.client:\n            log.info(\"closing existing http session\")\n            self.client.close()\n        if self.last_response:\n            log.info(\"closing last response\")\n            self.last_response.close()\n        log.info(\"creating http session\")\n\n        if self.gnip_auth:\n            logging.info(\"creating basic user authentication for gnip\")\n            s = requests.Session()\n            s.auth = (self.gnip_username, self.gnip_password)\n            self.client = s\n        elif not self.app_auth:\n            logging.info(\"creating OAuth1 user authentication\")\n            self.client = OAuth1Session(\n                client_key=self.consumer_key,\n                client_secret=self.consumer_secret,\n                resource_owner_key=self.access_token,\n                resource_owner_secret=self.access_token_secret,\n            )\n        else:\n            logging.info(\"creating OAuth2 app authentication\")\n            client = BackendApplicationClient(client_id=self.consumer_key)\n            oauth = OAuth2Session(client=client)\n            token = oauth.fetch_token(\n                token_url=\"https://api.twitter.com/oauth2/token\",\n                client_id=self.consumer_key,\n                client_secret=self.consumer_secret,\n            )\n            self.client = oauth\n\n        if self.client:\n            self.client.headers.update({\"User-Agent\": user_agent})\n\n    def get_keys(self):\n        \"\"\"\n        Get the Twitter API keys. Order of precedence is command line,\n        environment, config file. Return True if all the keys were found\n        and False if not.\n        \"\"\"\n        env = os.environ.get\n        if not self.consumer_key:\n            self.consumer_key = env(\"CONSUMER_KEY\")\n        if not self.consumer_secret:\n            self.consumer_secret = env(\"CONSUMER_SECRET\")\n        if not self.access_token:\n            self.access_token = env(\"ACCESS_TOKEN\")\n        if not self.access_token_secret:\n            self.access_token_secret = env(\"ACCESS_TOKEN_SECRET\")\n        if not self.gnip_username:\n            self.gnip_username = env(\"GNIP_USERNAME\")\n        if not self.gnip_password:\n            self.gnip_password = env(\"GNIP_PASSWORD\")\n        if not self.gnip_account:\n            self.gnip_account = env(\"GNIP_ACCOUNT\")\n\n        if self.config:\n            if self.gnip_auth and not (\n                self.gnip_username and self.gnip_password and self.gnip_account\n            ):\n                self.load_config()\n            elif not self.gnip_auth and not (\n                self.consumer_key\n                and self.consumer_secret\n                and self.access_token\n                and self.access_token_secret\n            ):\n                self.load_config()\n\n    def validate_keys(self):\n        \"\"\"\n        Validate the keys provided are authentic credentials.\n        \"\"\"\n        if self.gnip_auth:\n            url = \"https://gnip-api.twitter.com/metrics/usage/accounts/{}.json\".format(\n                self.gnip_account\n            )\n\n            keys_present = (\n                self.gnip_account and self.gnip_username and self.gnip_password\n            )\n        elif self.app_auth:\n            # no need to validate keys when using OAuth2 App Auth.\n            return True\n        else:\n            url = \"https://api.twitter.com/1.1/account/verify_credentials.json\"\n\n            keys_present = (\n                self.consumer_key\n                and self.consumer_secret\n                and self.access_token\n                and self.access_token_secret\n            )\n\n        if keys_present:\n            try:\n                # Need to explicitly reconnect to confirm the current creds\n                # are used in the session object.\n                self.connect()\n                self.get(url)\n                return True\n            except requests.HTTPError as e:\n                if e.response.status_code == 401:\n                    raise RuntimeError(\"Invalid credentials provided.\")\n                else:\n                    raise e\n        else:\n            print(\"Incomplete credentials provided.\")\n            print('Please run the command \"twarc configure\" to get started.')\n            sys.exit()\n\n    def load_config(self):\n        path = self.config\n        profile = self.profile\n        log.info(\"loading %s profile from config %s\", profile, path)\n\n        if not path or not os.path.isfile(path):\n            return {}\n\n        config = configparser.ConfigParser()\n        config.read(self.config)\n\n        if len(config.sections()) >= 1 and not profile:\n            profile = config.sections()[0]\n\n        data = {}\n        keys = (\n            [\"gnip_username\", \"gnip_password\", \"gnip_account\"]\n            if self.gnip_auth\n            else [\n                \"access_token\",\n                \"access_token_secret\",\n                \"consumer_key\",\n                \"consumer_secret\",\n            ]\n        )\n        for key in keys:\n            try:\n                setattr(self, key, config.get(profile, key))\n            except configparser.NoSectionError:\n                sys.exit(\"no such profile %s in %s\" % (profile, path))\n            except configparser.NoOptionError:\n                sys.exit(\"missing %s from profile %s in %s\" % (key, profile, path))\n        return data\n\n    def save_config(self, profile):\n        if not self.config:\n            return\n        config = configparser.ConfigParser()\n        config.read(self.config)\n\n        if config.has_section(profile):\n            config.remove_section(profile)\n\n        config.add_section(profile)\n        if self.gnip_auth:\n            config.set(profile, \"gnip_username\", self.access_token_secret)\n            config.set(profile, \"gnip_password\", self.access_token_secret)\n            config.set(profile, \"gnip_account\", self.access_token_secret)\n        else:\n            config.set(profile, \"consumer_key\", self.consumer_key)\n            config.set(profile, \"consumer_secret\", self.consumer_secret)\n            config.set(profile, \"access_token\", self.access_token)\n            config.set(profile, \"access_token_secret\", self.access_token_secret)\n        with open(self.config, \"w\") as config_file:\n            config.write(config_file)\n\n        return config\n\n    def configure(self):\n        print(\n            \"\\nTwarc needs to know a few things before it can talk to Twitter on your behalf.\\n\"\n        )\n\n        reuse = False\n        if self.consumer_key and self.consumer_secret:\n            print(\n                \"You already have these application keys in your config %s\\n\"\n                % self.config\n            )\n            print(\"consumer key: %s\" % self.consumer_key)\n            print(\"consumer secret: %s\" % self.consumer_secret)\n            reuse = get_input(\n                \"\\nWould you like to use those for your new profile? [y/n] \"\n            )\n            reuse = reuse.lower() == \"y\"\n\n        if not reuse:\n            print(\n                \"\\nPlease enter your Twitter application credentials from apps.twitter.com:\\n\"\n            )\n\n            self.consumer_key = get_input(\"consumer key: \")\n            self.consumer_secret = get_input(\"consumer secret: \")\n\n        answered = False\n        while not answered:\n            print(\n                \"\\nHow would you like twarc to obtain your user keys?\\n\\n1) generate access keys by visiting Twitter\\n2) manually enter your access token and secret\\n\"\n            )\n            answer = get_input(\"Please enter your choice [1/2] \")\n            if answer == \"1\":\n                answered = True\n                generate = True\n            elif answer == \"2\":\n                answered = True\n                generate = False\n\n        if generate:\n            request_token_url = \"https://api.twitter.com/oauth/request_token\"\n            oauth = OAuth1(self.consumer_key, client_secret=self.consumer_secret)\n            r = requests.post(url=request_token_url, auth=oauth)\n\n            credentials = parse_qs(r.text)\n            if not credentials:\n                print(\"\\nError: invalid credentials.\")\n                print(\n                    \"Please check that you are copying and pasting correctly and try again.\\n\"\n                )\n                return\n\n            resource_owner_key = credentials.get(\"oauth_token\")[0]\n            resource_owner_secret = credentials.get(\"oauth_token_secret\")[0]\n\n            base_authorization_url = \"https://api.twitter.com/oauth/authorize\"\n            authorize_url = (\n                base_authorization_url + \"?oauth_token=\" + resource_owner_key\n            )\n            print(\n                \"\\nPlease log into Twitter and visit this URL in your browser:\\n%s\"\n                % authorize_url\n            )\n            verifier = get_input(\n                \"\\nAfter you have authorized the application please enter the displayed PIN: \"\n            )\n\n            access_token_url = \"https://api.twitter.com/oauth/access_token\"\n            oauth = OAuth1(\n                self.consumer_key,\n                client_secret=self.consumer_secret,\n                resource_owner_key=resource_owner_key,\n                resource_owner_secret=resource_owner_secret,\n                verifier=verifier,\n            )\n            r = requests.post(url=access_token_url, auth=oauth)\n            credentials = parse_qs(r.text)\n\n            if not credentials:\n                print(\"\\nError: invalid PIN\")\n                print(\n                    \"Please check that you entered the PIN correctly and try again.\\n\"\n                )\n                return\n\n            self.access_token = resource_owner_key = credentials.get(\"oauth_token\")[0]\n            self.access_token_secret = credentials.get(\"oauth_token_secret\")[0]\n\n            screen_name = credentials.get(\"screen_name\")[0]\n        else:\n            self.access_token = get_input(\"Enter your Access Token: \")\n            self.access_token_secret = get_input(\"Enter your Access Token Secret: \")\n            screen_name = \"default\"\n\n        config = self.save_config(screen_name)\n        print(\n            \"\\nThe credentials for %s have been saved to your configuration file at %s\"\n            % (screen_name, self.config)\n        )\n        print(\"\\n✨ ✨ ✨  Happy twarcing! ✨ ✨ ✨\\n\")\n\n        if len(config.sections()) > 1:\n            print(\n                \"Note: you have multiple profiles in %s so in order to use %s you will use --profile\\n\"\n                % (self.config, screen_name)\n            )\n\n    def default_config(self):\n        return os.path.join(os.path.expanduser(\"~\"), \".twarc\")\n\n    def is_standard_v1(self, url):\n        result = True\n        if url.startswith(\"https://gnip-api.twitter.com\"):\n            result = False\n        elif url.startswith(\"https://api.twitter.com/1.1/tweets/search/30day\"):\n            result = False\n        elif url.startswith(\"https://api.twitter.com/1.1/tweets/search/fullarchive\"):\n            result = False\n        return result\n"
  },
  {
    "path": "src/twarc/client2.py",
    "content": "# -*- coding: utf-8 -*-\n\n\"\"\"\nSupport for the Twitter v2 API.\n\"\"\"\n\nimport re\nimport json\nimport time\nimport logging\nimport datetime\nimport requests\n\nfrom oauthlib.oauth2 import BackendApplicationClient\nfrom requests_oauthlib import OAuth1Session, OAuth2Session\n\nfrom twarc.expansions import (\n    EXPANSIONS,\n    TWEET_FIELDS,\n    USER_FIELDS,\n    MEDIA_FIELDS,\n    POLL_FIELDS,\n    PLACE_FIELDS,\n    LIST_FIELDS,\n)\nfrom twarc.decorators2 import *\nfrom twarc.version import version, user_agent\n\n\nlog = logging.getLogger(\"twarc\")\n\n\nclass Twarc2:\n    \"\"\"\n    A client for the Twitter v2 API.\n    \"\"\"\n\n    def __init__(\n        self,\n        consumer_key=None,\n        consumer_secret=None,\n        access_token=None,\n        access_token_secret=None,\n        bearer_token=None,\n        connection_errors=0,\n        metadata=True,\n    ):\n        \"\"\"\n        Instantiate a Twarc2 instance to talk to the Twitter V2+ API.\n\n        The client can use either App or User authentication, but only one at a\n        time. Whether app auth or user auth is used depends on which credentials\n        are provided on initialisation:\n\n        1. If a `bearer_token` is passed, app auth is always used.\n        2. If a `consumer_key` and `consumer_secret` are passed without an\n        `access_token` and `access_token_secret`, app auth is used.\n        3. If `consumer_key`, `consumer_secret`, `access_token` and\n        `access_token_secret` are all passed, then user authentication\n        is used instead.\n\n        Args:\n            consumer_key (str):\n                The API key.\n            consumer_secret (str):\n                The API secret.\n            access_token (str):\n                The Access Token\n            access_token_secret (str):\n                The Access Token Secret\n            bearer_token (str):\n                Bearer Token, can be generated from API keys.\n            connection_errors (int):\n                Number of retries for GETs\n            metadata (bool):\n                Append `__twarc` metadata to results.\n        \"\"\"\n        self.api_version = \"2\"\n        self.connection_errors = connection_errors\n        self.metadata = metadata\n        self.bearer_token = None\n\n        if bearer_token:\n            self.bearer_token = bearer_token\n            self.auth_type = \"application\"\n\n        elif consumer_key and consumer_secret:\n            if access_token and access_token_secret:\n                self.consumer_key = consumer_key\n                self.consumer_secret = consumer_secret\n                self.access_token = access_token\n                self.access_token_secret = access_token_secret\n                self.auth_type = \"user\"\n\n            else:\n                self.consumer_key = consumer_key\n                self.consumer_secret = consumer_secret\n                self.auth_type = \"application\"\n\n        else:\n            raise ValueError(\n                \"Must pass either a bearer_token or consumer/access_token keys and secrets\"\n            )\n\n        self.client = None\n        self.last_response = None\n\n        self.connect()\n\n    def _prepare_params(self, **kwargs):\n        \"\"\"\n        Prepare URL parameters and defaults for fields and expansions and others\n        \"\"\"\n        params = {}\n\n        # Defaults for fields and expansions\n        if \"expansions\" in kwargs:\n            params[\"expansions\"] = (\n                kwargs.pop(\"expansions\")\n                if kwargs[\"expansions\"]\n                else \",\".join(EXPANSIONS)\n            )\n\n        if \"tweet_fields\" in kwargs:\n            params[\"tweet.fields\"] = (\n                kwargs.pop(\"tweet_fields\")\n                if kwargs[\"tweet_fields\"]\n                else \",\".join(TWEET_FIELDS)\n            )\n\n        if \"user_fields\" in kwargs:\n            params[\"user.fields\"] = (\n                kwargs.pop(\"user_fields\")\n                if kwargs[\"user_fields\"]\n                else \",\".join(USER_FIELDS)\n            )\n\n        if \"media_fields\" in kwargs:\n            params[\"media.fields\"] = (\n                kwargs.pop(\"media_fields\")\n                if kwargs[\"media_fields\"]\n                else \",\".join(MEDIA_FIELDS)\n            )\n\n        if \"poll_fields\" in kwargs:\n            params[\"poll.fields\"] = (\n                kwargs.pop(\"poll_fields\")\n                if kwargs[\"poll_fields\"]\n                else \",\".join(POLL_FIELDS)\n            )\n\n        if \"place_fields\" in kwargs:\n            params[\"place.fields\"] = (\n                kwargs.pop(\"place_fields\")\n                if kwargs[\"place_fields\"]\n                else \",\".join(PLACE_FIELDS)\n            )\n\n        if \"list_fields\" in kwargs:\n            params[\"list.fields\"] = (\n                kwargs.pop(\"list_fields\")\n                if kwargs[\"list_fields\"]\n                else \",\".join(LIST_FIELDS)\n            )\n\n        # Format start_time and end_time\n        if \"start_time\" in kwargs:\n            start_time = kwargs[\"start_time\"]\n            params[\"start_time\"] = (\n                _ts(kwargs.pop(\"start_time\"))\n                if start_time and not isinstance(start_time, str)\n                else start_time\n            )\n\n        if \"end_time\" in kwargs:\n            end_time = kwargs[\"end_time\"]\n            params[\"end_time\"] = (\n                _ts(kwargs.pop(\"end_time\"))\n                if end_time and not isinstance(end_time, str)\n                else end_time\n            )\n\n        # Any other parameters passed as is,\n        # these include backfill_minutes, next_token, pagination_token, sort_order\n        params = {**params, **{k: v for k, v in kwargs.items() if v is not None}}\n\n        return params\n\n    def _search(\n        self,\n        url,\n        query,\n        since_id,\n        until_id,\n        start_time,\n        end_time,\n        max_results,\n        expansions,\n        tweet_fields,\n        user_fields,\n        media_fields,\n        poll_fields,\n        place_fields,\n        sort_order,\n        next_token=None,\n        granularity=None,\n        sleep_between=0,\n    ):\n        \"\"\"\n        Common function for search, counts endpoints.\n        \"\"\"\n\n        params = self._prepare_params(\n            query=query,\n            max_results=max_results,\n            since_id=since_id,\n            until_id=until_id,\n            start_time=start_time,\n            end_time=end_time,\n            next_token=next_token,\n            sort_order=sort_order,\n        )\n\n        if granularity:\n            # Do not specify anything else when calling counts endpoint\n            params[\"granularity\"] = granularity\n            # Mark that we're using counts, to workaround a limitation of the\n            # Twitter API with long running counts.\n            using_counts = True\n\n            # We need to use these as sentinel values, to differentiate\n            # between the count API returning zero prematurely, and queries\n            # like \"from:<no longer existing user_id>\". In the latter case\n            # instead of returning counts of 0 per day, it will just return\n            # an empty response with a total tweet count of zero. We can\n            # disambiguate the two cases by noting that the premature\n            # termination will already have counted some tweets correctly,\n            # while the latter will return immediately without any data\n            # rows.\n            time_periods_collected = 0\n            last_time_start = None\n        else:\n            params = self._prepare_params(\n                **params,\n                expansions=expansions,\n                tweet_fields=tweet_fields,\n                user_fields=user_fields,\n                media_fields=media_fields,\n                poll_fields=poll_fields,\n                place_fields=place_fields,\n            )\n            using_counts = False\n\n        # Workaround for observed odd behaviour in the Twitter counts\n        # functionality.\n        if using_counts:\n            while True:\n                for response in self.get_paginated(url, params=params):\n                    # Note that we're ensuring the appropriate amount of sleep is\n                    # taken before yielding every item. This ensures that we won't\n                    # exceed the rate limit even in cases where a response generator\n                    # is not completely consumed. This might be more conservative\n                    # than necessary.\n                    time.sleep(sleep_between)\n\n                    # can't return without 'data' if there are no results\n                    if \"data\" in response:\n                        last_time_start = response[\"data\"][0][\"start\"]\n                        time_periods_collected += len(response[\"data\"])\n                        yield response\n\n                    else:\n                        log.info(f\"Retrieved an empty page of results.\")\n\n                # Check that we've actually reached the end, and restart if necessary.\n                # Note we need to exactly match the Twitter format, which is a little\n                # fiddly because Python doesn't let you specify milliseconds only for\n                # strftime.\n                if (\n                    # If there's no explicit start time we're getting the last\n                    # 30 days by default, so don't need to do the tricky\n                    # things.\n                    start_time is None\n                    # We've actually reached the specified start time\n                    or (\n                        (start_time.strftime(\"%Y-%m-%dT%H:%M:%S.%f\")[:-3] + \"Z\")\n                        == last_time_start\n                    )\n                    # Or, we've hit one of the special cases that returns no rows\n                    # of data, and immediately indicates zero tweets returned, like\n                    # searching for a tweet that doesn't exist.\n                    or (time_periods_collected == 0)\n                ):\n                    break\n                else:\n                    # Note that we're passing the Twitter start_time straight\n                    # back to it - this avoids parsing and reformatting the date.\n                    params[\"end_time\"] = last_time_start\n\n                    # Remove the next_token reference, we're restarting the search.\n                    if \"next_token\" in params:\n                        del params[\"next_token\"]\n\n                    log.info(\n                        \"Detected incomplete counts, restarting with \"\n                        f\"{last_time_start} as the new end_time\"\n                    )\n\n        else:\n            for response in self.get_paginated(url, params=params):\n                # Note that we're ensuring the appropriate amount of sleep is\n                # taken before yielding every item. This ensures that we won't\n                # exceed the rate limit even in cases where a response generator\n                # is not completely consumed. This might be more conservative\n                # than necessary.\n                time.sleep(sleep_between)\n\n                # can't return without 'data' if there are no results\n                if \"data\" in response:\n                    yield response\n\n                else:\n                    log.info(f\"Retrieved an empty page of results.\")\n\n        log.info(f\"No more results for search {query}.\")\n\n    def _lists(\n        self,\n        url,\n        expansions=None,\n        list_fields=None,\n        user_fields=None,\n        max_results=None,\n        pagination_token=None,\n    ):\n        \"\"\"\n        Paginates and returns lists\n        \"\"\"\n        params = self._prepare_params(\n            list_fields=list_fields,\n            user_fields=user_fields,\n            max_results=max_results,\n            pagination_token=pagination_token,\n        )\n\n        if expansions:\n            params[\"expansions\"] = \"owner_id\"\n\n        for response in self.get_paginated(url, params=params):\n            # can return without 'data' if there are no results\n            if \"data\" in response:\n                yield response\n            else:\n                log.info(f\"Retrieved an empty page of results of lists for {url}\")\n\n    def list_followers(\n        self,\n        list_id,\n        expansions=None,\n        tweet_fields=None,\n        user_fields=None,\n        max_results=None,\n        pagination_token=None,\n    ):\n        \"\"\"\n        Returns a list of users who are followers of the specified List.\n\n        Calls [GET /2/lists/:id/followers](https://developer.twitter.com/en/docs/twitter-api/lists/list-follows/api-reference/get-lists-id-followers)\n\n        Args:\n            list_id (int): ID of the list.\n            expansions enum (pinned_tweet_id): Expansions, include pinned tweets.\n            max_results (int): the maximum number of results to retrieve. Between 1 and 100. Default is 100.\n\n        Returns:\n            generator[dict]: A generator, dict for each page of results.\n\n        \"\"\"\n        params = self._prepare_params(\n            tweet_fields=tweet_fields,\n            user_fields=user_fields,\n            max_results=max_results,\n            pagination_token=pagination_token,\n        )\n\n        if expansions:\n            params[\"expansions\"] = \"pinned_tweet_id\"\n\n        url = f\"https://api.twitter.com/2/lists/{list_id}/followers\"\n        return self.get_paginated(url, params=params)\n\n    def list_members(\n        self,\n        list_id,\n        expansions=None,\n        tweet_fields=None,\n        user_fields=None,\n        max_results=None,\n        pagination_token=None,\n    ):\n        \"\"\"\n        Returns a list of users who are members of the specified List.\n\n        Calls [GET /2/lists/:id/members](https://developer.twitter.com/en/docs/twitter-api/lists/list-members/api-reference/get-lists-id-members)\n\n        Args:\n            list_id (int): ID of the list.\n            expansions enum (pinned_tweet_id): Expansions, include pinned tweets.\n            max_results (int): The maximum number of results to be returned per page. This can be a number between 1 and 100.\n            pagination_token (string): Used to request the next page of results if all results weren't returned with the latest request, or to go back to the previous page of results.\n\n        Returns:\n            generator[dict]: A generator, dict for each page of results.\n\n        \"\"\"\n\n        params = self._prepare_params(\n            tweet_fields=tweet_fields,\n            user_fields=user_fields,\n            max_results=max_results,\n            pagination_token=pagination_token,\n        )\n\n        if expansions:\n            params[\"expansions\"] = \"pinned_tweet_id\"\n\n        url = f\"https://api.twitter.com/2/lists/{list_id}/members\"\n        return self.get_paginated(url, params=params)\n\n    def list_memberships(\n        self,\n        user,\n        expansions=None,\n        list_fields=None,\n        user_fields=None,\n        max_results=None,\n        pagination_token=None,\n    ):\n        \"\"\"\n        Returns all Lists a specified user is a member of.\n\n        Calls [GET /2/users/:id/list_memberships](https://developer.twitter.com/en/docs/twitter-api/lists/list-members/api-reference/get-users-id-list_memberships)\n\n        Args:\n            user (int): ID of the user.\n            expansions enum (owner_id): enable you to request additional data objects that relate to the originally returned List.\n            list_fields enum (created_at, follower_count, member_count, private, description, owner_id): This fields parameter enables you to select which specific List fields will deliver with each returned List objects.\n            user_fields enum (created_at, description, entities, id, location, name, pinned_tweet_id, profile_image_url, protected, public_metrics, url, username, verified, withheld):\n                This fields parameter enables you to select which specific user fields will deliver with the users object. Specify the desired fields in a comma-separated list without spaces between commas and fields.\n            max_results (int): The maximum number of results to be returned per page. This can be a number between 1 and 100.\n            pagination_token (string): Used to request the next page of results if all results weren't returned with the latest request, or to go back to the previous page of results.\n\n        Returns:\n            generator[dict]: A generator, dict for each page of results.\n        \"\"\"\n        user_id = self._ensure_user_id(user)\n        url = f\"https://api.twitter.com/2/users/{user_id}/list_memberships\"\n\n        return self._lists(\n            url=url,\n            expansions=expansions,\n            list_fields=list_fields,\n            user_fields=user_fields,\n            max_results=max_results,\n            pagination_token=pagination_token,\n        )\n\n    def owned_lists(\n        self,\n        user,\n        expansions=None,\n        list_fields=None,\n        user_fields=None,\n        max_results=None,\n        pagination_token=None,\n    ):\n        \"\"\"\n        Returns all Lists owned by the specified user.\n\n        Calls [GET /2/users/:id/owned_lists](https://developer.twitter.com/en/docs/twitter-api/lists/list-lookup/api-reference/get-users-id-owned_lists)\n\n        Args:\n            user (int): ID of the user.\n            expansions enum (owner_id): enable you to request additional data objects that relate to the originally returned List.\n            list_fields enum (created_at, follower_count, member_count, private, description, owner_id): This fields parameter enables you to select which specific List fields will deliver with each returned List objects.\n            user_fields enum (created_at, description, entities, id, location, name, pinned_tweet_id, profile_image_url, protected, public_metrics, url, username, verified, withheld):\n                This fields parameter enables you to select which specific user fields will deliver with the users object. Specify the desired fields in a comma-separated list without spaces between commas and fields.\n            max_results (int): The maximum number of results to be returned per page. This can be a number between 1 and 100.\n            pagination_token (string): Used to request the next page of results if all results weren't returned with the latest request, or to go back to the previous page of results.\n\n        Returns:\n            generator[dict]: A generator, dict for each page of results.\n        \"\"\"\n        user_id = self._ensure_user_id(user)\n        url = f\"https://api.twitter.com/2/users/{user_id}/owned_lists\"\n\n        return self._lists(\n            url=url,\n            expansions=expansions,\n            list_fields=list_fields,\n            user_fields=user_fields,\n            max_results=max_results,\n            pagination_token=pagination_token,\n        )\n\n    def followed_lists(\n        self,\n        user,\n        expansions=None,\n        list_fields=None,\n        user_fields=None,\n        max_results=None,\n        pagination_token=None,\n    ):\n        \"\"\"\n        Returns all Lists a specified user follows.\n\n        Calls [GET /2/users/:id/followed_lists](https://developer.twitter.com/en/docs/twitter-api/lists/list-follows/api-reference/get-users-id-followed_lists)\n\n        Args:\n            user (int): ID of the user.\n            expansions enum (owner_id): enable you to request additional data objects that relate to the originally returned List.\n            list_fields enum (created_at, follower_count, member_count, private, description, owner_id): This fields parameter enables you to select which specific List fields will deliver with each returned List objects.\n            user_fields enum (created_at, description, entities, id, location, name, pinned_tweet_id, profile_image_url, protected, public_metrics, url, username, verified, withheld):\n                This fields parameter enables you to select which specific user fields will deliver with the users object. Specify the desired fields in a comma-separated list without spaces between commas and fields.\n            max_results (int): The maximum number of results to be returned per page. This can be a number between 1 and 100.\n            pagination_token (string): Used to request the next page of results if all results weren't returned with the latest request, or to go back to the previous page of results.\n\n        Returns:\n            generator[dict]: A generator, dict for each page of results.\n        \"\"\"\n        user_id = self._ensure_user_id(user)\n        url = f\"https://api.twitter.com/2/users/{user_id}/followed_lists\"\n\n        return self._lists(\n            url=url,\n            expansions=expansions,\n            list_fields=list_fields,\n            user_fields=user_fields,\n            max_results=max_results,\n            pagination_token=pagination_token,\n        )\n\n    def pinned_lists(\n        self,\n        user,\n        expansions=None,\n        list_fields=None,\n        user_fields=None,\n        max_results=None,\n        pagination_token=None,\n    ):\n        \"\"\"\n        Returns the Lists pinned by the authenticating user. Does not work with a Bearer token.\n\n        Calls [GET /2/users/:id/pinned_lists](https://developer.twitter.com/en/docs/twitter-api/lists/pinned-lists/api-reference/get-users-id-pinned_lists)\n\n        Args:\n            user (int): ID of the user.\n            expansions enum (owner_id): enable you to request additional data objects that relate to the originally returned List.\n            list_fields enum (created_at, follower_count, member_count, private, description, owner_id): This fields parameter enables you to select which specific List fields will deliver with each returned List objects.\n            user_fields enum (created_at, description, entities, id, location, name, pinned_tweet_id, profile_image_url, protected, public_metrics, url, username, verified, withheld):\n                This fields parameter enables you to select which specific user fields will deliver with the users object. Specify the desired fields in a comma-separated list without spaces between commas and fields.\n            max_results (int): The maximum number of results to be returned per page. This can be a number between 1 and 100.\n            pagination_token (string): Used to request the next page of results if all results weren't returned with the latest request, or to go back to the previous page of results.\n\n        Returns:\n            generator[dict]: A generator, dict for each page of results.\n        \"\"\"\n        user_id = self._ensure_user_id(user)\n        url = f\"https://api.twitter.com/2/users/{user_id}/pinned_lists\"\n\n        return self._lists(\n            url=url,\n            expansions=expansions,\n            list_fields=list_fields,\n            user_fields=user_fields,\n            max_results=max_results,\n            pagination_token=pagination_token,\n        )\n\n    def list_lookup(self, list_id, expansions=None, list_fields=None, user_fields=None):\n        \"\"\"\n        Returns the details of a specified List.\n\n        Calls [GET /2/lists/:id](https://developer.twitter.com/en/docs/twitter-api/lists/list-lookup/api-reference/get-lists-id)\n\n        Args:\n            list_id (int): ID of the list.\n            expansions enum (owner_id): enable you to request additional data objects that relate to the originally returned List.\n            list_fields enum (created_at, follower_count, member_count, private, description, owner_id): This fields parameter enables you to select which specific List fields will deliver with each returned List objects.\n            user_fields enum (created_at, description, entities, id, location, name, pinned_tweet_id, profile_image_url, protected, public_metrics, url, username, verified, withheld):\n                This fields parameter enables you to select which specific user fields will deliver with the users object. Specify the desired fields in a comma-separated list without spaces between commas and fields.\n\n        Returns:\n            dict: Result dictionary.\n        \"\"\"\n\n        params = self._prepare_params(\n            list_fields=list_fields,\n            user_fields=user_fields,\n        )\n\n        if expansions:\n            params[\"expansions\"] = \"owner_id\"\n        url = f\"https://api.twitter.com/2/lists/{list_id}\"\n        resp = self.get(url, params=params)\n        data = resp.json()\n\n        if self.metadata:\n            data = _append_metadata(data, resp.url)\n\n        return data\n\n    def list_tweets(\n        self,\n        list_id,\n        expansions=None,\n        tweet_fields=None,\n        user_fields=None,\n        max_results=None,\n        pagination_token=None,\n    ):\n        \"\"\"\n        Returns Tweets from the specified List.\n\n        Calls [GET /2/lists/:id/tweets](https://developer.twitter.com/en/docs/twitter-api/lists/list-tweets/api-reference/get-lists-id-tweets)\n\n        Args:\n            list_id (int): ID of the list.\n            expansions enum (author_id): enable you to request additional data objects that relate to the originally returned List.\n            list_fields enum (created_at, follower_count, member_count, private, description, owner_id): This fields parameter enables you to select which specific List fields will deliver with each returned List objects.\n            user_fields enum (created_at, description, entities, id, location, name, pinned_tweet_id, profile_image_url, protected, public_metrics, url, username, verified, withheld):\n                This fields parameter enables you to select which specific user fields will deliver with the users object. Specify the desired fields in a comma-separated list without spaces between commas and fields.\n\n        Returns:\n            generator[dict]: A generator, dict for each page of results.\n        \"\"\"\n\n        params = self._prepare_params(\n            expansions=expansions,\n            tweet_fields=tweet_fields,\n            user_fields=user_fields,\n            max_results=max_results,\n            pagination_token=pagination_token,\n        )\n\n        url = f\"https://api.twitter.com/2/lists/{list_id}/tweets\"\n        return self.get_paginated(url, params=params)\n\n    def search_recent(\n        self,\n        query,\n        since_id=None,\n        until_id=None,\n        start_time=None,\n        end_time=None,\n        max_results=100,\n        expansions=None,\n        tweet_fields=None,\n        user_fields=None,\n        media_fields=None,\n        poll_fields=None,\n        place_fields=None,\n        next_token=None,\n        sort_order=None,\n    ):\n        \"\"\"\n        Search Twitter for the given query in the last seven days,\n        using the `/search/recent` endpoint.\n\n        Calls [GET /2/tweets/search/recent](https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-recent)\n\n        Args:\n            query (str):\n                The query string to be passed directly to the Twitter API.\n            since_id (int):\n                Return all tweets since this tweet_id.\n            until_id (int):\n                Return all tweets up to this tweet_id.\n            start_time (datetime):\n                Return all tweets after this time (UTC datetime).\n            end_time (datetime):\n                Return all tweets before this time (UTC datetime).\n            max_results (int):\n                The maximum number of results per request. Max is 100.\n            sort_order (str):\n                Order tweets based on relevancy or recency.\n\n        Returns:\n            generator[dict]: a generator, dict for each paginated response.\n        \"\"\"\n        return self._search(\n            url=\"https://api.twitter.com/2/tweets/search/recent\",\n            query=query,\n            since_id=since_id,\n            until_id=until_id,\n            start_time=start_time,\n            end_time=end_time,\n            max_results=max_results,\n            expansions=expansions,\n            tweet_fields=tweet_fields,\n            user_fields=user_fields,\n            media_fields=media_fields,\n            poll_fields=poll_fields,\n            place_fields=place_fields,\n            next_token=next_token,\n            sort_order=sort_order,\n        )\n\n    @requires_app_auth\n    def search_all(\n        self,\n        query,\n        since_id=None,\n        until_id=None,\n        start_time=None,\n        end_time=None,\n        max_results=100,  # temp fix for #504\n        expansions=None,\n        tweet_fields=None,\n        user_fields=None,\n        media_fields=None,\n        poll_fields=None,\n        place_fields=None,\n        next_token=None,\n        sort_order=None,\n    ):\n        \"\"\"\n        Search Twitter for the given query in the full archive,\n        using the `/search/all` endpoint (Requires Academic Access).\n\n        Calls [GET /2/tweets/search/all](https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-all)\n\n        Args:\n            query (str):\n                The query string to be passed directly to the Twitter API.\n            since_id (int):\n                Return all tweets since this tweet_id.\n            until_id (int):\n                Return all tweets up to this tweet_id.\n            start_time (datetime):\n                Return all tweets after this time (UTC datetime). If none of start_time, since_id, or until_id\n                are specified, this defaults to 2006-3-21 to search the entire history of Twitter.\n            end_time (datetime):\n                Return all tweets before this time (UTC datetime).\n            max_results (int):\n                The maximum number of results per request. Max is 500.\n            sort_order (str):\n                Order tweets based on relevancy or recency.\n\n        Returns:\n            generator[dict]: a generator, dict for each paginated response.\n        \"\"\"\n\n        # start time defaults to the beginning of Twitter to override the\n        # default of the last month. Only do this if start_time is not already\n        # specified and since_id and until_id aren't being used\n        if start_time is None and since_id is None and until_id is None:\n            start_time = datetime.datetime(2006, 3, 21, tzinfo=datetime.timezone.utc)\n\n        return self._search(\n            url=\"https://api.twitter.com/2/tweets/search/all\",\n            query=query,\n            since_id=since_id,\n            until_id=until_id,\n            start_time=start_time,\n            end_time=end_time,\n            max_results=max_results,\n            expansions=expansions,\n            tweet_fields=tweet_fields,\n            user_fields=user_fields,\n            media_fields=media_fields,\n            poll_fields=poll_fields,\n            place_fields=place_fields,\n            next_token=next_token,\n            sleep_between=1.05,\n            sort_order=sort_order,\n        )\n\n    @requires_app_auth\n    def counts_recent(\n        self,\n        query,\n        since_id=None,\n        until_id=None,\n        start_time=None,\n        end_time=None,\n        granularity=\"hour\",\n    ):\n        \"\"\"\n        Retrieve counts for the given query in the last seven days,\n        using the `/counts/recent` endpoint.\n\n        Calls [GET /2/tweets/counts/recent](https://developer.twitter.com/en/docs/twitter-api/tweets/counts/api-reference/get-tweets-counts-recent)\n\n        Args:\n            query (str):\n                The query string to be passed directly to the Twitter API.\n            since_id (int):\n                Return all tweets since this tweet_id.\n            until_id (int):\n                Return all tweets up to this tweet_id.\n            start_time (datetime):\n                Return all tweets after this time (UTC datetime).\n            end_time (datetime):\n                Return all tweets before this time (UTC datetime).\n            granularity (str):\n                Count aggregation level: `day`, `hour`, `minute`.\n                Default is `hour`.\n\n        Returns:\n            generator[dict]: a generator, dict for each paginated response.\n        \"\"\"\n        return self._search(\n            url=\"https://api.twitter.com/2/tweets/counts/recent\",\n            query=query,\n            since_id=since_id,\n            until_id=until_id,\n            start_time=start_time,\n            end_time=end_time,\n            max_results=None,\n            expansions=None,\n            tweet_fields=None,\n            user_fields=None,\n            media_fields=None,\n            poll_fields=None,\n            place_fields=None,\n            granularity=granularity,\n            sort_order=None,\n        )\n\n    @requires_app_auth\n    def counts_all(\n        self,\n        query,\n        since_id=None,\n        until_id=None,\n        start_time=None,\n        end_time=None,\n        granularity=\"hour\",\n        next_token=None,\n    ):\n        \"\"\"\n        Retrieve counts for the given query in the full archive,\n        using the `/search/all` endpoint (Requires Academic Access).\n\n        Calls [GET /2/tweets/counts/all](https://developer.twitter.com/en/docs/twitter-api/tweets/counts/api-reference/get-tweets-counts-all)\n\n        Args:\n            query (str):\n                The query string to be passed directly to the Twitter API.\n            since_id (int):\n                Return all tweets since this tweet_id.\n            until_id (int):\n                Return all tweets up to this tweet_id.\n            start_time (datetime):\n                Return all tweets after this time (UTC datetime).\n            end_time (datetime):\n                Return all tweets before this time (UTC datetime).\n            granularity (str):\n                Count aggregation level: `day`, `hour`, `minute`.\n                Default is `hour`.\n\n        Returns:\n            generator[dict]: a generator, dict for each paginated response.\n        \"\"\"\n        return self._search(\n            url=\"https://api.twitter.com/2/tweets/counts/all\",\n            query=query,\n            since_id=since_id,\n            until_id=until_id,\n            start_time=start_time,\n            end_time=end_time,\n            max_results=None,\n            expansions=None,\n            tweet_fields=None,\n            user_fields=None,\n            media_fields=None,\n            poll_fields=None,\n            place_fields=None,\n            next_token=next_token,\n            granularity=granularity,\n            sleep_between=1.05,\n            sort_order=None,\n        )\n\n    def tweet_lookup(\n        self,\n        tweet_ids,\n        expansions=None,\n        tweet_fields=None,\n        user_fields=None,\n        media_fields=None,\n        poll_fields=None,\n        place_fields=None,\n    ):\n        \"\"\"\n        Lookup tweets, taking an iterator of IDs and returning pages of fully\n        expanded tweet objects.\n\n        This can be used to rehydrate a collection shared as only tweet IDs.\n        Yields one page of tweets at a time, in blocks of up to 100.\n\n        Calls [GET /2/tweets](https://developer.twitter.com/en/docs/twitter-api/tweets/lookup/api-reference/get-tweets)\n\n        Args:\n            tweet_ids (iterable): A list of tweet IDs\n\n        Returns:\n            generator[dict]: a generator, dict for each batch of 100 tweets.\n        \"\"\"\n\n        def lookup_batch(tweet_id):\n            url = \"https://api.twitter.com/2/tweets\"\n\n            params = self._prepare_params(\n                expansions=expansions,\n                tweet_fields=tweet_fields,\n                user_fields=user_fields,\n                media_fields=media_fields,\n                poll_fields=poll_fields,\n                place_fields=place_fields,\n            )\n            params[\"ids\"] = \",\".join(tweet_id)\n\n            resp = self.get(url, params=params)\n            data = resp.json()\n\n            if self.metadata:\n                data = _append_metadata(data, resp.url)\n\n            return data\n\n        tweet_id_batch = []\n\n        for tweet_id in tweet_ids:\n            tweet_id_batch.append(str(int(tweet_id)))\n\n            if len(tweet_id_batch) == 100:\n                yield lookup_batch(tweet_id_batch)\n                tweet_id_batch = []\n\n        if tweet_id_batch:\n            yield (lookup_batch(tweet_id_batch))\n\n    def user_lookup(\n        self,\n        users,\n        usernames=False,\n        expansions=None,\n        tweet_fields=None,\n        user_fields=None,\n    ):\n        \"\"\"\n        Returns fully populated user profiles for the given iterator of\n        user_id or usernames. By default user_lookup expects user ids but if\n        you want to pass in usernames set usernames = True.\n\n        Yields one page of results at a time (in blocks of at most 100 user\n        profiles).\n\n        Calls [GET /2/users](https://developer.twitter.com/en/docs/twitter-api/users/lookup/api-reference/get-users)\n\n        Args:\n            users (iterable): User IDs or usernames to lookup.\n            usernames (bool): Parse `users` as usernames, not IDs.\n\n        Returns:\n            generator[dict]: a generator, dict for each batch of 100 users.\n        \"\"\"\n\n        if isinstance(users, str):\n            raise TypeError(\"users must be an iterable other than a string\")\n\n        if usernames:\n            url = \"https://api.twitter.com/2/users/by\"\n        else:\n            url = \"https://api.twitter.com/2/users\"\n\n        def lookup_batch(users):\n            params = self._prepare_params(\n                tweet_fields=tweet_fields,\n                user_fields=user_fields,\n            )\n            if expansions:\n                params[\"expansions\"] = \"pinned_tweet_id\"\n            if usernames:\n                params[\"usernames\"] = \",\".join(users)\n            else:\n                params[\"ids\"] = \",\".join(users)\n\n            resp = self.get(url, params=params)\n            data = resp.json()\n\n            if self.metadata:\n                data = _append_metadata(data, resp.url)\n\n            return data\n\n        batch = []\n        for item in users:\n            batch.append(str(item).strip())\n            if len(batch) == 100:\n                yield lookup_batch(batch)\n                batch = []\n\n        if batch:\n            yield (lookup_batch(batch))\n\n    @catch_request_exceptions\n    @requires_app_auth\n    def sample(\n        self,\n        event=None,\n        record_keepalive=False,\n        expansions=None,\n        tweet_fields=None,\n        user_fields=None,\n        media_fields=None,\n        poll_fields=None,\n        place_fields=None,\n        backfill_minutes=None,\n    ):\n        \"\"\"\n        Returns a sample of all publicly posted tweets.\n\n        The sample is based on slices of each second, not truly randomised. The\n        same tweets are returned for all users of this endpoint.\n\n        If a `threading.Event` is provided and the event is set, the\n        sample will be interrupted. This can be used for coordination with other\n        programs.\n\n        Calls [GET /2/tweets/sample/stream](https://developer.twitter.com/en/docs/twitter-api/tweets/sampled-stream/api-reference/get-tweets-sample-stream)\n\n        Args:\n            event (threading.Event): Manages a flag to stop the process.\n            record_keepalive (bool): whether to output keep-alive events.\n\n        Returns:\n            generator[dict]: a generator, dict for each tweet.\n        \"\"\"\n        url = \"https://api.twitter.com/2/tweets/sample/stream\"\n        params = self._prepare_params(\n            expansions=expansions,\n            tweet_fields=tweet_fields,\n            user_fields=user_fields,\n            media_fields=media_fields,\n            poll_fields=poll_fields,\n            place_fields=place_fields,\n            backfill_minutes=backfill_minutes,\n        )\n        yield from self._stream(url, params, event, record_keepalive)\n\n    @requires_app_auth\n    def add_stream_rules(self, rules):\n        \"\"\"\n        Adds new rules to the filter stream.\n\n        Calls [POST /2/tweets/search/stream/rules](https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/api-reference/post-tweets-search-stream-rules)\n\n        Args:\n            rules (list[dict]): A list of rules to add.\n\n        Returns:\n            dict: JSON Response from Twitter API.\n        \"\"\"\n        url = \"https://api.twitter.com/2/tweets/search/stream/rules\"\n        return self.post(url, {\"add\": rules}).json()\n\n    @requires_app_auth\n    def get_stream_rules(self):\n        \"\"\"\n        Returns a list of rules for the filter stream.\n\n        Calls [GET /2/tweets/search/stream/rules](https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/api-reference/get-tweets-search-stream-rules)\n\n        Returns:\n            dict: JSON Response from Twitter API with a list of defined rules.\n        \"\"\"\n        url = \"https://api.twitter.com/2/tweets/search/stream/rules\"\n        return self.get(url).json()\n\n    @requires_app_auth\n    def delete_stream_rule_ids(self, rule_ids):\n        \"\"\"\n        Deletes rules from the filter stream.\n\n        Calls [POST /2/tweets/search/stream/rules](https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/api-reference/post-tweets-search-stream-rules)\n\n        Args:\n            rule_ids (list[int]): A list of rule ids to delete.\n\n        Returns:\n            dict: JSON Response from Twitter API.\n        \"\"\"\n        url = \"https://api.twitter.com/2/tweets/search/stream/rules\"\n        return self.post(url, {\"delete\": {\"ids\": rule_ids}}).json()\n\n    @requires_app_auth\n    def stream(\n        self,\n        event=None,\n        record_keepalive=False,\n        expansions=None,\n        tweet_fields=None,\n        user_fields=None,\n        media_fields=None,\n        poll_fields=None,\n        place_fields=None,\n        backfill_minutes=None,\n    ):\n        \"\"\"\n        Returns a stream of tweets matching the defined rules.\n\n        Rules can be added or removed out-of-band, without disconnecting.\n        Tweet results will contain metadata about the rule that matched it.\n\n        If event is set with a threading.Event object, the sample stream\n        will be interrupted. This can be used for coordination with other\n        programs.\n\n        Calls [GET /2/tweets/search/stream](https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/api-reference/get-tweets-search-stream)\n\n        Args:\n            event (threading.Event): Manages a flag to stop the process.\n            record_keepalive (bool): whether to output keep-alive events.\n\n        Returns:\n            generator[dict]: a generator, dict for each tweet.\n        \"\"\"\n        url = \"https://api.twitter.com/2/tweets/search/stream\"\n        params = self._prepare_params(\n            expansions=expansions,\n            tweet_fields=tweet_fields,\n            user_fields=user_fields,\n            media_fields=media_fields,\n            poll_fields=poll_fields,\n            place_fields=place_fields,\n            backfill_minutes=backfill_minutes,\n        )\n        yield from self._stream(url, params, event, record_keepalive)\n\n    def _stream(self, url, params, event, record_keepalive, tries=30):\n        \"\"\"\n        A generator that handles streaming data from a response and catches and\n        logs any request exceptions, sleeps (exponential backoff) and restarts\n        the stream.\n\n        Args:\n            url (str): the streaming endpoint URL\n            params (dict): any query paramters to use with the url\n            event (threading.Event): Manages a flag to stop the process.\n            record_keepalive (bool): whether to output keep-alive events.\n            tries (int): the number of times to retry connecting after an error\n        Returns:\n            generator[dict]: A generator of tweet dicts.\n        \"\"\"\n        errors = 0\n        while True:\n            log.info(f\"connecting to stream {url}\")\n            resp = self.get(url, params=params, stream=True)\n\n            try:\n                for line in resp.iter_lines():\n                    errors = 0\n\n                    # quit & close the stream if the event is set\n                    if event and event.is_set():\n                        log.info(\"stopping response stream\")\n                        resp.close()\n                        return\n\n                    # return the JSON data w/ optional keep-alive\n                    if not line:\n                        log.info(\"keep-alive\")\n                        if record_keepalive:\n                            yield \"keep-alive\"\n                        continue\n                    else:\n                        data = json.loads(line.decode())\n                        if self.metadata:\n                            data = _append_metadata(data, resp.url)\n                        yield data\n                        if self._check_for_disconnect(data):\n                            break\n\n            except requests.exceptions.RequestException as e:\n                log.warn(\"caught exception during streaming: %s\", e)\n                errors += 1\n                if errors > tries:\n                    log.error(f\"too many consecutive errors ({tries}). stopping\")\n                    return\n                else:\n                    secs = errors**2\n                    log.info(\"sleeping %s seconds before reconnecting\", secs)\n                    time.sleep(secs)\n\n    def _timeline(\n        self,\n        user_id,\n        timeline_type,\n        since_id,\n        until_id,\n        start_time,\n        end_time,\n        exclude_retweets,\n        exclude_replies,\n        max_results=None,\n        expansions=None,\n        tweet_fields=None,\n        user_fields=None,\n        media_fields=None,\n        poll_fields=None,\n        place_fields=None,\n        pagination_token=None,\n    ):\n        \"\"\"\n        Helper function for user and mention timelines\n\n        Calls [GET /2/users/:id/tweets](https://developer.twitter.com/en/docs/twitter-api/tweets/timelines/api-reference/get-users-id-tweets)\n        or [GET /2/users/:id/mentions](https://developer.twitter.com/en/docs/twitter-api/tweets/timelines/api-reference/get-users-id-mentions)\n\n        Args:\n            user_id (int): ID of the user.\n            timeline_type (str): timeline type: `tweets` or `mentions`\n            since_id (int): results with a Tweet ID greater than (newer) than specified\n            until_id (int): results with a Tweet ID less than (older) than specified\n            start_time (datetime): oldest UTC timestamp from which the Tweets will be provided\n            end_time (datetime): newest UTC timestamp from which the Tweets will be provided\n            exclude_retweets (boolean): remove retweets from timeline\n            exlucde_replies (boolean): remove replies from timeline\n        Returns:\n            generator[dict]: A generator, dict for each page of results.\n        \"\"\"\n\n        url = f\"https://api.twitter.com/2/users/{user_id}/{timeline_type}\"\n\n        params = self._prepare_params(\n            since_id=since_id,\n            until_id=until_id,\n            start_time=start_time,\n            end_time=end_time,\n            max_results=max_results,\n            expansions=expansions,\n            tweet_fields=tweet_fields,\n            user_fields=user_fields,\n            media_fields=media_fields,\n            poll_fields=poll_fields,\n            place_fields=place_fields,\n            pagination_token=pagination_token,\n        )\n\n        excludes = []\n        if exclude_retweets:\n            excludes.append(\"retweets\")\n        if exclude_replies:\n            excludes.append(\"replies\")\n        if len(excludes) > 0:\n            params[\"exclude\"] = \",\".join(excludes)\n\n        for response in self.get_paginated(url, params=params):\n            # can return without 'data' if there are no results\n            if \"data\" in response:\n                yield response\n            else:\n                log.info(f\"Retrieved an empty page of results for timeline {user_id}\")\n\n        log.info(f\"No more results for timeline {user_id}.\")\n\n    def timeline(\n        self,\n        user,\n        since_id=None,\n        until_id=None,\n        start_time=None,\n        end_time=None,\n        exclude_retweets=False,\n        exclude_replies=False,\n        max_results=100,\n        expansions=None,\n        tweet_fields=None,\n        user_fields=None,\n        media_fields=None,\n        poll_fields=None,\n        place_fields=None,\n        pagination_token=None,\n    ):\n        \"\"\"\n        Retrieve up to the 3200 most recent tweets made by the given user.\n\n        Calls [GET /2/users/:id/tweets](https://developer.twitter.com/en/docs/twitter-api/tweets/timelines/api-reference/get-users-id-tweets)\n\n        Args:\n            user (int): ID of the user.\n            since_id (int): results with a Tweet ID greater than (newer) than specified\n            until_id (int): results with a Tweet ID less than (older) than specified\n            start_time (datetime): oldest UTC timestamp from which the Tweets will be provided\n            end_time (datetime): newest UTC timestamp from which the Tweets will be provided\n            exclude_retweets (boolean): remove retweets from timeline results\n            exclude_replies (boolean): remove replies from timeline results\n            max_results (int): the maximum number of Tweets to retrieve. Between 5 and 100.\n\n        Returns:\n            generator[dict]: A generator, dict for each page of results.\n        \"\"\"\n        user_id = self._ensure_user_id(user)\n        return self._timeline(\n            user_id=user_id,\n            timeline_type=\"tweets\",\n            since_id=since_id,\n            until_id=until_id,\n            start_time=start_time,\n            end_time=end_time,\n            exclude_retweets=exclude_retweets,\n            exclude_replies=exclude_replies,\n            max_results=max_results,\n            expansions=expansions,\n            tweet_fields=tweet_fields,\n            user_fields=user_fields,\n            media_fields=media_fields,\n            poll_fields=poll_fields,\n            place_fields=place_fields,\n            pagination_token=pagination_token,\n        )\n\n    def mentions(\n        self,\n        user,\n        since_id=None,\n        until_id=None,\n        start_time=None,\n        end_time=None,\n        exclude_retweets=False,\n        exclude_replies=False,\n        max_results=100,\n        expansions=None,\n        tweet_fields=None,\n        user_fields=None,\n        media_fields=None,\n        poll_fields=None,\n        place_fields=None,\n        pagination_token=None,\n    ):\n        \"\"\"\n        Retrieve up to the 800 most recent tweets mentioning the given user.\n\n        Calls [GET /2/users/:id/mentions](https://developer.twitter.com/en/docs/twitter-api/tweets/timelines/api-reference/get-users-id-mentions)\n\n        Args:\n            user (int): ID of the user.\n            since_id (int): results with a Tweet ID greater than (newer) than specified\n            until_id (int): results with a Tweet ID less than (older) than specified\n            start_time (datetime): oldest UTC timestamp from which the Tweets will be provided\n            end_time (datetime): newest UTC timestamp from which the Tweets will be provided\n            exclude_retweets (boolean): remove retweets from timeline results\n            exclude_replies (boolean): remove replies from timeline results\n            max_results (int): the maximum number of Tweets to retrieve. Between 5 and 100.\n\n\n        Returns:\n            generator[dict]: A generator, dict for each page of results.\n        \"\"\"\n        user_id = self._ensure_user_id(user)\n        return self._timeline(\n            user_id=user_id,\n            timeline_type=\"mentions\",\n            since_id=since_id,\n            until_id=until_id,\n            start_time=start_time,\n            end_time=end_time,\n            exclude_retweets=exclude_retweets,\n            exclude_replies=exclude_replies,\n            max_results=max_results,\n            expansions=expansions,\n            tweet_fields=tweet_fields,\n            user_fields=user_fields,\n            media_fields=media_fields,\n            poll_fields=poll_fields,\n            place_fields=place_fields,\n            pagination_token=pagination_token,\n        )\n\n    def following(\n        self,\n        user,\n        user_id=None,\n        max_results=1000,\n        expansions=None,\n        tweet_fields=None,\n        user_fields=None,\n        pagination_token=None,\n    ):\n        \"\"\"\n        Retrieve the user profiles of accounts followed by the given user.\n\n        Calls [GET /2/users/:id/following](https://developer.twitter.com/en/docs/twitter-api/users/follows/api-reference/get-users-id-following)\n\n        Args:\n            user (int): ID of the user.\n\n        Returns:\n            generator[dict]: A generator, dict for each page of results.\n        \"\"\"\n        user_id = self._ensure_user_id(user) if not user_id else user_id\n        params = self._prepare_params(\n            tweet_fields=tweet_fields,\n            user_fields=user_fields,\n            max_results=max_results,\n            pagination_token=pagination_token,\n        )\n        if expansions:\n            params[\"expansions\"] = \"pinned_tweet_id\"\n        url = f\"https://api.twitter.com/2/users/{user_id}/following\"\n        return self.get_paginated(url, params=params)\n\n    def followers(\n        self,\n        user,\n        user_id=None,\n        max_results=1000,\n        expansions=None,\n        tweet_fields=None,\n        user_fields=None,\n        pagination_token=None,\n    ):\n        \"\"\"\n        Retrieve the user profiles of accounts following the given user.\n\n        Calls [GET /2/users/:id/followers](https://developer.twitter.com/en/docs/twitter-api/users/follows/api-reference/get-users-id-followers)\n\n        Args:\n            user (int): ID of the user.\n\n        Returns:\n            generator[dict]: A generator, dict for each page of results.\n        \"\"\"\n        user_id = self._ensure_user_id(user) if not user_id else user_id\n        params = self._prepare_params(\n            tweet_fields=tweet_fields,\n            user_fields=user_fields,\n            max_results=max_results,\n            pagination_token=pagination_token,\n        )\n        if expansions:\n            params[\"expansions\"] = \"pinned_tweet_id\"\n        url = f\"https://api.twitter.com/2/users/{user_id}/followers\"\n        return self.get_paginated(url, params=params)\n\n    def liking_users(\n        self,\n        tweet_id,\n        expansions=None,\n        tweet_fields=None,\n        user_fields=None,\n        max_results=100,\n        pagination_token=None,\n    ):\n        \"\"\"\n        Retrieve the user profiles of accounts that have liked the given tweet.\n\n        \"\"\"\n        url = f\"https://api.twitter.com/2/tweets/{tweet_id}/liking_users\"\n\n        params = self._prepare_params(\n            tweet_fields=tweet_fields,\n            user_fields=user_fields,\n            max_results=max_results,\n            pagination_token=pagination_token,\n        )\n\n        if expansions:\n            params[\"expansions\"] = \"pinned_tweet_id\"\n\n        for page in self.get_paginated(url, params=params):\n            if \"data\" in page:\n                yield page\n            else:\n                log.info(\n                    f\"Retrieved an empty page of results for liking_users of {tweet_id}\"\n                )\n\n    def liked_tweets(\n        self,\n        user_id,\n        max_results=100,\n        expansions=None,\n        tweet_fields=None,\n        user_fields=None,\n        media_fields=None,\n        poll_fields=None,\n        place_fields=None,\n        pagination_token=None,\n    ):\n        \"\"\"\n        Retrieve the tweets liked by the given user_id.\n\n        \"\"\"\n        user_id = self._ensure_user_id(user_id)\n        url = f\"https://api.twitter.com/2/users/{user_id}/liked_tweets\"\n\n        params = self._prepare_params(\n            max_results=100,\n            expansions=None,\n            tweet_fields=None,\n            user_fields=None,\n            media_fields=None,\n            poll_fields=None,\n            place_fields=None,\n            pagination_token=None,\n        )\n\n        for page in self.get_paginated(url, params=params):\n            if \"data\" in page:\n                yield page\n            else:\n                log.info(\n                    f\"Retrieved an empty page of results for liked_tweets of {user_id}\"\n                )\n\n    def retweeted_by(\n        self,\n        tweet_id,\n        expansions=None,\n        tweet_fields=None,\n        user_fields=None,\n        max_results=100,\n        pagination_token=None,\n    ):\n        \"\"\"\n        Retrieve the user profiles of accounts that have retweeted the given tweet.\n\n        \"\"\"\n        url = f\"https://api.twitter.com/2/tweets/{tweet_id}/retweeted_by\"\n\n        params = self._prepare_params(\n            tweet_fields=tweet_fields,\n            user_fields=user_fields,\n            max_results=max_results,\n            pagination_token=pagination_token,\n        )\n\n        if expansions:\n            params[\"expansions\"] = \"pinned_tweet_id\"\n\n        for page in self.get_paginated(url, params=params):\n            if \"data\" in page:\n                yield page\n            else:\n                log.info(\n                    f\"Retrieved an empty page of results for retweeted_by of {tweet_id}\"\n                )\n\n    def quotes(\n        self,\n        tweet_id,\n        expansions=None,\n        tweet_fields=None,\n        user_fields=None,\n        max_results=100,\n        pagination_token=None,\n    ):\n        \"\"\"\n        Retrieve the tweets that quote tweet the given tweet.\n\n        \"\"\"\n        url = f\"https://api.twitter.com/2/tweets/{tweet_id}/quote_tweets\"\n\n        params = self._prepare_params(\n            expansions=expansions,\n            tweet_fields=tweet_fields,\n            user_fields=user_fields,\n            max_results=max_results,\n            pagination_token=pagination_token,\n        )\n\n        for page in self.get_paginated(url, params=params):\n            if \"data\" in page:\n                yield page\n            else:\n                log.info(f\"Retrieved an empty page of results for quotes of {tweet_id}\")\n\n    @catch_request_exceptions\n    @rate_limit\n    def get(self, *args, **kwargs):\n        \"\"\"\n        Make a GET request to a specified URL.\n\n        Args:\n            *args: Variable length argument list.\n            **kwargs: Arbitrary keyword arguments.\n\n        Returns:\n            requests.Response: Response from Twitter API.\n        \"\"\"\n        if not self.client:\n            self.connect()\n        log.info(\"getting %s %s\", args, kwargs)\n        r = self.last_response = self.client.get(*args, timeout=(3.05, 31), **kwargs)\n        return r\n\n    def get_paginated(self, *args, **kwargs):\n        \"\"\"\n        A wrapper around the `get` method that handles Twitter token based\n        pagination.\n\n        Yields one page (one API response) at a time.\n\n        Args:\n            *args: Variable length argument list.\n            **kwargs: Arbitrary keyword arguments.\n\n        Returns:\n            generator[dict]: A generator, dict for each page of results.\n        \"\"\"\n\n        resp = self.get(*args, **kwargs)\n        page = resp.json()\n\n        url = args[0]\n\n        if self.metadata:\n            page = _append_metadata(page, resp.url)\n\n        yield page\n\n        # Todo: Maybe this should be backwards.. check for `next_token`\n        endings = [\n            \"mentions\",\n            \"tweets\",\n            \"following\",\n            \"followers\",\n            \"liked_tweets\",\n            \"liking_users\",\n            \"retweeted_by\",\n            \"members\",\n            \"memberships\",\n            \"followed_lists\",\n            \"owned_lists\",\n            \"pinned_lists\",\n        ]\n\n        # The search endpoints only take a next_token, but the timeline\n        # endpoints take a pagination_token instead - this is a bit of a hack,\n        # but check the URL ending to see which we should use.\n        if any(url.endswith(end) for end in endings):\n            token_param = \"pagination_token\"\n        else:\n            token_param = \"next_token\"\n\n        while \"meta\" in page and \"next_token\" in page[\"meta\"]:\n            if \"params\" in kwargs:\n                kwargs[\"params\"][token_param] = page[\"meta\"][\"next_token\"]\n            else:\n                kwargs[\"params\"] = {token_param: page[\"meta\"][\"next_token\"]}\n\n            resp = self.get(*args, **kwargs)\n            page = resp.json()\n\n            if self.metadata:\n                page = _append_metadata(page, resp.url)\n\n            yield page\n\n    @catch_request_exceptions\n    @rate_limit\n    def post(self, url, json_data):\n        \"\"\"\n        Make a POST request to the specified URL.\n\n        Args:\n            url (str): URL to make a POST request\n            json_data (dict): JSON data to send.\n\n        Returns:\n            requests.Response: Response from Twitter API.\n        \"\"\"\n        if not self.client:\n            self.connect()\n        return self.client.post(url, json=json_data)\n\n    def connect(self):\n        \"\"\"\n        Sets up the HTTP session to talk to Twitter. If one is active it is\n        closed and another one is opened.\n        \"\"\"\n        if self.last_response:\n            self.last_response.close()\n\n        if self.client:\n            self.client.close()\n\n        if self.auth_type == \"application\" and self.bearer_token:\n            log.info(\"creating HTTP session headers for app auth.\")\n            auth = f\"Bearer {self.bearer_token}\"\n            log.debug(\"authorization: %s\", auth)\n            self.client = requests.Session()\n            self.client.headers.update({\"Authorization\": auth})\n        elif self.auth_type == \"application\":\n            log.info(\"creating app auth client via OAuth2\")\n            log.debug(\"client_id: %s\", self.consumer_key)\n            log.debug(\"client_secret: %s\", self.consumer_secret)\n            client = BackendApplicationClient(client_id=self.consumer_key)\n            self.client = OAuth2Session(client=client)\n            self.client.fetch_token(\n                token_url=\"https://api.twitter.com/oauth2/token\",\n                client_id=self.consumer_key,\n                client_secret=self.consumer_secret,\n            )\n        else:\n            log.info(\"creating user auth client\")\n            log.debug(\"client_id: %s\", self.consumer_key)\n            log.debug(\"client_secret: %s\", self.consumer_secret)\n            log.debug(\"resource_owner_key: %s\", self.access_token)\n            log.debug(\"resource_owner_secret: %s\", self.access_token_secret)\n            self.client = OAuth1Session(\n                client_key=self.consumer_key,\n                client_secret=self.consumer_secret,\n                resource_owner_key=self.access_token,\n                resource_owner_secret=self.access_token_secret,\n            )\n\n        if self.client:\n            self.client.headers.update({\"User-Agent\": user_agent})\n\n    @requires_app_auth\n    def compliance_job_list(self, job_type, status):\n        \"\"\"\n        Returns list of compliance jobs.\n\n        Calls [GET /2/compliance/jobs](https://developer.twitter.com/en/docs/twitter-api/compliance/batch-compliance/api-reference/get-compliance-jobs)\n\n        Args:\n            job_type (str): Filter by job type - either tweets or users.\n            status (str): Filter by job status. Only one of 'created', 'in_progress', 'complete', 'failed' can be specified. If not set, returns all.\n\n        Returns:\n            list[dict]: A list of jobs.\n        \"\"\"\n        params = {}\n        if job_type:\n            params[\"type\"] = job_type\n        if status:\n            params[\"status\"] = status\n        result = self.client.get(\n            \"https://api.twitter.com/2/compliance/jobs\", params=params\n        ).json()\n        if \"data\" in result or not result:\n            return result\n        else:\n            raise ValueError(f\"Unknown response from twitter: {result}\")\n\n    @requires_app_auth\n    def compliance_job_get(self, job_id):\n        \"\"\"\n        Returns a compliance job.\n\n        Calls [GET /2/compliance/jobs/{job_id}](https://developer.twitter.com/en/docs/twitter-api/compliance/batch-compliance/api-reference/get-compliance-jobs-id)\n\n        Args:\n            job_id (int): The ID of the compliance job.\n\n        Returns:\n            dict: A compliance job.\n        \"\"\"\n        result = self.client.get(\n            \"https://api.twitter.com/2/compliance/jobs/{}\".format(job_id)\n        )\n        if result.status_code == 200:\n            result = result.json()\n        else:\n            raise ValueError(f\"Error from API, response: {result.status_code}\")\n        if \"data\" in result:\n            return result\n        else:\n            raise ValueError(f\"Unknown response from twitter: {result}\")\n\n    @requires_app_auth\n    def compliance_job_create(self, job_type, job_name, resumable=False):\n        \"\"\"\n        Creates a new compliace job.\n\n        Calls [POST /2/compliance/jobs](https://developer.twitter.com/en/docs/twitter-api/compliance/batch-compliance/api-reference/post-compliance-jobs)\n\n        Args:\n            job_type (str): The type of job to create. Either 'tweets' or 'users'.\n            job_name (str): Optional name for the job.\n            resumable (bool): Whether or not the job upload is resumable.\n        \"\"\"\n        payload = {}\n        payload[\"type\"] = job_type\n        payload[\"resumable\"] = resumable\n        if job_name:\n            payload[\"name\"] = job_name\n\n        result = self.client.post(\n            \"https://api.twitter.com/2/compliance/jobs\", json=payload\n        )\n\n        if result.status_code == 200:\n            result = result.json()\n        else:\n            raise ValueError(f\"Error from API, response: {result.status_code}\")\n        if \"data\" in result:\n            return result\n        else:\n            raise ValueError(f\"Unknown response from twitter: {result}\")\n\n    def geo(\n        self,\n        lat=None,\n        lon=None,\n        query=None,\n        ip=None,\n        granularity=\"neighborhood\",\n        max_results=None,\n    ):\n        \"\"\"\n        Gets geographic places that can be useful in queries. This is a v1.1\n        endpoint but is useful in querying the v2 API.\n\n        Calls [1.1/geo/search.json](https://api.twitter.com/1.1/geo/search.json)\n\n        Args:\n            lat (float): latitude to search around\n            lon (float): longitude to search around\n            query (str): text to match in the place name\n            ip (str): use the ip address to locate places\n            granularity (str) : neighborhood, city, admin, country\n            max_results (int): maximum results to return\n        \"\"\"\n\n        params = {}\n        if lat and lon:\n            params[\"lat\"] = lat\n            params[\"long\"] = lon\n        elif query:\n            params[\"query\"] = query\n        elif ip:\n            params[\"ip\"] = ip\n        else:\n            raise ValueError(\"geo() needs either lat/lon, query or ip)\")\n\n        if granularity not in [\"neighborhood\", \"city\", \"admin\", \"country\"]:\n            raise ValueError(\n                \"{granularity} is not valid value for granularity, please use neighborhood, city, admin or country\"\n            )\n        params[\"granularity\"] = granularity\n\n        if max_results and type(max_results) != int:\n            raise ValueError(\"max_results must be an int\")\n        params[\"max_results\"] = max_results\n\n        url = \"https://api.twitter.com/1.1/geo/search.json\"\n\n        result = self.get(url, params=params)\n        if result.status_code == 200:\n            result = result.json()\n        else:\n            raise ValueError(f\"Error from API, response: {result.status_code}\")\n\n        return result\n\n    def _id_exists(self, user):\n        \"\"\"\n        Returns True if the user id exists\n        \"\"\"\n        try:\n            error_name = next(self.user_lookup([user]))[\"errors\"][0][\"title\"]\n            return error_name != \"Not Found Error\"\n        except KeyError:\n            return True\n\n    def _ensure_user_id(self, user):\n        \"\"\"\n        Always return a valid user id, look up if not numeric.\n        \"\"\"\n        user = str(user)\n        is_numeric = re.match(r\"^\\d+$\", user)\n\n        if len(user) > 15 or (is_numeric and self._id_exists(user)):\n            return user\n        else:\n            results = next(self.user_lookup([user], usernames=True))\n            if \"data\" in results and len(results[\"data\"]) > 0:\n                return results[\"data\"][0][\"id\"]\n            elif is_numeric:\n                return user\n            else:\n                raise ValueError(f\"No such user {user}\")\n\n    def _ensure_user(self, user):\n        \"\"\"\n        Always return a valid user object.\n        \"\"\"\n        user = str(user)\n        is_numeric = re.match(r\"^\\d+$\", user)\n\n        lookup = []\n        if len(user) > 15 or (is_numeric and self._id_exists(user)):\n            lookup = list(self.user_lookup([user]))[0]\n        else:\n            lookup = list(self.user_lookup([user], usernames=True))[0]\n\n        if \"data\" in lookup:\n            return lookup[\"data\"][0]\n        else:\n            raise ValueError(f\"No such user {user}\")\n\n    def _check_for_disconnect(self, data):\n        \"\"\"\n        Look for disconnect errors in a response, and reconnect if found. The\n        function returns True if a disconnect was found and False otherwise.\n        \"\"\"\n        for error in data.get(\"errors\", []):\n            if error.get(\"disconnect_type\") == \"OperationalDisconnect\":\n                log.info(\"Received operational disconnect message, reconnecting\")\n                self.connect()\n                return True\n        return False\n\n\ndef _ts(dt):\n    \"\"\"\n    Return ISO 8601 / RFC 3339 datetime in UTC. If no timezone is specified it\n    is assumed to be in UTC. The Twitter API does not accept microseconds.\n\n    Args:\n        dt (datetime): a `datetime` object to format.\n\n    Returns:\n        str: an ISO 8601 / RFC 3339 datetime in UTC.\n    \"\"\"\n    if dt.tzinfo:\n        dt = dt.astimezone(datetime.timezone.utc)\n    else:\n        dt = dt.replace(tzinfo=datetime.timezone.utc)\n    return dt.isoformat(timespec=\"seconds\")\n\n\ndef _utcnow():\n    \"\"\"\n    Return _now_ in ISO 8601 / RFC 3339 datetime in UTC.\n\n    Returns:\n        datetime: Current timestamp in UTC.\n    \"\"\"\n    return datetime.datetime.now(datetime.timezone.utc).isoformat(timespec=\"seconds\")\n\n\ndef _append_metadata(result, url):\n    \"\"\"\n    Appends `__twarc` metadata to the result.\n    Adds the full URL with parameters used, the version\n    and current timestamp in seconds.\n\n    Args:\n        result (dict): API Response to append data to.\n        url (str): URL of the API endpoint called.\n\n    Returns:\n        dict: API Response with append metadata\n    \"\"\"\n    result[\"__twarc\"] = {\"url\": url, \"version\": version, \"retrieved_at\": _utcnow()}\n    return result\n"
  },
  {
    "path": "src/twarc/command.py",
    "content": "from __future__ import print_function\n\nimport os\nimport re\nimport sys\nimport json\nimport signal\nimport codecs\nimport logging\nimport datetime\nimport argparse\nimport fileinput\n\nfrom twarc.client import Twarc\nfrom twarc.version import version\nfrom twarc.json2csv import csv, get_headings, get_row\nfrom dateutil.parser import parse as parse_dt\n\nif sys.version_info[:2] <= (2, 7):\n    # Python 2\n    pyv = 2\n    get_input = raw_input\n    str_type = unicode\n    import ConfigParser as configparser\nelse:\n    # Python 3\n    pyv = 3\n    get_input = input\n    str_type = str\n    import configparser\n\nlog = logging.getLogger(\"twarc\")\n\n\ncommands = [\n    \"configure\",\n    \"dehydrate\",\n    \"filter\",\n    \"followers\",\n    \"friends\",\n    \"help\",\n    \"hydrate\",\n    \"replies\",\n    \"retweets\",\n    \"sample\",\n    \"search\",\n    \"timeline\",\n    \"trends\",\n    \"tweet\",\n    \"users\",\n    \"listmembers\",\n    \"version\",\n]\n\n\ndef main():\n    parser = get_argparser()\n    args = parser.parse_args()\n\n    command = args.command\n    query = args.query or \"\"\n\n    logging.basicConfig(\n        filename=args.log,\n        level=logging.INFO,\n        format=\"%(asctime)s %(levelname)s %(message)s\",\n    )\n\n    # log and stop when process receives SIGINT\n    def stop(signal, frame):\n        log.warn(\"process received SIGNT, stopping\")\n        sys.exit(0)\n\n    signal.signal(signal.SIGINT, stop)\n\n    if command == \"version\":\n        print(\"twarc v%s\" % version)\n        sys.exit()\n    elif command == \"help\" or not command:\n        parser.print_help()\n        print(\"\\nPlease use one of the following commands:\\n\")\n        for cmd in commands:\n            print(\" - %s\" % cmd)\n        print(\"\\nFor example:\\n\\n    twarc search blacklivesmatter\")\n        sys.exit(1)\n\n    # Don't validate the keys if the command is \"configure\"\n    if command == \"configure\" or args.skip_key_validation:\n        validate_keys = False\n    else:\n        validate_keys = True\n\n    t = Twarc(\n        consumer_key=args.consumer_key,\n        consumer_secret=args.consumer_secret,\n        access_token=args.access_token,\n        access_token_secret=args.access_token_secret,\n        connection_errors=args.connection_errors,\n        http_errors=args.http_errors,\n        config=args.config,\n        profile=args.profile,\n        tweet_mode=args.tweet_mode,\n        protected=args.protected,\n        validate_keys=validate_keys,\n        app_auth=args.app_auth,\n        gnip_auth=args.gnip_auth,\n    )\n\n    # calls that return tweets\n    if command == \"search\":\n        if len(args.lang) > 0:\n            lang = args.lang[0]\n        else:\n            lang = None\n\n        # if not using a premium endpoint do a standard search\n        if not args.thirtyday and not args.fullarchive and not args.gnip_fullarchive:\n            things = t.search(\n                query,\n                since_id=args.since_id,\n                max_id=args.max_id,\n                lang=lang,\n                result_type=args.result_type,\n                geocode=args.geocode,\n            )\n        else:\n            # parse the dates if given\n            from_date = parse_dt(args.from_date) if args.from_date else None\n            to_date = parse_dt(args.to_date) if args.to_date else None\n            if args.gnip_fullarchive:\n                env = args.gnip_fullarchive\n                product = \"gnip_fullarchive\"\n            elif args.thirtyday:\n                env = args.thirtyday\n                product = \"30day\"\n            else:\n                env = args.fullarchive\n                product = \"fullarchive\"\n            things = t.premium_search(\n                query,\n                product,\n                env,\n                from_date=from_date,\n                to_date=to_date,\n                sandbox=args.sandbox,\n                limit=args.limit,\n            )\n\n    elif command == \"filter\":\n        things = t.filter(\n            track=query, follow=args.follow, locations=args.locations, lang=args.lang\n        )\n\n    elif command == \"dehydrate\":\n        input_iterator = fileinput.FileInput(\n            query,\n            mode=\"r\",\n            openhook=fileinput.hook_compressed,\n        )\n        things = t.dehydrate(input_iterator)\n\n    elif command == \"hydrate\":\n        input_iterator = fileinput.FileInput(\n            query,\n            mode=\"r\",\n            openhook=fileinput.hook_compressed,\n        )\n        things = t.hydrate(input_iterator)\n\n    elif command == \"tweet\":\n        things = [t.tweet(query)]\n\n    elif command == \"sample\":\n        things = t.sample()\n\n    elif command == \"timeline\":\n        kwargs = {\"max_id\": args.max_id, \"since_id\": args.since_id}\n        if re.match(\"^[0-9]+$\", query):\n            kwargs[\"user_id\"] = query\n        elif query:\n            kwargs[\"screen_name\"] = query\n        things = t.timeline(**kwargs)\n\n    elif command == \"retweets\":\n        if os.path.isfile(query):\n            iterator = fileinput.FileInput(\n                query,\n                mode=\"r\",\n                openhook=fileinput.hook_compressed,\n            )\n            things = t.retweets(tweet_ids=iterator)\n        else:\n            things = t.retweets(tweet_ids=query.split(\",\"))\n\n    elif command == \"users\":\n        if os.path.isfile(query):\n            iterator = fileinput.FileInput(\n                query,\n                mode=\"r\",\n                openhook=fileinput.hook_compressed,\n            )\n            if re.match(\"^[0-9,]+$\", next(open(query))):\n                id_type = \"user_id\"\n            else:\n                id_type = \"screen_name\"\n            things = t.user_lookup(ids=iterator, id_type=id_type)\n        elif re.match(\"^[0-9,]+$\", query):\n            things = t.user_lookup(ids=query.split(\",\"))\n        else:\n            things = t.user_lookup(ids=query.split(\",\"), id_type=\"screen_name\")\n\n    elif command == \"followers\":\n        things = t.follower_ids(query)\n\n    elif command == \"friends\":\n        things = t.friend_ids(query)\n\n    elif command == \"trends\":\n        # lookup woeid for geo-coordinate if appropriate\n        geo = re.match(\"^([0-9-.]+),([0-9-.]+)$\", query)\n        if geo:\n            lat, lon = map(float, geo.groups())\n            if lat > 180 or lat < -180 or lon > 180 or lon < -180:\n                parser.error(\"LAT and LONG must be within [-180.0, 180.0]\")\n            places = list(t.trends_closest(lat, lon))\n            if len(places) == 0:\n                parser.error(\"Couldn't find WOE ID for %s\" % query)\n            query = places[0][\"woeid\"]\n\n        if not query:\n            things = t.trends_available()\n        else:\n            trends = t.trends_place(query)\n            if trends:\n                things = trends[0][\"trends\"]\n\n    elif command == \"replies\":\n        tweet = t.tweet(query)\n        if not tweet:\n            parser.error(\"tweet with id %s does not exist\" % query)\n        things = t.replies(tweet, args.recursive)\n\n    elif command == \"listmembers\":\n        list_parts = re.match(\"^https://twitter.com/(.+)/lists/(.+)$\", query)\n        if not list_parts:\n            parser.error(\n                \"provide the url for the list, e.g., https://twitter.com/USAFacts/lists/us-armed-forces\"\n            )\n        things = t.list_members(\n            slug=list_parts.group(2), owner_screen_name=list_parts.groups(1)\n        )\n\n    elif command == \"configure\":\n        t.configure()\n        sys.exit()\n\n    else:\n        parser.print_help()\n        print(\"\\nPlease use one of the following commands:\\n\")\n        for cmd in commands:\n            print(\" - %s\" % cmd)\n        print(\"\\nFor example:\\n\\n    twarc search blacklivesmatter\")\n        sys.exit(1)\n\n    # get the output filehandle\n    if args.output:\n        if pyv == 3:\n            fh = codecs.open(args.output, \"wb\", \"utf8\")\n        else:\n            fh = open(args.output, \"w\")\n    else:\n        fh = sys.stdout\n\n    # optionally create a csv writer\n    csv_writer = None\n    if args.format in (\"csv\", \"csv-excel\") and command not in [\n        \"filter\",\n        \"hydrate\",\n        \"replies\",\n        \"retweets\",\n        \"sample\",\n        \"search\",\n        \"timeline\",\n        \"tweet\",\n    ]:\n        parser.error(\"csv output not available for %s\" % command)\n    elif args.format in (\"csv\", \"csv-excel\"):\n        csv_writer = csv.writer(fh)\n        csv_writer.writerow(get_headings())\n\n    line_count = 0\n    file_count = 0\n    for thing in things:\n        # rotate the files if necessary\n        if args.output and args.split and line_count % args.split == 0:\n            file_count += 1\n            fh = codecs.open(numbered_filepath(args.output, file_count), \"wb\", \"utf8\")\n            if csv_writer:\n                csv_writer = csv.writer(fh)\n                csv_writer.writerow(get_headings())\n\n        line_count += 1\n\n        # ready to output\n\n        kind_of = type(thing)\n        if kind_of == str_type:\n            # user or tweet IDs\n            print(thing, file=fh)\n            log.info(\"archived %s\" % thing)\n        elif \"id_str\" in thing:\n            # tweets and users\n            if args.format == \"json\":\n                print(json.dumps(thing), file=fh)\n            elif args.format == \"csv\":\n                csv_writer.writerow(get_row(thing))\n            elif args.format == \"csv-excel\":\n                csv_writer.writerow(get_row(thing, excel=True))\n            log.info(\"archived %s\", thing[\"id_str\"])\n        elif \"woeid\" in thing:\n            # places\n            print(json.dumps(thing), file=fh)\n        elif \"tweet_volume\" in thing:\n            # trends\n            print(json.dumps(thing), file=fh)\n        elif \"limit\" in thing:\n            # rate limits\n            t = datetime.datetime.utcfromtimestamp(\n                float(thing[\"limit\"][\"timestamp_ms\"]) / 1000\n            )\n            t = t.isoformat(\"T\") + \"Z\"\n            log.warning(\"%s tweets undelivered at %s\", thing[\"limit\"][\"track\"], t)\n            if args.warnings:\n                print(json.dumps(thing), file=fh)\n        elif \"warning\" in thing:\n            # other warnings\n            log.warning(thing[\"warning\"][\"message\"])\n            if args.warnings:\n                print(json.dumps(thing), file=fh)\n        elif \"data\" in thing:\n            # Labs style JSON schema.\n            print(json.dumps(thing), file=fh)\n\n\ndef get_argparser():\n    \"\"\"\n    Get the command line argument parser.\n    \"\"\"\n\n    parser = argparse.ArgumentParser(\"twarc\")\n    parser.add_argument(\"command\", choices=commands)\n    parser.add_argument(\"query\", nargs=\"?\", default=None)\n    parser.add_argument(\"--log\", dest=\"log\", default=\"twarc.log\", help=\"log file\")\n    parser.add_argument(\"--consumer_key\", default=None, help=\"Twitter API consumer key\")\n    parser.add_argument(\n        \"--consumer_secret\", default=None, help=\"Twitter API consumer secret\"\n    )\n    parser.add_argument(\"--access_token\", default=None, help=\"Twitter API access key\")\n    parser.add_argument(\n        \"--access_token_secret\", default=None, help=\"Twitter API access token secret\"\n    )\n    parser.add_argument(\n        \"--config\", help=\"Config file containing Twitter keys and secrets\"\n    )\n    parser.add_argument(\n        \"--profile\", help=\"Name of a profile in your configuration file\"\n    )\n    parser.add_argument(\n        \"--warnings\", action=\"store_true\", help=\"Include warning messages in output\"\n    )\n    parser.add_argument(\n        \"--connection_errors\",\n        type=int,\n        default=\"0\",\n        help=\"Number of connection errors before giving up\",\n    )\n    parser.add_argument(\n        \"--http_errors\",\n        type=int,\n        default=\"0\",\n        help=\"Number of http errors before giving up\",\n    )\n    parser.add_argument(\n        \"--max_id\", dest=\"max_id\", help=\"maximum tweet id to search for\"\n    )\n    parser.add_argument(\"--since_id\", dest=\"since_id\", help=\"smallest id to search for\")\n    parser.add_argument(\n        \"--result_type\",\n        dest=\"result_type\",\n        choices=[\"mixed\", \"recent\", \"popular\"],\n        default=\"recent\",\n        help=\"search result type\",\n    )\n    parser.add_argument(\n        \"--lang\",\n        dest=\"lang\",\n        action=\"append\",\n        default=[],\n        help=\"limit to ISO 639-1 language code\",\n    ),\n    parser.add_argument(\n        \"--geocode\", dest=\"geocode\", help=\"limit by latitude,longitude,radius\"\n    )\n    parser.add_argument(\n        \"--locations\", dest=\"locations\", help=\"limit filter stream to location(s)\"\n    )\n    parser.add_argument(\n        \"--follow\", dest=\"follow\", help=\"limit filter to tweets from given user id(s)\"\n    )\n    parser.add_argument(\n        \"--recursive\",\n        dest=\"recursive\",\n        action=\"store_true\",\n        help=\"also fetch replies to replies\",\n    )\n    parser.add_argument(\n        \"--tweet_mode\",\n        action=\"store\",\n        default=\"extended\",\n        dest=\"tweet_mode\",\n        choices=[\"compat\", \"extended\"],\n        help=\"set tweet mode\",\n    )\n    parser.add_argument(\n        \"--protected\",\n        dest=\"protected\",\n        action=\"store_true\",\n        help=\"include protected tweets\",\n    )\n    parser.add_argument(\n        \"--output\",\n        action=\"store\",\n        default=None,\n        dest=\"output\",\n        help=\"write output to file path\",\n    )\n    parser.add_argument(\n        \"--format\",\n        action=\"store\",\n        default=\"json\",\n        dest=\"format\",\n        choices=[\"json\", \"csv\", \"csv-excel\"],\n        help=\"set output format\",\n    )\n    parser.add_argument(\n        \"--split\",\n        action=\"store\",\n        type=int,\n        default=0,\n        help=\"used with --output to split into numbered files\",\n    )\n    parser.add_argument(\n        \"--skip_key_validation\",\n        action=\"store_true\",\n        help=\"skip checking keys are valid on startup\",\n    )\n    parser.add_argument(\n        \"--app_auth\",\n        action=\"store_true\",\n        default=False,\n        help=\"run in App Auth mode instead of User Auth\",\n    )\n    parser.add_argument(\n        \"--gnip_auth\",\n        action=\"store_true\",\n        default=False,\n        help=\"run in Gnip Auth mode (for enterprise APIs)\",\n    )\n    parser.add_argument(\n        \"--30day\",\n        action=\"store\",\n        dest=\"thirtyday\",\n        help=\"environment to use to search 30day premium endpoint\",\n    )\n    parser.add_argument(\n        \"--fullarchive\",\n        action=\"store\",\n        help=\"environment to use to search fullarchive premium endpoint\",\n    ),\n    parser.add_argument(\n        \"--gnip_fullarchive\",\n        action=\"store\",\n        help=\"environment to use to search gnip fullarchive enterprise endpoint\",\n    ),\n    parser.add_argument(\n        \"--from_date\",\n        action=\"store\",\n        default=None,\n        help=\"limit premium search to date e.g. 2012-05-01 03:04:01\",\n    )\n    parser.add_argument(\n        \"--to_date\",\n        action=\"store\",\n        default=None,\n        help=\"limit premium search to date e.g. 2012-05-01 03:04:01\",\n    )\n    parser.add_argument(\n        \"--limit\",\n        type=int,\n        default=0,\n        help=\"limit number of tweets returned by Premium API\",\n    )\n    parser.add_argument(\n        \"--sandbox\",\n        action=\"store_true\",\n        default=False,\n        help=\"indicate that Premium API endpoint is a sandbox\",\n    )\n\n    return parser\n\n\ndef numbered_filepath(filepath, num):\n    path, ext = os.path.splitext(filepath)\n    return os.path.join(\"{}-{:0>3}{}\".format(path, num, ext))\n"
  },
  {
    "path": "src/twarc/command2.py",
    "content": "\"\"\"\nThe command line interfact to the Twitter v2 API.\n\"\"\"\n\nimport os\nimport re\nimport json\nimport time\nimport twarc\nimport click\nimport logging\nimport pathlib\nimport datetime\nimport humanize\nimport requests\nimport configobj\nimport threading\n\nfrom tqdm.auto import tqdm\nfrom tqdm.utils import CallbackIOWrapper\n\nfrom datetime import timezone\nfrom click_plugins import with_plugins\nfrom importlib.metadata import entry_points\n\nfrom twarc.version import version\nfrom twarc.handshake import handshake\nfrom twarc.config import ConfigProvider\nfrom twarc.expansions import (\n    ensure_flattened,\n    EXPANSIONS,\n    TWEET_FIELDS,\n    USER_FIELDS,\n    MEDIA_FIELDS,\n    POLL_FIELDS,\n    PLACE_FIELDS,\n    LIST_FIELDS,\n)\nfrom click import Option, UsageError\nfrom click_config_file import configuration_option\nfrom twarc.decorators2 import (\n    cli_api_error,\n    TimestampProgressBar,\n    FileSizeProgressBar,\n    FileLineProgressBar,\n    _millis2snowflake,\n    _date2millis,\n)\n\n\nconfig_provider = ConfigProvider()\nlog = logging.getLogger(\"twarc\")\n\n\n@with_plugins(entry_points(group=\"twarc.plugins\"))\n@click.group()\n@click.option(\n    \"--consumer-key\",\n    type=str,\n    envvar=\"CONSUMER_KEY\",\n    help='Twitter app consumer key (aka \"App Key\")',\n)\n@click.option(\n    \"--consumer-secret\",\n    type=str,\n    envvar=\"CONSUMER_SECRET\",\n    help='Twitter app consumer secret (aka \"App Secret\")',\n)\n@click.option(\n    \"--access-token\",\n    type=str,\n    envvar=\"ACCESS_TOKEN\",\n    help=\"Twitter app access token for user authentication.\",\n)\n@click.option(\n    \"--access-token-secret\",\n    type=str,\n    envvar=\"ACCESS_TOKEN_SECRET\",\n    help=\"Twitter app access token secret for user authentication.\",\n)\n@click.option(\n    \"--bearer-token\",\n    type=str,\n    envvar=\"BEARER_TOKEN\",\n    help=\"Twitter app access bearer token.\",\n)\n@click.option(\n    \"--app-auth/--user-auth\",\n    default=True,\n    help=\"Use application authentication or user authentication. Some rate limits are \"\n    \"higher with user authentication, but not all endpoints are supported.\",\n    show_default=True,\n)\n@click.option(\"--log\", \"-l\", \"log_file\", default=\"twarc.log\")\n@click.option(\"--verbose\", is_flag=True, default=False)\n@click.option(\n    \"--metadata/--no-metadata\",\n    default=True,\n    show_default=True,\n    help=\"Include/don't include metadata about when and how data was collected.\",\n)\n@configuration_option(\n    cmd_name=\"twarc\", config_file_name=\"config\", provider=config_provider\n)\n@click.pass_context\ndef twarc2(\n    ctx,\n    consumer_key,\n    consumer_secret,\n    access_token,\n    access_token_secret,\n    bearer_token,\n    log_file,\n    metadata,\n    app_auth,\n    verbose,\n):\n    \"\"\"\n    Collect data from the Twitter V2 API.\n    \"\"\"\n    logging.basicConfig(\n        filename=log_file,\n        level=logging.DEBUG if verbose else logging.INFO,\n        format=\"%(asctime)s %(levelname)s %(message)s\",\n    )\n\n    log.info(\"using config %s\", config_provider.file_path)\n\n    if bearer_token or (consumer_key and consumer_secret):\n        if app_auth and (bearer_token or (consumer_key and consumer_secret)):\n            ctx.obj = twarc.Twarc2(\n                consumer_key=consumer_key,\n                consumer_secret=consumer_secret,\n                bearer_token=bearer_token,\n                metadata=metadata,\n            )\n        # Check everything is present for user auth.\n        elif consumer_key and consumer_secret and access_token and access_token_secret:\n            ctx.obj = twarc.Twarc2(\n                consumer_key=consumer_key,\n                consumer_secret=consumer_secret,\n                access_token=access_token,\n                access_token_secret=access_token_secret,\n                metadata=metadata,\n            )\n        else:\n            click.echo(\n                click.style(\n                    \"🙃  To use user authentication, you need all of the following:\\n\"\n                    \"- consumer_key\\n\",\n                    \"- consumer_secret\\n\",\n                    \"- access_token\\n\",\n                    \"- access_token_secret\\n\",\n                    fg=\"red\",\n                ),\n                err=True,\n            )\n            click.echo(\"You can configure twarc2 using the `twarc2 configure` command.\")\n    else:\n        click.echo()\n        click.echo(\"👋  Hi I don't see a configuration file yet, so let's make one.\")\n        click.echo()\n        click.echo(\"Please follow these steps:\")\n        click.echo()\n        click.echo(\"1. visit https://developer.twitter.com/en/portal/\")\n        click.echo(\"2. create a project and an app\")\n        click.echo(\"3. go to your Keys and Tokens and generate your keys\")\n        click.echo()\n        ctx.invoke(configure)\n\n\n@twarc2.command(\"configure\")\n@click.pass_context\ndef configure(ctx):\n    \"\"\"\n    Set up your Twitter app keys.\n    \"\"\"\n\n    config_file = config_provider.file_path\n    log.info(\"creating config file: %s\", config_file)\n\n    config_dir = pathlib.Path(config_file).parent\n    if not config_dir.is_dir():\n        log.info(\"creating config directory: %s\", config_dir)\n        config_dir.mkdir(parents=True)\n\n    keys = handshake()\n    if keys is None:\n        raise click.ClickException(\"Unable to authenticate\")\n\n    config = configobj.ConfigObj(unrepr=True)\n    config.filename = config_file\n\n    # Only write non empty keys.\n    for key in [\n        \"consumer_key\",\n        \"consumer_secret\",\n        \"access_token\",\n        \"access_token_secret\",\n        \"bearer_token\",\n    ]:\n        if keys.get(key, None):\n            config[key] = keys[key]\n\n    config.write()\n\n    click.echo(\n        click.style(f\"\\nYour keys have been written to {config_file}\", fg=\"green\")\n    )\n    click.echo()\n    click.echo(\"\\n✨ ✨ ✨  Happy twarcing! ✨ ✨ ✨\\n\")\n\n    ctx.exit()\n\n\n@twarc2.command(\"version\")\ndef get_version():\n    \"\"\"\n    Return the version of twarc that is installed.\n    \"\"\"\n    click.echo(f\"twarc v{version}\")\n\n\ndef _search(\n    T,\n    query,\n    outfile,\n    since_id,\n    until_id,\n    start_time,\n    end_time,\n    limit,\n    max_results,\n    archive,\n    hide_progress,\n    expansions,\n    tweet_fields,\n    user_fields,\n    media_fields,\n    poll_fields,\n    place_fields,\n    sort_order,\n):\n    \"\"\"\n    Common function to Search for tweets.\n    \"\"\"\n    count = 0\n\n    # Make sure times are always in UTC, click sometimes doesn't add timezone:\n    if start_time is not None and start_time.tzinfo is None:\n        start_time = start_time.replace(tzinfo=timezone.utc)\n    if end_time is not None and end_time.tzinfo is None:\n        end_time = end_time.replace(tzinfo=timezone.utc)\n\n    if archive:\n        search_method = T.search_all\n\n        # start time defaults to the beginning of Twitter to override the\n        # default of the last month. Only do this if start_time is not already\n        # specified and since_id and until_id aren't being used\n        if start_time is None and since_id is None and until_id is None:\n            start_time = datetime.datetime(2006, 3, 21, tzinfo=datetime.timezone.utc)\n    else:\n        search_method = T.search_recent\n\n    hide_progress = True if (outfile.name == \"<stdout>\") else hide_progress\n\n    with TimestampProgressBar(\n        since_id, until_id, start_time, end_time, disable=hide_progress\n    ) as progress:\n        for result in search_method(\n            query=query,\n            since_id=since_id,\n            until_id=until_id,\n            start_time=start_time,\n            end_time=end_time,\n            max_results=max_results,\n            expansions=expansions,\n            tweet_fields=tweet_fields,\n            user_fields=user_fields,\n            media_fields=media_fields,\n            poll_fields=poll_fields,\n            place_fields=place_fields,\n            sort_order=sort_order,\n        ):\n            _write(result, outfile)\n            tweet_ids = [t[\"id\"] for t in result.get(\"data\", [])]\n            log.info(\"archived %s\", \",\".join(tweet_ids))\n            progress.update_with_result(result)\n            count += len(result[\"data\"])\n            if limit != 0 and count >= limit:\n                # Display message when stopped early\n                progress.desc = f\"Set --limit of {limit} reached\"\n                break\n        else:\n            progress.early_stop = False\n\n\nclass MutuallyExclusiveOption(Option):\n    \"\"\"\n    Custom click class to make some options mutually exclusive\n    via https://gist.github.com/jacobtolar/fb80d5552a9a9dfc32b12a829fa21c0c\n    \"\"\"\n\n    def __init__(self, *args, **kwargs):\n        self.mutually_exclusive = set(kwargs.pop(\"mutually_exclusive\", []))\n        help = kwargs.get(\"help\", \"\")\n        if self.mutually_exclusive:\n            ex_str = \", \".join(\n                self.parse_name(name) for name in self.mutually_exclusive\n            )\n            kwargs[\"help\"] = help + (\n                \" NOTE: This argument is mutually exclusive with \"\n                \" arguments: [\" + ex_str + \"].\"\n            )\n        super(MutuallyExclusiveOption, self).__init__(*args, **kwargs)\n\n    def parse_name(self, name):\n        return f'--{name.replace(\"_\",\"-\")}'\n\n    def handle_parse_result(self, ctx, opts, args):\n        if self.mutually_exclusive.intersection(opts) and self.name in opts:\n            raise UsageError(\n                f\"Incorrect usage: {self.parse_name(self.name)} is mutually exclusive with \"\n                f\"arguments `{', '.join(self.parse_name(name) for name in self.mutually_exclusive)} use either one or the other.\"\n            )\n\n        return super(MutuallyExclusiveOption, self).handle_parse_result(ctx, opts, args)\n\n\ndef command_line_input_output_file_arguments(f):\n    \"\"\"\n    Decorator for specifying input and output file arguments in a command\n    \"\"\"\n    f = click.argument(\"outfile\", type=click.File(\"w\"), default=\"-\")(f)\n    f = click.argument(\"infile\", type=click.File(\"r\"), default=\"-\")(f)\n    return f\n\n\ndef command_line_progressbar_option(f):\n    \"\"\"\n    Decorator for specifying a progress bar option.\n    \"\"\"\n    f = click.option(\n        \"--hide-progress\",\n        is_flag=True,\n        default=False,\n        help=\"Hide the Progress bar. Default: show progress, unless using pipes.\",\n    )(f)\n    return f\n\n\ndef command_line_search_options(f):\n    \"\"\"\n    Decorator for specifying time range search API parameters.\n    \"\"\"\n    f = click.option(\n        \"--until-id\", type=int, help=\"Match tweets sent prior to tweet id\"\n    )(f)\n    f = click.option(\"--since-id\", type=int, help=\"Match tweets sent after tweet id\")(f)\n    f = click.option(\n        \"--end-time\",\n        type=click.DateTime(formats=(\"%Y-%m-%d\", \"%Y-%m-%dT%H:%M:%S\")),\n        help='Match tweets sent before UTC time (ISO 8601/RFC 3339), \\n e.g.  --end-time \"2021-01-01T12:31:04\"',\n    )(f)\n    f = click.option(\n        \"--start-time\",\n        type=click.DateTime(formats=(\"%Y-%m-%d\", \"%Y-%m-%dT%H:%M:%S\")),\n        help='Match tweets created after UTC time (ISO 8601/RFC 3339), \\n e.g.  --start-time \"2021-01-01T12:31:04\"',\n    )(f)\n    return f\n\n\ndef command_line_timelines_options(f):\n    \"\"\"\n    Decorator for common timelines command line options\n    \"\"\"\n    f = click.option(\n        \"--exclude-replies\",\n        is_flag=True,\n        default=False,\n        help=\"Exclude replies from timeline\",\n    )(f)\n    f = click.option(\n        \"--exclude-retweets\",\n        is_flag=True,\n        default=False,\n        help=\"Exclude retweets from timeline\",\n    )(f)\n    f = click.option(\n        \"--use-search\",\n        is_flag=True,\n        default=False,\n        help=\"Use the search/all API endpoint which is not limited to the last 3200 tweets, but requires Academic Product Track access.\",\n    )(f)\n    return f\n\n\ndef _validate_max_results(context, parameter, value):\n    \"\"\"\n    Validate and set appropriate max_results parameter.\n    \"\"\"\n\n    archive_set = \"archive\" in context.params and context.params[\"archive\"]\n    no_context_annotations_set = (\n        \"no_context_annotations\" in context.params\n        and context.params[\"no_context_annotations\"]\n    )\n    minimal_fields_set = (\n        \"minimal_fields\" in context.params and context.params[\"minimal_fields\"]\n    )\n    has_context_annotations = (\n        \"tweet_fields\" in context.params\n        and \"context_annotations\" in context.params[\"tweet_fields\"].split(\",\")\n    )\n\n    if value:\n        if not archive_set and value > 100:\n            raise click.BadParameter(\n                \"--max-results cannot be greater than 100 when using Standard Access. Specify --archive if you have Academic Access.\"\n            )\n        if value < 10 or value > 500:\n            raise click.BadParameter(\"--max-results must be between 10 and 500\")\n        if value > 100 and (has_context_annotations and not no_context_annotations_set):\n            raise click.BadParameter(\n                \"--max-results cannot be greater than 100 when using context annotations. Set --no-context-annotations to remove them, or don't specify them in --tweet-fields.\"\n            )\n\n        return value\n\n    else:\n        if archive_set and (\n            no_context_annotations_set\n            or minimal_fields_set\n            or not has_context_annotations\n        ):\n            return 500\n\n        return 100\n\n\ndef command_line_search_archive_options(f):\n    \"\"\"\n    Decorator for specifying additional search API parameters.\n    \"\"\"\n    f = click.option(\n        \"--limit\", default=0, help=\"Maximum number of tweets to save\", type=int\n    )(f)\n    f = click.option(\n        \"--max-results\",\n        default=None,\n        help=\"Maximum number of tweets per API response\",\n        callback=_validate_max_results,\n        type=int,\n    )(f)\n    f = click.option(\n        \"--archive\",\n        is_flag=True,\n        default=False,\n        is_eager=True,\n        help=\"Use the full archive (requires Academic Research track)\",\n    )(f)\n    return f\n\n\ndef _validate_expansions(context, parameter, value):\n    \"\"\"\n    Validate passed comma separated values for expansions.\n    \"\"\"\n    if value:\n        values = value.split(\",\")\n        valid = parameter.default.split(\",\")\n        for v in values:\n            if v not in valid:\n                raise click.BadOptionUsage(\n                    parameter.name,\n                    f'\"{v}\" is not a valid entry for --{parameter.name}. Must be a comma separated string, without spaces, like this:\\n--{parameter.name} \"{parameter.default}\"',\n                )\n        return \",\".join(values)\n\n\ndef command_line_expansions_options(f):\n    \"\"\"\n    Decorator for specifying custom fields and expansions\n    \"\"\"\n    f = click.option(\n        \"--poll-fields\",\n        default=\",\".join(POLL_FIELDS),\n        type=click.STRING,\n        help=\"Comma separated list of poll fields to retrieve. Default is all available.\",\n        callback=_validate_expansions,\n    )(f)\n    f = click.option(\n        \"--place-fields\",\n        default=\",\".join(PLACE_FIELDS),\n        type=click.STRING,\n        help=\"Comma separated list of place fields to retrieve. Default is all available.\",\n        callback=_validate_expansions,\n    )(f)\n    f = click.option(\n        \"--media-fields\",\n        default=\",\".join(MEDIA_FIELDS),\n        type=click.STRING,\n        help=\"Comma separated list of media fields to retrieve. Default is all available.\",\n        callback=_validate_expansions,\n    )(f)\n    f = click.option(\n        \"--user-fields\",\n        default=\",\".join(USER_FIELDS),\n        type=click.STRING,\n        help=\"Comma separated list of user fields to retrieve. Default is all available.\",\n        callback=_validate_expansions,\n    )(f)\n    f = click.option(\n        \"--tweet-fields\",\n        default=\",\".join(TWEET_FIELDS),\n        type=click.STRING,\n        is_eager=True,\n        help=\"Comma separated list of tweet fields to retrieve. Default is all available.\",\n        callback=_validate_expansions,\n    )(f)\n    f = click.option(\n        \"--expansions\",\n        default=\",\".join(EXPANSIONS),\n        type=click.STRING,\n        help=\"Comma separated list of expansions to retrieve. Default is all available.\",\n        callback=_validate_expansions,\n    )(f)\n    return f\n\n\ndef command_line_expansions_shortcuts(f):\n    \"\"\"\n    Decorator for specifying common fields and expansions presets\n    \"\"\"\n    f = click.option(\n        \"--minimal-fields\",\n        cls=MutuallyExclusiveOption,\n        mutually_exclusive=[\n            \"no_context_annotations\",\n            \"expansions\",\n            \"tweet_fields\",\n            \"user_fields\",\n            \"media_fields\",\n            \"poll_fields\",\n            \"place_fields\",\n            \"counts_only\",\n        ],\n        is_flag=True,\n        default=False,\n        is_eager=True,\n        help=\"By default twarc gets all available data. This option requests the minimal retrievable amount of data - only IDs and object references are retrieved. Setting this makes --max-results 500 the default.\",\n    )(f)\n    f = click.option(\n        \"--no-context-annotations\",\n        cls=MutuallyExclusiveOption,\n        mutually_exclusive=[\n            \"minimal_fields\",\n            \"expansions\",\n            \"tweet_fields\",\n            \"user_fields\",\n            \"media_fields\",\n            \"poll_fields\",\n            \"place_fields\",\n            \"counts_only\",\n        ],\n        is_flag=True,\n        default=False,\n        is_eager=True,\n        help=\"By default twarc gets all available data. This leaves out context annotations (Twitter API limits --max-results to 100 if these are requested). Setting this makes --max-results 500 the default.\",\n    )(f)\n    return f\n\n\ndef _process_expansions_shortcuts(kwargs):\n    # Override fields and expansions\n    if kwargs.pop(\"minimal_fields\", None):\n        kwargs[\n            \"expansions\"\n        ] = \"author_id,in_reply_to_user_id,referenced_tweets.id,referenced_tweets.id.author_id,attachments.poll_ids,attachments.media_keys,geo.place_id\"\n        kwargs[\n            \"tweet_fields\"\n        ] = \"id,conversation_id,author_id,in_reply_to_user_id,referenced_tweets,geo\"\n        kwargs[\n            \"user_fields\"\n        ] = \"id,username,name,pinned_tweet_id\"  # pinned_tweet_id is the only extra one, id,username,name are always returned.\n        kwargs[\"media_fields\"] = \"media_key\"\n        kwargs[\"poll_fields\"] = \"id\"\n        kwargs[\"place_fields\"] = \"id\"\n\n    if kwargs.pop(\"no_context_annotations\", None):\n        kwargs[\n            \"tweet_fields\"\n        ] = \"attachments,author_id,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,public_metrics,text,possibly_sensitive,referenced_tweets,reply_settings,source,withheld\"\n\n    return kwargs\n\n\ndef command_line_verbose_options(f):\n    \"\"\"\n    Decorator for specifying verbose and json output\n    \"\"\"\n    f = click.option(\n        \"--verbose\",\n        is_flag=True,\n        default=False,\n        help=\"Show all URLs and metadata.\",\n    )(f)\n    f = click.option(\n        \"--json-output\",\n        is_flag=True,\n        default=False,\n        help=\"Return the raw json content from the API.\",\n    )(f)\n    return f\n\n\n@twarc2.command(\"search\")\n@click.option(\n    \"--sort-order\",\n    type=click.Choice([\"recency\", \"relevancy\"]),\n    help='Filter tweets based on their date (\"recency\") (default) or based on their relevance as indicated by Twitter (\"relevancy\")',\n)\n@command_line_search_options\n@command_line_search_archive_options\n@command_line_expansions_shortcuts\n@command_line_expansions_options\n@command_line_progressbar_option\n@click.argument(\"query\", type=str)\n@click.argument(\"outfile\", type=click.File(\"w\"), default=\"-\")\n@click.pass_obj\n@cli_api_error\ndef search(\n    T,\n    query,\n    outfile,\n    **kwargs,\n):\n    \"\"\"\n    Search for tweets. For help on how to write a query see https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query\n    \"\"\"\n\n    kwargs = _process_expansions_shortcuts(kwargs)\n\n    return _search(\n        T,\n        query,\n        outfile,\n        **kwargs,\n    )\n\n\n@twarc2.command(\"counts\")\n@command_line_search_options\n@click.option(\n    \"--archive\",\n    is_flag=True,\n    default=False,\n    help=\"Count using the full archive (requires Academic Research track)\",\n)\n@click.option(\n    \"--granularity\",\n    default=\"hour\",\n    type=click.Choice([\"day\", \"hour\", \"minute\"], case_sensitive=False),\n    help=\"Aggregation level for counts. Can be one of: day, hour, minute. Default is hour.\",\n)\n@click.option(\n    \"--limit\",\n    default=0,\n    help=\"Maximum number of days of results to save (minimum is 30 days)\",\n)\n@click.option(\n    \"--text\",\n    is_flag=True,\n    default=False,\n    help=\"Output the counts as human readable text\",\n)\n@click.option(\"--csv\", is_flag=True, default=False, help=\"Output counts as CSV\")\n@command_line_progressbar_option\n@click.argument(\"query\", type=str)\n@click.argument(\"outfile\", type=click.File(\"w\"), default=\"-\")\n@click.pass_obj\n@cli_api_error\ndef counts(\n    T,\n    query,\n    outfile,\n    since_id,\n    until_id,\n    start_time,\n    end_time,\n    archive,\n    granularity,\n    limit,\n    text,\n    csv,\n    hide_progress,\n):\n    \"\"\"\n    Return counts of tweets matching a query.\n    \"\"\"\n    count = 0\n\n    # Make sure times are always in UTC, click sometimes doesn't add timezone:\n    if start_time is not None and start_time.tzinfo is None:\n        start_time = start_time.replace(tzinfo=timezone.utc)\n    if end_time is not None and end_time.tzinfo is None:\n        end_time = end_time.replace(tzinfo=timezone.utc)\n\n    if archive:\n        count_method = T.counts_all\n        # start time defaults to the beginning of Twitter to override the\n        # default of the last month. Only do this if start_time is not already\n        # specified and since_id/until_id aren't being used\n        if start_time is None and since_id is None and until_id is None:\n            start_time = datetime.datetime(2006, 3, 21, tzinfo=datetime.timezone.utc)\n    else:\n        count_method = T.counts_recent\n\n    if csv:\n        click.echo(f\"start,end,{granularity}_count\", file=outfile)\n\n    hide_progress = True if (outfile.name == \"<stdout>\") else hide_progress\n    total_tweets = 0\n\n    with TimestampProgressBar(\n        since_id, until_id, start_time, end_time, disable=hide_progress\n    ) as progress:\n        for result in count_method(\n            query,\n            since_id,\n            until_id,\n            start_time,\n            end_time,\n            granularity,\n        ):\n            # Count outputs:\n            if text:\n                for r in result[\"data\"]:\n                    total_tweets += r[\"tweet_count\"]\n                    click.echo(\n                        \"{start} - {end}: {tweet_count:,}\".format(**r), file=outfile\n                    )\n            elif csv:\n                for r in result[\"data\"]:\n                    click.echo(\n                        f'{r[\"start\"]},{r[\"end\"]},{r[\"tweet_count\"]}', file=outfile\n                    )\n            else:\n                _write(result, outfile)\n\n            # Progress and limits:\n            if len(result[\"data\"]) > 0:\n                progress.update_with_dates(\n                    result[\"data\"][0][\"start\"], result[\"data\"][-1][\"end\"]\n                )\n                progress.tweet_count += result[\"meta\"][\"total_tweet_count\"]\n            count += len(result[\"data\"])\n\n            if limit != 0 and count >= limit:\n                break\n            if text:\n                click.echo(\n                    click.style(\n                        \"\\nTotal Tweets: {:,}\\n\".format(total_tweets), fg=\"green\"\n                    ),\n                    file=outfile,\n                )\n        else:\n            progress.early_stop = False\n\n\n@twarc2.command(\"tweet\")\n@command_line_expansions_shortcuts\n@command_line_expansions_options\n@click.option(\"--pretty\", is_flag=True, default=False, help=\"Pretty print the JSON\")\n@click.argument(\"tweet_id\", type=str)\n@click.argument(\"outfile\", type=click.File(\"w\"), default=\"-\")\n@click.pass_obj\n@cli_api_error\ndef tweet(T, tweet_id, outfile, pretty, **kwargs):\n    \"\"\"\n    Look up a tweet using its tweet id or URL.\n    \"\"\"\n\n    kwargs = _process_expansions_shortcuts(kwargs)\n\n    if \"https\" in tweet_id:\n        tweet_id = tweet_id.split(\"/\")[-1]\n    if not re.match(r\"^\\d+$\", tweet_id):\n        click.echo(click.style(\"Please enter a tweet URL or ID\", fg=\"red\"), err=True)\n    result = next(T.tweet_lookup([tweet_id], **kwargs))\n    _write(result, outfile, pretty=pretty)\n\n\n@twarc2.command(\"followers\")\n@click.option(\n    \"--limit\",\n    default=0,\n    help=\"Maximum number of followers to save. Increments of 1000 or --max-results if set.\",\n    type=int,\n)\n@click.option(\n    \"--max-results\",\n    default=1000,\n    help=\"Maximum number of users per page. Default is 1000.\",\n    type=int,\n)\n@command_line_progressbar_option\n@click.argument(\"user\", type=str)\n@click.argument(\"outfile\", type=click.File(\"w\"), default=\"-\")\n@click.pass_obj\n@cli_api_error\ndef followers(T, user, outfile, limit, max_results, hide_progress):\n    \"\"\"\n    Get the followers for a given user.\n    \"\"\"\n    user_id = None\n    lookup_total = 1\n\n    hide_progress = True if (outfile.name == \"<stdout>\") else hide_progress\n\n    if not hide_progress:\n        target_user = T._ensure_user(user)\n        user_id = target_user[\"id\"]\n        lookup_total = target_user[\"public_metrics\"][\"followers_count\"]\n\n    _write_with_progress(\n        func=T.followers,\n        user=user,\n        user_id=user_id,\n        outfile=outfile,\n        limit=limit,\n        hide_progress=hide_progress,\n        progress_total=lookup_total,\n        max_results=max_results,\n    )\n\n\n@twarc2.command(\"following\")\n@click.option(\n    \"--limit\",\n    default=0,\n    help=\"Maximum number of friends to save. Increments of 1000 or --max-results if set.\",\n    type=int,\n)\n@click.option(\n    \"--max-results\",\n    default=1000,\n    help=\"Maximum number of users per page. Default is 1000.\",\n    type=int,\n)\n@command_line_progressbar_option\n@click.argument(\"user\", type=str)\n@click.argument(\"outfile\", type=click.File(\"w\"), default=\"-\")\n@click.pass_obj\n@cli_api_error\ndef following(T, user, outfile, limit, max_results, hide_progress):\n    \"\"\"\n    Get the users that a given user is following.\n    \"\"\"\n    user_id = None\n    lookup_total = 1\n\n    hide_progress = True if (outfile.name == \"<stdout>\") else hide_progress\n\n    if not hide_progress:\n        target_user = T._ensure_user(user)\n        user_id = target_user[\"id\"]\n        lookup_total = target_user[\"public_metrics\"][\"following_count\"]\n\n    _write_with_progress(\n        func=T.following,\n        user=user,\n        user_id=user_id,\n        outfile=outfile,\n        limit=limit,\n        hide_progress=hide_progress,\n        progress_total=lookup_total,\n        max_results=max_results,\n    )\n\n\n@twarc2.command(\"liking-users\")\n@click.option(\n    \"--limit\",\n    default=0,\n    help=\"Maximum number of liking users to retrieve. Increments of 100 or --max-results if set.\",\n    type=int,\n)\n@click.option(\n    \"--max-results\",\n    default=100,\n    help=\"Maximum number of users (likes) per page. Default is and maximum is 100.\",\n    type=int,\n)\n@command_line_progressbar_option\n@click.argument(\"tweet_id\", type=str)\n@click.argument(\"outfile\", type=click.File(\"w\"), default=\"-\")\n@click.pass_obj\n@cli_api_error\ndef liking_users(T, tweet_id, outfile, limit, max_results, hide_progress):\n    \"\"\"\n    Get the users that liked a specific tweet.\n\n    Note that the progress bar is approximate.\n\n    \"\"\"\n    lookup_total = 1\n\n    if not re.match(r\"^\\d+$\", str(tweet_id)):\n        click.echo(click.style(\"Please enter a tweet ID\", fg=\"red\"), err=True)\n\n    hide_progress = True if (outfile.name == \"<stdout>\") else hide_progress\n\n    if not hide_progress:\n        # TODO: we could probably do this everytime, and avoid doing any lookups\n        # for tweets that don't exist anymore.\n        target_tweet = list(T.tweet_lookup([tweet_id]))[0]\n        if \"data\" in target_tweet:\n            lookup_total = target_tweet[\"data\"][0][\"public_metrics\"][\"like_count\"]\n\n    _write_with_progress(\n        func=T.liking_users,\n        tweet_id=tweet_id,\n        outfile=outfile,\n        limit=limit,\n        hide_progress=hide_progress,\n        progress_total=lookup_total,\n        max_results=max_results,\n    )\n\n\n@twarc2.command(\"retweeted-by\")\n@click.option(\n    \"--limit\",\n    default=0,\n    help=\"Maximum number of retweeting users to retrieve. Increments of 100 or --max-results if set.\",\n    type=int,\n)\n@click.option(\n    \"--max-results\",\n    default=100,\n    help=\"Maximum number of users (retweets) per page of results. Default and maximum is 100.\",\n    type=int,\n)\n@command_line_progressbar_option\n@click.argument(\"tweet_id\", type=str)\n@click.argument(\"outfile\", type=click.File(\"w\"), default=\"-\")\n@click.pass_obj\n@cli_api_error\ndef retweeted_by(T, tweet_id, outfile, limit, max_results, hide_progress):\n    \"\"\"\n    Get the users that retweeted a specific tweet.\n\n    Note that the progress bar is approximate.\n\n    \"\"\"\n    lookup_total = 0\n\n    if not re.match(r\"^\\d+$\", str(tweet_id)):\n        click.echo(click.style(\"Please enter a tweet ID\", fg=\"red\"), err=True)\n\n    hide_progress = True if (outfile.name == \"<stdout>\") else hide_progress\n\n    if not hide_progress:\n        # TODO: we could probably do this everytime, and avoid doing any lookups\n        # for tweets that don't exist anymore.\n        target_tweet = list(T.tweet_lookup([tweet_id]))[0]\n        if \"data\" in target_tweet:\n            lookup_total = target_tweet[\"data\"][0][\"public_metrics\"][\"retweet_count\"]\n\n    _write_with_progress(\n        func=T.retweeted_by,\n        tweet_id=tweet_id,\n        outfile=outfile,\n        limit=limit,\n        hide_progress=hide_progress,\n        progress_total=lookup_total,\n        max_results=max_results,\n    )\n\n\n@twarc2.command(\"quotes\")\n@click.option(\n    \"--limit\",\n    default=0,\n    help=\"Maximum number of retweeting users to retrieve. Increments of 100 or --max-results if set.\",\n    type=int,\n)\n@click.option(\n    \"--max-results\",\n    default=100,\n    help=\"Maximum number of users (retweets) per page of results. Default and maximum is 100.\",\n    type=int,\n)\n@command_line_expansions_shortcuts\n@command_line_expansions_options\n@command_line_progressbar_option\n@click.argument(\"tweet_id\", type=str)\n@click.argument(\"outfile\", type=click.File(\"w\"), default=\"-\")\n@click.pass_obj\n@cli_api_error\ndef quotes(T, tweet_id, outfile, limit, max_results, hide_progress, **kwargs):\n    \"\"\"\n    Get the tweets that quote tweet the given tweet.\n\n    Note that the progress bar is approximate.\n    \"\"\"\n    count = 0\n    lookup_total = 0\n    kwargs = _process_expansions_shortcuts(kwargs)\n    # Also remove media poll and place from kwargs, these are not valid for this endpoint:\n    kwargs.pop(\"media_fields\", None)\n    kwargs.pop(\"poll_fields\", None)\n    kwargs.pop(\"place_fields\", None)\n\n    if not re.match(r\"^\\d+$\", str(tweet_id)):\n        click.echo(click.style(\"Please enter a tweet ID\", fg=\"red\"), err=True)\n\n    hide_progress = True if (outfile.name == \"<stdout>\") else hide_progress\n\n    if not hide_progress:\n        target_tweet = list(T.tweet_lookup([tweet_id]))[0]\n        if \"data\" in target_tweet:\n            lookup_total = target_tweet[\"data\"][0][\"public_metrics\"][\"quote_count\"]\n\n    _write_with_progress(\n        func=T.quotes,\n        tweet_id=tweet_id,\n        outfile=outfile,\n        limit=limit,\n        hide_progress=hide_progress,\n        progress_total=lookup_total,\n        max_results=max_results,\n        **kwargs,\n    )\n\n\n@twarc2.command(\"liked-tweets\")\n@click.option(\n    \"--limit\",\n    default=0,\n    help=\"Maximum number of liked tweets to retrieve. Increments of 100 or --max-results if set.\",\n    type=int,\n)\n@click.option(\n    \"--max-results\",\n    default=100,\n    help=\"Maximum number of liked tweets per page of results. Default and maximum is 100.\",\n    type=int,\n)\n@command_line_progressbar_option\n@click.argument(\"user_id\", type=str)\n@click.argument(\"outfile\", type=click.File(\"w\"), default=\"-\")\n@click.pass_obj\n@cli_api_error\ndef liked_tweets(T, user_id, outfile, limit, max_results, hide_progress):\n    \"\"\"\n    Get the tweets liked by a specific user_id.\n\n    Note that the progress bar is approximate.\n\n    \"\"\"\n\n    # NB: there doesn't appear to be anyway to get the total count of likes\n    # a user has made, so the progress bar isn't very useful in this case...\n    _write_with_progress(\n        func=T.liked_tweets,\n        user_id=user_id,\n        outfile=outfile,\n        limit=limit,\n        hide_progress=hide_progress,\n        progress_total=1,\n        max_results=max_results,\n    )\n\n\n@twarc2.command(\"sample\")\n@command_line_expansions_shortcuts\n@command_line_expansions_options\n@click.option(\"--limit\", default=0, help=\"Maximum number of tweets to save\")\n@click.argument(\"outfile\", type=click.File(\"a+\"), default=\"-\")\n@click.pass_obj\n@cli_api_error\ndef sample(T, outfile, limit, **kwargs):\n    \"\"\"\n    Fetch tweets from the sample stream.\n    \"\"\"\n\n    kwargs = _process_expansions_shortcuts(kwargs)\n\n    count = 0\n    event = threading.Event()\n    click.echo(\n        click.style(\n            f\"Started a random sample stream, writing to {outfile.name}\\nCTRL+C to stop...\",\n            fg=\"green\",\n        ),\n        err=True,\n    )\n    for result in T.sample(event=event, **kwargs):\n        count += 1\n        if limit != 0 and count >= limit:\n            event.set()\n        _write(result, outfile)\n\n        if result and \"data\" in result:\n            log.info(\"archived %s\", result[\"data\"][\"id\"])\n\n\n@twarc2.command(\"hydrate\")\n@command_line_expansions_shortcuts\n@command_line_expansions_options\n@command_line_input_output_file_arguments\n@command_line_progressbar_option\n@click.pass_obj\n@cli_api_error\ndef hydrate(T, infile, outfile, hide_progress, **kwargs):\n    \"\"\"\n    Hydrate tweet ids.\n    \"\"\"\n\n    kwargs = _process_expansions_shortcuts(kwargs)\n\n    with FileLineProgressBar(infile, outfile, disable=hide_progress) as progress:\n        for result in T.tweet_lookup(infile, **kwargs):\n            _write(result, outfile)\n            tweet_ids = [t[\"id\"] for t in result.get(\"data\", [])]\n            log.info(\"archived %s\", \",\".join(tweet_ids))\n            progress.update_with_result(result, error_resource_type=\"tweet\")\n\n\n@twarc2.command(\"dehydrate\")\n@click.option(\n    \"--id-type\",\n    default=\"tweets\",\n    type=click.Choice([\"tweets\", \"users\"], case_sensitive=False),\n    help=\"IDs to extract - either 'tweets' or 'users'.\",\n)\n@command_line_progressbar_option\n@command_line_input_output_file_arguments\n@cli_api_error\ndef dehydrate(infile, outfile, id_type, hide_progress):\n    \"\"\"\n    Extract tweet or user IDs from a dataset.\n    \"\"\"\n    if infile.name == outfile.name:\n        click.echo(\n            click.style(\n                f\"💔 Cannot extract files in-place, specify a different output file!\",\n                fg=\"red\",\n            ),\n            err=True,\n        )\n        return\n\n    with FileSizeProgressBar(infile, outfile, disable=hide_progress) as progress:\n        count = 0\n        unique_ids = set()\n        for line in infile:\n            count += 1\n            progress.update(len(line))\n\n            # ignore empty lines\n            line = line.strip()\n            if not line:\n                continue\n\n            try:\n                for tweet in ensure_flattened(json.loads(line)):\n                    if id_type == \"tweets\":\n                        click.echo(tweet[\"id\"], file=outfile)\n                        unique_ids.add(tweet[\"id\"])\n                    elif id_type == \"users\":\n                        click.echo(tweet[\"author_id\"], file=outfile)\n                        unique_ids.add(tweet[\"author_id\"])\n            except KeyError as e:\n                click.echo(\n                    f\"No {id_type} ID found in JSON data on line {count}\", err=True\n                )\n                break\n            except ValueError as e:\n                click.echo(f\"Unexpected JSON data on line {count}\", err=True)\n                break\n            except json.decoder.JSONDecodeError as e:\n                click.echo(f\"Invalid JSON on line {count}\", err=True)\n                break\n    click.echo(\n        f\"ℹ️  Parsed {len(unique_ids)} {id_type} IDs from {count} lines in {infile.name} file.\",\n        err=True,\n    )\n\n\n@twarc2.command(\"users\")\n@command_line_expansions_shortcuts\n@command_line_expansions_options\n@click.option(\"--usernames\", is_flag=True, default=False)\n@command_line_progressbar_option\n@command_line_input_output_file_arguments\n@click.pass_obj\n@cli_api_error\ndef users(T, infile, outfile, usernames, hide_progress, **kwargs):\n    \"\"\"\n    Get data for user ids or usernames.\n    \"\"\"\n\n    kwargs = _process_expansions_shortcuts(kwargs)\n    # Also remove media poll and place from kwargs, these are not valid for this endpoint:\n    kwargs.pop(\"media_fields\", None)\n    kwargs.pop(\"poll_fields\", None)\n    kwargs.pop(\"place_fields\", None)\n\n    with FileLineProgressBar(infile, outfile, disable=hide_progress) as progress:\n        for result in T.user_lookup(infile, usernames, **kwargs):\n            _write(result, outfile)\n            if usernames:\n                progress.update_with_result(\n                    result,\n                    field=\"username\",\n                    error_resource_type=\"user\",\n                    error_parameter=\"usernames\",\n                )\n            else:\n                progress.update_with_result(result, error_resource_type=\"user\")\n\n\n@twarc2.command(\"user\")\n@command_line_expansions_shortcuts\n@command_line_expansions_options\n@click.argument(\"name-or-id\", type=click.Choice([\"name\", \"id\"]))\n@click.argument(\"user\", type=str)\n@click.argument(\"outfile\", type=click.File(\"w\"), default=\"-\")\n@click.pass_obj\n@cli_api_error\ndef user(T, name_or_id, user, outfile, **kwargs):\n    \"\"\"\n    Get the profile data for a single user by either username or ID.\n\n    To look up a user by ID:\n\n        twarc2 user id 12\n\n    To look up a user by username:\n\n        twarc2 user name jack\n\n    \"\"\"\n\n    kwargs = _process_expansions_shortcuts(kwargs)\n    # Also remove media poll and place from kwargs, these are not valid for this endpoint:\n    kwargs.pop(\"media_fields\", None)\n    kwargs.pop(\"poll_fields\", None)\n    kwargs.pop(\"place_fields\", None)\n\n    username = name_or_id == \"name\"\n\n    user_data = list(T.user_lookup([user], username, **kwargs))\n    _write(user_data, outfile)\n\n\n@twarc2.command(\"mentions\")\n@command_line_search_options\n@command_line_expansions_shortcuts\n@command_line_expansions_options\n@command_line_progressbar_option\n@click.argument(\"user_id\", type=str)\n@click.argument(\"outfile\", type=click.File(\"w\"), default=\"-\")\n@click.pass_obj\n@cli_api_error\ndef mentions(T, user_id, outfile, hide_progress, **kwargs):\n    \"\"\"\n    Retrieve max of 800 of the most recent tweets mentioning the given user.\n    \"\"\"\n\n    kwargs = _process_expansions_shortcuts(kwargs)\n\n    with tqdm(disable=hide_progress, total=800) as progress:\n        for result in T.mentions(user_id, **kwargs):\n            _write(result, outfile)\n            progress.update(len(result.get(\"data\", [])))\n        else:\n            if progress.n > 800:\n                progress.desc = f\"API limit reached with {progress.n} tweets\"\n                progress.n = 800\n            else:\n                progress.desc = f\"Set limit reached with {progress.n} tweets\"\n\n\n@twarc2.command(\"timeline\")\n@command_line_search_options\n@command_line_timelines_options\n@command_line_expansions_shortcuts\n@command_line_expansions_options\n@command_line_progressbar_option\n@click.option(\"--limit\", default=0, help=\"Maximum number of tweets to return\")\n@click.option(\n    \"--sort-order\",\n    type=click.Choice([\"recency\", \"relevancy\"]),\n    help='Filter tweets based on their date (\"recency\") (default) or based on their relevance as indicated by Twitter (\"relevancy\")',\n)\n@click.argument(\"user_id\", type=str)\n@click.argument(\"outfile\", type=click.File(\"w\"), default=\"-\")\n@click.pass_obj\n@cli_api_error\ndef timeline(\n    T,\n    user_id,\n    outfile,\n    since_id,\n    until_id,\n    start_time,\n    end_time,\n    use_search,\n    limit,\n    exclude_retweets,\n    exclude_replies,\n    hide_progress,\n    sort_order,\n    **kwargs,\n):\n    \"\"\"\n    Retrieve recent tweets for the given user.\n    \"\"\"\n\n    kwargs = _process_expansions_shortcuts(kwargs)\n\n    count = 0\n    user = T._ensure_user(user_id)  # It's possible to skip this to optimize more\n\n    if use_search or (start_time or end_time) or (since_id or until_id):\n        pbar = TimestampProgressBar\n\n        # Infer start time as the user created time if not using ids\n        if start_time is None and (since_id is None and until_id is None):\n            start_time = datetime.datetime.strptime(\n                user[\"created_at\"], \"%Y-%m-%dT%H:%M:%S.%fZ\"\n            )\n        # Infer since_id as user created time if using ids\n        if start_time is None and since_id is None:\n            infer_id = _millis2snowflake(\n                _date2millis(\n                    datetime.datetime.strptime(\n                        user[\"created_at\"], \"%Y-%m-%dT%H:%M:%S.%fZ\"\n                    )\n                )\n            )\n            # Snowflake epoch is 1288834974657 so if older, just set it to \"1\"\n            since_id = infer_id if infer_id > 0 else 1\n\n        pbar_params = {\n            \"since_id\": since_id,\n            \"until_id\": until_id,\n            \"start_time\": start_time,\n            \"end_time\": end_time,\n            \"disable\": hide_progress,\n        }\n\n    else:\n        pbar = tqdm\n        pbar_params = {\n            \"disable\": hide_progress,\n            \"total\": user[\"public_metrics\"][\"tweet_count\"],\n        }\n\n    tweets = _timeline_tweets(\n        T,\n        use_search=use_search,\n        user_id=user_id,\n        since_id=since_id,\n        until_id=until_id,\n        start_time=start_time,\n        end_time=end_time,\n        exclude_retweets=exclude_retweets,\n        exclude_replies=exclude_replies,\n        sort_order=sort_order,\n        **kwargs,\n    )\n\n    with pbar(**pbar_params) as progress:\n        for result in tweets:\n            _write(result, outfile)\n\n            count += len(result[\"data\"])\n            if isinstance(progress, TimestampProgressBar):\n                progress.update_with_result(result)\n            else:\n                progress.update(len(result[\"data\"]))\n\n            if limit != 0 and count >= limit:\n                # Display message when stopped early\n                progress.desc = f\"Set --limit of {limit} reached\"\n                break\n        else:\n            if isinstance(progress, TimestampProgressBar):\n                progress.early_stop = False\n            if not use_search and user[\"public_metrics\"][\"tweet_count\"] > 3200:\n                progress.desc = f\"API limit of 3200 reached\"\n\n\n@twarc2.command(\"timelines\")\n@click.option(\"--limit\", default=0, help=\"Maximum number of tweets to return\")\n@click.option(\n    \"--timeline-limit\",\n    default=0,\n    help=\"Maximum number of tweets to return per-timeline\",\n)\n@click.option(\n    \"--sort-order\",\n    type=click.Choice([\"recency\", \"relevancy\"]),\n    help='Filter tweets based on their date (\"recency\") (default) or based on their relevance as indicated by Twitter (\"relevancy\")',\n)\n@command_line_search_options\n@command_line_timelines_options\n@command_line_expansions_shortcuts\n@command_line_expansions_options\n@command_line_progressbar_option\n@command_line_input_output_file_arguments\n@click.pass_obj\ndef timelines(\n    T,\n    infile,\n    outfile,\n    limit,\n    timeline_limit,\n    use_search,\n    sort_order,\n    hide_progress,\n    **kwargs,\n):\n    \"\"\"\n    Fetch the timelines of every user in an input source of tweets. If\n    the input is a line oriented text file of user ids or usernames that will\n    be used instead.\n\n    The infile can be:\n\n        - A file containing one user id per line (either quoted or unquoted)\n        - A JSONL file containing tweets collected in the Twitter API V2 format\n\n    \"\"\"\n    total_count = 0\n    line_count = 0\n    seen = set()\n    kwargs = _process_expansions_shortcuts(kwargs)\n\n    with FileLineProgressBar(infile, outfile, disable=hide_progress) as progress:\n        for line in infile:\n            progress.update()\n            line_count += 1\n            line = line.strip()\n            if line == \"\":\n                log.warn(\"skipping blank line on line %s\", line_count)\n                continue\n\n            users = None\n            try:\n                # assume this the line contains some tweet json\n                data = json.loads(line)\n\n                # if it parsed as a string or int assume it's a username\n                if isinstance(data, str) or isinstance(data, int):\n                    users = set([line])\n\n                # otherwise try to flatten the data and get the user ids\n                else:\n                    try:\n                        users = set([t[\"author\"][\"id\"] for t in ensure_flattened(data)])\n                    except (KeyError, ValueError):\n                        log.warn(\n                            \"ignored line %s which didn't contain users\", line_count\n                        )\n                        continue\n\n            except json.JSONDecodeError:\n                # maybe it's a single user?\n                users = set([line])\n\n            if users is None:\n                click.echo(\n                    click.style(\n                        f\"unable to find user or users on line {line_count}\",\n                        fg=\"red\",\n                    ),\n                    err=True,\n                )\n                break\n\n            for user in users:\n                # only process a given user once\n                if user in seen:\n                    log.info(\"already processed %s, skipping\", user)\n                    continue\n\n                # ignore what don't appear to be a username or user id since\n                # they can cause the Twitter API to throw a 400 error\n                if not re.match(r\"^((\\w{1,15})|(\\d+))$\", user):\n                    log.warn(\n                        'invalid username or user id \"%s\" on line %s', line, line_count\n                    )\n                    continue\n\n                seen.add(user)\n\n                tweets = _timeline_tweets(\n                    T,\n                    use_search=use_search,\n                    sort_order=sort_order,\n                    user_id=user,\n                    **kwargs,\n                )\n\n                timeline_count = 0\n                for response in tweets:\n                    _write(response, outfile)\n\n                    timeline_count += len(response[\"data\"])\n                    if timeline_limit != 0 and timeline_count >= timeline_limit:\n                        break\n\n                    total_count += len(response[\"data\"])\n                    if limit != 0 and total_count >= limit:\n                        return\n\n\ndef _timeline_tweets(\n    T,\n    use_search,\n    user_id,\n    since_id,\n    until_id,\n    start_time,\n    end_time,\n    exclude_retweets,\n    exclude_replies,\n    sort_order,\n    **kwargs,\n):\n    if use_search:\n        q = f\"from:{user_id}\"\n        if exclude_retweets and \"-is:retweet\" not in q:\n            q += \" -is:retweet\"\n        if exclude_replies and \"-is:reply\" not in q:\n            q += \" -is:reply\"\n        tweets = T.search_all(\n            query=q,\n            since_id=since_id,\n            until_id=until_id,\n            start_time=start_time,\n            end_time=end_time,\n            sort_order=sort_order,\n            **kwargs,\n        )\n    else:\n        tweets = T.timeline(\n            user=user_id,\n            since_id=since_id,\n            until_id=until_id,\n            start_time=start_time,\n            end_time=end_time,\n            exclude_retweets=exclude_retweets,\n            exclude_replies=exclude_replies,\n            **kwargs,\n        )\n    return tweets\n\n\n@twarc2.command(\"searches\")\n@command_line_search_options\n@command_line_search_archive_options\n@click.option(\n    \"--sort-order\",\n    type=click.Choice([\"recency\", \"relevancy\"]),\n    help='Filter tweets based on their date (\"recency\") (default) or based on their relevance as indicated by Twitter (\"relevancy\")',\n)\n@click.option(\n    \"--counts-only\",\n    is_flag=True,\n    default=False,\n    help=\"Only retrieve counts of tweets matching the search, not the tweets themselves. \"\n    \"outfile will be a CSV containing the counts for all of the queries in the input file.\",\n)\n@click.option(\n    \"--combine-queries\",\n    is_flag=True,\n    default=False,\n    help=\"\"\"Merge consecutive queries into a single OR query.\n    For example, if the three rows in your file are: banana, apple, pear\n    then a single query ((banana) OR (apple) OR (pear)) will be issued.\n    \"\"\",\n)\n@click.option(\n    \"--granularity\",\n    default=\"day\",\n    type=click.Choice([\"day\", \"hour\", \"minute\"], case_sensitive=False),\n    help=\"Aggregation level for counts (only used when --count-only is used). Can be one of: day, hour, minute. Default is day.\",\n)\n@command_line_expansions_shortcuts\n@command_line_expansions_options\n@command_line_progressbar_option\n@command_line_input_output_file_arguments\n@click.pass_obj\ndef searches(\n    T,\n    infile,\n    outfile,\n    limit,\n    max_results,\n    since_id,\n    until_id,\n    start_time,\n    end_time,\n    archive,\n    counts_only,\n    granularity,\n    combine_queries,\n    hide_progress,\n    sort_order,\n    **kwargs,\n):\n    \"\"\"\n    Execute each search in the input file, one at a time.\n\n    The infile must be a file containing one query per line. Each line will be\n    passed through directly to the Twitter API - unlike the timelines command\n    quotes will not be removed.\n\n    Input queries will be deduplicated - if the same literal query is present\n    in the file, it will still only be run once.\n\n    It is recommended that this command first be run with --counts-only, to\n    check that each of the queries is retrieving the volume of tweets\n    expected, and to avoid consuming quota unnecessarily.\n\n    \"\"\"\n    line_count = 0\n    seen = set()\n    kwargs = _process_expansions_shortcuts(kwargs)\n\n    # Make sure times are always in UTC, click sometimes doesn't add timezone:\n    if start_time is not None and start_time.tzinfo is None:\n        start_time = start_time.replace(tzinfo=timezone.utc)\n    if end_time is not None and end_time.tzinfo is None:\n        end_time = end_time.replace(tzinfo=timezone.utc)\n\n    # Standard API max query length\n    max_query_length = 512\n\n    # TODO: this duplicates existing logic in _search, but _search is too\n    # specific to be reused here.\n    if archive:\n        # start time defaults to the beginning of Twitter to override the\n        # default of the last month. Only do this if start_time is not already\n        # specified and since_id and until_id aren't being used\n        if start_time is None and since_id is None and until_id is None:\n            start_time = datetime.datetime(2006, 3, 21, tzinfo=datetime.timezone.utc)\n\n        # Academic track let's you use longer queries\n        max_query_length = 1024\n\n    if counts_only:\n        api_method = T.counts_all if archive else T.counts_recent\n        kwargs.pop(\"expansions\", None)\n        kwargs.pop(\"tweet_fields\", None)\n        kwargs.pop(\"user_fields\", None)\n        kwargs.pop(\"media_fields\", None)\n        kwargs.pop(\"poll_fields\", None)\n        kwargs.pop(\"place_fields\", None)\n        kwargs.pop(\"sort_order\", None)\n        kwargs = {\n            **kwargs,\n            **{\n                \"since_id\": since_id,\n                \"until_id\": until_id,\n                \"start_time\": start_time,\n                \"end_time\": end_time,\n                \"granularity\": granularity,\n            },\n        }\n\n        # Write the header for the CSV output\n        click.echo(f\"query,start,end,{granularity}_count\", file=outfile)\n\n    else:\n        api_method = T.search_all if archive else T.search_recent\n        kwargs = {\n            **kwargs,\n            **{\n                \"since_id\": since_id,\n                \"until_id\": until_id,\n                \"start_time\": start_time,\n                \"end_time\": end_time,\n                \"max_results\": max_results,\n                \"sort_order\": sort_order,\n            },\n        }\n\n    # TODO: Validate the queries are all valid length before beginning and report errors\n\n    # TODO: Needs an inputlines progress bar instead, as the queries are variable\n    # size.\n    with FileLineProgressBar(infile, outfile, disable=hide_progress) as progress:\n        merged_query = \"\"\n        extended_query = None\n        query = None\n\n        for query in infile:\n            query = query.strip()\n\n            progress.update(1)\n            line_count += 1\n\n            if query == \"\":\n                log.warn(\"skipping blank line on line %s\", line_count)\n                continue\n\n            if len(query) >= max_query_length:\n                log.warn(f\"skipping too long query {query} on line {line_count}\")\n                continue\n\n            if query in seen:\n                log.info(\"already processed %s, skipping\", query)\n                continue\n\n            seen.add(query)\n            retrieved = 0\n\n            if combine_queries and merged_query:\n                extended_query = f\"{merged_query} OR ({query})\"\n                # We've exceeded the limit, so now we can issue\n                # the merged query.\n                if len(extended_query) >= max_query_length:\n                    issue_query = merged_query\n                    merged_query = f\"({query})\"\n                else:\n                    # We haven't exceed the length yet, so accept the addon\n                    merged_query = extended_query\n                    continue\n\n            elif combine_queries:\n                merged_query = f\"({query})\"\n                continue\n\n            else:\n                # This is the normal case - we are not doing any combination.\n                issue_query = query\n\n            log.info(f'Beginning search for \"{issue_query}\"')\n\n            response = api_method(issue_query, **kwargs)\n\n            for result in response:\n                if counts_only:\n                    for r in result[\"data\"]:\n                        click.echo(\n                            f'{issue_query},{r[\"start\"]},{r[\"end\"]},{r[\"tweet_count\"]}',\n                            file=outfile,\n                        )\n\n                else:\n                    # Apply the limit if not counting\n                    _write(result, outfile)\n\n                    retrieved += len(result[\"data\"])\n\n                    if limit and (retrieved >= limit):\n                        break\n\n        # Make sure to process the final batch of queries if using the combined strategy\n        if combine_queries and (\n            merged_query == extended_query or merged_query == f\"({query})\"\n        ):\n            log.info(f'Beginning search for \"{merged_query}\"')\n            response = api_method(merged_query, **kwargs)\n\n            for result in response:\n                if counts_only:\n                    for r in result[\"data\"]:\n                        click.echo(\n                            f'{merged_query},{r[\"start\"]},{r[\"end\"]},{r[\"tweet_count\"]}',\n                            file=outfile,\n                        )\n\n                else:\n                    # Apply the limit if not counting\n                    _write(result, outfile)\n\n                    retrieved += len(result[\"data\"])\n\n                    if limit and (retrieved >= limit):\n                        break\n\n\n@twarc2.command(\"conversation\")\n@click.option(\n    \"--sort-order\",\n    type=click.Choice([\"recency\", \"relevancy\"]),\n    help='Filter tweets based on their date (\"recency\") (default) or based on their relevance as indicated by Twitter (\"relevancy\")',\n)\n@command_line_search_options\n@command_line_search_archive_options\n@command_line_expansions_shortcuts\n@command_line_expansions_options\n@command_line_progressbar_option\n@click.argument(\"tweet_id\", type=str)\n@click.argument(\"outfile\", type=click.File(\"w\"), default=\"-\")\n@click.pass_obj\n@cli_api_error\ndef conversation(\n    T,\n    tweet_id,\n    outfile,\n    **kwargs,\n):\n    \"\"\"\n    Retrieve a conversation thread using the tweet id.\n    \"\"\"\n\n    kwargs = _process_expansions_shortcuts(kwargs)\n\n    q = f\"conversation_id:{tweet_id}\"\n    return _search(\n        T,\n        q,\n        outfile,\n        **kwargs,\n    )\n\n\n@twarc2.command(\"conversations\")\n@click.option(\n    \"--conversation-limit\",\n    default=0,\n    help=\"Maximum number of tweets to return per-conversation\",\n)\n@click.option(\n    \"--sort-order\",\n    type=click.Choice([\"recency\", \"relevancy\"]),\n    help='Filter tweets based on their date (\"recency\") (default) or based on their relevance as indicated by Twitter (\"relevancy\")',\n)\n@command_line_search_options\n@command_line_search_archive_options\n@command_line_expansions_shortcuts\n@command_line_expansions_options\n@command_line_progressbar_option\n@command_line_input_output_file_arguments\n@click.pass_obj\n@cli_api_error\ndef conversations(\n    T, infile, outfile, archive, limit, conversation_limit, hide_progress, **kwargs\n):\n    \"\"\"\n    Fetch the full conversation threads that the input tweets are a part of.\n    Alternatively the input can be a line oriented file of conversation ids.\n    \"\"\"\n\n    kwargs = _process_expansions_shortcuts(kwargs)\n\n    # keep track of converstation ids that have been fetched so that they\n    # aren't fetched twice\n    seen = set()\n\n    # use the archive or recent search?\n    search = T.search_all if archive else T.search_recent\n\n    count = 0\n    stop = False\n\n    with FileLineProgressBar(infile, outfile, disable=hide_progress) as progress:\n        for line in infile:\n            progress.update()\n            conv_ids = []\n\n            # stop will get set when the total tweet limit has been met\n            if stop:\n                break\n\n            # get a specific conversation id\n            line = line.strip()\n            if line and re.match(r\"^\\d+$\", line):\n                if line in seen:\n                    continue\n                conv_ids = [line]\n\n            # generate all conversation_ids that are referenced in tweets input\n            elif line:\n\n                def f():\n                    for tweet in ensure_flattened(json.loads(line)):\n                        yield tweet.get(\"conversation_id\")\n\n                conv_ids = f()\n\n            # output results while paying attention to the set limits\n            conv_count = 0\n\n            for conv_id in conv_ids:\n                if conv_id in seen:\n                    log.info(f\"already fetched conversation_id {conv_id}\")\n                seen.add(conv_id)\n\n                conv_count = 0\n\n                log.info(f\"fetching conversation {conv_id}\")\n                for result in search(f\"conversation_id:{conv_id}\", **kwargs):\n                    _write(result, outfile, False)\n\n                    count += len(result[\"data\"])\n                    if limit != 0 and count >= limit:\n                        log.info(f\"reached tweet limit of {limit}\")\n                        stop = True\n                        break\n\n                    conv_count += len(result[\"data\"])\n                    if conversation_limit != 0 and conv_count >= conversation_limit:\n                        log.info(f\"reached conversation limit {conversation_limit}\")\n                        break\n\n\n@twarc2.command(\"flatten\")\n@command_line_progressbar_option\n@command_line_input_output_file_arguments\n@cli_api_error\ndef flatten(infile, outfile, hide_progress):\n    \"\"\"\n    \"Flatten\" tweets, or move expansions inline with tweet objects and ensure\n    that each line of output is a single tweet.\n    \"\"\"\n    if infile.name == outfile.name:\n        click.echo(\n            click.style(\n                f\"💔 Cannot flatten files in-place, specify a different output file!\",\n                fg=\"red\",\n            ),\n            err=True,\n        )\n        return\n\n    with FileSizeProgressBar(infile, outfile, disable=hide_progress) as progress:\n        for line in infile:\n            for tweet in ensure_flattened(json.loads(line)):\n                _write(tweet, outfile, False)\n            progress.update(len(line))\n\n\n@twarc2.command(\"places\")\n@click.option(\n    \"--type\",\n    \"search_type\",\n    type=click.Choice([\"name\", \"geo\", \"ip\"]),\n    default=\"name\",\n    help=\"How to search for places (defaults to name)\",\n)\n@click.option(\n    \"--granularity\",\n    type=click.Choice([\"neighborhood\", \"city\", \"admin\", \"country\"]),\n    default=\"neighborhood\",\n    help=\"What type of places to search for (defaults to neighborhood)\",\n)\n@click.option(\"--max-results\", type=int, help=\"Maximum results to return\")\n@click.option(\"--json\", is_flag=True, help=\"Output raw JSON response\")\n@click.argument(\"value\")\n@click.argument(\"outfile\", type=click.File(\"w\"), default=\"-\")\n@click.pass_obj\n@cli_api_error\ndef places(T, value, outfile, search_type, granularity, max_results, json):\n    \"\"\"\n    Search for places by place name, geo coordinates or ip address.\n    \"\"\"\n    params = {\"granularity\": granularity}\n\n    if search_type == \"name\":\n        params[\"query\"] = value\n    elif search_type == \"ip\":\n        params[\"ip\"] = value\n    elif search_type == \"geo\":\n        try:\n            lat, lon = list(map(float, value.split(\",\")))\n            params = {\"lat\": lat, \"lon\": lon}\n        except:\n            click.echo(\"--geo must be lat,lon\", err=True)\n\n    if max_results:\n        params[\"max_results\"] = max_results\n\n    result = T.geo(**params)\n\n    if \"errors\" in result:\n        click.echo(_error_str(result[\"errors\"]), err=True)\n    elif json:\n        _write(result, outfile)\n    else:\n        for place in result[\"result\"][\"places\"]:\n            if granularity == \"country\":\n                line = \"{0} [id={1}]\".format(place[\"country\"], place[\"id\"])\n            else:\n                line = \"{0}, {1} [id={2}]\".format(\n                    place[\"full_name\"], place[\"country\"], place[\"id\"]\n                )\n            click.echo(line)\n\n\n@twarc2.command(\"stream\")\n@click.option(\"--limit\", default=0, help=\"Maximum number of tweets to return\")\n@command_line_expansions_shortcuts\n@command_line_expansions_options\n@click.argument(\"outfile\", type=click.File(\"a+\"), default=\"-\")\n@click.pass_obj\n@cli_api_error\ndef stream(T, outfile, limit, **kwargs):\n    \"\"\"\n    Fetch tweets from the live stream.\n    \"\"\"\n\n    kwargs = _process_expansions_shortcuts(kwargs)\n    event = threading.Event()\n    count = 0\n    click.echo(click.style(f\"Started a stream with rules:\", fg=\"green\"), err=True)\n    _print_stream_rules(T)\n    click.echo(\n        click.style(f\"Writing to {outfile.name}\\nCTRL+C to stop...\", fg=\"green\"),\n        err=True,\n    )\n    for result in T.stream(event=event, **kwargs):\n        count += 1\n        if limit != 0 and count == limit:\n            log.info(f\"reached limit {limit}\")\n            event.set()\n        _write(result, outfile)\n\n        if result and \"data\" in result:\n            log.info(\"archived %s\", result[\"data\"][\"id\"])\n\n\n@twarc2.group()\n@click.pass_obj\ndef lists(T):\n    \"\"\"\n    Lists API support.\n    \"\"\"\n    pass\n\n\n@lists.command(\"lookup\")\n@click.argument(\"list_id\", type=str)\n@click.argument(\"outfile\", type=click.File(\"w\"), default=\"-\")\n@click.option(\"--pretty\", is_flag=True, default=False, help=\"Pretty print the JSON\")\n@click.option(\n    \"--list-fields\",\n    default=\",\".join(LIST_FIELDS),\n    type=click.STRING,\n    is_eager=True,\n    help=\"Comma separated list of tweet fields to retrieve. Default is all available.\",\n    callback=_validate_expansions,\n)\n@click.pass_obj\n@cli_api_error\ndef lists_lookup(T, list_id, outfile, pretty, **kwargs):\n    \"\"\"\n    Look up a single list using its list id or URL.\n    \"\"\"\n\n    kwargs = _process_expansions_shortcuts(kwargs)\n\n    if \"https\" in list_id:\n        list_id = list_id.split(\"/\")[-1]\n    if not re.match(r\"^\\d+$\", list_id):\n        click.echo(click.style(\"Please enter a List URL or ID\", fg=\"red\"), err=True)\n    result = T.list_lookup(list_id, **kwargs)\n    _write(result, outfile, pretty=pretty)\n\n\n@lists.command(\"bulk-lookup\")\n@command_line_input_output_file_arguments\n@command_line_progressbar_option\n@click.option(\n    \"--list-fields\",\n    default=\",\".join(LIST_FIELDS),\n    type=click.STRING,\n    is_eager=True,\n    help=\"Comma separated list of fields about a list to retrieve. Default is all available.\",\n    callback=_validate_expansions,\n)\n@click.pass_obj\n@cli_api_error\ndef lists_bulk_lookup(T, infile, outfile, hide_progress, **kwargs):\n    \"\"\"\n    Look up the details of many lists given a file of IDs or URLs.\n    \"\"\"\n\n    kwargs = _process_expansions_shortcuts(kwargs)\n\n    with FileLineProgressBar(infile, outfile, disable=hide_progress) as progress:\n        for list_id in infile:\n            progress.update()\n\n            if \"https\" in list_id:\n                list_id = list_id.split(\"/\")[-1]\n            if not re.match(r\"^\\d+$\", list_id):\n                click.echo(\n                    click.style(\"Skipping invalid List URL or ID: {line}\", fg=\"red\"),\n                    err=True,\n                )\n                continue\n            result = T.list_lookup(list_id.strip(), **kwargs)\n            _write(result, outfile)\n\n\n@lists.command(\"all\")\n@click.argument(\"user\", type=str)\n@click.argument(\"outfile\", type=click.File(\"w\"), default=\"-\")\n@click.option(\n    \"--list-fields\",\n    default=\",\".join(LIST_FIELDS),\n    type=click.STRING,\n    is_eager=True,\n    help=\"Comma separated list of tweet fields to retrieve. Default is all available.\",\n    callback=_validate_expansions,\n)\n@click.option(\n    \"--limit\",\n    default=0,\n    help=\"Maximum number of lists to save. Default is all.\",\n    type=int,\n)\n@command_line_progressbar_option\n@click.pass_obj\n@cli_api_error\ndef lists_all(T, user, outfile, limit, hide_progress, **kwargs):\n    \"\"\"\n    Get all Lists that a user created or is subscribed to.\n\n    You can use the `owned` or `followed` command to get just the lists\n    created by the user, or just the lists followed by the user\n    respectively.\n\n    \"\"\"\n    kwargs = _process_expansions_shortcuts(kwargs)\n    _write_with_progress(\n        func=T.owned_lists,\n        user=user,\n        outfile=outfile,\n        limit=limit,\n        hide_progress=hide_progress,\n        progress_total=1,\n        **kwargs,\n    )\n    _write_with_progress(\n        func=T.followed_lists,\n        user=user,\n        outfile=outfile,\n        limit=limit,\n        hide_progress=hide_progress,\n        progress_total=1,\n        **kwargs,\n    )\n\n\n@lists.command(\"owned\")\n@click.argument(\"user\", type=str)\n@click.argument(\"outfile\", type=click.File(\"w\"), default=\"-\")\n@click.option(\n    \"--list-fields\",\n    default=\",\".join(LIST_FIELDS),\n    type=click.STRING,\n    is_eager=True,\n    help=\"Comma separated list of tweet fields to retrieve. Default is all available.\",\n    callback=_validate_expansions,\n)\n@click.option(\n    \"--limit\",\n    default=0,\n    help=\"Maximum number of lists to save. Default is all.\",\n    type=int,\n)\n@command_line_progressbar_option\n@click.pass_obj\n@cli_api_error\ndef lists_owned(T, user, outfile, limit, hide_progress, **kwargs):\n    \"\"\"\n    Get all Lists that a user created.\n    \"\"\"\n    kwargs = _process_expansions_shortcuts(kwargs)\n    _write_with_progress(\n        func=T.owned_lists,\n        user=user,\n        outfile=outfile,\n        limit=limit,\n        hide_progress=hide_progress,\n        progress_total=1,\n        **kwargs,\n    )\n\n\n@lists.command(\"followed\")\n@click.argument(\"user\", type=str)\n@click.argument(\"outfile\", type=click.File(\"w\"), default=\"-\")\n@click.option(\n    \"--list-fields\",\n    default=\",\".join(LIST_FIELDS),\n    type=click.STRING,\n    is_eager=True,\n    help=\"Comma separated list of tweet fields to retrieve. Default is all available.\",\n    callback=_validate_expansions,\n)\n@click.option(\n    \"--limit\",\n    default=0,\n    help=\"Maximum number of lists to save. Default is all.\",\n    type=int,\n)\n@command_line_progressbar_option\n@click.pass_obj\n@cli_api_error\ndef lists_followed(T, user, outfile, limit, hide_progress, **kwargs):\n    \"\"\"\n    Get all Lists that a user is following.\n    \"\"\"\n    kwargs = _process_expansions_shortcuts(kwargs)\n    _write_with_progress(\n        func=T.followed_lists,\n        user=user,\n        outfile=outfile,\n        limit=limit,\n        hide_progress=hide_progress,\n        progress_total=1,\n        **kwargs,\n    )\n\n\n@lists.command(\"memberships\")\n@click.argument(\"user\", type=str)\n@click.argument(\"outfile\", type=click.File(\"w\"), default=\"-\")\n@click.option(\n    \"--list-fields\",\n    default=\",\".join(LIST_FIELDS),\n    type=click.STRING,\n    is_eager=True,\n    help=\"Comma separated list of tweet fields to retrieve. Default is all available.\",\n    callback=_validate_expansions,\n)\n@click.option(\n    \"--limit\",\n    default=0,\n    help=\"Maximum number of lists to save. Default is all.\",\n    type=int,\n)\n@command_line_progressbar_option\n@click.pass_obj\n@cli_api_error\ndef lists_memberships(T, user, outfile, limit, hide_progress, **kwargs):\n    \"\"\"\n    Get all Lists that a user is a member of.\n    \"\"\"\n    kwargs = _process_expansions_shortcuts(kwargs)\n    lookup_total = 1\n\n    hide_progress = True if (outfile.name == \"<stdout>\") else hide_progress\n\n    if not hide_progress:\n        target_user = T._ensure_user(user)\n        lookup_total = target_user[\"public_metrics\"][\"listed_count\"]\n\n    _write_with_progress(\n        func=T.list_memberships,\n        user=user,\n        outfile=outfile,\n        limit=limit,\n        hide_progress=hide_progress,\n        progress_total=lookup_total,\n        **kwargs,\n    )\n\n\n@lists.command(\"followers\")\n@click.argument(\"list-id\", type=str)\n@click.argument(\"outfile\", type=click.File(\"w\"), default=\"-\")\n@click.option(\n    \"--limit\",\n    default=0,\n    help=\"Maximum number of lists to save. Default is all.\",\n    type=int,\n)\n@command_line_expansions_options\n@command_line_progressbar_option\n@click.pass_obj\n@cli_api_error\ndef lists_followers(T, list_id, outfile, limit, hide_progress, **kwargs):\n    \"\"\"\n    Get all Users that are following (subscribed) to a list.\n    \"\"\"\n    kwargs = _process_expansions_shortcuts(kwargs)\n    # Also remove media poll and place from kwargs, these are not valid for this endpoint:\n    kwargs.pop(\"media_fields\", None)\n    kwargs.pop(\"poll_fields\", None)\n    kwargs.pop(\"place_fields\", None)\n\n    _list = ensure_flattened(T.list_lookup(list_id))[-1]\n    list_id = _list[\"id\"]\n    lookup_total = int(_list[\"follower_count\"])\n\n    _write_with_progress(\n        func=T.list_followers,\n        list_id=list_id,\n        outfile=outfile,\n        limit=limit,\n        hide_progress=hide_progress,\n        progress_total=lookup_total,\n        **kwargs,\n    )\n\n\n@lists.command(\"members\")\n@click.argument(\"list-id\", type=str)\n@click.argument(\"outfile\", type=click.File(\"w\"), default=\"-\")\n@click.option(\n    \"--limit\",\n    default=0,\n    help=\"Maximum number of lists to save. Default is all.\",\n    type=int,\n)\n@command_line_expansions_options\n@command_line_progressbar_option\n@click.pass_obj\n@cli_api_error\ndef lists_members(T, list_id, outfile, limit, hide_progress, **kwargs):\n    \"\"\"\n    Get all Users that are members of a list.\n    \"\"\"\n    kwargs = _process_expansions_shortcuts(kwargs)\n    # Also remove media poll and place from kwargs, these are not valid for this endpoint:\n    kwargs.pop(\"media_fields\", None)\n    kwargs.pop(\"poll_fields\", None)\n    kwargs.pop(\"place_fields\", None)\n\n    _list = ensure_flattened(T.list_lookup(list_id))[-1]\n    list_id = _list[\"id\"]\n    lookup_total = int(_list[\"member_count\"])\n\n    _write_with_progress(\n        func=T.list_members,\n        list_id=list_id,\n        outfile=outfile,\n        limit=limit,\n        hide_progress=hide_progress,\n        progress_total=lookup_total,\n        **kwargs,\n    )\n\n\n@lists.command(\"tweets\")\n@click.argument(\"list-id\", type=str)\n@click.argument(\"outfile\", type=click.File(\"w\"), default=\"-\")\n@click.option(\n    \"--limit\",\n    default=0,\n    help=\"Maximum number of tweets to save. Default and max is last 800.\",\n    type=int,\n)\n@command_line_expansions_options\n@command_line_progressbar_option\n@click.pass_obj\n@cli_api_error\ndef lists_tweets(T, list_id, outfile, limit, hide_progress, **kwargs):\n    \"\"\"\n    Get up to the most recent 800 tweets posted by members of a list.\n    \"\"\"\n    kwargs = _process_expansions_shortcuts(kwargs)\n    # Also remove media poll and place from kwargs, these are not valid for this endpoint:\n    kwargs.pop(\"media_fields\", None)\n    kwargs.pop(\"poll_fields\", None)\n    kwargs.pop(\"place_fields\", None)\n\n    _write_with_progress(\n        func=T.list_tweets,\n        list_id=list_id,\n        outfile=outfile,\n        limit=limit,\n        hide_progress=hide_progress,\n        progress_total=800,\n        **kwargs,\n    )\n\n\n@twarc2.group()\n@click.pass_obj\ndef stream_rules(T):\n    \"\"\"\n    List, add and delete rules for your stream.\n    \"\"\"\n    pass\n\n\n@stream_rules.command(\"list\")\n@click.option(\"--display-ids\", is_flag=True, help=\"display the rule ids\")\n@click.pass_obj\n@cli_api_error\ndef list_stream_rules(T, display_ids):\n    \"\"\"\n    List all the active stream rules.\n    \"\"\"\n    _print_stream_rules(T, display_ids)\n\n\ndef _print_stream_rules(T, display_ids=False):\n    \"\"\"\n    Output all the active stream rules\n    \"\"\"\n    result = T.get_stream_rules()\n    if \"data\" not in result or len(result[\"data\"]) == 0:\n        click.echo(\n            \"No rules yet. Add them with \"\n            + click.style(\"twarc2 stream-rules add\", bold=True),\n            err=True,\n        )\n    else:\n        count = 0\n        for rule in result[\"data\"]:\n            if count > 5:\n                count = 0\n            s = rule[\"value\"]\n            if \"tag\" in rule:\n                s += f\" (tag: {rule['tag']})\"\n            if display_ids:\n                s += f\" (id: {rule['id']})\"\n\n            click.echo(click.style(f\"☑  {s}\"), err=True)\n            count += 1\n\n\n@stream_rules.command(\"add\")\n@click.pass_obj\n@click.option(\"--tag\", type=str, help=\"a tag to help identify the rule\")\n@click.argument(\"value\", type=str)\n@cli_api_error\ndef add_stream_rule(T, value, tag):\n    \"\"\"\n    Create a new stream rule to match a value. Rules can be grouped with\n    optional tags.\n    \"\"\"\n    if tag:\n        rules = [{\"value\": value, \"tag\": tag}]\n    else:\n        rules = [{\"value\": value}]\n\n    results = T.add_stream_rules(rules)\n    if \"errors\" in results:\n        click.echo(_error_str(results[\"errors\"]), err=True)\n    else:\n        click.echo(click.style(f\"🚀  Added rule for \", fg=\"green\") + f'\"{value}\"')\n\n\n@stream_rules.command(\"delete\")\n@click.argument(\"value\")\n@click.pass_obj\n@cli_api_error\ndef delete_stream_rule(T, value):\n    \"\"\"\n    Delete the stream rule that matches a given value.\n    \"\"\"\n    # find the rule id\n    result = T.get_stream_rules()\n    if \"data\" not in result:\n        click.echo(click.style(\"💔  There are no rules to delete!\", fg=\"red\"), err=True)\n    else:\n        rule_id = None\n        for rule in result[\"data\"]:\n            if rule[\"value\"] == value:\n                rule_id = rule[\"id\"]\n                break\n        if not rule_id:\n            click.echo(\n                click.style(f'🙃  No rule could be found for \"{value}\"', fg=\"red\"),\n                err=True,\n            )\n        else:\n            results = T.delete_stream_rule_ids([rule_id])\n            if \"errors\" in results:\n                click.echo(_error_str(results[\"errors\"]), err=True)\n            else:\n                click.echo(f\"🗑  Deleted stream rule for {value}\", color=\"green\")\n\n\n@stream_rules.command(\"delete-all\")\n@click.pass_obj\n@cli_api_error\ndef delete_all(T):\n    \"\"\"\n    Delete all stream rules!\n    \"\"\"\n    result = T.get_stream_rules()\n    if \"data\" not in result:\n        click.echo(click.style(\"💔  There are no rules to delete!\", fg=\"red\"), err=True)\n    else:\n        rule_ids = [r[\"id\"] for r in result[\"data\"]]\n        results = T.delete_stream_rule_ids(rule_ids)\n        click.echo(f\"🗑  Deleted {len(rule_ids)} rules.\")\n\n\n@twarc2.group()\n@click.pass_obj\ndef compliance_job(T):\n    \"\"\"\n    Create, retrieve and list batch compliance jobs for Tweets and Users.\n    \"\"\"\n    pass\n\n\n@compliance_job.command(\"list\")\n@click.argument(\n    \"job_type\",\n    required=False,\n    default=None,\n    type=click.Choice([\"tweets\", \"users\"], case_sensitive=False),\n)\n@click.option(\n    \"--status\",\n    default=None,\n    type=click.Choice(\n        [\"created\", \"in_progress\", \"complete\", \"failed\"], case_sensitive=False\n    ),\n    help=\"Filter by job status. Only one of 'created', 'in_progress', 'complete', 'failed' can be specified. If not set, returns all.\",\n)\n@command_line_verbose_options\n@click.pass_obj\n@cli_api_error\ndef compliance_job_list(T, job_type, status, verbose, json_output):\n    \"\"\"\n    Returns a list of compliance jobs by job type and status.\n    \"\"\"\n\n    if job_type:\n        job_result = T.compliance_job_list(job_type.lower(), status)\n        results = job_result[\"data\"] if \"data\" in job_result else []\n    else:\n        tweets_result = T.compliance_job_list(\"tweets\", status)\n        users_result = T.compliance_job_list(\"users\", status)\n        tweets_jobs = tweets_result[\"data\"] if \"data\" in tweets_result else []\n        users_jobs = users_result[\"data\"] if \"data\" in users_result else []\n        results = tweets_jobs + users_jobs\n\n    if json_output:\n        click.echo(json.dumps(results))\n        return\n\n    if len(results) == 0:\n        job_type_message = \"tweet or user\" if job_type is None else job_type\n        status_message = f' with Status \"{status}\"' if status else \"\"\n        click.echo(\n            click.style(\n                f\"🙃 There are no {job_type_message} compliance jobs{status_message}. To create a new job, see:\\n twarc2 compliance-job create --help\",\n                fg=\"red\",\n            ),\n            err=True,\n        )\n    else:\n        for job in results:\n            _print_compliance_job(job, verbose)\n\n\n@compliance_job.command(\"get\")\n@click.argument(\"job\")\n@command_line_verbose_options\n@click.pass_obj\n@cli_api_error\ndef compliance_job_get(T, job, verbose, json_output):\n    \"\"\"\n    Returns status and download information about the job ID.\n    \"\"\"\n    if json_output:\n        result = T.compliance_job_get(job)\n        click.echo(json.dumps(result))\n        return\n\n    job = _get_job(T, job)\n    if job is None:\n        return\n\n    _print_compliance_job(job, verbose)\n\n    # Ask to download if complete\n    if job[\"status\"] == \"complete\":\n        continue_download = input(\n            f\"This job is complete, download it now into the current folder? [y or n]? \"\n        )\n        if continue_download.lower()[0] == \"y\":\n            _download_job(job)\n\n\n@compliance_job.command(\"create\")\n@click.argument(\n    \"job_type\",\n    required=True,\n    type=click.Choice([\"tweets\", \"users\"], case_sensitive=False),\n)\n@click.argument(\"infile\", type=click.Path(), required=True)\n@click.argument(\"outfile\", type=click.Path(), required=False, default=None)\n@click.option(\"--job-name\", type=str, help=\"A name or tag to help identify the job.\")\n@click.option(\n    \"--wait/--no-wait\",\n    default=True,\n    help=\"Wait for the job to finish and download the results. Wait by default.\",\n)\n@command_line_progressbar_option\n@click.pass_obj\n@cli_api_error\ndef compliance_job_create(T, job_type, infile, outfile, job_name, wait, hide_progress):\n    \"\"\"\n    Create a new compliance job and upload tweet IDs.\n    \"\"\"\n\n    # Check for file contents:\n    with open(infile) as f:\n        try:\n            int(f.readline())\n        except:\n            click.echo(\n                click.style(\n                    f\"🙃 The file {infile} does not contain a list of IDs. Use:\",\n                    fg=\"red\",\n                ),\n                err=True,\n            )\n            click.echo(\n                click.style(\n                    f\" twarc2 dehydrate --id-type {job_type} {infile} output_ids.txt\",\n                ),\n                err=True,\n            )\n            click.echo(\n                click.style(\n                    f\"to create a file with {job_type} IDs.\",\n                    fg=\"red\",\n                ),\n                err=True,\n            )\n            return\n\n    # Create a job (not resumable right now):\n    _job = T.compliance_job_create(job_type, job_name)[\"data\"]\n\n    click.echo(\n        click.style(\n            f\"Created a new {job_type} job {_job['id']}. Uploading {infile}.\",\n            fg=\"yellow\",\n            bold=True,\n        ),\n        err=True,\n    )\n\n    # Upload the file\n    with open(infile, \"rb\") as f:\n        with tqdm(\n            total=os.stat(infile).st_size, unit=\"B\", unit_scale=True, unit_divisor=1024\n        ) as pbar:\n            wrapped_file = CallbackIOWrapper(pbar.update, f, \"read\")\n            requests.put(\n                _job[\"upload_url\"],\n                data=wrapped_file,\n                headers={\"Content-Type\": \"text/plain\"},\n            )\n\n    if wait:\n        if _wait_for_job(T, _job):\n            _download_job(_job, outfile, hide_progress)\n\n\n@compliance_job.command(\"download\")\n@click.argument(\"job\")\n@click.argument(\"outfile\", type=click.Path(), required=False, default=None)\n@click.option(\n    \"--wait/--no-wait\",\n    default=True,\n    help=\"Wait for the job to finish and download the results. Wait by default.\",\n)\n@command_line_progressbar_option\n@click.pass_obj\n@cli_api_error\ndef compliance_job_download(T, job, outfile, wait, hide_progress):\n    \"\"\"\n    Download the compliance job with the specified ID.\n    \"\"\"\n\n    _job = _get_job(T, job)\n    if _job is None:\n        click.echo(\n            click.style(\n                f\"Job {job} not found. List valid job IDs with 'twarc2 compliance-job list' or Retry submitting the job with 'twarc2 compliance-job create'\",\n                fg=\"red\",\n                bold=True,\n            ),\n            err=True,\n        )\n        return\n\n    if _job[\"status\"] == \"complete\":\n        _download_job(_job, outfile, hide_progress)\n    elif _job[\"status\"] == \"expired\" or _job[\"status\"] == \"failed\":\n        click.echo(\n            click.style(\n                f\"Job {_job['id']} is '{_job['status']}'. Retry submitting the job with 'twarc2 compliance-job create'\",\n                fg=\"red\",\n                bold=True,\n            ),\n            err=True,\n        )\n        return\n    else:\n        if not wait:\n            click.echo(\n                click.style(\n                    f\"Job {_job['id']} is '{_job['status']}'. Use:\\n twarc2 compliance-job get {_job['id']}\\nto get the status. Or run:\\n twarc2 compliance-job download {_job['id']}\\nto wait for the job to complete.\",\n                    fg=\"yellow\",\n                    bold=True,\n                ),\n                err=True,\n            )\n        else:\n            if _wait_for_job(T, _job):\n                _download_job(_job, outfile, hide_progress)\n\n\ndef _get_job(T, job):\n    \"\"\"\n    Retrieve a job from the API by ID\n    \"\"\"\n    result = T.compliance_job_get(job)\n    if \"data\" not in result:\n        click.echo(\n            click.style(\n                f\"Job {job} could not be found. List valid job IDs with 'twarc2 compliance-job list'\",\n                fg=\"red\",\n                bold=True,\n            ),\n            err=True,\n        )\n        return None\n    return result[\"data\"]\n\n\ndef _wait_for_job(T, job, hide_progress=False):\n    \"\"\"\n    Wait for the compliance job to complete\n    \"\"\"\n\n    if (\n        job is not None\n        and \"status\" in job\n        and (job[\"status\"] == \"failed\" or job[\"status\"] == \"expired\")\n    ):\n        click.echo(\n            click.style(\n                f\"Stopped waiting for job... Job status is {job['status']}\",\n                fg=\"red\",\n                bold=True,\n            )\n        )\n        return False\n\n    click.echo(\n        click.style(\n            f\"Waiting for job {job['id']} to complete. Press Ctrl+C to cancel.\",\n            fg=\"yellow\",\n            bold=True,\n        ),\n        err=True,\n    )\n\n    start_time = datetime.datetime.now(datetime.timezone.utc)\n    est_completion = (\n        datetime.datetime.strptime(\n            job[\"estimated_completion\"], \"%Y-%m-%dT%H:%M:%S.%fZ\"\n        ).replace(tzinfo=datetime.timezone.utc)\n        if \"estimated_completion\" in job\n        else start_time\n    )\n    seconds_wait = int((est_completion - start_time).total_seconds())\n    if seconds_wait <= 0:\n        click.echo(\n            click.style(\n                f\"Estimated completion time unknown, waiting 1 minute instead.\",\n                fg=\"yellow\",\n                bold=True,\n            ),\n            err=True,\n        )\n        seconds_wait = 60\n        est_completion = datetime.datetime.now(\n            datetime.timezone.utc\n        ) + datetime.timedelta(seconds=60)\n\n    with TimestampProgressBar(\n        since_id=None,\n        until_id=None,\n        start_time=start_time,\n        end_time=est_completion,\n        disable=hide_progress,\n        bar_format=\"{l_bar}{bar}| Waiting {n_time}/{total_time}{postfix}\",\n    ) as pbar:\n        while True:\n            try:\n                pbar.refresh()\n                pbar.reset()\n                for i in range(seconds_wait * 10):\n                    pbar.update(100)\n                    time.sleep(0.1)\n\n                job = _get_job(T, job[\"id\"])\n\n                if job is not None and \"status\" in job:\n                    total_wait = humanize.naturaldelta(\n                        datetime.datetime.now(datetime.timezone.utc) - start_time\n                    )\n                    pbar.set_postfix_str(\n                        f\"Job Status: {job['status']}. Waited for: {total_wait}\"\n                    )\n\n                    if job[\"status\"] == \"complete\":\n                        return True\n                    elif job[\"status\"] == \"in_progress\" or job[\"status\"] == \"created\":\n                        continue\n                    else:\n                        click.echo(\n                            click.style(\n                                f\"Stopped waiting for job... Job status is {job['status']}\",\n                                fg=\"red\",\n                                bold=True,\n                            )\n                        )\n                        return False\n                else:\n                    click.echo(\n                        click.style(\n                            f\"Stopped waiting for job... Failed to retrieve job from API.\",\n                            fg=\"red\",\n                            bold=True,\n                        ),\n                        err=True,\n                    )\n                    return False\n\n            except KeyboardInterrupt:\n                click.echo(\n                    click.style(\n                        \"Stopped waiting for job... Run the command again to continue waiting.\",\n                        fg=\"yellow\",\n                        bold=True,\n                    )\n                )\n                return False\n\n\ndef _download_job(job, outfile=None, hide_progress=False):\n    \"\"\"\n    Download the compliance job.\n    \"\"\"\n\n    click.echo(\n        click.style(\n            f\"Job {job['id']} is '{job['status']}'. Downloading Results...\",\n            fg=\"yellow\",\n            bold=True,\n        ),\n        err=True,\n    )\n\n    url = job[\"download_url\"]\n    if outfile is None:\n        outfile = f\"{job['type']}_compliance_{job['id']}.json\"\n\n    response = requests.get(url, stream=True)\n\n    with open(outfile, \"wb\") as fout:\n        with tqdm(\n            disable=hide_progress,\n            unit=\"B\",\n            unit_scale=True,\n            unit_divisor=1024,\n            miniters=1,\n            total=int(response.headers.get(\"content-length\", 0)),\n        ) as pbar:\n            pbar.set_postfix_str(outfile)\n            for chunk in response.iter_content(chunk_size=4096):\n                fout.write(chunk)\n                pbar.update(len(chunk))\n\n\ndef _print_compliance_job(job, verbose=False):\n    job_colour = \"yellow\"\n\n    if job[\"status\"] == \"expired\" or job[\"status\"] == \"failed\":\n        job_colour = \"red\"\n\n    if job[\"status\"] == \"complete\":\n        job_colour = \"green\"\n\n    time_now = datetime.datetime.now(datetime.timezone.utc)\n\n    upload_exp = time_now - datetime.datetime.strptime(\n        job[\"upload_expires_at\"], \"%Y-%m-%dT%H:%M:%S.%fZ\"\n    ).replace(tzinfo=datetime.timezone.utc)\n\n    download_exp = time_now - datetime.datetime.strptime(\n        job[\"download_expires_at\"], \"%Y-%m-%dT%H:%M:%S.%fZ\"\n    ).replace(tzinfo=datetime.timezone.utc)\n\n    failure = \"\"\n    if \"error\" in job:\n        failure = job[\"error\"]\n\n    job_name = job[\"name\"] if \"name\" in job else \"Job\"\n    click.echo(\n        click.style(\n            f\"📃 Type: \\\"{job['type']}\\\" ID: \\\"{job['id']}\\\" Name: \\\"{job_name}\\\" Status: \\\"{job['status']}\\\" {failure}\",\n            fg=job_colour,\n            bold=True,\n        ),\n        err=True,\n    )\n    if verbose:\n        click.echo(\n            click.style(f\"Created at: {job['created_at']}\"),\n            err=True,\n        )\n        click.echo(\n            click.style(f\"Resumable: {job['resumable']}\"),\n            err=True,\n        )\n        upload_url = job[\"upload_url\"] if upload_exp.total_seconds() < 0 else \"Expired\"\n        click.echo(\n            click.style(\n                f\"Upload Expiry: {humanize.naturaltime(upload_exp)} URL: {upload_url}\"\n            ),\n            err=True,\n        )\n        download_url = (\n            job[\"download_url\"] if download_exp.total_seconds() < 0 else \"Expired\"\n        )\n        click.echo(\n            click.style(\n                f\"Download Expiry: {humanize.naturaltime(download_exp)} URL: {download_url}\"\n            ),\n            err=True,\n        )\n\n\ndef _rule_str(rule):\n    s = f\"id={rule['id']} value={rule['value']}\"\n    if \"tag\" in rule:\n        s += f\" tag={rule['tag']}\"\n    return s\n\n\ndef _error_str(errors):\n    # collapse all the error messages into a newline delimited red colored list\n    # the passed in errors can be single error object or a list of objects, each\n    # of which has an errors key that points to a list of error objects\n\n    if type(errors) != list or \"errors\" not in errors:\n        errors = [{\"errors\": errors}]\n\n    parts = []\n    for error in errors:\n        for part in error[\"errors\"]:\n            s = \"💣  \"\n            if \"message\" in part:\n                s += click.style(part[\"message\"], fg=\"red\")\n            elif \"title\" in part:\n                s += click.style(part[\"title\"], fg=\"red\")\n            else:\n                s = click.style(\"Unknown error\", fg=\"red\")\n            if \"type\" in part:\n                s += f\" see: {part['type']}\"\n            parts.append(s)\n\n    return click.style(\"\\n\".join(parts), fg=\"red\")\n\n\ndef _write(results, outfile, pretty=False):\n    indent = 2 if pretty else None\n    click.echo(json.dumps(results, indent=indent), file=outfile)\n\n\ndef _write_with_progress(\n    func, outfile, limit, hide_progress, progress_total=1, **kwargs\n):\n    \"\"\"\n    Get results page by page and write them out with a progress bar\n    \"\"\"\n    count = 0\n    hide_progress = True if (outfile.name == \"<stdout>\") else hide_progress\n\n    with tqdm(disable=hide_progress, total=progress_total) as progress:\n        results = func(**kwargs)\n        for result in results:\n            _write(result, outfile)\n            count += len(result.get(\"data\", []))\n            progress.update(len(result.get(\"data\", [])))\n            if limit != 0 and count >= limit:\n                # Display message when stopped early\n                progress.desc = f\"Set --limit of {limit} reached\"\n                break\n        # Finish the progress bar\n        progress.update(progress.total - progress.n)\n"
  },
  {
    "path": "src/twarc/config.py",
    "content": "import logging\nimport configobj\n\n# Adapted from click_config_file.configobj_provider so that we can store the\n# file path that the config was loaded from in order to log it later.\n\nlog = logging\n\n\nclass ConfigProvider:\n    def __init__(self):\n        self.file_path = None\n\n    def __call__(self, file_path, cmd_name):\n        self.file_path = file_path\n        return configobj.ConfigObj(file_path, unrepr=True)\n"
  },
  {
    "path": "src/twarc/decorators.py",
    "content": "import time\nimport logging\n\nfrom requests import HTTPError\nfrom requests.packages.urllib3.exceptions import ReadTimeoutError\nfrom requests.exceptions import ChunkedEncodingError, ReadTimeout, ContentDecodingError\n\nlog = logging.getLogger(\"twarc\")\n\n\ndef rate_limit(f):\n    \"\"\"\n    A decorator to handle rate limiting from the Twitter API. If\n    a rate limit error is encountered we will sleep until we can\n    issue the API call again.\n    \"\"\"\n\n    def new_f(*args, **kwargs):\n        errors = 0\n        while True:\n            resp = f(*args, **kwargs)\n            if resp.status_code == 200:\n                errors = 0\n                return resp\n            elif resp.status_code == 401:\n                # Hack to retain the original exception, but augment it with\n                # additional context for the user to interpret it. In a Python\n                # 3 only future we can raise a new exception of the same type\n                # with a new message from the old error.\n                try:\n                    resp.raise_for_status()\n                except HTTPError as e:\n                    message = (\n                        \"\\nThis is a protected or locked account, or\"\n                        + \" the credentials provided are no longer valid.\"\n                    )\n                    e.args = (e.args[0] + message,) + e.args[1:]\n                    log.warning(\"401 Authentication required for %s\", resp.url)\n                    raise\n            elif resp.status_code == 429:\n                try:\n                    reset = int(resp.headers[\"x-rate-limit-reset\"])\n                    now = time.time()\n                    seconds = reset - now + 10\n                except KeyError:\n                    # gnip endpoint doesn't have x-rate-limit-reset\n                    seconds = 2\n                if seconds < 1:\n                    seconds = 10\n                log.warning(\"rate limit exceeded: sleeping %s secs\", seconds)\n                time.sleep(seconds)\n            # Special case for Academic all archive search instability\n            # If we hit a 503 for that specific endpoint, we sleep for a shorter amount\n            # of time, and reduce the number of tweets per request.\n            elif (resp.status_code == 503) & (\n                resp.url.startswith(\"https://api.twitter.com/2/tweets/search/all\")\n            ):\n                errors += 1\n                if errors > 30:\n                    log.warning(\"too many errors from Twitter, giving up\")\n                    resp.raise_for_status()\n                # Shorter wait time than other endpoints for this specific case. Also\n                # on the first error, only wait for the single second required by the\n                # 1 request/s rate limit\n                seconds = max(1, 15 * (errors - 1))\n\n                # Backoff the number of results retrieved for this request.\n                old_page_size = kwargs[\"params\"][\"max_results\"]\n                kwargs[\"params\"][\"max_results\"] = max(50, old_page_size // 2)\n                log.warning(\n                    \"%s from Twitter search/all API, sleeping %s and backing off to %s tweets/page\",\n                    resp.status_code,\n                    seconds,\n                    kwargs[\"params\"][\"max_results\"],\n                )\n                time.sleep(seconds)\n            elif resp.status_code >= 500:\n                errors += 1\n                if errors > 30:\n                    log.warning(\"too many errors from Twitter, giving up\")\n                    resp.raise_for_status()\n                seconds = 60 * errors\n                log.warning(\n                    \"%s from Twitter API, sleeping %s\", resp.status_code, seconds\n                )\n                time.sleep(seconds)\n            else:\n                resp.raise_for_status()\n\n    return new_f\n\n\ndef catch_conn_reset(f):\n    \"\"\"\n    A decorator to handle connection reset errors even ones from pyOpenSSL\n    until https://github.com/edsu/twarc/issues/72 is resolved\n    It also handles ChunkedEncodingError which has been observed in the wild.\n    \"\"\"\n    try:\n        import OpenSSL\n\n        ConnectionError = OpenSSL.SSL.SysCallError\n    except:\n        ConnectionError = None\n\n    def new_f(self, *args, **kwargs):\n        # Only handle if pyOpenSSL is installed.\n        if ConnectionError:\n            try:\n                return f(self, *args, **kwargs)\n            except (ConnectionError, ChunkedEncodingError) as e:\n                log.warning(\"caught connection reset error: %s\", e)\n                self.connect()\n                return f(self, *args, **kwargs)\n        else:\n            return f(self, *args, **kwargs)\n\n    return new_f\n\n\ndef catch_timeout(f):\n    \"\"\"\n    A decorator to handle read timeouts from Twitter.\n    \"\"\"\n\n    def new_f(self, *args, **kwargs):\n        try:\n            return f(self, *args, **kwargs)\n        except (ReadTimeout, ReadTimeoutError) as e:\n            log.warning(\"caught read timeout: %s\", e)\n            self.connect()\n            return f(self, *args, **kwargs)\n\n    return new_f\n\n\ndef catch_gzip_errors(f):\n    \"\"\"\n    A decorator to handle gzip encoding errors which have been known to\n    happen during hydration.\n    \"\"\"\n\n    def new_f(self, *args, **kwargs):\n        try:\n            return f(self, *args, **kwargs)\n        except ContentDecodingError as e:\n            log.warning(\"caught gzip error: %s\", e)\n            self.connect()\n            return f(self, *args, **kwargs)\n\n    return new_f\n\n\ndef interruptible_sleep(t, event=None):\n    \"\"\"\n    Sleeps for a specified duration, optionally stopping early for event.\n\n    Returns True if interrupted\n    \"\"\"\n    log.info(\"sleeping %s\", t)\n\n    if event is None:\n        time.sleep(t)\n        return False\n    else:\n        return not event.wait(t)\n\n\ndef filter_protected(f):\n    \"\"\"\n    filter_protected will filter out protected tweets and users unless\n    explicitly requested not to.\n    \"\"\"\n\n    def new_f(self, *args, **kwargs):\n        for obj in f(self, *args, **kwargs):\n            if self.protected == False:\n                if \"user\" in obj and obj[\"user\"][\"protected\"]:\n                    continue\n                elif \"protected\" in obj and obj[\"protected\"]:\n                    continue\n            yield obj\n\n    return new_f\n"
  },
  {
    "path": "src/twarc/decorators2.py",
    "content": "import os\nimport time\nimport click\nimport logging\nimport requests\n\nimport datetime\nimport humanize\nfrom tqdm.auto import tqdm\nfrom functools import wraps\n\n\nlog = logging.getLogger(\"twarc\")\n\n\ndef rate_limit(f, tries=30):\n    \"\"\"\n    A decorator to handle rate limiting from the Twitter v2 API. If\n    a rate limit error is encountered we will sleep until we can\n    issue the API call again.\n    \"\"\"\n\n    @wraps(f)\n    def new_f(*args, **kwargs):\n        errors = 0\n        while True:\n            resp = f(*args, **kwargs)\n            if resp.status_code in [200, 201]:\n                errors = 0\n                return resp\n            elif resp.status_code == 429:\n                # Check the headers, and try to infer why we're hitting the\n                # rate limit. Because the search/all endpoints also have a\n                # 1r/s rate limit that isn't obvious in the headers, we need\n                # to infer the reason for the rate limit. Note that this is\n                # included to help debug problems with multiple concurrent\n                # clients - this shouldn't be hit in normal of operation of a\n                # single twarc client.\n                remaining = int(resp.headers[\"x-rate-limit-remaining\"])\n\n                # If we have a 429 rate limit, but there are remaining calls for\n                # this endpoint, we've probably hit the 1r/s limit.\n                if remaining:\n                    log.warning(\n                        \"Hit the 1 request/second rate limit, sleeping for 10 seconds. \"\n                        \"This shouldn't happen with normal usage of twarc, and may indicate \"\n                        \"multiple clients interacting with the Twitter API at the \"\n                        \"same time.\"\n                    )\n                    time.sleep(10)\n                    continue\n\n                # Just a regular 15 minute window rate limit.\n                else:\n                    reset = int(resp.headers[\"x-rate-limit-reset\"])\n                    now = time.time()\n\n                    # The time to sleep depends on having an accurate system time,\n                    # so check to see if there's something really bad happening\n                    # to warn the user.\n                    target_sleep_seconds = reset - now\n\n                    # Never sleep longer than 15 minutes, as that is the basis for\n                    # all of the read time based rate limits in the Twitter API\n                    seconds = min(901, max(10, (target_sleep_seconds + 10)))\n\n                    if target_sleep_seconds >= 900:\n                        # If we need to sleep for more than a rate limit period, the\n                        # system clock could be wrong.\n                        log.warning(\n                            \"Detected overlong sleep interval - is your system clock accurate? \"\n                            \"An accurate system time is needed to calculate how long to sleep for, \"\n                            \"and data collection might be slowed. \"\n                            f\"The rate limit resets at {reset} and the current time is {now}.\"\n                        )\n                    elif target_sleep_seconds < 0:\n                        # If we need to sleep for negative time something weird might be up.\n                        log.warning(\n                            \"Detected negative sleep interval - is your system clock accurate? \"\n                            \"If your system time is running fast, rate limiting may not be \"\n                            \"effective. \"\n                            f\"The rate limit resets at {reset} and the current time is {now}.\"\n                        )\n\n                    log.warning(\"rate limit exceeded: sleeping %s secs\", seconds)\n                    time.sleep(seconds)\n\n            elif resp.status_code >= 500:\n                errors += 1\n                if errors > tries:\n                    log.warning(f\"too many errors ({tries}) from Twitter, giving up\")\n                    resp.raise_for_status()\n                seconds = errors**2\n                log.warning(\n                    \"caught %s from Twitter API, sleeping %s\", resp.status_code, seconds\n                )\n                time.sleep(seconds)\n            else:\n                log.error(\"Unexpected HTTP response: %s\", resp)\n                resp.raise_for_status()\n\n    return new_f\n\n\ndef catch_request_exceptions(f, tries=30):\n    \"\"\"\n    A decorator to handle all request exceptions. This decorator will catch\n    *any* request level error, reconnect and try again. It does not handle\n    HTTP protocol level errors (404, 500) etc.\n\n    It will try up to tries times consecutively before giving up. A successful\n    call to f will result in the try counter being reset to 0.\n    \"\"\"\n\n    # pyOpenSSL has been known to throw these connection errors that need to be\n    # caught separately: https://github.com/edsu/twarc/issues/72\n\n    try:\n        import OpenSSL\n\n        ConnectionError = OpenSSL.SSL.SysCallError\n    except:\n        ConnectionError = requests.exceptions.ConnectionError\n\n    @wraps(f)\n    def new_f(self, *args, **kwargs):\n        errors = 0\n        while errors < tries:\n            try:\n                resp = f(self, *args, **kwargs)\n                errors = 0\n                return resp\n            except (requests.exceptions.RequestException, ConnectionError) as e:\n                # don't catch any HTTP errors since those are handled separately\n                if isinstance(e, requests.exceptions.HTTPError):\n                    raise e\n\n                errors += 1\n                log.warning(\"caught requests exception: %s\", e)\n                if errors > tries:\n                    log.error(f\"giving up, too many request exceptions: {tries}\")\n                    raise e\n                seconds = errors**2\n                log.info(\"sleeping %s\", seconds)\n                time.sleep(seconds)\n                self.connect()\n\n    return new_f\n\n\ndef interruptible_sleep(t, event=None):\n    \"\"\"\n    Sleeps for a specified duration, optionally stopping early for event.\n\n    Returns True if interrupted\n    \"\"\"\n    log.info(\"sleeping %s\", t)\n\n    if event is None:\n        time.sleep(t)\n        return False\n    else:\n        return not event.wait(t)\n\n\nclass cli_api_error:\n    \"\"\"\n    A decorator to catch HTTP errors for the command line.\n    \"\"\"\n\n    def __init__(self, f):\n        self.f = f\n        # this is needed for click help docs to work properly\n        self.__doc__ = f.__doc__\n\n    def __call__(self, *args, **kwargs):\n        try:\n            return self.f(*args, **kwargs)\n        except requests.exceptions.HTTPError as e:\n            try:\n                result = e.response.json()\n                if \"errors\" in result:\n                    for error in result[\"errors\"]:\n                        msg = error.get(\"message\", \"Unknown error\")\n                elif \"title\" in result:\n                    msg = result[\"title\"]\n                else:\n                    msg = \"Unknown error\"\n            except ValueError:\n                msg = f\"Unable to parse {e.response.status_code} error as JSON: {e.response.text}\"\n        except InvalidAuthType as e:\n            msg = \"This command requires application authentication, try passing --app-auth\"\n        except ValueError as e:\n            msg = str(e)\n        click.echo(\n            click.style(\"⚡ \", fg=\"yellow\") + click.style(msg, fg=\"red\"), err=True\n        )\n\n\ndef requires_app_auth(f):\n    \"\"\"\n    Ensure that application authentication is set for calls that only work in that mode.\n    \"\"\"\n\n    @wraps(f)\n    def new_f(self, *args, **kwargs):\n        if self.auth_type != \"application\":\n            raise InvalidAuthType(\n                \"This endpoint only works with application authentication\"\n            )\n\n        else:\n            return f(self, *args, **kwargs)\n\n    return new_f\n\n\nclass InvalidAuthType(Exception):\n    \"\"\"\n    Raised when the endpoint called is not supported by the current auth type.\n    \"\"\"\n\n\nclass FileLineProgressBar(tqdm):\n    \"\"\"\n    A progress bar based on input file line count. Counts an input file by lines.\n    This tries to read the entire file and count newlines in a robust way.\n    \"\"\"\n\n    def __init__(self, infile, outfile, **kwargs):\n        disable = False if \"disable\" not in kwargs else kwargs[\"disable\"]\n        if infile is not None and (infile.name == \"<stdin>\"):\n            disable = True\n        if outfile is not None and (outfile.name == \"<stdout>\"):\n            disable = True\n        kwargs[\"disable\"] = disable\n        kwargs[\"miniters\"] = 1\n        kwargs[\n            \"bar_format\"\n        ] = \"{l_bar}{bar}| Processed {n_fmt}/{total_fmt} lines of input file [{elapsed}<{remaining}, {rate_fmt}{postfix}]\"\n\n        # Warn for large (> 1 GB) input files:\n        if not disable and (os.stat(infile.name).st_size / (1024 * 1024 * 1024)) > 1:\n            click.echo(\n                click.style(\n                    f\"Input File Size is {os.stat(infile.name).st_size / (1024*1024):.2f} MB, it may take a while to process. CTRL+C to stop.\",\n                    fg=\"yellow\",\n                    bold=True,\n                ),\n                err=True,\n            )\n\n        def total_lines():\n            with open(infile.name, \"r\", encoding=\"utf-8\", errors=\"ignore\") as f:\n                return sum(1 for _ in f)\n\n        kwargs[\"total\"] = total_lines() if not disable else 1\n        super().__init__(**kwargs)\n\n    def update_with_result(\n        self, result, field=\"id\", error_resource_type=None, error_parameter=\"ids\"\n    ):\n        \"\"\"\n        Update the progress bar appropriately, with a full API response. For convenience,\n        and drop in compatibility with FileSizeProgressBar otherwise use tqdm's update().\n        \"\"\"\n\n        try:\n            if \"data\" in result:\n                for item in result[\"data\"]:\n                    self.update()\n            if error_resource_type and \"errors\" in result:\n                for error in result[\"errors\"]:\n                    # Account for deleted data\n                    # Errors have very inconsistent format, missing fields for different types of errors...\n                    if (\n                        \"resource_type\" in error\n                        and error[\"resource_type\"] == error_resource_type\n                    ):\n                        if (\n                            \"parameter\" in error\n                            and error[\"parameter\"] == error_parameter\n                        ):\n                            self.update()\n                            # todo: hide or show this?\n                            # self.set_description(\n                            #    \"Errors encountered, results may be incomplete\"\n                            # )\n                        # print(error[\"value\"], error[\"resource_type\"], error[\"parameter\"])\n        except Exception as e:\n            log.error(f\"Failed to update progress bar: {e}\")\n\n\nclass FileSizeProgressBar(tqdm):\n    \"\"\"\n    An input file size based progress bar. Counts an input file in bytes.\n    This will also dig into the responses and add up the outputs to match the file size.\n    Overrides `disable` parameter if file is a pipe.\n    \"\"\"\n\n    def __init__(self, infile, outfile, **kwargs):\n        disable = False if \"disable\" not in kwargs else kwargs[\"disable\"]\n        if infile is not None and (infile.name == \"<stdin>\"):\n            disable = True\n        if outfile is not None and (outfile.name == \"<stdout>\"):\n            disable = True\n        kwargs[\"disable\"] = disable\n        kwargs[\"unit\"] = \"B\"\n        kwargs[\"unit_scale\"] = True\n        kwargs[\"unit_divisor\"] = 1024\n        kwargs[\"miniters\"] = 1\n        kwargs[\n            \"bar_format\"\n        ] = \"{l_bar}{bar}| Processed {n_fmt}/{total_fmt} of input file [{elapsed}<{remaining}, {rate_fmt}{postfix}]\"\n        kwargs[\"total\"] = os.stat(infile.name).st_size if not disable else 1\n        super().__init__(**kwargs)\n\n    def update_with_result(\n        self, result, field=\"id\", error_resource_type=None, error_parameter=\"ids\"\n    ):\n        \"\"\"\n        Update the progress bar appropriately, with a full API response. For convenience,\n        otherwise use twdm's own update() method.\n        \"\"\"\n        try:\n            if \"data\" in result:\n                for item in result[\"data\"]:\n                    # Use the length of the id / name and a newline to match original file\n                    self.update(len(item[field]) + len(\"\\n\"))\n            if error_resource_type and \"errors\" in result:\n                for error in result[\"errors\"]:\n                    # Account for deleted data\n                    # Errors have very inconsistent format, missing fields for different types of errors...\n                    if (\n                        \"resource_type\" in error\n                        and error[\"resource_type\"] == error_resource_type\n                    ):\n                        if (\n                            \"parameter\" in error\n                            and error[\"parameter\"] == error_parameter\n                        ):\n                            self.update(len(error[\"value\"]) + len(\"\\n\"))\n                            # todo: hide or show this?\n                            # self.set_description(\n                            #    \"Errors encountered, results may be incomplete\"\n                            # )\n                        # print(error[\"value\"], error[\"resource_type\"], error[\"parameter\"])\n        except Exception as e:\n            log.error(f\"Failed to update progress bar: {e}\")\n\n\nclass TimestampProgressBar(tqdm):\n    \"\"\"\n    A Timestamp based progress bar. Counts timestamp ranges in milliseconds.\n    This can be used to display a progress bar for tweet ids and time ranges.\n    \"\"\"\n\n    def __init__(self, since_id, until_id, start_time, end_time, **kwargs):\n        self.early_stop = True\n        self.tweet_count = 0\n\n        disable = False if \"disable\" not in kwargs else kwargs[\"disable\"]\n        kwargs[\"disable\"] = disable\n\n        if start_time is None and (since_id is None and until_id is None):\n            start_time = datetime.datetime.now(\n                datetime.timezone.utc\n            ) - datetime.timedelta(days=7)\n        if end_time is None and (since_id is None and until_id is None):\n            end_time = datetime.datetime.now(\n                datetime.timezone.utc\n            ) - datetime.timedelta(seconds=30)\n\n        if since_id and not until_id:\n            until_id = _millis2snowflake(\n                _date2millis(datetime.datetime.now(datetime.timezone.utc))\n            )\n\n        if until_id and not since_id:\n            since_id = 1\n\n        total = (\n            _snowflake2millis(until_id) - _snowflake2millis(since_id)\n            if (since_id and until_id)\n            else _date2millis(end_time) - _date2millis(start_time)\n        )\n\n        kwargs[\"miniters\"] = 1\n        kwargs[\"total\"] = total\n        tweets_timeline_format = \"{l_bar}{bar}| Processed {n_time}/{total_time} [{elapsed}<{remaining}, {tweet_count} tweets total {postfix}]\"\n        kwargs[\"bar_format\"] = (\n            tweets_timeline_format\n            if \"bar_format\" not in kwargs\n            else kwargs[\"bar_format\"]\n        )\n        super().__init__(**kwargs)\n\n    def update_with_dates(self, start_span, end_span):\n        \"\"\"\n        Update the progress bar with a start and end time span.\n        \"\"\"\n        try:\n            if isinstance(start_span, str):\n                start_span = datetime.datetime.strptime(\n                    start_span, \"%Y-%m-%dT%H:%M:%S.%fZ\"\n                )\n            if isinstance(end_span, str):\n                end_span = datetime.datetime.strptime(end_span, \"%Y-%m-%dT%H:%M:%S.%fZ\")\n            n = _date2millis(end_span) - _date2millis(start_span)\n            if self.n + n > self.total:\n                self.n = self.total\n            else:\n                self.update(n)\n        except Exception as e:\n            log.error(f\"Failed to update progress bar: {e}\")\n\n    def update_with_result(self, result):\n        \"\"\"\n        Update progress bar based on snowflake ids from an API response.\n        \"\"\"\n        try:\n            newest_id = result[\"meta\"][\"newest_id\"]\n            oldest_id = result[\"meta\"][\"oldest_id\"]\n            n = _snowflake2millis(int(newest_id)) - _snowflake2millis(int(oldest_id))\n            self.update(n)\n            self.tweet_count += len(result[\"data\"])\n        except Exception as e:\n            log.error(f\"Failed to update progress bar: {e}\")\n\n    @property\n    def format_dict(self):\n        d = super(TimestampProgressBar, self).format_dict  # original format dict\n        tweets_per_second = int(self.tweet_count / d[\"elapsed\"] if d[\"elapsed\"] else 0)\n        n_time = humanize.naturaldelta(datetime.timedelta(seconds=int(d[\"n\"]) // 1000))\n        total_time = humanize.naturaldelta(\n            datetime.timedelta(seconds=int(d[\"total\"]) // 1000)\n        )\n        d.update(n_time=n_time)\n        d.update(total_time=total_time)\n        d.update(tweet_count=self.tweet_count)\n        d.update(tweets_per_second=tweets_per_second)\n        return d\n\n    def close(self):\n        if not self.early_stop:\n            # Finish the bar to 100% even if the last tweet ids do not cover the full time range\n            self.update(self.total - self.n)\n        super().close()\n\n\ndef _date2millis(dt):\n    return int(dt.timestamp() * 1000)\n\n\ndef _millis2date(ms):\n    return datetime.datetime.utcfromtimestamp(ms // 1000).replace(\n        microsecond=ms % 1000 * 1000\n    )\n\n\ndef _snowflake2millis(snowflake_id):\n    return (snowflake_id >> 22) + 1288834974657\n\n\ndef _millis2snowflake(ms):\n    return (int(ms) - 1288834974657) << 22\n"
  },
  {
    "path": "src/twarc/expansions.py",
    "content": "\"\"\"\nThis module contains a list of the known Twitter V2+ API expansions and fields\nfor each expansion, and a function flatten() for \"flattening\" a result set, \nincluding all expansions inline. \n\nensure_flattened() can be used in tweet processing programs that need to make \nsure that data is flattened.\n\"\"\"\n\nimport logging\nfrom itertools import chain\nfrom collections import defaultdict\n\nlog = logging.getLogger(\"twarc\")\n\nEXPANSIONS = [\n    \"author_id\",\n    \"in_reply_to_user_id\",\n    \"referenced_tweets.id\",\n    \"referenced_tweets.id.author_id\",\n    \"entities.mentions.username\",\n    \"attachments.poll_ids\",\n    \"attachments.media_keys\",\n    \"geo.place_id\",\n    \"edit_history_tweet_ids\",\n]\n\n\nUSER_FIELDS = [\n    \"created_at\",\n    \"description\",\n    \"entities\",\n    \"id\",\n    \"location\",\n    \"name\",\n    \"pinned_tweet_id\",\n    \"profile_image_url\",\n    \"protected\",\n    \"public_metrics\",\n    \"url\",\n    \"username\",\n    \"verified\",\n    \"verified_type\",\n    \"withheld\",\n]\n\nTWEET_FIELDS = [\n    \"attachments\",\n    \"author_id\",\n    \"context_annotations\",\n    \"conversation_id\",\n    \"created_at\",\n    \"entities\",\n    \"geo\",\n    \"id\",\n    \"in_reply_to_user_id\",\n    \"lang\",\n    \"public_metrics\",\n    # \"non_public_metrics\", # private\n    # \"organic_metrics\", # private\n    # \"promoted_metrics\", # private\n    \"text\",\n    \"possibly_sensitive\",\n    \"referenced_tweets\",\n    \"reply_settings\",\n    \"source\",\n    \"withheld\",\n    \"edit_controls\",\n    \"edit_history_tweet_ids\",\n]\n\nMEDIA_FIELDS = [\n    \"alt_text\",\n    \"duration_ms\",\n    \"height\",\n    \"media_key\",\n    \"preview_image_url\",\n    \"type\",\n    \"url\",\n    \"width\",\n    \"variants\",\n    # \"non_public_metrics\", # private\n    # \"organic_metrics\", # private\n    # \"promoted_metrics\", # private\n    \"public_metrics\",\n]\n\nPOLL_FIELDS = [\"duration_minutes\", \"end_datetime\", \"id\", \"options\", \"voting_status\"]\n\nPLACE_FIELDS = [\n    \"contained_within\",\n    \"country\",\n    \"country_code\",\n    \"full_name\",\n    \"geo\",\n    \"id\",\n    \"name\",\n    \"place_type\",\n]\n\nLIST_FIELDS = [\n    \"id\",\n    \"name\",\n    \"owner_id\",\n    \"created_at\",\n    \"member_count\",\n    \"follower_count\",\n    \"private\",\n    \"description\",\n]\n\n\ndef extract_includes(response, expansion, _id=\"id\"):\n    if \"includes\" in response and expansion in response[\"includes\"]:\n        return defaultdict(\n            lambda: {},\n            {include[_id]: include for include in response[\"includes\"][expansion]},\n        )\n    else:\n        return defaultdict(lambda: {})\n\n\ndef flatten(response):\n    \"\"\"\n    Flatten an API response by moving all \"included\" entities inline with the\n    tweets they are referenced from. flatten expects an entire page response\n    from the API (data, includes, meta) and will raise a ValueError if what is\n    passed in does not appear to be an API response. It will return a list of\n    dictionaries where each dictionary represents a tweet. Empty objects will\n    be returned for things that are missing in includes, which can happen when\n    protected or delete users or tweets are referenced.\n    \"\"\"\n\n    # Users extracted both by id and by username for expanding mentions\n    includes_users = defaultdict(\n        lambda: {},\n        {\n            **extract_includes(response, \"users\", \"id\"),\n            **extract_includes(response, \"users\", \"username\"),\n        },\n    )\n    # Media is by media_key, not id\n    includes_media = extract_includes(response, \"media\", \"media_key\")\n    includes_polls = extract_includes(response, \"polls\")\n    includes_places = extract_includes(response, \"places\")\n    # Tweets in includes will themselves be expanded\n    includes_tweets = extract_includes(response, \"tweets\")\n    # Errors are returned but unused here for now\n    includes_errors = extract_includes(response, \"errors\")\n\n    def expand_payload(payload):\n        \"\"\"\n        Recursively step through an object and sub objects and append extra data.\n        Can be applied to any tweet, list of tweets, sub object of tweet etc.\n        \"\"\"\n\n        # Don't try to expand on primitive values, return strings as is:\n        if isinstance(payload, (str, bool, int, float)):\n            return payload\n        # expand list items individually:\n        elif isinstance(payload, list):\n            payload = [expand_payload(item) for item in payload]\n            return payload\n        # Try to expand on dicts within dicts:\n        elif isinstance(payload, dict):\n            for key, value in payload.items():\n                payload[key] = expand_payload(value)\n\n        if \"author_id\" in payload:\n            payload[\"author\"] = includes_users[payload[\"author_id\"]]\n\n        if \"in_reply_to_user_id\" in payload:\n            payload[\"in_reply_to_user\"] = includes_users[payload[\"in_reply_to_user_id\"]]\n\n        if \"media_keys\" in payload:\n            payload[\"media\"] = list(\n                includes_media[media_key] for media_key in payload[\"media_keys\"]\n            )\n\n        if \"poll_ids\" in payload and len(payload[\"poll_ids\"]) > 0:\n            poll_id = payload[\"poll_ids\"][-1]  # only ever 1 poll per tweet.\n            payload[\"poll\"] = includes_polls[poll_id]\n\n        if \"geo\" in payload and \"place_id\" in payload[\"geo\"]:\n            place_id = payload[\"geo\"][\"place_id\"]\n            payload[\"geo\"] = {**payload[\"geo\"], **includes_places[place_id]}\n\n        if \"mentions\" in payload:\n            payload[\"mentions\"] = list(\n                {**referenced_user, **includes_users[referenced_user[\"username\"]]}\n                for referenced_user in payload[\"mentions\"]\n            )\n\n        if \"referenced_tweets\" in payload:\n            payload[\"referenced_tweets\"] = list(\n                {**referenced_tweet, **includes_tweets[referenced_tweet[\"id\"]]}\n                for referenced_tweet in payload[\"referenced_tweets\"]\n            )\n\n        if \"pinned_tweet_id\" in payload:\n            payload[\"pinned_tweet\"] = includes_tweets[payload[\"pinned_tweet_id\"]]\n\n        return payload\n\n    # First expand the tweets in \"includes\", before processing actual result tweets:\n    for included_id, included_tweet in extract_includes(response, \"tweets\").items():\n        includes_tweets[included_id] = expand_payload(included_tweet)\n\n    # Now expand the list of tweets or an individual tweet in \"data\"\n    tweets = []\n    if \"data\" in response:\n        data = response[\"data\"]\n\n        if isinstance(data, list):\n            tweets = expand_payload(response[\"data\"])\n        elif isinstance(data, dict):\n            tweets = [expand_payload(response[\"data\"])]\n\n        # Add the __twarc metadata and matching rules to each tweet if it's a result set\n        if \"__twarc\" in response:\n            for tweet in tweets:\n                tweet[\"__twarc\"] = response[\"__twarc\"]\n        if \"matching_rules\" in response:\n            for tweet in tweets:\n                tweet[\"matching_rules\"] = response[\"matching_rules\"]\n    else:\n        raise ValueError(f\"missing data stanza in response: {response}\")\n\n    return tweets\n\n\ndef ensure_flattened(data):\n    \"\"\"\n    Will ensure that the supplied data is \"flattened\". The input data can be a\n    response from the Twitter API, a list of tweet dictionaries, or a single tweet\n    dictionary. It will always return a list of tweet dictionaries. A ValueError\n    will be thrown if the supplied data is not recognizable or it cannot be\n    flattened.\n\n    ensure_flattened is designed for use in twarc plugins and other tweet\n    processing applications that want to operate on a stream of tweets, and\n    examine included entities like users and tweets without hunting and\n    pecking in the response data.\n    \"\"\"\n\n    # If it's a single response from the API, with data and includes, we flatten it:\n    if isinstance(data, dict) and \"data\" in data and \"includes\" in data:\n        return flatten(data)\n\n    # If it's a single response with data, but without includes:\n    elif isinstance(data, dict) and \"data\" in data and \"includes\" not in data:\n        # flatten() will still work, just with {} empty expansions, log a warning.\n        log.warning(f\"Unable to expand dictionary without includes: {data}\")\n        return flatten(data)\n\n    # If it's just an object with errors return an empty list\n    elif (\n        isinstance(data, dict)\n        and \"data\" not in data\n        and \"includes\" not in data\n        and \"errors\" in data\n    ):\n        return []\n\n    # If it's a single response and both \"includes\" and \"data\" are missing, it is already flattened\n    elif isinstance(data, dict) and \"data\" not in data and \"includes\" not in data:\n        return [data]\n\n    # If it's a list of objects (could be list of responses, or tweets, or users):\n    elif isinstance(data, list) and len(data) > 0 and isinstance(data[0], dict):\n        # Same as above,\n        if \"data\" in data[0] and \"includes\" in data[0]:\n            # but flatten each object individually and return a single list\n            return list(chain.from_iterable([flatten(item) for item in data]))\n        elif \"data\" in data[0] and \"includes\" not in data[0]:\n            # same as above, log warnings and return a single list\n            log.warning(f\"Unable to expand dictionary without includes: {data[0]}\")\n            return list(chain.from_iterable([flatten(item) for item in data]))\n        # Return already flattened data as is\n        elif \"data\" not in data[0] and \"includes\" not in data[0]:\n            return data\n    # Unknown format, eg: list of lists, or primitive\n    else:\n        raise ValueError(f\"Cannot flatten unrecognized data: {data}\")\n"
  },
  {
    "path": "src/twarc/handshake.py",
    "content": "\"\"\"\nA function for asking the user for their Twitter API keys.\n\"\"\"\n\nimport requests\n\nfrom requests_oauthlib import OAuth1\nfrom urllib.parse import parse_qs\n\n\ndef handshake():\n    # Default empty keys\n    consumer_key = \"\"\n    consumer_secret = \"\"\n    access_token = \"\"\n    access_token_secret = \"\"\n\n    bearer_token = input(\n        \"Please enter your Bearer Token (leave blank to skip to API key configuration): \"\n    )\n\n    if bearer_token:\n        continue_adding = input(\n            \"(Optional) Add API keys and secrets for user mode authentication [y or n]? \"\n        )\n\n        # Save a config with just the bearer_token\n        if continue_adding.lower() != \"y\":\n            return {\"bearer_token\": bearer_token}\n        else:\n            \"Configure API keys and secrets.\"\n\n    consumer_key = input(\"Please enter your API key: \")\n    consumer_secret = input(\"Please enter your API secret: \")\n\n    # verify that the keys work to get the bearer token\n    url = \"https://api.twitter.com/oauth2/token\"\n    params = {\"grant_type\": \"client_credentials\"}\n    auth = requests.auth.HTTPBasicAuth(consumer_key, consumer_secret)\n    try:\n        resp = requests.post(url, params, auth=auth)\n        resp.raise_for_status()\n        result = resp.json()\n        bearer_token = result[\"access_token\"]\n    except Exception as e:\n        return None\n\n    answered = False\n    while not answered:\n        print(\n            \"\\nHow would you like twarc to obtain your user keys?\\n\\n1) generate access keys by visiting Twitter\\n2) manually enter your access token and secret\\n\"\n        )\n        answer = input(\"Please enter your choice [1 or 2] \")\n        if answer == \"1\":\n            answered = True\n            generate = True\n        elif answer == \"2\":\n            answered = True\n            generate = False\n\n    if generate:\n        request_token_url = \"https://api.twitter.com/oauth/request_token\"\n        oauth = OAuth1(consumer_key, client_secret=consumer_secret)\n        r = requests.post(url=request_token_url, auth=oauth)\n\n        credentials = parse_qs(r.text)\n        if not credentials:\n            print(\"\\nError: invalid credentials.\")\n            print(\n                \"Please check that you are copying and pasting correctly and try again.\\n\"\n            )\n            return\n\n        resource_owner_key = credentials.get(\"oauth_token\")[0]\n        resource_owner_secret = credentials.get(\"oauth_token_secret\")[0]\n\n        base_authorization_url = \"https://api.twitter.com/oauth/authorize\"\n        authorize_url = base_authorization_url + \"?oauth_token=\" + resource_owner_key\n        print(\n            \"\\nPlease log into Twitter and visit this URL in your browser:\\n%s\"\n            % authorize_url\n        )\n        verifier = input(\n            \"\\nAfter you have authorized the application please enter the displayed PIN: \"\n        )\n\n        access_token_url = \"https://api.twitter.com/oauth/access_token\"\n        oauth = OAuth1(\n            consumer_key,\n            client_secret=consumer_secret,\n            resource_owner_key=resource_owner_key,\n            resource_owner_secret=resource_owner_secret,\n            verifier=verifier,\n        )\n        r = requests.post(url=access_token_url, auth=oauth)\n        credentials = parse_qs(r.text)\n\n        if not credentials:\n            print(\"\\nError: invalid PIN\")\n            print(\"Please check that you entered the PIN correctly and try again.\\n\")\n            return\n\n        access_token = resource_owner_key = credentials.get(\"oauth_token\")[0]\n        access_token_secret = credentials.get(\"oauth_token_secret\")[0]\n\n        screen_name = credentials.get(\"screen_name\")[0]\n    else:\n        access_token = input(\"Enter your Access Token: \")\n        access_token_secret = input(\"Enter your Access Token Secret: \")\n        screen_name = \"default\"\n\n    return {\n        \"consumer_key\": consumer_key,\n        \"consumer_secret\": consumer_secret,\n        \"access_token\": access_token,\n        \"access_token_secret\": access_token_secret,\n        \"bearer_token\": bearer_token,\n    }\n"
  },
  {
    "path": "src/twarc/json2csv.py",
    "content": "#!/usr/bin/env python\n\nimport sys\n\nfrom dateutil.parser import parse as date_parse\nfrom six import string_types\n\nif sys.version_info[0] < 3:\n    try:\n        import unicodecsv as csv\n    except ImportError:\n        sys.exit(\"unicodecsv is required for python 2\")\nelse:\n    import csv\n\n\ndef get_headings():\n    return [\n        \"id\",\n        \"tweet_url\",\n        \"created_at\",\n        \"parsed_created_at\",\n        \"user_screen_name\",\n        \"text\",\n        \"tweet_type\",\n        \"coordinates\",\n        \"hashtags\",\n        \"media\",\n        \"urls\",\n        \"favorite_count\",\n        \"in_reply_to_screen_name\",\n        \"in_reply_to_status_id\",\n        \"in_reply_to_user_id\",\n        \"lang\",\n        \"place\",\n        \"possibly_sensitive\",\n        \"retweet_count\",\n        \"retweet_or_quote_id\",\n        \"retweet_or_quote_screen_name\",\n        \"retweet_or_quote_user_id\",\n        \"source\",\n        \"user_id\",\n        \"user_created_at\",\n        \"user_default_profile_image\",\n        \"user_description\",\n        \"user_favourites_count\",\n        \"user_followers_count\",\n        \"user_friends_count\",\n        \"user_listed_count\",\n        \"user_location\",\n        \"user_name\",\n        \"user_statuses_count\",\n        \"user_time_zone\",\n        \"user_urls\",\n        \"user_verified\",\n    ]\n\n\ndef get_row(t, excel=False):\n    get = t.get\n    user = t.get(\"user\").get\n    return [\n        get(\"id_str\"),\n        tweet_url(t),\n        get(\"created_at\"),\n        date_parse(get(\"created_at\")),\n        user(\"screen_name\"),\n        text(t) if not excel else clean_str(text(t)),\n        tweet_type(t),\n        coordinates(t),\n        hashtags(t),\n        media(t),\n        urls(t),\n        favorite_count(t),\n        get(\"in_reply_to_screen_name\"),\n        get(\"in_reply_to_status_id\"),\n        get(\"in_reply_to_user_id\"),\n        get(\"lang\"),\n        place(t),\n        get(\"possibly_sensitive\"),\n        get(\"retweet_count\"),\n        retweet_id(t),\n        retweet_screen_name(t),\n        retweet_user_id(t),\n        get(\"source\"),\n        user(\"id_str\"),\n        user(\"created_at\"),\n        user(\"default_profile_image\"),\n        user(\"description\") if not excel else clean_str(user(\"description\")),\n        user(\"favourites_count\"),\n        user(\"followers_count\"),\n        user(\"friends_count\"),\n        user(\"listed_count\"),\n        user(\"location\") if not excel else clean_str(user(\"location\")),\n        user(\"name\") if not excel else clean_str(user(\"name\")),\n        user(\"statuses_count\"),\n        user(\"time_zone\"),\n        user_urls(t),\n        user(\"verified\"),\n    ]\n\n\ndef clean_str(string):\n    if isinstance(string, string_types):\n        return string.replace(\"\\n\", \" \").replace(\"\\r\", \"\")\n    return None\n\n\ndef text(t):\n    # need to look at original tweets for retweets for full text\n    if t.get(\"retweeted_status\"):\n        t = t.get(\"retweeted_status\")\n\n    if \"extended_tweet\" in t:\n        return t[\"extended_tweet\"][\"full_text\"]\n    elif \"full_text\" in t:\n        return t[\"full_text\"]\n    else:\n        return t[\"text\"]\n\n\ndef coordinates(t):\n    if \"coordinates\" in t and t[\"coordinates\"]:\n        return \"%f %f\" % tuple(t[\"coordinates\"][\"coordinates\"])\n    return None\n\n\ndef hashtags(t):\n    # If it's a retweet, the hashtags might be cutoff in the retweet object, so check\n    # the enclosed original tweet for the full list.\n    if \"retweeted_status\" in t:\n        hashtags = t[\"retweeted_status\"][\"entities\"][\"hashtags\"]\n    else:\n        hashtags = t[\"entities\"][\"hashtags\"]\n\n    return \" \".join(h[\"text\"] for h in hashtags)\n\n\ndef media(t):\n    if \"extended_entities\" in t and \"media\" in t[\"extended_entities\"]:\n        return \" \".join([h[\"media_url_https\"] for h in t[\"extended_entities\"][\"media\"]])\n    elif \"media\" in t[\"entities\"]:\n        return \" \".join([h[\"media_url_https\"] for h in t[\"entities\"][\"media\"]])\n    else:\n        return None\n\n\ndef urls(t):\n    return \" \".join([h[\"expanded_url\"] or \"\" for h in t[\"entities\"][\"urls\"]])\n\n\ndef place(t):\n    if \"place\" in t and t[\"place\"]:\n        return t[\"place\"][\"full_name\"]\n\n\ndef retweet_id(t):\n    if \"retweeted_status\" in t and t[\"retweeted_status\"]:\n        return t[\"retweeted_status\"][\"id_str\"]\n    elif \"quoted_status\" in t and t[\"quoted_status\"]:\n        return t[\"quoted_status\"][\"id_str\"]\n\n\ndef retweet_screen_name(t):\n    if \"retweeted_status\" in t and t[\"retweeted_status\"]:\n        return t[\"retweeted_status\"][\"user\"][\"screen_name\"]\n    elif \"quoted_status\" in t and t[\"quoted_status\"]:\n        return t[\"quoted_status\"][\"user\"][\"screen_name\"]\n\n\ndef retweet_user_id(t):\n    if \"retweeted_status\" in t and t[\"retweeted_status\"]:\n        return t[\"retweeted_status\"][\"user\"][\"id_str\"]\n    elif \"quoted_status\" in t and t[\"quoted_status\"]:\n        return t[\"quoted_status\"][\"user\"][\"id_str\"]\n\n\ndef favorite_count(t):\n    if \"retweeted_status\" in t and t[\"retweeted_status\"]:\n        return t[\"retweeted_status\"][\"favorite_count\"]\n    else:\n        return t[\"favorite_count\"]\n\n\ndef tweet_url(t):\n    return \"https://twitter.com/%s/status/%s\" % (t[\"user\"][\"screen_name\"], t[\"id_str\"])\n\n\ndef user_urls(t):\n    u = t.get(\"user\")\n    if not u:\n        return None\n    urls = []\n    if \"entities\" in u and \"url\" in u[\"entities\"] and \"urls\" in u[\"entities\"][\"url\"]:\n        for url in u[\"entities\"][\"url\"][\"urls\"]:\n            if url[\"expanded_url\"]:\n                urls.append(url[\"expanded_url\"])\n    return \" \".join(urls)\n\n\ndef tweet_type(t):\n    # Determine the type of a tweet\n    if t.get(\"in_reply_to_status_id\"):\n        return \"reply\"\n    if \"retweeted_status\" in t:\n        return \"retweet\"\n    if \"quoted_status\" in t:\n        return \"quote\"\n    return \"original\"\n"
  },
  {
    "path": "src/twarc/version.py",
    "content": "import platform\n\nversion = \"2.14.1\"\n\nuser_agent = f\"twarc/{version} ({platform.system()} {platform.machine()}) {platform.python_implementation()}/{platform.python_version()}\"\n"
  },
  {
    "path": "test_twarc.py",
    "content": "import os\nimport re\nimport json\nimport time\nimport dotenv\nimport pytest\nimport logging\nimport datetime\n\ndotenv.load_dotenv()\n\ntry:\n    from unittest.mock import patch, call, MagicMock  # Python 3\nexcept ImportError:\n    from mock import patch, call, MagicMock  # Python 2\n\nfrom requests_oauthlib import OAuth1Session\nimport requests\n\nimport twarc\nfrom twarc import json2csv\n\n\"\"\"\n\nYou will need to have these environment variables set to run these tests:\n\n* CONSUMER_KEY\n* CONSUMER_SECRET\n* ACCESS_TOKEN\n* ACCESS_TOKEN_SECRET\n\nTo run the premium tests, you will need to set the following environment variable:\n\nTWITTER_ENV\n\nTo run the gnip test, you will need to set the following environment variables:\n\nGNIP_ENV\nGNIP_ACCOUNT\nGNIP_USERNAME\nGNIP_PASSWORD\n\n\"\"\"\n\nlogging.basicConfig(filename=\"test.log\", level=logging.INFO)\nT = twarc.Twarc()\n\n\ndef test_search():\n    count = 0\n    for tweet in T.search(\"obama\"):\n        assert tweet[\"id_str\"]\n        count += 1\n        if count == 10:\n            break\n    assert count == 10\n\n\ndef test_search_max_pages():\n    tweets = list(T.search(\"obama\", max_pages=1))\n    assert 0 < len(tweets) <= 100\n    tweets = list(T.search(\"obama\", max_pages=2))\n    assert 100 < len(tweets) <= 200\n\n\ndef test_since_id():\n    for tweet in T.search(\"obama\"):\n        id = tweet[\"id_str\"]\n        break\n    assert id\n    time.sleep(5)\n    for tweet in T.search(\"obama\", since_id=id):\n        assert tweet[\"id_str\"] > id\n\n\ndef test_max_id():\n    for tweet in T.search(\"obama\"):\n        id = tweet[\"id_str\"]\n        break\n    assert id\n    time.sleep(5)\n    count = 0\n    for tweet in T.search(\"obama\", max_id=id):\n        count += 1\n        assert tweet[\"id_str\"] <= id\n        if count > 100:\n            break\n\n\ndef test_max_and_since_ids():\n    max_id = since_id = None\n    count = 0\n    for tweet in T.search(\"obama\"):\n        count += 1\n        if not max_id:\n            max_id = tweet[\"id_str\"]\n        since_id = tweet[\"id_str\"]\n        if count > 500:\n            break\n    count = 0\n    for tweet in T.search(\"obama\", max_id=max_id, since_id=since_id):\n        count += 1\n        assert tweet[\"id_str\"] <= max_id\n        assert tweet[\"id_str\"] > since_id\n\n\ndef test_paging():\n    # pages are 100 tweets big so if we can get 500 paging is working\n    count = 0\n    for tweet in T.search(\"obama\"):\n        count += 1\n        if count == 500:\n            break\n    assert count == 500\n\n\ndef test_geocode():\n    # look for tweets from New York ; the search radius is larger than NYC\n    # so hopefully we'll find one from New York in the first 500?\n    count = 0\n    found = False\n\n    for tweet in T.search(None, geocode=\"40.7484,-73.9857,1mi\"):\n        if (tweet[\"place\"] or {}).get(\"name\") == \"Manhattan\":\n            found = True\n            break\n        if count > 500:\n            break\n        count += 1\n\n    assert found\n\n\n@pytest.mark.skip(reason=\"v1.1 filter API disabled March 2023\")\ndef test_track():\n    tweet = next(T.filter(track=\"obama\"))\n    json_str = json.dumps(tweet)\n\n    assert re.search(\"obama\", json_str, re.IGNORECASE)\n\n    # reconnect to close streaming connection for other tests\n    T.connect()\n\n\n@pytest.mark.skip(reason=\"v1.1 filter API disabled March 2023\")\ndef test_keepalive():\n    for event in T.filter(track=\"abcdefghiklmno\", record_keepalive=True):\n        if event == \"keep-alive\":\n            break\n\n    # reconnect to close streaming connection for other tests\n    T.connect()\n\n\n@pytest.mark.skip(reason=\"v1.1 filter API disabled March 2023\")\ndef test_follow():\n    user_ids = [\n        \"87818409\",  # @guardian\n        \"428333\",  # @cnnbrk\n        \"5402612\",  # @BBCBreaking\n        \"2467791\",  # @washingtonpost\n        \"1020058453\",  # @BuzzFeedNews\n        \"23484039\",  # WSJbreakingnews\n        \"384438102\",  # ABCNewsLive\n        \"87416722\",  # SkyNewsBreak\n    ]\n    found = False\n\n    for tweet in T.filter(follow=\",\".join(user_ids)):\n        assert tweet[\"id_str\"]\n        if tweet[\"user\"][\"id_str\"] in user_ids:\n            found = True\n        elif tweet[\"in_reply_to_user_id_str\"] in user_ids:\n            found = True\n        elif tweet[\"retweeted_status\"][\"user\"][\"id_str\"] in user_ids:\n            found = True\n        elif (\n            \"quoted_status\" in tweet\n            and tweet[\"quoted_status\"][\"user\"][\"id_str\"] in user_ids\n        ):\n            found = True\n        break\n\n    if not found:\n        logging.warn(\"couldn't find user in response: %s\", json.dumps(tweet, indent=2))\n\n    assert found\n\n    # reconnect to close streaming connection for other tests\n    T.connect()\n\n\n@pytest.mark.skip(reason=\"v1.1 filter API disabled March 2023\")\ndef test_locations():\n    # look for tweets from New York ; the bounding box is larger than NYC\n    # so hopefully we'll find one from New York in the first 100?\n    count = 0\n    found = False\n\n    for tweet in T.filter(locations=\"-74,40,-73,41\"):\n        if tweet[\"place\"][\"name\"] == \"Manhattan\":\n            found = True\n            break\n        if count > 100:\n            break\n        count += 1\n\n    assert found\n\n    # reconnect to close streaming connection for other tests\n    T.connect()\n\n\n@pytest.mark.skip(reason=\"v1.1 filter API disabled March 2023\")\ndef test_languages():\n    count = 0\n    ok = True\n    langs = [\"fr\", \"es\"]\n    for tweet in T.filter(\"paris,madrid\", lang=langs):\n        if tweet[\"lang\"] not in langs:\n            ok = False\n            break\n        if count > 25:\n            break\n        count += 1\n\n    assert ok\n\n    # reconnect to close streaming connection for other tests\n    T.connect()\n\n\ndef test_timeline_by_user_id():\n    # looks for recent tweets and checks if tweets are of provided user_id\n    user_id = \"87818409\"\n\n    for tweet in T.timeline(user_id=user_id):\n        assert tweet[\"user\"][\"id_str\"] == user_id\n\n    # Make sure that passing an int user_id behaves as expected. Issue #235\n    user_id = 87818409\n\n    all_tweets = list(T.timeline(user_id=user_id))\n    assert len(all_tweets)\n\n    for tweet in all_tweets:\n        assert tweet[\"user\"][\"id\"] == user_id\n\n\ndef test_timeline_max_pages():\n    # looks for recent tweets and checks if tweets are of provided user_id\n    user_id = \"87818409\"\n\n    first_page = list(T.timeline(user_id=user_id, max_pages=1))\n    assert 0 < len(first_page) <= 200\n\n    all_pages = list(T.timeline(user_id=user_id))\n    assert len(all_pages) > len(first_page)\n\n\ndef test_timeline_by_screen_name():\n    # looks for recent tweets and checks if tweets are of provided screen_name\n    screen_name = \"guardian\"\n\n    for tweet in T.timeline(screen_name=screen_name):\n        assert tweet[\"user\"][\"screen_name\"].lower() == screen_name.lower()\n\n\ndef test_home_timeline():\n    found = False\n    for tweet in T.timeline():\n        found = True\n        break\n    assert found\n\n\ndef test_timeline_arg_handling():\n    # Confirm that only user_id *or* screen_name is valid for timeline\n    screen_name = \"guardian\"\n    user_id = \"87818409\"\n\n    with pytest.raises(ValueError):\n        for t in T.timeline(screen_name=screen_name, user_id=user_id):\n            pass\n\n\ndef test_timeline_with_since_id():\n    count = 0\n    tweet_id = None\n    for tweet in T.timeline(screen_name=\"guardian\"):\n        tweet_id = tweet[\"id_str\"]\n        count += 1\n        if count > 10:\n            break\n\n    tweets = list(T.timeline(screen_name=\"guardian\", since_id=tweet_id))\n    assert len(tweets) == 10\n\n\ndef test_trends_available():\n    # fetches all available trend regions and checks presence of likely member\n    trends = T.trends_available()\n    worldwide = [t for t in trends if t[\"placeType\"][\"name\"] == \"Supername\"]\n    assert worldwide[0][\"name\"] == \"Worldwide\"\n\n\ndef test_trends_place():\n    # fetches recent trends for Amsterdam, WOEID 727232\n    trends = T.trends_place(727232)\n    assert len(list(trends[0][\"trends\"])) > 0\n\n\ndef test_trends_closest():\n    # fetches regions bounding the specified lat and lon\n    trends = T.trends_closest(38.883137, -76.990228)\n    assert len(list(trends)) > 0\n\n\ndef test_trends_place_exclude():\n    # fetches recent trends for Amsterdam, WOEID 727232, sans hashtags\n    trends = T.trends_place(727232, exclude=\"hashtags\")[0][\"trends\"]\n    hashtag_trends = [t for t in trends if t[\"name\"].startswith(\"#\")]\n    assert len(hashtag_trends) == 0\n\n\ndef test_follower_ids():\n    count = 0\n    for id in T.follower_ids(\"justinbieber\"):\n        count += 1\n        if count == 10001:\n            break\n    assert count == 10001\n\n\ndef test_follower_ids_with_user_id():\n    count = 0\n    for id in T.follower_ids(27260086):\n        count += 1\n        if count > 10001:\n            break\n    assert count > 10001\n\n\ndef test_follower_ids_max_pages():\n    ids = list(T.follower_ids(813286, max_pages=1))\n    assert 0 < len(ids) <= 5000\n    ids = list(T.follower_ids(813286, max_pages=2))\n    assert 5000 < len(ids) <= 10000\n\n\ndef test_friend_ids():\n    count = 0\n    for id in T.friend_ids(\"justinbieber\"):\n        count += 1\n        if count == 10001:\n            break\n    assert count == 10001\n\n\ndef test_friend_ids_with_user_id():\n    count = 0\n    for id in T.friend_ids(27260086):\n        count += 1\n        if count > 10001:\n            break\n    assert count > 10001\n\n\ndef test_friend_ids_max_pages():\n    ids = list(T.friend_ids(27260086, max_pages=1))\n    assert 0 < len(ids) <= 5000\n    ids = list(T.friend_ids(27260086, max_pages=2))\n    assert 5000 < len(ids) <= 10000\n\n\ndef test_user_lookup_by_user_id():\n    # looks for the user with given user_id\n\n    user_ids = [\n        \"87818409\",  # @guardian\n        \"807095\",  # @nytimes\n        \"428333\",  # @cnnbrk\n        \"5402612\",  # @BBCBreaking\n        \"2467791\",  # @washingtonpost\n        \"1020058453\",  # @BuzzFeedNews\n        \"23484039\",  # WSJbreakingnews\n        \"384438102\",  # ABCNewsLive\n        \"87416722\",  # SkyNewsBreak\n    ]\n\n    uids = []\n\n    for user in T.user_lookup(ids=user_ids):\n        uids.append(user[\"id_str\"])\n\n    assert set(user_ids) == set(uids)\n\n\ndef test_user_lookup_by_screen_name():\n    # looks for the user with given screen_names\n    screen_names = [\n        \"guardian\",\n        \"nytimes\",\n        \"cnnbrk\",\n        \"BBCBreaking\",\n        \"washingtonpost\",\n        \"BuzzFeedNews\",\n        \"WSJbreakingnews\",\n        \"ABCNewsLive\",\n        \"SkyNewsBreak\",\n    ]\n\n    names = []\n\n    for user in T.user_lookup(ids=screen_names, id_type=\"screen_name\"):\n        names.append(user[\"screen_name\"].lower())\n\n    assert set(names) == set(map(lambda x: x.lower(), screen_names))\n\n\ndef test_tweet():\n    t = T.tweet(\"20\")\n    assert t[\"full_text\"] == \"just setting up my twttr\"\n\n\ndef test_dehydrate():\n    tweets = [\n        '{\"text\": \"test tweet 1\", \"id_str\": \"800000000000000000\"}',\n        '{\"text\": \"test tweet 2\", \"id_str\": \"800000000000000001\"}',\n    ]\n    ids = list(T.dehydrate(iter(tweets)))\n    assert len(ids) == 2\n    assert \"800000000000000000\" in ids\n    assert \"800000000000000001\" in ids\n\n\ndef test_hydrate():\n    ids = [\n        \"501064188211765249\",\n        \"501064196642340864\",\n        \"501064197632167936\",\n        \"501064196931330049\",\n        \"501064198005481472\",\n        \"501064198009655296\",\n        \"501064198059597824\",\n        \"501064198513000450\",\n        \"501064180468682752\",\n        \"501064199142117378\",\n        \"501064171707170816\",\n        \"501064200186118145\",\n        \"501064200035516416\",\n        \"501064201041743872\",\n        \"501064201251880961\",\n        \"501064198973960192\",\n        \"501064201256071168\",\n        \"501064202027798529\",\n        \"501064202245521409\",\n        \"501064201503113216\",\n        \"501064202363359232\",\n        \"501064202295848960\",\n        \"501064202380115971\",\n        \"501064202904403970\",\n        \"501064203135102977\",\n        \"501064203508412416\",\n        \"501064203516407810\",\n        \"501064203546148864\",\n        \"501064203697156096\",\n        \"501064204191690752\",\n        \"501064204288540672\",\n        \"501064197396914176\",\n        \"501064194309906436\",\n        \"501064204989001728\",\n        \"501064204980592642\",\n        \"501064204661850113\",\n        \"501064205400039424\",\n        \"501064205089665024\",\n        \"501064206666702848\",\n        \"501064207274868736\",\n        \"501064197686296576\",\n        \"501064207623000064\",\n        \"501064207824351232\",\n        \"501064208083980290\",\n        \"501064208277319680\",\n        \"501064208398573568\",\n        \"501064202794971136\",\n        \"501064208789045248\",\n        \"501064209535614976\",\n        \"501064209551994881\",\n        \"501064141332029440\",\n        \"501064207387742210\",\n        \"501064210177331200\",\n        \"501064210395037696\",\n        \"501064210693230592\",\n        \"501064210840035329\",\n        \"501064211855069185\",\n        \"501064192024006657\",\n        \"501064200316125184\",\n        \"501064205642903552\",\n        \"501064212547137536\",\n        \"501064205382848512\",\n        \"501064213843169280\",\n        \"501064208562135042\",\n        \"501064214211870720\",\n        \"501064214467731457\",\n        \"501064215160172545\",\n        \"501064209648848896\",\n        \"501064215990648832\",\n        \"501064216241897472\",\n        \"501064215759568897\",\n        \"501064211858870273\",\n        \"501064216522932227\",\n        \"501064216930160640\",\n        \"501064217667960832\",\n        \"501064211997274114\",\n        \"501064212303446016\",\n        \"501064213675012096\",\n        \"501064218343661568\",\n        \"501064213951823873\",\n        \"501064219467341824\",\n        \"501064219677044738\",\n        \"501064210080473088\",\n        \"501064220415229953\",\n        \"501064220847656960\",\n        \"501064222340423681\",\n        \"501064222772445187\",\n        \"501064222923440130\",\n        \"501064220121632768\",\n        \"501064222948593664\",\n        \"501064224936714240\",\n        \"501064225096499201\",\n        \"501064225142624256\",\n        \"501064225314185216\",\n        \"501064225926561794\",\n        \"501064226451259392\",\n        \"501064226816143361\",\n        \"501064227302674433\",\n        \"501064227344646144\",\n        \"501064227688558592\",\n        \"501064228288364546\",\n        \"501064228627705857\",\n        \"501064229764751360\",\n        \"501064229915729921\",\n        \"501064231304065026\",\n        \"501064231366983681\",\n        \"501064231387947008\",\n        \"501064231488200704\",\n        \"501064231941570561\",\n        \"501064232188665856\",\n        \"501064232449114112\",\n        \"501064232570724352\",\n        \"501064232700350464\",\n        \"501064233186893824\",\n        \"501064233438568450\",\n        \"501064233774510081\",\n        \"501064235107897344\",\n        \"619172347640201216\",\n        \"619172347275116548\",\n        \"619172341944332288\",\n        \"619172340891578368\",\n        \"619172338177843200\",\n        \"619172335426244608\",\n        \"619172332100284416\",\n        \"619172331592773632\",\n        \"619172331584376832\",\n        \"619172331399725057\",\n        \"619172328249757696\",\n        \"619172328149118976\",\n        \"619172326886674432\",\n        \"619172324600745984\",\n        \"619172323447324672\",\n        \"619172321564098560\",\n        \"619172320880533504\",\n        \"619172320360333312\",\n        \"619172319047647232\",\n        \"619172314710609920\",\n        \"619172313846693890\",\n        \"619172312122814464\",\n        \"619172306338709504\",\n        \"619172304191401984\",\n        \"619172303654518784\",\n        \"619172302878408704\",\n        \"619172300689031168\",\n        \"619172298310840325\",\n        \"619172295966392320\",\n        \"619172293936291840\",\n        \"619172293680345089\",\n        \"619172285501456385\",\n        \"619172282183725056\",\n        \"619172281751711748\",\n        \"619172281294655488\",\n        \"619172278086070272\",\n        \"619172275741298688\",\n        \"619172274235535363\",\n        \"619172257789706240\",\n        \"619172257278111744\",\n        \"619172253075378176\",\n        \"619172242736308224\",\n        \"619172236134588416\",\n        \"619172235488718848\",\n        \"619172232120692736\",\n        \"619172227813126144\",\n        \"619172221349662720\",\n        \"619172216349917184\",\n        \"619172214475108352\",\n        \"619172209857327104\",\n        \"619172208452182016\",\n        \"619172208355749888\",\n        \"619172193730199552\",\n        \"619172193482768384\",\n        \"619172184922042368\",\n        \"619172182548049920\",\n        \"619172179960328192\",\n        \"619172175820357632\",\n        \"619172174872469504\",\n        \"619172173568053248\",\n        \"619172170233679872\",\n        \"619172165959708672\",\n        \"619172163912908801\",\n        \"619172162608463873\",\n        \"619172158741303297\",\n        \"619172157197819905\",\n        \"501064235175399425\",\n        \"501064235456401410\",\n        \"615973042443956225\",\n        \"618602288781860864\",\n    ]\n    count = 0\n    for tweet in T.hydrate(iter(ids)):\n        assert tweet[\"id_str\"]\n        count += 1\n    assert count > 80  # may need to adjust as these might get deleted\n\n\n@patch(\"twarc.client.OAuth1Session\", autospec=True)\ndef test_connection_error_get(oauth1session_class):\n    mock_oauth1session = MagicMock(spec=OAuth1Session)\n    mock_oauth1session.headers = {}\n    oauth1session_class.return_value = mock_oauth1session\n    mock_oauth1session.get.side_effect = requests.exceptions.ConnectionError\n    t = twarc.Twarc(\n        \"consumer_key\",\n        \"consumer_secret\",\n        \"access_token\",\n        \"access_token_secret\",\n        connection_errors=3,\n        validate_keys=False,\n    )\n    with pytest.raises(requests.exceptions.ConnectionError):\n        t.get(\"https://api.twitter.com\")\n\n    assert 3 == mock_oauth1session.get.call_count\n\n\n@patch(\"twarc.client.OAuth1Session\", autospec=True)\ndef test_connection_error_post(oauth1session_class):\n    mock_oauth1session = MagicMock(spec=OAuth1Session)\n    mock_oauth1session.headers = {}\n    oauth1session_class.return_value = mock_oauth1session\n    mock_oauth1session.post.side_effect = requests.exceptions.ConnectionError\n    t = twarc.Twarc(\n        \"consumer_key\",\n        \"consumer_secret\",\n        \"access_token\",\n        \"access_token_secret\",\n        connection_errors=2,\n        validate_keys=False,\n    )\n    with pytest.raises(requests.exceptions.ConnectionError):\n        t.post(\"https://api.twitter.com\")\n\n    assert 2 == mock_oauth1session.post.call_count\n\n\ndef test_http_error_sample():\n    t = twarc.Twarc(\n        \"consumer_key\",\n        \"consumer_secret\",\n        \"access_token\",\n        \"access_token_secret\",\n        http_errors=2,\n        validate_keys=False,\n    )\n    with pytest.raises(requests.exceptions.HTTPError):\n        next(t.sample())\n\n\n@pytest.mark.skip(reason=\"v1.1 filter API disabled March 2023\")\ndef test_http_error_filter():\n    t = twarc.Twarc(\n        \"consumer_key\",\n        \"consumer_secret\",\n        \"access_token\",\n        \"access_token_secret\",\n        http_errors=3,\n        validate_keys=False,\n    )\n    with pytest.raises(requests.exceptions.HTTPError):\n        next(t.filter(track=\"test\"))\n\n\ndef test_retweets():\n    # hopefully there will continue to be more than 100 retweets of these\n    assert len(list(T.retweets([\"20\", \"21\"]))) > 100\n\n\ndef test_missing_retweets():\n    # this tweet doesn't exist and cannot have any retweets\n    assert len(list(T.retweets([\"795972820413140991\"]))) == 0\n\n\ndef test_oembed():\n    t = next(T.search(\"obama\"))\n    url = \"https://twitter.com/{}/status/{}\".format(\n        t[\"user\"][\"screen_name\"], t[\"id_str\"]\n    )\n    tweet_json = T.oembed(url)\n    assert url == tweet_json[\"url\"]\n\n\ndef test_oembed_params():\n    t = next(T.search(\"obama\"))\n    url = \"https://twitter.com/{}/status/{}\".format(\n        t[\"user\"][\"screen_name\"], t[\"id_str\"]\n    )\n    tweet_json = T.oembed(url, theme=\"dark\")\n    assert 'data-theme=\"dark\"' in tweet_json[\"html\"]\n\n\ndef test_replies():\n    # this test will look at trending hashtags, and do a search\n    # to find a popular tweet that uses it, and then makes a\n    # big assumption that someone must have responded to the tweet\n\n    # get the top hashtag that is trending\n    trends = T.trends_place(\"1\")[0][\"trends\"]\n    trends.sort(key=lambda a: a[\"tweet_volume\"] or 0, reverse=True)\n    top_hashtag = trends[0][\"name\"].strip(\"#\")\n\n    logging.info(\"top hashtag %s\" % top_hashtag)\n    tries = 0\n    for top_tweet in T.search(top_hashtag, result_type=\"popular\"):\n        logging.info(\"testing %s\" % top_tweet[\"id_str\"])\n\n        # get replies to the top tweet\n        replies = T.replies(top_tweet)\n\n        # the first tweet should be the base tweet, or the tweet that\n        # we are looking for replies to\n        me = next(replies)\n        assert me[\"id_str\"] == top_tweet[\"id_str\"]\n\n        try:\n            reply = next(replies)\n            assert reply[\"in_reply_to_status_id_str\"] == top_tweet[\"id_str\"]\n            break\n\n        except StopIteration:\n            pass  # didn't find a reply\n\n        tries += 1\n        if tries > 10:\n            break\n\n\ndef test_lists_members():\n    slug = \"bots\"\n    screen_name = \"edsu\"\n    members = list(T.list_members(slug=slug, owner_screen_name=screen_name))\n    assert len(members) > 0\n    assert members[0][\"screen_name\"]\n\n\ndef test_lists_members_owner_id():\n    slug = \"bots\"\n    owner_id = \"14331818\"\n    members = list(T.list_members(slug=slug, owner_id=owner_id))\n    assert len(members) > 0\n    assert members[0][\"screen_name\"]\n\n\ndef test_lists_list_id():\n    members = list(T.list_members(list_id=\"197880909\"))\n    assert len(members) > 0\n    assert members[0][\"screen_name\"]\n\n\ndef test_extended_compat():\n    t_compat = twarc.Twarc(tweet_mode=\"compat\")\n\n    assert \"full_text\" in next(T.search(\"obama\"))\n    assert \"text\" in next(t_compat.search(\"obama\"))\n\n    assert \"full_text\" in next(T.timeline(screen_name=\"BarackObama\"))\n    assert \"text\" in next(t_compat.timeline(screen_name=\"BarackObama\"))\n\n\ndef test_csv_retweet():\n    for tweet in T.search(\"obama\"):\n        if \"retweeted_status\" in tweet:\n            break\n    text = json2csv.text(tweet)\n    assert not text.startswith(\"RT @\")\n\n\ndef test_csv_retweet_hashtag():\n    toplevel_hashtags = 0\n    rt_hashtags = 0\n\n    for tweet in T.search(\"#auspol filter:nativeretweets filter:hashtags\"):\n        hashtag_rendered = json2csv.hashtags(tweet)\n        if hashtag_rendered:\n            hashtags = hashtag_rendered.split(\" \")\n        else:\n            hashtags = []\n\n        if len(hashtags) > len(tweet[\"entities\"][\"hashtags\"]):\n            break\n\n    else:\n        assert False\n\n\n@pytest.mark.skip(reason=\"v1.1 filter API disabled March 2023\")\ndef test_truncated_text():\n    for tweet in T.filter(\"tweet\"):\n        if tweet[\"truncated\"] == True:\n            break\n    assert tweet[\"text\"] != tweet[\"extended_tweet\"][\"full_text\"]\n    assert json2csv.text(tweet) == tweet[\"extended_tweet\"][\"full_text\"]\n\n\ndef test_invalid_credentials():\n    old_consumer_key = T.consumer_key\n\n    T.consumer_key = \"Definitely not a valid key\"\n    with pytest.raises(RuntimeError):\n        T.validate_keys()\n\n    T.consumer_key = old_consumer_key\n\n\ndef test_app_auth():\n    ta = twarc.Twarc(app_auth=True)\n    count = 0\n    for tweet in ta.search(\"obama\"):\n        assert tweet[\"id_str\"]\n        count += 1\n        if count == 10:\n            break\n    assert count == 10\n\n\n@pytest.mark.skipif(os.environ.get(\"TWITTER_ENV\") == None, reason=\"No environment\")\ndef test_premium_30day_search():\n    twitter_env = os.environ[\"TWITTER_ENV\"]\n    t = twarc.Twarc(app_auth=True)\n    now = datetime.date.today()\n    then = now - datetime.timedelta(days=14)\n\n    search = t.premium_search(\n        q=\"blacklivesmatter\",\n        product=\"30day\",\n        environment=twitter_env,\n        to_date=then,\n        sandbox=True,\n    )\n    tweet = next(search)\n    assert tweet\n\n\n@pytest.mark.skipif(os.environ.get(\"TWITTER_ENV\") == None, reason=\"No environment\")\ndef test_premium_fullarchive_search():\n    twitter_env = os.environ[\"TWITTER_ENV\"]\n    from_date = datetime.date(2013, 7, 1)\n    to_date = datetime.date(2013, 8, 1)\n    t = twarc.Twarc(app_auth=True)\n    search = t.premium_search(\n        q=\"blacklivesmatter\",\n        product=\"fullarchive\",\n        environment=twitter_env,\n        from_date=from_date,\n        to_date=to_date,\n        sandbox=True,\n    )\n\n    count = 0\n    for tweet in search:\n        created_at = datetime.datetime.strptime(\n            tweet[\"created_at\"], \"%a %b %d %H:%M:%S +0000 %Y\"\n        )\n        assert created_at.date() >= from_date\n        assert created_at.date() <= to_date\n        count += 1\n\n    assert count > 200\n\n\n@pytest.mark.skipif(os.environ.get(\"GNIP_ENV\") == None, reason=\"No gnip environment\")\ndef test_gnip_fullarchive_search():\n    twitter_env = os.environ[\"GNIP_ENV\"]\n    from_date = datetime.date(2013, 7, 1)\n    to_date = datetime.date(2013, 8, 1)\n    t = twarc.Twarc(gnip_auth=True)\n    search = t.premium_search(\n        q=\"blacklivesmatter\",\n        product=\"gnip_fullarchive\",\n        environment=twitter_env,\n        from_date=from_date,\n        to_date=to_date,\n        sandbox=True,\n    )\n\n    count = 0\n    for tweet in search:\n        created_at = datetime.datetime.strptime(\n            tweet[\"created_at\"], \"%a %b %d %H:%M:%S +0000 %Y\"\n        )\n        assert created_at.date() >= from_date\n        assert created_at.date() <= to_date\n        count += 1\n\n    assert count > 200\n"
  },
  {
    "path": "test_twarc2.py",
    "content": "import os\nimport pytz\nimport twarc\nimport dotenv\nimport pytest\nimport logging\nimport pathlib\nimport datetime\nimport threading\n\nfrom unittest import TestCase\nfrom twarc.version import version, user_agent\n\ndotenv.load_dotenv()\nconsumer_key = os.environ.get(\"CONSUMER_KEY\")\nconsumer_secret = os.environ.get(\"CONSUMER_SECRET\")\nbearer_token = os.environ.get(\"BEARER_TOKEN\")\naccess_token = os.environ.get(\"ACCESS_TOKEN\")\naccess_token_secret = os.environ.get(\"ACCESS_TOKEN_SECRET\")\n\ntest_data = pathlib.Path(\"test-data\")\nlogging.basicConfig(filename=\"test.log\", level=logging.INFO)\n\n# Implicitly test the constructor in application auth mode. This ensures that\n# the tests don't depend on test ordering, and allows using the pytest\n# functionality to only run a single test at a time.\n\nT = twarc.Twarc2(\n    consumer_key=consumer_key,\n    consumer_secret=consumer_secret,\n)\n\n\ndef test_version():\n    import setup\n\n    assert setup.version == version\n\n    assert user_agent\n    assert f\"twarc/{version}\" in user_agent\n\n\ndef test_auth_types_interaction():\n    \"\"\"\n    Test the various options for configuration work as expected.\n    \"\"\"\n\n    # 1. bearer_token auth -> app auth\n    tw = twarc.Twarc2(bearer_token=bearer_token)\n    assert tw.auth_type == \"application\"\n\n    for response in tw.user_lookup(range(1, 101)):\n        assert response[\"data\"]\n\n    tw.client.close()\n\n    # 2. consumer_keys\n    tw = twarc.Twarc2(consumer_key=consumer_key, consumer_secret=consumer_secret)\n    assert tw.auth_type == \"application\"\n\n    for response in tw.user_lookup(range(1, 101)):\n        assert response[\"data\"]\n\n    tw.client.close()\n\n    # 3. Full user auth\n    tw = twarc.Twarc2(\n        access_token=access_token,\n        access_token_secret=access_token_secret,\n        consumer_key=consumer_key,\n        consumer_secret=consumer_secret,\n    )\n    assert tw.auth_type == \"user\"\n\n    for response in tw.user_lookup(range(1, 101)):\n        assert response[\"data\"]\n\n    tw.client.close()\n\n    with pytest.raises(twarc.client2.InvalidAuthType):\n        tw.sample()\n\n\ndef test_sample():\n    # event to tell the filter stream to close\n    event = threading.Event()\n\n    for count, result in enumerate(T.sample(event=event)):\n        assert int(result[\"data\"][\"id\"])\n\n        # users are passed by reference an dincluded as includes\n        user_id = result[\"data\"][\"author_id\"]\n        assert len(pick_id(user_id, result[\"includes\"][\"users\"])) == 1\n\n        if count > 10:\n            # close the sample\n            event.set()\n\n    assert count == 11\n\n\n@pytest.mark.parametrize(\"sort_order\", [\"recency\", \"relevancy\"])\ndef test_search_recent(sort_order):\n    found_tweets = 0\n    pages = 0\n\n    for response_page in T.search_recent(\"politics\", sort_order=sort_order):\n        pages += 1\n        tweets = response_page[\"data\"]\n        found_tweets += len(tweets)\n\n        if pages == 2:\n            break\n\n    assert 100 <= found_tweets <= 200\n\n\ndef test_counts_recent():\n    found_counts = 0\n\n    for response_page in T.counts_recent(\"twitter is:verified\", granularity=\"day\"):\n        counts = response_page[\"data\"]\n        found_counts += len(counts)\n        break\n\n    assert 7 <= found_counts <= 8\n\n\n@pytest.mark.skipif(\n    os.environ.get(\"SKIP_ACADEMIC_PRODUCT_TRACK\") != None,\n    reason=\"No Academic Research Product Track access\",\n)\ndef test_counts_empty_page():\n    found_counts = 0\n\n    for response_page in T.counts_all(\n        \"beans\",\n        start_time=datetime.datetime(2006, 3, 21),\n        end_time=datetime.datetime(2006, 6, 1),\n        granularity=\"day\",\n    ):\n        counts = response_page[\"data\"]\n        found_counts += len(counts)\n\n    assert found_counts == 72\n\n\ndef test_search_times():\n    found = False\n    now = datetime.datetime.now(tz=pytz.timezone(\"Australia/Melbourne\"))\n    # twitter api doesn't resolve microseconds so strip them for comparison\n    now = now.replace(microsecond=0)\n    end = now - datetime.timedelta(seconds=60)\n    start = now - datetime.timedelta(seconds=61)\n\n    for response_page in T.search_recent(\"tweet\", start_time=start, end_time=end):\n        for tweet in response_page[\"data\"]:\n            found = True\n            # convert created_at to datetime with utc timezone\n            dt = tweet[\"created_at\"].strip(\"Z\")\n            dt = datetime.datetime.fromisoformat(dt)\n            dt = dt.replace(tzinfo=datetime.timezone.utc)\n            assert dt >= start\n            assert dt <= end\n\n    assert found\n\n\ndef test_user_ids_lookup():\n    users_found = 0\n    users_not_found = 0\n\n    for response in T.user_lookup(range(1, 1000)):\n        for profile in response[\"data\"]:\n            users_found += 1\n\n        for error in response[\"errors\"]:\n            # Note that errors includes lookup of contained entitites within a\n            # tweet, so a pinned tweet that doesn't exist anymore results in an\n            # additional error entry, even if the profile is present.\n            if error[\"resource_type\"] == \"user\":\n                users_not_found += 1\n\n    assert users_found >= 1\n    assert users_found + users_not_found == 999\n\n\ndef test_usernames_lookup():\n    users_found = 0\n    usernames = [\"jack\", \"barackobama\", \"rihanna\"]\n    for response in T.user_lookup(usernames, usernames=True):\n        for profile in response[\"data\"]:\n            users_found += 1\n    assert users_found == 3\n\n\ndef test_tweet_lookup():\n    tweets_found = 0\n    tweets_not_found = 0\n\n    for response in T.tweet_lookup(range(1000, 2000)):\n        for tweet in response[\"data\"]:\n            tweets_found += 1\n\n        for error in response[\"errors\"]:\n            # Note that errors includes lookup of contained entitites within a\n            # tweet, so a pinned tweet that doesn't exist anymore results in an\n            # additional error entry, even if the profile is present.\n            if error[\"resource_type\"] == \"tweet\":\n                tweets_not_found += 1\n\n    assert tweets_found >= 1\n    assert tweets_found + tweets_not_found == 1000\n\n\n# Alas, fetching the stream in GitHub action yields a 400 HTTP error\n# maybe this will go away since it used to work fine.\n\n\n@pytest.mark.skipif(\n    os.environ.get(\"GITHUB_ACTIONS\") != None,\n    reason=\"stream() seems to throw a 400 error under GitHub Actions?!\",\n)\ndef test_stream():\n    # remove any active stream rules\n    rules = T.get_stream_rules()\n    if \"data\" in rules and len(rules[\"data\"]) > 0:\n        rule_ids = [r[\"id\"] for r in rules[\"data\"]]\n        T.delete_stream_rule_ids(rule_ids)\n\n    # make sure they are empty\n    rules = T.get_stream_rules()\n    assert \"data\" not in rules\n\n    # add two rules\n    rules = T.add_stream_rules(\n        [{\"value\": \"hey\", \"tag\": \"twarc-test\"}, {\"value\": \"joe\", \"tag\": \"twarc-test\"}]\n    )\n    assert len(rules[\"data\"]) == 2\n\n    # make sure they are there\n    rules = T.get_stream_rules()\n    assert len(rules[\"data\"]) == 2\n\n    # these properties should be set\n    assert rules[\"data\"][0][\"id\"]\n    assert rules[\"data\"][0][\"tag\"] == \"twarc-test\"\n    assert rules[\"data\"][1][\"id\"]\n    assert rules[\"data\"][1][\"tag\"] == \"twarc-test\"\n\n    # the order of the values is not guaranteed\n    assert \"hey\" in [r[\"value\"] for r in rules[\"data\"]]\n    assert \"joe\" in [r[\"value\"] for r in rules[\"data\"]]\n\n    # collect some data\n    event = threading.Event()\n    for count, result in enumerate(T.stream(event=event)):\n        assert result[\"data\"][\"id\"]\n        assert result[\"data\"][\"text\"]\n        assert len(result[\"matching_rules\"]) > 0\n        for rule in result[\"matching_rules\"]:\n            assert rule[\"id\"]\n            assert rule[\"tag\"] == \"twarc-test\"\n        if count > 25:\n            event.set()\n    assert count > 25\n\n    # delete the rules\n    rule_ids = [r[\"id\"] for r in rules[\"data\"]]\n    T.delete_stream_rule_ids(rule_ids)\n\n    # make sure they are gone\n    rules = T.get_stream_rules()\n    assert \"data\" not in rules\n\n\ndef test_timeline():\n    \"\"\"\n    Test the user timeline endpoints.\n\n    \"\"\"\n    # get @jack's first pages of tweets and mentions\n    found = 0\n    for pages, tweets in enumerate(T.timeline(12)):\n        found += len(tweets[\"data\"])\n        if pages == 3:\n            break\n    assert found >= 200\n\n    found = 0\n    for pages, tweets in enumerate(T.mentions(12)):\n        found += len(tweets[\"data\"])\n        if pages == 3:\n            break\n    assert found >= 200\n\n\ndef test_timeline_username():\n    \"\"\"\n    Test the user timeline endpoints with username.\n\n    \"\"\"\n\n    found = 0\n    for pages, tweets in enumerate(T.timeline(\"jack\")):\n        found += len(tweets[\"data\"])\n        if pages == 3:\n            break\n    assert found >= 200\n\n    found = 0\n    for pages, tweets in enumerate(T.mentions(\"jack\")):\n        found += len(tweets[\"data\"])\n        if pages == 3:\n            break\n    assert found >= 200\n\n\ndef test_missing_timeline():\n    results = T.timeline(1033441111677788160)\n    assert len(list(results)) == 0\n\n\ndef test_follows():\n    \"\"\"\n    Test followers and and following.\n\n    \"\"\"\n\n    found = 0\n    for pages, users in enumerate(T.following(12)):\n        pages += 1\n        found += len(users[\"data\"])\n        if pages == 2:\n            break\n    assert found >= 1000\n\n    found = 0\n    for pages, users in enumerate(T.followers(12)):\n        found += len(users[\"data\"])\n        if pages == 2:\n            break\n    assert found >= 1000\n\n\ndef test_follows_username():\n    \"\"\"\n    Test followers and and following by username.\n\n    \"\"\"\n\n    found = 0\n    for pages, users in enumerate(T.following(\"jack\")):\n        pages += 1\n        found += len(users[\"data\"])\n        if pages == 2:\n            break\n    assert found >= 1000\n\n    found = 0\n    for pages, users in enumerate(T.followers(\"jack\")):\n        found += len(users[\"data\"])\n        if pages == 2:\n            break\n    assert found >= 1000\n\n\ndef test_flattened():\n    \"\"\"\n    This test uses the search API to test response flattening. It will look\n    at each tweet to find evidence that all the expansions have worked. Once it\n    finds them all it stops. If it has retrieved 500 tweets and not found any\n    of the expansions it stops and assumes that something is not right. This\n    500 cutoff or the query may need to be adjusted based on experience.\n    \"\"\"\n    found_geo = False\n    found_in_reply_to_user = False\n    found_attachments_media = False\n    found_attachments_polls = False\n    found_entities_mentions = False\n    found_referenced_tweets = False\n\n    count = 0\n\n    for response in T.search_recent(\n        \"(vote poll has:hashtags has:mentions -is:retweet) OR (checked into has:images -is:retweet)\"\n    ):\n        # Search api always returns a response of tweets with metadata but flatten\n        # will put these in a list\n        tweets = twarc.expansions.flatten(response)\n        assert len(tweets) > 1\n\n        for tweet in tweets:\n            count += 1\n\n            assert \"id\" in tweet\n            logging.info(\"got search tweet #%s %s\", count, tweet[\"id\"])\n\n            author_id = tweet[\"author_id\"]\n            assert \"author\" in tweet\n            assert tweet[\"author\"][\"id\"] == author_id\n\n            if \"in_reply_to_user_id\" in tweet:\n                assert \"in_reply_to_user\" in tweet\n                found_in_reply_to_user = True\n\n            if \"attachments\" in tweet:\n                if \"media_keys\" in tweet[\"attachments\"]:\n                    assert \"media\" in tweet[\"attachments\"]\n                    assert tweet[\"attachments\"][\"media\"]\n                    assert tweet[\"attachments\"][\"media\"][0][\"width\"]\n                    found_attachments_media = True\n                if \"poll_ids\" in tweet[\"attachments\"]:\n                    assert \"poll\" in tweet[\"attachments\"]\n                    assert tweet[\"attachments\"][\"poll\"]\n                    found_attachments_polls = True\n\n            if \"geo\" in tweet:\n                assert tweet[\"geo\"][\"place_id\"]\n                assert tweet[\"geo\"][\"place_id\"] == tweet[\"geo\"][\"id\"]\n                found_geo = True\n\n            if \"entities\" in tweet and \"mentions\" in tweet[\"entities\"]:\n                assert tweet[\"entities\"][\"mentions\"][0][\"username\"]\n                found_entities_mentions = True\n\n            # need to ensure there are no errors because a referenced tweet\n            # might be protected or deleted in which case it would not have been\n            # included in the response and would not have been flattened\n            if \"errors\" not in response and \"referenced_tweets\" in tweet:\n                assert tweet[\"referenced_tweets\"][0][\"text\"]\n                found_referenced_tweets = True\n\n        if (\n            found_geo\n            and found_in_reply_to_user\n            and found_attachments_media\n            and found_attachments_polls\n            and found_entities_mentions\n            and found_referenced_tweets\n        ):\n            logging.info(\"found all expansions!\")\n        elif count > 10000:\n            logging.info(\"didn't find all expansions in 10000 tweets\")\n\n    assert found_geo, \"found geo\"\n    assert found_in_reply_to_user, \"found in_reply_to_user\"\n    assert found_attachments_media, \"found media\"\n    assert found_attachments_polls, \"found polls\"\n    assert found_entities_mentions, \"found mentions\"\n    assert found_referenced_tweets, \"found referenced tweets\"\n\n\ndef test_ensure_flattened():\n    resp = next(T.search_recent(\"twitter\", max_results=20))\n\n    # flatten a response\n    flat1 = twarc.expansions.ensure_flattened(resp)\n    assert isinstance(flat1, list)\n    assert len(flat1) > 1\n    assert \"author\" in flat1[0]\n\n    # flatten the flattened list\n    flat2 = twarc.expansions.ensure_flattened(flat1)\n    assert isinstance(flat2, list)\n    assert len(flat2) == len(flat1)\n    assert \"author\" in flat2[0]\n\n    # flatten a tweet object which will force it into a list\n    flat3 = twarc.expansions.ensure_flattened(flat2[0])\n    assert isinstance(flat3, list)\n    assert len(flat3) == 1\n\n    # flatten an object without includes:\n    # List of records, data is a dict:\n    flat4 = twarc.expansions.ensure_flattened([{\"data\": {\"fake\": \"tweet\"}}])\n    assert isinstance(flat4, list)\n    assert len(flat4) == 1\n    # 1 record, data is a dict:\n    flat5 = twarc.expansions.ensure_flattened({\"data\": {\"fake\": \"tweet\"}})\n    assert isinstance(flat5, list)\n    assert len(flat5) == 1\n    # List of records, data is a list:\n    flat6 = twarc.expansions.ensure_flattened([{\"data\": [{\"fake\": \"tweet\"}]}])\n    assert isinstance(flat6, list)\n    assert len(flat6) == 1\n    # 1 record, data is a list:\n    flat7 = twarc.expansions.ensure_flattened({\"data\": [{\"fake\": \"tweet\"}]})\n    assert isinstance(flat7, list)\n    assert len(flat7) == 1\n    TestCase().assertDictEqual(flat4[0], flat5[0])\n    TestCase().assertDictEqual(flat6[0], flat7[0])\n    TestCase().assertDictEqual(flat4[0], flat7[0])\n\n    resp.pop(\"includes\")\n    flat8 = twarc.expansions.ensure_flattened(resp)\n    assert len(flat8) > 1\n    # Flatten worked without includes, wrote empty object:\n    assert \"author\" in flat8[0]\n    TestCase().assertDictEqual(flat8[0][\"author\"], {})\n\n    # If there's some other type of data:\n    with pytest.raises(ValueError):\n        twarc.expansions.ensure_flattened([[{\"data\": {\"fake\": \"list_of_lists\"}}]])\n\n\ndef test_ensure_flattened_errors():\n    \"\"\"\n    Test that ensure_flattened doesn't return tweets for API responses that only contain errors.\n    \"\"\"\n    data = {\"errors\": [\"fake error\"]}\n    assert twarc.expansions.ensure_flattened(data) == []\n\n\ndef test_ensure_user_id():\n    \"\"\"\n    Test _ensure_user_id's ability to discriminate correctly between IDs and\n    screen names.\n    \"\"\"\n    # presumably IDs don't change\n    assert T._ensure_user_id(\"jack\") == \"12\"\n\n    # should hold for all users, even if the screen name exists\n    assert T._ensure_user_id(\"12\") == \"12\"\n\n    # this is a screen name but not an ID\n    # would help to find more \"stable\" example?\n    assert T._ensure_user_id(\"42069\") == \"17334495\"\n    # should 42069 passed as int return ID or screen name?\n\n    assert T._ensure_user_id(\"1033441111677788160\") == \"1033441111677788160\"\n    assert T._ensure_user_id(1033441111677788160) == \"1033441111677788160\"\n\n\ndef test_liking_users():\n    # This is one of @jack's tweets about the Twitter API\n    likes = T.liking_users(1460417326130421765)\n\n    like_count = 0\n\n    for page in likes:\n        assert \"data\" in page\n        # These should be user objects.\n        assert \"description\" in page[\"data\"][0]\n        like_count += len(page[\"data\"])\n        if like_count > 300:\n            break\n\n\ndef test_retweeted_by():\n    # This is one of @jack's tweets about the Twitter API\n    retweet_users = T.retweeted_by(1460417326130421765)\n\n    retweet_count = 0\n\n    for page in retweet_users:\n        assert \"data\" in page\n        # These should be user objects.\n        assert \"description\" in page[\"data\"][0]\n        retweet_count += len(page[\"data\"])\n        if retweet_count > 150:\n            break\n\n\ndef test_liked_tweets():\n    # What has @jack liked?\n    liked_tweets = T.liked_tweets(12)\n\n    like_count = 0\n\n    for page in liked_tweets:\n        assert \"data\" in page\n        # These should be tweet objects.\n        assert \"text\" in page[\"data\"][0]\n        like_count += len(page[\"data\"])\n        if like_count > 300:\n            break\n\n\ndef test_list_lookup():\n    parks_list = T.list_lookup(715919216927322112)\n    assert \"data\" in parks_list\n    assert parks_list[\"data\"][\"name\"] == \"National-parks\"\n\n\ndef test_list_members():\n    response = list(T.list_members(715919216927322112))\n    assert len(response) == 1\n    members = twarc.expansions.flatten(response[0])\n    assert len(members) == 8\n\n\ndef test_list_followers():\n    response = list(T.list_followers(715919216927322112))\n    assert len(response) >= 2\n    followers = twarc.expansions.flatten(response[0])\n    assert len(followers) > 50\n\n\ndef test_list_memberships():\n    response = list(T.list_memberships(\"64flavors\"))\n    assert len(response) == 1\n    lists = twarc.expansions.flatten(response[0])\n    assert len(lists) >= 9\n\n\ndef test_followed_lists():\n    response = list(T.followed_lists(\"nasa\"))\n    assert len(response) == 1\n    lists = twarc.expansions.flatten(response[0])\n    assert len(lists) >= 1\n\n\ndef test_owned_lists():\n    response = list(T.owned_lists(\"nasa\"))\n    assert len(response) >= 1\n    lists = twarc.expansions.flatten(response[0])\n    assert len(lists) >= 11\n\n\ndef test_list_tweets():\n    response = next(T.list_tweets(715919216927322112))\n    assert \"data\" in response\n    tweets = twarc.expansions.flatten(response)\n    assert len(tweets) >= 90\n\n\ndef test_user_lookup_non_existent():\n    with pytest.raises(ValueError):\n        # This user does not exist, and a value error should be raised\n        T._ensure_user(\"noasdfasdf\")\n\n\ndef test_twarc_metadata():\n    # With metadata (default)\n    event = threading.Event()\n    for i, response in enumerate(T.sample(event=event)):\n        assert \"__twarc\" in response\n        if i == 10:\n            event.set()\n\n    for response in T.tweet_lookup(range(1000, 2000)):\n        assert \"__twarc\" in response\n        assert \"__twarc\" in twarc.expansions.flatten(response)[0]\n\n    # Witout metadata\n    T.metadata = False\n    event = threading.Event()\n    for i, response in enumerate(T.sample(event=event)):\n        assert \"__twarc\" not in response\n        if i == 10:\n            event.set()\n\n    for response in T.tweet_lookup(range(1000, 2000)):\n        assert \"__twarc\" not in response\n\n    T.metadata = True\n\n\ndef test_docs_requirements():\n    \"\"\"\n    Make sure that the mkdocs requirements has everything that is in the\n    twarc requirements so the readthedocs build doesn't fail.\n    \"\"\"\n    twarc_reqs = set(open(\"requirements.txt\").read().split())\n    mkdocs_reqs = set(open(\"requirements-mkdocs.txt\").read().split())\n\n    assert twarc_reqs.issubset(mkdocs_reqs)\n\n\ndef test_geo():\n    print(T.geo(query=\"Silver Spring\"))\n\n\ndef pick_id(id, objects):\n    \"\"\"pick an object out of a list of objects using its id\"\"\"\n    return list(filter(lambda o: o[\"id\"] == id, objects))\n"
  },
  {
    "path": "utils/auth_timing.py",
    "content": "#!/usr/bin/env python3\n\n\"\"\"\nTwitter's rate limits allow App Auth contexts to search at 450 requests\nevery 15 minutes, and User Auth contexts at 180 requests per 15 minutes. \nThis script exercises both contexts and counts how tweets it is able to \nreceive. We should see a significant number more tweets coming back for App\nAuth.\n\nTypical output should look like:\n\n    app auth:  44999\n    user auth:  18000\n\nhttps://developer.twitter.com/en/docs/basics/rate-limits\n\"\"\"\n\nimport logging\nfrom twarc import Twarc\nfrom datetime import datetime\nfrom datetime import timedelta\n\nlogging.basicConfig(\n    filename=\"time_test.log\",\n    level=logging.INFO,\n    format=\"%(asctime)s %(levelname)s %(message)s\",\n)\n\n\ndef count_tweets(app_auth):\n    \"\"\"\n    Search for covid_19 in tweets using the given context and return the number\n    of tweets that were fetched in 10 minutes.\n    \"\"\"\n    count = 0\n    t = Twarc(app_auth=app_auth)\n    start = None\n    for tweet in t.search(\"covid_19\"):\n        # start the timer when we get the first tweet\n        if start is None:\n            start = datetime.now()\n        count += 1\n        if datetime.now() - start > timedelta(minutes=10):\n            break\n    t.client.close()\n    return count\n\n\nprint(\"app auth: \", count_tweets(app_auth=True))\nprint(\"user auth: \", count_tweets(app_auth=False))\n"
  },
  {
    "path": "utils/deduplicate.py",
    "content": "#!/usr/bin/env python\n\"\"\"\nGiven a JSON file, remove any tweets with duplicate IDs.\n\nOptionally, this will extract retweets. (That is, for a retweet\nuse tweet from retweeted_status and retweet.)\n\nExample usage:\nutils/deduplicate.py tweets.jsonl > tweets_deduped.jsonl\n\"\"\"\n\nfrom __future__ import print_function\nimport json\nimport fileinput\nimport argparse\n\n\ndef main(files, extract_retweets=False):\n    seen = {}\n    for line in fileinput.input(files=files):\n        tweet = json.loads(line)\n        if extract_retweets and \"retweeted_status\" in tweet:\n            tweet = tweet[\"retweeted_status\"]\n        id = tweet[\"id\"]\n        if id not in seen:\n            seen[id] = True\n            print(json.dumps(tweet))\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\n        \"--extract-retweets\", action=\"store_true\", help=\"Extract retweets\"\n    )\n    parser.add_argument(\n        \"files\",\n        metavar=\"FILE\",\n        nargs=\"*\",\n        help=\"files to read, if empty, stdin is used\",\n    )\n    args = parser.parse_args()\n\n    main(\n        args.files if len(args.files) > 0 else (\"-\",),\n        extract_retweets=args.extract_retweets,\n    )\n"
  },
  {
    "path": "utils/deleted.py",
    "content": "#!/usr/bin/env python\n\n\"\"\"\nThis is a little utility that reads in tweets, rehydrates them, and only \noutputs the tweets JSON for tweets that are no longer available.\n\"\"\"\n\nimport json\nimport twarc\nimport fileinput\n\nt = twarc.Twarc()\n\n\ndef missing(tweets):\n    tweet_ids = [t[\"id_str\"] for t in tweets]\n    hydrated = t.hydrate(tweets)\n    hydrated_ids = [t[\"id_str\"] for t in hydrated]\n    missing_ids = tweet_ids - hydrated_ids\n    for t in tweets:\n        if t[\"id_str\"] in missing_ids:\n            yield t\n\n\ntweets = []\n\nfor line in fileinput.input():\n    t = json.loads(line)\n    tweets.append(t)\n    if len(tweets) > 100:\n        for t in missing(tweets):\n            print(json.dumps(t))\n        tweets = []\n\nif len(tweets) > 0:\n    for t in missing(tweets):\n        print(json.dumps(t))\n"
  },
  {
    "path": "utils/deleted_users.py",
    "content": "#!/usr/bin/env python3\n\n\"\"\"\nThis utility Will read in user ids, or tweet JSON data, and look up each\nuser_id. If the user no longer exists the user_id or tweet JSON will be written\nto stdout. If the user exists no output will be written. It acts like a filter\nto locate deleted accounts.\n\"\"\"\n\nimport re\nimport json\nimport twarc\nimport logging\nimport fileinput\n\nlogging.basicConfig(filename=\"deleted_users.log\", level=logging.INFO)\nt = twarc.Twarc()\n\nfor line in fileinput.input():\n    line = line.strip()\n\n    if re.match(\"^\\d+$\", line):\n        user_id = line\n    else:\n        tweet = json.loads(line)\n        user_id = tweet[\"user\"][\"id_str\"]\n    try:\n        user = next(t.user_lookup([user_id]))\n    except Exception as e:\n        print(line)\n"
  },
  {
    "path": "utils/deletes.py",
    "content": "#!/usr/bin/env python3\n\n\"\"\"\nThis program assumes that you are feeding it tweet JSON data for tweets\nthat have been deleted. It will use the metadata and the API to\nanalyze why each tweet appears to have been deleted.\n\nNote that lookups are based on user id, so may give different results than\nlooking up a user by screen name.\n\"\"\"\n\nimport json\nimport fileinput\nimport collections\nimport requests\nimport twarc\nimport argparse\nimport logging\n\nUSER_OK = \"USER_OK\"\nUSER_DELETED = \"USER_DELETED\"\nUSER_PROTECTED = \"USER_PROTECTED\"\nUSER_SUSPENDED = \"USER_SUSPENDED\"\nTWEET_OK = \"TWEET_OK\"\nTWEET_DELETED = \"TWEET_DELETED\"\n# You have been blocked by the user.\nTWEET_BLOCKED = \"TWEET_BLOCKED\"\nRETWEET_DELETED = \"RETWEET_DELETED\"\nORIGINAL_TWEET_DELETED = \"ORIGINAL_TWEET_DELETED\"\nORIGINAL_TWEET_BLOCKED = \"ORIGINAL_TWEET_BLOCKED\"\nORIGINAL_USER_DELETED = \"ORIGINAL_USER_DELETED\"\nORIGINAL_USER_PROTECTED = \"ORIGINAL_USER_PROTECTED\"\nORIGINAL_USER_SUSPENDED = \"ORIGINAL_USER_SUSPENDED\"\n\n# twarc instance\nt = None\n\n\ndef main(files, enhance_tweet=False, print_results=True, profile=None):\n    global t\n    if profile is not None:\n        t = twarc.Twarc(profile=profile)\n    else:\n        t = twarc.Twarc()\n\n    counts = collections.Counter()\n    for count, line in enumerate(fileinput.input(files=files)):\n        if count % 10000 == 0:\n            logging.info(\"processed {:,} tweets\".format(count))\n        tweet = json.loads(line)\n        result = examine(tweet)\n        if enhance_tweet:\n            tweet[\"delete_reason\"] = result\n            print(json.dumps(tweet))\n        else:\n            print(tweet_url(tweet), result)\n        counts[result] += 1\n    if print_results:\n        for result, count in counts.most_common():\n            print(result, count)\n\n\ndef examine(tweet):\n    user_status = get_user_status(tweet)\n    # Go with user status first (suspended, protected, deleted)\n    if user_status != USER_OK:\n        return user_status\n    else:\n        retweet = tweet.get(\"retweeted_status\", None)\n        tweet_status = get_tweet_status(tweet)\n\n        # If not a retweet and tweet deleted, then tweet deleted.\n\n        if tweet_status == TWEET_OK:\n            return TWEET_OK\n        elif retweet is None or tweet_status == TWEET_BLOCKED:\n            return tweet_status\n        else:\n            rt_status = examine(retweet)\n            if rt_status == USER_DELETED:\n                return ORIGINAL_USER_DELETED\n            elif rt_status == USER_PROTECTED:\n                return ORIGINAL_USER_PROTECTED\n            elif rt_status == USER_SUSPENDED:\n                return ORIGINAL_USER_SUSPENDED\n            elif rt_status == TWEET_DELETED:\n                return ORIGINAL_TWEET_DELETED\n            elif rt_status == TWEET_BLOCKED:\n                return ORIGINAL_TWEET_BLOCKED\n            elif rt_status == TWEET_OK:\n                return RETWEET_DELETED\n            else:\n                raise \"Unexpected retweet status %s for %s\" % (\n                    rt_status,\n                    tweet[\"id_str\"],\n                )\n\n\nusers = {}\n\n\ndef get_user_status(tweet):\n    user_id = tweet[\"user\"][\"id_str\"]\n    if user_id in users:\n        return users[user_id]\n\n    url = \"https://api.twitter.com/1.1/users/show.json\"\n    params = {\"user_id\": user_id}\n\n    # USER_DELETED: 404 and {\"errors\": [{\"code\": 50, \"message\": \"User not found.\"}]}\n    # USER_PROTECTED: 200 and user object with \"protected\": true\n    # USER_SUSPENDED: 403 and {\"errors\":[{\"code\":63,\"message\":\"User has been suspended.\"}]}\n    result = USER_OK\n    try:\n        resp = t.get(url, params=params, allow_404=True)\n        user = resp.json()\n        if user[\"protected\"]:\n            result = USER_PROTECTED\n    except requests.exceptions.HTTPError as e:\n        try:\n            resp_json = e.response.json()\n        except json.decoder.JSONDecodeError:\n            raise e\n        if e.response.status_code == 404 and has_error_code(resp_json, 50):\n            result = USER_DELETED\n        elif e.response.status_code == 403 and has_error_code(resp_json, 63):\n            result = USER_SUSPENDED\n        else:\n            raise e\n\n    users[user_id] = result\n    return result\n\n\ntweets = {}\n\n\ndef get_tweet_status(tweet):\n    id = tweet[\"id_str\"]\n    if id in tweets:\n        return tweets[id]\n    # USER_SUSPENDED: 403 and {\"errors\":[{\"code\":63,\"message\":\"User has been suspended.\"}]}\n    # USER_PROTECTED: 403 and {\"errors\":[{\"code\":179,\"message\":\"Sorry, you are not authorized to see this status.\"}]}\n    # TWEET_DELETED: 404 and {\"errors\":[{\"code\":144,\"message\":\"No status found with that ID.\"}]}\n    # or {\"errors\":[{\"code\":34,\"message\":\"Sorry, that page does not exist.\"}]}\n\n    url = \"https://api.twitter.com/1.1/statuses/show.json\"\n    params = {\"id\": id}\n\n    result = TWEET_OK\n    try:\n        t.get(url, params=params, allow_404=True)\n    except requests.exceptions.HTTPError as e:\n        try:\n            resp_json = e.response.json()\n        except json.decoder.JSONDecodeError:\n            raise e\n        if e.response.status_code == 404 and has_error_code(resp_json, (34, 144)):\n            result = TWEET_DELETED\n        elif e.response.status_code == 403 and has_error_code(resp_json, 63):\n            result = USER_SUSPENDED\n        elif e.response.status_code == 403 and has_error_code(resp_json, 179):\n            result = USER_PROTECTED\n        elif e.response.status_code == 401 and has_error_code(resp_json, 136):\n            result = TWEET_BLOCKED\n        else:\n            raise e\n\n    tweets[id] = result\n    return result\n\n\ndef tweet_url(tweet):\n    return \"https://twitter.com/%s/status/%s\" % (\n        tweet[\"user\"][\"screen_name\"],\n        tweet[\"id_str\"],\n    )\n\n\ndef has_error_code(resp, code):\n    if isinstance(code, int):\n        code = (code,)\n    for error in resp[\"errors\"]:\n        if error[\"code\"] in code:\n            return True\n    return False\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\n        \"--enhance\",\n        action=\"store_true\",\n        help=\"Enhance tweet with delete_reason and output enhanced tweet.\",\n    )\n    parser.add_argument(\n        \"--skip-results\",\n        action=\"store_true\",\n        help=\"Skip outputting delete reason summary\",\n    )\n    parser.add_argument(\"--profile\", help=\"The twarc API profile to use\")\n    parser.add_argument(\n        \"files\",\n        metavar=\"FILE\",\n        nargs=\"*\",\n        help=\"files to read, if empty, stdin is used\",\n    )\n    args = parser.parse_args()\n\n    main(\n        args.files if len(args.files) > 0 else (\"-\",),\n        enhance_tweet=args.enhance,\n        print_results=not args.skip_results and not args.enhance,\n        profile=args.profile,\n    )\n"
  },
  {
    "path": "utils/embeds.py",
    "content": "#!/usr/bin/env python\n\nfrom __future__ import print_function\nimport json\nimport fileinput\n\nfor line in fileinput.input():\n    tweet = json.loads(line)\n    if \"media\" in tweet[\"entities\"]:\n        for media in tweet[\"entities\"][\"media\"]:\n            print(media[\"media_url\"])\n"
  },
  {
    "path": "utils/emojis.py",
    "content": "#!/usr/bin/env python3\n\nimport re\nimport json\nimport fileinput\nimport collections\nimport optparse\n\nimport emoji\n\nopt_parser = optparse.OptionParser()\n\nopt_parser.add_option(\"-n\", \"--number\", dest=\"number\", type=\"int\", default=10)\noptions, args = opt_parser.parse_args()\ntweets = args\n\nnumber_of_emojis = options.number\ntweets = tweets.pop()\n\ncounts = collections.Counter()\n\nEMOJI_RE = emoji.get_emoji_regexp()\n\nfor line in open(tweets):\n    tweet = json.loads(line)\n    if \"full_text\" in tweet:\n        text = tweet[\"full_text\"]\n    else:\n        text = tweet[\"text\"]\n    for char in EMOJI_RE.findall(text):\n        counts[char] += 1\n\nfor char, count in counts.most_common(number_of_emojis):\n    print(\"%s %5i\" % (char, count))\n"
  },
  {
    "path": "utils/extractor.py",
    "content": "#!/usr/bin/env python3\nfrom datetime import datetime\nimport json\nimport os\nimport re\nimport argparse\nimport csv\nimport copy\nimport sys\nimport gzip\n\nstrptime = datetime.strptime\n\n\nclass attriObject:\n    \"\"\"Class object for attribute parser.\"\"\"\n\n    def __init__(self, string):\n        self.value = re.split(\":\", string)\n        self.title = self.value[-1]\n\n    def getElement(self, json_object):\n        found = [json_object]\n        for entry in self.value:\n            for index in range(len(found)):\n                try:\n                    found[index] = found[index][entry]\n                except (TypeError, KeyError):\n                    print(\n                        \"'{0}' is not a valid json entry.\".format(\":\".join(self.value))\n                    )\n                    sys.exit()\n\n                # If single search object is a list, search entire list. Error if nested lists.\n                if isinstance(found[index], list):\n                    if len(found) > 1:\n                        raise Exception(\n                            \"Extractor currently does not handle nested lists.\"\n                        )\n                    found = found[index]\n\n        return found\n\n\ndef tweets_files(string, path):\n    \"\"\"Iterates over json files in path.\"\"\"\n    for filename in os.listdir(path):\n        if re.match(string, filename) and \".jsonl\" in filename:\n            f = gzip.open if \".gz\" in filename else open\n            yield path + filename, f\n\n            Ellipsis\n\n\ndef parse(args):\n    with open(args.output, \"w+\", encoding=\"utf-8\") as output:\n        csv_writer = csv.writer(output, dialect=args.dialect)\n        csv_writer.writerow([a.title for a in args.attributes])\n        count = 0\n        tweets = set()\n\n        for filename, f in tweets_files(args.string, args.path):\n            print(\"parsing\", filename)\n            with f(filename, \"rb\") as data_file:\n                for line in data_file:\n                    try:\n                        json_object = json.loads(line.decode(\"utf-8\"))\n                    except ValueError:\n                        print(\"Error in\", filename, \"entry incomplete.\")\n                        continue\n\n                    # Check for duplicates\n                    identity = json_object[\"id\"]\n                    if identity in tweets:\n                        continue\n                    tweets.add(identity)\n\n                    # Check for time restrictions.\n                    if args.start or args.end:\n                        tweet_time = strptime(\n                            json_object[\"created_at\"], \"%a %b %d %H:%M:%S +0000 %Y\"\n                        )\n                        if args.start and args.start > tweet_time:\n                            continue\n                        if args.end and args.end < tweet_time:\n                            continue\n\n                    # Check for hashtag.\n                    if args.hashtag:\n                        for entity in json_object[\"entities\"][\"hashtags\"]:\n                            if entity[\"text\"].lower() == args.hashtag:\n                                break\n                        else:\n                            continue\n\n                    count += extract(json_object, args, csv_writer)\n\n        print(\"Searched\", len(tweets), \"tweets and recorded\", count, \"items.\")\n        print(\"largest id:\", max(tweets))\n\n\ndef extract(json_object, args, csv_writer):\n    \"\"\"Extract and write found attributes.\"\"\"\n    found = [[]]\n    for attribute in args.attributes:\n        item = attribute.getElement(json_object)\n        if len(item) == 0:\n            for row in found:\n                row.append(\"NA\")\n        else:\n            found1 = []\n            for value in item:\n                if value is None:\n                    value = \"NA\"\n                new = copy.deepcopy(found)\n                for row in new:\n                    row.append(value)\n                found1.extend(new)\n            found = found1\n\n    for row in found:\n        csv_writer.writerow(row)\n    return len(found)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser(description=\"Extracts attributes from tweets.\")\n    parser.add_argument(\n        \"attributes\",\n        nargs=\"*\",\n        help=\"Attributes to search for. Attributes inside nested inside other attributes should be seperated by a colon. Example: user:screen_name, entities:hashtags:text.\",\n    )\n    parser.add_argument(\n        \"-dialect\",\n        default=\"excel\",\n        help=\"Sets dialect for csv output. Defaults to excel. See python module csv.list_dialects()\",\n    )\n    parser.add_argument(\n        \"-string\",\n        default=\"\",\n        help=\"Regular expression for files to parse. Defaults to empty string.\",\n    )\n    parser.add_argument(\n        \"-path\",\n        default=\"./\",\n        help=\"Optional path to folder containing tweets. Defaults to current folder.\",\n    )\n    parser.add_argument(\n        \"-output\",\n        default=\"output.csv\",\n        help=\"Optional file to output results. Defaults to output.csv.\",\n    )\n    parser.add_argument(\n        \"-start\", default=\"\", help=\"Define start date for tweets. Format (mm:dd:yyyy)\"\n    )\n    parser.add_argument(\n        \"-end\", default=\"\", help=\"Define end date for tweets. Format (mm:dd:yyyy)\"\n    )\n    parser.add_argument(\n        \"-hashtag\", default=\"\", help=\"Define a hashtag that must be in parsed tweets.\"\n    )\n    args = parser.parse_args()\n\n    if not args.path.endswith(\"/\"):\n        args.path += \"/\"\n\n    args.start = strptime(args.start, \"%m:%d:%Y\") if args.start else False\n    args.end = strptime(args.end, \"%m:%d:%Y\") if args.end else False\n    args.attributes = [attriObject(i) for i in args.attributes]\n    args.string = re.compile(args.string)\n    args.hashtag = args.hashtag.lower()\n\n    parse(args)\n"
  },
  {
    "path": "utils/filter_date.py",
    "content": "#!/usr/bin/env python\n\"\"\"\nGiven a minimum and/or maximum date, filter out all tweets after this date.\n\nFor example, if a hashtag was used for another event before the one you're\ninterested in, you can filter out the old ones.\n\nExample usage:\nutils/filter_date.py --mindate 1-may-2014 tweets.jsonl > filtered.jsonl\n\"\"\"\nfrom __future__ import print_function\n\nimport sys\nimport json\nimport fileinput\nimport argparse\nimport datetime\nfrom dateutil.parser import parse\n\n\ndef filter_input(mindate, maxdate, files):\n    mindate = parse(mindate) if mindate is not None else datetime.datetime.min\n    maxdate = parse(maxdate) if maxdate is not None else datetime.datetime.max\n\n    for line in fileinput.input(files):\n        tweet = json.loads(line)\n\n        created_at = parse(tweet[\"created_at\"])\n        created_at = created_at.replace(tzinfo=None)\n\n        if mindate < created_at and maxdate > created_at:\n            print(json.dumps(tweet))\n\n\ndef main():\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"--mindate\", help=\"the minimum date\", default=None)\n    parser.add_argument(\"--maxdate\", help=\"the maximum date\", default=None)\n    parser.add_argument(\"files\", nargs=\"?\", default=[])\n    args = parser.parse_args()\n\n    filter_input(args.mindate, args.maxdate, args.files)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "utils/filter_users.py",
    "content": "#!/usr/bin/env python\n\"\"\"\nFilters tweets posted by a list of users.\n\nThe list is supplied in a file. The file can contain:\n* screen names\n* user ids\n* screen name,user id\n* user id,screen name\nwhere each appears on a separate line.\n\nWhen a user id is provided, it will be used. Otherwise, screen name\nwill be used.\n\nThere is also an option to filter by tweets NOT posted by the list of users.\n\"\"\"\n\nimport argparse\nimport fileinput\nimport json\nimport logging\n\n\ndef read_user_list_file(user_list_filepath):\n    screen_names = set()\n    user_ids = set()\n\n    with open(user_list_filepath) as f:\n        for count, line in enumerate(f):\n            split_line = line.rstrip(\"\\n\\r\").split(\",\")\n            if _is_header(count, split_line):\n                continue\n            if split_line[0].isdigit():\n                user_ids.add(split_line[0])\n            else:\n                screen_names.add(split_line[0])\n                if len(split_line) > 1 and split_line[1].isdigit():\n                    user_ids.add(split_line[1])\n\n    assert screen_names or user_ids\n    return user_ids, screen_names\n\n\ndef _is_header(count, split_line):\n    # If this is first line and there is more than one part and none are all digit, then a header\n    if count == 0:\n        for part in split_line:\n            if part.isdigit():\n                return False\n        return True\n    return False\n\n\ndef main(files, user_ids, screen_names, positive_match=True):\n    for count, line in enumerate(fileinput.input(files=files)):\n        try:\n            tweet = json.loads(line.rstrip(\"\\n\"))\n            match = False\n            if user_ids and tweet[\"user\"][\"id_str\"] in user_ids:\n                match = True\n            elif tweet[\"user\"][\"screen_name\"] in screen_names:\n                match = True\n\n            if not positive_match:\n                match = not match\n\n            if match:\n                print(line.rstrip(\"\\n\"))\n\n            if count % 100000 == 0:\n                logging.info(\"processed {:,} tweets\".format(count))\n\n        except json.decoder.JSONDecodeError:\n            pass\n\n\nif __name__ == \"__main__\":\n    logging.basicConfig(\n        level=logging.INFO, format=\"%(asctime)s %(levelname)s %(message)s\"\n    )\n\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\n        \"--neg-match\", action=\"store_true\", help=\"Return tweets that do not match users\"\n    )\n    parser.add_argument(\n        \"user_list_file\", help=\"file containing list of users to filter tweets by\"\n    )\n    parser.add_argument(\n        \"tweet_files\",\n        metavar=\"FILE\",\n        nargs=\"*\",\n        help=\"file containing tweets to filter, if empty, \" \"stdin is used\",\n    )\n    args = parser.parse_args()\n    m_user_ids, m_screen_names = read_user_list_file(args.user_list_file)\n    main(\n        args.tweet_files if len(args.tweet_files) > 0 else (\"-\",),\n        m_user_ids,\n        m_screen_names,\n        positive_match=not args.neg_match,\n    )\n"
  },
  {
    "path": "utils/flakey.py",
    "content": "#!/usr/bin/env python3\n\n#\n# This program will read tweet ids (Snowflake IDs) from a file or a pipe and\n# write the tweet ids back out again with their extracted creation time\n# (RFC 3339) as csv.\n#\n# usage: flakey.py ids.txt > ids-times.csv\n#\n# For more about Snowflake IDs see:\n# https://ws-dl.blogspot.com/2019/08/2019-08-03-tweetedat-finding-tweet.html\n#\n\nimport fileinput\nfrom datetime import datetime\n\n\ndef id2time(tweet_id):\n    ms = (tweet_id >> 22) + 1288834974657\n    dt = datetime.utcfromtimestamp(ms // 1000)\n    return dt.replace(microsecond=ms % 1000 * 1000)\n\n\nprint(\"id,created_at\")\nfor line in fileinput.input():\n    tweet_id = int(line)\n    created_at = id2time(tweet_id).strftime(\"%Y-%m-%dT%H:%M:%S.%f\")[0:-3] + \"Z\"\n    print(\"{},{}\".format(tweet_id, created_at))\n"
  },
  {
    "path": "utils/foaf.py",
    "content": "#!/usr/bin/env python3\n\n\"\"\"\nThis is a utility for getting the friend-of-a-friend network for a \ngiven twitter user. It writes a sqlite database as it collects the data\n{user-id}.sqlite and once complete it exports that data to two csv files:\n\n* {user-id}.csv - the user id links\n* {user-id}-users.csv - metadata about the users keyed off their user id\n\n\"\"\"\n\nimport re\nimport csv\nimport sys\nimport twarc\nimport logging\nimport sqlite3\nimport argparse\nimport requests\n\nfrom dateutil.parser import parse as parse_datetime\n\nlogging.basicConfig(\n    filename=\"foaf.log\",\n    level=logging.INFO,\n    format=\"%(asctime)s %(levelname)s %(message)s\",\n)\n\n\ndef friendships(user_id, level=2):\n    \"\"\"\n    Pass in a user_id and you will be returned a generator of friendship\n    tuples (user_id, friend_id). By default it will return the friend\n    of a friend network (level=2), but you can expand this by settings the\n    level parameter to either another number. But beware, it could run for a\n    while!\n    \"\"\"\n\n    logging.info(\"getting friends for user %s\", user_id)\n    level -= 1\n    try:\n        count = 0\n        for friend_id in t.friend_ids(user_id):\n            count += 1\n            add_friendship(user_id, friend_id)\n            yield (user_id, friend_id)\n            if level > 0:\n                if not user_in_db(friend_id):\n                    yield from friendships(friend_id, level)\n                else:\n                    logging.info(\"already collected %s\", friend_id)\n            if count % 1000 == 0:\n                db.commit()\n\n    except requests.exceptions.HTTPError as e:\n        if e.response.status_code == 401:\n            logging.error(\"can't get friends for protected user %s\", user_id)\n        else:\n            raise (e)\n\n\ndef user_ids():\n    \"\"\"\n    Returns all the Twitter user_ids in the database.\n    \"\"\"\n    sql = \"\"\"\n        SELECT DISTINCT(user_id) AS user_id FROM friends\n        UNION\n        SELECT DISTINCT(friend_id) AS user_id FROM friends\n        \"\"\"\n    for result in db.execute(sql):\n        yield str(result[0])\n\n\ndef user_in_db(user_id):\n    \"\"\"\n    Checks to see if the user's friends have already been collected.\n    \"\"\"\n    results = db.execute(\"SELECT COUNT(*) FROM friends where user_id = ?\", [user_id])\n    return results.fetchone()[0] > 0\n\n\ndef add_friendship(user_id, friend_id):\n    \"\"\"\n    Add a friendship to the database.\n    \"\"\"\n    db.execute(\n        \"INSERT INTO friends (user_id, friend_id) VALUES (?, ?)\", [user_id, friend_id]\n    )\n\n\ndef add_user(u):\n    \"\"\"\n    Add a user to the database.\n    \"\"\"\n    db.execute(\n        \"\"\"\n        INSERT INTO users (\n          user_id,\n          screen_name,\n          name,\n          description,\n          location,\n          created,\n          statuses,\n          verified\n        ) \n        VALUES (?, ?, ?, ?, ?, ?, ?, ?)\n        \"\"\",\n        [\n            u[\"id\"],\n            u[\"screen_name\"],\n            u[\"name\"],\n            u[\"description\"],\n            u[\"location\"],\n            parse_datetime(u[\"created_at\"]).strftime(\"%Y-%m-%d %H:%M:%S\"),\n            u[\"statuses_count\"],\n            u[\"verified\"],\n        ],\n    )\n\n\n# get command line arguments\nparser = argparse.ArgumentParser(\"tweet.py\")\nparser.add_argument(\"user\", action=\"store\", help=\"user_id\")\nparser.add_argument(\n    \"--level\",\n    type=int,\n    action=\"store\",\n    default=2,\n    help=\"how far out into the social graph to follow\",\n)\nargs = parser.parse_args()\n\n# create twarc instance for querying Twitter\nt = twarc.Twarc()\n\n# get the seed user_id, potentially from their screen name\nif re.match(\"^\\d+$\", args.user):\n    seed_user_id = args.user\nelse:\n    seed_user_id = next(t.user_lookup([args.user]))[\"id_str\"]\n\n# setup sqlite db for storing information as it is collected\ndb = sqlite3.connect(f\"{seed_user_id}.sqlite3\")\ndb.execute(\n    \"\"\"\n    CREATE TABLE IF NOT EXISTS friends (\n      user_id INT,\n      friend_id INT,\n      PRIMARY KEY (user_id, friend_id)\n    )\n    \"\"\"\n)\ndb.execute(\n    \"\"\"\n    CREATE TABLE IF NOT EXISTS users (\n      user_id INT,\n      screen_name TEXT,\n      name TEXT,\n      description TEXT,\n      location TEXT,\n      created TEXT,\n      statuses INT,\n      verified TEXT,\n      PRIMARY KEY (user_id)\n    )\n    \"\"\"\n)\n\n# lookup friendship data\nfor friendship in friendships(seed_user_id, args.level):\n    print(\"%s,%s\" % friendship)\n\n# lookup user metadata\nfor user in t.user_lookup(user_ids()):\n    add_user(user)\n\ndb.commit()\n\n# write out friendships\nwith open(\"{}.csv\".format(seed_user_id), \"w\") as fh:\n    w = csv.writer(fh)\n    w.writerow([\"user_id\", \"friend_user_id\"])\n    for row in db.execute(\"SELECT * FROM friends\"):\n        w.writerow(row)\n\n# write out user data as csv\nwith open(\"{}-users.csv\".format(seed_user_id), \"w\") as fh:\n    w = csv.writer(fh)\n    w.writerow(\n        [\n            \"user_id\",\n            \"screen_name\",\n            \"name\",\n            \"description\",\n            \"location\",\n            \"created\",\n            \"statuses\",\n            \"verified\",\n        ]\n    )\n\n    sql = \"\"\"\n        SELECT user_id, screen_name, name, description,\n               location, created, statuses, verified\n        FROM users\n        \"\"\"\n\n    for row in db.execute(sql):\n        w.writerow(row)\n"
  },
  {
    "path": "utils/gender.py",
    "content": "#!/usr/bin/env python\n\n\"\"\"\nfilters tweets based on a guess about the users gender\n\"\"\"\nfrom __future__ import print_function\n\nimport json\nimport optparse\nimport fileinput\nfrom genderator.detector import Detector, MALE, FEMALE, ANDROGYNOUS\n\nusage = \"usage: gender.py --gender [male|female|unknown] tweet_file *\"\nopt_parser = optparse.OptionParser(usage=usage)\nopt_parser.add_option(\n    \"-g\",\n    \"--gender\",\n    dest=\"gender\",\n    choices=[\"male\", \"female\", \"unknown\"],\n    action=\"store\",\n)\noptions, args = opt_parser.parse_args()\n\nif not options.gender:\n    opt_parser.error(\"must supply --gender\")\n\nd = Detector()\nfor line in fileinput.input(args):\n    line = line.strip()\n    tweet = json.loads(line)\n    name = tweet[\"user\"][\"name\"]\n    first_name = name.split(\" \")[0]\n    gender = d.getGender(first_name)\n    if options.gender == \"male\" and gender == MALE:\n        print(line.encode(\"utf-8\"))\n    elif options.gender == \"female\" and gender == FEMALE:\n        print(line.encode(\"utf-8\"))\n    elif options.gender == \"unknown\" and gender == ANDROGYNOUS:\n        print(line.encode(\"utf-8\"))\n"
  },
  {
    "path": "utils/geo.py",
    "content": "#!/usr/bin/env python\n\n\"\"\"\nFilter tweets/retweets that have geocoding.\n\"\"\"\nfrom __future__ import print_function\n\nimport json\nimport fileinput\n\nfor line in fileinput.input():\n    tweet = json.loads(line)\n    if \"retweeted_status\" in tweet:\n        if tweet[\"retweeted_status\"][\"geo\"]:\n            print(json.dumps(tweet))\n    elif tweet[\"geo\"]:\n        print(json.dumps(tweet))\n"
  },
  {
    "path": "utils/geofilter.py",
    "content": "#!/usr/bin/env python\n\nfrom __future__ import print_function\n\nimport argparse\nimport json\nimport sys\n\nfrom shapely.geometry import shape\n\n\ndef process(line, has_coordinates=None, has_place=None, fence=None):\n    tweet = json.loads(line)\n\n    coordinates = tweet.get(\"coordinates\")\n    place = tweet.get(\"place\")\n\n    if any(\n        [\n            has_coordinates and not coordinates,\n            has_coordinates is False and coordinates,\n            has_place and not place,\n            has_place is False and place,\n        ]\n    ):\n        return\n\n    if fence and (coordinates or place):\n        if coordinates:\n            location = shape(coordinates)\n        else:\n            location = shape(place[\"bounding_box\"])\n\n        if not fence.contains(location):\n            return\n\n    print(line.strip(\"\\n\"))\n\n\ndef main():\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\n        \"infile\", nargs=\"?\", type=argparse.FileType(\"r\"), default=sys.stdin\n    )\n    parser.add_argument(\n        \"--yes-coordinates\", dest=\"has_coordinates\", action=\"store_true\"\n    )\n    parser.add_argument(\n        \"--no-coordinates\", dest=\"has_coordinates\", action=\"store_false\"\n    )\n    parser.add_argument(\"--yes-place\", dest=\"has_place\", action=\"store_true\")\n    parser.add_argument(\"--no-place\", dest=\"has_place\", action=\"store_false\")\n    parser.add_argument(\"--fence\", default=None, help=\"geojson file with geofence\")\n    parser.set_defaults(has_coordinates=None, has_place=None)\n    args = parser.parse_args()\n\n    fence = None\n    if args.fence:\n        with open(args.fence, \"r\") as f:\n            fence = shape(json.loads(f.read()))\n\n    for line in args.infile:\n        process(line, args.has_coordinates, args.has_place, fence)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "utils/geojson.py",
    "content": "#!/usr/bin/env python\n\n\"\"\"\ngeojson.py reads in tweets and writes out a corresponding geojson file for the\ntweets. Each feature will include the following properties:\n\n* twitter user name\n* twitter user screename\n* tweet creation time\n* tweet status text\n* profile image url\n* the tweet url\n\nBy default both Point and Polygon features will be included, depending on\nwhether the tweet includes a point or is assigned to a place with a bounding\nbox.\n\nOptionally you can convert bounding boxes to points with the --centroid\nparameter, and can also use --fuzz to randomly place the the point inside the\nbounding box.\n\"\"\"\n\nfrom __future__ import print_function\n\nimport json\nimport random\nimport argparse\nimport fileinput\nimport dateutil.parser\n\n\ndef text(t):\n    return (\n        t.get(\"full_text\") or t.get(\"extended_tweet\", {}).get(\"full_text\") or t[\"text\"]\n    ).replace(\"\\n\", \" \")\n\n\nparser = argparse.ArgumentParser()\n\nparser.add_argument(\n    \"-c\",\n    \"--centroid\",\n    dest=\"centroid\",\n    action=\"store_true\",\n    default=False,\n    help=\"store centroid instead of a bounding box\",\n)\n\nparser.add_argument(\n    \"-f\",\n    \"--fuzz\",\n    type=float,\n    dest=\"fuzz\",\n    default=0,\n    help=\"add a random lon and lat shift to bounding box centroids (0-0.1)\",\n)\n\nparser.add_argument(\n    \"files\", nargs=\"*\", default=(\"-\",), help=\"files to read, if empty, stdin is used\"\n)\n\n\nargs = parser.parse_args()\n\nfeatures = []\n\nfor line in fileinput.input(files=args.files):\n    tweet = json.loads(line)\n    t = dateutil.parser.parse(tweet[\"created_at\"])\n\n    f = {\n        \"type\": \"Feature\",\n        \"properties\": {\n            \"name\": tweet[\"user\"][\"name\"],\n            \"screen_name\": tweet[\"user\"][\"screen_name\"],\n            \"created_at\": t.isoformat(\"T\") + \"Z\",\n            \"text\": text(tweet),\n            \"profile_image_url\": tweet[\"user\"][\"profile_image_url\"],\n            \"url\": \"http://twitter.com/%s/status/%s\"\n            % (tweet[\"user\"][\"screen_name\"], tweet[\"id_str\"]),\n        },\n    }\n\n    if tweet[\"geo\"]:\n        f[\"geometry\"] = {\n            \"type\": \"Point\",\n            \"coordinates\": [\n                tweet[\"geo\"][\"coordinates\"][1],\n                tweet[\"geo\"][\"coordinates\"][0],\n            ],\n        }\n\n    elif tweet[\"place\"] and any(tweet[\"place\"][\"bounding_box\"]):\n        bbox = tweet[\"place\"][\"bounding_box\"][\"coordinates\"][0]\n\n        if args.centroid:\n            min_x = bbox[0][0]\n            min_y = bbox[0][1]\n            max_x = bbox[2][0]\n            max_y = bbox[2][1]\n\n            fuzz_x = args.fuzz * random.uniform(-1, 1)\n            fuzz_y = args.fuzz * random.uniform(-1, 1)\n\n            center_x = ((max_x + min_x) / 2.0) + fuzz_x\n            center_y = ((max_y + min_y) / 2.0) + fuzz_y\n\n            f[\"geometry\"] = {\"type\": \"Point\", \"coordinates\": [center_x, center_y]}\n\n        else:\n            f[\"geometry\"] = {\n                \"type\": \"Polygon\",\n                \"coordinates\": [[bbox[0], bbox[1], bbox[2], bbox[3], bbox[0]]],\n            }\n\n    if \"geometry\" in f:\n        features.append(f)\n\ngeojson = {\"type\": \"FeatureCollection\", \"features\": features}\nprint(json.dumps(geojson, indent=2))\n"
  },
  {
    "path": "utils/json2csv.py",
    "content": "#!/usr/bin/env python\n\n\"\"\"\nA sample JSON to CSV program. Multivalued JSON properties are space delimited \nCSV columns. If you'd like it adjusted send a pull request!\n\"\"\"\n\nfrom twarc import json2csv\n\nimport os\nimport sys\nimport json\nimport codecs\nimport argparse\nimport fileinput\n\nif sys.version_info[0] < 3:\n    try:\n        import unicodecsv as csv\n    except ImportError:\n        sys.exit(\"unicodecsv is required for python 2\")\nelse:\n    import csv\n\n\ndef main():\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"--output\", \"-o\", help=\"write output to file instead of stdout\")\n    parser.add_argument(\n        \"--split\",\n        \"-s\",\n        help=\"if writing to file, split into multiple files with this many lines per \"\n        \"file\",\n        type=int,\n        default=0,\n    )\n    parser.add_argument(\n        \"--extra-field\",\n        \"-e\",\n        help=\"extra fields to include. Provide a field name and a pointer to \"\n        \"the field. Example: -e verified user.verified\",\n        nargs=2,\n        action=\"append\",\n    )\n    parser.add_argument(\n        \"--excel\", \"-x\", help=\"create file compatible with Excel\", action=\"store_true\"\n    )\n    parser.add_argument(\n        \"files\",\n        metavar=\"FILE\",\n        nargs=\"*\",\n        help=\"files to read, if empty, stdin is used\",\n    )\n    args = parser.parse_args()\n\n    file_count = 1\n    csv_file = None\n    if args.output:\n        if args.split:\n            csv_file = codecs.open(\n                numbered_filepath(args.output, file_count), \"wb\", \"utf-8\"\n            )\n            file_count += 1\n        else:\n            csv_file = codecs.open(args.output, \"wb\", \"utf-8\")\n    else:\n        csv_file = sys.stdout\n    sheet = csv.writer(csv_file)\n\n    extra_headings = []\n    extra_fields = []\n    if args.extra_field:\n        for heading, field in args.extra_field:\n            extra_headings.append(heading)\n            extra_fields.append(field)\n\n    sheet.writerow(get_headings(extra_headings=extra_headings))\n\n    files = args.files if len(args.files) > 0 else (\"-\",)\n    for count, line in enumerate(\n        fileinput.input(files, openhook=fileinput.hook_encoded(\"utf-8\"))\n    ):\n        if args.split and count and count % args.split == 0:\n            csv_file.close()\n            csv_file = codecs.open(\n                numbered_filepath(args.output, file_count), \"wb\", \"utf-8\"\n            )\n            sheet = csv.writer(csv_file)\n            sheet.writerow(get_headings(extra_headings=extra_headings))\n            file_count += 1\n        tweet = json.loads(line)\n        sheet.writerow(get_row(tweet, extra_fields=extra_fields, excel=args.excel))\n\n\ndef numbered_filepath(filepath, num):\n    path, ext = os.path.splitext(filepath)\n    return os.path.join(\"{}-{:0>3}{}\".format(path, num, ext))\n\n\ndef get_headings(extra_headings=None):\n    fields = json2csv.get_headings()\n    if extra_headings:\n        fields.extend(extra_headings)\n    return fields\n\n\ndef get_row(t, extra_fields=None, excel=False):\n    row = json2csv.get_row(t, excel=excel)\n    if extra_fields:\n        for field in extra_fields:\n            row.append(extra_field(t, field))\n    return row\n\n\ndef extra_field(t, field_str):\n    obj = t\n    for field in field_str.split(\".\"):\n        if field in obj:\n            obj = obj[field]\n        else:\n            return None\n    return obj\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "utils/media2warc.py",
    "content": "#!/usr/bin/env python\n\n\"\"\"\nThis utility extracts media urls from tweet jsonl.gz and  save them as warc records.\n\nWarcio (https://github.com/webrecorder/warcio) is a dependency and before you can use it you need to:\n% pip install warcio\n\nYou run it like this:\n% python media2warc.py /mnt/tweets/ferguson/tweets-0001.jsonl.gz /mnt/tweets/ferguson/tweets-0001.warc.gz\n\nThe input file will be checked for duplicate urls to avoid duplicates within the input file. Subsequent runs\nwill  be deduplicated using a sqlite db.  If an identical-payload-digest is found a revist record is created.\n\nThe script is able to fetch media resources in multiple threads (maximum 2) by passing --threads <int> (default to a single thread).\n\nPlease be careful modifying this script to use more than two threads since it can be interpreted as a DoS-attack.\n\n\"\"\"\n\nimport os\nimport gzip\nimport json\nimport time\nimport queue\nimport hashlib\nimport logging\nimport sqlite3\nimport argparse\nimport requests\nimport threading\n\nfrom datetime import timedelta\nfrom warcio.warcwriter import WARCWriter\nfrom warcio.statusandheaders import StatusAndHeaders\n\nq = queue.Queue()\nout_queue = queue.Queue()\nBLOCK_SIZE = 25600\n\n\nclass GetResource(threading.Thread):\n    def __init__(self, q):\n        threading.Thread.__init__(self)\n        self.q = q\n        self.rlock = threading.Lock()\n        self.out_queue = out_queue\n        self.d = Dedup()\n\n    def run(self):\n        while True:\n            host = self.q.get()\n\n            try:\n                r = requests.get(\n                    host, headers={\"Accept-Encoding\": \"identity\"}, stream=True\n                )\n                data = [r.raw.headers.items(), r.raw, host, r.status_code, r.reason]\n                print(data[2])\n                self.out_queue.put(data)\n                self.q.task_done()\n\n            except requests.exceptions.RequestException as e:\n                logging.error(\"%s for %s\", e, data[2])\n                print(e)\n                self.q.task_done()\n                continue\n\n\nclass WriteWarc(threading.Thread):\n    def __init__(self, out_queue, warcfile):\n        threading.Thread.__init__(self)\n        self.out_queue = out_queue\n        self.lock = threading.Lock()\n        self.warcfile = warcfile\n        self.dedup = Dedup()\n\n    def run(self):\n        with open(self.warcfile, \"ab\") as output:\n            while True:\n                self.lock.acquire()\n                data = self.out_queue.get()\n                writer = WARCWriter(output, gzip=False)\n                headers_list = data[0]\n                http_headers = StatusAndHeaders(\n                    \"{} {}\".format(data[3], data[4]), headers_list, protocol=\"HTTP/1.0\"\n                )\n                record = writer.create_warc_record(\n                    data[2], \"response\", payload=data[1], http_headers=http_headers\n                )\n                h = hashlib.sha1()\n                h.update(record.raw_stream.read(BLOCK_SIZE))\n                if self.dedup.lookup(h.hexdigest()):\n                    record = writer.create_warc_record(\n                        data[2], \"revisit\", http_headers=http_headers\n                    )\n                    writer.write_record(record)\n                    self.out_queue.task_done()\n                    self.lock.release()\n                else:\n                    self.dedup.save(h.hexdigest(), data[2])\n                    record.raw_stream.seek(0)\n                    writer.write_record(record)\n                    self.out_queue.task_done()\n                    self.lock.release()\n\n\nclass Dedup:\n    \"\"\"\n    Stolen from warcprox\n    https://github.com/internetarchive/warcprox/blob/master/warcprox/dedup.py\n    \"\"\"\n\n    def __init__(self):\n        self.file = os.path.join(args.archive_dir, \"dedup.db\")\n\n    def start(self):\n        conn = sqlite3.connect(self.file)\n        conn.execute(\n            \"create table if not exists dedup (\"\n            \"  key varchar(300) primary key,\"\n            \"  value varchar(4000)\"\n            \");\"\n        )\n        conn.commit()\n        conn.close()\n\n    def save(self, digest_key, url):\n        conn = sqlite3.connect(self.file)\n        conn.execute(\n            \"insert or replace into dedup (key, value) values (?, ?)\", (digest_key, url)\n        )\n        conn.commit()\n        conn.close()\n\n    def lookup(self, digest_key, url=None):\n        result = False\n        conn = sqlite3.connect(self.file)\n        cursor = conn.execute(\"select value from dedup where key = ?\", (digest_key,))\n        result_tuple = cursor.fetchone()\n        conn.close()\n        if result_tuple:\n            result = True\n\n        return result\n\n\ndef parse_extended_entities(extended_entities_dict):\n    \"\"\"Parse media file URL:s form tweet data\n\n    :extended_entities_dict:\n    :returns: list of media file urls\n\n    \"\"\"\n    urls = []\n\n    if \"media\" in extended_entities_dict.keys():\n        for item in extended_entities_dict[\"media\"]:\n            # add static image\n            urls.append(item[\"media_url_https\"])\n\n            # add best quality video file\n            if \"video_info\" in item.keys():\n                max_bitrate = -1  # handle twitters occasional bitrate=0\n                video_url = None\n                for video in item[\"video_info\"][\"variants\"]:\n                    if \"bitrate\" in video.keys() and \"content_type\" in video.keys():\n                        if video[\"content_type\"] == \"video/mp4\":\n                            if int(video[\"bitrate\"]) > max_bitrate:\n                                max_bitrate = int(video[\"bitrate\"])\n                                video_url = video[\"url\"]\n\n                if not video_url:\n                    print(\"Error: No bitrate / content_type\")\n                    print(item[\"video_info\"])\n                else:\n                    urls.append(video_url)\n\n    return urls\n\n\ndef parse_binlinks_from_tweet(tweetdict):\n    \"\"\"Parse binary file url:s from a single tweet.\n\n    :tweetdict: json data dict for tweet\n    :returns: list of urls for media files\n\n    \"\"\"\n\n    urls = []\n\n    if \"user\" in tweetdict.keys():\n        urls.append(tweetdict[\"user\"][\"profile_image_url_https\"])\n        urls.append(tweetdict[\"user\"][\"profile_background_image_url_https\"])\n\n    if \"extended_entities\" in tweetdict.keys():\n        urls.extend(parse_extended_entities(tweetdict[\"extended_entities\"]))\n    return urls\n\n\ndef main():\n    start = time.time()\n    if not os.path.isdir(args.archive_dir):\n        os.mkdir(args.archive_dir)\n\n    logging.basicConfig(\n        filename=os.path.join(args.archive_dir, \"media_harvest.log\"),\n        level=logging.INFO,\n        format=\"%(asctime)s %(levelname)s %(message)s\",\n    )\n    logging.getLogger(__name__)\n    logging.info(\"Logging media harvest for %s\", args.tweet_file)\n\n    urls = []\n    d = Dedup()\n    d.start()\n    uniqueUrlCount = 0\n    duplicateUrlCount = 0\n\n    if args.tweet_file.endswith(\".gz\"):\n        tweetfile = gzip.open(args.tweet_file, \"r\")\n    else:\n        tweetfile = open(args.tweet_file, \"r\")\n\n    logging.info(\"Checking for duplicate urls\")\n\n    for line in tweetfile:\n        tweet = json.loads(line)\n        tweet_urls = parse_binlinks_from_tweet(tweet)\n        for url in tweet_urls:\n            if not url in urls:\n                urls.append(url)\n                q.put(url)\n                uniqueUrlCount += 1\n            else:\n                duplicateUrlCount += 1\n\n    logging.info(\n        \"Found %s total media urls %s unique and %s duplicates\",\n        uniqueUrlCount + duplicateUrlCount,\n        uniqueUrlCount,\n        duplicateUrlCount,\n    )\n\n    threads = int(args.threads)\n\n    if threads > 2:\n        threads = 2\n\n    for i in range(threads):\n        t = GetResource(q)\n        t.daemon = True\n        t.start()\n\n    wt = WriteWarc(out_queue, os.path.join(args.archive_dir, \"warc.warc\"))\n    wt.daemon = True\n    wt.start()\n\n    q.join()\n    out_queue.join()\n    logging.info(\n        \"Finished media harvest in %s\", str(timedelta(seconds=(time.time() - start)))\n    )\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser(\"archive\")\n    parser.add_argument(\n        \"tweet_file\", action=\"store\", help=\"a twitter jsonl.gz input file\"\n    )\n    parser.add_argument(\n        \"archive_dir\",\n        action=\"store\",\n        help=\"a directory where the resulting warc is stored\",\n    )\n    parser.add_argument(\n        \"--threads\",\n        action=\"store\",\n        default=1,\n        help=\"Number of threads that fetches media resources\",\n    )\n    args = parser.parse_args()\n    main()\n"
  },
  {
    "path": "utils/media_urls.py",
    "content": "#!/usr/bin/env python\n\n\"\"\"\nPrint out the URLs of images uploaded to Twitter in a tweet json stream.\nUseful for piping to wget or curl to mass download. In Bash:\n\n% wget $(./utils/image_urls.py tweets.jsonl)\n\"\"\"\nfrom __future__ import print_function\n\nimport json\nimport fileinput\n\nfor line in fileinput.input(openhook=fileinput.hook_encoded(\"utf8\")):\n    tweet = json.loads(line)\n    id = tweet[\"id_str\"]\n\n    if \"media\" in tweet[\"entities\"]:\n        for media in tweet[\"entities\"][\"media\"]:\n            if media[\"type\"] == \"photo\":\n                print(id, media[\"media_url_https\"])\n\n    if \"extended_entities\" in tweet and \"media\" in tweet[\"extended_entities\"]:\n        for media in tweet[\"extended_entities\"][\"media\"]:\n            if media[\"type\"] == \"animated_gif\":\n                print(id, media[\"media_url_https\"])\n\n            if \"video_info\" in media:\n                for v in media[\"video_info\"][\"variants\"]:\n                    print(id, v[\"url\"])\n"
  },
  {
    "path": "utils/network.py",
    "content": "#!/usr/bin/env python\n\n# NOTE:\n#\n# This script has been ported to the twarc-network plugin for working\n# with data collected with twarc2. Please see\n# https://github.com/docnow/twarc-newtwork for details.\n#\n# ---\n#\n# build a reply, quote, retweet network from a file of tweets and write it\n# out as a gexf, dot, json or  html file. You will need to have networkx\n# installed and pydotplus if you want to use dot. The html presentation\n# uses d3 to display the network graph in your browser.\n#\n#   ./network.py tweets.jsonl network.html\n#\n# or\n#   ./network.py tweets.jsonl network.dot\n#\n# or\n#\n#  ./network.py tweets.jsonl network.gexf\n#\n# if you would rather have the network oriented around nodes that are users\n# instead of tweets use the --users flag\n#\n#  ./network.py --users tweets.jsonl network.gexf\n#\n# if you would rather have the network oriented around nodes that are hashtags\n# instead of tweets or users, use the --hashtags flag\n#\n# TODO: this is mostly here some someone can improve it :)\n\nimport sys\nimport json\nimport networkx\nimport optparse\nimport itertools\nimport time\n\nfrom networkx import nx_pydot\nfrom networkx.readwrite import json_graph\n\nusage = \"network.py tweets.jsonl graph.html\"\nopt_parser = optparse.OptionParser(usage=usage)\n\nopt_parser.add_option(\n    \"--retweets\", dest=\"retweets\", action=\"store_true\", help=\"include retweets\"\n)\n\nopt_parser.add_option(\n    \"--min_subgraph_size\",\n    dest=\"min_subgraph_size\",\n    type=\"int\",\n    help=\"remove any subgraphs with a size smaller than this number\",\n)\n\nopt_parser.add_option(\n    \"--max_subgraph_size\",\n    dest=\"max_subgraph_size\",\n    type=\"int\",\n    help=\"remove any subgraphs with a size larger than this number\",\n)\n\nopt_parser.add_option(\n    \"--users\",\n    dest=\"users\",\n    action=\"store_true\",\n    help=\"show user relations instead of tweet relations\",\n)\n\n\nopt_parser.add_option(\n    \"--hashtags\",\n    dest=\"hashtags\",\n    action=\"store_true\",\n    help=\"show hashtag relations instead of tweet relations\",\n)\n\noptions, args = opt_parser.parse_args()\n\nif len(args) != 2:\n    opt_parser.error(\"must supply input and output file names\")\n\ntweets, output = args\n\nG = networkx.DiGraph()\n\n\ndef add(from_user, from_id, to_user, to_id, type, created_at=None):\n    \"adds a relation to the graph\"\n    # storing start_data will allow for timestamps for gephi timeline, where nodes will appear on screen at their start dataset\n    # and stay on forever after\n\n    if (options.users or options.hashtags) and to_user:\n        G.add_node(from_user, screen_name=from_user, start_date=created_at)\n        G.add_node(to_user, screen_name=to_user, start_date=created_at)\n\n        if G.has_edge(from_user, to_user):\n            weight = G[from_user][to_user][\"weight\"] + 1\n        else:\n            weight = 1\n        G.add_edge(from_user, to_user, type=type, weight=weight)\n\n    elif not options.users and to_id:\n        G.add_node(from_id, screen_name=from_user, type=type)\n        if to_user:\n            G.add_node(to_id, screen_name=to_user)\n        else:\n            G.add_node(to_id)\n        G.add_edge(from_id, to_id, type=type)\n\n\ndef to_json(g):\n    j = {\"nodes\": [], \"links\": []}\n    for node_id, node_attrs in g.nodes(True):\n        j[\"nodes\"].append(\n            {\n                \"id\": node_id,\n                \"type\": node_attrs.get(\"type\"),\n                \"screen_name\": node_attrs.get(\"screen_name\"),\n            }\n        )\n    for source, target, attrs in g.edges(data=True):\n        j[\"links\"].append(\n            {\"source\": source, \"target\": target, \"type\": attrs.get(\"type\")}\n        )\n    return j\n\n\nfor line in open(tweets):\n    try:\n        t = json.loads(line)\n    except:\n        continue\n    from_id = t[\"id_str\"]\n    from_user = t[\"user\"][\"screen_name\"]\n    from_user_id = t[\"user\"][\"id_str\"]\n    to_user = None\n    to_id = None\n    # standardize raw created at date to dd/MM/yyyy HH:mm:ss\n    created_at_date = time.strftime(\n        \"%d/%m/%Y %H:%M:%S\",\n        time.strptime(t[\"created_at\"], \"%a %b %d %H:%M:%S +0000 %Y\"),\n    )\n\n    if options.users:\n        for u in t[\"entities\"].get(\"user_mentions\", []):\n            add(from_user, from_id, u[\"screen_name\"], None, \"reply\", created_at_date)\n\n    elif options.hashtags:\n        hashtags = t[\"entities\"].get(\"hashtags\", [])\n        hashtag_pairs = list(\n            itertools.combinations(hashtags, 2)\n        )  # list of all possible hashtag pairs\n        for u in hashtag_pairs:\n            # source hashtag: u[0]['text']\n            # target hashtag: u[1]['text']\n            add(\n                \"#\" + u[0][\"text\"],\n                None,\n                \"#\" + u[1][\"text\"],\n                None,\n                \"hashtag\",\n                created_at_date,\n            )\n\n    else:\n        if t.get(\"in_reply_to_status_id_str\"):\n            to_id = t[\"in_reply_to_status_id_str\"]\n            to_user = t[\"in_reply_to_screen_name\"]\n            add(from_user, from_id, to_user, to_id, \"reply\")\n\n        if t.get(\"quoted_status\"):\n            to_id = t[\"quoted_status\"][\"id_str\"]\n            to_user = t[\"quoted_status\"][\"user\"][\"screen_name\"]\n            to_user_id = t[\"quoted_status\"][\"user\"][\"id_str\"]\n            add(from_user, from_id, to_user, to_id, \"quote\")\n\n        if options.retweets and t.get(\"retweeted_status\"):\n            to_id = t[\"retweeted_status\"][\"id_str\"]\n            to_user = t[\"retweeted_status\"][\"user\"][\"screen_name\"]\n            to_user_id = t[\"retweeted_status\"][\"user\"][\"id_str\"]\n            add(from_user, from_id, to_user, to_id, \"retweet\")\n\nif options.min_subgraph_size or options.max_subgraph_size:\n    g_copy = G.copy()\n    for g in networkx.connected_component_subgraphs(G):\n        if options.min_subgraph_size and len(g) < options.min_subgraph_size:\n            g_copy.remove_nodes_from(g.nodes())\n        elif options.max_subgraph_size and len(g) > options.max_subgraph_size:\n            g_copy.remove_nodes_from(g.nodes())\n    G = g_copy\n\nif output.endswith(\".gexf\"):\n    networkx.write_gexf(G, output)\n\nelif output.endswith(\".gml\"):\n    networkx.write_gml(G, output)\n\nelif output.endswith(\".dot\"):\n    nx_pydot.write_dot(G, output)\n\nelif output.endswith(\".json\"):\n    json.dump(to_json(G), open(output, \"w\"), indent=2)\n\nelif output.endswith(\".html\"):\n    graph_data = json.dumps(to_json(G), indent=2)\n    html = (\n        \"\"\"<!DOCTYPE html>\n<meta charset=\"utf-8\">\n<script src=\"https://platform.twitter.com/widgets.js\"></script>\n<script src=\"https://d3js.org/d3.v4.min.js\"></script>\n<script src=\"https://code.jquery.com/jquery-3.1.1.min.js\"></script>\n<style>\n\n.links line {\n  stroke: #999;\n  stroke-opacity: 0.8;\n  stroke-width: 2px;\n}\n\nline.reply {\n  stroke: #999;\n}\n\nline.retweet {\n  stroke-dasharray: 5;\n}\n\nline.quote {\n  stroke-dasharray: 5;\n}\n\n.nodes circle {\n  stroke: red;\n  fill: red;\n  stroke-width: 1.5px;\n}\n\ncircle.retweet {\n  fill: white;\n  stroke: #999;\n}\n\ncircle.reply {\n  fill: #999;\n  stroke: #999;\n}\n\ncircle.quote {\n  fill: yellow;\n  stroke: yellow;\n}\n\n#graph {\n  width: 99vw;\n  height: 99vh;\n}\n\n#tweet {\n  position: absolute;\n  left: 100px;\n  top: 150px;\n}\n\n</style>\n<svg id=\"graph\"></svg>\n<div id=\"tweet\"></div>\n<script>\n\nvar width = $(window).width();\nvar height = $(window).height();\n\nvar svg = d3.select(\"svg\")\n    .attr(\"height\", height)\n    .attr(\"width\", width);\n\nvar color = d3.scaleOrdinal(d3.schemeCategory20c);\n\nvar simulation = d3.forceSimulation()\n    .velocityDecay(0.6)\n    .force(\"link\", d3.forceLink().id(function(d) { return d.id; }))\n    .force(\"charge\", d3.forceManyBody())\n    .force(\"center\", d3.forceCenter(width / 2, height / 2));\n\nvar graph = %s;\n\nvar link = svg.append(\"g\")\n    .attr(\"class\", \"links\")\n  .selectAll(\"line\")\n  .data(graph.links)\n  .enter().append(\"line\")\n    .attr(\"class\", function(d) { return d.type; });\n\nvar node = svg.append(\"g\")\n    .attr(\"class\", \"nodes\")\n  .selectAll(\"circle\")\n  .data(graph.nodes)\n  .enter().append(\"circle\")\n    .attr(\"r\", 5)\n    .attr(\"class\", function(d) { return d.type; })\n    .call(d3.drag()\n        .on(\"start\", dragstarted)\n        .on(\"drag\", dragged)\n        .on(\"end\", dragended));\n\nnode.append(\"title\")\n    .text(function(d) { return d.id; });\n\nnode.on(\"click\", function(d) {\n  $(\"#tweet\").empty();\n\n  var rect = this.getBoundingClientRect();\n  var paneHeight = d.type == \"retweet\" ? 50 : 200;\n  var paneWidth = d.type == \"retweet\" ? 75 : 500;\n\n  var left = rect.x - paneWidth / 2;\n  if (rect.y > height / 2) {\n    var top = rect.y - paneHeight;\n  } else {\n    var top = rect.y + 10;\n  }\n\n  var tweet = $(\"#tweet\");\n  tweet.css({left: left, top: top});\n\n  if (d.type == \"retweet\") {\n    twttr.widgets.createFollowButton(d.screen_name, tweet[0], {size: \"large\"});\n  } else {\n    twttr.widgets.createTweet(d.id, tweet[0], {conversation: \"none\"});\n  }\n\n  d3.event.stopPropagation();\n\n});\n\nsvg.on(\"click\", function(d) {\n  $(\"#tweet\").empty();\n});\n\nsimulation\n    .nodes(graph.nodes)\n    .on(\"tick\", ticked);\n\nsimulation.force(\"link\")\n    .links(graph.links);\n\nfunction ticked() {\n  link\n      .attr(\"x1\", function(d) { return d.source.x; })\n      .attr(\"y1\", function(d) { return d.source.y; })\n      .attr(\"x2\", function(d) { return d.target.x; })\n      .attr(\"y2\", function(d) { return d.target.y; });\n\n  node\n      .attr(\"cx\", function(d) { return d.x; })\n      .attr(\"cy\", function(d) { return d.y; });\n}\n\nfunction dragstarted(d) {\n  if (!d3.event.active) simulation.alphaTarget(0.3).restart();\n  d.fx = d.x;\n  d.fy = d.y;\n}\n\nfunction dragged(d) {\n  d.fx = d3.event.x;\n  d.fy = d3.event.y;\n}\n\nfunction dragended(d) {\n  if (!d3.event.active) simulation.alphaTarget(0);\n  d.fx = null;\n  d.fy = null;\n}\n\n</script>\n\"\"\"\n        % graph_data\n    )\n    open(output, \"w\").write(html)\n"
  },
  {
    "path": "utils/noretweets.py",
    "content": "#!/usr/bin/env python\n\"\"\"\nGiven a JSON file, remove any retweets.\n\nExample usage:\nutils/noretweets.py tweets.jsonl > tweets_noretweets.jsonl\n\"\"\"\nfrom __future__ import print_function\nimport json\nimport fileinput\n\nfor line in fileinput.input():\n    tweet = json.loads(line)\n\n    if not \"retweeted_status\" in tweet:\n        print(json.dumps(tweet))\n"
  },
  {
    "path": "utils/oembeds.py",
    "content": "#!/usr/bin/env python3\n\n\"\"\"\noembeds.py will read a stream of tweet JSON and augment .entities.urls with oembed\nmetadata for the URL. It uses the oembedders python module and a sqlite database \nto prevent multiple lookups for the same URL. Here's an example of how each URL\nstanza will be augmented:\n\n{\n  \"url\": \"https://t.co/ZX6cE5Xbti\",\n  \"expanded_url\": \"https://www.youtube.com/watch?v=ybvmu7kM8z0\",\n  \"display_url\": \"youtube.com/watch?v=ybvmu7…\",\n  \"indices\": [\n    106,\n    129\n  ],\n  \"oembed\": {\n    \"html\": \"<iframe width=\\\"480\\\" height=\\\"270\\\" src=\\\"https://www.youtube.com/embed/ybvmu7kM8z0?fea\nture=oembed\\\" frameborder=\\\"0\\\" allow=\\\"accelerometer; autoplay; encrypted-media; gyroscope; picture-\nin-picture\\\" allowfullscreen></iframe>\",\n    \"thumbnail_url\": \"https://i.ytimg.com/vi/ybvmu7kM8z0/hqdefault.jpg\",\n    \"thumbnail_height\": 360,\n    \"width\": 480,\n    \"thumbnail_width\": 480,\n    \"provider_url\": \"https://www.youtube.com/\",\n    \"type\": \"video\",\n    \"version\": \"1.0\",\n    \"title\": \"Obama knew\",\n    \"provider_name\": \"YouTube\",\n    \"author_url\": \"https://www.youtube.com/channel/UCAql2DyGU2un1Ei2nMYsqOA\",\n    \"author_name\": \"Donald J Trump\",\n    \"height\": 270\n  }\n}\n\nHopefully your URL won't be political propaganda from a tyrant like this one.\n\"\"\"\n\nimport json\nimport logging\nimport sqlite3\nimport fileinput\n\nfrom oembedders import embed\n\n\ndef main():\n    db = OEmbeds()\n    for line in fileinput.input():\n        tweet = json.loads(line)\n        for ent in tweet[\"entities\"][\"urls\"]:\n            url = ent.get(\"unshortened_url\") or ent[\"expanded_url\"]\n            if \"twitter.com\" in url:\n                continue\n            meta, exists = db.get(url)\n            if not exists:\n                try:\n                    meta = embed(url)\n                    db.put(url, meta)\n                except Exception as e:\n                    logging.warn(\"error while looking up %s: %s\", url, e)\n            if meta:\n                ent[\"oembed\"] = meta\n        print(json.dumps(tweet))\n\n\nclass OEmbeds:\n    def __init__(self, path=\"oembeds.db\"):\n        self.db = sqlite3.connect(path)\n        self.db.execute(\n            \"\"\"\n            CREATE table IF NOT EXISTS oembeds (\n              url text PRIMARY KEY,\n              oembed text NOT NULL\n            )\n            \"\"\"\n        )\n\n    def put(self, url, metadata):\n        s = json.dumps(metadata)\n        self.db.execute(\"INSERT INTO oembeds VALUES(?, ?)\", [url, s])\n        self.db.commit()\n\n    def get(self, url):\n        cursor = self.db.execute(\"SELECT oembed FROM oembeds WHERE url=?\", [url])\n        result = cursor.fetchone()\n        if result is not None:\n            return json.loads(result[0]), True\n        else:\n            return None, False\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "utils/remove_limit.py",
    "content": "#!/usr/bin/env python\n\n\"\"\"\nUtility to remove limit warnings from Filter API output.\n\nIf --warnings was used, you will have the following in output:\n{\"limit\": {\"track\": 2530, \"timestamp_ms\": \"1482168932301\"}}\n\nThis utility removes any limit warnings from output.\n\nUsage:\n    remove_limit.py aleppo.jsonl > aleppo_no_warnings.jsonl\n\"\"\"\n\nfrom __future__ import print_function\nimport sys\nimport json\nimport fileinput\n\nlimitbreaker = '{\"limit\":{\"track\":'\nlimit_breaker = '{\"limit\": {\"track\":'\n\nfor line in fileinput.input():\n    if limitbreaker not in line and limit_breaker not in line:\n        print(json.dumps(line))\n"
  },
  {
    "path": "utils/retweets.py",
    "content": "#!/usr/bin/env python\n\n\"\"\"\nPrints out the tweet ids and counts of most retweeted.\n\"\"\"\nfrom __future__ import print_function\n\nimport json\nimport optparse\nimport fileinput\n\nfrom collections import defaultdict\n\n\ndef main():\n    parser = optparse.OptionParser()\n    options, argv = parser.parse_args()\n\n    counts = defaultdict(int)\n    for line in fileinput.input(argv):\n        try:\n            tweet = json.loads(line)\n        except:\n            continue\n        if \"retweeted_status\" not in tweet:\n            continue\n\n        rt = tweet[\"retweeted_status\"]\n        id = rt[\"id_str\"]\n        count = rt[\"retweet_count\"]\n        if count > counts[id]:\n            counts[id] = count\n\n    for id in sorted(counts, key=counts.get, reverse=True):\n        print(\"{},{}\".format(id, counts[id]))\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "utils/search.py",
    "content": "#!/usr/bin/env python\n\n\"\"\"\nFilter tweet JSON based on a regular expression to apply to the text of the \ntweet.\n\n    search.py <regex> file1\n\nOr if you want a case insensitive match:\n\n    search.py -i <regex> file1\n\n\"\"\"\n\nfrom __future__ import print_function\n\nimport re\nimport sys\nimport json\nimport argparse\nimport fileinput\n\nfrom twarc import json2csv\n\nif len(sys.argv) == 1:\n    sys.exit(\"usage: search.py <regex> file1 file2\")\n\nparser = argparse.ArgumentParser(description=\"filter tweets by regex\")\n\nparser.add_argument(\n    \"-i\", \"--ignore\", dest=\"ignore\", action=\"store_true\", help=\"ignore case\"\n)\n\nparser.add_argument(\"regex\")\n\nparser.add_argument(\n    \"files\",\n    metavar=\"FILE\",\n    nargs=\"*\",\n    default=[\"-\"],\n    help=\"files to read, if empty, stdin is used\",\n)\n\nargs = parser.parse_args()\n\nflags = 0\nif args.ignore:\n    flags = re.IGNORECASE\n\ntry:\n    regex = re.compile(args.regex, flags)\nexcept Exception as e:\n    sys.exit(\"error: regex failed to compile: {}\".format(e))\n\nfor line in fileinput.input(files=args.files):\n    tweet = json.loads(line)\n    text = json2csv.text(tweet)\n    if regex.search(text):\n        print(line, end=\"\")\n"
  },
  {
    "path": "utils/sensitive.py",
    "content": "#!/usr/bin/env python\n\n\"\"\"\nFilter out tweets or retweets that Twitter thinks are sensitive (mostly porn).\n\"\"\"\nfrom __future__ import print_function\n\nimport json\nimport fileinput\n\nfor line in fileinput.input():\n    tweet = json.loads(line)\n    if \"possibly_sensitive\" in tweet and tweet[\"possibly_sensitive\"]:\n        pass\n    elif (\n        \"retweeted_status\" in tweet\n        and \"possibly_sensitive\" in tweet[\"retweeted_status\"]\n        and tweet[\"retweeted_status\"][\"possibly_sensitive\"]\n    ):\n        pass\n    else:\n        print(json.dumps(tweet))\n"
  },
  {
    "path": "utils/sort_by_id.py",
    "content": "#!/usr/bin/env python\n\"\"\"\nSort tweets by ID.\n\nTwitter IDs are generated in chronologically ascending order,\nso this is the same as sorting by date.\n\nExample usage:\nutils/sort_by_id.py tweets.jsonl > sorted.jsonl\n\"\"\"\nfrom __future__ import print_function\n\nimport json\nfrom operator import itemgetter\nimport fileinput\n\n\ntweets = []\nfor line in fileinput.input():\n    tweet = json.loads(line)\n    tweets.append(tweet)\n\ntweets = sorted(tweets, key=itemgetter(\"id\"))\n\nfor tweet in tweets:\n    print(json.dumps(tweet))\n\n# End of file\n"
  },
  {
    "path": "utils/source.py",
    "content": "#!/usr/bin/env python\n\"\"\"\nUtil to count which clients are most used.\n\nExample usage:\nutils/source.py tweets.jsonl > sources.html\n\"\"\"\nimport json\nimport fileinput\nfrom collections import defaultdict\n\nsummary = defaultdict(int)\nfor line in fileinput.input():\n    tweet = json.loads(line)\n\n    source = tweet[\"source\"]\n    summary[source] += 1\n\nsumsort = sorted(summary, key=summary.get, reverse=True)\n\nprint(\n    \"\"\"<!doctype html>\n<html>\n\n<head>\n  <meta charset=\"utf-8\">\n  <title>Twitter client sources</title>\n  <style>\n    body {\n      font-family: Arial, Helvetica, sans-serif;\n      font-size: 12pt;\n      margin-left: auto;\n      margin-right: auto;\n      width: 95%;\n    }\n\n    footer#page {\n      margin-top: 15px;\n      clear: both;\n      width: 100%;\n      text-align: center;\n      font-size: 20pt;\n      font-weight: heavy;\n    }\n\n    header {\n      text-align: center;\n      margin-bottom: 20px;\n    }\n\n  </style>\n</head>\n\n<body>\n\n  <header>\n  <h1>Twitter client sources</h1>\n  <em>created on the command line with <a href=\"https://github.com/DocNow/twarc\">twarc</a></em>\n  </header>\n\n  <table>\n\"\"\"\n)\n\nfor source in sumsort:\n    print(\"<tr><td>{}</td><td>{}</td></tr>\".format(source, summary[source]))\nprint(\n    \"\"\"\n\n\n</table>\n\n<footer id=\"page\">\n<hr>\n<br>\ncreated on the command line with <a href=\"https://github.com/DocNow/twarc\">twarc</a>.\n<br>\n<br>\n</footer>\n\n</body>\n</html>\"\"\"\n)\n\n# End of file\n"
  },
  {
    "path": "utils/tags.py",
    "content": "#!/usr/bin/env python\nfrom __future__ import print_function\n\nimport json\nimport fileinput\nimport collections\n\ncounts = collections.Counter()\nfor line in fileinput.input():\n    tweet = json.loads(line)\n    for tag in tweet[\"entities\"][\"hashtags\"]:\n        t = tag[\"text\"].lower()\n        counts[t] += 1\n\nfor tag, count in counts.most_common():\n    print(\"%5i %s\" % (count, tag))\n"
  },
  {
    "path": "utils/times.py",
    "content": "#!/usr/bin/env python\nfrom __future__ import print_function\n\nimport sys\nimport json\nimport optparse\nimport fileinput\nimport dateutil.parser\n\nfrom dateutil import tz\n\nto_zone = tz.tzlocal()\n\nopt_parser = optparse.OptionParser()\nopt_parser.add_option(\"-f\", \"--format\", dest=\"format\", default=\"%Y-%m-%d %H:%M:%S\")\nopt_parser.add_option(\"-l\", \"--local\", dest=\"local\", action=\"store_true\")\nopts, args = opt_parser.parse_args()\n\nfor line in fileinput.input(args):\n    try:\n        tweet = json.loads(line)\n        created_at = dateutil.parser.parse(tweet[\"created_at\"])\n        # convert to local time\n        if opts.local:\n            created_at = created_at.astimezone(to_zone)\n        print(created_at.strftime(opts.format))\n    except ValueError as e:\n        sys.stderr.write(\"uhoh: %s\\n\" % e)\n"
  },
  {
    "path": "utils/twarc-archive.py",
    "content": "#!/usr/bin/env python\n\n\"\"\"\nThis little utility uses twarc to write Twitter search results to a directory\nof your choosing. It will use the previous results to determine when to stop\nsearching.\n\nSo for example if you want to search for tweets mentioning \"ferguson\" you can\nrun it:\n\n    % twarc-archive.py ferguson /mnt/tweets/ferguson\n\nThe first time you run this it will search twitter for tweets matching\n\"ferguson\" and write them to a file:\n\n    /mnt/tweets/ferguson/tweets-0001.jsonl.gz\n\nWhen you run the exact same command again:\n\n    % twarc-archive.py ferguson /mnt/tweets/ferguson\n\nit will get the first tweet id in tweets-0001.jsonl.gz and use it to write \nanother file which includes any new tweets since that tweet:\n\n    /mnt/tweets/ferguson/tweets-0002.jsonl.gz\n\nThis functionality was initially part of twarc.py itself, but has been split out\ninto a separate utility.\n\n\"\"\"\nfrom __future__ import print_function\n\nimport os\nimport re\nimport sys\nimport gzip\nimport json\nimport twarc\nimport logging\nimport argparse\n\narchive_file_fmt = \"tweets-%04i.jsonl.gz\"\narchive_file_pat = \"tweets-(\\d+).jsonl.gz$\"\n\n\ndef main():\n    config = os.path.join(os.path.expanduser(\"~\"), \".twarc\")\n    e = os.environ.get\n    parser = argparse.ArgumentParser(\"archive\")\n    parser.add_argument(\n        \"search\", action=\"store\", help=\"search for tweets matching a query\"\n    )\n    parser.add_argument(\n        \"archive_dir\", action=\"store\", help=\"a directory where results are stored\"\n    )\n    parser.add_argument(\n        \"--consumer_key\",\n        action=\"store\",\n        default=e(\"CONSUMER_KEY\"),\n        help=\"Twitter API consumer key\",\n    )\n    parser.add_argument(\n        \"--consumer_secret\",\n        action=\"store\",\n        default=e(\"CONSUMER_SECRET\"),\n        help=\"Twitter API consumer secret\",\n    )\n    parser.add_argument(\n        \"--access_token\",\n        action=\"store\",\n        default=e(\"ACCESS_TOKEN\"),\n        help=\"Twitter API access key\",\n    )\n    parser.add_argument(\n        \"--access_token_secret\",\n        action=\"store\",\n        default=e(\"ACCESS_TOKEN_SECRET\"),\n        help=\"Twitter API access token secret\",\n    )\n    parser.add_argument(\"--profile\", action=\"store\", default=\"main\")\n    parser.add_argument(\n        \"-c\",\n        \"--config\",\n        default=config,\n        help=\"Config file containing Twitter keys and secrets. Overridden by environment config.\",\n    )\n    parser.add_argument(\n        \"--tweet_mode\",\n        action=\"store\",\n        default=\"extended\",\n        dest=\"tweet_mode\",\n        choices=[\"compat\", \"extended\"],\n        help=\"set tweet mode\",\n    )\n    parser.add_argument(\n        \"--twarc_command\",\n        action=\"store\",\n        default=\"search\",\n        choices=[\"search\", \"timeline\"],\n        help=\"select twarc command to be used for harvest, currently supports search and timeline\",\n    )\n\n    args = parser.parse_args()\n\n    if not os.path.isdir(args.archive_dir):\n        os.mkdir(args.archive_dir)\n\n    logging.basicConfig(\n        filename=os.path.join(args.archive_dir, \"archive.log\"),\n        level=logging.INFO,\n        format=\"%(asctime)s %(levelname)s %(message)s\",\n    )\n\n    lockfile = os.path.join(args.archive_dir, \"\") + \"lockfile\"\n    if not os.path.exists(lockfile):\n        pid = os.getpid()\n        lockfile_handle = open(lockfile, \"w\")\n        lockfile_handle.write(str(pid))\n        lockfile_handle.close()\n    else:\n        old_pid = \"unknown\"\n        with open(lockfile, \"r\") as lockfile_handle:\n            old_pid = lockfile_handle.read()\n\n        sys.exit(\n            \"Another twarc-archive.py process with pid \"\n            + old_pid\n            + \" is running. If the process is no longer active then it may have been interrupted. In that case remove the 'lockfile' in \"\n            + args.archive_dir\n            + \" and run the command again.\"\n        )\n\n    logging.info(\"logging search for %s to %s\", args.search, args.archive_dir)\n\n    t = twarc.Twarc(\n        consumer_key=args.consumer_key,\n        consumer_secret=args.consumer_secret,\n        access_token=args.access_token,\n        access_token_secret=args.access_token_secret,\n        profile=args.profile,\n        config=args.config,\n        tweet_mode=args.tweet_mode,\n    )\n\n    last_archive = get_last_archive(args.archive_dir)\n    if last_archive:\n        last_id = json.loads(next(gzip.open(last_archive, \"rt\")))[\"id_str\"]\n    else:\n        last_id = None\n\n    if args.twarc_command == \"search\":\n        tweets = t.search(args.search, since_id=last_id)\n    elif args.twarc_command == \"timeline\":\n        if re.match(\"^\\d+$\", args.search):\n            tweets = t.timeline(userid=args.search, since_id=last_id)\n        else:\n            tweets = t.timeline(screen_name=args.search, since_id=last_id)\n    else:\n        raise Exception(\"invalid twarc_command %s\" % args.twarc_command)\n\n    next_archive = get_next_archive(args.archive_dir)\n\n    # we only create the file if there are new tweets to save\n    # this prevents empty archive files\n    fh = None\n\n    for tweet in tweets:\n        if not fh:\n            fh = gzip.open(next_archive, \"wt\")\n        logging.info(\"archived %s\", tweet[\"id_str\"])\n        fh.write(json.dumps(tweet))\n        fh.write(\"\\n\")\n\n    if fh:\n        fh.close()\n    else:\n        logging.info(\"no new tweets found for %s\", args.search)\n\n    if os.path.exists(lockfile):\n        os.remove(lockfile)\n\n\ndef get_last_archive(archive_dir):\n    count = 0\n    for filename in os.listdir(archive_dir):\n        m = re.match(archive_file_pat, filename)\n        if m and int(m.group(1)) > count:\n            count = int(m.group(1))\n    if count != 0:\n        return os.path.join(archive_dir, archive_file_fmt % count)\n    else:\n        return None\n\n\ndef get_next_archive(archive_dir):\n    last_archive = get_last_archive(archive_dir)\n    if last_archive:\n        m = re.search(archive_file_pat, last_archive)\n        count = int(m.group(1)) + 1\n    else:\n        count = 1\n    return os.path.join(archive_dir, archive_file_fmt % count)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "utils/tweet.py",
    "content": "#!/usr/bin/env python\n\n\"\"\"\nFetch a single tweet as JSON using its id.\n\"\"\"\nfrom __future__ import print_function\n\nimport os\nimport json\nimport twarc\nimport argparse\n\ne = os.environ.get\nparser = argparse.ArgumentParser(\"tweet.py\")\n\nparser.add_argument(\"tweet_id\", action=\"store\", help=\"Tweet ID\")\nparser.add_argument(\n    \"--consumer_key\",\n    action=\"store\",\n    default=e(\"CONSUMER_KEY\"),\n    help=\"Twitter API consumer key\",\n)\nparser.add_argument(\n    \"--consumer_secret\",\n    action=\"store\",\n    default=e(\"CONSUMER_SECRET\"),\n    help=\"Twitter API consumer secret\",\n)\nparser.add_argument(\n    \"--access_token\",\n    action=\"store\",\n    default=e(\"ACCESS_TOKEN\"),\n    help=\"Twitter API access key\",\n)\nparser.add_argument(\n    \"--access_token_secret\",\n    action=\"store\",\n    default=e(\"ACCESS_TOKEN_SECRET\"),\n    help=\"Twitter API access token secret\",\n)\nargs = parser.parse_args()\n\ntw = twarc.Twarc(\n    args.consumer_key, args.consumer_secret, args.access_token, args.access_token_secret\n)\ntweet = tw.get(\"https://api.twitter.com/1.1/statuses/show/%s.json\" % args.tweet_id)\n\nprint(json.dumps(tweet.json(), indent=2))\n"
  },
  {
    "path": "utils/tweet_compliance.py",
    "content": "#!/usr/bin/env python\n\n\"\"\"\nSupports tweet compliance. See https://developer.twitter.com/en/docs/tweets/compliance/overview.\nThat is, providing the most recent version of a tweet or removing unavailable (deleted or protected)\ntweets.\n\nAlso useful for splitting out available tweets from unavailable tweets.\n\nExample usage: python tweet_compliance.py test.txt > test.json 2> test_delete.txt\n\nFor each tweet in a list of tweets or tweet ids provided by standard input or contained in files,\nlooks up the current tweet state.\n\nIf a tweet is not available and tweet ids are provided, the tweet id is\noutput to standard error.\n\nIf a tweet is not available and tweets are provided, the (deleted) tweet is output to standard error.\n\nOtherwise, the current tweet (i.e., the tweet retrieved from the API) is returned to standard out.\n\nOrdering is not guaranteed.\n\nRequires Twitter API keys provided in ~/.twarc or environment variables. (See twarc.py.)\n\"\"\"\nfrom __future__ import print_function\n\nimport json\nimport fileinput\nimport twarc\nimport sys\nimport logging\n\n# Send logging to file instead of STDERR.\nlogging.basicConfig(\n    filename=\"tweet_compliance.log\",\n    level=logging.INFO,\n    format=\"%(asctime)s %(levelname)s %(message)s\",\n)\n\nt = twarc.Twarc()\n\n\ndef process_tweets(tweets):\n    available_tweet_ids = set()\n    # Hydrate the tweets.\n    for tweet in t.hydrate(tweets.keys()):\n        # Keep track of the tweet ids of the tweets that are available.\n        available_tweet_ids.add(tweet[\"id_str\"])\n        # Print available tweets to STDOUT.\n        print(json.dumps(tweet))\n\n    # Find the unavailable tweets.\n    for tweet_id, tweet in tweets.items():\n        if tweet_id not in available_tweet_ids:\n            # Print tweet or tweet id to STDERR\n            if tweets[tweet_id]:\n                print(json.dumps(tweets[tweet_id]), file=sys.stderr)\n            else:\n                print(tweet_id, file=sys.stderr)\n\n\ntweets = {}\nfor line in (line.rstrip(\"\\n\") for line in fileinput.input()):\n    # Add tweet or None to tweet map.\n    tweet_id = line\n    tweet = None\n    if not line.isdigit():\n        tweet = json.loads(line)\n        tweet_id = tweet[\"id_str\"]\n    tweets[tweet_id] = tweet\n\n    # When get to 100, process the tweets.\n    if len(tweets) == 100:\n        process_tweets(tweets)\n        tweets.clear()\n\n# Process any remaining tweets.\nif tweets:\n    process_tweets(tweets)\n"
  },
  {
    "path": "utils/tweet_text.py",
    "content": "#!/usr/bin/env python\n\n\"\"\"\nGiven a JSON file, return just the text of the tweet.\nExample usage:\nutils/tweet_text.py tweets.jsonl > tweets.txt\n\"\"\"\n\nfrom __future__ import print_function\nimport json\nimport fileinput\n\nfor line in fileinput.input():\n    tweet = json.loads(line)\n\n    if \"full_text\" in tweet:\n        print(tweet[\"full_text\"].encode(\"utf8\"))\n    else:\n        print(tweet[\"text\"].encode(\"utf8\"))\n"
  },
  {
    "path": "utils/tweet_urls.py",
    "content": "#!/usr/bin/env python\n\n\"\"\"\nUsed in conjunction with retweet.py.\n\nPrints out the retweet count, and url of the retweeted tweet.\n\nTakes in the output from retweet.py\n\ntweet_urls.py retweets.jsonl > retweets.txt\n\"\"\"\nfrom __future__ import print_function\n\nimport sys\nimport json\nimport fileinput\n\nfor line in fileinput.input():\n    try:\n        tweet = json.loads(line)\n        tweet_id = tweet[\"id_str\"]\n        screen_name = tweet[\"user\"][\"screen_name\"]\n        retweet_count = tweet[\"retweet_count\"]\n        tweet_urls = \"https://twitter.com/%s/status/%s\" % (screen_name, tweet_id)\n        print(\"%d retweets of %s\" % (retweet_count, tweet_urls))\n    except Exception as e:\n        sys.stderr.write(\"uhoh: %s\\n\" % e)\n"
  },
  {
    "path": "utils/tweetometer.py",
    "content": "#!/usr/bin/env python3\n\n\"\"\"\nReads tweet or Twitter user JSON and outputs a CSV of when the user account was\ncreated, how many tweets they have sent to date, and their average tweets per\nhour. The unit of measurement can be changed to second, minute, day and year\nwith the --unit option.\n\"\"\"\n\nimport json\nimport optparse\nimport fileinput\nimport dateutil.parser\nfrom datetime import datetime, timezone\n\n\nop = optparse.OptionParser()\nop.add_option(\n    \"--unit\", choices=[\"second\", \"minute\", \"hour\", \"day\", \"year\"], default=\"hour\"\n)\n\nopts, args = op.parse_args()\n\nif opts.unit == \"second\":\n    div = 1\nelif opts.unit == \"minute\":\n    div = 60\nelif opts.unit == \"hour\":\n    div = 60 * 60\nelif opts.unit == \"day\":\n    div = 60 * 60 * 24\nelif opts.unit == \"year\":\n    div = 60 * 60 * 24 * 365\n\nnow = datetime.now(timezone.utc)\n\nprint(\"screen_name,tweets per %s\" % opts.unit)\n\nfor line in fileinput.input(args):\n    t = json.loads(line)\n    if \"user\" in t:\n        u = t[\"user\"]\n    elif \"screen_name\" in t:\n        u = t\n    else:\n        raise Exception(\"not a tweet or user JSON object\")\n\n    created_at = dateutil.parser.parse(u[\"created_at\"])\n    age = now - created_at\n    unit = age.total_seconds() / float(div)\n    total = u[\"statuses_count\"]\n    tweets_per_unit = total / unit\n    print(\"%s,%s,%s,%0.2f\" % (u[\"screen_name\"], total, created_at, tweets_per_unit))\n"
  },
  {
    "path": "utils/tweets.py",
    "content": "#!/usr/bin/env python\nfrom __future__ import print_function\n\nimport json\nimport fileinput\nimport dateutil.parser\n\nfor line in fileinput.input():\n    tweet = json.loads(line)\n    created_at = dateutil.parser.parse(tweet[\"created_at\"])\n    print(\n        (\n            \"[%s] @%s: %s (%s)\"\n            % (\n                created_at.strftime(\"%Y-%m-%d %H:%M:%S\"),\n                tweet[\"user\"][\"screen_name\"],\n                tweet[\"text\"],\n                tweet[\"id_str\"],\n            )\n        ).encode(\"utf8\")\n    )\n"
  },
  {
    "path": "utils/unshrtn.py",
    "content": "#!/usr/bin/env python3\n\n\"\"\"\nUnfortunately the \"expanded_url\" as supplied by Twitter aren't fully\nexpanded one hop past t.co.\n\nunshrtn.py will attempt to completely unshorten URLs and add them as the\n\"unshortened_url\" key to each url, and emit the tweet as JSON again on stdout.\n\nThis script starts 10 separate processes which talk to an instance of unshrtn\nthat is running:\n\n    http://github.com/edsu/unshrtn\n\n\"\"\"\n\nimport re\nimport json\nimport time\nimport logging\nimport argparse\nimport fileinput\nimport multiprocessing\nimport urllib.request, urllib.parse, urllib.error\n\n# number of urls to look up in parallel\nPOOL_SIZE = 10\nunshrtn_url = \"http://localhost:3000\"\nretries = 2\nwait = 15\n\nlogging.basicConfig(filename=\"unshorten.log\", level=logging.INFO)\n\n\ndef unshrtn_obj(obj):\n    \"\"\"Pass in an object and have all the object returned with additional\n    unshortened_url keys\n    \"\"\"\n    if type(obj) == list:\n        return list(map(unshrtn_obj, obj))\n    elif type(obj) != dict:\n        return obj\n\n    url = obj.get(\"expanded_url\") or obj.get(\"url\")\n    if not url or re.match(r\"^https?://(api.)?twitter.com/\", url):\n        return {k: unshrtn_obj(v) for k, v in obj.items()}\n\n    u = \"{}/?{}\".format(\n        unshrtn_url, urllib.parse.urlencode({\"url\": url.encode(\"utf8\")})\n    )\n    resp = None\n    for retry in range(0, retries):\n        try:\n            resp = json.loads(urllib.request.urlopen(u).read().decode(\"utf-8\"))\n            break\n        except Exception as e:\n            logging.error(\n                \"http error: %s when looking up %s. Try %s of %s\",\n                e,\n                url,\n                retry,\n                retries,\n            )\n            time.sleep(wait)\n\n    return {**obj, \"unshortened_url\": resp[\"long\"]}\n\n\ndef rewrite_line(line):\n    try:\n        data = json.loads(line)\n        return json.dumps(unshrtn_obj(data))\n    except Exception as e:\n        # garbage in, garbage out\n        logging.error(e)\n        return line\n\n\ndef main():\n    global unshrtn_url, retries, wait\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\n        \"--pool-size\",\n        help=\"number of urls to look up in parallel\",\n        default=POOL_SIZE,\n        type=int,\n    )\n    parser.add_argument(\n        \"--unshrtn\", help=\"url of the unshrtn service\", default=unshrtn_url\n    )\n    parser.add_argument(\n        \"--retries\",\n        help=\"number of time to retry if error from unshrtn service\",\n        default=retries,\n        type=int,\n    )\n    parser.add_argument(\n        \"--wait\",\n        help=\"number of seconds to wait between retries if error from unshrtn service\",\n        default=wait,\n        type=int,\n    )\n    parser.add_argument(\n        \"files\",\n        metavar=\"FILE\",\n        nargs=\"*\",\n        help=\"files to read, if empty, stdin is used\",\n    )\n    args = parser.parse_args()\n\n    unshrtn_url = args.unshrtn\n    retries = args.retries\n    wait = args.wait\n    pool = multiprocessing.Pool(args.pool_size)\n    for line in pool.imap_unordered(\n        rewrite_line,\n        fileinput.input(files=args.files if len(args.files) > 0 else (\"-\",)),\n    ):\n        if line != \"\\n\":\n            print(line)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "utils/urls.py",
    "content": "#!/usr/bin/env python3\n\n\"\"\"\nPrint out the URLs in a tweet json stream.\n\"\"\"\nfrom __future__ import print_function\n\nimport json\nimport fileinput\n\nfor line in fileinput.input():\n    tweet = json.loads(line)\n    for url in tweet[\"entities\"][\"urls\"]:\n        if \"unshortened_url\" in url:\n            print(url[\"unshortened_url\"])\n        elif url.get(\"expanded_url\"):\n            print(url[\"expanded_url\"])\n        elif url.get(\"url\"):\n            print(url[\"url\"])\n"
  },
  {
    "path": "utils/users.py",
    "content": "#!/usr/bin/env python\nfrom __future__ import print_function\n\nimport json\nimport fileinput\n\nfor line in fileinput.input():\n    tweet = json.loads(line)\n    print((\"%s [%s]\" % (tweet[\"user\"][\"name\"], tweet[\"user\"][\"screen_name\"])))\n"
  },
  {
    "path": "utils/validate.py",
    "content": "#!/usr/bin/env python\n\nimport sys\nimport json\nimport fileinput\n\nline_number = 0\n\nfor line in fileinput.input():\n    line_number += 1\n    try:\n        tweet = json.loads(line)\n    except Exception as e:\n        sys.stderr.write(\"invalid JSON (%s) line %s: %s\" % (e, line_number, line))\n"
  },
  {
    "path": "utils/wall.py",
    "content": "#!/usr/bin/env python\n\n\"\"\"\nFeed wall.py your JSON and get a wall of tweets as HTML. If you want to get the\nwall in chronological order, a handy trick is:\n\n    % tail -r tweets.jsonl | ./wall.py > wall.html\n\n\"\"\"\n\nimport os\nimport re\nimport sys\nimport json\nimport requests\nimport fileinput\n\nAVATAR_DIR = \"img\"\n\n\ndef download_file(url):\n    local_filename = url.split(\"/\")[-1]\n    outfile = os.path.join(AVATAR_DIR, local_filename)\n    if not os.path.isfile(outfile):\n        r = requests.get(url, stream=True)\n        with open(outfile, \"wb\") as f:\n            for chunk in r.iter_content(chunk_size=1024):\n                if chunk:  # filter out keep-alive new chunks\n                    f.write(chunk)\n                    f.flush()\n    return local_filename\n\n\ndef text(t):\n    return (\n        t.get(\"full_text\") or t.get(\"extended_tweet\", {}).get(\"full_text\") or t[\"text\"]\n    ).replace(\"\\n\", \" \")\n\n\nprint(\n    \"\"\"<!doctype html>\n<html>\n\n<head>\n  <meta charset=\"utf-8\">\n  <title>twarc wall</title>\n  <style>\n    body {\n      font-family: Arial, Helvetica, sans-serif;\n      font-size: 12pt;\n      margin-left: auto;\n      margin-right: auto;\n      width: 95%;\n    }\n\n    article.tweet {\n      position: relative;\n      float: left;\n      border: thin #eeeeee solid;\n      margin: 10px;\n      width: 270px;\n      padding: 10px;\n      height: 220px;\n    }\n\n    .name {\n      font-weight: bold;\n    }\n\n    img.avatar {\n        vertical-align: middle;\n        float: left;\n        margin-right: 10px;\n        border-radius: 5px;\n        height: 45px;\n    }\n\n    .tweet footer {\n      position: absolute;\n      bottom: 5px;\n      left: 10px;\n      font-size: smaller;\n    }\n\n    .tweet a {\n      text-decoration: none;\n    }\n\n    .tweet .text {\n      height: 130px;\n      overflow: auto;\n    }\n\n    footer#page {\n      margin-top: 15px;\n      clear: both;\n      width: 100%;\n      text-align: center;\n      font-size: 20pt;\n      font-weight: heavy;\n    }\n\n    header {\n      text-align: center;\n      margin-bottom: 20px;\n    }\n\n  </style>\n</head>\n\n<body>\n\n  <header>\n  <h1>Title Here</h1>\n  <em>created on the command line with <a href=\"https://github.com/DocNow/twarc\">twarc</a></em>\n  </header>\n\n  <div id=\"tweets\">\n\"\"\"\n)\n\n# Make avatar directory\nif not os.path.isdir(AVATAR_DIR):\n    os.makedirs(AVATAR_DIR)\n\n# Parse command-line args\nreverse = False\n# If args include --reverse, remove first it,\n# leaving file name(s) (if any) in args\nif len(sys.argv) > 1:\n    if sys.argv[1] == \"--reverse\" or sys.argv[1] == \"-r\":\n        reverse = True\n        del sys.argv[0]\n\nlines = fileinput.input()\nif reverse:\n    buffered_lines = []\n    for line in lines:\n        buffered_lines.append(line)\n    # Reverse list using slice\n    lines = buffered_lines[::-1]\n\nfor line in lines:\n    tweet = json.loads(line)\n\n    # Download avatar\n    url = tweet[\"user\"][\"profile_image_url\"]\n    filename = download_file(url)\n\n    t = {\n        \"created_at\": tweet[\"created_at\"],\n        \"name\": tweet[\"user\"][\"name\"],\n        \"username\": tweet[\"user\"][\"screen_name\"],\n        \"user_url\": \"https://twitter.com/\" + tweet[\"user\"][\"screen_name\"],\n        \"text\": text(tweet),\n        \"avatar\": AVATAR_DIR + \"/\" + filename,\n        \"url\": \"https://twitter.com/\"\n        + tweet[\"user\"][\"screen_name\"]\n        + \"/status/\"\n        + tweet[\"id_str\"],\n    }\n\n    if \"retweet_status\" in tweet:\n        t[\"retweet_count\"] = tweet[\"retweet_status\"].get(\"retweet_count\", 0)\n    else:\n        t[\"retweet_count\"] = tweet.get(\"retweet_count\", 0)\n\n    t[\"favorite_count\"] = tweet.get(\"favorite_count\", 0)\n    t[\"retweet_string\"] = \"retweet\" if t[\"retweet_count\"] == 1 else \"retweets\"\n    t[\"favorite_string\"] = \"like\" if t[\"favorite_count\"] == 1 else \"likes\"\n\n    for url in tweet[\"entities\"][\"urls\"]:\n        a = '<a href=\"%(expanded_url)s\">%(url)s</a>' % url\n        start, end = url[\"indices\"]\n        t[\"text\"] = t[\"text\"][0:start] + a + t[\"text\"][end:]\n\n    t[\"text\"] = re.sub(\n        \"@([A-Za-z0-9_]+)\", r'<a href=\"https://twitter.com/\\g<1>\">@\\g<1></a>', t[\"text\"]\n    )\n    t[\"text\"] = re.sub(\n        \" #([^ ]+)\",\n        r' <a href=\"https://twitter.com/search?q=%23\\g<1>&src=hash\">#\\g<1></a>',\n        t[\"text\"],\n    )\n\n    html = (\n        \"\"\"\n    <article class=\"tweet\">\n      <img class=\"avatar\" src=\"%(avatar)s\">\n      <a href=\"%(user_url)s\" class=\"name\">%(name)s</a><br>\n      <span class=\"username\">%(username)s</span><br>\n      <br>\n      <div class=\"text\">%(text)s</div><br>\n      <footer>\n      %(retweet_count)s %(retweet_string)s, %(favorite_count)s %(favorite_string)s<br>\n      <a href=\"%(url)s\"><time>%(created_at)s</time></a>\n      </footer>\n    </article>\n    \"\"\"\n        % t\n    )\n\n    print(html)\n\nprint(\n    \"\"\"\n\n</div>\n\n<footer id=\"page\">\n<hr>\n<br>\ncreated on the command line with <a href=\"https://github.com/DocNow/twarc\">twarc</a>.\n<br>\n<br>\n</footer>\n\n</body>\n</html>\"\"\"\n)\n"
  },
  {
    "path": "utils/wayback.py",
    "content": "#!/usr/bin/env python\n\n#\n# Reads a stream of tweets and checks to see if the tweet is archived at\n# Internet Archive and optionally requests SavePageNow save it.\n#\n# usage: ./wayback.py tweets.jsonl\n#\n# see ./wayback.py --help for details\n\nimport re\nimport json\nimport time\nimport requests\nimport optparse\nimport fileinput\n\n\ndef main(files, save, force_save, sleep):\n    count = 0\n    found_count = 0\n    for line in fileinput.input(files):\n        tweet = json.loads(line)\n        url = \"https://twitter.com/{}/status/{}\".format(\n            tweet[\"user\"][\"screen_name\"], tweet[\"id_str\"]\n        )\n        count += 1\n\n        found = lookup(url)\n        if found:\n            print(\"{} last archived at {}\".format(url, found))\n            found_count += 1\n        else:\n            print(\"{} not archived\".format(url))\n\n        if (not found and save) or force_save:\n            archive_url = savepagenow(url)\n            if archive_url:\n                print(\"saved {} as {}\".format(url, archive_url))\n            else:\n                print(\"save failed for {}\".format(url))\n\n        time.sleep(sleep)\n\n    print(\"\")\n    if count > 0:\n        print(\"{}/{} found\".format(found_count, count))\n\n\ndef lookup(url):\n    found = None\n    resp = requests.get(\"https://archive.org/wayback/available?url={}\".format(url))\n    if resp.status_code == 200:\n        result = resp.json()\n        if \"closest\" in result[\"archived_snapshots\"]:\n            found = timestamp(result[\"archived_snapshots\"][\"closest\"][\"timestamp\"])\n    return found\n\n\ndef savepagenow(url):\n    resp = requests.get(\"https://web.archive.org/save/\" + url)\n    if resp.status_code != 200 or \"content-location\" not in resp.headers:\n        return False\n    return \"https://web.archive.org\" + resp.headers[\"content-location\"]\n\n\ndef timestamp(s):\n    m = re.match(r\"^(\\d\\d\\d\\d)(\\d\\d)(\\d\\d)(\\d\\d)(\\d\\d)(\\d\\d)$\", s)\n    return \"{}-{}-{} {}:{}:{}\".format(*m.groups())\n\n\nif __name__ == \"__main__\":\n    usage = \"usage: %prog [options] tweets.jsonl\"\n    parser = optparse.OptionParser(usage)\n    parser.add_option(\n        \"--save\",\n        action=\"store_true\",\n        dest=\"save\",\n        help=\"Save tweet at Internet Archive if not archived\",\n    )\n    parser.add_option(\n        \"--force-save\",\n        action=\"store_true\",\n        dest=\"force_save\",\n        help=\"Always save at Internet Archive, whether it is archived already or not\",\n    )\n    parser.add_option(\n        \"--sleep\",\n        dest=\"sleep\",\n        type=\"int\",\n        default=1,\n        help=\"Time to sleep between requests to Internet Archive\",\n    )\n\n    (opts, args) = parser.parse_args()\n    main(args, save=opts.save, force_save=opts.force_save, sleep=opts.sleep)\n"
  },
  {
    "path": "utils/webarchives.py",
    "content": "#!/usr/bin/env python3\n\n\"\"\"\nA program to filter tweets that contain links to a web archive. At the moment it\nsupports archive.org and archive.is, but please add more if you want!\n\"\"\"\n\nimport json\nimport fileinput\n\narchives = [\"archive.is\", \"web.archive.org\", \"wayback.archive.org\"]\n\nfor line in fileinput.input():\n    tweet = json.loads(line)\n    for url in tweet[\"entities\"][\"urls\"]:\n        done = False\n        for host in archives:\n            if host in url[\"expanded_url\"]:\n                print(line, end=\"\")\n                done = True\n        # prevent outputting same data twice if it contains\n        # multiple archive urls\n        if done:\n            break\n"
  },
  {
    "path": "utils/wordcloud.py",
    "content": "#!/usr/bin/env python\n\nfrom __future__ import print_function\nimport re\nimport sys\nimport json\nimport fileinput\n\n\ndef main():\n    try:\n        from urllib import urlopen  # Python 2\n    except ImportError:\n        from urllib.request import urlopen  # Python 3\n\n    MAX_WORDS = 100\n\n    word_counts = {}\n    stop_words = set(\n        [\n            \"a\",\n            \"able\",\n            \"about\",\n            \"across\",\n            \"actually\",\n            \"after\",\n            \"against\",\n            \"agreed\",\n            \"all\",\n            \"almost\",\n            \"already\",\n            \"also\",\n            \"am\",\n            \"among\",\n            \"an\",\n            \"and\",\n            \"any\",\n            \"anyone\",\n            \"anyway\",\n            \"are\",\n            \"as\",\n            \"at\",\n            \"be\",\n            \"because\",\n            \"been\",\n            \"being\",\n            \"between\",\n            \"but\",\n            \"by\",\n            \"can\",\n            \"cannot\",\n            \"come\",\n            \"could\",\n            \"dear\",\n            \"did\",\n            \"do\",\n            \"does\",\n            \"either\",\n            \"else\",\n            \"ever\",\n            \"every\",\n            \"for\",\n            \"from\",\n            \"get\",\n            \"getting\",\n            \"got\",\n            \"had\",\n            \"has\",\n            \"have\",\n            \"he\",\n            \"her\",\n            \"here\",\n            \"hers\",\n            \"hey\",\n            \"hi\",\n            \"him\",\n            \"his\",\n            \"how\",\n            \"however\",\n            \"i\",\n            \"i'd\",\n            \"i'll\",\n            \"i'm\",\n            \"if\",\n            \"in\",\n            \"into\",\n            \"is\",\n            \"isnt\",\n            \"isn't\",\n            \"it\",\n            \"its\",\n            \"just\",\n            \"kind\",\n            \"last\",\n            \"latest\",\n            \"least\",\n            \"let\",\n            \"like\",\n            \"likely\",\n            \"look\",\n            \"make\",\n            \"may\",\n            \"me\",\n            \"might\",\n            \"more\",\n            \"most\",\n            \"must\",\n            \"my\",\n            \"neither\",\n            \"new\",\n            \"no\",\n            \"nor\",\n            \"not\",\n            \"now\",\n            \"of\",\n            \"off\",\n            \"often\",\n            \"on\",\n            \"only\",\n            \"or\",\n            \"other\",\n            \"our\",\n            \"out\",\n            \"over\",\n            \"own\",\n            \"part\",\n            \"piece\",\n            \"play\",\n            \"put\",\n            \"putting\",\n            \"rather\",\n            \"real\",\n            \"really\",\n            \"said\",\n            \"say\",\n            \"says\",\n            \"she\",\n            \"should\",\n            \"simply\",\n            \"since\",\n            \"so\",\n            \"some\",\n            \"than\",\n            \"thanks\",\n            \"that\",\n            \"that's\",\n            \"thats\",\n            \"the\",\n            \"their\",\n            \"them\",\n            \"then\",\n            \"there\",\n            \"these\",\n            \"they\",\n            \"they're\",\n            \"this\",\n            \"those\",\n            \"tis\",\n            \"to\",\n            \"too\",\n            \"try\",\n            \"twas\",\n            \"us\",\n            \"use\",\n            \"used\",\n            \"uses\",\n            \"via\",\n            \"wants\",\n            \"was\",\n            \"way\",\n            \"we\",\n            \"well\",\n            \"were\",\n            \"what\",\n            \"when\",\n            \"where\",\n            \"which\",\n            \"while\",\n            \"who\",\n            \"whom\",\n            \"why\",\n            \"will\",\n            \"with\",\n            \"would\",\n            \"yet\",\n            \"you\",\n            \"your\",\n            \"you're\",\n            \"youre\",\n        ]\n    )\n\n    for line in fileinput.input():\n        try:\n            tweet = json.loads(line)\n        except:\n            pass\n        for word in text(tweet).split(\" \"):\n            word = word.lower()\n            word = word.replace(\".\", \"\")\n            word = word.replace(\",\", \"\")\n            word = word.replace(\"...\", \"\")\n            word = word.replace(\"'\", \"\")\n            word = word.replace(\":\", \"\")\n            word = word.replace(\"(\", \"\")\n            word = word.replace(\")\", \"\")\n            if len(word) < 3:\n                continue\n            if len(word) > 15:\n                continue\n            if word in stop_words:\n                continue\n            if word[0] in [\"@\", \"#\"]:\n                continue\n            if re.match(\"https?\", word):\n                continue\n            if word.startswith(\"rt\"):\n                continue\n            if not re.match(\"^[a-z]\", word, re.IGNORECASE):\n                continue\n            word_counts[word] = word_counts.get(word, 0) + 1\n\n    sorted_words = list(word_counts.keys())\n    sorted_words.sort(key=lambda x: word_counts[x], reverse=True)\n    top_words = sorted_words[0:MAX_WORDS]\n\n    words = []\n    count_range = word_counts[top_words[0]] - word_counts[top_words[-1]] + 1\n    size_ratio = 100.0 / count_range\n    for word in top_words:\n        size = int(word_counts[word] * size_ratio) + 15\n        words.append({\"text\": word, \"size\": size})\n\n    wordcloud_js = urlopen(\n        \"https://raw.githubusercontent.com/jasondavies/d3-cloud/master/build/d3.layout.cloud.js\"\n    ).read()\n\n    output = \"\"\"<!DOCTYPE html>\n\t<html>\n\t<head>\n\t<meta charset=\"utf-8\">\n\t<title>twarc wordcloud</title>\n\t<script src=\"https://d3js.org/d3.v3.min.js\"></script>\n\t</head>\n\t<body>\n\t<script>\n\n\t  // embed Jason Davies' d3-cloud since it's not available in a CDN\n\t  %s\n\n\t  var fill = d3.scale.category20();\n\t  var words = %s\n\n\t  d3.layout.cloud().size([800, 800])\n\t\t  .words(words)\n\t\t  .rotate(function() { return ~~(Math.random() * 2) * 90; })\n\t\t  .font(\"Impact\")\n\t\t  .fontSize(function(d) { return d.size; })\n\t\t  .on(\"end\", draw)\n\t\t  .start();\n\n\t  function draw(words) {\n\t\td3.select(\"body\").append(\"svg\")\n\t\t\t.attr(\"width\", 1000)\n\t\t\t.attr(\"height\", 1000)\n\t\t  .append(\"g\")\n\t\t\t.attr(\"transform\", \"translate(400,400)\")\n\t\t  .selectAll(\"text\")\n\t\t\t.data(words)\n\t\t  .enter().append(\"text\")\n\t\t\t.style(\"font-size\", function(d) { return d.size + \"px\"; })\n\t\t\t.style(\"font-family\", \"Impact\")\n\t\t\t.style(\"fill\", function(d, i) { return fill(i); })\n\t\t\t.attr(\"text-anchor\", \"middle\")\n\t\t\t.attr(\"transform\", function(d) {\n\t\t\t  return \"translate(\" + [d.x, d.y] + \")rotate(\" + d.rotate + \")\";\n\t\t\t})\n\t\t\t.text(function(d) { return d.text; });\n\t  }\n\t</script>\n\t</body>\n\t</html>\n\t\"\"\" % (\n        wordcloud_js.decode(\"utf8\"),\n        json.dumps(words, indent=2),\n    )\n\n    sys.stdout.write(output)\n\n\ndef text(t):\n    if \"full_text\" in t:\n        return t[\"full_text\"]\n    return t[\"text\"]\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "utils/youtubedl.py",
    "content": "#!/usr/bin/env python3\n\n\"\"\"\nusage: youtubedl.py [-h] [--max-downloads MAX_DOWNLOADS]\n                    [--max-filesize MAX_FILESIZE] [--ignore-livestreams]\n                    [--download-dir DOWNLOAD_DIR] [--block BLOCK]\n                    [--timeout TIMEOUT]\n                    files\n\nDownload videos in Twitter JSON data.\n\npositional arguments:\n  files                 json files to parse\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --max-downloads MAX_DOWNLOADS\n                        max downloads per URL\n  --max-filesize MAX_FILESIZE\n                        max filesize to download (bytes)\n  --ignore-livestreams  ignore livestreams which may never end\n  --download-dir DOWNLOAD_DIR\n                        directory to download to\n  --block BLOCK         hostnames to block (repeatable)\n  --timeout TIMEOUT     timeout download after n seconds\n\"\"\"\n\nimport os\nimport sys\nimport json\nimport time\nimport argparse\nimport logging\nimport fileinput\nimport youtube_dl\nimport multiprocessing as mp\n\nfrom urllib.parse import urlparse\nfrom datetime import datetime, timedelta\nfrom youtube_dl.utils import match_filter_func\n\nparser = argparse.ArgumentParser(description=\"Download videos in Twitter JSON data.\")\nparser.add_argument(\"--max-downloads\", type=int, help=\"max downloads per URL\")\n\nparser.add_argument(\"--max-filesize\", type=int, help=\"max filesize to download (bytes)\")\n\nparser.add_argument(\n    \"--ignore-livestreams\",\n    action=\"store_true\",\n    default=False,\n    help=\"ignore livestreams which may never end\",\n)\n\nparser.add_argument(\n    \"--download-dir\", type=str, help=\"directory to download to\", default=\"youtubedl\"\n)\n\nparser.add_argument(\"--block\", action=\"append\", help=\"hostnames to block (repeatable)\")\n\nparser.add_argument(\n    \"--timeout\", type=int, default=0, help=\"timeout download after n seconds\"\n)\n\nparser.add_argument(\"files\", action=\"append\", help=\"json files to parse\")\n\n\ndef main():\n    args = parser.parse_args()\n\n    # make download directory\n    download_dir = args.download_dir\n    if not os.path.isdir(download_dir):\n        os.mkdir(download_dir)\n\n    # setup logger\n    log_file = \"{}/youtubedl.log\".format(download_dir)\n    logging.basicConfig(filename=log_file, level=logging.INFO)\n    log = logging.getLogger()\n\n    # setup youtube_dl config\n    ydl_opts = {\n        \"format\": \"best\",\n        \"logger\": log,\n        \"restrictfilenames\": True,\n        \"ignoreerrors\": True,\n        \"nooverwrites\": True,\n        \"writedescription\": True,\n        \"writeinfojson\": True,\n        \"writesubtitles\": True,\n        \"writeautomaticsub\": True,\n        \"outtmpl\": \"{}/%(extractor)s/%(id)s/%(title)s.%(ext)s\".format(download_dir),\n        \"download_archive\": \"{}/archive.txt\".format(download_dir),\n    }\n    if args.ignore_livestreams:\n        ydl_opts[\"matchfilter\"] = match_filter_func(\"!is_live\")\n    if args.max_downloads:\n        ydl_opts[\"max_downloads\"] = args.max_downloads\n    if args.max_filesize:\n        ydl_opts[\"max_filesize\"] = args.max_filesize\n\n    # keep track of domains to block\n    blocklist = []\n    if args.block:\n        blocklist = args.block\n\n    # read in existing mapping file to know which urls we can ignorej\n    seen = set()\n    mapping_file = os.path.join(download_dir, \"mapping.tsv\")\n    if os.path.isfile(mapping_file):\n        for line in open(mapping_file):\n            url, path = line.split(\"\\t\")\n            log.info(\"found %s in %s\", url, mapping_file)\n            seen.add(url)\n\n    # loop through the tweets\n    results = open(mapping_file, \"a\")\n    for line in fileinput.input(args.files):\n        tweet = json.loads(line)\n        log.info(\"analyzing %s\", tweet[\"id_str\"])\n        for e in tweet[\"entities\"][\"urls\"]:\n            url = e.get(\"unshortened_url\") or e[\"expanded_url\"]\n\n            # see if we can skip this one\n            if not url:\n                continue\n            if url in seen:\n                log.info(\"already processed %s\", url)\n                continue\n            seen.add(url)\n\n            # check for blocks\n            uri = urlparse(url)\n            if uri.netloc in blocklist:\n                logging.warn(\"%s in block list\", url)\n                continue\n\n            # set up a multiprocessing queue to manage the download with a timeout\n            log.info(\"processing %s\", url)\n            q = mp.Queue()\n            p = mp.Process(target=download, args=(url, q, ydl_opts, log))\n            p.start()\n\n            started = datetime.now()\n            while True:\n                # if we've exceeded the timeout terminate the process\n                if args.timeout and datetime.now() - started > timedelta(\n                    seconds=args.timeout\n                ):\n                    log.warning(\"reached timeout %s\", args.timeout)\n                    p.terminate()\n                    break\n                # if the process is done we can stop\n                elif not p.is_alive():\n                    break\n                # otherwise sleep and the check again\n                time.sleep(1)\n\n            # if the queue was empty there either wasn't a download or it timed out\n            if q.empty():\n                filename = \"\"\n            else:\n                filename = q.get()\n\n            p.join()\n\n            # write the result to the mapping file\n            results.write(\"{}\\t{}\\n\".format(url, filename))\n\n\ndef download(url, q, ydl_opts, log):\n    try:\n        ydl = youtube_dl.YoutubeDL(ydl_opts)\n        info = ydl.extract_info(url)\n        if info:\n            filename = ydl.prepare_filename(info)\n            log.info(\"downloaded %s as %s\", url, filename)\n        else:\n            filename = \"\"\n            logging.warning(\"%s doesn't look like a video\", url)\n    except youtube_dl.utils.MaxDownloadsReached as e:\n        logging.warning(\"only %s downloads per url allowed\", args.max_downloads)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  }
]